<a href="https://colab.research.google.com/github/BennoKrojer/ML2/blob/main/Part2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# COMP 551 - Mini Project 2


In [82]:
import numpy as np
import pandas as pd
import sklearn as sk
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn import metrics

### Data Loading

In [None]:
!wget https://raw.githubusercontent.com/BennoKrojer/ML2/main/fake_news_data/fake_news_train.csv
!wget https://raw.githubusercontent.com/BennoKrojer/ML2/main/fake_news_data/fake_news_val.csv
!wget https://raw.githubusercontent.com/BennoKrojer/ML2/main/fake_news_data/fake_news_test.csv

In [4]:
!ls

fake_news_test.csv  fake_news_train.csv  fake_news_val.csv  sample_data


In [8]:
raw_data_train = pd.read_csv("fake_news_train.csv")
raw_data_val   = pd.read_csv("fake_news_val.csv")
raw_data_test  = pd.read_csv("fake_news_test.csv")

raw_data_train.head
raw_data_train.columns

Index(['text', 'label'], dtype='object')

### Data Pre-Processing
To preform machine learning on text, need to first extract features for the text.

In [45]:
# Seperate raw data into text and label
raw_data_train_text = raw_data_train["text"]
raw_data_train_label = raw_data_train["label"]

# Create numpy array to store target labels
Y_train = np.asarray(raw_data_train_label)
Y_train

array([0, 0, 0, ..., 0, 1, 1])

In [52]:
# Seperate raw data for validation set
raw_data_val_text = raw_data_val["text"]
raw_data_val_label = raw_data_val["label"]
Y_val = np.asarray(raw_data_val_label)
Y_val

array([0, 0, 1, ..., 0, 1, 0])

In [53]:
# Seperate raw data for test set
raw_data_test_text = raw_data_test["text"]
raw_data_test_label = raw_data_test["label"]
Y_test = np.asarray(raw_data_test_label)
Y_test

array([0, 1, 0, ..., 1, 0, 0])

#### Bag of Words (Vectorizing)
For the bag of words method, each word in the training data set is given an integer ID, then for each data sample, count the number of occurrences of each word, and store the count of each word as a feature for the training sample.  

Ex:  
`Data sample 'i' = "The quick brown fox jumped over the brown dog"`  
`if id for word "brown" = 3`  
`store X[i, 3] = 2 (word count)`    

This method implies that the number of features = number of unique words in all training samples. Number features is typically > 100,000.  

Thus, if every sample had every word, then the size of the matrix would be 100,000 x 100,000 x 4 bytes, which is not very practical. Luckily, most features will be zero for most samples, as most features only contain a small subset of the total set of words in the data set. For this reason, we usually say that the bag of words features array are *high-dimensional sparse data sets*.  

Will use scipy.sparse matrix to store data set features




In [18]:
# Use sklearn.feature_extraction.text.CountVectorizer
# To build a dictionary of features, and transform data samples into feature vectors
count_vect = CountVectorizer()

# training feature vectors
X_train_counts = count_vect.fit_transform(raw_data_train_text)
X_train_counts.shape

(20000, 145402)

In [30]:
# Peak at output of count vectorizer
count_vect.vocabulary_.get(u'algorithm')
#count_vect.vocabulary_.get(u' ')

#### Occurences -> Frequencies (Transforming)
Occurence count is a good start, but is skewed towards longer documents. Longer text will on average have more occurences than shorter text. To compensate for this, we can look at term frequencies which is the number of occurences of a word in some text, divided by the total number of words in that text.  

#### Downscale Word Weights
Another pre-processing technique for text is to give higher weights to rarer words, and lower weights to words that appear frequently in the overall set of text. This is down by downscaling the weights of words that appear frequently in all texts.  

These two approaches can be combined into something called tf-idf, or Term Frequency times Inverse Document Frequency. Can be computed using sklearn->TfidfTransformer.

In [58]:
tfidf_transformer = TfidfTransformer()
# fit_transform combines fit and transform into one step
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(20000, 145402)

### Model Training
Linear model - Logistic Regression using sklearn.linear_model.LogisticRegression  

Start with default parameters for Logistic Regression classifier:  
- penalty  - norm of the penalty (default = 'l2')  
- tol      - tolerance for stopping criteria (default = 1e-4)  
- max_iter - Maximum number of iterations taken for the solvers to converge (default = 100) 

can also specify # cpus to use
n_jobs = default = None: means 1, use -1 to specify all processors

#### Fit

In [50]:
# is model same as classifier?
# clf = classifier
clf = LogisticRegression(penalty='l2', tol=1e-4, max_iter=1000)

# fit classifier
clf.fit(X_train_tfidf, Y_target)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

#### Validate

In [60]:
# Pre process validation data in same way as train data
# But only transform data, as count_vectorizer and tfidf_transformer have already been
# fit to training data (and don't want to fit to validation or test data)

X_val_counts = count_vect.transform(raw_data_val_text)
print("X_val_counts.shape: ", X_val_counts.shape)
X_val_tfidf = tfidf_transformer.transform(X_val_counts)
print("X_val_tfidf.shape: ", X_val_tfidf.shape)

val_pred = clf.predict(X_val_tfidf)

print("val_pred.shape: ", val_pred.shape)
print("Y_val.shape: ", Y_val.shape)

# Compare predicitions to labels
np.mean(val_pred == Y_val)

X_val_counts.shape:  (2000, 145402)
X_val_tfidf.shape:  (2000, 145402)
val_pred.shape:  (2000,)
Y_val.shape:  (2000,)


0.731

### Pipeline
In order to make our sequence of pre-processing and classifying easier, sklearn allows us to create a pipeline to apply the operations to a set of data: vectorizer->transformer->classifier.  
- Vectorizer ("vect") = Bag of Words, transform text to counts of words in text
- Transformer ("tfidf") = TF-IDF transform
- Classifier ("lr-clf") = Logistic Regression classifier

In [67]:
text_pipeline = Pipeline([
    ("vect", CountVectorizer()),
    ("tfidf", TfidfTransformer()),
    ("lr-clf", LogisticRegression(max_iter=500)),
])

In [71]:
# Can now train using a single command / function
text_pipeline.fit(raw_data_train_text, Y_train)

Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabulary=None)),
                ('tfidf',
                 TfidfTransformer(norm='l2', smooth_idf=True,
                                  sublinear_tf=False, use_idf=True)),
                ('lr-clf',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,

In [72]:
# Train pipeline using train set
text_pipeline.score(raw_data_train_text, Y_train)

0.85525

In [79]:
# Predict labels for train set using pipeline
Y_pred = text_pipeline.predict(raw_data_train_text)
#print("Y_pred.shape: ", Y_pred.shape, ", Y_train.shape: ", Y_train.shape)
acc = np.mean(Y_pred == Y_train)
print(acc)

0.85525


In [77]:
# Validate using pipeline
Y_pred = text_pipeline.predict(raw_data_val_text)
#print("Y_pred.shape: ", Y_pred.shape, ", Y_val.shape: ", Y_val.shape)
acc = np.mean(Y_pred == Y_val)
print(acc)

Y_pred.shape:  (2000,) , Y_val.shape:  (2000,)
0.731


### Model Evalutation (testing)
Now test performance of model using testing data

#### Test accuracy

In [80]:
# Test model acc with test data
Y_pred = text_pipeline.predict(raw_data_test_text)
acc = np.mean(Y_pred == Y_test)
print(acc)

0.71


#### Model Metrics

**double check this** might be backwards  
Confusion Matrix: Positive = fake news, Negative = not fake news

In [83]:
# Confusion matrix for binary classification: 
metrics.confusion_matrix(raw_data_test_label, Y_pred)

array([[ 554,  676],
       [ 194, 1576]])