TF: number of times a word appears in a document divided by the total number of words in the document (https://towardsdatascience.com/natural-language-processing-feature-engineering-using-tf-idf-e8b9d00e7e76)

IDF: used as a measure of calculating how significant a word is in an entire corpus. To do so, it calculates how many times a word appears on a set of documents. (https://tealfeed.com/detecting-fake-news-python-machine-learning-2c3b7)

TfidfVectorizer: is used when one wishes to convert a collection of raw documents into a matrix of TF and IDF features.(https://tealfeed.com/detecting-fake-news-python-machine-learning-2c3b7)

-----
Passive Aggressive Classifier
- an algorithm remains passive for a correct classification outcome, and turns aggressive in the event of a miscalculation, updating and adjusting


RESOURCES:
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer

https://data-flair.training/blogs/advanced-python-project-detecting-fake-news/comment-page-2/

https://www.kaggle.com/hassanamin/fake-news-classifier/code

https://tealfeed.com/detecting-fake-news-python-machine-learning-2c3b7

https://www.datacamp.com/community/tutorials/scikit-learn-fake-news

https://machinelearningmastery.com/prepare-text-data-machine-learning-scikit-learn/

In [74]:
# Import dependencies 
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report

In [75]:
# Read CSV
df = pd.read_csv("articles.csv")
df.head()

Unnamed: 0,title,text,subject,label
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,Politics,1
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,Politics,1
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",Politics,1
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",Politics,1
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,Politics,1


In [76]:
# Assign variables (features and target y and X)
y = df.label
X = df.text

# Split data into training and testing sets 
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=78, test_size=0.20)

In [77]:
# Initialize TF-IDF 
# maximum document frequency of 0.7 (terms with a higher document frequency will be discarded).
tf_idf = TfidfVectorizer(stop_words = 'english', max_df = 0.7)

# Fit and transform training data
tf_idf_train = tf_idf.fit_transform(X_train)

# Transform testing data 
tf_idf_test = tf_idf.transform(X_test)

# Check 
print(tf_idf_test)

  (0, 94854)	0.08181449639385187
  (0, 94186)	0.04047253404119041
  (0, 94171)	0.04131863908303009
  (0, 93368)	0.028275013176832408
  (0, 92534)	0.0368421451909895
  (0, 92199)	0.06112354626064536
  (0, 90349)	0.03877934264942043
  (0, 90258)	0.07511785819960073
  (0, 88160)	0.09825082121464127
  (0, 87935)	0.04710833493771305
  (0, 87683)	0.07732734742647794
  (0, 87657)	0.0473188880236601
  (0, 87441)	0.04702507861729999
  (0, 86984)	0.09152276721835384
  (0, 86045)	0.026626543530901395
  (0, 85615)	0.03217499586141256
  (0, 84107)	0.09430846780403815
  (0, 83228)	0.033896279848797665
  (0, 81539)	0.08982182141490197
  (0, 81532)	0.05002057417846024
  (0, 81431)	0.04710137608931464
  (0, 80441)	0.06358479220111414
  (0, 80292)	0.05152310208109519
  (0, 78306)	0.11997464839118269
  (0, 77593)	0.07403566488363002
  :	:
  (6794, 27849)	0.07918946205463527
  (6794, 27170)	0.10856368602661397
  (6794, 25215)	0.04650214809990692
  (6794, 23353)	0.06610291491546187
  (6794, 23320)	0.051600

In [78]:
# MODEL 3: Logistic Regression

#Import library
from sklearn.linear_model import LogisticRegression

# Initiate model
lg_model = LogisticRegression(solver='lbfgs',
                                max_iter=200,
                                random_state=1)

# Train model
lg_model.fit(tf_idf_train, y_train)

# Create model predictions
y_pred = lg_model.predict(tf_idf_test)

# Validate Model(1): Accuracy Score 
accuracy = accuracy_score(y_test, y_pred)*100
print(accuracy)

# OPTIONAL:
# Validate Model(2): Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
cm_df = pd.DataFrame(
    cm, index=["Actual 0", "Actual 1"], columns=["Predicted 0", "Predicted 1"]
)

print(cm_df)

# Validate Model(3): Classification Report
print(classification_report(y_test, y_pred))

98.10154525386314
          Predicted 0  Predicted 1
Actual 0         2142           77
Actual 1           52         4524
              precision    recall  f1-score   support

           0       0.98      0.97      0.97      2219
           1       0.98      0.99      0.99      4576

    accuracy                           0.98      6795
   macro avg       0.98      0.98      0.98      6795
weighted avg       0.98      0.98      0.98      6795



In [79]:
# MODEL 6: PassiveAggressiveClassifier

#Import library
from sklearn.linear_model import PassiveAggressiveClassifier

# Initiate model
pac_model = PassiveAggressiveClassifier(max_iter=50)

# Train model
pac_model.fit(tf_idf_train, y_train)

# Create model predictions
y_pred = pac_model.predict(tf_idf_test)

# Validate Model(1): Accuracy Score 
accuracy = accuracy_score(y_test, y_pred)*100
print(accuracy)

# OPTIONAL:
# Validate Model(2): Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
cm_df = pd.DataFrame(
    cm, index=["Actual 0", "Actual 1"], columns=["Predicted 0", "Predicted 1"]
)

print(cm_df)

# Validate Model(3): Classification Report
print(classification_report(y_test, y_pred))

99.24944812362031
          Predicted 0  Predicted 1
Actual 0         2193           26
Actual 1           25         4551
              precision    recall  f1-score   support

           0       0.99      0.99      0.99      2219
           1       0.99      0.99      0.99      4576

    accuracy                           0.99      6795
   macro avg       0.99      0.99      0.99      6795
weighted avg       0.99      0.99      0.99      6795

