<a href="https://colab.research.google.com/github/Koo8/ML-text-classification-logistic-regression-nltk-news/blob/main/newsML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**USE ML for Fake News Analysis**  
Keyword:  

1.   text data -TfidfVectorizer - convert text data to numerical data
2.   classification + logistic regression model + stratification: stratify=y for classification problems
3.   nltk
4.   stemming - nltk PorterStemmer - to remove prefix/suffix


**Fake News Prediction with Python**  
This project deals with text data.
It is a binary classification project, the result is either Fake or Real.  
Feed preprocessed trained data to *logistic regression model*, which works best for classification project.  
**About the Dataset**  
1 id: unique article id  
2 title: news article title  
3 author: article author  
4 text: the text of the article; could be incomplete  
5 label: mark the article fake(1) or real(0)  



In [1]:
# import dependency and dataset
import pandas as pd
import numpy as np
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer # remove prefix / suffix and return root word
from sklearn.feature_extraction.text import TfidfVectorizer  #Convert a collection of raw documents to a matrix of TF-IDF features(basically numbers)
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score



In [2]:
# first: download the stopwords
import nltk
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [3]:
# show stopwords
print(stopwords.words('english'))
# these words have not much values in analysis, so should be removed from the dataset

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

**Data Pre-processing**

In [4]:
news_data = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/fake_news_dataset.csv')

In [5]:
news_data.head()

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


In [6]:
news_data.shape

(20800, 5)

In [7]:
news_data.isna().sum()  # title, author and text all have null values


id           0
title      558
author    1957
text        39
label        0
dtype: int64

**Handle Missing Values**  
For this project, since the data is (20800,5), very large compared with missing values of (558,1957, 39), instead of imputation  
* imputation ( central tendencies : mean() for normal distribution, median() or mode() of dataset for skewed distribution)
* dataset.fillna(dataset.mean(), inplace=True)    



I use drop values method by replacing n/a with empty string.

In [8]:
news_data = news_data.fillna("")

In [15]:
# check again any NA value -> no missing values anymore
news_data.isna().sum()


id         0
title      0
author     0
text       0
label      0
content    0
dtype: int64

**Prediction Approach**  
Only use 'title' and 'author' for fake news predicting. 'text' can be used as well, but will consume a lot of power as too many words are inside 'text'. Without if, the prediction is pretty good already.

In [10]:
# merge the two cols together to 'content'col
news_data['content']=news_data['title'] + " " + news_data['author']
news_data['content'].head()

0    House Dem Aide: We Didn’t Even See Comey’s Let...
1    FLYNN: Hillary Clinton, Big Woman on Campus - ...
2    Why the Truth Might Get You Fired Consortiumne...
3    15 Civilians Killed In Single US Airstrike Hav...
4    Iranian woman jailed for fictional unpublished...
Name: content, dtype: object

**Stemming Words**  
to reduce a word to its root word  
eg. actor, actress, acting --> act  
so that in vectorizing step (numericalize), we can apply feature vectors( numerical data) to each root word.  
Only with all numerical data, I can feed the dataset to a model for training.


In [11]:
stemmer = PorterStemmer()

In [12]:
# remove non alphabetical characters, stem each word while remove all stopwords, then join words with a " " space
def stemming(content) :
  stemmed = re.sub('[^a-zA-Z]', ' ', content).split()
  stemmed = [stemmer.stem(word) for word in stemmed if not word in stopwords.words('english')] # stem(to_lowercase = True) is default
  stemmed = " ".join(stemmed)
  return stemmed

In [13]:
news_data['content'] = news_data['content'].apply(stemming) # 2minutes

In [14]:
news_data.content


0        hous dem aid we didn even see comey letter unt...
1        flynn hillari clinton big woman campu breitbar...
2           whi truth might get you fire consortiumnew com
3        civilian kill in singl us airstrik have been i...
4        iranian woman jail fiction unpublish stori wom...
                               ...                        
20795    rapper t i trump poster child for white suprem...
20796    n f l playoff schedul matchup odd the new york...
20797    maci is said receiv takeov approach hudson bay...
20798    nato russia to hold parallel exercis in balkan...
20799                       what keep f aliv david swanson
Name: content, Length: 20800, dtype: object

**Train Test Split** with "content' col + 'label' col only. Not using 'text' cold because it has too many words

In [38]:
# seperate X and Y
X = news_data['content'].values
Y = news_data['label'].values
# type(Y) # pandas.core.series.Series -> numpy.ndarray
print(X)

['hous dem aid we didn even see comey letter until jason chaffetz tweet it darrel lucu'
 'flynn hillari clinton big woman campu breitbart daniel j flynn'
 'whi truth might get you fire consortiumnew com' ...
 'maci is said receiv takeov approach hudson bay the new york time michael j de la merc rachel abram'
 'nato russia to hold parallel exercis in balkan alex ansari'
 'what keep f aliv david swanson']


**However** before tts, we need to convert text values of 'content' into numerical values using TFIDFVECTORIZER  
*Term Frequency*: This summarizes how often a given word appears within a document.
*Inverse Document Frequency*: This downscales words that appear a lot across documents.

In [39]:
tfidf_vec = TfidfVectorizer()
tfidf_vec.fit(X) #Learn vocabulary and idf from training set.
X = tfidf_vec.transform(X) #Transform documents to document-term matrix. Uses the vocabulary and document frequencies (df) learned by fit (or fit_transform).

print(X)
#  (0, 16586)	0.19235766814206556
# (0, 16055)	0.29053629972079575
# (0, 15762)	0.25180682754084327 ....


<class 'scipy.sparse._csr.csr_matrix'>
  (0, 16586)	0.19235766814206556
  (0, 16055)	0.29053629972079575
  (0, 15762)	0.25180682754084327
  (0, 13528)	0.22682424464275375
  (0, 8949)	0.3214177525383042
  (0, 8670)	0.25823746120670493
  (0, 7731)	0.219100353547038
  (0, 7639)	0.1611238443347535
  (0, 7040)	0.19336678366028098
  (0, 4998)	0.20625126961544935
  (0, 4028)	0.2667908047240904
  (0, 3811)	0.23915031396884368
  (0, 3619)	0.31814479104571186
  (0, 2977)	0.21821000835755877
  (0, 2501)	0.3250028762214839
  (0, 272)	0.2387684291853839
  (1, 16892)	0.30071745655510157
  (1, 6849)	0.1904660198296849
  (1, 5528)	0.7143299355715573
  (1, 3587)	0.26373768806048464
  (1, 2831)	0.19094574062359204
  (1, 2241)	0.3827320386859759
  (1, 1910)	0.15521974226349364
  (1, 1512)	0.2939891562094648
  (2, 17095)	0.29337976465513754
  :	:
  (20797, 9631)	0.17242189281191916
  (20797, 9561)	0.2918128273796141
  (20797, 9030)	0.35719284755530417
  (20797, 8404)	0.22049990081059304
  (20797, 7611)	0.

**TIP**  
Are you using train_test_split with a **classification** problem?  
BE sure to set **"stratify=y"** so that class proportions are preserved when splitting, which means both train and test data include proportioned ratio of both 'fake' and 'real' classes examples.
Especially TRUE when you have class imbalance

In [45]:
pd.value_counts(Y)  # 1    10413  | 0    10387

1    10413
0    10387
dtype: int64

In [40]:
# train test splitting

trainX, testX, trainY, testY = train_test_split(X, Y, test_size=0.2, random_state=2, stratify=Y)

In [41]:
print(trainX.shape, testX.shape, X.shape)

(16640, 17227) (4160, 17227) (20800, 17227)


In [32]:
trainX

array(['liber moder conserv see how facebook label you the new york time jeremi b merril',
       'c i a develop tool spi mac comput wikileak disclosur show the new york time vindu goel',
       'texa man allegedli have sex with fenc arrest warner todd huston',
       ..., '',
       'alton sterl shoot baton roug prompt justic dept investig the new york time richard fausset richard p rez pe campbel robertson',
       'purchas loyalti foreign aid jacob g hornberg'], dtype=object)

**Logistic Regression predict something TRUE or FALSE**  
**It is the best model for classification problems**  
We test to see if a variable's effect on the prediction is significantly different from 0.  
If not, it means the variable is not helping the prediction (totes useless).  
Logistic regressions provide probabilities (0-100%) and classify(True,False) new samples using continuous | discrete meansurements.  
1. good for classify samples
2. can use different types of data (continuous data or discrete data)
3. for assess what variables are useful for classifying samples



In [47]:
logis_reg_model = LogisticRegression()

In [48]:
logis_reg_model.fit(trainX, trainY)

**Check accuracy**

In [51]:
train_y_predict = logis_reg_model.predict(trainX)
accuracy_train = accuracy_score(train_y_predict, trainY)

In [52]:
print(accuracy_train) #0.9868990384615385 pretty accurate

0.9868990384615385


**The most important is not trained data accuracy but test data accuracy**

In [54]:
test_y_predict = logis_reg_model.predict(testX)

In [55]:
accuracy_test = accuracy_score(test_y_predict, testY)

In [56]:
print(accuracy_test) #0.9762019230769231

0.9762019230769231


**Build a predicting system with test data to check results**

In [63]:
new_X = testX[0]
predict_y = logis_reg_model.predict(new_X) # return [1]
print(predict_y[0] == testY[0])

True


In [64]:
new_X = testX[10]
predict_y = logis_reg_model.predict(new_X) # return [1]
print(predict_y[0] == testY[10])

True
