In this Notebook, I'll use text mining to predict the tweets to be fake or true

In [3]:
import pandas as pd
import fakenews_utilities as fns
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

### read the data file 

In [2]:
df = pd.read_csv('tweets_labeled.csv')
df.head()

Unnamed: 0,tweet_id,text,label
0,1161040537207463936,'RT @SenJeffMerkley: The Endangered Species Ac...,1
1,1176360756239118342,'RT @LindseyGrahamSC: Interesting concept -- i...,1
2,1099036648573145088,'RT @RealJamesWoods: #BuildTheWall #DeportThem...,0
3,1092915693203480577,'RT @PatriotJackiB: Why would the MEXICAN GOV’...,0
4,1149038450668187654,'RT @TheOnion: Sweden Announces Plan To Get 10...,0


In [4]:
df_clean = fns.wash_pandas_str(df)
df_clean.head()

Unnamed: 0,tweet_id,text,label
0,1161040537207463936,The Endangered Species Act saved the bald eagl...,1
1,1176360756239118342,"Interesting concept -- impeach first, find fac...",1
2,1099036648573145088,#BuildTheWall #DeportThemAll,0
3,1092915693203480577,Why would the MEXICAN GOV’T fund this? Who are...,0
4,1149038450668187654,Sweden Announces Plan To Get 100% Of Energy Fr...,0


### Use the code to generate a document-feature matrix

In [5]:
text = df_clean['text'].values.astype('U') #Taking the text from the df, convert it to Unicode
vect = CountVectorizer(stop_words='english') #Create the CV object, with English stop words
vect = vect.fit(text) # fit the model with the words from the tweets text
docu_feat = vect.transform(text) # make a matrix


In [6]:
print(docu_feat)

  (0, 663)	1
  (0, 1789)	1
  (0, 5944)	1
  (0, 6237)	1
  (0, 10529)	1
  (0, 14172)	1
  (0, 15298)	1
  (0, 15950)	1
  (0, 16584)	1
  (0, 17025)	1
  (1, 3892)	1
  (1, 6780)	1
  (1, 9199)	1
  (1, 9647)	1
  (1, 10502)	1
  (1, 10818)	1
  (1, 11249)	1
  (1, 11545)	1
  (1, 16964)	1
  (1, 17420)	1
  (1, 19451)	1
  (1, 20310)	1
  (2, 2733)	1
  (2, 5125)	1
  (3, 2876)	1
  :	:
  (225402, 4429)	1
  (225402, 9028)	1
  (225402, 9111)	1
  (225402, 11723)	1
  (225402, 14761)	1
  (225402, 14999)	1
  (225403, 2409)	1
  (225403, 2655)	2
  (225403, 3035)	1
  (225403, 5827)	1
  (225403, 11681)	1
  (225403, 12945)	1
  (225403, 13180)	1
  (225403, 19567)	1
  (225403, 20078)	1
  (225404, 1201)	1
  (225404, 3013)	1
  (225404, 3067)	1
  (225404, 6420)	1
  (225404, 7522)	1
  (225404, 7680)	1
  (225404, 8963)	1
  (225404, 13355)	1
  (225404, 16319)	1
  (225404, 20257)	1


### Building the mode

Use the Naïve Bayes classifier from `sklearn`.

In [7]:
from sklearn.naive_bayes import MultinomialNB

nb = MultinomialNB()  
x = docu_feat #the document-feature matrix is the x matrix
y = df_clean['label'] #creating the y vector

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3)  
nb = nb.fit(x_train, y_train) #fit the model x=features, y=character


### Evaluating the model

In [8]:
y_test_p = nb.predict(x_test)
nb.score(x_test, y_test)

0.9665641359320931

The accuracy is  96.6%  

In [9]:
df_clean['label'].value_counts(normalize=True)

0    0.680828
1    0.319172
Name: label, dtype: float64

68% of the tweets are true, 31.9% are fake.

Let's create a confusion matrix.

In [10]:
cm = confusion_matrix(y_test, y_test_p)
cm = pd.DataFrame(cm, index=['0', '1'], columns=['0-pred', '1-pred'])
cm

Unnamed: 0,0-pred,1-pred
0,44161,1808
1,453,21200


Let's calculate precision and recall

In [11]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_test_p))

              precision    recall  f1-score   support

           0       0.99      0.96      0.98     45969
           1       0.92      0.98      0.95     21653

    accuracy                           0.97     67622
   macro avg       0.96      0.97      0.96     67622
weighted avg       0.97      0.97      0.97     67622



The precision for true tweets is 0.99

The precision for fake tweets is 0.92
