<a href="https://colab.research.google.com/github/Junrulin0225/Fake-News-Detection/blob/main/Fake_News_Detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [29]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import metrics
from sklearn.naive_bayes import MultinomialNB

In [30]:
df = pd.read_csv('Fake News Data.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


In [31]:
x = df.text
y = df.label

In [32]:
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size = 0.33, random_state= 53 )

In [33]:
# Initialize a CountVectorizer object: count_vectorizer
count_vectorizer = CountVectorizer(stop_words="english")

# Transform the training data using only the 'text' column values
count_train = count_vectorizer.fit_transform(x_train.values)

# Transform the test data using only the 'text' column values
count_test = count_vectorizer.transform(x_test.values)

# Print the first 10 features of the count_vectorizer
count_vectorizer.get_feature_names_out()[:10]

array(['00', '000', '0000', '00000031', '000km', '001', '003', '004',
       '006s', '008'], dtype=object)


**CountVectorizer for text classification**

Use pandas alongside scikit-learn to create a sparse text vectorizer we can use to train and test a simple supervised model. To begin, we'll set up a CountVectorizer and investigate some of its features.

In [34]:
tfidf_vectorizer =  TfidfVectorizer(stop_words="english", max_df=0.7)
tfidf_train = tfidf_vectorizer.fit_transform(x_train.values)
tfidf_test = tfidf_vectorizer.transform(x_test.values)
tfidf_vectorizer.get_feature_names_out()[:10]

array(['00', '000', '0000', '00000031', '000km', '001', '003', '004',
       '006s', '008'], dtype=object)

In [35]:
tfidf_train

<2090x42427 sparse matrix of type '<class 'numpy.float64'>'
	with 553402 stored elements in Compressed Sparse Row format>

In [36]:
tfidf_test

<4245x42427 sparse matrix of type '<class 'numpy.float64'>'
	with 1075745 stored elements in Compressed Sparse Row format>

In [37]:
#A stands for array
tfidf_train.A[:5]

array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.03510333, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

**Inspecting the vectors**


To get a better idea of how the vectors work, we investigate them by converting them into pandas DataFrames.

In [38]:
# Create the CountVectorizer DataFrame
count_df = pd.DataFrame(count_train.A, columns=count_vectorizer.get_feature_names_out())

# Create the TfidfVectorizer DataFrame
tfidf_df = pd.DataFrame(tfidf_train.A, columns=tfidf_vectorizer.get_feature_names_out())


In [39]:
count_df

Unnamed: 0,00,000,0000,00000031,000km,001,003,004,006s,008,...,حلب,عربي,عن,لم,ما,محاولات,من,هذا,والمرضى,ยงade
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,2,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2085,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2086,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2087,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2088,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [40]:
tfidf_df

Unnamed: 0,00,000,0000,00000031,000km,001,003,004,006s,008,...,حلب,عربي,عن,لم,ما,محاولات,من,هذا,والمرضى,ยงade
0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.035103,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2085,0.0,0.014323,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2086,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2087,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2088,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [41]:
# Calculate the difference in columns: difference
difference = set(count_df.columns) - set(tfidf_df.columns)
print(difference)

set()


In [42]:
# Check whether the DataFrames are equal
count_df.equals(tfidf_df)

False

Training and testing the "fake news" model with CountVectorizer.

In [43]:
nb_classifier = MultinomialNB()

In [44]:
nb_classifier.fit(count_train, y_train)
metrics.accuracy_score(y_test, nb_classifier.predict(count_test))

0.8810365135453475

In [45]:
metrics.confusion_matrix(y_test, nb_classifier.predict(count_test), labels=['FAKE', 'REAL'])

array([[1738,  349],
       [ 156, 2002]])

Training and testing the "fake news" model with TfidfVectorizer.

In [46]:
nb_classifier.fit(tfidf_train, y_train)
metrics.accuracy_score(y_test, nb_classifier.predict(tfidf_test))

0.8473498233215547

In [47]:
metrics.confusion_matrix(y_test, nb_classifier.predict(tfidf_test), labels=['FAKE', 'REAL'])

array([[1512,  575],
       [  73, 2085]])

**Improving the model**

To test a few different alpha levels using the Tfidf vectors to determine if there is a better performing combination.


In [48]:
# Create the list of alphas: alphas
alphas = np.arange(0, 1, 0.1)

# Define train_and_predict()
def train_and_predict(alpha):
    # Instantiate the classifier: nb_classifier
    nb_classifier = MultinomialNB(alpha = alpha)
    # Fit to the training data
    nb_classifier.fit(tfidf_train, y_train)
    # Predict the labels: pred
    pred = nb_classifier.predict(tfidf_test)
    # Compute accuracy: score
    score = metrics.accuracy_score(y_test, pred)
    return score

# Iterate over the alphas and print the corresponding score
for alpha in alphas:
    print('Alpha: ', alpha)
    print('Score: ', train_and_predict(alpha))
    print()

Alpha:  0.0
Score:  0.8650176678445229

Alpha:  0.1
Score:  0.8954063604240282

Alpha:  0.2
Score:  0.887396937573616

Alpha:  0.30000000000000004
Score:  0.880565371024735

Alpha:  0.4
Score:  0.8751472320376914

Alpha:  0.5
Score:  0.8685512367491166

Alpha:  0.6000000000000001
Score:  0.8605418138987043

Alpha:  0.7000000000000001
Score:  0.855359246171967

Alpha:  0.8




Score:  0.8530035335689046

Alpha:  0.9
Score:  0.850412249705536



In [49]:
with open('bbc_news.txt', 'w') as f:
  f.write(x.iloc[2])

In [50]:
with open('bbc_news.txt', 'r') as f:
  random_text = f.read()

In [51]:
# Transform the random test data using only the 'text' column values
vectorized_test = count_vectorizer.transform([random_text])

In [52]:
nb_classifier.predict(vectorized_test)

array(['REAL'], dtype='<U4')

In [53]:
y_test.iloc[2]

'REAL'