In [None]:
# Name: Shamita Goyal

In this lab you will train an ML model to identify whether a news article is real news or fake news (Just in time for the election coming up in November).

The input file is `news.csv` ([source](https://www.kaggle.com/datasets/meruvulikith/realtime-news-classification-dataset/data))

In [270]:
# import modules
import pandas as pd
import matplotlib.pyplot as plt
import random
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer

1. Read data from `news.csv` into a DataFrame.<br>
Then __print the number of rows and columns of the DataFrame__, and __print the first 5 rows of the DataFrame__

In [272]:
news_data = pd.read_csv("news.csv", encoding='ISO-8859-1')
news_data.head()

Unnamed: 0,title,text,subject,date,label
0,Ben Stein Calls Out 9th Circuit Court: Committ...,"21st Century Wire says Ben Stein, reputable pr...",US_News,"February 13, 2017",fake
1,Trump drops Steve Bannon from National Securit...,WASHINGTON (Reuters) - U.S. President Donald T...,politicsNews,"April 5, 2017",true
2,Puerto Rico expects U.S. to lift Jones Act shi...,(Reuters) - Puerto Rico Governor Ricardo Rosse...,politicsNews,"September 27, 2017",true
3,OOPS: Trump Just Accidentally Confirmed He Le...,"On Monday, Donald Trump once again embarrassed...",News,"May 22, 2017",fake
4,Donald Trump heads for Scotland to reopen a go...,"GLASGOW, Scotland (Reuters) - Most U.S. presid...",politicsNews,"June 24, 2016",true


2. Analyze data

2a. __Print the subject and count of each subject__

In [274]:
news_data.subject.value_counts()

subject
politicsNews       11272
worldnews          10145
News                9050
politics            6841
left-news           4459
Government News     1570
US_News              783
Middle-east          778
Name: count, dtype: int64

2b. Use regular expression, change the date to show only the years<br>
Then __print the year and count of each year__ to see how old the training data is.

In [275]:
news_data["date"] = news_data["date"].str.extract(r', (\d{4})')
news_data.date.value_counts()

date
2017    25904
2016    16470
2015     2479
Name: count, dtype: int64

2c. __Show whether the label data are balanced__<br>
then __create a RawNB Convert cell to explain your result__.

In [276]:
news_data.label.value_counts()

label
fake    23481
true    21417
Name: count, dtype: int64

3. Clean data

3a. Since the titles are generally short, we will train the model with the text.<br>
The subject can be dropped since there can be real and fake news on any subject.<br>
Similarly the date can be dropped since real and fake news can appear on any day.

__Drop the title, subject, and date__ columns.

In [277]:
cleanedD = news_data[["text","label"]]

3b. __Check for NaNs and drop any NaNs__.

In [278]:
cleanedD.isna().sum()

text     0
label    0
dtype: int64

__Print the resulting DataFrame__.

In [279]:
cleanedD.head()

Unnamed: 0,text,label
0,"21st Century Wire says Ben Stein, reputable pr...",fake
1,WASHINGTON (Reuters) - U.S. President Donald T...,true
2,(Reuters) - Puerto Rico Governor Ricardo Rosse...,true
3,"On Monday, Donald Trump once again embarrassed...",fake
4,"GLASGOW, Scotland (Reuters) - Most U.S. presid...",true


3c. Note that row index 44893 has no news article. Perhaps there are other rows with empty text.<br>
__Find the number of rows where the text looks empty__, like row index 44893

In [280]:
print("There are", len(cleanedD.loc[cleanedD['text'].str.contains(r'^\s*$', regex=True)]), "rows where the text looks empty.")

There are 631 rows where the text looks empty.


__Create a new DataFrame that doesn't contain these rows__ of empty text.<br>
Then __find the difference__ by subtracting:<br>
(number of rows of the original DataFrame) - (number of rows of new DataFrame)

The difference should be the same as the number of rows where the text is empty.

In [281]:
index = cleanedD.loc[cleanedD['text'].str.contains(r'^\s*$', regex=True)].index
cleanedD = cleanedD.drop([i for i in index])

In [282]:
news_data.shape[0] - cleanedD.shape[0]

631

4. Create X and y datasets

4a. Now that the data is cleaned, __create the X and y datasets__ from the new DataFrame.<br>
Then __print the number of rows and columns of X and y__.

In [283]:
y = cleanedD.label
label = {"fake":0, "true":1}
y = y.replace(label)
X = cleanedD.drop(columns=['label'])
X = X.reset_index(drop =True)

print(f"the rows and columns of y are {y.shape}.")
print(f"the rows and columns of X are {X.shape}.")

the rows and columns of y are (44267,).
the rows and columns of X are (44267, 1).


5. Preprocessing X

__Print the X dataset__, then __preprocess X__ as discussed in class.

_It's a good idea to use multiple code cells, one for each main step of preprocessing_.<br>
_Also note that some processing steps could take a minute or two_.

In [284]:
# print the X dataset
X.head()

Unnamed: 0,text
0,"21st Century Wire says Ben Stein, reputable pr..."
1,WASHINGTON (Reuters) - U.S. President Donald T...
2,(Reuters) - Puerto Rico Governor Ricardo Rosse...
3,"On Monday, Donald Trump once again embarrassed..."
4,"GLASGOW, Scotland (Reuters) - Most U.S. presid..."


In [285]:
tokenizer = RegexpTokenizer('[a-z]+')
stop_words=set(stopwords.words("english"))
stemmer = PorterStemmer()

In [286]:
# 1.remove all numbers and punctuations (ony words with letters should remain)
# 2. change all words to lowercase
# 3. remove all stop words
# 4. stem all words

def preprocess(s) :
    w = tokenizer.tokenize(s.lower())  
    w = [word for word in w if word not in stop_words] 
    w = [stemmer.stem(word) for word in w] 
    return ' '.join(w)        

# with each row, preprocess the text string, and store all rows in the X_processed DataFrame
X_processed = pd.DataFrame([preprocess(X.loc[i,'text']) for i in range(len(X))])
X_processed.head()  

Unnamed: 0,0
0,st centuri wire say ben stein reput professor ...
1,washington reuter u presid donald trump remov ...
2,reuter puerto rico governor ricardo rossello s...
3,monday donald trump embarrass countri accident...
4,glasgow scotland reuter u presidenti candid go...


6. Train an ML model.
- __Split into training and testing sets__
- __Train the ML model__
- __Show the accuracy__

In [287]:
vect = CountVectorizer()
vect.fit(X_processed[0])
X_vectors = vect.transform(X_processed[0])
X_vectors.shape

(44267, 89633)

In [288]:
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_vectors,y,test_size=0.2)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(35413, 89633) (35413,) (8854, 89633) (8854,)


In [289]:
# train the model
classifier = MultinomialNB()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

In [290]:
# accuracy 
print("the accuracy score is:", metrics.accuracy_score(y_test, y_pred))
print("the confusion matrix is:")
metrics.confusion_matrix(y_test, y_pred, labels=[0,1])

the accuracy score is: 0.9489496272871019
the confusion matrix is:


array([[4272,  240],
       [ 212, 4130]])

---

7. Further analysis

7a. __Print the DataFrame__ from step 3b

Note that news article with (Reuters) seem to be labeled 'true' most of the time.<br>
From the rows with Reuters in the text, __print the number of 'true' and 'fake' labels__.

In [291]:
cleanedD.head()

Unnamed: 0,text,label
0,"21st Century Wire says Ben Stein, reputable pr...",fake
1,WASHINGTON (Reuters) - U.S. President Donald T...,true
2,(Reuters) - Puerto Rico Governor Ricardo Rosse...,true
3,"On Monday, Donald Trump once again embarrassed...",fake
4,"GLASGOW, Scotland (Reuters) - Most U.S. presid...",true


In [292]:
cleanedD.label[cleanedD['text'].str.contains('Reuters')].value_counts()

label
true    21378
fake      311
Name: count, dtype: int64

7b. __Create a RawNB Convert to explain the result__ of step 7a.

7c. Check to see if Reuters have an effect on the accuracy of the model.<br>
Using multiple code cells:
- Start from the same X and y as step 4
- __Remove the word Reuters from X and preprocess X__
- __Train and check the accuracy of the model__

In [293]:
# remove the word reuters from X
X['text'] = X['text'].str.replace('(Reuters)', '')
X

Unnamed: 0,text
0,"21st Century Wire says Ben Stein, reputable pr..."
1,WASHINGTON - U.S. President Donald Trump remo...
2,- Puerto Rico Governor Ricardo Rossello said ...
3,"On Monday, Donald Trump once again embarrassed..."
4,"GLASGOW, Scotland - Most U.S. presidential ca..."
...,...
44262,Miss Universe 1996 Alicia Machado is now an Am...
44263,LONDON/TOKYO - British Prime Minister Theresa...
44264,BERLIN - Chancellor Angela Merkel said German...
44265,Jesus f*cking Christ our President* is a moron...


In [None]:
# preprocess X again 
def preprocess(s) :
    w = tokenizer.tokenize(s.lower())  
    w = [word for word in w if word not in stop_words] 
    w = [stemmer.stem(word) for word in w] 
    return ' '.join(w)        

# with each row, preprocess the text string, and store all rows in the X_processed DataFrame
X_noR = pd.DataFrame([preprocess(X.loc[i,'text']) for i in range(len(X))])
X_noR.head()  

In [294]:
vect1 = CountVectorizer()
vect1.fit(X_noR[0])
X_vectors = vect1.transform(X_noR[0])
X_vectors.shape

(44267, 89633)

In [295]:
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_vectors,y,test_size=0.2)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(35413, 89633) (35413,) (8854, 89633) (8854,)


In [296]:
# train the model
classifier_noR = MultinomialNB()
classifier_noR.fit(X_train, y_train)
y_pred = classifier_noR.predict(X_test)

In [297]:
# accuracy 
print("the accuracy score is:", metrics.accuracy_score(y_test, y_pred))
print("the confusion matrix is:")
metrics.confusion_matrix(y_test, y_pred, labels=[0,1])

the accuracy score is: 0.9436412920713801
the confusion matrix is:


array([[4266,  238],
       [ 261, 4089]])

7d. Does removing Reuters make the accuracy go up, go down, or remain the same?<br>
__Create a RawNB Convert cell for your answer__.

Removing the reuters makes the accuracy go only a little down but it is almost the same as the previous accuracy score we found for the data with the reuters still in the news article titles. 

---

8. Test the ML model with real data.

8a. Follow these steps to get your choice of data.
- Go to one of the official news source (ABC News, BBC News, Reuters News, AP News, etc.) and choose 2-3 news articles. Copy the first 2-3 sentences of each article into a Python string.
- Go to the AP News fake news [site](https://apnews.com/hub/not-real-news), or other fake news site you know, and choose 1-3 articles. Copy the first 2-3 sentences of each article into a Python string. Make sure you copy the fake news part only and not the explanation for it, if there are any.
- You should end up with 4-6 Python strings.

In [312]:
# trusted news sources:

abcNews = """The biggest night in Hollywood is finally here: The 2024 Oscars take place tonight, March 10.
Jimmy Kimmel is hosting the 96th Academy Awards, a ceremony which will honor excellence in cinematic achievements for some of the past year's biggest films.
"Oppenheimer" has the most nominations heading into the night with a total of 13 nods."""

bbcNews = """Donald Trump and Joe Biden have both held campaign rallies in the US state of Georgia, as their general election showdown comes into greater focus.
The former president, 77, slammed Thursday's State of the Union speech as an "angry, dark and hate-filled rant". """

reutersNews = """ March 10 (Reuters) - Ukraine on Sunday rebuffed Pope Francis's call to negotiate an end to the war with Russia, with President Volodymyr Zelenskiy saying 
the pontiff was engaging in "virtual mediation" and his foreign minister saying Kyiv would never capitulate. Francis said that when things were going badly for a party to
a conflict one had to show the "courage of the white flag" and negotiate. """


# fake news sources:

apNews = """Executive Order 9066 authorized Japanese detention during World War II, not gift cards for recent migrants
CLAIM: President Joe Biden issued Executive Order 9066, which provides people who enter the U.S. illegally with a $5,000 Visa gift card."""

hollywoodGazette = """This article investigates the real story behind the legendary bloodied and bandaged face worn by Weeknd in his music video titled “What Happened to Weekend Face.”
Are you looking forward to finding out more about What Happened to Weekend Face?
If such is the case, you have arrived to the right place to quickly and easily get complete information on it."""


8b. __Create a list named `y_actual`__ which contains the 'true' or 'fake' value corresponding to the strings you have above.<br>
Make sure the y_actual data is in the same order as your news strings above (otherwise all bets are off).

In [313]:
y_actual = ["true", "true", "true", "false", "false"]

8c. __Create a DataFrame from your 4-6 news strings__ so that it looks similar to the X DataFrame of step 5 (not counting the index)

In [314]:
testDf = pd.DataFrame([abcNews,bbcNews,reutersNews,apNews,hollywoodGazette], columns=["text"])

In [315]:
testDf

Unnamed: 0,text
0,The biggest night in Hollywood is finally here...
1,Donald Trump and Joe Biden have both held camp...
2,March 10 (Reuters) - Ukraine on Sunday rebuff...
3,Executive Order 9066 authorized Japanese deten...
4,This article investigates the real story behin...


In [316]:
# preprocess X again 
def preprocess(s) :
    w = tokenizer.tokenize(s.lower())  
    w = [word for word in w if word not in stop_words] 
    w = [stemmer.stem(word) for word in w] 
    return ' '.join(w)        

# with each row, preprocess the text string, and store all rows in the X_processed DataFrame
X_news = pd.DataFrame([preprocess(testDf.loc[i,'text']) for i in range(len(testDf))])
X_news.head()  

Unnamed: 0,0
0,biggest night hollywood final oscar take place...
1,donald trump joe biden held campaign ralli us ...
2,march reuter ukrain sunday rebuf pope franci c...
3,execut order author japanes detent world war i...
4,articl investig real stori behind legendari bl...


8d. __Test your ML model with your own test data__.

In [317]:
XNews = vect.transform(X_news[0])
XNews.shape

(5, 89633)

In [318]:
prediction = classifier.predict(XNews)
prediction

array([0, 0, 1, 0, 0])

8e. __Create a DataFrame to show the actual y and predicted y values__.

In [319]:
df = pd.DataFrame({"prediction":prediction, "actual values":y_actual})
df

Unnamed: 0,prediction,actual values
0,0,True
1,0,True
2,1,True
3,0,False
4,0,False


8f. __Explain your test result__ in a RawNB Convert cell.
- Is it about the same rate as the model's accuracy?
- What could you do with the news articles that you choose, that would cause your test result to be lower than the accuracy? Refer to observations you've made in the training data in the first steps of the notebook.