In [1]:
import pandas as pd # import pandas library to create and manipulate dataframes

df = pd.read_csv("WELFAKE_Dataset.csv") # read the csv file into a dataframe

df.head() # display the first 5 rows of the dataframe

Unnamed: 0.1,Unnamed: 0,title,text,label
0,0,LAW ENFORCEMENT ON HIGH ALERT Following Threat...,No comment is expected from Barack Obama Membe...,1
1,1,,Did they post their votes for Hillary already?,1
2,2,UNBELIEVABLE! OBAMA’S ATTORNEY GENERAL SAYS MO...,"Now, most of the demonstrators gathered last ...",1
3,3,"Bobby Jindal, raised Hindu, uses story of Chri...",A dozen politically active pastors came here f...,0
4,4,SATAN 2: Russia unvelis an image of its terrif...,"The RS-28 Sarmat missile, dubbed Satan 2, will...",1


In [2]:
df.shape # checking the dimensions of the dataframe

(72134, 4)

In [3]:
df = df.sample(n = 5000,random_state = 42).reset_index(drop = True)
# reduce dataframe to 5000 random rows, random_state = 42 ensures that everytime same 5000 rows are selected at random
# reset_index(drop = True) resets the index of the dataframe to 0 and drop = True drops the old index column

In [4]:
df['label'].value_counts() # count number of occurences of each unique value in 'label' column

label
1    2528
0    2472
Name: count, dtype: int64

Next step is data cleaning

In [5]:
import re # import 're' library to use regex ie regular expression

def clean_text(text): # create a function to clean the text
    text = str(text).lower() # convert the text to lower case

    text = re.sub(r"[^a-z0-9\s]", '', text) # remove characters that are not alphanumeric

    text = " ".join(text.split()) # remove extra spaces
    
    return text # return the cleaned text

df['clean_text'] = df['text'].apply(clean_text) # create a new column named 'clean_text' which contains corresponding column value's cleaned version

Now we are ready to train our model. First we will split the dataset into training and testing sets.

In [None]:
from sklearn.model_selection import train_test_split # import train_test_split function from scikit-learn to split the dataset into training and testing sets

X_train, X_test,y_train,y_test = train_test_split(df['clean_text'], df['label'], test_size = 0.2, random_state = 42)
# split the dataset into training and testing sets with a 80% data for training and 20% for testing by writing test_size = 0.2
# random_state = 42 to ensure that dataset is split into same random training and testing set each time the code is run

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer # import the TfidfVectorizer class from scikit-learn library to convert text data into numerical features
# this is because machine learning models cannot work with text data directly, so we need to convert them into numerical features

tfidf = TfidfVectorizer(stop_words = 'english') # create an instance of the TfidfVectorizer class with the stop_words parameter set to 'english' to remove common English stop words from the text data like 'is', 'am', 'the', 'are' etc.
X_train_tfidf = tfidf.fit_transform(X_train) # fit the TfidfVectorizer instance to the training data and transform the training data into numerical features
X_test_tfidf = tfidf.transform(X_test) # transform the testing data into numerical features using the TfidfVectorizer instance

Now that we have split the data into training and testing sets, and converted text data into numerical features, we can train a model on the training set and evaluate its performance on the testing set.

In [8]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix

# create a dictionary of models with their names as keys and instances of their classes imported from scikit-learn as values
models = {
    'Logistic_Regression': LogisticRegression(),
    'Random_Forest': RandomForestClassifier(),
    'SVC': SVC()
}

# iterate over the dictionary and sequentially train and evaluate each model by printing it's classification report and confusion matrix
for name, model in models.items():
    model.fit(X_train_tfidf, y_train)
    
    y_pred = model.predict(X_test_tfidf)

    print(f"-----------{name}-----------\n")
    print("Classification Report \n", classification_report(y_test, y_pred))
    print("Confusion Matrix \n", confusion_matrix(y_test, y_pred))
    print("\n")

-----------Logistic_Regression-----------

Classification Report 
               precision    recall  f1-score   support

           0       0.90      0.91      0.90       471
           1       0.92      0.91      0.91       529

    accuracy                           0.91      1000
   macro avg       0.91      0.91      0.91      1000
weighted avg       0.91      0.91      0.91      1000

Confusion Matrix 
 [[429  42]
 [ 49 480]]


-----------Random_Forest-----------

Classification Report 
               precision    recall  f1-score   support

           0       0.87      0.94      0.90       471
           1       0.94      0.87      0.90       529

    accuracy                           0.90      1000
   macro avg       0.90      0.91      0.90      1000
weighted avg       0.91      0.90      0.90      1000

Confusion Matrix 
 [[443  28]
 [ 69 460]]


-----------SVC-----------

Classification Report 
               precision    recall  f1-score   support

           0       0.91 

Considering all the aspects that decide how good a model is, SVC takes the victory here. Hence, we will train an SVC model.

In [9]:
model = SVC()

model.fit(X_train_tfidf,y_train)

Now let's test the model by predicting the output for some new inputs.

In [10]:
text = 'And now a message of peace and unity from one of our neighbors to the South:  We, Mexicans, have to kill Donald J. Trump before he becomes President. He is a threat to every single one of us. There are many Mexican Americans living in the U.S. right now and I m asking them to kill Donald Trump before he becomes President. The one in Mexico who have the means, I m asking you to cross the border and go and kill Donald Trump, and as many of his supporters as possible. Anywhere he goes just try to bomb the place, shoot up the place, do something. B..b but he looked like such a nice boy. Don t call him  illegal he just wanted a better life he s just a victim This punk isn t the only one threatening the life of Trump. Watch this video that exposes the truth about how the media ignores these threats:'
text = clean_text(text)
text = tfidf.transform([text])
print(model.predict(text)[0]) # expected output: 1 (true news)

1


In [None]:
text = 'WINNIPEG, Manitoba (Reuters) - Former U.S. President Jimmy Carter, appearing fully recovered from dehydration suffered while helping to build a home for charity in Canada, was released from an overnight hospital stay on Friday and addressed the project’s closing ceremony. Carter, 92, collapsed while working on Thursday at the Winnipeg construction site for Habitat for Humanity, which promotes affordable home ownership, and was taken to St. Boniface General Hospital for medical treatment and tests. By Friday morning, Carter was smiling as he returned to the building site to help kick off the project’s last day. Hours later, he and his wife, Rosalynn Carter, 89, attended closing ceremonies at the Canadian Museum for Human Rights in Manitoba’s capital, receiving a rousing ovation from the crowd. Dressed in blue jeans, a T-shirt and light-weight jacket, a relaxed, fit-looking Carter climbed a short flight of steps to the stage to salute Habitat’s members for their contributions. “I look upon all the volunteers, in a very sincere way, as human rights heroes, and I thank you for it,” he said, and joked that his “bringing attention to this Habitat project was completely unintentional.” The former first lady told the crowd her husband received a clean bill of health after an extensive battery of tests during his brief hospitalization, including from one test designed to detect heart damage. The results showed “there has never been any kind of damage at all to Jimmy Carter’s heart,” she said. “I knew he had a good heart.” Thursday’s health scare generated an outpouring of support for Carter, a Democrat who served in the White House from January 1977 to January 1981 and has lived longer after his term in office than any other president in U.S. history. A high point of his presidency was Carter’s role in brokering the 1978 Camp David Accords that ushered in peace between Israel and Egypt.  But having left office profoundly unpopular, he is widely regarded as a better former president than he was a president, and received the Nobel Peace Prize in 2002 for his humanitarian work. Carter disclosed in August 2015 that he had been diagnosed with a form of skin cancer called melanoma that had spread to his brain and elsewhere and had been spotted during liver surgery. But months later, Carter told the Maranatha Baptist Church, where he teaches Sunday school in his home town of Plains, Georgia, that his latest brain scan showed no sign of the disease. Carter was in Canada for a project to build 150 new homes for needy families, celebrating the country’s 150th independence anniversary.'
text = clean_text(text)
text = tfidf.transform([text])
print(model.predict(text)[0]) # expected output: 0 (fake news)

0


Finally, we will save the TF-IDF and SVC model using pickle module.

In [12]:
import pickle
pickle.dump(tfidf, open("tfidf.pkl",'wb'))
pickle.dump(model, open("model.pkl",'wb'))