# 0. Inspirations for this code:

1. Splitting data: https://www.kaggle.com/code/adrianabukaa/fake-news-eda

2. Punctuation removal, lowering text: https://www.analyticsvidhya.com/blog/2021/06/text-preprocessing-in-nlp-with-python-codes/

3. Lowering text: https://www.analyticsvidhya.com/blog/2021/06/text-preprocessing-in-nlp-with-python-codes/

4. Tokenization: Book natural language processing (section 3.3.1)

5. Stop word removal: https://www.analyticsvidhya.com/blog/2021/06/text-preprocessing-in-nlp-with-python-codes/

6. Lemmatization: https://www.analyticsvidhya.com/blog/2021/06/text-preprocessing-in-nlp-with-python-codes/

7. Dropping columns: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html

Installing NLTK: https://www.nltk.org/install.html

## 1. Splitting

In [37]:
import pandas as pd
from sklearn import model_selection

In [38]:
data = df = pd.read_csv('WELFake_Dataset.csv', index_col = 0)

In [39]:
# Amount of datapoints before removing missing values
print(len(data))

72134


In [40]:
# Amount of datapoints after removing missing values
data = data.dropna()
print(len(data))

71537


In [41]:
# Splitting the dataset into X and y variables
y, X = data.loc[:,"label"], data.loc[:,data.columns != "label"]

In [42]:
# Splitting further into a training and test set
X_train, X_test, y_train, y_test = model_selection.train_test_split(
    X, y, test_size = 0.2, random_state = 42, stratify = y)

print(len(X_train))
print(len(X_test))
print(len(y_train))
print(len(y_test))

57229
14308
57229
14308


## 2. Punctuation removal

In [43]:
# Library that contains punctuation
import string
string.punctuation

# Defining the function to remove punctuation
def remove_punctuation(text):
    punctuationfree="".join([i for i in text if i not in string.punctuation])
    return punctuationfree

# Storing the puntuation free text
data['clean_msg']= data.iloc[:,1].apply(lambda x:remove_punctuation(x))
data.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['clean_msg']= data.iloc[:,1].apply(lambda x:remove_punctuation(x))


Unnamed: 0,title,text,label,clean_msg
0,LAW ENFORCEMENT ON HIGH ALERT Following Threat...,No comment is expected from Barack Obama Membe...,1,No comment is expected from Barack Obama Membe...
2,UNBELIEVABLE! OBAMA’S ATTORNEY GENERAL SAYS MO...,"Now, most of the demonstrators gathered last ...",1,Now most of the demonstrators gathered last n...
3,"Bobby Jindal, raised Hindu, uses story of Chri...",A dozen politically active pastors came here f...,0,A dozen politically active pastors came here f...
4,SATAN 2: Russia unvelis an image of its terrif...,"The RS-28 Sarmat missile, dubbed Satan 2, will...",1,The RS28 Sarmat missile dubbed Satan 2 will re...
5,About Time! Christian Group Sues Amazon and SP...,All we can say on this one is it s about time ...,1,All we can say on this one is it s about time ...


## 3. Lowering text

In [44]:
data['msg_lower']= data.iloc[:,3].apply(lambda x: x.lower())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['msg_lower']= data.iloc[:,3].apply(lambda x: x.lower())


# 4. Tokenization



In [58]:
pip install --user -U nltk

Collecting nltk
  Downloading nltk-3.7-py3-none-any.whl (1.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting regex>=2021.8.3
  Downloading regex-2022.10.31-cp38-cp38-macosx_10_9_x86_64.whl (294 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m294.0/294.0 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: regex, nltk
[0mSuccessfully installed nltk-3.7 regex-2022.10.31

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2.2[0m[39;49m -> [0m[32;49m22.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/opt/anaconda3/bin/python -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [67]:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

data['msg_tokenized']= data.iloc[:,4].apply(lambda x: word_tokenize(x))

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/pietervanbrakel/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['msg_tokenized']= data.iloc[:,4].apply(lambda x: word_tokenize(x))


## 5. Stop word removal

In [70]:
nltk.download('stopwords')

stopwords = nltk.corpus.stopwords.words('english')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/pietervanbrakel/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [77]:
def remove_stopwords(text):
    output= [i for i in text if i not in stopwords]
    return output

#applying the function
data['no_stopwords']= data.iloc[:,5].apply(lambda x:remove_stopwords(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['no_stopwords']= data.iloc[:,5].apply(lambda x:remove_stopwords(x))


In [78]:
data

Unnamed: 0,title,text,label,clean_msg,msg_lower,msg_tokenized,no_stopwords
0,LAW ENFORCEMENT ON HIGH ALERT Following Threat...,No comment is expected from Barack Obama Membe...,1,No comment is expected from Barack Obama Membe...,no comment is expected from barack obama membe...,"[no, comment, is, expected, from, barack, obam...","[comment, expected, barack, obama, members, fy..."
2,UNBELIEVABLE! OBAMA’S ATTORNEY GENERAL SAYS MO...,"Now, most of the demonstrators gathered last ...",1,Now most of the demonstrators gathered last n...,now most of the demonstrators gathered last n...,"[now, most, of, the, demonstrators, gathered, ...","[demonstrators, gathered, last, night, exercis..."
3,"Bobby Jindal, raised Hindu, uses story of Chri...",A dozen politically active pastors came here f...,0,A dozen politically active pastors came here f...,a dozen politically active pastors came here f...,"[a, dozen, politically, active, pastors, came,...","[dozen, politically, active, pastors, came, pr..."
4,SATAN 2: Russia unvelis an image of its terrif...,"The RS-28 Sarmat missile, dubbed Satan 2, will...",1,The RS28 Sarmat missile dubbed Satan 2 will re...,the rs28 sarmat missile dubbed satan 2 will re...,"[the, rs28, sarmat, missile, dubbed, satan, 2,...","[rs28, sarmat, missile, dubbed, satan, 2, repl..."
5,About Time! Christian Group Sues Amazon and SP...,All we can say on this one is it s about time ...,1,All we can say on this one is it s about time ...,all we can say on this one is it s about time ...,"[all, we, can, say, on, this, one, is, it, s, ...","[say, one, time, someone, sued, southern, pove..."
...,...,...,...,...,...,...,...
72129,Russians steal research on Trump in hack of U....,WASHINGTON (Reuters) - Hackers believed to be ...,0,WASHINGTON Reuters Hackers believed to be wor...,washington reuters hackers believed to be wor...,"[washington, reuters, hackers, believed, to, b...","[washington, reuters, hackers, believed, worki..."
72130,WATCH: Giuliani Demands That Democrats Apolog...,"You know, because in fantasyland Republicans n...",1,You know because in fantasyland Republicans ne...,you know because in fantasyland republicans ne...,"[you, know, because, in, fantasyland, republic...","[know, fantasyland, republicans, never, questi..."
72131,Migrants Refuse To Leave Train At Refugee Camp...,Migrants Refuse To Leave Train At Refugee Camp...,0,Migrants Refuse To Leave Train At Refugee Camp...,migrants refuse to leave train at refugee camp...,"[migrants, refuse, to, leave, train, at, refug...","[migrants, refuse, leave, train, refugee, camp..."
72132,Trump tussle gives unpopular Mexican leader mu...,MEXICO CITY (Reuters) - Donald Trump’s combati...,0,MEXICO CITY Reuters Donald Trump’s combative ...,mexico city reuters donald trump’s combative ...,"[mexico, city, reuters, donald, trump, ’, s, c...","[mexico, city, reuters, donald, trump, ’, comb..."


## 6. Lemmatization

In [83]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
#defining the object for Lemmatization
wordnet_lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/pietervanbrakel/nltk_data...


In [88]:
#defining the function for lemmatization
def lemmatizer(text):
    lemm_text = [wordnet_lemmatizer.lemmatize(word) for word in text]
    return lemm_text

data['msg_lemmatized']=data.iloc[:,6].apply(lambda x:lemmatizer(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['msg_lemmatized']=data.iloc[:,6].apply(lambda x:lemmatizer(x))


## 7. Dropping columns

In [89]:
data

Unnamed: 0,title,text,label,clean_msg,msg_lower,msg_tokenized,no_stopwords,msg_lemmatized
0,LAW ENFORCEMENT ON HIGH ALERT Following Threat...,No comment is expected from Barack Obama Membe...,1,No comment is expected from Barack Obama Membe...,no comment is expected from barack obama membe...,"[no, comment, is, expected, from, barack, obam...","[comment, expected, barack, obama, members, fy...","[comment, expected, barack, obama, member, fyf..."
2,UNBELIEVABLE! OBAMA’S ATTORNEY GENERAL SAYS MO...,"Now, most of the demonstrators gathered last ...",1,Now most of the demonstrators gathered last n...,now most of the demonstrators gathered last n...,"[now, most, of, the, demonstrators, gathered, ...","[demonstrators, gathered, last, night, exercis...","[demonstrator, gathered, last, night, exercisi..."
3,"Bobby Jindal, raised Hindu, uses story of Chri...",A dozen politically active pastors came here f...,0,A dozen politically active pastors came here f...,a dozen politically active pastors came here f...,"[a, dozen, politically, active, pastors, came,...","[dozen, politically, active, pastors, came, pr...","[dozen, politically, active, pastor, came, pri..."
4,SATAN 2: Russia unvelis an image of its terrif...,"The RS-28 Sarmat missile, dubbed Satan 2, will...",1,The RS28 Sarmat missile dubbed Satan 2 will re...,the rs28 sarmat missile dubbed satan 2 will re...,"[the, rs28, sarmat, missile, dubbed, satan, 2,...","[rs28, sarmat, missile, dubbed, satan, 2, repl...","[rs28, sarmat, missile, dubbed, satan, 2, repl..."
5,About Time! Christian Group Sues Amazon and SP...,All we can say on this one is it s about time ...,1,All we can say on this one is it s about time ...,all we can say on this one is it s about time ...,"[all, we, can, say, on, this, one, is, it, s, ...","[say, one, time, someone, sued, southern, pove...","[say, one, time, someone, sued, southern, pove..."
...,...,...,...,...,...,...,...,...
72129,Russians steal research on Trump in hack of U....,WASHINGTON (Reuters) - Hackers believed to be ...,0,WASHINGTON Reuters Hackers believed to be wor...,washington reuters hackers believed to be wor...,"[washington, reuters, hackers, believed, to, b...","[washington, reuters, hackers, believed, worki...","[washington, reuters, hacker, believed, workin..."
72130,WATCH: Giuliani Demands That Democrats Apolog...,"You know, because in fantasyland Republicans n...",1,You know because in fantasyland Republicans ne...,you know because in fantasyland republicans ne...,"[you, know, because, in, fantasyland, republic...","[know, fantasyland, republicans, never, questi...","[know, fantasyland, republican, never, questio..."
72131,Migrants Refuse To Leave Train At Refugee Camp...,Migrants Refuse To Leave Train At Refugee Camp...,0,Migrants Refuse To Leave Train At Refugee Camp...,migrants refuse to leave train at refugee camp...,"[migrants, refuse, to, leave, train, at, refug...","[migrants, refuse, leave, train, refugee, camp...","[migrant, refuse, leave, train, refugee, camp,..."
72132,Trump tussle gives unpopular Mexican leader mu...,MEXICO CITY (Reuters) - Donald Trump’s combati...,0,MEXICO CITY Reuters Donald Trump’s combative ...,mexico city reuters donald trump’s combative ...,"[mexico, city, reuters, donald, trump, ’, s, c...","[mexico, city, reuters, donald, trump, ’, comb...","[mexico, city, reuters, donald, trump, ’, comb..."


In [91]:
data.drop(["text","clean_msg", "msg_lower", "msg_tokenized", "no_stopwords"], axis = 1)

Unnamed: 0,title,label,msg_lemmatized
0,LAW ENFORCEMENT ON HIGH ALERT Following Threat...,1,"[comment, expected, barack, obama, member, fyf..."
2,UNBELIEVABLE! OBAMA’S ATTORNEY GENERAL SAYS MO...,1,"[demonstrator, gathered, last, night, exercisi..."
3,"Bobby Jindal, raised Hindu, uses story of Chri...",0,"[dozen, politically, active, pastor, came, pri..."
4,SATAN 2: Russia unvelis an image of its terrif...,1,"[rs28, sarmat, missile, dubbed, satan, 2, repl..."
5,About Time! Christian Group Sues Amazon and SP...,1,"[say, one, time, someone, sued, southern, pove..."
...,...,...,...
72129,Russians steal research on Trump in hack of U....,0,"[washington, reuters, hacker, believed, workin..."
72130,WATCH: Giuliani Demands That Democrats Apolog...,1,"[know, fantasyland, republican, never, questio..."
72131,Migrants Refuse To Leave Train At Refugee Camp...,0,"[migrant, refuse, leave, train, refugee, camp,..."
72132,Trump tussle gives unpopular Mexican leader mu...,0,"[mexico, city, reuters, donald, trump, ’, comb..."
