# Importing Data

In [223]:
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
import re


In [224]:


fake_df = pd.read_csv('Fake.csv')
true_df = pd.read_csv('True.csv')

# Add a 'label' column to each DataFrame
fake_df['label'] = 'non-credible'
true_df['label'] = 'credible'

# Concatenate the DataFrames
df = pd.concat([fake_df, true_df], ignore_index=True)

# Display the first few rows of the combined DataFrame
df.head()

Unnamed: 0,title,text,subject,date,label
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",non-credible
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",non-credible
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",non-credible
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",non-credible
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",non-credible


- Imports the pandas library for data manipulation.
- Loads two CSV files, 'Fake.csv' and 'True.csv', into separate DataFrames: `fake_df` and `true_df`.
- Adds a new column called `label` to each DataFrame to indicate whether the news is 'credible' or 'non-credible'.
- Concatenates the two DataFrames into a single DataFrame `df`, combining both credible and non-credible news articles.
- Displays the first few rows of the combined DataFrame to provide an overview of the merged dataset.

# EDA and Preprocessing

## Basic EDA / Dataset Overview

In [225]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44898 entries, 0 to 44897
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    44898 non-null  object
 1   text     44898 non-null  object
 2   subject  44898 non-null  object
 3   date     44898 non-null  object
 4   label    44898 non-null  object
dtypes: object(5)
memory usage: 1.7+ MB


The `df.info()` method provides a concise summary of the DataFrame

In [226]:
df.shape

(44898, 5)

The `df.shape` attribute returns a tuple representing the dimensions of the DataFrame.(rows and collumns)

The output `(44898, 5)` indicates that the DataFrame `df` contains 44,898 rows and 5 columns.

## Dropping non essential features

In [227]:
df = df.drop(columns=['subject', 'date'])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44898 entries, 0 to 44897
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   title   44898 non-null  object
 1   text    44898 non-null  object
 2   label   44898 non-null  object
dtypes: object(3)
memory usage: 1.0+ MB


We dropped the `subject` and `date` columns because they are not essential for the initial analysis or modeling.
Removing these columns helps to focus on the main textual content (`title`, `text`) and the target label (`label`). 
Additionally it also simplifies the dataset and reduces noise.

## Combine Title and Text in to one collum

In [228]:
df['content'] = df['title'] + ' ' + df['text']
df = df.drop(columns=['title', 'text'])
df = df[['content', 'label']]
df.head()

Unnamed: 0,content,label
0,Donald Trump Sends Out Embarrassing New Year’...,non-credible
1,Drunk Bragging Trump Staffer Started Russian ...,non-credible
2,Sheriff David Clarke Becomes An Internet Joke...,non-credible
3,Trump Is So Obsessed He Even Has Obama’s Name...,non-credible
4,Pope Francis Just Called Out Donald Trump Dur...,non-credible


The code combines the "title" and "text" columns into a new "content" column, then removes the original "title" and "text" columns, leaving only "content" and "label" in the DataFrame. This simplifies the dataset by merging all relevant text into a single column for easier processing and analysis, and displays the first few rows of the updated DataFrame.

## Converting text to lowercase

In [229]:
df['content'] = df['content'].str.lower()
df.head()

Unnamed: 0,content,label
0,donald trump sends out embarrassing new year’...,non-credible
1,drunk bragging trump staffer started russian ...,non-credible
2,sheriff david clarke becomes an internet joke...,non-credible
3,trump is so obsessed he even has obama’s name...,non-credible
4,pope francis just called out donald trump dur...,non-credible


code that maps the `'content'` collumn of the DateFrame to itself after converting it in to lower case using the `.str.lower()` method

## Missing Values

In [230]:
print(df.isnull().sum())

content    0
label      0
dtype: int64



This code prints the number of missing (null) values in each column of the DataFrame `df

The output shows us that the dataset has no missing values

## Finding and Handling Duplicates

In [231]:
print(df['label'].value_counts())

label
non-credible    23481
credible        21417
Name: count, dtype: int64


The code takes the current DataFrame and uses the value_counts() method on the 'label' column to count the number of articles in each class (credible and non-credible).

In [232]:
duplicate_content = df[df.duplicated(subset='content')]
duplicate_content

Unnamed: 0,content,label
9942,hillary tweets message in defense of daca…oops...,non-credible
11446,former democrat warns young americans: “rioter...,non-credible
14925,[video] #blacklivesmatter terrorists storm dar...,non-credible
15892,house intel slaps subpoenas on mccain institut...,non-credible
15893,priceless! watch msnbc host’s shocked response...,non-credible
...,...,...
44709,france unveils labor reforms in first step to ...,credible
44744,guatemala top court sides with u.n. graft unit...,credible
44771,"europeans, africans agree renewed push to tack...",credible
44834,thailand's ousted pm yingluck has fled abroad:...,credible


The code creates a new DataFrame called `duplicate_content` that stores rows with duplicate values in the "content" column. This is achieved using the `duplicated()` function, which identifies all but the first occurrence of each duplicate which is later displayed. From the output we know that there are 5793 rows of duplicate values (first occurances)

In [233]:
print(duplicate_content['label'].value_counts())

label
non-credible    5573
credible         220
Name: count, dtype: int64


The code takes the duplicate_content DataFrame using ".value_count()" and displays the counts the number of articles in each class, The output shows that the most amount of duplicates are within the rows classified as "non-credible" articles at 5573 articles while only 220 "credible" artiles are duplicated

In [234]:
df = df.drop_duplicates(subset='content', keep='first').reset_index(drop=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39105 entries, 0 to 39104
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   content  39105 non-null  object
 1   label    39105 non-null  object
dtypes: object(2)
memory usage: 611.1+ KB


The `.drop_duplicates()` method is used to drop duplicate rows found in the `content` column, the `keep = first` parameter makes it so that the first occurance of the article in the dataset and drops the rest. `reset_index()` is used to reset the row index of the DataFrame after the drop opperation and the `drop=true` parameter makes sure that the old index is not added as a seperate collumn and is completely discarded

In [235]:
print(df['label'].value_counts())

label
credible        21197
non-credible    17908
Name: count, dtype: int64


DataFrame distribution after dropping duplicate content

In [236]:
df['content'].iloc[476]

" the exact same texas lawmakers that voted against hurricane relief are now begging for help when hurricane sandy hit, affecting states all the way from north carolina to new england and particularly devastating new york and new jersey in 2012, texas lawmakers overwhelmingly voted against recovery assistance. in fact, john culberson, whose 7th congressional district includes parts of houston, was the only texas republican in congress in favor of the $50.7 billion relief effort.one of the loudest opponents of sandy aid was ted cruz, who was merely weeks away from becoming a texas senator. cruz s main concerns involved additional spending, which included funding for disaster preparedness and relief in other parts of the country as a means of gaining support for the hurricane sandy relief effort. hurricane sandy inflicted devastating damage on the east coast, and congress appropriately responded with hurricane relief,  cruz said in a statement at the time in an effort to justify his stan

In [237]:
# Display 5 random rows that contain contractions in 'content'
contraction_rows = df[df['content'].str.contains(contraction_pattern)]
random_examples = contraction_rows.sample(5, random_state=42)
for idx, row in random_examples.iterrows():
    print(f"Index: {idx}")
    print(row)
    print('-' * 40)


Index: 34057
content    imran khan's pti retains seat in by-election, ...
label                                               credible
Name: 34057, dtype: object
----------------------------------------
Index: 25513
content    clinton says confident new emails will not cha...
label                                               credible
Name: 25513, dtype: object
----------------------------------------
Index: 476
content     the exact same texas lawmakers that voted aga...
label                                           non-credible
Name: 476, dtype: object
----------------------------------------
Index: 22300
content    trump adviser from wall street backs u.s. bank...
label                                               credible
Name: 22300, dtype: object
----------------------------------------
Index: 38393
content    eu's barnier worried by uk's post-brexit plan ...
label                                               credible
Name: 38393, dtype: object
------------------------------

## Handling Contractions

In [238]:
import contractions
df['content'] = df['content'].apply(contractions.fix)

In [239]:
df['content'].iloc[476]

" the exact same texas lawmakers that voted against hurricane relief are now begging for help when hurricane sandy hit, affecting states all the way from north carolina to new england and particularly devastating new york and new jersey in 2012, texas lawmakers overwhelmingly voted against recovery assistance. in fact, john culberson, whose 7th congressional district includes parts of houston, was the only texas republican in congress in favor of the $50.7 billion relief effort.one of the loudest opponents of sandy aid was ted cruz, who was merely weeks away from becoming a texas senator. cruz s main concerns involved additional spending, which included funding for disaster preparedness and relief in other parts of the country as a means of gaining support for the hurricane sandy relief effort. hurricane sandy inflicted devastating damage on the east coast, and congress appropriately responded with hurricane relief,  cruz said in a statement at the time in an effort to justify his stan

## Finding and Handling usernames, hashtags, and emails

In [240]:
def count_users_hash(dataframe):

    email_count = dataframe['content'].str.count(r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+').sum()
    print(f"Total email addresses: {email_count}")
    
    username_count = dataframe['content'].str.count(r'@[A-Za-z0-9_]{1,15}\b').sum()
    print(f"Total Twitter usernames: {username_count}")


count_users_hash(df)

Total email addresses: 44
Total Twitter usernames: 24782


In [241]:

def clean_text(text):
    
    text = re.sub(r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+', '', text)

    text = re.sub(r'@[A-Za-z0-9_]{1,15}\b', '', text)
    
    return text

df['content'] = df['content'].apply(clean_text)
count_users_hash(df)

Total email addresses: 0
Total Twitter usernames: 0


## Finding and Handling HTML tags and URL's

In [242]:
df['content'].iloc[0]


' donald trump sends out embarrassing new year’s eve message; this is disturbing donald trump just couldn t wish all americans a happy new year and leave it at that. instead, he had to give a shout out to his enemies, haters and  the very dishonest fake news media.  the former reality show star had just one job to do and he couldn t do it. as our country rapidly grows stronger and smarter, i want to wish all of my friends, supporters, enemies, haters, and even the very dishonest fake news media, a happy and healthy new year,  president angry pants tweeted.  2018 will be a great year for america! as our country rapidly grows stronger and smarter, i want to wish all of my friends, supporters, enemies, haters, and even the very dishonest fake news media, a happy and healthy new year. 2018 will be a great year for america!  donald j. trump () december 31, 2017trump s tweet went down about as welll as you d expect.what kind of president sends a new year s greeting like this despicable, pett

In [243]:
def count_links(dataframe):

    html_tag_count = df['content'].str.contains(r'<.*?>', regex=True).sum()
    url_count = df['content'].str.contains(r'http\S+|www\.\S+', regex=True).sum()
    dot_com_count = df['content'].str.count(r'\b\w+\.\w+\.(com|org|net|gov|edu|info|io|co|us|uk|in|au|ca|de|fr|ru|jp|cn|br|za)\b').sum()
    

    print(f"Links matching (abc.xyz.com(others)) pattern: {dot_com_count}")
    print(f"Rows with HTML tags: {html_tag_count}")
    print(f"Rows with URLs: {url_count}")

count_links(df)
df.shape

Links matching (abc.xyz.com(others)) pattern: 5845
Rows with HTML tags: 68
Rows with URLs: 2589


(39105, 2)

In [244]:
import re

def clean_text(text):

    text = re.sub(r'\b\w+\.\w+\.(com|org|net|gov|edu|info|io|co|us|uk|in|au|ca|de|fr|ru|jp|cn|br|za)\b', '', text)
    # Remove HTML tags
    text = re.sub(r'<.*?>', '', text)
    # Remove URLs
    text = re.sub(r'http\S+|www\.\S+', '', text)
    return text

df['content'] = df['content'].apply(clean_text)
df.head()

Unnamed: 0,content,label
0,donald trump sends out embarrassing new year’...,non-credible
1,drunk bragging trump staffer started russian ...,non-credible
2,sheriff david clarke becomes an internet joke...,non-credible
3,trump is so obsessed he even has obama’s name...,non-credible
4,pope francis just called out donald trump dur...,non-credible


In [245]:
count_links(df)
df.shape

Links matching (abc.xyz.com(others)) pattern: 0
Rows with HTML tags: 0
Rows with URLs: 0


(39105, 2)

In [246]:
df['content'].iloc[0]

' donald trump sends out embarrassing new year’s eve message; this is disturbing donald trump just couldn t wish all americans a happy new year and leave it at that. instead, he had to give a shout out to his enemies, haters and  the very dishonest fake news media.  the former reality show star had just one job to do and he couldn t do it. as our country rapidly grows stronger and smarter, i want to wish all of my friends, supporters, enemies, haters, and even the very dishonest fake news media, a happy and healthy new year,  president angry pants tweeted.  2018 will be a great year for america! as our country rapidly grows stronger and smarter, i want to wish all of my friends, supporters, enemies, haters, and even the very dishonest fake news media, a happy and healthy new year. 2018 will be a great year for america!  donald j. trump () december 31, 2017trump s tweet went down about as welll as you d expect.what kind of president sends a new year s greeting like this despicable, pett

## Handling Special characters and Digits (non-word and non-whitespaces)

In [247]:
df = df.replace(to_replace=r'[^\w\s]', value='', regex=True)
df = df.replace(to_replace=r'\d', value='', regex=True)
df.head()

Unnamed: 0,content,label
0,donald trump sends out embarrassing new years...,noncredible
1,drunk bragging trump staffer started russian ...,noncredible
2,sheriff david clarke becomes an internet joke...,noncredible
3,trump is so obsessed he even has obamas name ...,noncredible
4,pope francis just called out donald trump dur...,noncredible


here the `df.replace()` function is used to replace all characters that are not(`^`) words ((a-z, A-Z), digits (0-9), and underscore (_)) (`\w`) or whitespaces (`\s`), and are replaces with an empty string (`value=''`). This is done to reduce noise and improve consistency as special characters often do not add any meaningful information to text classification and analytics casts and removing them helps standardize the text making future tokenizations and processing easier

the code also replaces all digits (`\d`) with and empty string

Note: '[^\w\s]' = NOT (`^`) words or whitespaces while '\d' = IS digits

In [248]:
# Count rows with extra (consecutive) white spaces in 'content'
extra_ws_mask = df['content'].str.contains(r'\s{2,}', regex=True)
extra_ws_count = extra_ws_mask.sum()
print(f"Rows with extra white spaces: {extra_ws_count}")

# Print an example row with extra white spaces, if any exist
if extra_ws_count > 0:
    example_row = df[extra_ws_mask].iloc[0]
    print("Example with extra white spaces:")
    print(example_row['content'])
else:
    print("No extra white spaces found in the dataset.")


Rows with extra white spaces: 38450
Example with extra white spaces:
 donald trump sends out embarrassing new years eve message this is disturbing donald trump just couldn t wish all americans a happy new year and leave it at that instead he had to give a shout out to his enemies haters and  the very dishonest fake news media  the former reality show star had just one job to do and he couldn t do it as our country rapidly grows stronger and smarter i want to wish all of my friends supporters enemies haters and even the very dishonest fake news media a happy and healthy new year  president angry pants tweeted   will be a great year for america as our country rapidly grows stronger and smarter i want to wish all of my friends supporters enemies haters and even the very dishonest fake news media a happy and healthy new year  will be a great year for america  donald j trump  december  trump s tweet went down about as welll as you d expectwhat kind of president sends a new year s greeting l

In [249]:
df['content'].iloc[0]

' donald trump sends out embarrassing new years eve message this is disturbing donald trump just couldn t wish all americans a happy new year and leave it at that instead he had to give a shout out to his enemies haters and  the very dishonest fake news media  the former reality show star had just one job to do and he couldn t do it as our country rapidly grows stronger and smarter i want to wish all of my friends supporters enemies haters and even the very dishonest fake news media a happy and healthy new year  president angry pants tweeted   will be a great year for america as our country rapidly grows stronger and smarter i want to wish all of my friends supporters enemies haters and even the very dishonest fake news media a happy and healthy new year  will be a great year for america  donald j trump  december  trump s tweet went down about as welll as you d expectwhat kind of president sends a new year s greeting like this despicable petty infantile gibberish only trump his lack of

## Handling extra whitespaces

In [250]:
# Remove extra (consecutive) white spaces from 'content'
df['content'] = df['content'].str.replace(r'\s+', ' ', regex=True).str.strip()
df.head()

Unnamed: 0,content,label
0,donald trump sends out embarrassing new years ...,noncredible
1,drunk bragging trump staffer started russian c...,noncredible
2,sheriff david clarke becomes an internet joke ...,noncredible
3,trump is so obsessed he even has obamas name c...,noncredible
4,pope francis just called out donald trump duri...,noncredible


In [251]:
df['content'].iloc[0]

'donald trump sends out embarrassing new years eve message this is disturbing donald trump just couldn t wish all americans a happy new year and leave it at that instead he had to give a shout out to his enemies haters and the very dishonest fake news media the former reality show star had just one job to do and he couldn t do it as our country rapidly grows stronger and smarter i want to wish all of my friends supporters enemies haters and even the very dishonest fake news media a happy and healthy new year president angry pants tweeted will be a great year for america as our country rapidly grows stronger and smarter i want to wish all of my friends supporters enemies haters and even the very dishonest fake news media a happy and healthy new year will be a great year for america donald j trump december trump s tweet went down about as welll as you d expectwhat kind of president sends a new year s greeting like this despicable petty infantile gibberish only trump his lack of decency w

## Handling missing whitespaces (Experimental, remove if bad)

based on `https://stackoverflow.com/questions/8870261/how-to-split-text-without-spaces-into-list-of-words`

In [252]:
import wordninja

df['content'] = df['content'].apply(lambda x: ' '.join(wordninja.split(x)))



In [253]:
df['content'].iloc[0]

'donald trump sends out embarrassing new years eve message this is disturbing donald trump just couldn t wish all americans a happy new year and leave it at that instead he had to give a shout out to his enemies haters and the very dishonest fake news media the former reality show star had just one job to do and he couldn t do it as our country rapidly grows stronger and smarter i want to wish all of my friends supporters enemies haters and even the very dishonest fake news media a happy and healthy new year president angry pants tweeted will be a great year for america as our country rapidly grows stronger and smarter i want to wish all of my friends supporters enemies haters and even the very dishonest fake news media a happy and healthy new year will be a great year for america donald j trump december trump s tweet went down about as well l as you d expect what kind of president sends a new year s greeting like this despicable petty infantile gibberish only trump his lack of decency

In [254]:
df

Unnamed: 0,content,label
0,donald trump sends out embarrassing new years ...,noncredible
1,drunk bragging trump staffer started russian c...,noncredible
2,sheriff david clarke becomes an internet joke ...,noncredible
3,trump is so obsessed he even has obama s name ...,noncredible
4,pope francis just called out donald trump duri...,noncredible
...,...,...
39100,fully committed nato backs new you s approach ...,credible
39101,lex isn ex is withdrew two products from chine...,credible
39102,minsk cultural hub becomes haven from authorit...,credible
39103,vatican upbeat on possibility of pope francis ...,credible


## Tokenization

In [255]:
from textblob import TextBlob

import nltk
nltk.download('punkt_tab') # for tokenization
nltk.download('averaged_perceptron_tagger_eng') # for POS Tagging
nltk.download('wordnet') # for tokenization
from textblob import Word

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\eksudee\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\eksudee\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\eksudee\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [259]:
# Tokenize the 'content' column and store the result in a new column 'tokens'
df['content'] = df['content'].apply(lambda x: TextBlob(x).words)
df.head()

KeyboardInterrupt: 

WordList(['donald', 'trump', 'sends', 'out', 'embarrassing', 'new', 'years', 'eve', 'message', 'this', 'is', 'disturbing', 'donald', 'trump', 'just', 'couldn', 't', 'wish', 'all', 'americans', 'a', 'happy', 'new', 'year', 'and', 'leave', 'it', 'at', 'that', 'instead', 'he', 'had', 'to', 'give', 'a', 'shout', 'out', 'to', 'his', 'enemies', 'haters', 'and', 'the', 'very', 'dishonest', 'fake', 'news', 'media', 'the', 'former', 'reality', 'show', 'star', 'had', 'just', 'one', 'job', 'to', 'do', 'and', 'he', 'couldn', 't', 'do', 'it', 'as', 'our', 'country', 'rapidly', 'grows', 'stronger', 'and', 'smarter', 'i', 'want', 'to', 'wish', 'all', 'of', 'my', 'friends', 'supporters', 'enemies', 'haters', 'and', 'even', 'the', 'very', 'dishonest', 'fake', 'news', 'media', 'a', 'happy', 'and', 'healthy', 'new', 'year', 'president', 'angry', 'pants', 'tweeted', 'will', 'be', 'a', 'great', 'year', 'for', 'america', 'as', 'our', 'country', 'rapidly', 'grows', 'stronger', 'and', 'smarter', 'i', 'want', 