# **Data** **Preprocessing**


---



Data preprocessing is a crucial step in the data analysis and machine learning pipeline. It involves transforming raw data into a clean and usable format. The main objectives are to improve the quality of the data and to prepare it for analysis or model training. Some common data preprocessing steps include - Data Cleaning, Data Transformation, Text Data Processing, Handling Imbalanced Data, Dimensionality Reduction, Data Integration, Data Splitting.

**The Data Preprocessing techniques used in this section are -**

- Lower Case
- Remove links
- Remove next lines (\n)
- Words containing numbers
- Extra spaces
- Special characters
- Removal of stop words
- Lemmatization



In [3]:
#Import required libraries
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
import spacy
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from google.colab import drive

In [4]:
# Mount Google Drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [5]:
# Load dataset
df = pd.read_csv('/content/drive/MyDrive/InfosysSB/FakeNewsNet/FakeNewsNet.csv')

In [6]:
df.head()

Unnamed: 0,title,news_url,source_domain,tweet_num,real
0,Kandi Burruss Explodes Over Rape Accusation on...,http://toofab.com/2017/05/08/real-housewives-a...,toofab.com,42,1
1,People's Choice Awards 2018: The best red carp...,https://www.today.com/style/see-people-s-choic...,www.today.com,0,1
2,Sophia Bush Sends Sweet Birthday Message to 'O...,https://www.etonline.com/news/220806_sophia_bu...,www.etonline.com,63,1
3,Colombian singer Maluma sparks rumours of inap...,https://www.dailymail.co.uk/news/article-33655...,www.dailymail.co.uk,20,1
4,Gossip Girl 10 Years Later: How Upper East Sid...,https://www.zerchoo.com/entertainment/gossip-g...,www.zerchoo.com,38,1


In [7]:
# Overview of the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23196 entries, 0 to 23195
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   title          23196 non-null  object
 1   news_url       22866 non-null  object
 2   source_domain  22866 non-null  object
 3   tweet_num      23196 non-null  int64 
 4   real           23196 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 906.2+ KB


In [8]:
# Check if any column contains null values
nan_columns = df.columns[df.isna().any()].tolist()
print("Columns with NaN values:", nan_columns)

Columns with NaN values: ['news_url', 'source_domain']


In [9]:
# Check the count of null values
print(df.isnull().sum())

title              0
news_url         330
source_domain    330
tweet_num          0
real               0
dtype: int64


In [10]:
# Fill missing values
df['news_url'].fillna('missing url', inplace=True)
df['source_domain'].fillna('missing domain', inplace=True)

In [11]:
# Again confirm the count of null values
print(df.isnull().sum())

title            0
news_url         0
source_domain    0
tweet_num        0
real             0
dtype: int64


In [12]:
# Initialize necessary tools
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
nlp = spacy.load('en_core_web_sm')

# Customize stop words by excluding certain words while keeping important ones
important_words_to_keep = {"not", "no", "never", "can", "could", "will", "would", "shall", "should", "may", "might", "must",
                           "but", "and", "or", "yet", "so",
                           "I", "you", "he", "she", "it", "we", "they",
                           "a", "an", "the", "this", "that", "these", "those",
                           "in", "on", "at", "by", "for", "with", "of", "to"}

custom_stop_words = stop_words - important_words_to_keep

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


In [13]:
def preprocess_text(text):
    # Lower case
    text = text.lower()

    # Remove links (http, www, https) and ".com"
    text = re.sub(r'http\S+|www\S+|https\S+|\s[A-Za-z]*\.com', '', text, flags=re.MULTILINE)

    # Remove newlines, tabs, and extra spaces
    text = re.sub(r"(\\n|\n|\t|\s\s+)", ' ', text).strip()

    # Remove words containing numbers
    text = re.sub(r'\w*\d\w*', '', text)

   # Remove special characters except allowed ones
    allowed_chars = "£$"
    text = re.sub(r"[^a-zA-Z0-9\s" + re.escape(allowed_chars) + "]", '', text)

    # Tokenize text
    tokens = word_tokenize(text)

    # Remove stop words and single character tokens
    tokens = [token for token in tokens if token.lower() not in custom_stop_words and len(token) > 1]

    # Lemmatization
    tokens = [lemmatizer.lemmatize(token) for token in tokens]


    return ' '.join(tokens)

In [14]:
# Apply preprocessing to the 'title' column
df['title'] = df['title'].apply(preprocess_text)

In [15]:
# Handle missing values in title column ()
df['title'].fillna('Missing title', inplace=True)

In [16]:
# Save the preprocessed dataset
df.to_csv('/content/drive/MyDrive/InfosysSB/Preprocessed_Dataset/FakeNewsNet_Preprocessed.csv', index=False)

In [17]:
# Load and display the preprocessed dataset
df = pd.read_csv('/content/drive/MyDrive/InfosysSB/Preprocessed_Dataset/FakeNewsNet_Preprocessed.csv')
print(df['title'][0])

kandi burruss explodes rape accusation on real housewife of atlanta reunion video


In [18]:
df.head()

Unnamed: 0,title,news_url,source_domain,tweet_num,real
0,kandi burruss explodes rape accusation on real...,http://toofab.com/2017/05/08/real-housewives-a...,toofab.com,42,1
1,people choice award the best red carpet look,https://www.today.com/style/see-people-s-choic...,www.today.com,0,1
2,sophia bush sends sweet birthday message to on...,https://www.etonline.com/news/220806_sophia_bu...,www.etonline.com,63,1
3,colombian singer maluma spark rumour of inappr...,https://www.dailymail.co.uk/news/article-33655...,www.dailymail.co.uk,20,1
4,gossip girl year later upper east siders shock...,https://www.zerchoo.com/entertainment/gossip-g...,www.zerchoo.com,38,1
