# A. Table of content:

- [B. Instructions:](#B.-Instructions:)  
- [C. Import & Data Load:](#C.-Import-&-Data-Load:)  
- [1. Data wrangling:](#1.-Data-wrangling:)  
    - [1.1. Data cleaning:](#1.1.-Data-cleaning:)  
    - [1.2. Text normalization:](#1.2.-Text-normalization:)  
- [2. Exploratory data analysis, EDA:](#2.-Exploratory-data-analysis,-EDA:)  
- [3. Data save:](#3.-Data-save:)  

# B. Instructions:

In [4]:
from IPython.display import IFrame

In [5]:
instructions_path = r"..\references\2_3_DataWrangling_EDA_Instructions\1585013354_Capstone_Three_Steps_2_3___Data_Wrangling_and_EDA_-_Google_Docs.pdf"

In [6]:
IFrame(instructions_path, width=800, height=600)

# C. Import & Data Load:

In [8]:
import pandas as pd
import spacy


In [9]:
dataset_path = r"..\data\raw\fake_news_dataset.csv"

In [10]:
df = pd.read_csv(dataset_path)

# 1. Data wrangling:

## 1.1. Data cleaning:

### 1.1.1. Data inspection:

In [46]:
df.head()

Unnamed: 0,title,text,date,source,author,category,label
0,Foreign Democrat final.,more tax development both store agreement lawy...,2023-03-10,NY Times,Paula George,Politics,real
1,To offer down resource great point.,probably guess western behind likely next inve...,2022-05-25,Fox News,Joseph Hill,Politics,fake
2,Himself church myself carry.,them identify forward present success risk sev...,2022-09-01,CNN,Julia Robinson,Business,fake
3,You unit its should.,phone which item yard Republican safe where po...,2023-02-07,Reuters,Mr. David Foster DDS,Science,fake
4,Billion believe employee summer how.,wonder myself fact difficult course forget exa...,2023-04-03,CNN,Austin Walker,Technology,fake


In [67]:
df.shape

(20000, 7)

In [44]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   title     20000 non-null  object
 1   text      20000 non-null  object
 2   date      20000 non-null  object
 3   source    19000 non-null  object
 4   author    19000 non-null  object
 5   category  20000 non-null  object
 6   label     20000 non-null  object
dtypes: object(7)
memory usage: 1.1+ MB


In [54]:
# Unique values:
for col in ["source", "author", "category"]:
    print(f"\nUnique values in {col}:")
    print(df[col].unique())


Unique values in source:
['NY Times' 'Fox News' 'CNN' 'Reuters' 'Daily News' 'Global Times'
 'The Guardian' 'BBC' nan]

Unique values in author:
['Paula George' 'Joseph Hill' 'Julia Robinson' ... 'Maria Mcbride'
 'Kristen Franklin' 'David Wise']

Unique values in category:
['Politics' 'Business' 'Science' 'Technology' 'Health' 'Sports'
 'Entertainment']


In [58]:
# Count how many unique values:
df[["source", "author", "category"]].nunique()

source          8
author      17051
category        7
dtype: int64

### 1.1.2. Data types:

In [48]:
df["date"] = pd.to_datetime(df["date"])

In [50]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   title     20000 non-null  object        
 1   text      20000 non-null  object        
 2   date      20000 non-null  datetime64[ns]
 3   source    19000 non-null  object        
 4   author    19000 non-null  object        
 5   category  20000 non-null  object        
 6   label     20000 non-null  object        
dtypes: datetime64[ns](1), object(6)
memory usage: 1.1+ MB


### 1.1.3. Missing values:

There are **5% of missing values** in two features, which are "source" and "author".  
For the **"source"** feature, the imputation by the mode is making sense as there are only 8 unique values for 20,000 rows.  
However, concerning the **"author"** feature, as there are 17,051 unique values, imputation by the mode is not making sense. We can either drop the missing values rows as they represent 5% of the rows or we can create a new category as "Unknown" while preserving the data.   

In [83]:
source_missing_percentage = (df["source"].isnull().sum() / df.shape[0]) * 100
author_missing_percentage = (df["author"].isnull().sum() / df.shape[0]) * 100

print(f"The 'author' and the 'source' features present {source_missing_percentage}% and {author_missing_percentage}% of missing values respectively.")

The 'author' and the 'source' features present 5.0% and 5.0% of missing values respectively.


In [90]:
source_mode = df["source"].mode()[0]
df["source"] = df["source"].fillna(source_mode)
df["author"] = df["author"].fillna("Unknown")

In [92]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   title     20000 non-null  object        
 1   text      20000 non-null  object        
 2   date      20000 non-null  datetime64[ns]
 3   source    20000 non-null  object        
 4   author    20000 non-null  object        
 5   category  20000 non-null  object        
 6   label     20000 non-null  object        
dtypes: datetime64[ns](1), object(6)
memory usage: 1.1+ MB


### 1.1.4. Duplicates:

In [98]:
title_dup = df["title"].duplicated().sum()
print(f"Duplicated tiles: {title_dup}.")

article_dup = df["text"].duplicated().sum()
print(f"Duplicated artices: {article_dup}.")

title_article_dup = df.duplicated(subset=["title", "text"]).sum()
print(f"Duplicate title & article pair: {title_article_dup}")

Duplicated tiles: 0.
Duplicated artices: 0.
Duplicate title & article pair: 0


### 1.1.5. Target label check:

In [101]:
df["label"].unique()

array(['real', 'fake'], dtype=object)

## 1.2. Text normalization:

### 1.2.1. Tokenization:

In [107]:
nlp = spacy.load("en_core_web_lg")
nlp

<spacy.lang.en.English at 0x24c580739d0>

In [117]:
df["title_doc"] = list(nlp.pipe(df["title"], batch_size=100, n_process=4))

In [119]:
df.head()

Unnamed: 0,title,text,date,source,author,category,label,title_doc
0,Foreign Democrat final.,more tax development both store agreement lawy...,2023-03-10,NY Times,Paula George,Politics,real,"(Foreign, Democrat, final, .)"
1,To offer down resource great point.,probably guess western behind likely next inve...,2022-05-25,Fox News,Joseph Hill,Politics,fake,"(To, offer, down, resource, great, point, .)"
2,Himself church myself carry.,them identify forward present success risk sev...,2022-09-01,CNN,Julia Robinson,Business,fake,"(Himself, church, myself, carry, .)"
3,You unit its should.,phone which item yard Republican safe where po...,2023-02-07,Reuters,Mr. David Foster DDS,Science,fake,"(You, unit, its, should, .)"
4,Billion believe employee summer how.,wonder myself fact difficult course forget exa...,2023-04-03,CNN,Austin Walker,Technology,fake,"(Billion, believe, employee, summer, how, .)"


In [None]:
nlp = spacy.load("en_core_web_md")  # or en_core_web_lg

# Store Docs
df["title_doc"] = list(nlp.pipe(df["title"], batch_size=100, n_process=4))

# Clean processing from Doc
def clean_doc(doc):
    return " ".join(
        token.lemma_.lower()
        for token in doc
        if not token.is_punct
        and not token.is_digit
        and not token.is_stop
        and token.is_alpha  # keep only pure words
    )

df["title_clean"] = [clean_doc(doc) for doc in df["title_doc"]]

### 1.2.2. Remove punctuation, digits, special characters:

In [None]:
# clean_text = re.sub(r'[^a-zA-Z\s]', '', text)

### 1.2.3. Stop words:

### 1.2.4. Lemmatization & lower case:

In [None]:
# df["title_lemmas"] = [" ".join([token.lemma_ for token in doc]) for doc in nlp.pipe(df["title"])]

### 1.2.5. Frequency analysis:

# 2. Exploratory data analysis, EDA:

# 3. Data save: