# 1\. Data Loading and Preparation

Start by importing the essential python libraries.

In [12]:
import pandas as pd
import numpy as np

Load the two CSV files from data folder.

In [13]:
df_fake = pd.read_csv('../data/raw/Fake.csv')
df_real = pd.read_csv('../data/raw/True.csv')

Create the Target Variable: Assign a clear label to each dataset so I can tell them apart. This is my target variable, which I will eventually predict.

Fake News: Assign the label 0.

Real News: Assign the label 1.

In [14]:
df_fake['label'] = 0
df_real['label'] = 1

Combine DataFrames: Stack the two data sets into a single DataFrame.

Initial Inspection: Check the structure, size, and data types.

In [15]:
df = pd.concat([df_fake, df_real], ignore_index=True)
print(df.shape)
print(df.info())
print(df['label'].value_counts()) # Check for class imbalance!

(44898, 5)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44898 entries, 0 to 44897
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    44898 non-null  object
 1   text     44898 non-null  object
 2   subject  44898 non-null  object
 3   date     44898 non-null  object
 4   label    44898 non-null  int64 
dtypes: int64(1), object(4)
memory usage: 1.7+ MB
None
label
0    23481
1    21417
Name: count, dtype: int64


# 2\. Initial Data Cleaning 

Drop Unnecessary/Redundant Columns: The title and text are the primary predictive features.drop subject and date for the initial model, which can be combine them later.

In [16]:
df = df.drop(['subject', 'date'], axis=1)

Handle Duplicates: Since these are news articles, identical entries are likely mistakes. Remove all duplicates to ensure a clean dataset.

In [17]:
df.drop_duplicates(inplace=True)
df.reset_index(drop=True, inplace=True)

Check Missing Values: Confirm that no key columns (title, text, label) have missing values (NaN).

In [18]:
print(df.isnull().sum())

title    0
text     0
label    0
dtype: int64


Combine Text Fields: For the classification model, it's good to combine the information from the title and the full text into one feature.

In [19]:
df['full_text'] = df['title'] + ' ' + df['text']
df = df.drop(['title', 'text'], axis=1) # Drop the originals