<a href="https://colab.research.google.com/github/JackMAlucard/Data-Scientist-Technical-Assessment/blob/main/ta_fake_real_news.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#1️⃣ Data Cleaning and Preprocessing

##💾 Loading the text dataset and inspecting its structure

###Setting up the Kaggle API to import the datasets directly from Kaggle
**Setting up the Kaggle API to import the datasets directly from Kaggle**

In [1]:
# Upload Kaggle API Credentials, downloaded from https://www.kaggle.com/settings
from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"jackmalucard","key":"2aa4618ce05c77e80b6f52fc895aa216"}'}

In [2]:
# Move Kaggle API Credentials
!mkdir -p ~/.kaggle
!mv kaggle.json ~/.kaggle/
# Set Permissions
!chmod 600 ~/.kaggle/kaggle.json

###Downloading dataset from Kaggle

In [4]:
# Download dataset directly using the Kaggle API
!kaggle datasets download -d clmentbisaillon/fake-and-real-news-dataset/
!ls

fake-and-real-news-dataset.zip: Skipping, found more recently modified local copy (use --force to force download)
fake-and-real-news-dataset.zip	sample_data


###Extracting dataset from zip file into the ```/content``` directory.

In [6]:
# Import 'zipfile' module
import zipfile

# Specify the path to the zip file with the dataset
zip_file_path = 'fake-and-real-news-dataset.zip'

# Specify the directory where the files are to be extracted
extract_dir = '/content'

# Extract the files
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    zip_ref.extractall(extract_dir)

!ls

fake-and-real-news-dataset.zip	Fake.csv  sample_data  True.csv


Getting the names of the files inside the dataset to be used later

In [7]:
import os

# Specify the directory path
directory_path = '/content'

# Specify the desired file extension
extension = '.csv'

# Get a list of file names in the directory
file_names = os.listdir(directory_path)

# Filter and save filenames with the desired extension
filtered_files = [file_name for file_name in file_names if file_name.endswith(extension)]

# Print the list of file names
print("List of file names in the directory:")
for file_name in filtered_files:
    print(file_name)

List of file names in the directory:
Fake.csv
True.csv


###Loading and inspecting data

In [8]:
# Import pandas library
import pandas as pd

In [17]:
# Load the dataset files into Pandas DataFrames
df1 = pd.read_csv(filtered_files[0])
df2 = pd.read_csv(filtered_files[1])

In [18]:
# Fake News Dataset (df1)
# Get basic information about df1
print("Fake News dataframe info:")
print(df1.info())
print("\n")

# Display the first few rows of df1
print("First few rows of Fake News DataFrame:")
print(df1.head())

Fake News dataframe info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23481 entries, 0 to 23480
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    23481 non-null  object
 1   text     23481 non-null  object
 2   subject  23481 non-null  object
 3   date     23481 non-null  object
dtypes: object(4)
memory usage: 733.9+ KB
None


First few rows of Fake News DataFrame:
                                               title  \
0   Donald Trump Sends Out Embarrassing New Year’...   
1   Drunk Bragging Trump Staffer Started Russian ...   
2   Sheriff David Clarke Becomes An Internet Joke...   
3   Trump Is So Obsessed He Even Has Obama’s Name...   
4   Pope Francis Just Called Out Donald Trump Dur...   

                                                text subject  \
0  Donald Trump just couldn t wish all Americans ...    News   
1  House Intelligence Committee Chairman Devin Nu...    News   
2  On Friday, it was reveal

In [19]:
# Real News Dataset (df2)
# Get basic information about df2
print("Real News dataframe info:")
print(df2.info())
print("\n")

# Display the first few rows of df2
print("First few rows of Real News DataFrame:")
print(df2.head())

Real News dataframe info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21417 entries, 0 to 21416
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    21417 non-null  object
 1   text     21417 non-null  object
 2   subject  21417 non-null  object
 3   date     21417 non-null  object
dtypes: object(4)
memory usage: 669.4+ KB
None


First few rows of Real News DataFrame:
                                               title  \
0  As U.S. budget fight looms, Republicans flip t...   
1  U.S. military to accept transgender recruits o...   
2  Senior U.S. Republican senator: 'Let Mr. Muell...   
3  FBI Russia probe helped by Australian diplomat...   
4  Trump wants Postal Service to charge 'much mor...   

                                                text       subject  \
0  WASHINGTON (Reuters) - The head of a conservat...  politicsNews   
1  WASHINGTON (Reuters) - Transgender people will...  politicsNews   
2  WASHIN

**Observations on the datasets based on their structure**

After loading the two datasets into dataframes, and inspecting them by printing its basic properties and its first few rows, it can be determined that the datasets consist of:
- 4 data columns
- The rows of each column contain text
- The columns contain the following information for each news article:
  1. title
  2. text
  3. subject
  4. date

Thus far, it could be said that the most relevant text information that can be used to classify each news article would be contained in the *title* and *text* columns.

The specific *date* information *may* be irrelevant for the classification if the articles are all pulled from the same relatively short time frame.

The *subject* column may contain similar or repeated subjects.

The Kaggle page for this dataset somewhat confirms the last two paragraphs. The time frame for the Fake News set is 30-Mar-15 to 18-Feb-18, while for the Real News set it is 12-Jan-16 to 30-Dec-17. 68% of the subjects for the Fake News set is made up of the generic 'News' and 'Politics' subjects, while the Real News set is entirely made up of the subjects 'politicsNews' and 'worldNews'.

The ```info``` method of the pandas library indicates that there are no missing values on either set. The Kaggle page however indicates that 3% of the data from the Fake News set is empty.

##🚫 Handle missing values in text data if *present*

Following the information

In [28]:
# Find the most common value in the 'text' column of the Faka News Dataset (df1)
most_common_value = df1['text'].mode().values[0]
print("Most Common Value:", most_common_value)
frequency = df1['text'].value_counts().max()
print("Frequency:", frequency)

empty_space_count = 0

# Iterate through the column and count empty space values
for value in df2['text']:
    if value.isspace():
        empty_space_count += 1

print(f"Number of values filled with empty spaces in 'text': {empty_space_count}")

Most Common Value: 1
Frequency: 626
Number of values filled with empty spaces in 'text': 1


Text

##🔄 Perform text preprocessing steps

**Considerations before preprocessing**

Before performing text preprocessing, the two sets will be unified into a single dataset with the following columns: title, text, label (0 fake, 1 real).

The subject column has been dropped with the working assumption that due to the apparent uniformity in the subjects, these categorization of the news will not be significantly relevant to the classification task.

The date column has been dropped since, even though it could also be used to classify these kind of news by pattern seeking in the distribution of the dates, it is not relevant for classification at an NLP level.

###Merging datasets, dropping subject and date columns, adding label column

In [None]:
#Code

##✨ Provide a clean, preprocessed text dataset with extensive documentation

In [None]:
#Code

#2️⃣ Text Analysis and Feature Extraction

Text

##📝 Conduct text analysis to gain insights from the text data

Text

In [None]:
#Code

##🛰️ Extract advanced features from text data

Text

In [None]:
#Code

##📊 Visualize and explore text data patterns using advanced NLP visualization

Text

In [None]:
#Code

#3️⃣ Predictive Modeling with NLP

Text

##✂️ Split the text dataset into training, validation, and testing sets

Text

In [None]:
#Code

##📈 Choose and implement an NLP model to handle the text data

Text

##🤖 Select an algorithm that best suits the needs of this assessment

Text

In [None]:
#Code

##🏃‍♂️ Train the model on the training set

Text

In [None]:
#Code

##📝 Evaluate the model on the validation set

Text

#4️⃣ Model Interpretation

Text

##👨‍💻 Interpret complex text model predictions

Text

In [None]:
#Code

##🧑‍🚀 Explore the model's sensitivity

Text

In [None]:
#Code

##🧑‍🏫 Discuss the model's robustness

Text

In [None]:
#Code