# Cyber Data Science – Full Pipeline

Author: Basel Shaer

---

## 1. Dataset Justification

I selected the **"Fake and Real News Dataset"** from Kaggle because it provides a rich combination of textual, categorical, and temporal data that supports a full end-to-end data science pipeline. It includes news articles labeled as either *fake* or *real*, with fields such as title, full text, subject, and publication date.

Furthermore, its relevance to current global issues like misinformation enhances the practical value of the project.

In [2]:
import pandas as pd

# Load fake and real datasets
fake_df = pd.read_csv("../data/fake_news_dataset/Fake.csv")
real_df = pd.read_csv("../data/fake_news_dataset/True.csv")

# Add a 'label' column to each: 0 = fake, 1 = real
fake_df["label"] = 0
real_df["label"] = 1

# Combine into one dataframe
df = pd.concat([fake_df, real_df], ignore_index=True)

# Show first 5 rows
df.head()


Unnamed: 0,title,text,subject,date,label
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",0
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",0
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",0
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",0
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",0


## 2. System Stage

The dataset was downloaded from [Kaggle](https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset). It consists of two CSV files:

- `Fake.csv`: 23,502 fake news articles
- `True.csv`: 21,417 real news articles

Each file contains the following columns:
- `title`: Title of the news article
- `text`: Full body of the news article
- `subject`: Category or topic of the article
- `date`: Publish date
- `label`: Added manually (0 = Fake, 1 = Real)

**File System Information:**
- `Format`: CSV (comma-separated values)
- `Protocol`: HTTP download from Kaggle
- `Versions`: One main version (no updates or revisions at the time of use)

The files are stored locally under `data/fake_news_dataset/`.

Version control is managed via Git and GitHub in this project repository.

- `Repository`: [GitHub - cyber-data-pipeline](https://github.com/Basel6/Cyber-Data-Pipeline.git)


In [3]:
import os

fake_path = "../data/fake_news_dataset/Fake.csv"
real_path = "../data/fake_news_dataset/True.csv"

print("Fake.csv size:", os.path.getsize(fake_path) / 1024, "KB")
print("True.csv size:", os.path.getsize(real_path) / 1024, "KB")


Fake.csv size: 61318.23828125 KB
True.csv size: 52327.08984375 KB


## 3. Metadata

Describe the purpose of **Metadata**, then provide any relevant code, tables, or visualizations below.

In [None]:
# Your code here

## 4. Statistics

Describe the purpose of **Statistics**, then provide any relevant code, tables, or visualizations below.

In [None]:
# Your code here

## 5. Abnormality Detection

Describe the purpose of **Abnormality Detection**, then provide any relevant code, tables, or visualizations below.

In [None]:
# Your code here

## 6. Clustering

Describe the purpose of **Clustering**, then provide any relevant code, tables, or visualizations below.

In [None]:
# Your code here

## 7. Segment Analysis

Describe the purpose of **Segment Analysis**, then provide any relevant code, tables, or visualizations below.

In [None]:
# Your code here

## 8. Natural Language Processing (NLP)

Describe the purpose of **Natural Language Processing (NLP)**, then provide any relevant code, tables, or visualizations below.

In [None]:
# Your code here

## 9. Graphs

Describe the purpose of **Graphs**, then provide any relevant code, tables, or visualizations below.

In [None]:
# Your code here

## 10. Modeling

Describe the purpose of **Modeling**, then provide any relevant code, tables, or visualizations below.

In [None]:
# Your code here

## 11. Reporting

Describe the purpose of **Reporting**, then provide any relevant code, tables, or visualizations below.

In [None]:
# Your code here

## 12. Improvements

Describe the purpose of **Improvements**, then provide any relevant code, tables, or visualizations below.

In [None]:
# Your code here