# 📊 ***Data Science, Phase2*** 📚

* **Member 1** : [Kasra Kashani, 810101490] 🆔
* **Member 2** : [Borna Foroohari, 810101480] 🆔

📄 **Subjects**: Databases, AI pipelines, CI/CD (Continuous Integration and Continuous Delivery), MLOps 

## 🔹**Imports**

Import required modules.

In [None]:
import pandas as pd
import numpy as np
import sqlite3
import os
import subprocess
import torch
from sqlalchemy import create_engine
from sklearn.preprocessing import StandardScaler
from transformers import BertTokenizer, BertModel
from tqdm import tqdm

## 📍 **Section 1: Database Implementation and Data Querying**

In this step, first we **clean** our dataset and then transfer it into a **structured database system**, such as the **SQLite database**.

Also we will show some **queries** from that database.

### 💡1. *Choosing a suitable database based on dataset requirements*

For this project, the `SQLite` database was selected for the following reasons:

- It does not require separate installation, as it is natively supported in Python.

- It is lightweight and portable, making it ideal for small-scale and personal projects.

- It is well-suited for academic assignments and quick prototyping.

- It is fully compatible with Python libraries such as SQLAlchemy and Pandas, enabling seamless integration into the data pipeline.

### 💡2. *Database schema designing*

In our project, the dataset is stored in a single table named `news`, which contains the cleaned news articles along with various extracted linguistic and statistical features.

#### **Table: `news`**

| Column Name                  | Data Type | Description                                                               |
|-----------------------------|-----------|---------------------------------------------------------------------------|
| `title`                     | TEXT      | The headline or title of the news article.                                |
| `text`                      | TEXT      | The full content/body of the news article.                                |
| `subject`                   | TEXT      | The subject of the article.                                               |
| `date`                      | TEXT      | The publication date of the article (converted to standard format).       |
| `label`                     | TEXT      | Indicates whether the news is real (`"true"`) or fake (`"fake"`).         |
| `title_capital_word_count`  | INTEGER   | Number of all-uppercase words in the title.                               |
| `is_question_title`         | INTEGER   | Binary indicator (1 or 0) showing if the title is in question form.       |
| `title_emotional_word_count`| INTEGER   | Number of emotionally charged words in the title.                         |
| `title_word_count`          | INTEGER   | Total number of words in the title.                                       |
| `text_word_count`           | INTEGER   | Total number of words in the full article text.                           |
| `text_stopword_count`       | INTEGER   | Number of stopwords in the article text.                                  |
| `text_stopword_ratio`       | FLOAT     | Ratio of stopwords to total words in the text.                            |
| `text_sentence_count`       | INTEGER   | Number of sentences in the article text.                                  |
| `text_lexical_diversity`    | FLOAT     | Ratio of unique words to total words in the text.                         |
| `title_number_count`        | INTEGER   | Number of numerical terms found in the title.                             |
| `text_number_count`         | INTEGER   | Number of numerical terms found in the text.                              |
| `text_url_count`            | INTEGER   | Number of URLs or hyperlinks present in the text.                         |
| `general_category`            | TEXT      | The general category of the article.                                    |

#### **Relationships**

This project uses a **flat structure with a single table** (`news`), and therefore, no foreign key relationships exist between multiple tables.

#### **Primary and Foreign Keys**

In database schema design, **primary keys** are used to uniquely identify each record in a table, while **foreign keys** are used to establish relationships between multiple tables.

Since our project uses a flat, single-table structure, there is no need for foreign keys. All relevant information and extracted features are stored in the news table.

However, to ensure each news article entry can be uniquely identified, a primary key is implicitly created using the `rowid` in SQLite.

As a result, SQLite's implicit `rowid` serves as the primary key to uniquely identify each row. No foreign keys were required due to the flat structure of the dataset.

### 💡3. *Importing cleaned dataset into the database*

We use Python scripts such as **Pansas** and **SQLAlchemy** to automate the data import process.

First, we load our previously cleaned dataset into a Pandas dataframe.

Before that, we merhe our 2 datasets into the `news.csv` CSV file containing some extracted features.

In [2]:
# Create the directory structure for the preprocess script file
base_path_data = os.path.abspath(os.path.join(os.getcwd(), "..", "scripts"))
data_file = os.path.join(base_path_data, "preprocess_featureExtract.py")

# Running the preprocessing script
subprocess.run(["python", data_file])

CompletedProcess(args=['python', 'c:\\Users\\Asus\\Desktop\\University\\term 6\\Foundations of Data Science\\Final Project\\Phase2\\Project_P2_810101490_810101480\\scripts\\preprocess_featureExtract.py'], returncode=0)

In [3]:
# Create the directory structure for the dataset file
base_path_data = os.path.abspath(os.path.join(os.getcwd(), "..",  "dataset"))
os.makedirs(base_path_data, exist_ok=True)
data_file = os.path.join(base_path_data, "news.csv")

# Read the CSV file into a dataframe
df = pd.read_csv(data_file, low_memory=False)

In [4]:
# Show the dataframe
df

Unnamed: 0,title,text,subject,date,label,title_capital_word_count,is_question_title,title_emotional_word_count,title_word_count,text_word_count,text_stopword_count,text_stopword_ratio,text_sentence_count,text_lexical_diversity,title_number_count,text_number_count,text_url_count,general_category
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,2017-12-31,fake,0,0,1,12,495,198,0.400000,28,0.527273,0,40,0,World-news
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,2017-12-31,fake,0,0,0,8,305,125,0.409836,11,0.645902,0,1,0,World-news
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,2017-12-30,fake,0,0,0,15,580,227,0.391379,25,0.544828,0,41,1,World-news
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,2017-12-29,fake,1,0,0,14,444,172,0.387387,15,0.583333,0,32,4,World-news
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,2017-12-25,fake,0,0,0,11,420,207,0.492857,19,0.550000,0,0,0,World-news
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
44893,'Fully committed' NATO backs new U.S. approach...,BRUSSELS (Reuters) - NATO allies on Tuesday we...,worldnews,2017-08-22,true,2,0,0,9,466,189,0.405579,15,0.562232,0,10,0,World-news
44894,LexisNexis withdrew two products from Chinese ...,"LONDON (Reuters) - LexisNexis, a provider of l...",worldnews,2017-08-22,true,0,0,0,7,125,48,0.384000,6,0.640000,0,1,0,World-news
44895,Minsk cultural hub becomes haven from authorities,MINSK (Reuters) - In the shadow of disused Sov...,worldnews,2017-08-22,true,0,0,0,7,320,140,0.437500,16,0.656250,0,5,0,World-news
44896,Vatican upbeat on possibility of Pope Francis ...,MOSCOW (Reuters) - Vatican Secretary of State ...,worldnews,2017-08-22,true,0,0,0,9,205,85,0.414634,8,0.678049,0,3,0,World-news


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44898 entries, 0 to 44897
Data columns (total 18 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   title                       44898 non-null  object 
 1   text                        44898 non-null  object 
 2   subject                     44898 non-null  object 
 3   date                        33285 non-null  object 
 4   label                       44898 non-null  object 
 5   title_capital_word_count    44898 non-null  int64  
 6   is_question_title           44898 non-null  int64  
 7   title_emotional_word_count  44898 non-null  int64  
 8   title_word_count            44898 non-null  int64  
 9   text_word_count             44898 non-null  int64  
 10  text_stopword_count         44898 non-null  int64  
 11  text_stopword_ratio         44898 non-null  float64
 12  text_sentence_count         44898 non-null  int64  
 13  text_lexical_diversity      448

In [6]:
df.describe()

Unnamed: 0,title_capital_word_count,is_question_title,title_emotional_word_count,title_word_count,text_word_count,text_stopword_count,text_stopword_ratio,text_sentence_count,text_lexical_diversity,title_number_count,text_number_count,text_url_count
count,44898.0,44898.0,44898.0,44898.0,44898.0,44898.0,44898.0,44898.0,44898.0,44898.0,44898.0,44898.0
mean,1.815537,0.028086,0.163459,12.453472,405.282284,167.0439,0.399132,14.871821,0.612277,0.137445,6.730055,0.10486
std,2.564639,0.16522,0.414382,4.111476,351.265595,147.373016,0.074657,12.693722,0.128954,0.443749,11.009077,0.481299
min,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,10.0,203.0,80.25,0.372199,7.0,0.549622,0.0,1.0,0.0
50%,1.0,0.0,0.0,11.0,362.0,149.0,0.403433,13.0,0.597765,0.0,4.0,0.0
75%,3.0,0.0,0.0,14.0,513.0,216.0,0.439294,19.0,0.661538,0.0,9.0,0.0
max,24.0,1.0,4.0,42.0,8135.0,3161.0,0.749999,321.0,1.0,6.0,649.0,22.0


Now we connect to or create the database, using an **engine interface**. Then we load our dataframe into a SQL Table, named `news`.

In [7]:
# Create the directory structure for the database file
base_path_db = os.path.abspath(os.path.join(os.getcwd(), "..",  "database"))
os.makedirs(base_path_db, exist_ok=True)
db_file = os.path.join(base_path_db, "news_dataset.db")

# Connect to SQLite database (or create it if it doesn't exist)
engine = create_engine(f"sqlite:///{db_file}")

# Write the dataframe to a SQL table named 'news'
df_db = df.copy()
df_db.to_sql("news", con=engine, index=False, if_exists="replace")

44898

### 💡4. *Database Queries and Explorations*

In this part, we execute and document **9 meaningful SQL queries** on our database.

First we connect to the our created database, named `news_dataset`.

In [8]:
# Create the directory structure for the database file
base_path_query = os.path.abspath(os.path.join(os.getcwd(), "..",  "database"))
os.makedirs(base_path_query, exist_ok=True)
db_file = os.path.join(base_path_query, "news_dataset.db")

# Connect to the SQLite database
conn = sqlite3.connect(db_file)

Before our main queries, we show our data format saved in the database.

In [9]:
query0 = """
    SELECT * 
    FROM news 
    LIMIT 5;
"""

# Execute the query and read the results into a dataframe
df0 = pd.read_sql_query(query0, conn)

# Show the dataframe
df0

Unnamed: 0,title,text,subject,date,label,title_capital_word_count,is_question_title,title_emotional_word_count,title_word_count,text_word_count,text_stopword_count,text_stopword_ratio,text_sentence_count,text_lexical_diversity,title_number_count,text_number_count,text_url_count,general_category
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,2017-12-31,fake,0,0,1,12,495,198,0.4,28,0.527273,0,40,0,World-news
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,2017-12-31,fake,0,0,0,8,305,125,0.409836,11,0.645902,0,1,0,World-news
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,2017-12-30,fake,0,0,0,15,580,227,0.391379,25,0.544828,0,41,1,World-news
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,2017-12-29,fake,1,0,0,14,444,172,0.387387,15,0.583333,0,32,4,World-news
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,2017-12-25,fake,0,0,0,11,420,207,0.492857,19,0.55,0,0,0,World-news


***❓ Are fake news titles more emotionally charged than true ones?***

In [10]:
query1 = """
    SELECT label, AVG(title_emotional_word_count) AS avg_emotion
    FROM news
    GROUP BY label;
"""

# Execute the query and read the results into a dataframe
df1 = pd.read_sql_query(query1, conn)

# Show the dataframe
df1

Unnamed: 0,label,avg_emotion
0,fake,0.228397
1,true,0.092263


📊 *Insight*:

On average, fake news titles contain more emotionally charged words than true headlines.

This suggests that fake news often use emotional language to provoke stronger reactions and draw attention.

***❓ Are fake news titles more often written as questions?***

In [11]:
query2 = """
    SELECT label, AVG(is_question_title) AS question_ratio
    FROM news
    GROUP BY label;
"""

# Execute the query and read the results into a dataframe
df2 = pd.read_sql_query(query2, conn)

# Show the dataframe
df2

Unnamed: 0,label,question_ratio
0,fake,0.048805
1,true,0.00537


📊 *Insight*:

Approximately 4.9% of fake news titles are phrased as questions, compared to only 0.5% of true news.

This implies that fake news tend to use interrogative titles more often possibly as a clickbait strategy to spark curiosity or uncertainty.

***❓ Do fake news articles contain more hyperlinks than true news?***

In [12]:
query3 = """
    SELECT label, AVG(text_url_count) AS avg_links
    FROM news
    GROUP BY label;
"""

# Execute the query and read the results into a dataframe
df3 = pd.read_sql_query(query3, conn)

# Show the dataframe
df3

Unnamed: 0,label,avg_links
0,fake,0.200503
1,true,0.0


📊 *Insight*:

Fake news articles include significantly more hyperlinks than true articles.

This may indicate that fake content often attempts to appear credible or redirect users to suspicious external sources.

***❓ Do fake news articles show lower lexical diversity (simpler language)?***

In [13]:
query4 = """
    SELECT label, AVG(text_lexical_diversity) AS avg_diversity
    FROM news
    GROUP BY label;
"""

# Execute the query and read the results into a dataframe
df4 = pd.read_sql_query(query4, conn)

# Show the dataframe
df4

Unnamed: 0,label,avg_diversity
0,fake,0.596366
1,true,0.62972


📊 *Insight*:

Fake news has slightly lower lexical diversity than true news, suggesting simpler or more repetitive vocabulary.

This could reflect attempts to target wider, less critical audiences or replicate template-based content.

***❓ Are longer titles more likely to be fake?***

In [14]:
query5 = """
    SELECT label, COUNT(*) AS count
    FROM news
    WHERE title_word_count > 12
    GROUP BY label;
"""

# Execute the query and read the results into a dataframe
df5 = pd.read_sql_query(query5, conn)

# Show the dataframe
df5

Unnamed: 0,label,count
0,fake,16202
1,true,1489


📊 *Insight*:

Among long titles, 16202 ones belong to fake news and only 1,489 to true news, a massive difference.

This supports the idea that fake news often uses overly detailed or sensationalized titles to attract attention.

***❓ What percentage of fake vs. true news titles include numbers?***

In [15]:
query6 = """
  SELECT label, ROUND(SUM(CASE WHEN title_number_count > 0 THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 2) AS percent_with_number
  FROM news
  GROUP BY label;
"""

# Execute the query and read the results into a dataframe
df6 = pd.read_sql_query(query6, conn)

# Show the dataframe
df6

Unnamed: 0,label,percent_with_number
0,fake,12.81
1,true,8.4


📊 *Insight*:

We can see that 12.81% of fake news titles contain numbers, compared to 8.40% of true news titles.

This suggests that fake news may rely more on numeric terms (e.g. *"10 shocking facts"*, *"2020 warning"*) to boost engagement.

***❓ Do fake news titles use a higher ratio of stopwords in their text?***

In [16]:
query7 = """
    SELECT label, AVG(text_stopword_ratio) AS avg_stop_ratio
    FROM news
    GROUP BY label;
"""

# Execute the query and read the results into a dataframe
df7 = pd.read_sql_query(query7, conn)

# Show the dataframe
df7

Unnamed: 0,label,avg_stop_ratio
0,fake,0.416388
1,true,0.380213


📊 *Insight*:

The stopwords ratio in fake news is higher than in true news.

This may indicate that fake news includes more filler or less content-rich writing to appear longer or more detailed.

***❓ Which subject has the highest proportion of fake news?***

In [17]:
query8 = """
    SELECT general_category,
        COUNT(*) AS total_articles,
        SUM(CASE WHEN label = 'fake' THEN 1 ELSE 0 END) AS fake_count,
        ROUND(SUM(CASE WHEN label = 'fake' THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 2) AS fake_percentage
    FROM news
    GROUP BY general_category
    ORDER BY fake_percentage DESC
    LIMIT 5;
"""

# Execute the query and read the results into a dataframe
df8 = pd.read_sql_query(query8, conn)

# Show the dataframe
df8

Unnamed: 0,general_category,total_articles,fake_count,fake_percentage
0,World-news,26785,16640,62.12
1,Politics-news,18113,6841,37.77


📊 *Insight*:

The *World news* category shows the highest proportion of fake news (62.12%), followed by *Politics news* (37.77%).

This indicates that general or ambiguous topics may be more prone to misinformation and content manipulation.

***❓ On which days of the week is the publication of fake news more likely?***

In [18]:
query9 = """
    SELECT
        strftime('%w', date) AS weekday_number,
        CASE strftime('%w', date)
            WHEN '6' THEN 'Sunday'
            WHEN '0' THEN 'Monday'
            WHEN '1' THEN 'Tuesday'
            WHEN '2' THEN 'Wednesday'
            WHEN '3' THEN 'Thursday'
            WHEN '4' THEN 'Friday'
            WHEN '5' THEN 'Saturday'
        END AS weekday_name,
        label,
        COUNT(*) AS article_count
    FROM news
    WHERE date IS NOT NULL
    GROUP BY weekday_number, label
    ORDER BY weekday_number;
"""

# Execute the query and read the results into a dataframe
df9 = pd.read_sql_query(query9, conn)

# Show the dataframe
df9

Unnamed: 0,weekday_number,weekday_name,label,article_count
0,0,Monday,fake,1558
1,0,Monday,true,1427
2,1,Tuesday,fake,1620
3,1,Tuesday,true,3064
4,2,Wednesday,fake,1764
5,2,Wednesday,true,3749
6,3,Thursday,fake,1829
7,3,Thursday,true,4184
8,4,Friday,fake,1860
9,4,Friday,true,4106


📊 *Insight*:

On weekends, especially *Friday* and *Saturday*, true news articles outnumber fake ones,  suggesting higher activity from legitimate sources during off-days.

In contrast, fake news articles are relatively more frequent during midweek, especially *Tuesday* to *Thursday*, suggesting potential patterns of targeted misinformation during working days.  

## 📍 **Section 2: Advanced Feature Engineering, Data Preprocessing, and Preparation for Modeling**

In this section, we should conduct detailed feature engineering and complete professional-level preprocessing, ***which we have already done previously.***

Also we will conduct carefully structured to match the **standardized project folder format** introduced previously.

We prepare our processed dataset based on insights from our exploratory analysis and **EDA from Phase 1**, using Power BI visualizations.

This dataset will be ready for direct usage in **modeling tasks**, such as classification.

### 💡1. *Reviewing initial insights (EDA)*

In this part, we revisit and document key insights gained from our *phase 1* visualizations and exploratory data analysis, using *PowerBI*.

Also we clearly define the features we plan to engineer based on these insights.

#### ***📊 Key Insights from Phase 1: Exploratory Dashboard***

---

#### 1️⃣ **Fake News Seeks Attention Early in the Week**

- Fake news articles tend to peak on **Mondays and Tuesdays**, possibly because creators target the beginning of the week when online activity is higher and audiences are more engaged.

---

#### 2️⃣ **True News Drops on Weekends**

- In contrast, true news articles are mostly published during **weekdays**, with a sharp drop during weekends. This pattern reflects the **structured workflow of professional news agencies**.

---

#### 3️⃣ **Fake News Titles Are Short, Emotional & Loud**

- Fake headlines are generally **shorter**, include more **emotional and capitalized words**, and often resemble **clickbait**. Words like *BREAKING* or *WOW* are common.

---

#### 4️⃣ **True News Uses Richer, Balanced Language**

- True headlines tend to have **more structured phrasing**, use **neutral vocabulary**, and contain **more stop words**, reflecting **professional journalistic style**.

---

#### 5️⃣ **Topic-Wise Difference: Politics vs World News**

- Fake news dominates **World News**, likely due to broader, less verifiable topics. On the other hand, **Politics News** includes more true articles, possibly due to better verification and official sources.

---

#### 6️⃣ **Fake News Relies on Identity-Based Vocabulary**

- Words like _Trump_, _President_, and _U.S._ are more frequent in fake news, used as **emotional triggers** instead of factual context.

---

#### 7️⃣ **True News Surged After 2017**

- There is a significant **+254% growth in true news after 2017**, likely due to **platform-level interventions** and **fake news moderation policies**.

---

#### 8️⃣ **Fake News Titles Often Use Questions**

- Although rare overall, fake news shows a **slightly higher reliance** on question-style titles, a technique often used in **misinformation tactics**.

---

#### 9️⃣ **Shorter, Denser Articles in Fake News**

- Fake articles are often **shorter** and more **word-dense**, possibly to deliver fast, emotionally loaded content without much detail.

---

#### ***👷 Features Engineered Based on Exploratory Insights***

Based on insights from SQL queries and Phase 1 exploratory analysis (EDA), we identified several behavioral patterns in fake versus true news articles. We have engineered the following features, previously on `preprocess_featureExtract.py` file, to capture these differences more effectively:

| Feature Name                  | Description                                                         | Insight-Based Justification                                        |
|------------------------------|---------------------------------------------------------------------|---------------------------------------------------------------------|
| `title_word_count`           | Number of words in the title                                       | Fake news tends to have longer titles                              |
| `title_emotional_word_count` | Number of emotionally charged words in title                       | Fake news headlines use more emotional language                    |
| `is_question_title`          | Binary flag if the title is a question                             | Fake news often uses question-style headlines (clickbait pattern)  |
| `title_capital_word_count`   | Count of fully capitalized words in the title                      | Fake headlines use more emphasis via capital letters               |
| `title_number_count`         | Number of numeric terms in the title                               | Fake headlines use numbers to draw attention                       |
| `text_stopword_count`        | Total number of stopwords in article text                          | Fake news contains more stopwords (filler, less content-rich)      |
| `text_stopword_ratio`        | Ratio of stopwords to total words                                  | Higher ratio in fake news articles                                 |
| `text_lexical_diversity`     | Unique words / total words in text                                 | Fake news tends to use simpler, repetitive language                |
| `text_url_count`             | Number of URLs/hyperlinks in the article                           | Fake news includes more outbound links                             |
| `general_category`           | Categorical grouping of subjects (Politics news vs. World news)    | Simplifies subject categories for more generalizable patterns      |
| `has_question_mark`          | Binary flag if the title contains '?'                              | Fake news often hints uncertainty or curiosity                     |
| `title_avg_word_length`      | Average character length of words in the title                     | Longer words may indicate technicality or sophistication level     |
| `stopword_to_length_ratio`   | Stopword count divided by text length                              | A normalized indicator of low-information-density writing          |

### 💡2. *Performing some additional advanced feature engineering*

In this part, we implement some sophisticated **feature engineering** methods again, in addition to our previous features.

#### ➕ Creating new meaningful features

So we extract 5 new advanced features, from our existing features.

In [19]:
# Feature 1: Ratio of stopwords to total text word count
df["stopword_to_length_ratio"] = df["text_stopword_count"] / (df["text_word_count"] + 1)

# Feature 2: Average word length in the title
df["title_avg_word_length"] = df["title"].apply(lambda x: np.mean([len(w) for w in str(x).split()]) if pd.notnull(x) else 0)

# Feature 3: Word density = total words / sentences
df["word_density"] = df["text_word_count"] / (df["text_sentence_count"] + 1)

# Feature 4: Emotional word density in title
df["emotional_density"] = df["title_emotional_word_count"] / (df["title_word_count"] + 1)

# Feature 5: Interaction between question format and emotional intensity
df["question_emotion_interaction"] = df["is_question_title"] * df["title_emotional_word_count"]

In [20]:
# Show the updated dataframe
df

Unnamed: 0,title,text,subject,date,label,title_capital_word_count,is_question_title,title_emotional_word_count,title_word_count,text_word_count,...,text_lexical_diversity,title_number_count,text_number_count,text_url_count,general_category,stopword_to_length_ratio,title_avg_word_length,word_density,emotional_density,question_emotion_interaction
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,2017-12-31,fake,0,0,1,12,495,...,0.527273,0,40,0,World-news,0.399194,5.583333,17.068966,0.076923,0
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,2017-12-31,fake,0,0,0,8,305,...,0.645902,0,1,0,World-news,0.408497,7.625000,25.416667,0.000000,0
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,2017-12-30,fake,0,0,0,15,580,...,0.544828,0,41,1,World-news,0.390706,5.000000,22.307692,0.000000,0
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,2017-12-29,fake,1,0,0,14,444,...,0.583333,0,32,4,World-news,0.386517,4.571429,27.750000,0.000000,0
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,2017-12-25,fake,0,0,0,11,420,...,0.550000,0,0,0,World-news,0.491686,5.363636,21.000000,0.000000,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
44893,'Fully committed' NATO backs new U.S. approach...,BRUSSELS (Reuters) - NATO allies on Tuesday we...,worldnews,2017-08-22,true,2,0,0,9,466,...,0.562232,0,10,0,World-news,0.404711,5.888889,29.125000,0.000000,0
44894,LexisNexis withdrew two products from Chinese ...,"LONDON (Reuters) - LexisNexis, a provider of l...",worldnews,2017-08-22,true,0,0,0,7,125,...,0.640000,0,1,0,World-news,0.380952,6.571429,17.857143,0.000000,0
44895,Minsk cultural hub becomes haven from authorities,MINSK (Reuters) - In the shadow of disused Sov...,worldnews,2017-08-22,true,0,0,0,7,320,...,0.656250,0,5,0,World-news,0.436137,6.142857,18.823529,0.000000,0
44896,Vatican upbeat on possibility of Pope Francis ...,MOSCOW (Reuters) - Vatican Secretary of State ...,worldnews,2017-08-22,true,0,0,0,9,205,...,0.678049,0,3,0,World-news,0.412621,5.888889,22.777778,0.000000,0


##### 🔧 Engineered Features: Type, Sources, and Rationale

The table below summarizes the engineered features, including their data type, the original features they were based on, the reason those original features were considered important, and the specific insight or signal each new feature is designed to capture.

| Engineered Feature             | Type      | Based on Features Used                             | Why Those Features Were Important                                     | Purpose / Insight Captured                                                |
|-------------------------------|-----------|----------------------------------------------------|------------------------------------------------------------------------|----------------------------------------------------------------------------|
| `stopword_to_length_ratio`    | Ratio     | `text_stopword_count`, `text_word_count`           | Fake news tends to use more filler words and have lower text richness.| Measures text density and informational content.                          |
| `title_avg_word_length`       | Average   | `title`                                            | Simpler/shorter words are more common in fake headlines.              | Estimates language sophistication in the headline.                        |
| `word_density`                | Ratio     | `text_word_count`, `text_sentence_count`           | Fake content may have excessive verbosity or over-simplified structure.| Captures average sentence length (text compactness).                     |
| `emotional_density`           | Ratio     | `title_emotional_word_count`, `title_word_count`   | Fake headlines use emotional words heavily, often in short titles.    | Measures intensity of emotional language in the title.                    |
| `question_emotion_interaction`| Interaction | `is_question_title`, `title_emotional_word_count` | Fake news often combines emotional tone with question-style headlines.| Detects emotionally charged clickbait phrasing.                          |


#### 🔤 Text vectorization (TF-IDF, embeddings)

As we are using a textual data, text vectorization should be used. For instance, we perform vectorization on the `title` and `text` columns, using **BERT** embedding.

In [21]:
# Load BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")
model.eval()

# Use GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Ensure title and text are strings
df["title"] = df["title"].astype(str)
df["text"] = df["text"].astype(str)

# Define embedding function
def get_bert_embedding(txt):
    inputs = tokenizer(txt, return_tensors="pt", truncation=True, padding=True, max_length=512)
    inputs = {k: v.to(device) for k, v in inputs.items()}

    with torch.no_grad():
        outputs = model(**inputs)

    return outputs.last_hidden_state.mean(dim=1).squeeze().cpu().numpy()

# Apply to title
tqdm.pandas()
df["title_embedding"] = df["title"].progress_apply(get_bert_embedding)

# Apply to text
df["text_embedding"] = df["text"].progress_apply(get_bert_embedding)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

100%|██████████| 44898/44898 [29:49<00:00, 25.09it/s]    
100%|██████████| 44898/44898 [3:11:58<00:00,  3.90it/s]  


In [22]:
# Show the updated dataframe
df

Unnamed: 0,title,text,subject,date,label,title_capital_word_count,is_question_title,title_emotional_word_count,title_word_count,text_word_count,...,text_number_count,text_url_count,general_category,stopword_to_length_ratio,title_avg_word_length,word_density,emotional_density,question_emotion_interaction,title_embedding,text_embedding
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,2017-12-31,fake,0,0,1,12,495,...,40,0,World-news,0.399194,5.583333,17.068966,0.076923,0,"[-0.087652735, 0.3052924, 0.4860082, -0.215476...","[-0.110753626, 0.08251098, 0.48530617, -0.0257..."
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,2017-12-31,fake,0,0,0,8,305,...,1,0,World-news,0.408497,7.625000,25.416667,0.000000,0,"[0.07518646, -0.43053955, 0.2936775, -0.136392...","[-0.20508027, -0.095989324, 0.056940734, 0.127..."
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,2017-12-30,fake,0,0,0,15,580,...,41,1,World-news,0.390706,5.000000,22.307692,0.000000,0,"[0.0004514205, 0.22627294, 0.02068508, -0.2220...","[-0.040056203, 0.017511971, 0.35744786, -0.052..."
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,2017-12-29,fake,1,0,0,14,444,...,32,4,World-news,0.386517,4.571429,27.750000,0.000000,0,"[0.11892752, 0.03887261, 0.16360746, -0.031435...","[-0.12169886, -0.06291097, 0.56233454, -0.0505..."
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,2017-12-25,fake,0,0,0,11,420,...,0,0,World-news,0.491686,5.363636,21.000000,0.000000,0,"[0.18226019, 0.020588301, 0.45404932, -0.16698...","[-0.14068584, 0.15590447, 0.2628155, -0.124266..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
44893,'Fully committed' NATO backs new U.S. approach...,BRUSSELS (Reuters) - NATO allies on Tuesday we...,worldnews,2017-08-22,true,2,0,0,9,466,...,10,0,World-news,0.404711,5.888889,29.125000,0.000000,0,"[-0.22016054, -0.2646486, -0.28054425, -0.0202...","[-0.29300874, -0.2690835, 0.20311326, -0.01129..."
44894,LexisNexis withdrew two products from Chinese ...,"LONDON (Reuters) - LexisNexis, a provider of l...",worldnews,2017-08-22,true,0,0,0,7,125,...,1,0,World-news,0.380952,6.571429,17.857143,0.000000,0,"[-0.23533003, 0.030441722, -0.2116002, 0.14849...","[-0.35906503, 0.024558794, 0.24344209, -0.0435..."
44895,Minsk cultural hub becomes haven from authorities,MINSK (Reuters) - In the shadow of disused Sov...,worldnews,2017-08-22,true,0,0,0,7,320,...,5,0,World-news,0.436137,6.142857,18.823529,0.000000,0,"[-0.16499297, -0.22954267, 0.22405392, -0.0613...","[-0.049460813, 0.023484496, 0.32261994, -0.105..."
44896,Vatican upbeat on possibility of Pope Francis ...,MOSCOW (Reuters) - Vatican Secretary of State ...,worldnews,2017-08-22,true,0,0,0,9,205,...,3,0,World-news,0.412621,5.888889,22.777778,0.000000,0,"[-0.12216297, -0.21545862, 0.27457115, -0.2240...","[-0.40189856, -0.1370452, 0.28224158, -0.14327..."


In this project, the goal is to analyze the content of news articles (titles and bodies) to detect fake news. To achieve a deeper understanding of the meaning and semantics of each article, we used **BERT** (Bidirectional Encoder Representations from Transformers) for these reasons:

1. Unlike traditional approaches like TF-IDF or CountVectorizer that only consider word frequency, BERT understands the **contextual meaning of each word**.

2. BERT is **trained on massive text corpora** (e.g., Wikipedia, BooksCorpus) and brings general language understanding into our task, eliminating the need for training from scratch.

3. BERT converts each input text into a dense vector (embedding) of **768 dimensions**, capturing deep semantic and syntactic information. These embeddings can be directly fed into machine learning models.

As a result, we applied BERT to generate embeddings for both `title` (short text) and `text` (long text) columns separately, allowing our models to learn from the unique semantic signals each contains.

#### 🧠 Encoding categorical variables appropriately

We encode categorical variables, sush as `label` and `general_category` and `subject` columns, using **one-hot encoding**. 

In [23]:
# Label encoding
df["label"] = df["label"].map({"fake": 0, "true": 1})

# One-hot encoding for general_category and subject
df = pd.get_dummies(df, columns=["general_category", "subject"], drop_first=True)
df[df.select_dtypes("bool").columns] = df.select_dtypes("bool").astype(int)

In [24]:
# Show the updated dataframe
df

Unnamed: 0,title,text,date,label,title_capital_word_count,is_question_title,title_emotional_word_count,title_word_count,text_word_count,text_stopword_count,...,title_embedding,text_embedding,general_category_World-news,subject_Middle-east,subject_News,subject_US_News,subject_left-news,subject_politics,subject_politicsNews,subject_worldnews
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,2017-12-31,0,0,0,1,12,495,198,...,"[-0.087652735, 0.3052924, 0.4860082, -0.215476...","[-0.110753626, 0.08251098, 0.48530617, -0.0257...",1,0,1,0,0,0,0,0
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,2017-12-31,0,0,0,0,8,305,125,...,"[0.07518646, -0.43053955, 0.2936775, -0.136392...","[-0.20508027, -0.095989324, 0.056940734, 0.127...",1,0,1,0,0,0,0,0
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",2017-12-30,0,0,0,0,15,580,227,...,"[0.0004514205, 0.22627294, 0.02068508, -0.2220...","[-0.040056203, 0.017511971, 0.35744786, -0.052...",1,0,1,0,0,0,0,0
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",2017-12-29,0,1,0,0,14,444,172,...,"[0.11892752, 0.03887261, 0.16360746, -0.031435...","[-0.12169886, -0.06291097, 0.56233454, -0.0505...",1,0,1,0,0,0,0,0
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,2017-12-25,0,0,0,0,11,420,207,...,"[0.18226019, 0.020588301, 0.45404932, -0.16698...","[-0.14068584, 0.15590447, 0.2628155, -0.124266...",1,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
44893,'Fully committed' NATO backs new U.S. approach...,BRUSSELS (Reuters) - NATO allies on Tuesday we...,2017-08-22,1,2,0,0,9,466,189,...,"[-0.22016054, -0.2646486, -0.28054425, -0.0202...","[-0.29300874, -0.2690835, 0.20311326, -0.01129...",1,0,0,0,0,0,0,1
44894,LexisNexis withdrew two products from Chinese ...,"LONDON (Reuters) - LexisNexis, a provider of l...",2017-08-22,1,0,0,0,7,125,48,...,"[-0.23533003, 0.030441722, -0.2116002, 0.14849...","[-0.35906503, 0.024558794, 0.24344209, -0.0435...",1,0,0,0,0,0,0,1
44895,Minsk cultural hub becomes haven from authorities,MINSK (Reuters) - In the shadow of disused Sov...,2017-08-22,1,0,0,0,7,320,140,...,"[-0.16499297, -0.22954267, 0.22405392, -0.0613...","[-0.049460813, 0.023484496, 0.32261994, -0.105...",1,0,0,0,0,0,0,1
44896,Vatican upbeat on possibility of Pope Francis ...,MOSCOW (Reuters) - Vatican Secretary of State ...,2017-08-22,1,0,0,0,9,205,85,...,"[-0.12216297, -0.21545862, 0.27457115, -0.2240...","[-0.40189856, -0.1370452, 0.28224158, -0.14327...",1,0,0,0,0,0,0,1


To prepare our dataset for machine learning models, we encoded all categorical features into numerical format as follows:

`label` column (target variable) was binary encoded, mapping *fake* to 0 and *true* to 1. This allows classification algorithms to interpret the target as a binary numeric label.

`general_category` and `subject` columns were one-hot encoded. Since these are nominal variables (unordered categories), one-hot encoding creates a separate binary column for each category.

Our dataset doesn't have any ordinal features, so we didn't use ordinal encoding.

This transformation ensures that all features passed into the model are numeric, standardized, and machine-readable.

#### 📈 Handling time series

The dataset has a **time-related** feature, which is the `date` column. So we convert it to a DateTime type and extract its **day**, **month** and **season** in three separated numeric columns.

In [25]:
# Convert the date column to datetime format
df["date"] = pd.to_datetime(df["date"], errors="coerce")

# Extract day from the date
df["publish_dayofweek"] = df["date"].dt.dayofweek # 0 = Monday, ..., 6 = Sunday

# Extract month from the date
df["publish_month"] = df["date"].dt.month

# Specify the news that was published on weekends 
df["is_weekend"] = df["publish_dayofweek"].apply(lambda x: 1 if x >= 5 else 0)

# Extract season from the date
def get_season(month):  # 1 = Winter 2 = Spring, 3 = Summer, 4 = Autumn
    if month in [12, 1, 2]:
        return 1
    elif month in [3, 4, 5]:
        return 2
    elif month in [6, 7, 8]:
        return 3
    elif month in [9, 10, 11]:
        return 4
    else:
        return 0

df["season"] = df["publish_month"].apply(get_season)

In [26]:
# Show the updated dataframe
df

Unnamed: 0,title,text,date,label,title_capital_word_count,is_question_title,title_emotional_word_count,title_word_count,text_word_count,text_stopword_count,...,subject_News,subject_US_News,subject_left-news,subject_politics,subject_politicsNews,subject_worldnews,publish_dayofweek,publish_month,is_weekend,season
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,2017-12-31,0,0,0,1,12,495,198,...,1,0,0,0,0,0,6.0,12.0,1,1
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,2017-12-31,0,0,0,0,8,305,125,...,1,0,0,0,0,0,6.0,12.0,1,1
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",2017-12-30,0,0,0,0,15,580,227,...,1,0,0,0,0,0,5.0,12.0,1,1
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",2017-12-29,0,1,0,0,14,444,172,...,1,0,0,0,0,0,4.0,12.0,0,1
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,2017-12-25,0,0,0,0,11,420,207,...,1,0,0,0,0,0,0.0,12.0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
44893,'Fully committed' NATO backs new U.S. approach...,BRUSSELS (Reuters) - NATO allies on Tuesday we...,2017-08-22,1,2,0,0,9,466,189,...,0,0,0,0,0,1,1.0,8.0,0,3
44894,LexisNexis withdrew two products from Chinese ...,"LONDON (Reuters) - LexisNexis, a provider of l...",2017-08-22,1,0,0,0,7,125,48,...,0,0,0,0,0,1,1.0,8.0,0,3
44895,Minsk cultural hub becomes haven from authorities,MINSK (Reuters) - In the shadow of disused Sov...,2017-08-22,1,0,0,0,7,320,140,...,0,0,0,0,0,1,1.0,8.0,0,3
44896,Vatican upbeat on possibility of Pope Francis ...,MOSCOW (Reuters) - Vatican Secretary of State ...,2017-08-22,1,0,0,0,9,205,85,...,0,0,0,0,0,1,1.0,8.0,0,3


From the `date` column, we extracted several informative features to capture temporal patterns in news publishing behavior:

- **`publish_dayofweek`**: Identifies the day of the week (0 = Monday to 6 = Sunday).
- **`publish_month`**: Captures the month of publication to detect seasonal trends or topic surges.
- **`is_weekend`**: A binary feature indicating if the article was published on a weekend (Saturday or Sunday).
- **`season`**: Maps the publication month to a season (1 = Winter, 2 = Spring, 3 = Summer, 4 = Autumn).

These features allow the model to consider when a news article was released, as fake and true news may exhibit different temporal behaviors.

#### 🖼️ Image data

❌ Not applicable ❌

Our project does not involve any image-based data. Therefore, no image preprocessing or visual feature extraction was necessary for this pipeline.

### 💡3. *Comprehensive data preprocessing*

In this we part, we will handle **missing data and null values**, **normalize and standardize** numeric features and remove **irrelative and noisy** features.

#### ❌ Handle missing data professionally

First we find columns containing null values, and then handle each one as we explained.

In [27]:
# Find missing values in the dataframe
missing = df.isnull().sum()
missing[missing > 0].sort_values(ascending=False)

date                 11613
publish_dayofweek    11613
publish_month        11613
dtype: int64

In [28]:
# Fill NaN values in the date column with a default date
df["date"] = df["date"].fillna(pd.to_datetime("2000-01-01"))

In [29]:
# Find NaN indices in day column
nan_indices_day = df[df["publish_dayofweek"].isna()].index

# Total number of NaNs
n_missing_day = len(nan_indices_day)

# Days of week (0 to 6)
days = list(range(7))

# Repeat days enough times and slice exactly n_missing elements
repeated_days = (days * ((n_missing_day // 7) + 1))[:n_missing_day]

# Shuffle for better distribution
np.random.shuffle(repeated_days)

# Fill NaNs with the distributed days
df.loc[nan_indices_day, "publish_dayofweek"] = repeated_days

In [30]:
# Find NaN indices in month column
nan_indices_month = df[df["publish_month"].isna()].index

# Total number of NaNs
n_missing_month = len(nan_indices_month)

# Months (1 to 12)
months = list(range(1, 13))

# Repeat months enough times and slice exactly n_missing elements
repeated_months = (months * ((n_missing_month // 12) + 1))[:n_missing_month]

# Shuffle for better distribution
np.random.shuffle(repeated_months)

# Fill NaNs with the distributed months
df.loc[nan_indices_month, "publish_month"] = repeated_months

In [31]:
# Check missing values in the dataframe again
missing = df.isnull().sum()
missing[missing > 0].sort_values(ascending=False)

Series([], dtype: int64)

To handle missing values in time-based features, we applied the following strategies:

1. `date` → Missing dates were replaced with a fixed placeholder date: `2000-01-01`, indicating that the exact publication time is unknown.  

2. `publish_dayofweek` → Null values in the `day` column were handled by distributing the missing entries evenly across all 7 days (0 to 6). This ensures balanced representation instead of imputing with a single mode or mean.


3. `publish_month` → Also filled by distributing the missing entries evenly across all 12 months (1 to 12).

These imputation decisions allowed us to preserve all rows and ensure time-derived features were usable in downstream modeling tasks.

In [32]:
# Show the updated dataframe
df

Unnamed: 0,title,text,date,label,title_capital_word_count,is_question_title,title_emotional_word_count,title_word_count,text_word_count,text_stopword_count,...,subject_News,subject_US_News,subject_left-news,subject_politics,subject_politicsNews,subject_worldnews,publish_dayofweek,publish_month,is_weekend,season
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,2017-12-31,0,0,0,1,12,495,198,...,1,0,0,0,0,0,6.0,12.0,1,1
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,2017-12-31,0,0,0,0,8,305,125,...,1,0,0,0,0,0,6.0,12.0,1,1
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",2017-12-30,0,0,0,0,15,580,227,...,1,0,0,0,0,0,5.0,12.0,1,1
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",2017-12-29,0,1,0,0,14,444,172,...,1,0,0,0,0,0,4.0,12.0,0,1
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,2017-12-25,0,0,0,0,11,420,207,...,1,0,0,0,0,0,0.0,12.0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
44893,'Fully committed' NATO backs new U.S. approach...,BRUSSELS (Reuters) - NATO allies on Tuesday we...,2017-08-22,1,2,0,0,9,466,189,...,0,0,0,0,0,1,1.0,8.0,0,3
44894,LexisNexis withdrew two products from Chinese ...,"LONDON (Reuters) - LexisNexis, a provider of l...",2017-08-22,1,0,0,0,7,125,48,...,0,0,0,0,0,1,1.0,8.0,0,3
44895,Minsk cultural hub becomes haven from authorities,MINSK (Reuters) - In the shadow of disused Sov...,2017-08-22,1,0,0,0,7,320,140,...,0,0,0,0,0,1,1.0,8.0,0,3
44896,Vatican upbeat on possibility of Pope Francis ...,MOSCOW (Reuters) - Vatican Secretary of State ...,2017-08-22,1,0,0,0,9,205,85,...,0,0,0,0,0,1,1.0,8.0,0,3


#### ⚖️ Normalize or standardize numeric features appropriately

**Standardization** transforms data to have a **mean of 0** and a **standard deviation of 1**, preserving the shape of the distribution.

- **Standardization** is best when our data follows a **Gaussian (normal-like)** or some different distribution or when we are using models that assume centered data with **equal variance**, like linear regression, logistic regression, or SVM. It's suitable when features have different units or scales but **may contain outliers**.

**Normalization** rescales data to a **fixed range**, usually [0, 1], affecting the scale but not the distribution shape.

- **Normalization** is ideal when we need **bounded data**, such as for neural networks or distance-based models (e.g., k-NN, KMeans), especially when the algorithm is sensitive to the **absolute scale**. Also when the distributation is not important and we don't want negative values. It works best when the data **doesn’t contain extreme outliers**.

In our dataset, we only use `standardization` which is generally the better choice and we just perform that. Because we have many **count-based and ratio-based numeric features** which are on **very different scales**. So standardization (mean=0, std=1) makes these features **comparable** and **prevents bias** in distance or weight sensitive models.

As a result, to ensure all numerical features contribute fairly during model training, we will apply **StandardScaler**, which transforms features to have a mean of 0 and a standard deviation of 1.

This step prevents features with large numerical ranges (e.g., `text_word_count`) from dominating the learning process.

In [33]:
# Specify the features to scale
features_to_scale = [
    "title_capital_word_count",
    "title_emotional_word_count",
    "title_word_count",
    "text_word_count",
    "text_stopword_count",
    "text_stopword_ratio",
    "text_sentence_count",
    "text_lexical_diversity",
    "title_number_count",
    "text_number_count",
    "text_url_count",
    "stopword_to_length_ratio",
    "title_avg_word_length",
    "word_density",
    "emotional_density",
    "question_emotion_interaction"
]

# Scale and standardize the features
scaler = StandardScaler()
df[features_to_scale] = scaler.fit_transform(df[features_to_scale])

##### ✅ Features that were scaled and why:

| Feature Name | Why it was scaled | Benefit to the model |
|--------------|-------------------|------------------------|
| `title_capital_word_count` | May have large values | Helps detect emphasis or shouting in titles |
| `title_emotional_word_count` | Varies across titles | Highlights emotional tone |
| `title_word_count` | Varies widely | Indicates complexity of headlines |
| `text_word_count` | Can be very large | Prevents it from overpowering other features |
| `text_stopword_count` | Tied to text length | Adjusts its weight proportionally |
| `text_stopword_ratio` | Ratio (0–1) | Standardized for better comparison |
| `text_sentence_count` | Differs across articles | Avoids scale dominance |
| `text_lexical_diversity` | Ratio (0–1) | Supports fair modeling with other ratios |
| `title_number_count` | Numeric content in titles | Prevents underweight or overweight |
| `text_number_count` | May range widely | Helps in detecting factual density |
| `text_url_count` | Usually small values | Scaling improves model sensitivity |
| `stopword_to_length_ratio` | Normalized measure | Aligned with other scaled ratios |
| `title_avg_word_length` | Language complexity | Indicates linguistic level |
| `word_density` | Words per sentence | Helps understand content compactness |
| `emotional_density` | Emotional intensity in text | Crucial for sentiment-driven analysis |
| `question_emotion_interaction` | Interaction term with wide range | Scaling controls for excessive impact |

---

##### ❌ Features that were not scaled and why:

| Feature Name | Why it was not scaled |
|--------------|--------|
| `label` | Target binary variable – should not be transformed |
| `title`, `text`, `date` | Raw data – not directly used in model |
| `publish_dayofweek`, `publish_month`, `season` | Represent ordered categories, not true numeric values |
| `is_weekend`, `is_question_title` | Binary – no scaling needed |
| `subject_*`, `general_category_*` | One-hot encoded – already in 0/1 format |
| `tfidf_title_*` | Already normalized by TF-IDF algorithm |

---

*By applying this targeted standardization, we ensured all continuous features are on a comparable scale, enabling the model to learn more effectively and fairly.*

Also, we convert `publish_dayofweek` and `publish_month` columns types to int.

In [34]:
# Convert these 2 columns types to integer
df["publish_dayofweek"] = df["publish_dayofweek"].astype(int)
df["publish_month"] = df["publish_month"].astype(int)

In [35]:
# Show the updated dataframe
df

Unnamed: 0,title,text,date,label,title_capital_word_count,is_question_title,title_emotional_word_count,title_word_count,text_word_count,text_stopword_count,...,subject_News,subject_US_News,subject_left-news,subject_politics,subject_politicsNews,subject_worldnews,publish_dayofweek,publish_month,is_weekend,season
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,2017-12-31,0,-0.707919,0,2.018789,-0.110296,0.255416,0.210055,...,1,0,0,0,0,0,6,12,1,1
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,2017-12-31,0,-0.707919,0,-0.394470,-1.083193,-0.285492,-0.285292,...,1,0,0,0,0,0,6,12,1,1
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",2017-12-30,0,-0.707919,0,-0.394470,0.619378,0.497400,0.406837,...,1,0,0,0,0,0,5,12,1,1
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",2017-12-29,0,-0.317997,0,-0.394470,0.376153,0.110225,0.033630,...,1,0,0,0,0,0,4,12,0,1
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,2017-12-25,0,-0.707919,0,-0.394470,-0.353520,0.041900,0.271125,...,1,0,0,0,0,0,0,12,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
44893,'Fully committed' NATO backs new U.S. approach...,BRUSSELS (Reuters) - NATO allies on Tuesday we...,2017-08-22,1,0.071926,0,-0.394470,-0.839969,0.172856,0.148985,...,0,0,0,0,0,1,1,8,0,3
44894,LexisNexis withdrew two products from Chinese ...,"LONDON (Reuters) - LexisNexis, a provider of l...",2017-08-22,1,-0.707919,0,-0.394470,-1.326417,-0.797930,-0.807782,...,0,0,0,0,0,1,1,8,0,3
44895,Minsk cultural hub becomes haven from authorities,MINSK (Reuters) - In the shadow of disused Sov...,2017-08-22,1,-0.707919,0,-0.394470,-1.326417,-0.242788,-0.183508,...,0,0,0,0,0,1,1,8,0,3
44896,Vatican upbeat on possibility of Pope Francis ...,MOSCOW (Reuters) - Vatican Secretary of State ...,2017-08-22,1,-0.707919,0,-0.394470,-0.839969,-0.570180,-0.556715,...,0,0,0,0,0,1,1,8,0,3


#### 🧹 Remove irrelevant, noisy, or highly correlated features based on statistical reasoning

Features that are not useful for modeling, contain noisy or incorrect information, or are highly correlated with other features should be identified and removed using statistical reasoning.

- **Irrelevant Features** -> These are features that have no meaningful relationship with the target variable. For example, an ID column or a completely random field.

- **Noisy Features**  -> These are features with inconsistent or error-prone data that can mislead the model and reduce its accuracy. As part of noise reduction and dimensionality cleanup, we should identify and remove features that lacke meaningful variation or are overly sparse.

- **Highly Correlated Features** -> When two or more features carry almost the same information (e.g., correlation > 0.95), keeping all of them can lead to overfitting. In such cases, keeping just one is sufficient.

At first, we ***remove irrelevant raw features*** that were no longer needed after feature engineering:

- `title`: Replaced by TF-IDF vectors and structural features (e.g., `word_count`, `emotion`).
- `text`: Replaced by aggregate and density-based metrics (e.g., `word_density`, `lexical_diversity`).
- `date`: Only used to derive temporal features, `publish_dayofweek` and `month` and `season`.
- `subject_*`: Replaced by a better and more general feature, `general_category` column.

Removing these features helps reduce memory usage and prevents noise in modeling.

In [36]:
# Remove irrelevant columns
df.drop(columns=["title", "text", "date", "subject_Middle-east", "subject_News", "subject_US_News",
                 "subject_left-news", "subject_politics", "subject_politicsNews", "subject_worldnews"], inplace=True)

In [37]:
# Show the updated dataframe
df

Unnamed: 0,label,title_capital_word_count,is_question_title,title_emotional_word_count,title_word_count,text_word_count,text_stopword_count,text_stopword_ratio,text_sentence_count,text_lexical_diversity,...,word_density,emotional_density,question_emotion_interaction,title_embedding,text_embedding,general_category_World-news,publish_dayofweek,publish_month,is_weekend,season
0,0,-0.707919,0,2.018789,-0.110296,0.255416,0.210055,0.011621,1.034238,-0.659185,...,-0.836768,2.138704,-0.069192,"[-0.087652735, 0.3052924, 0.4860082, -0.215476...","[-0.110753626, 0.08251098, 0.48530617, -0.0257...",1,6,12,1,1
1,0,-0.707919,0,-0.394470,-1.083193,-0.285492,-0.285292,0.143371,-0.305022,0.260754,...,-0.013094,-0.386341,-0.069192,"[0.07518646, -0.43053955, 0.2936775, -0.136392...","[-0.20508027, -0.095989324, 0.056940734, 0.127...",1,6,12,1,1
2,0,-0.707919,0,-0.394470,0.619378,0.497400,0.406837,-0.103851,0.797898,-0.523051,...,-0.319859,-0.386341,-0.069192,"[0.0004514205, 0.22627294, 0.02068508, -0.2220...","[-0.040056203, 0.017511971, 0.35744786, -0.052...",1,5,12,1,1
3,0,-0.317997,0,-0.394470,0.376153,0.110225,0.033630,-0.157321,0.010098,-0.224448,...,0.217138,-0.386341,-0.069192,"[0.11892752, 0.03887261, 0.16360746, -0.031435...","[-0.12169886, -0.06291097, 0.56233454, -0.0505...",1,4,12,0,1
4,0,-0.707919,0,-0.394470,-0.353520,0.041900,0.271125,1.255410,0.325218,-0.482940,...,-0.448890,-0.386341,-0.069192,"[0.18226019, 0.020588301, 0.45404932, -0.16698...","[-0.14068584, 0.15590447, 0.2628155, -0.124266...",1,0,12,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
44893,1,0.071926,0,-0.394470,-0.839969,0.172856,0.148985,0.086355,0.010098,-0.388086,...,0.352811,-0.386341,-0.069192,"[-0.22016054, -0.2646486, -0.28054425, -0.0202...","[-0.29300874, -0.2690835, 0.20311326, -0.01129...",1,1,8,0,3
44894,1,-0.707919,0,-0.394470,-1.326417,-0.797930,-0.807782,-0.202694,-0.698922,0.214988,...,-0.758998,-0.386341,-0.069192,"[-0.23533003, 0.030441722, -0.2116002, 0.14849...","[-0.35906503, 0.024558794, 0.24344209, -0.0435...",1,1,8,0,3
44895,1,-0.707919,0,-0.394470,-1.326417,-0.242788,-0.183508,0.513920,0.088878,0.341003,...,-0.663644,-0.386341,-0.069192,"[-0.16499297, -0.22954267, 0.22405392, -0.0613...","[-0.049460813, 0.023484496, 0.32261994, -0.105...",1,1,8,0,3
44896,1,-0.707919,0,-0.394470,-0.839969,-0.570180,-0.556715,0.207640,-0.541362,0.510048,...,-0.273475,-0.386341,-0.069192,"[-0.12216297, -0.21545862, 0.27457115, -0.2240...","[-0.40189856, -0.1370452, 0.28224158, -0.14327...",1,1,8,0,3


Then, we delete **noisy features** which are seperated to these 3 parts, according to our dataset:

- **Duplicate features**: During feature generation, several TF-IDF features were unintentionally added twice (e.g., `tfidf_title_trump`, `tfidf_title_vote`, etc.). To prevent over representation of these words, ensure a fair learning process and also prevent from overfitting, all duplicate columns should be removed.

- **Low-variance features**: We should detect features that have only one unique value across all rows, meaning they provide no useful information for classification or prediction. These columns will be constant and therefore irrelevant for any statistical or machine learning model.


- **Sparse features (more zeros)**: We also should analyze all TF-IDF columns and identify those where more than 98% of the values are zeros, except binary columns such as `label`, `is_weekend` and`is_question_title` columns. These sparse features are considered uninformative due to their extremely limited activation. This step also helps reduce dimensionality and prevents model overfitting caused by noisy or rarely used features.

In [38]:
# Remove duplicate columns
df = df.loc[:, ~df.columns.duplicated()]

In [39]:
# Show the updated dataframe
df

Unnamed: 0,label,title_capital_word_count,is_question_title,title_emotional_word_count,title_word_count,text_word_count,text_stopword_count,text_stopword_ratio,text_sentence_count,text_lexical_diversity,...,word_density,emotional_density,question_emotion_interaction,title_embedding,text_embedding,general_category_World-news,publish_dayofweek,publish_month,is_weekend,season
0,0,-0.707919,0,2.018789,-0.110296,0.255416,0.210055,0.011621,1.034238,-0.659185,...,-0.836768,2.138704,-0.069192,"[-0.087652735, 0.3052924, 0.4860082, -0.215476...","[-0.110753626, 0.08251098, 0.48530617, -0.0257...",1,6,12,1,1
1,0,-0.707919,0,-0.394470,-1.083193,-0.285492,-0.285292,0.143371,-0.305022,0.260754,...,-0.013094,-0.386341,-0.069192,"[0.07518646, -0.43053955, 0.2936775, -0.136392...","[-0.20508027, -0.095989324, 0.056940734, 0.127...",1,6,12,1,1
2,0,-0.707919,0,-0.394470,0.619378,0.497400,0.406837,-0.103851,0.797898,-0.523051,...,-0.319859,-0.386341,-0.069192,"[0.0004514205, 0.22627294, 0.02068508, -0.2220...","[-0.040056203, 0.017511971, 0.35744786, -0.052...",1,5,12,1,1
3,0,-0.317997,0,-0.394470,0.376153,0.110225,0.033630,-0.157321,0.010098,-0.224448,...,0.217138,-0.386341,-0.069192,"[0.11892752, 0.03887261, 0.16360746, -0.031435...","[-0.12169886, -0.06291097, 0.56233454, -0.0505...",1,4,12,0,1
4,0,-0.707919,0,-0.394470,-0.353520,0.041900,0.271125,1.255410,0.325218,-0.482940,...,-0.448890,-0.386341,-0.069192,"[0.18226019, 0.020588301, 0.45404932, -0.16698...","[-0.14068584, 0.15590447, 0.2628155, -0.124266...",1,0,12,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
44893,1,0.071926,0,-0.394470,-0.839969,0.172856,0.148985,0.086355,0.010098,-0.388086,...,0.352811,-0.386341,-0.069192,"[-0.22016054, -0.2646486, -0.28054425, -0.0202...","[-0.29300874, -0.2690835, 0.20311326, -0.01129...",1,1,8,0,3
44894,1,-0.707919,0,-0.394470,-1.326417,-0.797930,-0.807782,-0.202694,-0.698922,0.214988,...,-0.758998,-0.386341,-0.069192,"[-0.23533003, 0.030441722, -0.2116002, 0.14849...","[-0.35906503, 0.024558794, 0.24344209, -0.0435...",1,1,8,0,3
44895,1,-0.707919,0,-0.394470,-1.326417,-0.242788,-0.183508,0.513920,0.088878,0.341003,...,-0.663644,-0.386341,-0.069192,"[-0.16499297, -0.22954267, 0.22405392, -0.0613...","[-0.049460813, 0.023484496, 0.32261994, -0.105...",1,1,8,0,3
44896,1,-0.707919,0,-0.394470,-0.839969,-0.570180,-0.556715,0.207640,-0.541362,0.510048,...,-0.273475,-0.386341,-0.069192,"[-0.12216297, -0.21545862, 0.27457115, -0.2240...","[-0.40189856, -0.1370452, 0.28224158, -0.14327...",1,1,8,0,3


In [41]:
# Remove columns with low variance (only one unique value)
low_variance = []

for col in df.columns:
    if not isinstance(df[col].iloc[0], (np.ndarray, list)):
        if df[col].nunique() == 1:
            low_variance.append(col)

df.drop(columns=low_variance, inplace=True)


In [42]:
# Show the updated dataframe
df

Unnamed: 0,label,title_capital_word_count,is_question_title,title_emotional_word_count,title_word_count,text_word_count,text_stopword_count,text_stopword_ratio,text_sentence_count,text_lexical_diversity,...,word_density,emotional_density,question_emotion_interaction,title_embedding,text_embedding,general_category_World-news,publish_dayofweek,publish_month,is_weekend,season
0,0,-0.707919,0,2.018789,-0.110296,0.255416,0.210055,0.011621,1.034238,-0.659185,...,-0.836768,2.138704,-0.069192,"[-0.087652735, 0.3052924, 0.4860082, -0.215476...","[-0.110753626, 0.08251098, 0.48530617, -0.0257...",1,6,12,1,1
1,0,-0.707919,0,-0.394470,-1.083193,-0.285492,-0.285292,0.143371,-0.305022,0.260754,...,-0.013094,-0.386341,-0.069192,"[0.07518646, -0.43053955, 0.2936775, -0.136392...","[-0.20508027, -0.095989324, 0.056940734, 0.127...",1,6,12,1,1
2,0,-0.707919,0,-0.394470,0.619378,0.497400,0.406837,-0.103851,0.797898,-0.523051,...,-0.319859,-0.386341,-0.069192,"[0.0004514205, 0.22627294, 0.02068508, -0.2220...","[-0.040056203, 0.017511971, 0.35744786, -0.052...",1,5,12,1,1
3,0,-0.317997,0,-0.394470,0.376153,0.110225,0.033630,-0.157321,0.010098,-0.224448,...,0.217138,-0.386341,-0.069192,"[0.11892752, 0.03887261, 0.16360746, -0.031435...","[-0.12169886, -0.06291097, 0.56233454, -0.0505...",1,4,12,0,1
4,0,-0.707919,0,-0.394470,-0.353520,0.041900,0.271125,1.255410,0.325218,-0.482940,...,-0.448890,-0.386341,-0.069192,"[0.18226019, 0.020588301, 0.45404932, -0.16698...","[-0.14068584, 0.15590447, 0.2628155, -0.124266...",1,0,12,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
44893,1,0.071926,0,-0.394470,-0.839969,0.172856,0.148985,0.086355,0.010098,-0.388086,...,0.352811,-0.386341,-0.069192,"[-0.22016054, -0.2646486, -0.28054425, -0.0202...","[-0.29300874, -0.2690835, 0.20311326, -0.01129...",1,1,8,0,3
44894,1,-0.707919,0,-0.394470,-1.326417,-0.797930,-0.807782,-0.202694,-0.698922,0.214988,...,-0.758998,-0.386341,-0.069192,"[-0.23533003, 0.030441722, -0.2116002, 0.14849...","[-0.35906503, 0.024558794, 0.24344209, -0.0435...",1,1,8,0,3
44895,1,-0.707919,0,-0.394470,-1.326417,-0.242788,-0.183508,0.513920,0.088878,0.341003,...,-0.663644,-0.386341,-0.069192,"[-0.16499297, -0.22954267, 0.22405392, -0.0613...","[-0.049460813, 0.023484496, 0.32261994, -0.105...",1,1,8,0,3
44896,1,-0.707919,0,-0.394470,-0.839969,-0.570180,-0.556715,0.207640,-0.541362,0.510048,...,-0.273475,-0.386341,-0.069192,"[-0.12216297, -0.21545862, 0.27457115, -0.2240...","[-0.40189856, -0.1370452, 0.28224158, -0.14327...",1,1,8,0,3


In [44]:
# Remove columns with high sparsity (more than 98% zeros) except binary columns
sparse_cols = []

for col in df.columns:
    if np.issubdtype(df[col].dtype, np.number):
        if (df[col] == 0).sum() / len(df) > 0.98 and col not in ["lable", "is_weekend", "is_question_title"]:
            sparse_cols.append(col)

df.drop(columns=sparse_cols, inplace=True)

In [45]:
# Show the updated dataframe
df

Unnamed: 0,label,title_capital_word_count,is_question_title,title_emotional_word_count,title_word_count,text_word_count,text_stopword_count,text_stopword_ratio,text_sentence_count,text_lexical_diversity,...,word_density,emotional_density,question_emotion_interaction,title_embedding,text_embedding,general_category_World-news,publish_dayofweek,publish_month,is_weekend,season
0,0,-0.707919,0,2.018789,-0.110296,0.255416,0.210055,0.011621,1.034238,-0.659185,...,-0.836768,2.138704,-0.069192,"[-0.087652735, 0.3052924, 0.4860082, -0.215476...","[-0.110753626, 0.08251098, 0.48530617, -0.0257...",1,6,12,1,1
1,0,-0.707919,0,-0.394470,-1.083193,-0.285492,-0.285292,0.143371,-0.305022,0.260754,...,-0.013094,-0.386341,-0.069192,"[0.07518646, -0.43053955, 0.2936775, -0.136392...","[-0.20508027, -0.095989324, 0.056940734, 0.127...",1,6,12,1,1
2,0,-0.707919,0,-0.394470,0.619378,0.497400,0.406837,-0.103851,0.797898,-0.523051,...,-0.319859,-0.386341,-0.069192,"[0.0004514205, 0.22627294, 0.02068508, -0.2220...","[-0.040056203, 0.017511971, 0.35744786, -0.052...",1,5,12,1,1
3,0,-0.317997,0,-0.394470,0.376153,0.110225,0.033630,-0.157321,0.010098,-0.224448,...,0.217138,-0.386341,-0.069192,"[0.11892752, 0.03887261, 0.16360746, -0.031435...","[-0.12169886, -0.06291097, 0.56233454, -0.0505...",1,4,12,0,1
4,0,-0.707919,0,-0.394470,-0.353520,0.041900,0.271125,1.255410,0.325218,-0.482940,...,-0.448890,-0.386341,-0.069192,"[0.18226019, 0.020588301, 0.45404932, -0.16698...","[-0.14068584, 0.15590447, 0.2628155, -0.124266...",1,0,12,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
44893,1,0.071926,0,-0.394470,-0.839969,0.172856,0.148985,0.086355,0.010098,-0.388086,...,0.352811,-0.386341,-0.069192,"[-0.22016054, -0.2646486, -0.28054425, -0.0202...","[-0.29300874, -0.2690835, 0.20311326, -0.01129...",1,1,8,0,3
44894,1,-0.707919,0,-0.394470,-1.326417,-0.797930,-0.807782,-0.202694,-0.698922,0.214988,...,-0.758998,-0.386341,-0.069192,"[-0.23533003, 0.030441722, -0.2116002, 0.14849...","[-0.35906503, 0.024558794, 0.24344209, -0.0435...",1,1,8,0,3
44895,1,-0.707919,0,-0.394470,-1.326417,-0.242788,-0.183508,0.513920,0.088878,0.341003,...,-0.663644,-0.386341,-0.069192,"[-0.16499297, -0.22954267, 0.22405392, -0.0613...","[-0.049460813, 0.023484496, 0.32261994, -0.105...",1,1,8,0,3
44896,1,-0.707919,0,-0.394470,-0.839969,-0.570180,-0.556715,0.207640,-0.541362,0.510048,...,-0.273475,-0.386341,-0.069192,"[-0.12216297, -0.21545862, 0.27457115, -0.2240...","[-0.40189856, -0.1370452, 0.28224158, -0.14327...",1,1,8,0,3


At the end, for removing **highly correleted features**, we will delete the second column and keep the first one for each pair of columns that have a correlation more than 98%.

In [51]:
# Specify the numeric columns to keep for correlation analysis
numeric_df = df.select_dtypes(include=[np.number])

# Compute the correlation matrix
corr_matrix = numeric_df.corr().abs()

# Set the threshold for correlation
threshold = 0.90

# Keep track of columns to drop
to_drop = set()

# List of columns in the correlation matrix
columns = corr_matrix.columns

# Delete the second column of the correlation matrix that have a high correlation with the first column
for i in range(len(columns)):
    if columns[i] in to_drop:
        continue
    for j in range(i + 1, len(columns)):
        if columns[j] in to_drop:
            continue
        if corr_matrix.iloc[i, j] > threshold:
            print(f"Columns {columns[i]} and {columns[j]} are highly correlated with a correlation of {corr_matrix.iloc[i, j]:.2f}")
            to_drop.add(columns[j])

# Drop the selected columns from the dataframe
df = df.drop(columns=list(to_drop))

Columns title_emotional_word_count and emotional_density are highly correlated with a correlation of 0.95
Columns text_word_count and text_stopword_count are highly correlated with a correlation of 0.99
Columns text_stopword_ratio and stopword_to_length_ratio are highly correlated with a correlation of 1.00


In [52]:
# Show the updated dataframe
df

Unnamed: 0,label,title_capital_word_count,is_question_title,title_emotional_word_count,title_word_count,text_word_count,text_stopword_ratio,text_sentence_count,text_lexical_diversity,title_number_count,...,title_avg_word_length,word_density,question_emotion_interaction,title_embedding,text_embedding,general_category_World-news,publish_dayofweek,publish_month,is_weekend,season
0,0,-0.707919,0,2.018789,-0.110296,0.255416,0.011621,1.034238,-0.659185,-0.309739,...,-0.007276,-0.836768,-0.069192,"[-0.087652735, 0.3052924, 0.4860082, -0.215476...","[-0.110753626, 0.08251098, 0.48530617, -0.0257...",1,6,12,1,1
1,0,-0.707919,0,-0.394470,-1.083193,-0.285492,0.143371,-0.305022,0.260754,-0.309739,...,1.365580,-0.013094,-0.069192,"[0.07518646, -0.43053955, 0.2936775, -0.136392...","[-0.20508027, -0.095989324, 0.056940734, 0.127...",1,6,12,1,1
2,0,-0.707919,0,-0.394470,0.619378,0.497400,-0.103851,0.797898,-0.523051,-0.309739,...,-0.399521,-0.319859,-0.069192,"[0.0004514205, 0.22627294, 0.02068508, -0.2220...","[-0.040056203, 0.017511971, 0.35744786, -0.052...",1,5,12,1,1
3,0,-0.317997,0,-0.394470,0.376153,0.110225,-0.157321,0.010098,-0.224448,-0.309739,...,-0.687701,0.217138,-0.069192,"[0.11892752, 0.03887261, 0.16360746, -0.031435...","[-0.12169886, -0.06291097, 0.56233454, -0.0505...",1,4,12,0,1
4,0,-0.707919,0,-0.394470,-0.353520,0.041900,1.255410,0.325218,-0.482940,-0.309739,...,-0.155005,-0.448890,-0.069192,"[0.18226019, 0.020588301, 0.45404932, -0.16698...","[-0.14068584, 0.15590447, 0.2628155, -0.124266...",1,0,12,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
44893,1,0.071926,0,-0.394470,-0.839969,0.172856,0.086355,0.010098,-0.388086,-0.309739,...,0.198185,0.352811,-0.069192,"[-0.22016054, -0.2646486, -0.28054425, -0.0202...","[-0.29300874, -0.2690835, 0.20311326, -0.01129...",1,1,8,0,3
44894,1,-0.707919,0,-0.394470,-1.326417,-0.797930,-0.202694,-0.698922,0.214988,-0.309739,...,0.657138,-0.758998,-0.069192,"[-0.23533003, 0.030441722, -0.2116002, 0.14849...","[-0.35906503, 0.024558794, 0.24344209, -0.0435...",1,1,8,0,3
44895,1,-0.707919,0,-0.394470,-1.326417,-0.242788,0.513920,0.088878,0.341003,-0.309739,...,0.368958,-0.663644,-0.069192,"[-0.16499297, -0.22954267, 0.22405392, -0.0613...","[-0.049460813, 0.023484496, 0.32261994, -0.105...",1,1,8,0,3
44896,1,-0.707919,0,-0.394470,-0.839969,-0.570180,0.207640,-0.541362,0.510048,-0.309739,...,0.198185,-0.273475,-0.069192,"[-0.12216297, -0.21545862, 0.27457115, -0.2240...","[-0.40189856, -0.1370452, 0.28224158, -0.14327...",1,1,8,0,3


## *At the end  of this section, we load our final updated dataframe into the `final_news.csv` CSV file and also in database into `final_news.db` table.*

In [53]:
# Create the directory structure for the dataset file
base_path_data = os.path.abspath(os.path.join(os.getcwd(), "..",  "dataset"))
os.makedirs(base_path_data, exist_ok=True)
data_file = os.path.join(base_path_data, "final_news.csv")

# Save the updated dataframe back to the CSV file
df.to_csv(data_file, index=False)

In [54]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44898 entries, 0 to 44897
Data columns (total 22 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   label                         44898 non-null  int64  
 1   title_capital_word_count      44898 non-null  float64
 2   is_question_title             44898 non-null  int64  
 3   title_emotional_word_count    44898 non-null  float64
 4   title_word_count              44898 non-null  float64
 5   text_word_count               44898 non-null  float64
 6   text_stopword_ratio           44898 non-null  float64
 7   text_sentence_count           44898 non-null  float64
 8   text_lexical_diversity        44898 non-null  float64
 9   title_number_count            44898 non-null  float64
 10  text_number_count             44898 non-null  float64
 11  text_url_count                44898 non-null  float64
 12  title_avg_word_length         44898 non-null  float64
 13  w

In [55]:
df.describe()

Unnamed: 0,label,title_capital_word_count,is_question_title,title_emotional_word_count,title_word_count,text_word_count,text_stopword_ratio,text_sentence_count,text_lexical_diversity,title_number_count,text_number_count,text_url_count,title_avg_word_length,word_density,question_emotion_interaction,general_category_World-news,publish_dayofweek,publish_month,is_weekend,season
count,44898.0,44898.0,44898.0,44898.0,44898.0,44898.0,44898.0,44898.0,44898.0,44898.0,44898.0,44898.0,44898.0,44898.0,44898.0,44898.0,44898.0,44898.0,44898.0,44898.0
mean,0.477015,-7.089918e-17,0.028086,-9.115609000000001e-17,-5.0642270000000006e-17,5.950467e-17,-2.532114e-16,2.991059e-17,-1.645874e-16,-9.115609000000001e-17,-1.139451e-17,2.785325e-17,2.709362e-16,-3.468996e-16,-2.595416e-17,0.596574,2.746982,7.094993,0.125551,1.988218
std,0.499477,1.000011,0.16522,1.000011,1.000011,1.000011,1.000011,1.000011,1.000011,1.000011,1.000011,1.000011,1.000011,1.000011,1.000011,0.49059,1.868471,3.536477,0.331347,1.569661
min,0.0,-0.7079194,0.0,-0.3944699,-2.785764,-1.15379,-5.346243,-1.171602,-4.748061,-0.3097388,-0.6113255,-0.2178711,-1.74436,-2.520978,-0.06919161,0.0,0.0,1.0,0.0,0.0
25%,0.0,-0.7079194,0.0,-0.3944699,-0.5967443,-0.5758735,-0.3607693,-0.6201419,-0.4858731,-0.3097388,-0.5204904,-0.2178711,-0.3995212,-0.46605,-0.06919161,0.0,1.0,4.0,0.0,0.0
50%,0.0,-0.3179966,0.0,-0.3944699,-0.3535199,-0.1232195,0.05761068,-0.147462,-0.1125313,-0.3097388,-0.2479849,-0.2178711,-0.06331141,-0.1035418,-0.06919161,1.0,3.0,8.0,0.0,2.0
75%,1.0,0.461849,0.0,-0.3944699,0.3761532,0.3066595,0.5379449,0.3252179,0.382014,-0.3097388,0.2061908,-0.2178711,0.310255,0.3207429,-0.06919161,1.0,4.0,10.0,0.0,4.0
max,1.0,8.650228,1.0,9.258568,7.186436,22.00558,4.699739,24.11677,3.006704,13.21156,58.34069,45.4923,96.42889,25.0822,35.77579,1.0,6.0,12.0,1.0,4.0


In [56]:
# Create the directory structure for the database file
base_path_query = os.path.abspath(os.path.join(os.getcwd(), "..",  "database"))
os.makedirs(base_path_query, exist_ok=True)
db_file = os.path.join(base_path_query, "news_dataset.db")

# Connect to the SQLite database
conn = sqlite3.connect(db_file)

# Write the final dataframe to a SQL table named 'final_news'
df.to_sql("final_news", conn, index=False, if_exists="replace")

44898