![logo_ironhack_blue 7](https://user-images.githubusercontent.com/23629340/40541063-a07a0a8a-601a-11e8-91b5-2f13e4e6b441.png)

# PROJECT | Natural Language Processing Challenge

## Introduction

Learning how to process text is a skill required for Data Scientists/AI Engineers.

In this project, you will put these skills into practice to identify whether a news headline is real or fake news.

## Project Overview

In the file `dataset/data.csv`, you will find a dataset containing news articles with the following columns:

- **`label`**: 0 if the news is fake, 1 if the news is real.
- **`title`**: The headline of the news article.
- **`text`**: The full content of the article.
- **`subject`**: The category or topic of the news.
- **`date`**: The publication date of the article.

Your goal is to build a classifier that is able to distinguish between the two.

Once you have a classifier built, then use it to predict the labels for `dataset/validation_data.csv`. Generate a new file
where the label `2` has been replaced by `0` (fake) or `1` (real) according to your model. Please respect the original file format,
do not include extra columns, and respect the column separator.

Please ensure to split the `data.csv` into **training** and **test** datasets before using it for model training or evaluation.

## Guidance

Like in a real life scenario, you are able to make your own choices and text treatment.
Use the techniques you have learned and the common packages to process this data and classify the text.

## Deliverables

1. **Python Code:** Provide well-documented Python code that conducts the analysis.
2. **Predictions:** A csv file in the same format as `validation_data.csv` but with the predicted labels (0 or 1)
3. **Accuracy estimation:** Provide the teacher with your estimation of how your model will perform.
4. **Presentation:** You will present your model in a 10-minute presentation. Your teacher will provide further instructions.

# Setup the Environment

In [None]:
import pandas as pd
import os
from sklearn.model_selection import TimeSeriesSplit
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
import torch

In [None]:


# IF WORKING ON GOOGLE COLLAB

# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Download the file from Google Drive (replace with your file ID)
!gdown --id 1P5SugC6aQSSV_DXviZ7S6W2XhVA8io0Y -O train_data.csv # Replace YOUR_FILE_ID with your actual file ID

import os
print(f"Current working directory: {os.getcwd()}")

# Check if the file was downloaded
if os.path.exists('train_data.csv'):
    print("File 'train_data.csv' found. Attempting to read with pandas.")
    data = pd.read_csv("train_data.csv", encoding='latin-1')
    print(data.shape)
    print(data.head())
else:
    print("Error: File 'train_data.csv' not found after gdown attempt.")
    print("Please double-check the file ID and ensure the file exists and is shared correctly in Google Drive.")

Mounted at /content/drive
Downloading...
From (original): https://drive.google.com/uc?id=1P5SugC6aQSSV_DXviZ7S6W2XhVA8io0Y
From (redirected): https://drive.google.com/uc?id=1P5SugC6aQSSV_DXviZ7S6W2XhVA8io0Y&confirm=t&uuid=10775f2d-7311-44cd-990f-bc06dc6a168d
To: /content/train_data.csv
100% 105M/105M [00:01<00:00, 56.3MB/s] 
Current working directory: /content
File 'train_data.csv' found. Attempting to read with pandas.
(40399, 5)
   label                                              title  \
0      0  HILLARY RODHAM NIXON: A CANDIDATE WITH MORE BA...   
1      0  WATCH DIRTY HARRY REID ON HIS LIE ABOUT ROMNEY...   
2      0  HILLARY RODHAM NIXON: A CANDIDATE WITH MORE BA...   
3      0  FLASHBACK: KING OBAMA COMMUTES SENTENCES OF 22...   
4      0  BENGHAZI PANEL CALLS HILLARY TO TESTIFY UNDER ...   

                                                text    subject        date  
0  The irony here isn t lost on us. Hillary is be...   politics  2015-03-31  
1  In case you missed it Sen. 

In [None]:
# IF WORKING LOCAL

# Read Data
# data = pd.read_csv("train_data.csv",encoding='latin-1')

print(data.shape)

print(data.head())

(40399, 5)
   label                                              title  \
0      0  HILLARY RODHAM NIXON: A CANDIDATE WITH MORE BA...   
1      0  WATCH DIRTY HARRY REID ON HIS LIE ABOUT ROMNEY...   
2      0  HILLARY RODHAM NIXON: A CANDIDATE WITH MORE BA...   
3      0  FLASHBACK: KING OBAMA COMMUTES SENTENCES OF 22...   
4      0  BENGHAZI PANEL CALLS HILLARY TO TESTIFY UNDER ...   

                                                text    subject        date  
0  The irony here isn t lost on us. Hillary is be...   politics  2015-03-31  
1  In case you missed it Sen. Harry Reid (R-NV), ...  left-news  2015-03-31  
2  The irony here isn t lost on us. Hillary is be...  left-news  2015-03-31  
3  Just making room for Hillary President Obama t...   politics  2015-03-31  
4  Does anyone really think Hillary Clinton will ...   politics  2015-03-31  


# Transformer Model (Embedding)

In [None]:
# Drop unnecessary columns
data = data[['title', 'label']]

# Display the first few rows to verify the change
display(data.head())

Unnamed: 0,title,label
0,HILLARY RODHAM NIXON: A CANDIDATE WITH MORE BA...,0
1,WATCH DIRTY HARRY REID ON HIS LIE ABOUT ROMNEY...,0
2,HILLARY RODHAM NIXON: A CANDIDATE WITH MORE BA...,0
3,FLASHBACK: KING OBAMA COMMUTES SENTENCES OF 22...,0
4,BENGHAZI PANEL CALLS HILLARY TO TESTIFY UNDER ...,0


In [None]:
from transformers import AutoTokenizer, AutoModel

# Choose model for embedding
embedding_model_name = "distilbert-base-uncased"

# Load tokenizer and model for embeddings
tokenizer = AutoTokenizer.from_pretrained(embedding_model_name)
model = AutoModel.from_pretrained(embedding_model_name)

# Function to get embeddings
def get_embeddings(text):
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
    with torch.no_grad():
        outputs = model(**inputs)
    # Get the embeddings from the last hidden state (usually the [CLS] token embedding)
    # or average the token embeddings
    embeddings = outputs.last_hidden_state[:, 0, :].squeeze() # Using [CLS] token embedding
    # embeddings = torch.mean(outputs.last_hidden_state, dim=1).squeeze() # Using mean pooling
    return embeddings

# Example usage (optional)
sample_text = "This is a sample sentence for embedding."
sample_embedding = get_embeddings(sample_text)

print(sample_embedding.shape)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

torch.Size([768])


## Split the data

In [None]:
from sklearn.model_selection import TimeSeriesSplit

# Features and target
X = data.drop(columns=["label"])
y = data["label"]

# Build time series folds
tscv = TimeSeriesSplit(n_splits=3)

folds = []  # store indices and the sliced data for later use
for fold_id, (tr_idx, va_idx) in enumerate(tscv.split(X), start=1):
    fold = {
        "train_idx": tr_idx,
        "val_idx": va_idx,
        "X_train": X.iloc[tr_idx],
        "X_val": X.iloc[va_idx],
        "y_train": y.iloc[tr_idx],
        "y_val": y.iloc[va_idx],
    }
    folds.append(fold)
    print(f"Fold {fold_id}  train rows {len(tr_idx)}  val rows {len(va_idx)}")

# Example of how to access one fold later
# X_train = folds[0]["X_train"]
# y_train = folds[0]["y_train"]
# X_val   = folds[0]["X_val"]
# y_val   = folds[0]["y_val"]

Fold 1  train rows 10102  val rows 10099
Fold 2  train rows 20201  val rows 10099
Fold 3  train rows 30300  val rows 10099


In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import numpy as np

# Initialize lists to store predictions and labels across all folds
all_preds = []
all_labels = []

# Iterate through the time series folds
for i, fold in enumerate(folds, start=1):
    print(f"Processing Fold {i} for Logistic Regression...")

    # Get the training and validation data for the current fold
    X_train_text = fold["X_train"]["title"].tolist() # Using 'title' for now as 'text' might be too long
    y_train = fold["y_train"].tolist()
    X_val_text = fold["X_val"]["title"].tolist() # Using 'title' for now
    y_val = fold["y_val"].tolist()

    # Generate embeddings for training data
    print("Generating embeddings for training data...")
    X_train_embeddings = torch.stack([get_embeddings(text) for text in X_train_text]).numpy()

    # Generate embeddings for validation data
    print("Generating embeddings for validation data...")
    X_val_embeddings = torch.stack([get_embeddings(text) for text in X_val_text]).numpy()


    # Initialize and train the Logistic Regression model
    print("Training Logistic Regression model...")
    log_reg_model = LogisticRegression(max_iter=1000) # Increased max_iter for convergence
    log_reg_model.fit(X_train_embeddings, y_train)

    # Make predictions on the validation data
    print("Making predictions on validation data...")
    fold_preds = log_reg_model.predict(X_val_embeddings)

    # Store predictions and labels
    all_preds.extend(fold_preds)
    all_labels.extend(y_val)

# Calculate and print overall accuracy
accuracy = accuracy_score(all_labels, all_preds)
print(f"\nOverall Accuracy on all validation folds (Logistic Regression with Embeddings): {accuracy}")

Processing Fold 1 for Logistic Regression...
Generating embeddings for training data...
Generating embeddings for validation data...
Training Logistic Regression model...
Making predictions on validation data...
Processing Fold 2 for Logistic Regression...
Generating embeddings for training data...
Generating embeddings for validation data...
Training Logistic Regression model...
Making predictions on validation data...
Processing Fold 3 for Logistic Regression...
Generating embeddings for training data...
Generating embeddings for validation data...
Training Logistic Regression model...
Making predictions on validation data...

Overall Accuracy on all validation folds (Logistic Regression with Embeddings): 0.9264943723801037


In [None]:
import joblib

# Save the trained logistic regression model
joblib.dump(log_reg_model, 'model_lr_transformer.pkl')
print("Logistic regression model saved successfully!")

Logistic regression model saved successfully!
