<a href="https://colab.research.google.com/github/AlbertV100/Spring/blob/main/TweetEval_Preprocessing_Pipeline_Colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# TweetEval Preprocessing Pipeline (TF‑IDF → L1 Logistic Regression)

**Scope:** This notebook focuses on the preprocessing and training workflow:
- Remove stopwords (via `TfidfVectorizer(stop_words='english')`)
- Build a scikit‑learn `Pipeline`
- Fit a **Logistic Regression** model with **L1** regularization
- Run **GridSearchCV on the *training* set only**
- Generate **predictions on the test set**




## 1) Setup

In [1]:

# Minimal, easy-to-run imports (Colab has these by default)
import os
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, accuracy_score
from joblib import dump

SEED = 42
np.random.seed(SEED)


## 2) Load data (TweetEval Sentiment)

In [2]:

# This cell tries to load TweetEval (sentiment) directly.
# If the internet is disabled or the datasets package is unavailable,
# we'll fall back to loading from a local CSV with columns: ['text','label']
#
# Local CSV fallback path (edit if needed):
LOCAL_CSV = '/content/tweeteval_sentiment.csv'  # expects columns: text,label (label as int 0/1/2)

def load_data():
    try:
        # Try HuggingFace datasets (works in Colab with internet)
        from datasets import load_dataset
        ds = load_dataset('tweet_eval', 'sentiment')
        # Convert to Pandas
        train_df = pd.DataFrame({'text': ds['train']['text'], 'label': ds['train']['label']})
        test_df  = pd.DataFrame({'text': ds['test']['text'],  'label': ds['test']['label']})
        # Combine & keep a clear train/test split later
        df = pd.concat([train_df.assign(split='train'), test_df.assign(split='test')], ignore_index=True)
        source = "huggingface_datasets"
    except Exception as e:
        # Fall back: local CSV with 'text','label'
        if not os.path.exists(LOCAL_CSV):
            raise FileNotFoundError(
                f"Could not load TweetEval via datasets and fallback CSV not found at: {LOCAL_CSV}.\n"
                "Please upload a CSV with columns ['text','label'] to the Colab /content/ path "
                "or enable internet & rerun this cell."
            )
        df = pd.read_csv(LOCAL_CSV)
        assert {'text','label'}.issubset(df.columns), "CSV must have 'text' and 'label' columns."
        # We'll create a fresh split below
        source = "local_csv"
    return df, source

df, source = load_data()
print(f"Loaded {len(df):,} rows from: {source}")
df.head()


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loaded 57,899 rows from: huggingface_datasets


Unnamed: 0,text,label,split
0,"""QT @user In the original draft of the 7th boo...",2,train
1,"""Ben Smith / Smith (concussion) remains out of...",1,train
2,Sorry bout the stream last night I crashed out...,1,train
3,Chase Headley's RBI double in the 8th inning o...,1,train
4,@user Alciato: Bee will invest 150 million in ...,2,train


## 3) Train/Test Split (Stratified)

In [3]:

# If the data already contains a 'split' column from HF datasets, we'll honor it.
if 'split' in df.columns.unique():
    train_df = df[df['split']=='train'].copy()
    test_df  = df[df['split']=='test'].copy()
    X_train, y_train = train_df['text'].values, train_df['label'].values
    X_test,  y_test  = test_df['text'].values,  test_df['label'].values
else:
    # Otherwise, stratified split from a single CSV
    X_train, X_test, y_train, y_test = train_test_split(
        df['text'].values, df['label'].values,
        test_size=0.2, random_state=SEED, stratify=df['label'].values
    )

print(f"Train size: {len(X_train):,}, Test size: {len(X_test):,}")


Train size: 45,615, Test size: 12,284


## 4) Build Preprocessing → Model Pipeline

In [4]:

# Preprocessing (stopword removal & TF-IDF) + L1 Logistic Regression
pipe = Pipeline([
    ('tfidf', TfidfVectorizer(
        stop_words='english',  # <-- stopwords removal
        ngram_range=(1,2),     # light bigrams for signal (can edit to (1,1) if you prefer)
        min_df=2,              # ignore very rare terms
        max_df=0.95            # ignore extremely common terms
    )),
    ('logreg', LogisticRegression(
        penalty='l1',          # <-- L1 regularization
        solver='liblinear',    # supports L1
        max_iter=1000,
        random_state=SEED
    ))
])
pipe


## 5) Grid Search **on the training set only**

In [5]:

# Parameter grid: only tune C for L1 logistic
param_grid = {
    'logreg__C': [0.001, 0.01, 0.1, 1, 10]
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=SEED)

grid = GridSearchCV(
    estimator=pipe,
    param_grid=param_grid,
    scoring='accuracy',
    cv=cv,
    n_jobs=-1,
    refit=True,       # refit best model on the FULL training set
    verbose=1
)

grid.fit(X_train, y_train)  # <-- TRAINING SET ONLY
print("Best params:", grid.best_params_)
print("Best CV mean accuracy:", round(grid.best_score_, 4))
best_model = grid.best_estimator_


Fitting 5 folds for each of 5 candidates, totalling 25 fits
Best params: {'logreg__C': 1}
Best CV mean accuracy: 0.657


## 6) Predictions on the Test Set

In [6]:

y_pred = best_model.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(f"Test Accuracy: {acc:.4f}\n")
print("Classification Report:")
print(classification_report(y_test, y_pred, digits=4))


Test Accuracy: 0.5822

Classification Report:
              precision    recall  f1-score   support

           0     0.7189    0.2737    0.3964      3972
           1     0.5729    0.8082    0.6705      5937
           2     0.5286    0.5335    0.5310      2375

    accuracy                         0.5822     12284
   macro avg     0.6068    0.5384    0.5326     12284
weighted avg     0.6115    0.5822    0.5549     12284



## 7) Save Artifacts (Optional)

In [7]:

# Save the best pipeline (vectorizer + model) and predictions
os.makedirs('artifacts', exist_ok=True)
dump(best_model, 'artifacts/tfidf_l1_logreg_pipeline.joblib')
pd.DataFrame({'text': X_test, 'y_true': y_test, 'y_pred': y_pred}).to_csv('artifacts/test_predictions.csv', index=False)

print("Saved: artifacts/tfidf_l1_logreg_pipeline.joblib")
print("Saved: artifacts/test_predictions.csv")


Saved: artifacts/tfidf_l1_logreg_pipeline.joblib
Saved: artifacts/test_predictions.csv
