# Text Classification with Dimensionality Reduction  
### *Airline Tweet Sentiment Analysis using TF-IDF, Naive Bayes, SVD, PCA & Logistic Regression*

---

## Group 2 Members
**Albright Maduka Ifechukwude – 9053136**  
**Abdullahi Abdirizak Mohamed – 9082466**  
**Kamamo Lesley Wanjiku - 8984971**

---

## Introduction

This project explores the task of **binary text classification** using real-world airline customer tweets.  
Building on the techniques outlined in our course materials, we implement a complete Natural Language Processing (NLP) workflow that transforms raw textual data into meaningful numerical features and analyzes how different models perform on sentiment classification.

We focus on:
- Converting tweets into TF-IDF feature vectors  
- Reducing feature dimensionality using **SVD (TruncatedSVD)** and **PCA**  
- Comparing baseline and advanced machine learning models  
- Evaluating performance using confusion matrices and standard classification metrics  

To align with the project requirements, we:
- Select **two airlines** (United Airlines and Delta Airlines)  
- Reduce the sentiment labels to **positive vs negative** (binary classification)  
- Apply normalization and text preprocessing to clean noisy tweet data  
- Train three models:
  1. **Naive Bayes using TF-IDF** (baseline)  
  2. **Logistic Regression using SVD-reduced features**  
  3. **Logistic Regression using PCA-reduced features**  

This introduction serves as the foundation for the detailed analysis, modeling, and evaluation presented in the following sections of the notebook.

---


## 1. Importing Libraries & Loading the Dataset

In this section, we import all the necessary Python libraries for data processing, 
visualization, feature extraction, dimensionality reduction, and machine learning.

We then load the `Tweets.csv` dataset, which contains airline customer tweets along 
with sentiment labels.  

This dataset will be filtered later to match the binary classification requirement 
of the project (positive vs negative).


In [4]:
# 0. Imports
import numpy as np
import pandas as pd
import re

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.decomposition import TruncatedSVD, PCA
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    confusion_matrix, accuracy_score, precision_score,
    recall_score, f1_score, classification_report
)

import matplotlib.pyplot as plt
import seaborn as sns

# 1. Load CSV
df = pd.read_csv("data\Tweets.csv")
df.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


## 2. Dataset Filtering: Selecting Airlines & Creating Multi-Class Labels

Following the project requirement for sentiment classification, we:

- Select **two airlines**: United and Delta.
- Keep **positive**, **negative**, and **neutral** tweets.
- Convert sentiment into a three-class label:
  - **0 = Negative**
  - **1 = Positive**
  - **2 = Neutral**

We also balance the dataset to approximately ~2,000 total tweets, ensuring that all 
three classes are equally represented.  
This step prevents model bias and improves evaluation reliability.

## 3. Text Preprocessing & Normalization

Raw tweets contain noise such as:
- URLs  
- Mentions (@username)  
- Special characters & punctuation  
- Random spacing  
- Mixed casing  

To prepare the data for TF-IDF and machine learning, we clean each tweet using a 
custom normalization function that:
- Converts text to lowercase
- Removes URLs and mentions
- Removes non-alphabetical characters
- Collapses extra spaces

The resulting `clean_text` column is used for all subsequent modeling steps.


## 4. TF-IDF Feature Extraction

TF-IDF (Term Frequency–Inverse Document Frequency) converts text into numerical vectors.  
This representation emphasizes:
- Words that appear frequently in an individual tweet, and  
- Words that are rare across the entire dataset.

We use:
- Maximum vocabulary size: 5000 terms  
- Unigrams + bigrams (1–2 word phrases)  
- English stopword removal

TF-IDF produces a **high-dimensional sparse matrix**, essential for the baseline Naive 
Bayes model and for dimensionality reduction (SVD & PCA).

## 5. Model 1 — Naive Bayes (Baseline with TF-IDF)

Naive Bayes is a common and effective baseline model for text classification.  
It works well with TF-IDF because:
- It assumes word independence (bag-of-words assumption)
- It handles high-dimensional sparse features efficiently
- It performs strongly on short text like tweets

In this section, we:
- Train a Multinomial Naive Bayes classifier
- Predict sentiment for the test set
- Generate a confusion matrix and evaluation metrics

This provides a foundation for comparing models with dimensionality reduction.

## 6. Dimensionality Reduction using SVD (TruncatedSVD)

TF-IDF generates thousands of features, many of which are redundant or noisy.  
We apply **TruncatedSVD**, also known as Latent Semantic Analysis (LSA), to reduce 
the dimensionality to about 100 components.

Why SVD?
- Produces **dense semantic features**
- Captures latent topics in the text
- Improves model performance and training speed
- Works directly on sparse TF-IDF matrices

We also visualize the **explained variance curve** to show how much information 
each SVD component retains.

## 7. Model 2 — Logistic Regression with SVD-Reduced Features

After reducing TF-IDF using SVD, we train a **Logistic Regression** model on the 
dense, lower-dimensional feature set.

Logistic Regression is:
- Robust  
- Interpretable  
- Effective for binary classification  

We evaluate the model using:
- Confusion matrix  
- Accuracy, precision, recall, F1-score  

We later compare this model directly to the Naive Bayes baseline and PCA-reduced model.


## 8. Dimensionality Reduction using PCA

Unlike SVD, **PCA requires dense, standardized data**, so we first convert TF-IDF 
to a dense array and apply standardization.

We then reduce to the **same number of components as SVD** to ensure a fair 
comparison between the two dimensionality reduction techniques.

We also visualize PCA's **explained variance curve**, which shows how much of the 
data's variance is preserved across components.

## 9. Model 3 — Logistic Regression with PCA-Reduced Features

We train another Logistic Regression model, this time using PCA-transformed features.

This allows us to compare:
- SVD vs PCA performance  
- The effect of dimensionality reduction on classification accuracy  
- Differences in semantic vs variance-based transformations of TF-IDF  

As before, we evaluate using confusion matrices and standard metrics.


## 10. Final Comparison of All Models

We summarize and compare the performance of all three models:

1. **Naive Bayes + TF-IDF** (baseline)  
2. **Logistic Regression + SVD**  
3. **Logistic Regression + PCA**

We present:
- A combined performance table  
- Accuracy, precision, recall, and F1-scores  
- A discussion of which model performs best  
- Error analysis (FP/FN patterns)  
- Insights on dimensionality reduction effectiveness

This section forms the core of the presentation and final report.

## 11. Conclusion

We reflect on the overall performance of the three models and highlight:

- Which approach yields the highest accuracy  
- Whether dimensionality reduction helps or hurts performance  
- Which method (SVD or PCA) is more suitable for text data  
- Strengths and weaknesses of each model  
- Observations from Delta vs United sentiment trends  
- Suggestions for future improvements (deep learning, more features, larger dataset)

This conclusion ties together all analysis and supports the final presentation.