<a href="https://colab.research.google.com/github/Aanil-vitta/machine-learning-/blob/main/IA3_2025.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AI534 IA3: Text Classification with BoW, Linear SVM and Naive Bayes



##Overview:
You will use use the sklearn package to implement and compare classic text classification models, practicing principled evluation, tuning and interpretation. Specifically, your work will include:
* clean preprocessing pipeline to produce BoW and TF-IDF features
* Linear/RBF SVM and Multinomial Naive Bayes classifiers
* Hyperparameter sweeps+plots
* Error analysis, feature interpretatoin and short reflection.

## Data
The data for this assignment consists of a natural language sentiment dataset sourced from Twitter. The first column indicates sentiment of the tweets (zero representing negative sentiment and one for positive sentiment) and the second column contains the text of the tweets.

Two datasets are provided on Canvas: a training set named 'IA3-training.csv' and a validation set named 'IA3-dev.csv'. You will use the training set to build your models and the validation set to tune the parameters and observe their impacts.


## What you need to submit
1. Your completed notebook in ipynb
2. a PDF report that includes all code outputs and figures. You can use the code block at the end of the notebook to generate a PDF export of the notebook with the outputs for your report. However, if any figures or outputs are missing, you must either:
* Manually add the missing figures to the PDF using a PDF editor or
* Copy your notebook contents into a Word or Google Doc, insert the missing outputs there, and export that document as a PDF.

First let's import the packages needed for this assignment.

In [None]:
!pip install nbconvert > /dev/null 2>&1
!pip install pdfkit > /dev/null 2>&1
!apt-get install -y wkhtmltopdf > /dev/null 2>&1
import os
import pdfkit
import contextlib
import sys
from google.colab import files
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from sklearn import svm
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
import seaborn as sns


Let's load the data.

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

train_path = '/content/gdrive/My Drive/AI534/IA3-train.csv' # DO NOT MODIFY THIS. Please make sure your data has this exact path
val_path = '/content/gdrive/My Drive/AI534/IA3-dev.csv' # DO NOT MODIFY THIS. Please make sure your data has this exact path
train_data = pd.read_csv(train_path)
val_data = pd.read_csv(val_path)

Mounted at /content/gdrive


# Part 0: (10 pts) Preprocessing and Initial Anlaysis
In this part, you will take the text of the tweets and convert it to the bag-of-words (BoW) representation that can be processed by the model you will train.



##üìò Detailed instructions
First, you will build your BoW vocabulary using the training set and represent your training set using *tf-idf*.

Here ***tf*** stands for term frequecy and is defined as follows for a document $d$ and term $t$:

$TF(t,d) = \frac{\mbox{Num. of times } t \mbox{ appears in } d}{\mbox{Total number of terms in }d}$

and ***idf*** stands for inverse document frequency and is defined as follows for a term $t$ and a document collection $D$

$IDF(t, D) = \log(\frac{N}{|\{d\in D: t\in d\}|}) $


where $N$ is the total number of documents in $D$, and $|\{d\in D: t\in d\}|$ is the number of documents that contain term $t$.
Specifically, you will be using the **TfidfVectorizer** class from the sklearn.feature_extraction.text package (https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#). For this assignment, you should set '***stop_words=english***', other parameters can remain at their default values.

The specific methods you will need to use include:
- **`fit_transform()`** ‚Äî learns the vocabulary and inverse document frequencies from the input documents (fit), and produces their TF‚ÄìIDF representation (transform).
- **`transform()`** ‚Äî applies the learned vocabulary and IDF values to new documents to produce their TF‚ÄìIDF representation.

Both methods return a sparse matrix representation of the documents, which can be directly used by the SVM package and the Naive Bayes Package for learning. Using a sparse matrix representation is computationally efficient, especially for SVM models.

## üöß Task: Build TF-IDF Features & Build Intial Word Insights
**Your tasks:**

1. Use class `TfidfVectorizer(stop_words='english')` and apply:
   - `fit_transform()` to the **training** tweets to learn the vocabulary and IDF weights and produce the TF-IDF representation for training data
   - `transform()` to the **validation** tweets using the learned vocabulary and IDF values and produce the TF-IDF representation of the validation data

2. Print the shapes of the resulting TF‚ÄìIDF matrices for the training and validation sets.

3. On the **training data only**, for each sentiment class (**positive** and **negative**):
   - Compute a **cumulative TF‚ÄìIDF score** for each word:
   `CumulativeTFIDF(t, C)` = $\sum_{d \in C} \text{TFIDF}(t, d)$

     
   - Rank all words by this score and list the **top 20 words** in each class

4. Compare the two lists and report:
   - Words that appear in **both** lists
   - Words **distinctive to positive** sentiment
   - Words **distinctive to negative** sentiment

> ‚úÖ Your output should include matrix shapes, the two ranked word lists (computed from training data only), and the comparison.



In [None]:
# Your code goes here

## ‚úçÔ∏è Questions

1. Consider the three sets of words you identified:  
   **(a)** common to both classes, **(b)** positive-only, and **(c)** negative-only.  
   What patterns do you observe in each group? (e.g., emotional tone, topic, intensity)

2. Suppose you train a linear sentiment classifier (e.g., linear SVM).  
   How do you expect it to use these words?  
   Discuss the likely **sign** and **magnitude** of the weights assigned to
   - positive-only words  
   - negative-only words  
   - words appearing in both classes

**Your answer goes here.**

# Part 1. (30 pts) Linear SVM with C sweep

You will train **linear SVM** models on the TF‚ÄìIDF features and tune the regularization parameter **C** over
$C \in \{10^{-2}, 10^{-1}, 10^{0}, 10^{1}, 10^{2}, 10^{3}\}.$

Use `sklearn.svm.SVC` with `kernel='linear'` so that you can access the **number of support vectors** (this is not available in `LinearSVC`). The linear SVM can consume sparse TF‚ÄìIDF matrices directly.

### üöß Task: Linear SVM and parameter sweep
Complete the following:

1. Train a `SVC(kernel='linear')` model for each  
   
   $C \in \{10^{-2},\,10^{-1},\,1,\,10,\,10^2,\,10^3\}$

2. For each $C$:
   - Train on the **training** set
   - Compute **AUROC** on **training** and **validation** sets  
   - Record the **total number of support vectors**

3. Create two plots (log-scale on the \(C\)-axis):
   - **Training vs validation AUROC** as a function of \(C\)
   - **Total number of support vectors** as a function of \(C\)

4. Select and report the **best \(C\)** based on **validation AUROC**  

In [None]:
# Your code goes here.

## ‚úçÔ∏è Questions
1. As the regularization parameter $C$ increases, what trends do you expect in   (a) training performance and
  (b) validation performance?
  Explain why in terms of underfitting vs. overfitting.  
  Then compare your theoretical expectation to your observed results.  
  If they differ, provide a plausible explanation.
2. If different $C$ values produce very similar AUROC values, what principle would you use to choose between them, and why?

3. How do you expect the number of support vectors to change as $C$ increases? Explain why based on the SVM objective and margin behavior.  
Then compare your expectation to your observed results and discuss any discrepancies.



**Your answer goes here**

## üöß Task ‚Äî Further Exploration of the SVM Parameter \(C\)

After completing your initial coarse grid search for \(C\), refine your search to more precisely identify an effective value using validation AUROC as the evaluation criterion.

Use the following guidelines:

- **Boundary Expansion**  
  If the best-performing $C$ value lies at the **edge** of your grid (e.g., the smallest or largest value tested), **extend the search range outward** to explore more extreme values.  
  This checks whether better performance may exist beyond your initial limits.

- **Local Refinement**  
  If the best-performing $C$ value lies **within the interior** of your grid (not at an edge), **refine the search locally** by testing intermediate values near the current best.  
  For example, if $10^1$ performs best in your initial grid, consider trying nearby values such as $10^{0.5}$ and $10^{1.5}$.

- **Iterative Adjustment**  
  Apply expansion or refinement **iteratively**.  
  If a newly tested value becomes the best, repeat the decision process:
  - If the new best lies at a boundary ‚Üí expand further  
  - If the new best lies inside the range ‚Üí refine locally again  

Continue exploring until additional adjustments **do not meaningfully improve** validation AUROC.

**Report**
- A small table of tried values and **validation AUROC** (sorted best-to-worst).
- Final chosen value for $C$
- For the best performing linear SVM model, extract the learned feature weights and report the top 10 positive-weight and top 10 negative-weight words

In [None]:
# Your code goes here

## ‚úçÔ∏è Questions ‚Äî Comparing SVM Feature Weights to TF-IDF Words

Compare the top positive and negative SVM-weighted words to the three TF-IDF groups from Part 0: top words for positive class only (positive-only), top words for negative class only (negative-only), and top words for both classes(common):

1. Discuss the overlap between SVM-important words and the three lists. Did any **common** TF-IDF words appear among the most important SVM features?  Why might this happen (or not happen)?

2. Identify one or two words that were ranked highly in part 0 for positive or negative sentiment, but that **did not** appear among top SVM features.  Provide one plausible explanation.

3. Identify one or two **new** SVM-important words that did *not* appear prominently in the part 0 lists. Why might these words be especially helpful for classification?


**Your answer goes here**

# Part 2: (30 pts) Naive Bayes Classifier

In this part, you will use the same TF-IDF representation of the tweets and experiment with MultiNomial Naive Bayes classifier.

Multinomial Naive Bayes estimates class-conditional word probabilities from word counts. However, if a word never appears in a class in the training data, its estimated probability becomes zero, forcing the entire document probability to zero whenever that word appears.

To avoid this, Naive Bayes uses a smoothing parameter ùõº, which controls how many "fake counts" we add to each word when estimating probabilities. Larger ùõº value means more "fake counts" and heavier smoothing.
You will study how different levels of smoothing affect model performance.

## üößTask: Naive Bayes and smoothing parameters
1. Using your TF-IDF features, train a MultinomialNB model for each
for each  
   
   $\alpha \in \{10,1,0.5,0.1,0.05\}$

2. For each ùõº:
* Train on the training set
* Compute the AUROC on the training and validation sets
3. Create a table that report the training/validation AUROC for different $\alpha$ values.

4. Select the best ùõº based on validation AUROC.


In [None]:
# Your code goes here.

## ‚úçÔ∏è Questions
1. Based on the role of the smoothing parameter ùõº, (i.e., adding ‚Äúfake counts‚Äù), describe how you expect model performance to change as
ùõº varies from very small to large values.
Justify your expectation in terms of model behavior, not just the plot.

2. Compare your expectation with the observed validation AUROC across
ùõº. If the pattern does not perfectly match your prediction, suggest one plausible reason grounded in properties of text data or Naive Bayes assumptions.

3. Suppose two values of ùõº produce very similar validation AUROC.
In such a situation, what principle would you use to choose between them, and why?

**Your answer goes here**

##üößTask ‚Äî Refining the Naive Bayes Smoothing Parameter

1. Expand your search for the Naive Bayes smoothing parameter \( \alpha \) using the same boundary-expansion / local-refinement strategy from Part 1. Select the $\alpha$ that yields the highest **validation AUROC**.

2. Extract the weight coefficients of the linear classifier produced by Naive Bayes for different words. Specifically, for each word $w_i$, compute its weight as:$
   \log P(w_i \mid y=1) - \log P(w_i \mid y=0)
   $

   In sklearn, these values can be obtained via `model.feature_log_prob_`.

3. Compare two Naive Bayes models:
   - Your **best $\alpha$** based on validation AUROC
   - A **large-smoothing model** that uses a much larger smoothing paramter (e.g., $\alpha = 10$)

   For each model, report the **top 10 positive-weight words** and **top 10 negative-weight words** with their corresponding weight values.

In [None]:
#your code goes here

##‚úçÔ∏è Questions
1. Compare the most important words learned by the two Naive Bayes models
(large $\alpha$ vs. best $\alpha$):

- What changes do you observe in the top positive/negative words and their weights when smoothing increases or decreases? Why does smoothing affect which words are emphasized?

2. Now compare the most influential words from your best Naive Bayes model with those from the best linear SVM in Part 1:
- Do you see any noticeable differences between the influential words from the two models? How might these differences influence generalization and robustness?


Your answer goes here.

# Part 3 (20 pts) Exploring bigram features
So far you have trained models using unigram features only (single words).
In this part, you will investigate whether including bigrams (word pairs) improves sentiment classification performance.
Including bigrams enables the model to capture short phrases and word combinations like:

* not good

* very happy

* delayed flight

that single words alone may not fully express.

## üöß Task ‚Äî Bigram Feature Exploration

1. Construct and evaluate two TF-IDF feature representations:

- Using both Unigrams + Bigrams by setting `ngram_range = (1, 2)`

- Using Bigrams only by setting `ngram_range = (2, 2)`

2. For each representation,
- train both linear SVM and Multinomial Naive Bayes, using the selected parameters from Parts 1 and 2 (note here you are not asked to retune ùê∂ or ùõº as the goal is to isolate the effect of changing the feature space only.).
- Compute and report training and validation AUROC for each representation

- Extract and report the top positive and negative features (similar to Parts 1 & 2)


In [None]:
Your code goes here.

## ‚úçÔ∏è Questions.
1. Did including bigrams (in addition to unigrams) improve performance compared to using unigram alone? Why might bigrams help in sentiment classification?

2. How did the bigrams-only model perform relative to unigrams+bigrams or unigram-only? Provide an explanation for your observed differences.

3. Inspect the most influential bigram features. Provide one example where the bigram carries more sentiment meaning than either unigram alone.

#Part 4. (10 pts) In-class competition
We will host a in-class competition using the IA3 data. To participate in this competition, use the following link: https://www.kaggle.com/t/a6382751cf574a7a85b9e9adb8384777

**Model restriction.** For this competition, you are required to use SVM and Naive Bayes models.

**Exploration encouraged**. Here are some ideas you are welcome to explore:
- **Model variants**  
  Try alternative settings for SVM or Naive Bayes, or other simple linear models.

- **Feature engineering**  
  Add, remove, or transform text features (e.g., character n-grams, stopword decisions, emoji handling).

- **Data balancing strategies**  
  Try upsampling minority class examples or downsampling the majority class.

- **Additional preprocessing**  
  Experiment with handling punctuation, URLs, user handles, emojis, or casing.

- **Additional hyperparameter tuning**  
  If useful, continue refining $C$ or $\alpha$, or explore related knobs.

**Team work.** You should continue working in the same team for this competition. The training and validation data provided on the kaggle site are the same as the IA3 assignment.

**Evaluation.** To participate, you will apply your trained/tuned model to the test data provided on kaggle (which does not contian the label column), and generate a prediction score for each example. You can consult the sample submission file on Kaggle for the right format for the submission. The metric used for this competition is AUROC due to imbalanced class distribution.

There are two parts to the score you will see on kaggle. The performance reported on the public leaderboard and a score reported on the private leaderboard. The public leader board scores are visible through out the competition and you can use it as an external validation to help you refine your model design and tune the model. The private leader board scores are evaluated using a separate set of test data as the final performance evaluation and will be released only after the competition is closed.

**Points and bonus points.** You will get the full 10 points if you

- Submitting predictions to the competition (at least one successful submission)

- Achieving non-trivial performance (i.e., outperforming a simple baseline ‚Äî not necessarily high ranking)

- Completing a brief write-up describing:

  - what you tried

  - which change(s) had the largest effect on performance (positive or negative)

  - what you learned from the process

You will get **3 nonus points** if your team **scored top 3** on the private leader board, or entered **the largest number of unique submissions** (unique sores).

No late submission. The competition will be closed at 11:59 pm of the due date.

## ‚úçÔ∏è Kaggle write-up

**Team name**:

Your report goes here.

In [None]:
#running this code block will convert this notebook and its outputs into a pdf report.
!jupyter nbconvert --to html /content/gdrive/MyDrive/Colab\ Notebooks/IA3-2024.ipynb  # you might need to change this path to appropriate value to location your copy of the IA0 notebook

input_html = '/content/gdrive/MyDrive/Colab Notebooks/IA3-2024.html' #you might need to change this path accordingly
output_pdf = '/content/gdrive/MyDrive/Colab Notebooks/IA3output.pdf' #you might need to change this path or name accordingly

# Convert HTML to PDF
pdfkit.from_file(input_html, output_pdf)

# Download the generated PDF
files.download(output_pdf)