# NLP Assignment Exercises
This notebook contains solutions to all five exercises from the NLP preprocessing course.

## Exercises:
1. Compare performances: CountVectorizer vs TfidfVectorizer vs Word2Vec averaged embeddings on a 10k-sample dataset
2. Try ngram_range=(1,3) and observe overfitting/feature explosion
3. Use GridSearchCV to tune C for Logistic Regression and alpha for MultinomialNB
4. Create an inference API using FastAPI that loads best_text_pipeline.joblib and exposes POST /predict
5. (Advanced) Fine-tune a small transformer (e.g., DistilBERT) for sentiment classification using Hugging Face transformers

## Setup and Imports

In [1]:
# Install required packages
!pip install -q scikit-learn pandas numpy matplotlib seaborn gensim datasets

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import time
import warnings
warnings.filterwarnings('ignore')

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.pipeline import Pipeline

from gensim.models import Word2Vec
from datasets import load_dataset

# Set random seed for reproducibility
np.random.seed(42)

# Set plot style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

## Load Dataset (10k samples)
We'll use the IMDb movie review dataset for sentiment classification.

In [3]:
# Load IMDb dataset
print("Loading IMDb dataset...")
dataset = load_dataset("imdb", split="train")

# Take 10k samples
dataset = dataset.shuffle(seed=42).select(range(10000))

# Convert to pandas DataFrame
df = pd.DataFrame({
    'text': dataset['text'],
    'label': dataset['label']
})

print(f"Dataset shape: {df.shape}")
print(f"\nLabel distribution:\n{df['label'].value_counts()}")
print(f"\nSample review:\n{df['text'].iloc[0][:200]}...")

Loading IMDb dataset...




Dataset shape: (10000, 2)

Label distribution:
label
0    5004
1    4996
Name: count, dtype: int64

Sample review:
There is no relation at all between Fortier and Profiler but the fact that both are police series about violent crimes. Profiler looks crispy, Fortier looks classic. Profiler plots are quite simple. F...
