# Text Classification 📊
Using the [IMDb dataset](https://ai.stanford.edu/~amaas/data/sentiment/), we will train a text classification model to predict whether a movie review is positive or negative. We will train a simple logistic regression model to classify the reviews.

In [4]:
import pandas as pd
from datasets import load_dataset
from sklearn.model_selection import train_test_split

# Load IMDb dataset from the datasets library
dataset = load_dataset('imdb')

# Convert to DataFrame
df = pd.DataFrame(dataset['train'])

# Display the first few rows
print("Sample Data:")
display(df.head())

# Split dataset into features and labels
X = df['text']
y = df['label']

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training set size:", len(X_train))
print("Test set size:", len(X_test))

Sample Data:


Unnamed: 0,text,label
0,I rented I AM CURIOUS-YELLOW from my video sto...,0
1,"""I Am Curious: Yellow"" is a risible and preten...",0
2,If only to avoid making this type of film in t...,0
3,This film was probably inspired by Godard's Ma...,0
4,"Oh, brother...after hearing about this ridicul...",0


Training set size: 20000
Test set size: 5000


## 1. Text Preprocessing 🧹
Preprocess the text to prepare it for model training.

Techniques used:

**CountVectorizer (Bag-of-Words):** This technique converts text data into numerical vectors by counting the frequency of each word in the text. It creates a vocabulary of all unique words in the dataset and represents each document as a vector of word counts. While simple, it treats all words equally, which might not capture the importance of different words in the text.

**TfidfVectorizer (TF-IDF):** TF-IDF stands for Term Frequency-Inverse Document Frequency. This technique also converts text into numerical vectors but adds importance to words that are unique or rare in the dataset. It balances the frequency of words with how common they are across all documents, giving more weight to distinctive words in each document.

**Word Embeddings:** Unlike the previous techniques, word embeddings represent words in continuous vector space, capturing semantic relationships between words. Techniques like Word2Vec and GloVe learn these representations by analyzing the context in which words appear, allowing words with similar meanings to have similar vector representations. Embeddings are more powerful for capturing the meaning and context of words and will be explored further in the [intermediate](../../../2.%20Intermediate) projects section.

In [7]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Initialize a CountVectorizer for Bag-of-Words representation
bow_vectorizer = CountVectorizer(stop_words='english', max_features=10_000)

# Initialize a TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_features=10_000)

# Preprocess and vectorize text data (Bag-of-Words)
X_train_bow = bow_vectorizer.fit_transform(X_train)
X_test_bow = bow_vectorizer.transform(X_test)

# Preprocess and vectorize text data (TF-IDF)
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

print("Sample TF-IDF Vectorized Feature:\n", X_train_tfidf.toarray()[:3])

Sample TF-IDF Vectorized Feature:
 [[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


## 2. Building and Training the Model 🛠️
Build and train a simple logistic regression model to classify the reviews.

In [8]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression

# Naive Bayes Classifier with Bag-of-Words
nb_classifier_bow = MultinomialNB()
nb_classifier_bow.fit(X_train_bow, y_train)

# Naive Bayes Classifier with TF-IDF
nb_classifier_tfidf = MultinomialNB()
nb_classifier_tfidf.fit(X_train_tfidf, y_train)

# Logistic Regression with Bag-of-Words
lr_classifier_bow = LogisticRegression(max_iter=1000, random_state=42)
lr_classifier_bow.fit(X_train_bow, y_train)

# Logistic Regression with TF-IDF
lr_classifier_tfidf = LogisticRegression(max_iter=1000, random_state=42)
lr_classifier_tfidf.fit(X_train_tfidf, y_train)

## 3. Model Evaluation 📊
Evaluate the model performance on the test set using accuracy, precision, recall, and F1 score.

In [9]:
from sklearn.metrics import accuracy_score, classification_report

# Evaluate Naive Bayes with Bag-of-Words
y_pred_nb_bow = nb_classifier_bow.predict(X_test_bow)
print("Naive Bayes with Bag-of-Words Accuracy:", accuracy_score(y_test, y_pred_nb_bow))
print("Classification Report:\n", classification_report(y_test, y_pred_nb_bow))

# Evaluate Naive Bayes with TF-IDF
y_pred_nb_tfidf = nb_classifier_tfidf.predict(X_test_tfidf)
print("Naive Bayes with TF-IDF Accuracy:", accuracy_score(y_test, y_pred_nb_tfidf))
print("Classification Report:\n", classification_report(y_test, y_pred_nb_tfidf))

# Evaluate Logistic Regression with Bag-of-Words
y_pred_lr_bow = lr_classifier_bow.predict(X_test_bow)
print("Logistic Regression with Bag-of-Words Accuracy:", accuracy_score(y_test, y_pred_lr_bow))
print("Classification Report:\n", classification_report(y_test, y_pred_lr_bow))

# Evaluate Logistic Regression with TF-IDF
y_pred_lr_tfidf = lr_classifier_tfidf.predict(X_test_tfidf)
print("Logistic Regression with TF-IDF Accuracy:", accuracy_score(y_test, y_pred_lr_tfidf))
print("Classification Report:\n", classification_report(y_test, y_pred_lr_tfidf))

Naive Bayes with Bag-of-Words Accuracy: 0.8482
Classification Report:
               precision    recall  f1-score   support

           0       0.84      0.86      0.85      2515
           1       0.85      0.84      0.85      2485

    accuracy                           0.85      5000
   macro avg       0.85      0.85      0.85      5000
weighted avg       0.85      0.85      0.85      5000

Naive Bayes with TF-IDF Accuracy: 0.8586
Classification Report:
               precision    recall  f1-score   support

           0       0.86      0.86      0.86      2515
           1       0.86      0.86      0.86      2485

    accuracy                           0.86      5000
   macro avg       0.86      0.86      0.86      5000
weighted avg       0.86      0.86      0.86      5000

Logistic Regression with Bag-of-Words Accuracy: 0.8708
Classification Report:
               precision    recall  f1-score   support

           0       0.88      0.86      0.87      2515
           1       0.8

## Summary
- This notebook demonstrated the basics of text classification using the IMDb dataset. We explored Bag-of-Words and TF-IDF for feature extraction and built classifiers with Naive Bayes and Logistic Regression. Model performance was evaluated and compared, providing insights into the effectiveness of different approaches.

- Logistic Regression with Bag-of-Words achieved the best performance among the models, with accuracy of 0.87 and F1 score of 0.87 on the test set.

- The results show that simple models can achieve good performance on text classification tasks, and feature extraction techniques like Bag-of-Words and TF-IDF are effective in capturing the information needed for classification for this dataset.