# Text Classification Assessment (Key)

## Objective
Classify customer reviews into positive or negative sentiment.

## 1. Data Loading and Preprocessing

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Load the dataset
df = pd.read_csv(
    "../ml-assessments/datasets/customer-reviews-dataset.csv"
)

# Create binary target variable
df["sentiment"] = (df["rating"] >= 4).astype(int)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    df["review_text"], df["sentiment"], test_size=0.2, random_state=42
)

## 2. Exploratory Data Analysis

In [11]:
# Display the first few rows of the dataset
print(df.head())

# Show the distribution of positive and negative reviews
print(df["sentiment"].value_counts(normalize=True))

# Calculate and display the average length of reviews for each class
df["review_length"] = df["review_text"].str.len()
print(df.groupby("sentiment")["review_length"].mean())

   review_id                                        review_text  rating  \
0          1  This product exceeded my expectations. Great v...       5   
1          2  Disappointed with the quality. Not worth the p...       2   
2          3  Average product, nothing special but does the ...       3   
3          4  Absolutely love it! Would highly recommend to ...       5   
4          5  Terrible customer service and the product arri...       1   

   sentiment  review_length  
0          1             61  
1          0             51  
2          0             50  
3          1             55  
4          0             58  
sentiment
0    0.533333
1    0.466667
Name: proportion, dtype: float64
sentiment
0    53.5
1    53.0
Name: review_length, dtype: float64


## 3. Feature Engineering

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Use TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words="english", max_features=5000)
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)

# Explanation
"""
I chose TfidfVectorizer over CountVectorizer because it not only considers the frequency of words
but also their importance in the corpus. This can help in giving less weight to common words
that appear in many documents but may not be as informative for classification.
"""

'\nI chose TfidfVectorizer over CountVectorizer because it not only considers the frequency of words\nbut also their importance in the corpus. This can help in giving less weight to common words\nthat appear in many documents but may not be as informative for classification.\n'

## 4. Model Selection and Training

In [13]:
from sklearn.linear_model import LogisticRegression

# Choose and train the model
model = LogisticRegression(random_state=42)
model.fit(X_train_vectorized, y_train)

# Explanation
"""
I chose Logistic Regression because:
1. It's suitable for binary classification problems like sentiment analysis.
2. It's relatively simple and interpretable.
3. It often performs well on text classification tasks, especially with high-dimensional data.
4. It's computationally efficient for both training and prediction.
"""

"\nI chose Logistic Regression because:\n1. It's suitable for binary classification problems like sentiment analysis.\n2. It's relatively simple and interpretable.\n3. It often performs well on text classification tasks, especially with high-dimensional data.\n4. It's computationally efficient for both training and prediction.\n"

## 5. Making Predictions

In [14]:
# Reset indices for X_test and y_test
X_test = X_test.reset_index(drop=True)
y_test = y_test.reset_index(drop=True)

In [15]:
from sklearn.metrics import accuracy_score

# Make predictions on the test set
y_pred = model.predict(X_test_vectorized)

# Calculate and display the accuracy score
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

n = min(
    5, len(X_test)
)  # Ensure we don't go out of bounds if there are fewer than 5 rows

for i in range(n):
    print(f"Review: {X_test.iloc[i]}")
    print(f"Actual sentiment: {y_test.iloc[i]}")
    print(f"Predicted sentiment: {y_pred[i]}\n")

Accuracy: 0.33
Review: Fantastic! Exactly what I was looking for.
Actual sentiment: 1
Predicted sentiment: 0

Review: Decent product, but the instructions were confusing.
Actual sentiment: 0
Predicted sentiment: 0

Review: This product exceeded my expectations. Great value for money!
Actual sentiment: 1
Predicted sentiment: 0

