# Mini Project 2: Text Classification

This notebook demonstrates how to build a text classifier to predict if a product review is positive or negative. 
The project follows these steps:
1. **Data Loading**: Load positive and negative reviews along with lists of positive and negative words.
2. **Preprocessing**: Tokenize and clean the reviews.
3. **Feature Engineering**: Extract features such as positive/negative word counts, presence of specific words, and more.
4. **Model Implementation**: Train and evaluate multiple models.
5. **Evaluation & Comparison**: Compare the accuracy of different models.
6. **Conclusion**: Summarize findings and suggest improvements.

---


In [6]:
# Install missing packages
%pip install pandas numpy scikit-learn

# Import libraries
import pandas as pd
import numpy as np
import math
import re
from sklearn.model_selection import train_test_split

# Load the data
with open('positive-reviews.txt', 'r', encoding='utf-8') as f:
    positive_reviews = f.readlines()

with open('negative-reviews.txt', 'r', encoding='utf-8') as f:
    negative_reviews = f.readlines()

with open('positive-words.txt', 'r', encoding='utf-8') as f:
    positive_words = set(f.read().splitlines())

with open('negative-words.txt', 'r', encoding='utf-8') as f:
    negative_words = set(f.read().splitlines())

# Split data into train and test sets (80/20 split)
positive_train, positive_test = train_test_split(positive_reviews, test_size=0.2, random_state=42)
negative_train, negative_test = train_test_split(negative_reviews, test_size=0.2, random_state=42)

# Combine and label data
train_data = [(review, 1) for review in positive_train] + [(review, 0) for review in negative_train]
test_data = [(review, 1) for review in positive_test] + [(review, 0) for review in negative_test]

# Shuffle the data
np.random.shuffle(train_data)
np.random.shuffle(test_data)

negative_words





[notice] A new release of pip is available: 23.2.1 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


{'undermined',
 'exaggeration',
 'calumnies',
 'conceited',
 'dawdle',
 'hater',
 'limit',
 'fawningly',
 'oppress',
 'backaching',
 'unfit',
 'interferes',
 'orphan',
 'curses',
 'failures',
 'absurdly',
 'desititute',
 'lawbreaking',
 'liable',
 'acerbically',
 'scolding',
 'superficially',
 'obese',
 'limitation',
 'perverse',
 'suspect',
 'painfull',
 'vibration',
 'aching',
 'sicken',
 'blow',
 'gibberish',
 'infuriating',
 'gawk',
 'obscured',
 'overweight',
 'steeply',
 'reluctantly',
 'lengthy',
 'unyielding',
 'anti-us',
 'dangerous',
 'rigidness',
 'drain',
 'irreformable',
 'neglected',
 'fabrication',
 'gloatingly',
 'impudent',
 'indiscreetly',
 'severe',
 'ire',
 'traitor',
 'teasingly',
 'drastically',
 'morbidly',
 'disobedient',
 'scams',
 'inappropriate',
 'cocky',
 'erode',
 'slave',
 'strictly',
 'intimidate',
 'aghast',
 'cuplrit',
 'extravagance',
 'discompose',
 'frail',
 'defects',
 'bluring',
 'offend',
 'irrecoverablenesses',
 'gibe',
 'wild',
 'racists',
 'fr

In [7]:
# Define preprocessing function
def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove special characters and numbers
    text = re.sub(r'[^a-z\s]', '', text)
    # Tokenize
    return text.split()

# Apply preprocessing to training and test data
train_data = [(preprocess_text(review), label) for review, label in train_data]
test_data = [(preprocess_text(review), label) for review, label in test_data]


In [8]:
# Define feature extraction function
def extract_features(review, positive_words, negative_words):
    features = {}
    # Count positive words
    features['positive_word_count'] = sum(1 for word in review if word in positive_words)
    # Count negative words
    features['negative_word_count'] = sum(1 for word in review if word in negative_words)
    # Binary indicator for 'no'
    features['contains_no'] = 1 if 'no' in review else 0
    # Count pronouns
    pronouns = {'i', 'me', 'my', 'you', 'your'}
    features['pronoun_count'] = sum(1 for word in review if word in pronouns)
    # Binary indicator for '!'
    features['contains_exclamation'] = 1 if '!' in ''.join(review) else 0
    # Log of review length
    features['log_length'] = math.log(len(review)) if len(review) > 0 else 0
    return features

# Extract features from training and test data
X_train = [extract_features(review, positive_words, negative_words) for review, _ in train_data]
y_train = [label for _, label in train_data]

X_test = [extract_features(review, positive_words, negative_words) for review, _ in test_data]
y_test = [label for _, label in test_data]

# Convert feature dictionaries to DataFrames
X_train = pd.DataFrame(X_train)
X_test = pd.DataFrame(X_test)


In [9]:
X_train.shape
X_train.tail()

Unnamed: 0,positive_word_count,negative_word_count,contains_no,pronoun_count,contains_exclamation,log_length
31995,1,0,0,0,0,2.484907
31996,2,0,0,0,0,1.386294
31997,0,1,1,0,0,1.791759
31998,2,0,0,0,0,2.079442
31999,0,0,0,0,0,1.609438


In [10]:
# Import models and evaluation metric
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Train Logistic Regression
lr = LogisticRegression()
lr.fit(X_train, y_train)
lr_pred = lr.predict(X_test)
lr_acc = accuracy_score(y_test, lr_pred)

# Train Random Forest
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
rf_pred = rf.predict(X_test)
rf_acc = accuracy_score(y_test, rf_pred)

# Train SVM
svm = SVC()
svm.fit(X_train, y_train)
svm_pred = svm.predict(X_test)
svm_acc = accuracy_score(y_test, svm_pred)


In [11]:
# Print accuracy results
print("Logistic Regression Accuracy:", lr_acc)
print("Random Forest Accuracy:", rf_acc)
print("SVM Accuracy:", svm_acc)

Logistic Regression Accuracy: 0.817625
Random Forest Accuracy: 0.818125
SVM Accuracy: 0.818375


## Conclusion

The accuracy of different models was evaluated. The results show that:
- Logistic Regression achieved an accuracy of ...
- Random Forest achieved an accuracy of ...
- SVM achieved an accuracy of ...

### Next Steps
1. Experiment with additional features such as TF-IDF or sentiment polarity.
2. Perform hyperparameter tuning to optimize model performance.
3. Explore deep learning approaches for further improvements.
