## 1. Dataset overview

### 1.1 Brief overview of the [Amazon product review](https://github.com/rashakil-ds/Public-Datasets/blob/main/amazon.csv) dataset.

#### The dataset contains the following columns:
1. `reviewText`: The review text.
2. `Positive`: The target variable. It is a binary variable indicating whether the review is positive or negative.

#### The dataset contains 20000 rows and 2 columns.

### 1.2 Describe columns

In [None]:
import pandas as pd
df = pd.read_csv('amazon.csv')
df.head()

In [None]:
df.shape

## 2. Data Preprocessing

### 2.1 Check for missing values

In [None]:
df.isnull().sum()

### 2.2 Perform text preprocessing (lowercasing, removing stop words, punctuation, etc.) on the `reviewText` column.

#### 2.2.1 Lowercasing

In [None]:
df['reviewText'] = df['reviewText'].str.lower()
df.head()

#### 2.2.2 Remove stop words

In [None]:
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

def remove_stop_words(text):
    return ' '.join([word for word in text.split() if word not in stop_words])

df['reviewText'] = df['reviewText'].apply(remove_stop_words)
df.head()

#### 2.2.3 Removing punctuation

In [None]:
import string

def remove_punctuation(text):
    return text.translate(str.maketrans('', '', string.punctuation))

df['reviewText'] = df['reviewText'].apply(remove_punctuation)
df.head()

#### 2.2.4 Lemmatization

In [None]:
import nltk
nltk.download('wordnet')

In [None]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

def lemmatize(text):
    return ' '.join([lemmatizer.lemmatize(word) for word in text.split()])

df['reviewText'] = df['reviewText'].apply(lemmatize)
df.head()

#### 2.2.5 Diacritics removal aka accent removal

In [None]:
import unidecode

def remove_diacritics(text):
    return unidecode.unidecode(text)

df['reviewText'] = df['reviewText'].apply(remove_diacritics)
df.head()

#### 2.2.6 Expand contractions

In [None]:
import contractions

def expand_contractions(text):
    return contractions.fix(text)

df['reviewText'] = df['reviewText'].apply(expand_contractions)
df.head()

### 2.3 Split the dataset into training and testing sets
#### Using 80% of the data for training and 20% for testing

In [None]:
from sklearn.model_selection import train_test_split

X = df['reviewText']
y = df['Positive']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X_train.shape, X_test.shape

## 3 Model Selection

### 3.1 Logistic Regression

#### 3.1.1 Necessary imports

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

#### 3.1.2 Create a pipeline

In [None]:
pipe = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', LogisticRegression())
])

#### 3.1.3 Train the model

In [None]:
pipe.fit(X_train, y_train)

#### 3.1.4 Make predictions

In [None]:
y_pred = pipe.predict(X_test)

#### 3.1.5 Formal evaluation
1. Accuracy
2. Precision
3. Recall
4. F1-score
5. Confusion matrix

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

print(f'Accuracy: {accuracy_score(y_test, y_pred)}')
print('-'*60)
print(classification_report(y_test, y_pred))
print('-'*60)
print(confusion_matrix(y_test, y_pred))

## 3.1.6 Hyperparameter tuning using GridSearchCV
### Why GridSearchCV?
1. Small dataset
2. Exhaustive search
3. Few hyperparameters

#### 3.1.6.1 Necessary imports

In [None]:
from sklearn.model_selection import GridSearchCV

#### 3.1.6.2 Define hyperparameters

In [None]:
param_grid = {
    'tfidf__max_df': [0.7, 0.8, 0.9],                     # Maximum document frequency
    'tfidf__min_df': [1, 2, 5],                           # Minimum document frequency
    'tfidf__ngram_range': [(1, 1), (1, 2)],               # N-gram range (unigrams, bigrams)
    'tfidf__max_features': [5000, 10000],                 # Maximum number of features
    'tfidf__use_idf': [True, False],                      # Whether to use IDF weighting
    'clf__C': [0.01, 0.1, 1, 10],                        # Regularization strength
    'clf__penalty': ['l2'],                              # Type of regularization (e.g., l2)
    'clf__solver': ['liblinear'],                        # Solver algorithm
}

#### 3.1.6.3 Fit the model 

In [None]:
grid = GridSearchCV(pipe, param_grid, cv=3, scoring='accuracy', verbose=2)
grid.fit(X_train, y_train)

#### 3.1.6.4 Best hyperparameters

In [None]:
grid.best_params_

In [None]:
grid.best_score_

### 3.2 Support Vector Machine (SVM)

#### 3.2.1 Necessary imports

In [None]:
from sklearn.svm import SVC

#### 3.2.2 Create a pipeline

In [None]:
from sklearn.preprocessing import StandardScaler

pipe_svm = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', SVC(verbose=True))
])

#### 3.2.3 Train the model

In [None]:
pipe_svm.fit(X_train, y_train)

#### 3.2.4 Make predictions

In [None]:
y_pred_svm = pipe_svm.predict(X_test)

#### 3.2.5 Formal evaluation

In [None]:
print(f'Accuracy: {accuracy_score(y_test, y_pred_svm)}')
print('-'*60)
print(classification_report(y_test, y_pred_svm))
print('-'*60)
print(confusion_matrix(y_test, y_pred_svm))

#### 3.2.6 Hyperparameter tuning using GridSearchCV

#### 3.2.6.1 Define hyperparameters

### ⚠️ This takes a long time to run

In [None]:
param_grid = {
    'tfidf__max_df': [0.7, 0.8, 0.9],
    'tfidf__min_df': [1, 2, 5],
    'tfidf__ngram_range': [(1, 1), (1, 2), (1, 3)],  # unigrams, bigrams, trigrams
    'tfidf__max_features': [10000, 20000],  # Limit number of features
    'tfidf__use_idf': [True, False],  # Use IDF weighting or not
    'tfidf__sublinear_tf': [True, False],  # Apply sublinear scaling
    
    'clf__C': [0.1, 1, 10, 100],  # Regularization parameter
    'clf__kernel': ['linear', 'rbf'],  # Kernel type
    'clf__gamma': ['scale', 'auto', 0.01, 0.001],  # Kernel coefficient
    'clf__class_weight': [None, 'balanced'],  # Handling class imbalance
}

### This is less exhaustive 

In [None]:
param_grid = {
    'tfidf__max_df': [0.8, 0.9],
    'tfidf__ngram_range': [(1, 1), (1, 2)],  # Focus only on unigrams and bigrams
    'clf__C': [1, 10],  # Start with fewer values for regularization
    'clf__kernel': ['linear', 'rbf'],  # Stick to common kernels
    'clf__gamma': ['scale', 'auto'],  # Reduce number of values for gamma
}

#### 3.2.6.2 Fit the model

In [None]:
grid_svm = GridSearchCV(pipe_svm, param_grid, cv=3, scoring='accuracy', verbose=3)
grid_svm.fit(X_train, y_train)

### 3.3 Random Forest

In [106]:
from sklearn.ensemble import RandomForestClassifier

pipe_rf = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', RandomForestClassifier())
])

pipe_rf.fit(X_train, y_train)

y_pred_rf = pipe_rf.predict(X_test)

print(f'Accuracy: {accuracy_score(y_test, y_pred_rf)}')
print('-'*60)
print(classification_report(y_test, y_pred_rf))
print('-'*60)
print(confusion_matrix(y_test, y_pred_rf))

Accuracy: 0.86525
------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.86      0.52      0.65       958
           1       0.87      0.97      0.92      3042

    accuracy                           0.87      4000
   macro avg       0.86      0.75      0.78      4000
weighted avg       0.87      0.87      0.85      4000

------------------------------------------------------------
[[ 498  460]
 [  79 2963]]
