<a href="https://colab.research.google.com/github/Shankar-Hadimani/PythonforDataScience/blob/master/3.%20Classification%20-%20Sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Problem statement: Classification model to analyze Amazon product reviews

The objective is to create a classification model that will analyze Amazon product reviews to classify sentiments as positive or negative. Here's a breakdown of the steps involved in this workflow:

- Step 1: Load the Dataset
- Step 2: Data Pre-processing
- Step 3: Feature Selection
- Step 4: Model Selection
- Step 5: Training the Model
- Step 6: Model Evaluation
- Step 7: Hyperparameter Tuning
- Step 8: Cross Validation

The notebook contains 7 exercises in total:

* [Exercise 1](#ex_1)
* [Exercise 2](#ex_2)
* [Exercise 3](#ex_3)
* [Exercise 4](#ex_4)
* [Exercise 5](#ex_5)
* [Exercise 6](#ex_6)
* [Exercise 7](#ex_7)

## Step 1: Load the dataset
First, let's load the dataset from Google Drive. You need to upload the dataset and then read the CSV file into a pandas DataFrame.

In [2]:
from google.colab import files
uploaded = files.upload()

Saving amazon-product-review-data.csv to amazon-product-review-data.csv


In [11]:
# Import necessary libraries
import pandas as pd

# Load the dataset into a DataFrame
df = pd.read_csv('amazon-product-review-data.csv')

# Display the first few rows to check if the data is loaded correctly
df.head()

Unnamed: 0,market_place,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date,sentiments
0,"""US""","""42521656""","""R26MV8D0KG6QI6""","""B000SAQCWC""","""159713740""","""The Cravings Place Chocolate Chunk Cookie Mix...","""Grocery""",1,0,0,0 \t(N),1 \t(Y),"""Using these for years - love them.""","""As a family allergic to wheat, dairy, eggs, n...",2015-08-31,positive
1,"""US""","""12049833""","""R1OF8GP57AQ1A0""","""B00509LVIQ""","""138680402""","""Mauna Loa Macadamias, 11 Ounce Packages""","""Grocery""",1,0,0,0 \t(N),1 \t(Y),"""Wonderful""","""My favorite nut. Creamy, crunchy, salty, and ...",2015-08-31,positive
2,"""US""","""107642""","""R3VDC1QB6MC4ZZ""","""B00KHXESLC""","""252021703""","""Organic Matcha Green Tea Powder - 100% Pure M...","""Grocery""",1,0,0,0 \t(N),0 \t(N),"""Five Stars""","""This green tea tastes so good! My girlfriend ...",2015-08-31,positive
3,"""US""","""6042304""","""R12FA3DCF8F9ER""","""B000F8JIIC""","""752728342""","""15oz Raspberry Lyons Designer Dessert Syrup S...","""Grocery""",1,0,0,0 \t(N),1 \t(Y),"""Five Stars""","""I love Melissa's brand but this is a great se...",2015-08-31,positive
4,"""US""","""18123821""","""RTWHVNV6X4CNJ""","""B004ZWR9RQ""","""552138758""","""Stride Spark Kinetic Fruit Sugar Free Gum, 14...","""Grocery""",1,0,0,0 \t(N),1 \t(Y),"""Five Stars""","""good""",2015-08-31,positive


## Step 2: Data Pre-processing





In [12]:
# Import necessary libraries for data pre-processing
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder

# Remove any rows with missing values
df.dropna(inplace=True)

# Encode the 'sentiments' column (positive/negative) to numerical values (0/1)
le = LabelEncoder()
df['sentiments'] = le.fit_transform(df['sentiments'])

# Text data preprocessing using TF-IDF vectorization
tfidf_vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')
X = tfidf_vectorizer.fit_transform(df['review_body']).toarray()
y = df['sentiments'].values

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the shapes of the resulting data
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)


X_train shape: (400, 3466)
X_test shape: (100, 3466)
y_train shape: (400,)
y_test shape: (100,)


<a name="ex_1"></a>
## Exercise 1

- Use the train_test_split function and change the test_size to 0.3

This way the training set (X and y) should be 70% and the testing set(X and y) should be 30%

In [13]:
# Split the data into training and testing sets (70% train, 30% test)
X_train_30, X_test_30, y_train_30, y_test_30 = train_test_split(X, y, test_size=0.3, random_state=42)

# Display the shapes of the resulting data
print("X_train shape:", X_train_30.shape)
print("X_test shape:", X_test_30.shape)
print("y_train shape:", y_train_30.shape)
print("y_test shape:", y_test_30.shape)

X_train shape: (350, 3466)
X_test shape: (150, 3466)
y_train shape: (350,)
y_test shape: (150,)


## Step 3: Feature Selection

In this step, we'll perform feature selection to reduce the dimensionality of the TF-IDF vectorized data and potentially improve the model's performance. We'll use feature selection techniques like chi-squared (chi2) or mutual information to select the most important features.

In [14]:
from sklearn.feature_selection import SelectKBest, chi2

# Apply feature selection using chi-squared (chi2) test
# You can adjust the number of features (k) as needed
k = 1000
selector = SelectKBest(chi2, k=k)
X_train_selected = selector.fit_transform(X_train, y_train)
X_test_selected = selector.transform(X_test)

# Display the shapes of the selected feature sets
print("X_train_selected shape:", X_train_selected.shape)
print("X_test_selected shape:", X_test_selected.shape)

X_train_selected shape: (400, 1000)
X_test_selected shape: (100, 1000)


<a name="ex_2"></a>
## Exercise 2

- Compare the X_train_selected shape and X_test_selected shape with the new test_size=0.3

In [15]:
from sklearn.feature_selection import SelectKBest, chi2

# Apply feature selection using chi-squared (chi2) test
# You can adjust the number of features (k) as needed
k = 1000
selector = SelectKBest(chi2, k=k)
X_train_selected_30 = selector.fit_transform(X_train_30, y_train_30)
X_test_selected_30 = selector.transform(X_test_30)

# Display the shapes of the selected feature sets
print("X_train_selected shape:", X_train_selected_30.shape)
print("X_test_selected shape:", X_test_selected_30.shape)

X_train_selected shape: (350, 1000)
X_test_selected shape: (150, 1000)


We have successfully performed feature selection, reducing the dimensionality of the data while retaining the most important features.


## Step 4: Model Selection
For sentiment analysis, you can use various machine learning algorithms like Logistic Regression, Naive Bayes, Support Vector Machines, or even deep learning models like LSTM or BERT. Since you're a beginner, let's start with a simple model like Logistic Regression.

In [16]:
from sklearn.linear_model import LogisticRegression

# Initialize the Logistic Regression model
model = LogisticRegression(random_state=42)


<a name="ex_3"></a>
## Exercise 3

What does the random_state (parameter of the LogisticRegression) represent?

**Answer**:

**Ensures Reproducibility:** Running the same code multiple times will yield identical results.

**Facilitates Debugging:** Different runs won't produce different outcomes, making it easier to track issues.


The random_state parameter (from sklearn.linear_model) is used to control the randomness involved in certain aspects of the model training process. Setting a specific value (like 42) ensures that the results are reproducible across multiple runs.


## Step 5: Training the Model

Now that we have initialized our Logistic Regression model, it's time to train it on the selected features from the training dataset.



In [17]:

# Train the Logistic Regression model on the selected features
model.fit(X_train_selected, y_train)

# We can now proceed to Step 7: Model Evaluation

## Step 6: Model Evaluation

In this step, we'll evaluate the performance of the trained Logistic Regression model using the testing data.

- We import necessary metrics from `sklearn.metrics` such as `accuracy_score`, `classification_report`, and `confusion_matrix`.
- We use the trained model to predict sentiment labels (`y_pred`) for the test data (`X_test_selected`).
- We calculate the accuracy of the model by comparing the predicted labels to the true labels.
- We display a classification report that includes precision, recall, F1-score, and support for both positive and negative sentiment classes.
- We display a confusion matrix to visualize the true positive, true negative, false positive, and false negative predictions.



In [18]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Predict sentiment labels for the test data
y_pred = model.predict(X_test_selected)

# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Display a classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Display a confusion matrix
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Accuracy: 0.86

Classification Report:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00        14
           1       0.86      1.00      0.92        86

    accuracy                           0.86       100
   macro avg       0.43      0.50      0.46       100
weighted avg       0.74      0.86      0.80       100


Confusion Matrix:
[[ 0 14]
 [ 0 86]]


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


<a name="ex_4"></a>
## Exercise 4

- Compare the Results with the new data split with the results of the actual split.

In [19]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Predict sentiment labels for the test data
y_pred_30 = model.predict(X_test_selected_30)

# Calculate the accuracy of the model
accuracy = accuracy_score(y_test_30, y_pred_30)
print("Accuracy:", accuracy)

# Display a classification report
print("\nClassification Report:")
print(classification_report(y_test_30, y_pred_30))

# Display a confusion matrix
print("\nConfusion Matrix:")
print(confusion_matrix(y_test_30, y_pred_30))

Accuracy: 0.8466666666666667

Classification Report:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00        23
           1       0.85      1.00      0.92       127

    accuracy                           0.85       150
   macro avg       0.42      0.50      0.46       150
weighted avg       0.72      0.85      0.78       150


Confusion Matrix:
[[  0  23]
 [  0 127]]


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


<a name="ex_5"></a>
## Exercise 5

Do different training and testing sizes impact the model's learning and response to new data?

**Answer**:


Impact of Different Training and Testing Sizes on Model Learning and Performance
Yes, different training and testing sizes can significantly impact how a model learns and generalizes to new data. The key reasons include:

1. **Training Data Size:** A larger training set generally helps the model learn better representations of the data, reducing variance and improving generalization. Conversely, a smaller training set can lead to underfitting if it does not capture the full distribution of data.

2. **Testing Data Size:** A larger test set gives a more reliable estimate of real-world performance, whereas a smaller test set might introduce higher variance in performance metrics due to random chance.

3. **Class Imbalance Impact:** The number of instances in each class influences precision, recall, and F1-score. If one class is underrepresented in training data, the model may struggle to learn meaningful patterns for that class.

**Observations from the Given Models**

Accuracy is Nearly the Same:

**Old Model:** 86%

**New Model:** 84.67%

This suggests that the model performs consistently, despite different dataset sizes.
**Class Imbalance is Present in Both Models:**

In both cases, the model fails to classify any instances of Class 0 (negative class).
The recall for Class 1 is 100%, meaning it predicts every instance as Class 1, leading to zero true negatives.
Confusion Matrix Shows No Improvement in Identifying Class 0:

The old model had 14 false positives out of 14 total instances of Class 0.
The new model had 23 false positives out of 23 instances.
This suggests that increasing dataset size did not improve the model’s ability to recognize Class 0.

**Macro vs. Weighted Metrics:**

Macro Avg (average across both classes) remains low  because Class 1 dominates the dataset.

## Step 7: Hyperparameter Tuning

In this step, we'll perform hyperparameter tuning to optimize the Logistic Regression model's performance. We can search for the best hyperparameters using techniques like Grid Search or Random Search.

- We import `GridSearchCV` from `sklearn.model_selection`.
- We define a grid of hyperparameters to search, including 'C' (regularization parameter) and 'max_iter' (maximum iterations).
- We initialize Grid Search with cross-validation (5-fold) to find the best hyperparameters.
- The best hyperparameters are extracted using `grid_search.best_params_`.
- We fit the tuned model with the best hyperparameters to the training data.
- Finally, we evaluate the tuned model's accuracy on the test data.

In [20]:
from sklearn.model_selection import GridSearchCV

# Define hyperparameters to search
param_grid = {
    'C': [0.1, 1, 10, 100],  # Regularization parameters
    'max_iter': [100, 200, 300]  # Maximum number of iterations
}

# Initialize Grid Search with cross-validation (5-fold)
grid_search = GridSearchCV(LogisticRegression(random_state=42), param_grid, cv=5, verbose=1, n_jobs=-1)

# Fit the Grid Search to the data
grid_search.fit(X_train_selected, y_train)

# Get the best hyperparameters
best_params = grid_search.best_params_
print("Best Hyperparameters:", best_params)

# Evaluate the model with the best hyperparameters
best_model = grid_search.best_estimator_
y_pred_tuned = best_model.predict(X_test_selected)

# Calculate the accuracy of the tuned model
accuracy_tuned = accuracy_score(y_test, y_pred_tuned)
print("Tuned Model Accuracy:", accuracy_tuned)

Fitting 5 folds for each of 12 candidates, totalling 60 fits
Best Hyperparameters: {'C': 100, 'max_iter': 100}
Tuned Model Accuracy: 0.86


<a name="ex_6"></a>
## Exercise 6

- What is GridSearchCV used for?
- What are hyperparameters?
- Does the model give better results after hyperparameters ?

**Answer**:


**1. What is GridSearchCV used for?**

GridSearchCV is used for hyperparameter tuning in machine learning models. It systematically searches for the best combination of hyperparameters by performing exhaustive cross-validation (in this case, 5-fold cross-validation). The key benefits include:

- **Finding the optimal hyperparameters** for better model performance.

- **Ensuring robustness by evaluating** different parameter combinations using multiple training-validation splits.

- **Preventing overfitting** by select-ing hyperparameters that generalize well to unseen data.
In this case, GridSearchCV explores different values of C (regularization strength) and max_iter (maximum iterations for optimization) to find the best combination.



**2. What are hyperparameters?**

Hyperparameters are model-specific settings that are not learned from data but set before training to control the learning process. They affect model performance and need to be optimized.

In the given code, the hyperparameters being tuned are:

- **C: The inverse of regularization strength.** Higher values reduce regularization, allowing more flexibility in the model but increasing the risk of overfitting.

- **max_iter:** The maximum number of iterations allowed for the solver to converge. A higher value ensures the model has enough time to find optimal weights.
Unlike parameters (e.g., model weights in logistic regression), hyperparameters are set manually or tuned using methods like GridSearchCV.

**3. Does the model give better results after hyperparameter tuning?**

**Before tuning:** The model had an accuracy of 84.67% (from your previous model comparison).

**After tuning:** The accuracy improved to 86%.

**Best hyperparameters:** {'C': 100, 'max_iter': 100} were selected, meaning a higher regularization parameter and fewer iterations provided better performance.

***Thus, hyperparameter tuning did improve the model performance, but the improvement is relatively small.*** The gain in accuracy suggests that regularization and convergence settings played a role in fine-tuning the decision boundary. However, further improvements could be explored using:

- Feature selection or engineering.
- Handling class imbalance (since Class 0 was misclassified completely).
- Trying different solvers (liblinear, saga).
- Testing more hyperparameters.

It appears that the hyperparameter tuning did not significantly improve the model's accuracy in this case. The accuracy remains at 0.86.

## Step 8: Cross Validation

We'll use cross-validation to estimate how well the model will perform on unseen data and check if the model's performance is consistent across different folds of the data.

- We import `cross_val_score` from `sklearn.model_selection`.
- We perform 5-fold cross-validation on the tuned model (`best_model`) using the training data (`X_train_selected` and `y_train`).
- We calculate the mean cross-validation accuracy to get a more robust estimate of the model's performance.

In [22]:
from sklearn.model_selection import cross_val_score

# Perform 5-fold cross-validation on the tuned model
cv_scores = cross_val_score(best_model, X_train_selected, y_train, cv=5)

# Calculate and display the mean cross-validation accuracy
mean_cv_accuracy = np.mean(cv_scores)
print("Mean Cross-Validation Accuracy:", mean_cv_accuracy)

Mean Cross-Validation Accuracy: 0.7925000000000001


<a name="ex_7"></a>
## Exercise 7

- What is Cross Validation used for?
- Compare the new Validation score (with the new training and testing size)
- What do you conclude ?

In [24]:
from sklearn.model_selection import GridSearchCV

# Define hyperparameters to search
param_grid = {
    'C': [0.1, 1, 10, 100],  # Regularization parameters
    'max_iter': [100, 200, 300]  # Maximum number of iterations
}

# Initialize Grid Search with cross-validation (5-fold)
grid_search_30 = GridSearchCV(LogisticRegression(random_state=42), param_grid, cv=5, verbose=1, n_jobs=-1)

# Fit the Grid Search to the data
grid_search_30.fit(X_train_selected_30, y_train_30)

# Get the best hyperparameters
best_params_30 = grid_search.best_params_
print("Best Hyperparameters:", best_params_30)

# Evaluate the model with the best hyperparameters
best_model_30 = grid_search.best_estimator_
y_pred_tuned_30 = best_model.predict(X_test_selected_30)

# Calculate the accuracy of the tuned model
accuracy_tuned_30 = accuracy_score(y_test_30, y_pred_tuned_30)
print("Tuned Model Accuracy:", accuracy_tuned_30)

Fitting 5 folds for each of 12 candidates, totalling 60 fits
Best Hyperparameters: {'C': 100, 'max_iter': 100}
Tuned Model Accuracy: 0.78


In [25]:
from sklearn.model_selection import cross_val_score

# Perform 5-fold cross-validation on the tuned model
cv_scores_30 = cross_val_score(best_model_30, X_train_selected_30, y_train_30, cv=5)

# Calculate and display the mean cross-validation accuracy
mean_cv_accuracy_30 = np.mean(cv_scores)
print("Mean Cross-Validation Accuracy:", mean_cv_accuracy_30)

Mean Cross-Validation Accuracy: 0.7925000000000001


**Answer**:


**1. What is Cross-Validation used for?**
Cross-validation is used to assess a model’s performance by splitting the dataset into multiple subsets (folds) and training the model on different subsets while validating on the remaining data. The key benefits of cross-validation are:

- a. **More reliable model evaluation** by reducing dependency on a single train-test split.
- b. **Prevention of overfitting** by ensuring the model generalizes well to unseen data.
- c. **Providing a stable performance** metric by averaging results across different data splits.

In this case, 5-fold cross-validation is used, meaning the dataset is split into 5 parts, the model is trained on 4 parts, and validated on the remaining part in 5 different iterations. The average accuracy across these iterations gives the mean cross-validation accuracy.

**2. Comparing the New Validation Score with the New Training and Testing Size**

We have two model evaluations:

**First Model (Original Dataset Size)**
- **Tuned Model Accuracy** (on test set): 0.86 (86%)
- **Mean Cross-Validation Accuracy** (on train set): 0.7925 (79.25%)


**Second Model (New Dataset Size)**

- **Tuned Model Accuracy** (on test set): 0.78 (78%)
- **Mean Cross-Validation Accuracy** (on train set): 0.7925 (79.25%)

### **3. What Do We Conclude?**

1. **Test Accuracy Dropped in the New Model (86% → 78%)**  
   - The second model performed worse on the test set, possibly due to **changes in training/testing size or class distribution** in the new dataset.  

2. **Cross-Validation Accuracy Remains the Same (79.25%)**  
   - Interestingly, the mean cross-validation accuracy on the training set **remained constant** despite the change in dataset size.  
   - This suggests that the model's ability to generalize **within the training set** was not impacted significantly.

3. **Potential Reasons for the Drop in Test Accuracy**  
   - The **test set distribution** might be different in the new dataset, making it harder for the model to generalize.  
   - The **training set may have less representative data** after the dataset change, affecting its real-world performance.  
   - The model might **overfit slightly** to the training data, causing a bigger drop when tested on unseen data.

---

### **Final  Observation/Takeaway**  
- **Cross-validation provides a more stable estimate of model performance,** which remains the same (~79%) across dataset sizes.  
- However, **the test accuracy drop suggests that the new dataset might be less representative or harder to learn from.**  
- Further improvements can be made by:
  - Ensuring a well-balanced training dataset.  
  - Expanding feature selection.  
  - Trying alternative regularization settings (e.g., different `C` values).  
