
*   Install **Transformer** for importing and using DistillBERT model later on.
*   Install **joblib** for saving the trained model



In [None]:
!pip install transformers joblib  # Changed: Added joblib for model saving
!pip install matplotlib seaborn



Import all the necessary libraries.

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.model_selection import cross_val_score
import torch
import warnings
warnings.filterwarnings('ignore')
from google.colab import drive
from transformers import DistilBertTokenizer, DistilBertModel
import joblib
import nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from textblob import TextBlob
import re
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!




*   Since the **dataset** resides in the **Google Drive**. So mount Google Drive to access its contents.
*   Then load the dataset from the Google Drive by providing the path to the dataset in the drive.


In [None]:
drive.mount("/content/drive")
df = pd.read_csv("/content/drive/MyDrive/Wajid Ali/amazon.csv")
df.head()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Unnamed: 0.1,Unnamed: 0,reviewerName,overall,reviewText,reviewTime,day_diff,helpful_yes,helpful_no,total_vote,score_pos_neg_diff,score_average_rating,wilson_lower_bound
0,0,,4,No issues.,23-07-2014,138,0,0,0,0,0.0,0.0
1,1,0mie,5,"Purchased this for my device, it worked as adv...",25-10-2013,409,0,0,0,0,0.0,0.0
2,2,1K3,4,it works as expected. I should have sprung for...,23-12-2012,715,0,0,0,0,0.0,0.0
3,3,1m2,5,This think has worked out great.Had a diff. br...,21-11-2013,382,0,0,0,0,0.0,0.0
4,4,2&amp;1/2Men,5,"Bought it with Retail Packaging, arrived legit...",13-07-2013,513,0,0,0,0,0.0,0.0




*   After loading the dataset, sort the dataset by the column **wilson_lower_bound**
*   Drop the unwanted column named **Unnamed**



In [None]:
# df = df[df['reviewText'].str.len() <= 450]
df = df.iloc[:2000,:]
df = df.sort_values(by="wilson_lower_bound", ascending=False)
df.drop(columns=["Unnamed: 0"], inplace=True)
df.shape

(2000, 11)

# **Text Processing and Sentiment Analysis Steps**
**Text Normalization:** The code cleans the reviewText column by removing non-alphabetic characters and converting all text to lowercase, ensuring uniformity for analysis.

**Sentiment Score Calculation:** Implements NLTK's SentimentIntensityAnalyzer to evaluate and assign a compound sentiment score to each review, which quantifies sentiment on a scale from -1 (negative) to 1 (positive).

**Binary Classification:** Transforms the compound sentiment scores into a binary format, where scores of 0 or above are labeled as positive (1) and scores below 0 as negative (-1), facilitating easier categorization and analysis.

In [None]:
df['reviewText'] = df['reviewText'].apply(lambda review: re.sub("[^a-zA-Z]", ' ', str(review)).lower())
sid = SentimentIntensityAnalyzer()
df['compound'] = df['reviewText'].apply(lambda x: sid.polarity_scores(x)['compound'])
df['result'] = df['compound'].apply(lambda x: 1 if x >= 0 else -1)  # Changed: Streamlined the classification into positive (1) and negative (-1)

 # **Loading DistilBERT Components**
**Tokenizer Initialization:** Loads the DistilBertTokenizer from Hugging Face's transformers library, configured for the '**distilbert-base-uncased**' model, ensuring text is appropriately preprocessed for model compatibility.

**Model Setup: **Instantiates the DistilBertModel, also using the '**distilbert-base-uncased**' configuration, which is designed for efficient natural language understanding while maintaining high accuracy.


In [None]:

# Load DistilBERT model and tokenizer
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertModel.from_pretrained('distilbert-base-uncased')

#**Tokenization and Feature Extraction Process**
**Tokenization:** Converts the text in reviewText into a sequence of tokens using the DistilBertTokenizer. It includes special tokens that help the model understand the start and end of sentences.

**Padding:** Ensures all token sequences are of the same length by appending zeros to shorter sequences, determined by the longest sequence in the dataset. This uniformity is crucial for batch processing in neural networks.

**Attention Mask Creation:** Generates an attention mask to inform the model which tokens should be attended to and which are just padding. This mask helps the model focus on meaningful content.


In [None]:
# Ensure each text sequence is truncated to a maximum length of 512 tokens
tokenized = df['reviewText'].apply(lambda x: tokenizer.encode(x, add_special_tokens=True, max_length=512, truncation=True))

# Find the maximum length of the tokenized sequences (up to 512)
max_len = max(min(len(i), 512) for i in tokenized.values)

# Pad sequences so that they all have the same length
padded = np.array([i + [0] * (max_len - len(i)) for i in tokenized.values])

# Create attention masks and convert to tensors
attention_mask = np.where(padded != 0, 1, 0)
input_ids = torch.tensor(padded)
attention_mask = torch.tensor(attention_mask)
# # Tokenization and feature extraction
# tokenized = df['reviewText'].apply(lambda x: tokenizer.encode(x, add_special_tokens=True))
# max_len = max(min(len(i), 512) for i in tokenized.values)
# padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])
# attention_mask = np.where(padded != 0, 1, 0)
# input_ids = torch.tensor(padded)
# attention_mask = torch.tensor(attention_mask)

#**Feature Extraction and Data Preparation**
**Feature Extraction:** Executes the DistilBERT model in a no-gradient context (**torch.no_grad()**) to generate embeddings for each input, minimizing memory usage and computation time. The extracted features from the last hidden state of the model are used as input features for classification.

**Train-Test Split:** Splits the extracted features and their corresponding labels into training and testing sets, with 20% of the data reserved for testing. This separation is crucial for training the model on one set of data and validating its performance on an unseen set.

In [None]:
with torch.no_grad():
    last_hidden_states = model(input_ids, attention_mask=attention_mask)

features = last_hidden_states[0][:,0,:].numpy()
labels = df['result'].values

# Train test split
train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size=0.2)

#**Training and Evaluating a Logistic Regression Classifier**
**Model Training:** Initializes and trains a logistic regression classifier using the extracted features and labels from the training set, learning to predict sentiment based on text features derived from DistilBERT.

**Model Evaluation:** Evaluates the trained logistic regression model on the test dataset to determine its accuracy, providing a measure of how well the model generalizes to new, unseen data.

In [None]:
# Train a real logistic regression classifier
lr_clf = LogisticRegression()
lr_clf.fit(train_features, train_labels)
print("Logistic Regression score on test data: ", lr_clf.score(test_features, test_labels))  # Changed: Added to show model accuracy on test data

#**Generating Classification Report for Logistic Regression**
**Prediction and Evaluation:** The logistic regression classifier makes predictions on the test features. These predictions are then compared to the actual labels to assess the model's performance.

**Detailed Report:** Outputs a detailed classification report that includes metrics such as precision, recall, and F1-score for each class, providing a comprehensive view of the model’s predictive accuracy and handling of each sentiment class.

In [None]:
# Classification report
test_predictions = lr_clf.predict(test_features)
print("Logistic Regression Classification Report:")
print(classification_report(test_labels, test_predictions))  # Changed: Added detailed classification report

#**Saving the Trained Logistic Regression Model**
**Model Serialization:** Uses joblib to serialize the trained logistic regression model, enabling it to be saved to disk. This process converts the model into a format that can be efficiently written to a file.

**Save and Confirm:** The model is saved to a specified path on Google Drive. A confirmation message is printed, verifying the model's successful storage at the given location.

In [None]:
# Save the trained model
model_path = "/content/drive/MyDrive/Wajid Ali/sentiment_model.pkl"
joblib.dump(lr_clf, model_path)  # New line: Saving the model using joblib
print(f"Model saved to {model_path}")

Model saved to /content/drive/MyDrive/Wajid Ali/sentiment_model.pkl


#**Loading and Using a Saved Model**
**Load Model:** Uses joblib to load the logistic regression model from Google Drive.

**Predict:** Applies the loaded model to new data for predictions.

In [None]:
# Optionally load the model later:
# loaded_model = joblib.load(model_path)  # New line: Example of how to load the model
# new_features = "This product was very nice"
# loaded_model.predict(new_features)