# Task 3: Predictive Modelling for Sentiment Classification

Build a machine learning model to predict sentiment (positive or negative) based on review text.


## Sentiment Analysis with Logistic Regression

Overview:
------------
In this notebook, we perform sentiment analysis on Amazon sentiment dataset using a Logistic Regression model. The goal is to predict whether a given review is positive or negative based on the review text. The dataset contains labeled reviews with corresponding sentiments.

Workflow:
------------
1. **Data Loading:** We load the sentiment dataset, which includes columns for the review class, review title, and review text.

2. **Data Preprocessing:** The review title and text are combined, and the class labels are mapped to binary values (positive: 1, negative: 0). The data is then cleaned, removing unnecessary elements such as URLs, HTML tags, and special characters. A stemming process is applied to reduce words to their root form.

3. **Train-Test Split:** The dataset is split into training and testing sets to train and evaluate the machine learning model.

4. **Text Vectorization:** The full_review text is converted into numerical features using TF-IDF (Term Frequency-Inverse Document Frequency) vectorization. This step is crucial for training machine learning models on text data.

5. **Model Training:** A Logistic Regression model is chosen for its simplicity and effectiveness in binary classification tasks. The model is trained on the training set.

6. **Model Evaluation:** The trained model is evaluated on the testing set using accuracy as the performance metric. Additionally, a confusion matrix is generated to analyze the model's performance in terms of true positives, true negatives, false positives, and false negatives.

------------


In [1]:
# Importing Libaries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import re
from collections import defaultdict
from nltk.stem import WordNetLemmatizer
from wordcloud import WordCloud,STOPWORDS
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer


from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix

import warnings
warnings.filterwarnings('ignore')

Because of how large the dataset is, I have reduced the number of rows to save computational time.

In [2]:
# Loading Data
col_names = ['class', 'review_title', 'review_text']
train = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Prembly/Datasets/train.csv', names=col_names)
test = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Prembly/Datasets/test.csv', names=col_names)

train_1 = train.iloc[:10000]
test_1 = test.iloc[:10000]

train_1.shape, test_1.shape

((10000, 3), (10000, 3))

In [3]:
# combine 'review_title' and 'review_text'
train_1['full_review'] = train_1['review_title'] +  ' ' + train_1['review_text']
test_1['full_review'] = test_1['review_title'] +  ' ' + test_1['review_text']

# Rename class labels: positive 2 to 1, negative 1 to 0
train_1['class'] = train_1['class'].map({2: 1, 1: 0})
test_1['class'] = test_1['class'].map({2: 1, 1: 0})


# Drop columns
del train_1['review_title'], train_1['review_text']
del test_1['review_title'], test_1['review_text']

train_1.dropna(inplace=True)
test_1.dropna(inplace=True)

In [4]:
stemmer = SnowballStemmer("english")

def preprocess_data(data):

    #removal of url
    text = re.sub(r'https?://\S+|www\.\S+|http?://\S+',' ',data)

    #decontraction
    text = re.sub(r"won\'t", " will not", text)
    text = re.sub(r"won\'t've", " will not have", text)
    text = re.sub(r"can\'t", " can not", text)
    text = re.sub(r"don\'t", " do not", text)
    text = re.sub(r"can\'t've", " can not have", text)
    text = re.sub(r"ma\'am", " madam", text)
    text = re.sub(r"let\'s", " let us", text)
    text = re.sub(r"ain\'t", " am not", text)
    text = re.sub(r"shan\'t", " shall not", text)
    text = re.sub(r"sha\n't", " shall not", text)
    text = re.sub(r"o\'clock", " of the clock", text)
    text = re.sub(r"y\'all", " you all", text)
    text = re.sub(r"n\'t", " not", text)
    text = re.sub(r"n\'t've", " not have", text)
    text = re.sub(r"\'re", " are", text)
    text = re.sub(r"\'s", " is", text)
    text = re.sub(r"\'d", " would", text)
    text = re.sub(r"\'d've", " would have", text)
    text = re.sub(r"\'ll", " will", text)
    text = re.sub(r"\'ll've", " will have", text)
    text = re.sub(r"\'t", " not", text)
    text = re.sub(r"\'ve", " have", text)
    text = re.sub(r"\'m", " am", text)
    text = re.sub(r"\'re", " are", text)

    #removal of html tags
    text = re.sub(r'<.*?>',' ',text)

    # Match all digits in the string and replace them by empty string
    text = re.sub(r'[0-9]', '', text)
    text = re.sub("["
                           u"\U0001F600-\U0001F64F"  # removal of emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+",' ',text)

    # filtering out miscellaneous text.
    text = re.sub('[^a-zA-Z]',' ',text)
    text = re.sub(r"\([^()]*\)", "", text)

    # remove mentions
    text = re.sub('@\S+', '', text)

    # remove punctuations
    text = re.sub('[%s]' % re.escape("""!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~"""), '', text)


    # Lowering all the words in text
    text = text.lower()
    text = text.split()

    text = [stemmer.stem(words) for words in text if words not in STOPWORDS]

    # Removal of words with length<2
    text = [i for i in text if len(i)>2]
    text = ' '.join(text)
    return text

train_1["Cleaned_review"] = train_1["full_review"].apply(preprocess_data)
test_1["Cleaned_review"] = test_1["full_review"].apply(preprocess_data)

In [5]:
train_1.iloc[9998]

class                                                             0
full_review       Don't buy The box looked used and it is obviou...
Cleaned_review    buy box look use obvious new tri contact email...
Name: 9998, dtype: object

I noticed that after the cleaning the data, the word 'Don"t' was removed, words like this would really be useful to categorize negative reviews, because they are more liable to be used for negative reviews.

# Logistic Regression

In [6]:
X_train, X_test, y_train, y_test = train_test_split(train_1['Cleaned_review'], train_1['class'], test_size=0.2, random_state=0, stratify=train_1['class'])
y_train = np.array(y_train)
y_test = np.array(y_test)

In [7]:
# Cpnverts text data to numeric
tfidf_vectorizer = TfidfVectorizer()
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

In [8]:
# Model Training
lr = LogisticRegression()
lr.fit(X_train_tfidf, y_train)

# Model Evaluation
y_pred = lr.predict(X_test_tfidf)
accuracy_score(y_test, y_pred)

0.8615

In [9]:
print(confusion_matrix(y_test, y_pred))

[[864 155]
 [122 859]]


## Test data

In [10]:
# Cpnverts text data to numeric
X_valid = tfidf_vectorizer.transform(test_1['Cleaned_review'])
y_valid = np.array(test_1['class'])

# Model Evaluation
y_valid_pred = lr.predict(X_valid)
accuracy_score(y_valid, y_valid_pred)

0.8448844884488449

In [11]:
print(confusion_matrix(y_valid, y_valid_pred))

[[4039  835]
 [ 716 4409]]


reference:



*   regex code [Link](https://www.kaggle.com/code/mohitnirgulkar/disaster-tweets-classification-using-ml)
*   Text Enhancement with ChatGPT


