# [Business Problem:]
A company wants to understand customer feedback on their newly launched product. They have
collected hundreds of customer reviews from e-commerce platforms and social media but need a
way to automatically analyze the sentiment behind the reviews.

By performing sentiment analysis on these reviews, the company can:
* Identify customer satisfaction levels.
* Detect common pain points and areas for improvement.
* Monitor brand reputation over time.

# Data Collection
For this case study, we’ll use a public dataset of customer reviews.  The dataset typically includes:
* Text of the review: Customer&#39;s written feedback.
* Sentiment Label: The sentiment assigned to the review (either Positive, Negative, or Neutral).

In [1]:


# Example dataset of reviews
reviews = ["I love this product, it works great!",
"Worst purchase I ever made. Totally disappointed.",
"The product is okay, nothing special",
"I love how it works. Definitely recommend it!",
"The quality is poor, and it broke after one use."
]
y = ["positive", "negative", "neutral", "positive", "negative"]

In [2]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer


In [3]:
# Download NLTK resources (you only need to do this once)
nltk.download('stopwords')
nltk.download('wordnet')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

# Stop Words
Stop words are words that are commonly used but have little meaning and are often filtered out of text processing tasks. Examples of stop words in English include "a", "the", "is", "and", "but", "or", "in", "on", "at", "with", "he", "she", "it", "they", "was", "were", "be", and "being".

# Wordnet
WordNet is a large electronic dictionary that groups words into sets of synonyms, called synsets, and connects them with semantic and lexical relationships.

You can use WordNet to find synonyms, antonyms, and other semantic relationships between words. You can also use it to clarify the meaning of words and examples of words in context.

In [4]:
# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()


In [5]:
# Function to preprocess text
def preprocess_text(text):
    # Convert text to lowercase
    text = text.lower()

    # Remove punctuation and non-alphabetic characters
    text = re.sub(r'[^a-z\s]', '', text)

    # Tokenize and remove stopwords
    tokens = text.split()
    tokens = [word for word in tokens if word not in stopwords.words('english')]

    # Lemmatization
    tokens = [lemmatizer.lemmatize(word) for word in tokens]

    # Join tokens back into a string
    return ' '.join(tokens)


In [6]:
sample_review = "I love this product, it is works great!"
preprocessed_review = preprocess_text(sample_review)
print(preprocessed_review)


love product work great


A lemmatizer is a process that reduces words to their root form, or lemma, to make analysis easier.

Here's an example of lemmatization: "builds", "building", and "built" can be reduced to the lemma "build".


In [8]:
# Preprocess all reviews
preprocessed_reviews = [preprocess_text(review) for review in reviews]

# Initialize TF-IDF Vectorizer
vectorizer = TfidfVectorizer()

# Transform reviews into TF-IDF features
vectorizer.fit(preprocessed_reviews)
X = vectorizer.transform(preprocessed_reviews)
# Show feature names
print("Feature Names: ", vectorizer.get_feature_names_out())
print("TF-IDF Matrix: \n", X.toarray())


Feature Names:  ['broke' 'definitely' 'disappointed' 'ever' 'great' 'love' 'made'
 'nothing' 'okay' 'one' 'poor' 'product' 'purchase' 'quality' 'recommend'
 'special' 'totally' 'use' 'work' 'worst']
TF-IDF Matrix: 
 [[0.         0.         0.         0.         0.5819515  0.4695148
  0.         0.         0.         0.         0.         0.4695148
  0.         0.         0.         0.         0.         0.
  0.4695148  0.        ]
 [0.         0.         0.40824829 0.40824829 0.         0.
  0.40824829 0.         0.         0.         0.         0.
  0.40824829 0.         0.         0.         0.40824829 0.
  0.         0.40824829]
 [0.         0.         0.         0.         0.         0.
  0.         0.52335825 0.52335825 0.         0.         0.42224214
  0.         0.         0.         0.52335825 0.         0.
  0.         0.        ]
 [0.         0.55032913 0.         0.         0.         0.44400208
  0.         0.         0.         0.         0.         0.
  0.         0.    

Tf_IDF is one of the ways to convert sentences into mathematical values. Each sentence is converted into an array of float values.

In [12]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report


In [15]:

# Initialize the Logistic Regression model
model = LogisticRegression(max_iter=200)

# Train the model
model.fit(X, y)



In [16]:

# Make predictions on the train set of sample dataset
y_pred = model.predict(X)
print(y_pred)
# Evaluate the model
accuracy = accuracy_score(y, y_pred)
print(f"Accuracy: {accuracy:.4f}")
print("Confusion Matrix:")
print(confusion_matrix(y, y_pred))
print("Classification Report:")
print(classification_report(y, y_pred))

['positive' 'negative' 'neutral' 'positive' 'negative']
Accuracy: 1.0000
Confusion Matrix:
[[2 0 0]
 [0 1 0]
 [0 0 2]]
Classification Report:
              precision    recall  f1-score   support

    negative       1.00      1.00      1.00         2
     neutral       1.00      1.00      1.00         1
    positive       1.00      1.00      1.00         2

    accuracy                           1.00         5
   macro avg       1.00      1.00      1.00         5
weighted avg       1.00      1.00      1.00         5



Creating sample test dataset to evaluate the performance on the unknown data


In [22]:
#creating sample test dataset
test_review = ["It is a great product","I love this product, it works amazing","I dislike this product","I am disappointed"]
preprocessed_reviews = [preprocess_text(review) for review in test_review]
# Transform reviews into TF-IDF features
print(preprocessed_reviews)
X_test = vectorizer.transform(preprocessed_reviews)
print(X_test.toarray())
y_pred = model.predict(X_test)
print(y_pred)

['great product', 'love product work amazing', 'dislike product', 'disappointed']
[[0.         0.         0.         0.         0.77828292 0.
  0.         0.         0.         0.         0.         0.62791376
  0.         0.         0.         0.         0.         0.
  0.         0.        ]
 [0.         0.         0.         0.         0.         0.57735027
  0.         0.         0.         0.         0.         0.57735027
  0.         0.         0.         0.         0.         0.
  0.57735027 0.        ]
 [0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         1.
  0.         0.         0.         0.         0.         0.
  0.         0.        ]
 [0.         0.         1.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.        ]]
['positive' 'positive' 'positive' 'negative']


It can be noticed that for input "I am disappointed", the model prediction is correct whereas for the input "I dislike this product", the model prediction is wrong.

It can be infered from the fact that in training data, "disappointed" word was present in the sentence, therefore, model learnt the meaning of this word. On the other hand, model has no knowledge about the word "dislike" from the training data.

# Training model on the public dataset

In [23]:
 import pandas as pd
import numpy as np

In [24]:
from google.colab import drive
drive.mount("/content/drive/")

Mounted at /content/drive/


In [25]:
path='/content/drive/My Drive/Sentiment dataset/yelp_labelled.txt'

In [26]:
df = pd.read_csv(path, names=['sentence', 'label'], delimiter='\t',header=None)
df.head()

Unnamed: 0,sentence,label
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


# Splitting dataset into training and testing set


In [27]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df['sentence'], df['label'], test_size=0.2, random_state=42)

In [29]:
# Preprocess all reviews
preprocessed_reviews = [preprocess_text(review) for review in X_train]

# Initialize TF-IDF Vectorizer
vectorizer = TfidfVectorizer()

# Transform reviews into TF-IDF features
vectorizer.fit(preprocessed_reviews)
X_train = vectorizer.transform(preprocessed_reviews)
# Show feature names
print("Feature Names: ", vectorizer.get_feature_names_out())
print("TF-IDF Matrix: \n", X_train.toarray())

Feature Names:  ['absolute' 'absolutely' 'absolutley' ... 'yum' 'yummy' 'zero']
TF-IDF Matrix: 
 [[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


In [30]:
# Train the model
model.fit(X_train, y_train)

In [34]:
# Make predictions on the train set
y_pred = model.predict(X_train)

# Evaluate the model
accuracy = accuracy_score(y_train, y_pred)
print(f"Accuracy: {accuracy:.4f}")
print("Confusion Matrix:")
print(confusion_matrix(y_train, y_pred))
print("Classification Report:")
print(classification_report(y_train, y_pred))

Accuracy: 0.9537
Confusion Matrix:
[[394  10]
 [ 27 369]]
Classification Report:
              precision    recall  f1-score   support

           0       0.94      0.98      0.96       404
           1       0.97      0.93      0.95       396

    accuracy                           0.95       800
   macro avg       0.95      0.95      0.95       800
weighted avg       0.95      0.95      0.95       800



# Making predictions on test dataset


In [35]:
preprocessed_reviews = [preprocess_text(review) for review in X_test]
X_test = vectorizer.transform(preprocessed_reviews)
y_pred = model.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("Classification Report:")
print(classification_report(y_test, y_pred))

Accuracy: 0.7750
Confusion Matrix:
[[79 17]
 [28 76]]
Classification Report:
              precision    recall  f1-score   support

           0       0.74      0.82      0.78        96
           1       0.82      0.73      0.77       104

    accuracy                           0.78       200
   macro avg       0.78      0.78      0.77       200
weighted avg       0.78      0.78      0.77       200

