# 🎓 Building a Sentiment Classifier with Logistic Regression

In this notebook, we will build a **Logistic Regression** classifier to predict the sentiment (positive or negative) of customer reviews for women's clothing from an e-commerce website. 

For this task, we will use the **LogisticRegression** class from the scikit-learn library.


In [7]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [8]:
file_path = 'womens_clothing_ecommerce_reviews.csv'
df = pd.read_csv(file_path)
print("✅ Successfully loaded the dataset.")
print("Dataset preview:")
print(df.head())

✅ Successfully loaded the dataset.
Dataset preview:
                                         Review Text  sentiment
0  Absolutely wonderful - silky and sexy and comf...          1
1  Love this dress!  it's sooo pretty.  i happene...          1
2  I love, love, love this jumpsuit. it's fun, fl...          1
3  This shirt is very flattering to all due to th...          1
4  I love tracy reese dresses, but this one is no...         -1


In [9]:
# Get some basic information about the dataset
print("\nDataset Information:")
df.info()


Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19818 entries, 0 to 19817
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Review Text  19818 non-null  object
 1   sentiment    19818 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 309.8+ KB


In [10]:
# --- Step 2: Split Data into Training and Testing Sets ---
# It's crucial to test our model on data it has never seen before.
# We'll use 80% of the data for training and 20% for testing.
X = df['Review Text']
y = df['sentiment']

# 'stratify=y' ensures that the proportion of positive and negative reviews is the same in both your training set and your testing set.
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"\nData split into {len(X_train)} training samples and {len(X_test)} testing samples.")


Data split into 15854 training samples and 3964 testing samples.


In [11]:
# --- Step 3: Feature Engineering with Bag-of-Words ---
# Here, we convert the text reviews into numerical feature vectors.
# Each feature is a count of how many times a word appears in a review.
print("\nConverting text to numerical features using Bag-of-Words...")

# Initialize the vectorizer. `stop_words='english'` removes common
# English words like 'the', 'a', 'is', which don't carry much sentiment.
vectorizer = CountVectorizer(stop_words='english')

# Fit the vectorizer on the TRAINING data and transform it into a matrix
X_train_bow = vectorizer.fit_transform(X_train)

# ONLY transform the TESTING data using the already-fitted vectorizer
X_test_bow = vectorizer.transform(X_test)

print("✅ Text successfully converted to feature vectors.")


Converting text to numerical features using Bag-of-Words...
✅ Text successfully converted to feature vectors.


In [12]:
# --- Step 4: Train the Linear Classifier ---
# We'll use Logistic Regression, a reliable linear model for classification.
print("\nTraining the Logistic Regression model...")

# Initialize the model
# max_iter is increased to ensure the model has enough time to find the best weights
model = LogisticRegression(max_iter=2000)

# Train the model on our Bag-of-Words training data
model.fit(X_train_bow, y_train)

print("✅ Model training complete!")


Training the Logistic Regression model...
✅ Model training complete!


In [13]:
# --- Step 5: Evaluate the Model's Performance ---
# Let's see how accurately our model predicts sentiment on the unseen test data.
print("\nEvaluating model performance on the test set...")

# Make predictions on the test data
y_pred = model.predict(X_test_bow)

# Calculate the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"📈 Model Accuracy: {accuracy:.4f} ({accuracy:.2%})")


Evaluating model performance on the test set...
📈 Model Accuracy: 0.9294 (92.94%)


In [14]:
# --- Step 6: Predict Sentiment on New Reviews ---
# This is the fun part! Let's use our fine-tuned model on brand new text.
print("\n--- Making Predictions on New Reviews ---")

new_reviews = [
    "This dress is absolutely beautiful and fits perfectly!",
    "The material felt cheap and it was not what I expected.",
    "It's an okay product, not great but not terrible either.",
    "I am so disappointed with this purchase, I will be returning it."
]

# 1. Transform the new reviews into the Bag-of-Words format
new_reviews_bow = vectorizer.transform(new_reviews)

# 2. Predict using our trained model
new_predictions = model.predict(new_reviews_bow)


for i in range(len(new_reviews)):
    print(f"Review: {new_reviews[i]}")
    print(f"Predicted Sentiment: {new_predictions[i]}\n")


--- Making Predictions on New Reviews ---
Review: This dress is absolutely beautiful and fits perfectly!
Predicted Sentiment: 1

Review: The material felt cheap and it was not what I expected.
Predicted Sentiment: -1

Review: It's an okay product, not great but not terrible either.
Predicted Sentiment: -1

Review: I am so disappointed with this purchase, I will be returning it.
Predicted Sentiment: -1



In [15]:
import joblib
# to save the model
joblib.dump(model, 'logistic_regression_model.joblib')

['logistic_regression_model.joblib']

In [1]:
!pip3  install google-cloud-aiplatform




[notice] A new release of pip is available: 25.0.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [16]:
from google.cloud import aiplatform

In [17]:
aiplatform.init(project="sincere-loader-470605-j1", location="us-central1")
endpoint = aiplatform.Endpoint(
    endpoint_name='496349235592036352'
)
print("Successfully Created Endpoint "+endpoint.resource_name)

Successfully Created Endpoint projects/370195256686/locations/us-central1/endpoints/496349235592036352


In [18]:
new_review = ["This dress is absolutely ugly and  very bad!",]
print("New review to classify: " + new_review[0])

New review to classify: This dress is absolutely ugly and  very bad!


In [19]:
print("\nStep-01: converting a sparce matrix from the text ..")
sparse_matrix = vectorizer.transform(new_review)



print("\nStep 2: converting is a dense numpy array")
numpy_array=sparse_matrix.toarray()
processed_review=numpy_array.tolist()

print("\nStep 03: convert the numpy array to a standard python list")
processed_review=numpy_array.tolist()

print("Final ready for prediction :", processed_review)

print("\nSending prediction request to the vertex AI Endpoint")
# make the prediction call
response = endpoint.predict(instances=processed_review)

print("\nPrediction response:", response)
print("\nType of the response: ",type(response))

# print 
# print("Recieved response from the model.")

# the prediction is insie the "prediction" key of the response object
prediction_result = response.predictions[0]

# assiging your model outputs 1 for 'positive' and 0 
sentiment = "Positive" if prediction_result == 1 else "Negative"
print(f"Prediction Result :  '{sentiment}'")



Step-01: converting a sparce matrix from the text ..

Step 2: converting is a dense numpy array

Step 03: convert the numpy array to a standard python list
Final ready for prediction : [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [None]:
# reviews array
cloth_shop_reviews = [
    "I loved the variety of outfits available here! The fabrics feel premium, and the staff was really helpful in suggesting the right sizes. Definitely coming back for the new arrivals.",
    "Trendy designs at reasonable prices. I bought a kurta set and a pair of jeans — both fit perfectly. The store could add more trial rooms though, it gets crowded on weekends.",
    "The collection is fresh and stylish. I appreciate that they keep both casual and formal options. Online ordering was smooth, and delivery was on time.",
    "Quality of clothes is really good, especially compared to other shops in this price range. However, I feel they should expand the men’s collection a bit more.",
    "Excellent customer service! The staff helped me mix and match to create a full outfit. The store ambiance is nice and welcoming. Highly recommend!",
    "this is shit, so bad",
    "this is the baddest shop ever"
]

print("\nStep-01: Converting the reviews to a sparse matrix ..")
sparse_matrix = vectorizer.transform(cloth_shop_reviews)

print("\nStep-02: Converting to a dense numpy array ..")
numpy_array = sparse_matrix.toarray()

print("\nStep-03: Converting the numpy array to a standard Python list ..")
processed_reviews = numpy_array.tolist()

print("Final ready for prediction :", processed_reviews)

print("\nSending prediction request to the Vertex AI Endpoint ..")
# make the prediction call (batch prediction for all reviews)
response = endpoint.predict(instances=processed_reviews)

print("\nPrediction response:", response)
print("\nType of the response: ", type(response))

# The predictions will be a list (one result per review)
predictions = response.predictions

print("\n--- Final Results ---")
for idx, prediction in enumerate(predictions, start=1):
    sentiment = "Positive" if prediction == 1 else "Negative"
    print(f"REVIEW {idx:02d} : {sentiment}")



Step-01: Converting the reviews to a sparse matrix ..

Step-02: Converting to a dense numpy array ..

Step-03: Converting the numpy array to a standard Python list ..
Final ready for prediction : [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 