<a href="https://colab.research.google.com/github/Faezeh-Maleki/sentiment-analysis/blob/main/Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install google-colab
from google.colab import drive
drive.mount('/content/drive')

import pandas as pd
import numpy as np
import re
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score

'''
 We want to develop a system for sentiment analysis using the amazon_baby dataset. Sentiment analysis is a subset of classification methods and can be performed using various techniques. In this scenario, sentiment is defined as follows:

Positive reviews are those for which the user has given a rating of 4 or 5 stars.
Negative reviews are those for which the user has given a rating of 1 or 2 stars.
Neutral reviews are those for which the user has given a rating of 3 stars.
Using these labels and one of the classification algorithms, such as Logistic Regression, k-Nearest Neighbors (kNN), or Artificial Neural Networks (ANN), classify the reviews. After classification, analyze the features of positive and negative reviews.
'''


Mounted at /content/drive


In [None]:
# Load data
file_path = '/content/drive/MyDrive/amazon_baby.csv'

# Load the dataset into a DataFrame
df = pd.read_csv(file_path)

# Display a few samples of the data
print(df.head())

# Remove rows with missing values in the 'review' column
df.dropna(subset=['review'], inplace=True)

# Convert reviews to lowercase and remove non-alphabetic characters
df['review_clean'] = df['review'].apply(lambda x: re.sub(r'[^a-zA-Z\s]', '', x.lower()))

# Define sentiment labels based on ratings
def classify_sentiment(rating):
    if rating in [4, 5]:
        return 'Positive'
    elif rating in [1, 2]:
        return 'Negative'
    elif rating == 3:
        return 'Neutral'

# Apply the classify_sentiment function to the 'rating' column to create a new 'Sentiment' column
df['Sentiment'] = df['rating'].apply(classify_sentiment)

# Initialize the Count Vectorizer to convert text data into numerical features
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['review_clean'])  # Transform text data to numerical features

# Prepare features and labels
X_train, X_test, y_train, y_test = train_test_split(X, df['Sentiment'], test_size=0.3, random_state=42)

# Train a Logistic Regression model
model = LogisticRegression()
model.fit(X_train, y_train)  # Train the model on the training data
y_pred = model.predict(X_test)  # Predict sentiments on the test data

# Print classification report and accuracy score for the Logistic Regression model
print("Logistic Regression Classification Report:")
print(classification_report(y_test, y_pred))
print("Logistic Regression Accuracy Score:", accuracy_score(y_test, y_pred))

# Additional analysis can be done here, such as exploring feature importance or visualizing results

df['review_clean'] = df['review'].apply(lambda x: re.sub(r'[^a-zA-Z\s]', '', x.lower()))

                                                name  \
0                           Planetwise Flannel Wipes   
1                              Planetwise Wipe Pouch   
2                Annas Dream Full Quilt with 2 Shams   
3  Stop Pacifier Sucking without tears with Thumb...   
4  Stop Pacifier Sucking without tears with Thumb...   

                                              review  rating  
0  These flannel wipes are OK, but in my opinion ...       3  
1  it came early and was not disappointed. i love...       5  
2  Very soft and comfortable and warmer than it l...       5  
3  This is a product well worth the purchase.  I ...       5  
4  All of my kids have cried non-stop when I trie...       5  


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Logistic Regression Classification Report:
              precision    recall  f1-score   support

    Negative       0.72      0.66      0.69      7937
     Neutral       0.41      0.20      0.26      4895
    Positive       0.89      0.96      0.92     41979

    accuracy                           0.85     54811
   macro avg       0.67      0.61      0.63     54811
weighted avg       0.82      0.85      0.83     54811

Logistic Regression Accuracy Score: 0.8482421411760414
