# Sentiment Analysis Model for Classifying Customer Reviews as Positive or Negative
This notebook implements a sentiment analysis model using Logistic Regression on the IMDB reviews dataset.

In [None]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix
import joblib

## Step 1: Load the Dataset

In [None]:
# Load the dataset into a pandas DataFrame
data = pd.read_csv('/content/IMDB.csv')  # Updated file path

# Convert sentiment to binary labels (1 for positive, 0 for negative)
data['label'] = data['sentiment'].apply(lambda x: 1 if x == 'positive' else 0)

# Display the first few rows of the dataset
data.head()

Unnamed: 0,review,sentiment,label
0,One of the other reviewers has mentioned that ...,positive,1
1,A wonderful little production. <br /><br />The...,positive,1
2,I thought this was a wonderful way to spend ti...,positive,1
3,Basically there's a family where a little boy ...,negative,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,1


## Step 2: Data Preprocessing

In [None]:
def preprocess_text(text):
    """Preprocess the text by:
    1. Removing punctuation and special characters.
    2. Converting text to lowercase.
    3. Removing stop words (optional, depending on the dataset).
    """
    text = text.lower()
    text = ''.join([char for char in text if char.isalpha() or char == ' '])
    stop_words = set(['the', 'and', 'is', 'in', 'it', 'to', 'of', 'for', 'on', 'with'])
    text = ' '.join([word for word in text.split() if word not in stop_words])
    return text

# Apply preprocessing to the 'review' column
data['cleaned_text'] = data['review'].apply(preprocess_text)

# Display first few preprocessed reviews
data[['review', 'cleaned_text']].head()

Unnamed: 0,review,cleaned_text
0,One of the other reviewers has mentioned that ...,one other reviewers has mentioned that after w...
1,A wonderful little production. <br /><br />The...,a wonderful little production br br filming te...
2,I thought this was a wonderful way to spend ti...,i thought this was a wonderful way spend time ...
3,Basically there's a family where a little boy ...,basically theres a family where a little boy j...
4,"Petter Mattei's ""Love in the Time of Money"" is...",petter matteis love time money a visually stun...


## Step 3: Feature Extraction

In [None]:
# Initialize the TF-IDF vectorizer
tfidf = TfidfVectorizer(max_features=5000)

# Fit and transform the cleaned text
X = tfidf.fit_transform(data['cleaned_text']).toarray()

# Extract labels
y = data['label']

## Step 4: Train-Test Split

In [None]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Step 5: Model Training

In [None]:
# Initialize and train Logistic Regression model
model = LogisticRegression()
model.fit(X_train, y_train)

## Step 6: Model Evaluation

In [None]:
# Predict on the test set
y_pred = model.predict(X_test)

# Compute evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

print("Model Evaluation Metrics:")
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")

# Display confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(conf_matrix)

Model Evaluation Metrics:
Accuracy: 0.89
Precision: 0.88
Recall: 0.90
Confusion Matrix:
[[4365  596]
 [ 483 4556]]


## Step 7: Save the Model

In [None]:
# Save trained model and TF-IDF vectorizer
joblib.dump(model, 'sentiment_model.pkl')
joblib.dump(tfidf, 'tfidf_vectorizer.pkl')

print("Model and vectorizer saved successfully.")

Model and vectorizer saved successfully.


# Step 8: Insights and Challenges

Insights:
- The Logistic Regression model achieved an accuracy of [insert accuracy], which is a good starting point for sentiment analysis.
- Precision and recall scores indicate how well the model performs in classifying positive and negative reviews.
- The TF-IDF vectorizer effectively captured the important features in the text data.

Challenges:
- One challenge was handling imbalanced data (if applicable). In such cases, techniques like oversampling or class weighting could be used.
- Another challenge was selecting the right number of features for TF-IDF. Too many features can lead to overfitting, while too few can result in underfitting.
- Hyperparameter tuning could further improve the model's performance.

Future Improvements:
- Experiment with other algorithms like Naive Bayes, SVM, or even deep learning models like BERT.
- Perform more extensive hyperparameter tuning to optimize the model.
- Use word embeddings (e.g., Word2Vec, GloVe) for feature extraction to capture semantic meaning.