# Sentiment Analysis on Customer Reviews

This notebook performs sentiment analysis on customer reviews using TF-IDF vectorization and Logistic Regression. We will preprocess the text data, train a model, and evaluate its performance.

## 1. Import Libraries

First, we import the necessary libraries for data processing, vectorization, modeling, and evaluation.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import re
import warnings
warnings.filterwarnings('ignore')

# Download required NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

## 2. Create Sample Dataset

For demonstration, we'll create a sample dataset of customer reviews with binary sentiments (positive=1, negative=0). In a real scenario, you would load your dataset here.

In [None]:
# Sample dataset
data = {
    'review': [
        'The product is amazing and works perfectly!',
        'Terrible experience, the item broke after one use.',
        'I love this product, highly recommend it!',
        'Very disappointing, poor quality and bad service.',
        'Fantastic purchase, exceeded my expectations!',
        'Not worth the money, stopped working quickly.',
        'Great customer service and fast delivery!',
        'Horrible product, complete waste of money.',
        'Really happy with my purchase, will buy again.',
        'The worst experience I’ve ever had with a product.'
    ],
    'sentiment': [1, 0, 1, 0, 1, 0, 1, 0, 1, 0]
}

df = pd.DataFrame(data)
df.head()

## 3. Text Preprocessing

We preprocess the reviews by converting to lowercase, removing special characters, tokenizing, removing stopwords, and lemmatizing.

In [None]:
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove special characters and numbers
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Tokenize
    tokens = word_tokenize(text)
    # Remove stopwords and lemmatize
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]
    return ' '.join(tokens)

# Apply preprocessing
df['cleaned_review'] = df['review'].apply(preprocess_text)
df[['review', 'cleaned_review']].head()

## 4. TF-IDF Vectorization

We convert the cleaned text into numerical features using TF-IDF vectorization.

In [None]:
# Initialize TF-IDF Vectorizer
tfidf = TfidfVectorizer(max_features=1000)

# Transform reviews to TF-IDF features
X = tfidf.fit_transform(df['cleaned_review'])
y = df['sentiment']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f'Training set shape: {X_train.shape}')
print(f'Testing set shape: {X_test.shape}')

## 5. Train Logistic Regression Model

We train a Logistic Regression model on the TF-IDF features.

In [None]:
# Initialize and train the model
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)

# Predict on test set
y_pred = model.predict(X_test)

## 6. Evaluate the Model

We evaluate the model using a classification report and confusion matrix.

In [None]:
# Print classification report
print('Classification Report:')
print(classification_report(y_test, y_pred))

# Print confusion matrix
print('Confusion Matrix:')
print(confusion_matrix(y_test, y_pred))

## 7. Predict Sentiment for New Reviews

We demonstrate how to predict sentiment for new reviews.

In [None]:
# Example new reviews
new_reviews = [
    'This product is fantastic and works great!',
    'Awful experience, never buying again.'
]

# Preprocess new reviews
new_reviews_cleaned = [preprocess_text(review) for review in new_reviews]

# Transform using TF-IDF
new_reviews_tfidf = tfidf.transform(new_reviews_cleaned)

# Predict sentiment
predictions = model.predict(new_reviews_tfidf)

# Display results
for review, sentiment in zip(new_reviews, predictions):
    print(f'Review: {review}')
    print(f'Predicted Sentiment: {"Positive" if sentiment == 1 else "Negative"}\n')

## 8. Conclusion

This notebook demonstrated sentiment analysis using TF-IDF vectorization and Logistic Regression. The preprocessing steps cleaned the text data, the TF-IDF vectorizer converted text to numerical features, and the Logistic Regression model classified sentiments. The model was evaluated using standard metrics, and we showed how to predict sentiments for new reviews.