# Sentiment Analysis with Logistic Regression

This notebook implements a sentiment analysis model using Logistic Regression and TF-IDF vectorization. The goal is to classify text as positive, negative, or neutral. The current model achieves an accuracy of 0.666, and improvements are planned using larger datasets and advanced models.

In [3]:
# Import required libraries
import pandas as pd
import nltk
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

# Download necessary NLTK data
nltk.download('punkt_tab')
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\Ati\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Ati\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Ati\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Dataset and Preprocessing

The dataset consists of 45 balanced text samples labeled as positive, negative, or neutral. Text preprocessing includes:
- Converting to lowercase
- Removing punctuation
- Lemmatization
- Removing stop words

In [5]:
# Define preprocessing function
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    return ' '.join([lemmatizer.lemmatize(word) for word in text.split() if word not in stop_words])

# Load and preprocess dataset
data = {
    'text': [
        "This product is amazing and I love it!", "Terrible service, very disappointed.",
        "It's okay, nothing special.", "Fantastic experience, highly recommend!",
        "Worst purchase ever, waste of money.", "Really happy with my purchase!",
        "Not bad, could be better.", "Horrible quality, broke after one use.",
        "I’m neutral about this product.", "Superb quality, worth every penny!",
        "Service was okay, not great.", "Very bad experience, never again.",
        "I love this product so much!", "Quality is decent, nothing extraordinary.",
        "This item is fantastic, I’m thrilled!", "Really disappointed, poor service.",
        "It works fine, nothing to complain about.", "Awful product, total waste.",
        "Great product, very satisfied!", "Neutral opinion, it’s just okay.",
        "Bad quality, not worth it.", "Amazing service, will buy again!",
        "Mediocre experience, not impressed.", "Terrible product, broke quickly.",
        "Love the quality, highly recommend!", "Poor service, very upset.",
        "It’s fine, no strong opinion.", "Awesome product, super happy!",
        "Really bad, not recommended.", "Just okay, nothing exciting.",
        "Incredible product, highly satisfied!", "Very poor quality, total failure.",
        "Neutral feedback, it’s average.", "Best purchase ever, love it!",
        "Terrible experience, not worth it.", "Nothing special, just fine.",
        "Wonderful product, very pleased!", "Disappointing quality, not great.",
        "It’s alright, nothing to write home about.", "Top-notch service, love it!",
        "Complete waste, very unhappy.", "Pretty average, no complaints.",
        "Excellent quality, highly recommend!", "Bad service, very frustrating.",
        "Neutral, it’s just okay."
    ],
    'sentiment': ['positive', 'negative', 'neutral', 'positive', 'negative', 
                  'positive', 'neutral', 'negative', 'neutral', 'positive',
                  'neutral', 'negative', 'positive', 'neutral', 'positive', 
                  'negative', 'neutral', 'negative', 'positive', 'neutral',
                  'negative', 'positive', 'neutral', 'negative', 'positive', 
                  'negative', 'neutral', 'positive', 'negative', 'neutral',
                  'positive', 'negative', 'neutral', 'positive', 'negative', 
                  'neutral', 'positive', 'negative', 'neutral', 'positive', 
                  'negative', 'neutral', 'positive', 'negative', 'neutral']
}

df = pd.DataFrame(data)
df['text'] = df['text'].apply(preprocess_text)

# Prepare features and target
X = df['text']
y = df['sentiment']

## Model Training

The model uses a pipeline combining TF-IDF vectorization and Logistic Regression with balanced class weights. The dataset is split into 60% training and 40% testing.

In [7]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

# Build and train model
model = make_pipeline(
    TfidfVectorizer(stop_words='english', ngram_range=(1, 3), min_df=1, max_df=0.9),
    LogisticRegression(max_iter=200, class_weight='balanced', C=0.5)
)
model.fit(X_train, y_train)

## Evaluation and Predictions

The model is evaluated using accuracy and cross-validation. Predictions are made on the test set and new texts. The current accuracy is 0.666, indicating room for improvement.

In [9]:
# Make predictions
predictions = model.predict(X_test)

# Print detailed results
print("X_train:", X_train.values)
print("y_train:", y_train.values)
print("X_test:", X_test.values)
print("y_test:", y_test.values)
print("Predictions:", predictions)
for text, true_label, pred_label in zip(X_test, y_test, predictions):
    print(f"Text: {text}, True: {true_label}, Predicted: {pred_label}")
print("Accuracy:", accuracy_score(y_test, predictions))

# Cross-validation
scores = cross_val_score(model, X, y, cv=3, scoring='accuracy')
print("Cross-validation accuracy:", scores.mean())

# Predict new texts
new_texts = ["I really enjoyed using this!", "This is awful, never again."]
new_texts = [preprocess_text(text) for text in new_texts]
new_predictions = model.predict(new_texts)
print("New text predictions:", new_predictions)

# Debug TF-IDF features
vectorizer = model.named_steps['tfidfvectorizer']
print("Top features:", vectorizer.get_feature_names_out()[:20])

X_train: ['superb quality worth every penny' 'work fine nothing complain'
 'disappointing quality great' 'poor quality total failure'
 'awesome product super happy' 'product amazing love'
 'incredible product highly satisfied' 'okay nothing exciting'
 'really happy purchase' 'bad experience never' 'best purchase ever love'
 'terrible service disappointed' 'complete waste unhappy'
 'amazing service buy' 'okay nothing special' 'terrible experience worth'
 'terrible product broke quickly' 'wonderful product pleased'
 'service okay great' 'mediocre experience impressed'
 'great product satisfied' 'neutral okay' 'bad quality worth'
 'horrible quality broke one use' 'item fantastic im thrilled'
 'really bad recommended' 'alright nothing write home']
y_train: ['positive' 'neutral' 'negative' 'negative' 'positive' 'positive'
 'positive' 'neutral' 'positive' 'negative' 'positive' 'negative'
 'negative' 'positive' 'neutral' 'negative' 'negative' 'positive'
 'neutral' 'neutral' 'positive' 'neutra

## Conclusion

This notebook implements a sentiment analysis model with an accuracy of 0.666. The low accuracy is due to the small dataset size (45 samples). Future improvements include:
- Using a larger dataset like Twitter Sentiment.
- Applying advanced models like BERT.
- Enhancing text preprocessing (e.g., removing URLs or emojis).

Next steps involve testing the model with real-world data and deploying it as a web app using Streamlit.