# Sentiment Analysis using Logistic Regression

# Logistic Regression Theory

**Logistic Regression** is perfect for binary classification (positive/negative sentiment).

## How it works:
1. **Linear Combination**: z = θ₀ + θ₁x₁ + θ₂x₂ + ... + θₙxₙ
2. **Sigmoid Function**: σ(z) = 1 / (1 + e^(-z)) → maps to probability [0,1]
3. **Decision**: If σ(z) ≥ 0.5 → Positive, else → Negative

## Our Features:
- x₁: Bias term (always 1)
- x₂: Sum of positive word frequencies  
- x₃: Sum of negative word frequencies


In [8]:
import nltk
from os import getcwd
import numpy as np
import pandas as pd
from nltk.corpus import stopwords
from nltk.corpus import twitter_samples
from utils import process_tweet, build_freqs, extract_features


Train Test Split (80/20)

In [4]:
pos_tweets = twitter_samples.strings('positive_tweets.json')
neg_tweets = twitter_samples.strings('negative_tweets.json')

test_pos = pos_tweets[4000:]
train_pos = pos_tweets[:4000]
test_neg = neg_tweets[4000:]
train_neg = neg_tweets[:4000]

train_x = train_pos + train_neg
test_x = test_pos + test_neg

train_y = np.append(np.ones((len(train_pos),1)),np.zeros((len(train_neg),1)),axis=0)
test_y = np.append(np.ones((len(test_pos),1)),np.zeros((len(test_neg),1)),axis=0)

In [6]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

model = LogisticRegression(
    C=1.0,             # -> Regularization(Control overfitting)
    max_iter=1000,     # ->  how many iterations to find optimal weights
    random_state=42,    
    solver='lbfgs'     # -> general purpose solver (can also use liblinear)
)

print("LogisticRegression model created")
print(f"Model parameters: {model.get_params()}")


LogisticRegression model created
Model parameters: {'C': 1.0, 'class_weight': None, 'dual': False, 'fit_intercept': True, 'intercept_scaling': 1, 'l1_ratio': None, 'max_iter': 1000, 'multi_class': 'auto', 'n_jobs': None, 'penalty': 'l2', 'random_state': 42, 'solver': 'lbfgs', 'tol': 0.0001, 'verbose': 0, 'warm_start': False}


Feature Preparation

In [7]:
freqs = build_freqs(train_x, train_y)
print(f"Frequency dictionary built with {len(freqs)} word-sentiment pairs")

Frequency dictionary built with 11397 word-sentiment pairs


Feature Extraction

In [9]:
print("Extracting features for training data")
X_train = np.zeros((len(train_x),3))  #[bias,pos_freq,neg_freq]

for i,tweet in enumerate(train_x):
    processed_tweet = process_tweet(tweet)
    features = extract_features(processed_tweet,freqs)
    X_train[i,:] = features

print(f"Training features extracted: {X_train.shape}")
print(f"Features -> [bias, positive_freq, negative_freq]")
print(f"example tweet features: {X_train[0]}")



Extracting features for training data
Training features extracted: (8000, 3)
Features -> [bias, positive_freq, negative_freq]
example tweet features: [1.000e+00 3.133e+03 6.100e+01]


In [10]:
print("Extracting features for test data")
X_test = np.zeros((len(test_x), 3))

for i,tweet in enumerate(test_x):
    processed_tweet = process_tweet(tweet)
    features = extract_features(processed_tweet,freqs)
    X_test[i,:] = features

print(f"Test features extracted: {X_test.shape}")


Extracting features for test data
Test features extracted: (2000, 3)


Flattening for SciKit Learn

In [11]:
y_train = train_y.flatten()
y_test = test_y.flatten()
print(f"Labels prepared -> Train: {y_train.shape},Test: {y_test.shape}")

Labels prepared -> Train: (8000,),Test: (2000,)


Model Training

In [12]:
model.fit(X_train, y_train)
print("Model training completed")

Model training completed


Accuracy Score

In [13]:
y_train_pred = model.predict(X_train)
train_accuracy = accuracy_score(y_train,y_train_pred)
y_test_pred = model.predict(X_test)
test_accuracy = accuracy_score(y_test, y_test_pred)

print(f"Training Accuracy: {train_accuracy:.4f} ({train_accuracy*100:.2f}%)")
print(f"Test Accuracy: {test_accuracy:.4f} ({test_accuracy*100:.2f}%)")


Training Accuracy: 0.9942 (99.42%)
Test Accuracy: 0.9950 (99.50%)


Making Predictions on New Tweets

In [15]:
def predict_sentiment(tweet,model,freqs):
    processed_tweet = process_tweet(tweet)
    features = extract_features(processed_tweet,freqs)
    prediction = model.predict(features)[0]
    probability = model.predict_proba(features)[0][1] 
    return prediction,probability,features[0]


test_tweets = [
    "I love this movie! It's absolutely amazing! #movielife #happy",
    "This is the worst day ever #hate . I hate everything ",
    "The weather is okay today",
    "Happy birthday! Hope you have a wonderful day!",
    "I'm so disappointed with this product. Terrible quality. #disappointed https://t.co/1234567890"
]

for tweet in test_tweets:
    prediction,probability,features = predict_sentiment(tweet, model, freqs)
    sentiment = "positive" if prediction == 1 else "negative"
    
    
    print(f"Tweet: '{tweet}'")
    print(f"Sentiment: {sentiment} (Probability: {probability:.3f})")


Tweet: 'I love this movie! It's absolutely amazing! #movielife #happy'
Sentiment: positive (Probability: 0.968)
Tweet: 'This is the worst day ever #hate . I hate everything '
Sentiment: negative (Probability: 0.496)
Tweet: 'The weather is okay today'
Sentiment: positive (Probability: 0.543)
Tweet: 'Happy birthday! Hope you have a wonderful day!'
Sentiment: positive (Probability: 0.929)
Tweet: 'I'm so disappointed with this product. Terrible quality. #disappointed https://t.co/1234567890'
Sentiment: positive (Probability: 0.567)
