# Sentiment Analysis using NLTK and Machine Learning
## Technical Report - Jupyter Notebook

**Author:** Data Analytics Student  
**Date:** October 2025  
**Project:** Text Sentiment Analysis using Natural Language Processing

---

## Instructions
- Run cells in order with **Shift+Enter**
- Or click **Cell → Run All** to execute everything
- Estimated time: 5-10 minutes

## 1. Problem Statement

### Objectives:
1. Analyze sentiment (positive, negative, neutral) in text data
2. Build a machine learning model to classify sentiment
3. Evaluate model performance

### Technical Goal:
Develop a classification model using TF-IDF vectorization and logistic regression.

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn-v0_8-darkgrid')

# Text processing
from bs4 import BeautifulSoup
import re
import string

# NLTK
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Machine Learning
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Download NLTK data
nltk.download('stopwords', quiet=True)
nltk.download('punkt', quiet=True)
nltk.download('wordnet', quiet=True)

print('✅ Libraries imported successfully!')

In [None]:
# Create sample dataset
np.random.seed(42)

# Positive reviews
positive_reviews = [
    "This product is amazing! Exceeded all my expectations.",
    "Absolutely love it! Best purchase I've made this year.",
    "Excellent quality and fast shipping. Highly recommend!",
    "Perfect! Exactly what I was looking for.",
    "Outstanding product. Worth every penny!",
] * 100

# Negative reviews
negative_reviews = [
    "Terrible product. Complete waste of money.",
    "Very disappointed. Does not work as advertised.",
    "Poor quality. Broke after one use.",
    "Awful! Do not buy this product.",
    "Worst purchase ever. Requesting a refund.",
] * 100

# Neutral reviews
neutral_reviews = [
    "It's okay. Nothing special but does the job.",
    "Average product. Met basic expectations.",
    "It's fine. Not great, not terrible.",
] * 100

# Create DataFrame
reviews = positive_reviews + negative_reviews + neutral_reviews
sentiments = (['positive'] * len(positive_reviews) + 
              ['negative'] * len(negative_reviews) + 
              ['neutral'] * len(neutral_reviews))

df = pd.DataFrame({
    'text': reviews,
    'sentiment': sentiments,
    'rating': np.random.choice([1, 2, 3, 4, 5], size=len(reviews))
})

# Add missing values
missing_indices = np.random.choice(df.index, size=int(0.05 * len(df)), replace=False)
df.loc[missing_indices, 'text'] = np.nan

# Shuffle
df = df.sample(frac=1, random_state=42).reset_index(drop=True)

print(f'✅ Dataset created: {df.shape}')
df.head(10)

In [None]:
# Exploratory Data Analysis
print('='*80)
print('EXPLORATORY DATA ANALYSIS')
print('='*80)

print('\nDataset Info:')
print(df.info())

print('\nMissing Values:')
print(df.isnull().sum())

print('\nSentiment Distribution:')
print(df['sentiment'].value_counts())

# Handle missing values
df_clean = df.dropna(subset=['text']).copy()
print(f'\n✅ Clean dataset: {len(df_clean)} rows')

In [None]:
# Visualize sentiment distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar plot
df_clean['sentiment'].value_counts().plot(kind='bar', ax=axes[0], color=['green', 'red', 'gray'])
axes[0].set_title('Sentiment Distribution', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Sentiment')
axes[0].set_ylabel('Count')

# Pie chart
df_clean['sentiment'].value_counts().plot(kind='pie', ax=axes[1], autopct='%1.1f%%')
axes[1].set_title('Sentiment Proportion', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

In [None]:
# Text preprocessing function
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    if pd.isna(text):
        return ''
    
    # Remove HTML tags
    text = BeautifulSoup(text, 'html.parser').get_text()
    
    # Lowercase
    text = text.lower()
    
    # Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text)
    
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    
    # Remove numbers
    text = re.sub(r'\d+', '', text)
    
    # Tokenize
    tokens = word_tokenize(text)
    
    # Remove stop words and lemmatize
    tokens = [lemmatizer.lemmatize(word) for word in tokens 
              if word not in stop_words and len(word) > 2]
    
    return ' '.join(tokens)

# Apply preprocessing
print('Preprocessing text...')
df_clean['cleaned_text'] = df_clean['text'].apply(preprocess_text)

# Remove empty texts
df_clean = df_clean[df_clean['cleaned_text'].str.len() > 0].copy()

print(f'✅ Preprocessing complete: {len(df_clean)} rows')
print('\nSample:')
print(df_clean[['text', 'cleaned_text', 'sentiment']].head(3))

In [None]:
# TF-IDF Vectorization
X = df_clean['cleaned_text']
y = df_clean['sentiment']

tfidf_vectorizer = TfidfVectorizer(
    max_features=5000,
    min_df=2,
    max_df=0.8,
    ngram_range=(1, 2)
)

X_tfidf = tfidf_vectorizer.fit_transform(X)

print(f'✅ TF-IDF Matrix Shape: {X_tfidf.shape}')
print(f'Number of features: {len(tfidf_vectorizer.get_feature_names_out())}')

In [None]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X_tfidf, y, test_size=0.2, random_state=42, stratify=y
)

print(f'Training set: {X_train.shape[0]} samples')
print(f'Testing set: {X_test.shape[0]} samples')

# Train model
print('\nTraining Logistic Regression model...')
model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(X_train, y_train)

# Evaluate
train_accuracy = model.score(X_train, y_train)
test_accuracy = model.score(X_test, y_test)

print(f'\n✅ Training Accuracy: {train_accuracy:.2%}')
print(f'✅ Testing Accuracy: {test_accuracy:.2%}')

In [None]:
# Test predictions
test_sentences = [
    "This is absolutely wonderful! I love it so much!",
    "Terrible experience. Would not recommend.",
    "It's okay, nothing special.",
]

print('Making predictions on new text:\n')
for i, sentence in enumerate(test_sentences, 1):
    cleaned = preprocess_text(sentence)
    transformed = tfidf_vectorizer.transform([cleaned])
    prediction = model.predict(transformed)[0]
    probabilities = model.predict_proba(transformed)[0]
    
    print(f'{i}. Text: {sentence}')
    print(f'   Predicted: {prediction.upper()}')
    print(f'   Confidence: {probabilities.max():.2%}\n')

In [None]:
# Confusion matrix
y_pred = model.predict(X_test)
cm = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(10, 8))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=model.classes_,
            yticklabels=model.classes_)
plt.title('Confusion Matrix', fontsize=14, fontweight='bold')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.show()

# Classification report
print('\nClassification Report:')
print(classification_report(y_test, y_pred))

## Analysis of Findings

### Key Results:
- The model successfully classifies sentiment with high accuracy
- TF-IDF effectively captures important sentiment-bearing words
- Logistic regression provides interpretable results

### Strengths:
- Fast training and prediction
- Interpretable feature importance
- Good performance on clear sentiment

### Limitations:
- May struggle with sarcasm
- Limited context understanding
- Depends on training data quality

### Recommendations:
1. Increase dataset size and diversity
2. Try ensemble methods
3. Experiment with word embeddings
4. Consider deep learning for complex cases

## References

1. Bird, S., Klein, E., & Loper, E. (2009). Natural Language Processing with Python. O'Reilly.
2. Pedregosa, F., et al. (2011). Scikit-learn: Machine Learning in Python. JMLR.
3. Liu, B. (2012). Sentiment Analysis and Opinion Mining. Morgan & Claypool.
4. Manning, C. D., et al. (2008). Introduction to Information Retrieval. Cambridge.
5. NLTK Documentation: https://www.nltk.org/