# Public Service Feedback Analyzer - Sentiment Analysis with Naive Bayes

This notebook demonstrates how to use Naive Bayes for sentiment analysis of public service feedback. We'll cover:

1. Data preparation
2. Feature extraction
3. Training a Naive Bayes classifier
4. Evaluating the model
5. Classifying new feedback

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Download necessary NLTK data
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')

## 1. Sample Data Creation

For this demonstration, we'll create a sample dataset of public service feedback.

In [None]:
import pandas as pd

# Step 1: Load the CSV
df = pd.read_csv('sentiment_analysis.csv')

# Step 2: Keep only required columns
df = df[['feedback', 'Sentiment']]

# Step 3: Rename 'Sentiment' to 'sentiment'
df.rename(columns={'Sentiment': 'sentiment'}, inplace=True)

# Step 4: Map sentiment numbers to text labels
df['sentiment_label'] = df['sentiment'].map({0: 'negative', 1: 'neutral', 2: 'positive', 3: 'urgent'})

# (Optional) Check if it worked
print(df.head())


In [None]:
# Check class distribution
plt.figure(figsize=(10, 6))
sns.countplot(x='sentiment_label', data=df)
plt.title('Distribution of Sentiment Classes')
plt.xlabel('Sentiment')
plt.ylabel('Count')
plt.show()

## 2. Text Preprocessing

Before training our model, we need to preprocess the text data. This involves:
- Converting to lowercase
- Removing special characters
- Removing stopwords
- Lemmatization

In [None]:
# Initialize lemmatizer and stopwords
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    
    # Remove special characters and digits
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    
    # Tokenize
    tokens = nltk.word_tokenize(text)
    
    # Remove stopwords and lemmatize
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words and len(word) > 2]
    
    # Join tokens back into a string
    return ' '.join(tokens)

# Apply preprocessing to the feedback column
df['processed_feedback'] = df['feedback'].apply(preprocess_text)

# Display examples of processed text
pd.set_option('display.max_colwidth', None)
df[['feedback', 'processed_feedback', 'sentiment_label']].head()

## 3. Feature Extraction

Next, we'll convert the text data into numerical features using the Bag of Words model and TF-IDF.

In [None]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    df['processed_feedback'], 
    df['sentiment'], 
    test_size=0.2, 
    random_state=42,
    stratify=df['sentiment']  # Ensure balanced classes in train and test sets
)

# Initialize the vectorizers
count_vectorizer = CountVectorizer(max_features=1000)  # Limit features to avoid overfitting
tfidf_vectorizer = TfidfVectorizer(max_features=1000)

# Transform the data
X_train_counts = count_vectorizer.fit_transform(X_train)
X_test_counts = count_vectorizer.transform(X_test)

X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

# Display feature information
print(f"Number of features (Count Vectorizer): {X_train_counts.shape[1]}")
print(f"Number of features (TF-IDF Vectorizer): {X_train_tfidf.shape[1]}")

# Get feature names for later analysis
feature_names = count_vectorizer.get_feature_names_out()

## 4. Training the Naive Bayes Classifier

We'll train two separate models - one using Bag of Words (Count Vectorizer) and one using TF-IDF features.

In [None]:
# Initialize Multinomial Naive Bayes classifiers
nb_counts = MultinomialNB()
nb_tfidf = MultinomialNB()

# Train models
nb_counts.fit(X_train_counts, y_train)
nb_tfidf.fit(X_train_tfidf, y_train)

# Make predictions
y_pred_counts = nb_counts.predict(X_test_counts)
y_pred_tfidf = nb_tfidf.predict(X_test_tfidf)

# Calculate probability predictions for later analysis
y_prob_counts = nb_counts.predict_proba(X_test_counts)
y_prob_tfidf = nb_tfidf.predict_proba(X_test_tfidf)

## 5. Model Evaluation

Let's evaluate the performance of our models.

In [None]:
# Evaluate Count Vectorizer model
print("Model: Naive Bayes with Bag of Words")
print(f"Accuracy: {accuracy_score(y_test, y_pred_counts):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred_counts, 
                           target_names=['Negative', 'Neutral', 'Positive', 'Urgent']))

# Confusion Matrix for Count Vectorizer model
cm_counts = confusion_matrix(y_test, y_pred_counts)
plt.figure(figsize=(8, 6))
sns.heatmap(cm_counts, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Negative', 'Neutral', 'Positive', 'Urgent'],
            yticklabels=['Negative', 'Neutral', 'Positive', 'Urgent'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix - Naive Bayes with Bag of Words')
plt.show()

# Evaluate TF-IDF model
print("\nModel: Naive Bayes with TF-IDF")
print(f"Accuracy: {accuracy_score(y_test, y_pred_tfidf):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred_tfidf, 
                           target_names=['Negative', 'Neutral', 'Positive', 'Urgent']))

# Confusion Matrix for TF-IDF model
cm_tfidf = confusion_matrix(y_test, y_pred_tfidf)
plt.figure(figsize=(8, 6))
sns.heatmap(cm_tfidf, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Negative', 'Neutral', 'Positive', 'Urgent'],
            yticklabels=['Negative', 'Neutral', 'Positive', 'Urgent'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix - Naive Bayes with TF-IDF')
plt.show()

## 6. Feature Importance Analysis

Let's analyze which words are most predictive for each sentiment class.

In [None]:
def show_top_features(classifier, feature_names, class_labels, n=10):
    """Display top n features for each class"""
    for i, class_label in enumerate(class_labels):
        print(f"\nTop {n} features for class: {class_label}")
        
        # Get feature log probabilities for the class
        feature_log_probs = classifier.feature_log_prob_[i]
        
        # Get indices of top features
        top_indices = np.argsort(feature_log_probs)[::-1][:n]
        
        # Print top features
        for idx in top_indices:
            print(f"{feature_names[idx]}: {np.exp(feature_log_probs[idx]):.4f}")

# Class labels
class_labels = ['Negative', 'Neutral', 'Positive', 'Urgent']

# Display top features for the TF-IDF model
print("Top features by class (TF-IDF Model):")
show_top_features(nb_tfidf, tfidf_vectorizer.get_feature_names_out(), class_labels)

## 7. Building a Complete Sentiment Analysis Function

Now that we've trained and evaluated our model, let's create a function to analyze new feedback.

In [None]:
def analyze_sentiment(text, model=nb_tfidf, vectorizer=tfidf_vectorizer):
    """Analyze the sentiment of a text input"""
    # Preprocess the text
    processed_text = preprocess_text(text)
    
    # Vectorize the text
    text_vector = vectorizer.transform([processed_text])
    
    # Predict sentiment class
    sentiment_class = model.predict(text_vector)[0]
    
    # Get probability scores
    proba = model.predict_proba(text_vector)[0]
    
    # Map numeric label to text label
    sentiment_map = {0: 'negative', 1: 'neutral', 2: 'positive', 3: 'urgent'}
    sentiment = sentiment_map[sentiment_class]
    
    # Calculate confidence score (max probability)
    confidence = max(proba)
    
    return {
        'text': text,
        'sentiment': sentiment,
        'confidence': confidence,
        'probabilities': {
            'negative': proba[0],
            'neutral': proba[1],
            'positive': proba[2],
            'urgent': proba[3]
        }
    }

# Test the function with some sample feedback
sample_feedback = [
    "The city's new recycling program is well organized and easy to use.",
    "I've been trying to get a building permit for months and keep getting the runaround.",
    "I'd like to know when the next city council meeting is scheduled.",
    "There's a large pothole on Oak Street that's causing cars to swerve dangerously into oncoming traffic!"
]

for feedback in sample_feedback:
    result = analyze_sentiment(feedback)
    print(f"\nText: {result['text']}")
    print(f"Sentiment: {result['sentiment']}")
    print(f"Confidence: {result['confidence']:.4f}")
    print("Probability Distribution:")
    for sentiment, prob in result['probabilities'].items():
        print(f"  {sentiment}: {prob:.4f}")

## 8. Saving the Model

Finally, let's save our trained model and vectorizer for use in the application.

In [None]:
import pickle

# Save the model and vectorizer
with open('sentiment_model.pkl', 'wb') as f:
    pickle.dump(nb_tfidf, f)
    
with open('tfidf_vectorizer.pkl', 'wb') as f:
    pickle.dump(tfidf_vectorizer, f)

print("Model and vectorizer saved successfully!")

## 9. Conclusion

In this notebook, we've built a Naive Bayes classifier for sentiment analysis of public service feedback. The model can categorize feedback into four classes: positive, negative, neutral, and urgent.

Key takeaways:
1. Naive Bayes is an effective algorithm for text classification tasks
2. TF-IDF vectorization slightly outperformed the simpler Bag of Words approach
3. The model shows good ability to distinguish between different sentiment classes
4. Urgent feedback, which requires immediate attention, is successfully identified

For a production system, you would want to:
1. Train on a much larger dataset
2. Consider more advanced models such as BERT or other transformer-based models
3. Implement additional features like entity recognition to identify specific issues
4. Continuously update the model as new feedback is collected

This model provides a solid foundation for a public service feedback analyzer system.