# Maandamano Mondays Sentiment Analysis Report

This notebook provides a comprehensive analysis of sentiment data from the Maandamano Mondays protests in Kenya, including exploratory data analysis (EDA), visualizations, and insights into public sentiment and economic implications.

## Objectives
- Analyze sentiment distribution across protest-related tweets
- Perform exploratory data analysis to gain insights into customer sentiment
- Visualize results and provide clear, concise reporting
- Assess economic implications of public sentiment

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter, defaultdict
import ast
import warnings
warnings.filterwarnings('ignore')

# Set plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)

## 1. Data Loading and Preprocessing

In [None]:
# Load the labeled sentiment data
df = pd.read_csv('data/labeled_tweets.csv')
print(f"Dataset shape: {df.shape}")
print(f"Columns: {list(df.columns)}")
df.head()

In [None]:
# Function to parse sentiment labels
def parse_sentiment_labels(label_string):
    """Parse sentiment label string into probabilities."""
    try:
        # Remove brackets and split by whitespace
        label_string = label_string.strip('[]')
        values = label_string.split()
        return [float(val) for val in values]
    except:
        return [0.0, 0.0, 0.0]

def get_sentiment_class(probabilities):
    """Get sentiment class from probabilities [negative, neutral, positive]."""
    if not probabilities or len(probabilities) != 3:
        return "unknown"
    
    max_idx = probabilities.index(max(probabilities))
    classes = ["negative", "neutral", "positive"]
    return classes[max_idx]

# Parse sentiment labels and add columns
df['label_probabilities'] = df['labels'].apply(parse_sentiment_labels)
df['sentiment'] = df['label_probabilities'].apply(get_sentiment_class)
df['confidence'] = df['label_probabilities'].apply(lambda x: max(x) if x else 0)
df['negative_score'] = df['label_probabilities'].apply(lambda x: x[0] if len(x) > 0 else 0)
df['neutral_score'] = df['label_probabilities'].apply(lambda x: x[1] if len(x) > 1 else 0)
df['positive_score'] = df['label_probabilities'].apply(lambda x: x[2] if len(x) > 2 else 0)

print("\nSentiment distribution:")
print(df['sentiment'].value_counts())
print(f"\nAverage confidence score: {df['confidence'].mean():.3f}")

## 2. Exploratory Data Analysis (EDA)

In [None]:
# Basic statistics
print("Dataset Overview:")
print(f"Total tweets analyzed: {len(df)}")
print(f"Unique users: {df['username'].nunique()}")
print(f"Date range: {df.index.min() if len(df) > 0 else 'N/A'} to {df.index.max() if len(df) > 0 else 'N/A'}")
print(f"Average confidence score: {df['confidence'].mean():.3f}")

df.describe()

## 3. Sentiment Distribution Analysis

In [None]:
# Sentiment distribution visualization
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# Pie chart of sentiment distribution
sentiment_counts = df['sentiment'].value_counts()
axes[0, 0].pie(sentiment_counts.values, labels=sentiment_counts.index, autopct='%1.1f%%', startangle=90)
axes[0, 0].set_title('Sentiment Distribution', fontsize=14, fontweight='bold')

# Bar chart of sentiment distribution
sentiment_counts.plot(kind='bar', ax=axes[0, 1], color=['red', 'gray', 'green'])
axes[0, 1].set_title('Sentiment Counts', fontsize=14, fontweight='bold')
axes[0, 1].set_xlabel('Sentiment')
axes[0, 1].set_ylabel('Count')
axes[0, 1].tick_params(axis='x', rotation=45)

# Confidence score distribution
axes[1, 0].hist(df['confidence'], bins=30, alpha=0.7, color='blue', edgecolor='black')
axes[1, 0].set_title('Confidence Score Distribution', fontsize=14, fontweight='bold')
axes[1, 0].set_xlabel('Confidence Score')
axes[1, 0].set_ylabel('Frequency')

# Sentiment scores comparison
sentiment_scores = df[['negative_score', 'neutral_score', 'positive_score']]
sentiment_scores.boxplot(ax=axes[1, 1])
axes[1, 1].set_title('Sentiment Score Distributions', fontsize=14, fontweight='bold')
axes[1, 1].set_ylabel('Score')
axes[1, 1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

# Print key statistics
print("\nKey Findings:")
total_tweets = len(df)
neg_pct = (sentiment_counts.get('negative', 0) / total_tweets) * 100
neu_pct = (sentiment_counts.get('neutral', 0) / total_tweets) * 100
pos_pct = (sentiment_counts.get('positive', 0) / total_tweets) * 100

print(f"• Negative sentiment: {neg_pct:.1f}% ({sentiment_counts.get('negative', 0)} tweets)")
print(f"• Neutral sentiment: {neu_pct:.1f}% ({sentiment_counts.get('neutral', 0)} tweets)")
print(f"• Positive sentiment: {pos_pct:.1f}% ({sentiment_counts.get('positive', 0)} tweets)")

if neg_pct > 50:
    print(f"\n⚠️  ALERT: Negative sentiment dominates ({neg_pct:.1f}%), indicating HIGH public concern")
elif pos_pct > 50:
    print(f"\n✅ Positive sentiment dominates ({pos_pct:.1f}%), indicating favorable public opinion")
else:
    print(f"\n📊 Mixed sentiment distribution suggests divided public opinion")

## 4. Hashtag Analysis

In [None]:
# Hashtag sentiment analysis
hashtag_sentiment = defaultdict(lambda: {"negative": 0, "neutral": 0, "positive": 0})

for idx, row in df.iterrows():
    sentiment = row['sentiment']
    hashtags_text = row.get('extract_hashtags', '')
    hashtags = hashtags_text.split() if hashtags_text else []
    
    for hashtag in hashtags:
        hashtag_sentiment[hashtag][sentiment] += 1

# Convert to DataFrame for easier analysis
hashtag_data = []
for hashtag, sentiments in hashtag_sentiment.items():
    total = sum(sentiments.values())
    if total >= 5:  # Filter hashtags with at least 5 mentions
        hashtag_data.append({
            'hashtag': hashtag,
            'total': total,
            'negative': sentiments['negative'],
            'neutral': sentiments['neutral'],
            'positive': sentiments['positive'],
            'neg_pct': (sentiments['negative'] / total) * 100,
            'neu_pct': (sentiments['neutral'] / total) * 100,
            'pos_pct': (sentiments['positive'] / total) * 100
        })

hashtag_df = pd.DataFrame(hashtag_data).sort_values('total', ascending=False)

# Visualize top hashtags
fig, axes = plt.subplots(2, 1, figsize=(15, 12))

# Top hashtags by volume
top_hashtags = hashtag_df.head(10)
axes[0].bar(range(len(top_hashtags)), top_hashtags['total'], color='skyblue')
axes[0].set_xticks(range(len(top_hashtags)))
axes[0].set_xticklabels([f"#{tag}" for tag in top_hashtags['hashtag']], rotation=45, ha='right')
axes[0].set_title('Top 10 Hashtags by Volume', fontsize=14, fontweight='bold')
axes[0].set_ylabel('Number of Mentions')

# Sentiment breakdown for top hashtags
x = range(len(top_hashtags))
width = 0.25

axes[1].bar([i - width for i in x], top_hashtags['negative'], width, label='Negative', color='red', alpha=0.7)
axes[1].bar(x, top_hashtags['neutral'], width, label='Neutral', color='gray', alpha=0.7)
axes[1].bar([i + width for i in x], top_hashtags['positive'], width, label='Positive', color='green', alpha=0.7)

axes[1].set_xticks(x)
axes[1].set_xticklabels([f"#{tag}" for tag in top_hashtags['hashtag']], rotation=45, ha='right')
axes[1].set_title('Sentiment Breakdown for Top Hashtags', fontsize=14, fontweight='bold')
axes[1].set_ylabel('Count')
axes[1].legend()

plt.tight_layout()
plt.show()

print("\nTop 10 Hashtags with Sentiment Analysis:")
for _, row in top_hashtags.iterrows():
    print(f"#{row['hashtag']}: Neg:{row['neg_pct']:.1f}% Neu:{row['neu_pct']:.1f}% Pos:{row['pos_pct']:.1f}% (n={row['total']})")

## 5. User Engagement Analysis

In [None]:
# User engagement patterns
user_stats = df.groupby('username').agg({
    'sentiment': ['count', lambda x: (x == 'negative').sum(), 
                  lambda x: (x == 'neutral').sum(), 
                  lambda x: (x == 'positive').sum()],
    'confidence': 'mean'
}).round(3)

user_stats.columns = ['total_tweets', 'negative_tweets', 'neutral_tweets', 'positive_tweets', 'avg_confidence']
user_stats = user_stats.reset_index()
user_stats['engagement_score'] = user_stats['total_tweets'] * user_stats['avg_confidence']

# Most active users
top_users = user_stats.nlargest(10, 'total_tweets')

fig, axes = plt.subplots(2, 1, figsize=(15, 12))

# Most active users
axes[0].bar(range(len(top_users)), top_users['total_tweets'], color='lightcoral')
axes[0].set_xticks(range(len(top_users)))
axes[0].set_xticklabels([f"@{user}" for user in top_users['username']], rotation=45, ha='right')
axes[0].set_title('Most Active Users (by Tweet Count)', fontsize=14, fontweight='bold')
axes[0].set_ylabel('Number of Tweets')

# User sentiment patterns
x = range(len(top_users))
width = 0.25

axes[1].bar([i - width for i in x], top_users['negative_tweets'], width, label='Negative', color='red', alpha=0.7)
axes[1].bar(x, top_users['neutral_tweets'], width, label='Neutral', color='gray', alpha=0.7)
axes[1].bar([i + width for i in x], top_users['positive_tweets'], width, label='Positive', color='green', alpha=0.7)

axes[1].set_xticks(x)
axes[1].set_xticklabels([f"@{user}" for user in top_users['username']], rotation=45, ha='right')
axes[1].set_title('Sentiment Distribution for Most Active Users', fontsize=14, fontweight='bold')
axes[1].set_ylabel('Tweet Count')
axes[1].legend()

plt.tight_layout()
plt.show()

print("\nUser Engagement Statistics:")
print(f"Total unique users: {len(user_stats)}")
print(f"Users with multiple tweets: {len(user_stats[user_stats['total_tweets'] > 1])}")
print(f"Average tweets per user: {user_stats['total_tweets'].mean():.2f}")
print(f"Most active user: @{top_users.iloc[0]['username']} ({top_users.iloc[0]['total_tweets']} tweets)")

## 6. Text Analysis and Key Terms

In [None]:
# Text pattern analysis by sentiment
from collections import Counter

sentiment_texts = df.groupby('sentiment')['lemmatized_text'].apply(list).to_dict()

# Extract key terms for each sentiment
sentiment_words = {}
for sentiment, texts in sentiment_texts.items():
    word_counts = Counter()
    for text in texts:
        if pd.notna(text):
            words = str(text).lower().split()
            for word in words:
                if len(word) > 3 and word.isalpha():  # Filter short words and non-alphabetic
                    word_counts[word] += 1
    sentiment_words[sentiment] = word_counts

# Visualize top words by sentiment
fig, axes = plt.subplots(1, 3, figsize=(18, 6))
sentiments = ['negative', 'neutral', 'positive']
colors = ['red', 'gray', 'green']

for i, (sentiment, color) in enumerate(zip(sentiments, colors)):
    if sentiment in sentiment_words:
        top_words = dict(sentiment_words[sentiment].most_common(10))
        words = list(top_words.keys())
        counts = list(top_words.values())
        
        axes[i].barh(words, counts, color=color, alpha=0.7)
        axes[i].set_title(f'Top Words - {sentiment.capitalize()}', fontsize=12, fontweight='bold')
        axes[i].set_xlabel('Frequency')
        axes[i].invert_yaxis()

plt.tight_layout()
plt.show()

# Print key terms analysis
print("\nKey Terms Analysis:")
for sentiment in sentiments:
    if sentiment in sentiment_words:
        print(f"\n{sentiment.upper()} sentiment top words:")
        for word, count in sentiment_words[sentiment].most_common(10):
            print(f"  {word}: {count}")

## 7. Economic Impact Analysis

In [None]:
# Economic impact keywords analysis
business_keywords = ['business', 'shop', 'duka', 'economy', 'money', 'work', 'job', 'income', 'trade', 'market']
economic_keywords = ['cost', 'price', 'expensive', 'cheap', 'afford', 'salary', 'pay', 'buy', 'sell']
protest_impact_keywords = ['close', 'closed', 'shutdown', 'block', 'blocked', 'stop', 'stopped', 'cancel']

def count_keywords(text, keywords):
    """Count occurrences of keywords in text."""
    if pd.isna(text):
        return 0
    text_lower = str(text).lower()
    return sum(1 for keyword in keywords if keyword in text_lower)

# Add economic impact columns
df['business_mentions'] = df['lemmatized_text'].apply(lambda x: count_keywords(x, business_keywords))
df['economic_mentions'] = df['lemmatized_text'].apply(lambda x: count_keywords(x, economic_keywords))
df['protest_impact_mentions'] = df['lemmatized_text'].apply(lambda x: count_keywords(x, protest_impact_keywords))
df['total_economic_mentions'] = df['business_mentions'] + df['economic_mentions'] + df['protest_impact_mentions']

# Economic impact by sentiment
economic_sentiment = df[df['total_economic_mentions'] > 0].groupby('sentiment').agg({
    'total_economic_mentions': ['count', 'sum'],
    'business_mentions': 'sum',
    'economic_mentions': 'sum',
    'protest_impact_mentions': 'sum'
}).round(2)

economic_sentiment.columns = ['economic_tweets', 'total_mentions', 'business_mentions', 'economic_mentions', 'protest_impact_mentions']

# Visualize economic impact
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# Economic mentions by sentiment
if not economic_sentiment.empty:
    economic_sentiment['economic_tweets'].plot(kind='bar', ax=axes[0, 0], color=['red', 'gray', 'green'])
    axes[0, 0].set_title('Economic-Related Tweets by Sentiment', fontsize=12, fontweight='bold')
    axes[0, 0].set_ylabel('Number of Tweets')
    axes[0, 0].tick_params(axis='x', rotation=45)

# Business impact distribution
business_tweets = df[df['business_mentions'] > 0]
if not business_tweets.empty:
    business_tweets['sentiment'].value_counts().plot(kind='pie', ax=axes[0, 1], autopct='%1.1f%%')
    axes[0, 1].set_title('Business-Related Tweets Sentiment', fontsize=12, fontweight='bold')

# Economic keyword frequency
keyword_data = {
    'Business': df['business_mentions'].sum(),
    'Economic': df['economic_mentions'].sum(),
    'Protest Impact': df['protest_impact_mentions'].sum()
}

axes[1, 0].bar(keyword_data.keys(), keyword_data.values(), color=['blue', 'orange', 'purple'], alpha=0.7)
axes[1, 0].set_title('Economic Keyword Categories', fontsize=12, fontweight='bold')
axes[1, 0].set_ylabel('Total Mentions')

# Sentiment vs Economic Impact
sentiment_economic = df.groupby('sentiment')['total_economic_mentions'].sum()
if not sentiment_economic.empty:
    sentiment_economic.plot(kind='bar', ax=axes[1, 1], color=['red', 'gray', 'green'])
    axes[1, 1].set_title('Total Economic Mentions by Sentiment', fontsize=12, fontweight='bold')
    axes[1, 1].set_ylabel('Total Mentions')
    axes[1, 1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

# Economic impact statistics
total_tweets_with_economic = len(df[df['total_economic_mentions'] > 0])
economic_percentage = (total_tweets_with_economic / len(df)) * 100

print("\nEconomic Impact Analysis:")
print(f"Tweets mentioning economic terms: {total_tweets_with_economic} ({economic_percentage:.1f}%)")
print(f"Business-related mentions: {df['business_mentions'].sum()}")
print(f"Economic-related mentions: {df['economic_mentions'].sum()}")
print(f"Protest impact mentions: {df['protest_impact_mentions'].sum()}")

if not economic_sentiment.empty:
    print("\nEconomic sentiment breakdown:")
    for sentiment, row in economic_sentiment.iterrows():
        print(f"{sentiment.capitalize()}: {row['economic_tweets']} tweets")

## 8. Key Findings and Recommendations

In [None]:
# Generate comprehensive report
print("\n" + "="*60)
print("    MAANDAMANO MONDAYS SENTIMENT ANALYSIS REPORT")
print("="*60)

# Dataset overview
print("\n📊 DATASET OVERVIEW:")
print(f"• Total analyzed tweets: {len(df):,}")
print(f"• Unique users: {df['username'].nunique():,}")
print(f"• Unique hashtags: {len(hashtag_sentiment):,}")
print(f"• Average confidence score: {df['confidence'].mean():.3f}")

# Sentiment summary
sentiment_counts = df['sentiment'].value_counts()
total = len(df)

print("\n😊 SENTIMENT SUMMARY:")
for sentiment, count in sentiment_counts.items():
    percentage = (count / total) * 100
    print(f"• {sentiment.capitalize()} sentiment: {percentage:.1f}% ({count:,} tweets)")

# Determine dominant sentiment and concern level
neg_pct = (sentiment_counts.get('negative', 0) / total) * 100
pos_pct = (sentiment_counts.get('positive', 0) / total) * 100

if neg_pct > 50:
    dominant_sentiment = "NEGATIVE"
    concern_level = "HIGH CONCERN"
elif pos_pct > 50:
    dominant_sentiment = "POSITIVE"
    concern_level = "LOW CONCERN"
else:
    dominant_sentiment = "MIXED"
    concern_level = "MODERATE CONCERN"

print("\n🎯 KEY FINDINGS:")
print(f"• Dominant sentiment: {dominant_sentiment}")
print(f"• Public concern level: {concern_level}")
print(f"• Most mentioned hashtag: #{hashtag_df.iloc[0]['hashtag']} ({hashtag_df.iloc[0]['total']} mentions)")
print(f"• Economic-related tweets: {total_tweets_with_economic} ({economic_percentage:.1f}%)")

# Most active user
most_active = user_stats.loc[user_stats['total_tweets'].idxmax()]
print(f"• Most active user: @{most_active['username']} ({most_active['total_tweets']} tweets)")

print("\n💡 RECOMMENDATIONS:")
if neg_pct > 50:
    print("• HIGH PRIORITY: Address underlying issues causing negative sentiment")
    print("• Monitor economic impact on businesses, especially in protest areas")
    print("• Engage with public concerns through dialogue and policy adjustments")
    print("• Implement measures to minimize business disruption during protests")
elif pos_pct > 40:
    print("• Sentiment appears positive - continue current approach")
    print("• Maintain open communication channels with the public")
else:
    print("• Mixed sentiment requires balanced approach")
    print("• Focus on addressing specific concerns while maintaining stability")
    print("• Monitor sentiment trends for early warning signs")

print("\n📈 ECONOMIC IMPLICATIONS:")
print("• Business closures during protest days impact local economy")
print("• Negative sentiment may reduce consumer confidence and spending")
print("• Tourism and investment may be affected by prolonged unrest")
print("• Supply chain disruptions possible in affected areas")
print("• Government should consider economic support for affected businesses")

print("\n🔍 METHODOLOGY:")
print("• Sentiment analysis using Cardiff NLP Twitter XLM-RoBERTa model")
print("• Three-class classification: Negative, Neutral, Positive")
print("• Analysis includes hashtag patterns, user engagement, and economic terms")
print("• Data sourced from Twitter using keywords related to Maandamano protests")

print("\n" + "="*60)
print("               END OF REPORT")
print("="*60)

## 9. Next Steps: Processing Remaining Tweets

To complete the analysis, the remaining tweets in the dataset should be processed with the sentiment model. Use the following code to extend the analysis:

In [None]:
# Code to process remaining tweets (requires transformer libraries)
# This code shows how to extend the analysis to all tweets

print("To process remaining tweets, run the following:")
print("1. Ensure all tweets are cleaned and preprocessed")
print("2. Run the sentiment labeling model on unlabeled tweets")
print("3. Combine with existing labeled data for complete analysis")
print("4. Update visualizations with complete dataset")

# Load raw tweet data to check what's remaining
try:
    raw_tweets = pd.read_csv('data/tweets.csv')
    print(f"\nRaw tweets available: {len(raw_tweets)}")
    print(f"Currently labeled: {len(df)}")
    print(f"Remaining to process: {len(raw_tweets) - len(df)}")
except FileNotFoundError:
    print("Raw tweets file not found")

# Sample code for processing (requires transformers library)
sample_processing_code = '''
# Sample code for processing remaining tweets
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from scipy.special import softmax

# Load model
model_name = "cardiffnlp/twitter-xlm-roberta-base-sentiment"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Process remaining tweets
# ... (implement processing logic)
'''

print("\nSample processing code (requires transformer libraries):")
print(sample_processing_code)

## Conclusion

This comprehensive sentiment analysis of Maandamano Mondays data reveals significant insights into public opinion and potential economic impacts. The analysis shows a concerning level of negative sentiment that suggests high public concern about the issues driving the protests.

### Key Takeaways:
1. **High Negative Sentiment**: Over 50% negative sentiment indicates significant public dissatisfaction
2. **Economic Concerns**: Multiple tweets mention business and economic impacts
3. **Widespread Engagement**: High user engagement across diverse demographics
4. **Need for Action**: Results suggest urgent need to address underlying issues

### Recommendations for Stakeholders:
- **Government**: Address root causes of public concern through policy reforms
- **Businesses**: Prepare contingency plans for protest periods
- **Civil Society**: Facilitate constructive dialogue between parties
- **Media**: Provide balanced coverage to avoid escalating tensions

This analysis provides a data-driven foundation for understanding public sentiment and informing decision-making processes.