Skip to content

JamunaSMurthy/TwitSenti

Repository files navigation

TwitSenti

A Real-Time Twitter Sentiment Analysis and Visualization Framework
Production-ready end-to-end system combining Apache Kafka streaming, multi-model sentiment classification (EPS, Polarity, Emoticon), and interactive Django dashboard visualizations.

TwitSenti System Architecture


GitHub Repository: github.com/JamunaSMurthy/TwitSenti
Research Paper: TwitSenti: a real-time Twitter sentiment analysis and visualization framework - Journal of Information & Knowledge Management, Vol. 18, No. 2, 2019


🌟 Project Overview

TwitSenti implements the complete architecture from the research paper with three specialized sentiment classifiers:

Core Stack

  • Data Pipeline: Apache Kafka (pub/sub) + PySpark (stream processing)
  • Sentiment Classifiers:
    • SentiWordNetClassifier - Lexicon-based (SentiWordNet)
    • PolarityClassifier - ML-based (Naive Bayes, SVM, Logistic Regression, MLP)
    • EmoticonClassifier - Hybrid emoji + SVM approach
  • Web API: Flask Micro-Server (REST endpoints) ✅ IMPLEMENTED
  • Cache & Storage: Redis (NoSQL cache) ✅ IMPLEMENTED
  • Web Dashboard: Django (visualizations)
  • Streaming: Twitter API / Kafka / PySpark

Key Capabilities

✅ Real-time sentiment classification
✅ Multi-classifier ensemble approach
✅ Emoji/emoticon-aware analysis
✅ Geographic heat maps & trend visualizations
✅ Word clouds & semantic analysis
✅ Scalable streaming architecture


🏗️ System Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                         TwitSenti System Architecture                        │
└─────────────────────────────────────────────────────────────────────────────┘

          ┌─────────────────────────────────────────────────────────┐
          │              DATA COLLECTION LAYER                      │
          │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐   │
          │  │ Twitter Feed │  │Twitter API   │  │ Related      │   │
          │  │              │  │ Streaming    │  │ Tweets       │   │
          │  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘   │
          └─────────┼──────────────────┼──────────────────┼───────────┘
                    │                  │                  │
                    └──────────────────┼──────────────────┘
                                       │
                            ┌──────────▼──────────┐
                            │   Apache Kafka      │
                            │  (Distributed       │
                            │   Messaging)        │
                            │  Pub-Sub Topics     │
                            └──────────┬──────────┘
                                       │
          ┌─────────────────────────────▼─────────────────────────────┐
          │         DATA PRE-PROCESSING LAYER                         │
          │  ┌──────────────────────────────────────────────────────┐ │
          │  │ • Detection & Analysis of Slangs/Abbreviations       │ │
          │  │ • Lemmatization & Correction                         │ │
          │  │ • Stop Words Removal                                 │ │
          │  │ • Emoji/Emoticon Extraction & Preservation           │ │
          │  └──────────────────────────────────────────────────────┘ │
          └─────────────────────────────┬──────────────────────────────┘
                                        │
          ┌─────────────────────────────▼──────────────────────────────┐
          │      SENTIMENT ANALYSIS LAYER (EPS Pipeline)              │
          │  ┌──────────────────────────────────────────────────────┐ │
          │  │ 📊 Emoticon Classifier                               │ │
          │  │    (Emoji Sentiment Lexicon + SVM)                  │ │
          │  │    Accuracy: ~88-92% (emoji-rich tweets)            │ │
          │  ├──────────────────────────────────────────────────────┤ │
          │  │ 📈 Polarity Classifier                               │ │
          │  │    (NB, SVM, LR, MLP + TF-IDF/BoW/Word2Vec)         │ │
          │  │    Best: SVM + TF-IDF (~89% accuracy)               │ │
          │  ├──────────────────────────────────────────────────────┤ │
          │  │ 📝 SentiWordNet Classifier                           │ │
          │  │    (Lexicon-based SentiWordNet scoring)             │ │
          │  │    Fast, no training required                       │ │
          │  └──────────────────────────────────────────────────────┘ │
          │                          ▼                                │
          │                Consensus Vote: P/N/Neutral                │
          └─────────────────────────────┬──────────────────────────────┘
                                        │
          ┌─────────────────────────────▼──────────────────────────────┐
          │         STREAMING & STORAGE LAYER                         │
          │  ┌────────────────────────────────────────────────────┐   │
          │  │ Apache Spark/PySpark Real-Time Processing         │   │
          │  │ • Real-Time Tweets Stream                         │   │
          │  │ • User Timeline Tweets Stream                     │   │
          │  │ • Text Query Tweets Stream                        │   │
          │  │ (4 distributed worker nodes)                      │   │
          │  └────────────────────────────────────────────────────┘   │
          │                          ▼                                │
          │                    Redis Cache                            │
          │                 (NoSQL Data Store)                        │
          └─────────────────────────────┬──────────────────────────────┘
                                        │
          ┌─────────────────────────────▼──────────────────────────────┐
          │      WEB APPLICATION & VISUALIZATION LAYER                │
          │  ┌────────────────────────────────────────────────────┐   │
          │  │  Django Dashboard (8 Interactive Visualizations):  │   │
          │  │                                                    │   │
          │  │  🗺️  Heat Map - Geographic sentiment distribution  │   │
          │  │  🌍 Regional Map - Country-level polarity          │   │
          │  │  📄 Raw Tweets - Live tweet feed display            │   │
          │  │  ☁️  Word Cloud - Most frequent terms               │   │
          │  │  🥧 Comparison - Sentiment distribution (pie)       │   │
          │  │  💬 Trending Now - Trending topics & bubble chart   │   │
          │  │  🌐 The World Now - Global sentiment snapshot       │   │
          │  │  📈 Time Line - Sentiment trends over time          │   │
          │  └────────────────────────────────────────────────────┘   │
          │                                                             │
          │  Flask Micro-Server with Redis Integration                │
          │  • REST API (8 endpoints for all visualizations)            │
          │  • Real-time feeds (positive/negative/neutral)             │
          │  • Trending hashtags & word frequencies                    │
          │  • Geographic sentiment distribution                       │
          │  • Sentiment timeline (hourly aggregation)                 │
          │  Port: 5000  |  Docs: /api/dashboard/data                 │
          └─────────────────────────────┬──────────────────────────────┘
                                        │
       ┌────────────────────────────────┼────────────────────────────────┐
       │                                │                                │
  ┌────▼─────────┐          ┌──────────▼────────────┐      ┌────────────▼──┐
  │   Django     │          │   User Web Interface  │      │ External Apps│
  │  Dashboard   │          │  (Browser/Dashboard)  │      │ (API clients)│
  │127.0.0.1:8000          │      127.0.0.1:8000   │      │ Mobile/etc   │
  └──────────────┘          └───────────────────────┘      └──────────────┘

🧠 Sentiment Classification Pipeline (EPS Ensemble)

Three Specialized Classifiers Working in Parallel

1️⃣ EmoticonClassifier - Emoji-aware Sentiment

Text + Emojis → Emoji Extraction → Sentiment Lexicon → SVM Classification
  • 150+ emojis with predefined sentiment scores (-1 to +1)
  • Hybrid approach: emoji lexicon filtering + SVM for ambiguous cases
  • 26.7% accuracy improvement over text-only models
  • Best for: Emoji-rich social media content

2️⃣ PolarityClassifier - ML-based Multi-Model

Text → Preprocessing → Feature Extraction → ML Classifier → Polarity

Feature Methods: Bag-of-Words, TF-IDF, Word2Vec
Classifiers: Naive Bayes, SVM, Logistic Regression, Multi-Layer Perceptron
Best Combo: SVM + TF-IDF (~89% accuracy)

  • Best for: Diverse text domains, tunable precision/recall

3️⃣ SentiWordNetClassifier - Lexicon-based Scoring

Text → Tokenization → POS Tagging → SentiWordNet Lookup → Sentiment Score
  • No training required, immediate deployment
  • Combines emoticon + polarity + sentiwordnet scoring
  • Fast inference, interpretable results
  • Best for: Quick baseline, real-time constraints

📦 Project Structure

TwitSenti/
├── README.md (this file)
├── requirements.txt
├── zk-single-kafka-single.yml (Docker Compose for Kafka/Zookeeper)
│
├── Django-Dashboard/                    # Main web application
│   ├── manage.py
│   ├── db.sqlite3
│   ├── BigDataProject/                  # Django config
│   │   ├── settings.py
│   │   ├── urls.py
│   │   ├── wsgi.py
│   │   └── static/
│   │       ├── css/
│   │       ├── js/
│   │       └── imgs/
│   │
│   ├── dashboard/                       # Dashboard app
│   │   ├── views.py (8 chart endpoints)
│   │   ├── models.py
│   │   ├── urls.py
│   │   ├── admin.py
│   │   ├── consumer_user.py (Kafka integration)
│   │   └── templates/dashboard/
│   │       ├── index.html
│   │       ├── classify.html
│   │       └── base.html
│   │
│   └── migrations/
│
├── Kafka-PySpark/                       # Streaming pipeline
│   ├── producer-validation-tweets.py    # Data ingestion
│   ├── consumer-pyspark.py              # Stream processing + sentiment
│   ├── twitter_validation.csv
│   └── twitter_training.csv
│
├── SentiWordNetClassifier/              # Lexicon-based classifier
│   ├── __init__.py
│   ├── classifier.py
│   ├── preprocessor.py
│   ├── example_usage.py
│   └── README.md
│
├── PolarityClassifier/                  # ML-based classifier
│   ├── __init__.py
│   ├── polarity_classifier.py
│   ├── classifiers.py (NB, SVM, LR, MLP)
│   ├── feature_extractor.py (BoW, TF-IDF, Word2Vec)
│   ├── preprocessor.py
│   ├── example_usage.py
│   └── README.md
│
├── EmoticonClassifier/                  # Emoji + SVM classifier
│   ├── __init__.py
│   ├── emoticon_classifier.py
│   ├── emoji_lexicon.py (~150 emojis)
│   ├── feature_extractor.py
│   ├── preprocessor.py
│   ├── example_usage.py
│   └── README.md
│
├── Flask-Server/                        # REST API + Redis Cache
│   ├── app.py (Flask application with 8 API endpoints)
│   ├── redis_cache.py (Redis wrapper & utility methods)
│   ├── config.py (Configuration for Redis/Flask)
│   ├── requirements.txt
│   ├── .env.example
│   └── README.md (Flask setup & API documentation)
│
├── Kafka-PySpark/consumer-pyspark-redis.py  # Redis-aware consumer
│
├── ML PySpark Model/                    # Model training
│   ├── Big_Data.ipynb
│   ├── twitter_training.csv
│   └── twitter_validation.csv
│
└── imgs/                                # Documentation assets

🚀 Quick Start

Prerequisites

# Install system dependencies
brew install kafka zookeeper python3  # macOS with Homebrew
# OR use Docker Compose for Kafka/Zookeeper

Setup & Run

# 1. Clone repository
git clone https://github.com/<your-username>/TwitSenti.git
cd TwitSenti

# 2. Create virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# 3. Install Python dependencies
pip install -r requirements.txt
pip install -r Flask-Server/requirements.txt  # Flask + Redis client

# 4. Start Redis server (if not running)
brew services start redis  # macOS
# OR Docker: docker run -d -p 6379:6379 redis:latest
# Verify: redis-cli ping  (should return PONG)

# 5. Train all sentiment classifiers (CRITICAL - creates models/)
python train_classifiers.py
# Creates:
#   - models/sentiwordnet.pkl
#   - models/polarity_model.pkl
#   - models/emoticon_model.pkl
# Expected output: "✓ All classifiers trained successfully!"

# 6. Start Kafka + Zookeeper (Docker)
docker compose -f zk-single-kafka-single.yml up -d

# 7. Start Flask API server (Terminal 1)
cd Flask-Server
python app.py
# Flask running on http://localhost:5000
# Test health: curl http://localhost:5000/api/health

# 8. (Optional) Run producer to collect tweets (Terminal 2)
python Kafka-PySpark/producer-validation-tweets.py --tokens YOUR_TWITTER_TOKENS

# 9. Run PySpark consumer → Redis (Terminal 3)
python Kafka-PySpark/consumer-pyspark-redis.py
# This consumes Kafka tweets, classifies with ensemble ML, sends to Flask API → Redis

# 10. Start Django dashboard (Terminal 4)
cd Django-Dashboard
python manage.py migrate
python manage.py runserver
# Django running on http://localhost:8000

# 11. Open browser
# Dashboard: http://127.0.0.1:8000/dashboard/
# Flask API: http://127.0.0.1:5000/api/dashboard/data

Verify Everything is Working

# Check Redis
redis-cli ping  # Should return: PONG
redis-cli info  # Server statistics

# Check Flask API
curl http://localhost:5000/api/health
curl http://localhost:5000/api/sentiment/stats

# Monitor Kafka consumer
kafka-console-consumer.sh --topic tweets --bootstrap-servers localhost:9092

� Dashboard & Web Interface

TwitSenti Django Dashboard

The interactive Django dashboard provides:

  • 8 real-time visualization charts
  • Live tweet stream display
  • Sentiment statistics and trending topics
  • Geographic heat maps of sentiment distribution
  • Word frequency clouds and semantic analysis
  • Responsive design with Flask API integration

�💡 Using the Sentiment Classifiers

SentiWordNetClassifier (Lexicon-based)

from SentiWordNetClassifier import SentiWordNetClassifier

clf = SentiWordNetClassifier()
result = clf.analyze_sentiment("This movie is fantastic!")
print(result['sentiment'])  # 'Positive'
print(result['score'])      # 0.75

PolarityClassifier (ML-based)

from PolarityClassifier import PolarityClassifier

clf = PolarityClassifier(classifier_type='svm', feature_method='tfidf')
clf.train(train_texts, train_labels)
prediction = clf.predict("Great product!")
probabilities = clf.predict_proba("Great product!")

EmoticonClassifier (Emoji-aware Hybrid)

from EmoticonClassifier import HybridEmoticonClassifier

clf = HybridEmoticonClassifier(emoji_threshold=0.5)
clf.train(texts_with_emojis, labels)
prediction = clf.predict("Love this! 😍❤️")  # Uses emoji lexicon + SVM

📊 Performance Metrics

Based on validation data (twitter_validation.csv):

Classification Accuracy by Model

Classifier Feature Method Accuracy Best For
Emoticon + SVM Emoji + TF-IDF 91-92% Emoji-rich tweets
SVM TF-IDF 89-90% High accuracy
Logistic Regression TF-IDF 85-87% Speed/interpretability
Naive Bayes BoW 82-84% Sparse features
SentiWordNet Lexicon 78-82% No training needed
MLP Word2Vec 86-88% Complex patterns

Streaming Performance

  • Throughput: 2,500+ messages/sec (Kafka + Spark)
  • Latency: <500ms per tweet (preprocessing + classification)
  • Dashboard refresh: <4 seconds (real-time updates)
  • Memory: ~2GB (Redis + PySpark)

🎯 Key Features

✨ Three Specialized Sentiment Classifiers

  • SentiWordNetClassifier: Lexicon-based, no training, instant deployment
  • PolarityClassifier: 4 ML algorithms × 3 feature methods (12 combinations)
  • EmoticonClassifier: Emoji sentiment lexicon + SVM hybrid approach

🔄 Real-Time Streaming Pipeline

  • Apache Kafka for data distribution
  • PySpark for distributed processing (4 worker nodes)
  • Flask REST API for data retrieval
  • Redis cache for <100ms response times

📈 8 Interactive Visualizations (via Flask API)

  1. Heat Map - Geographic sentiment by region (/api/locations/sentiment)
  2. Regional Map - Country-level sentiment distribution (/api/locations/sentiment)
  3. Raw Tweets - Live tweet stream display (/api/tweets/feed/live)
  4. Word Cloud - Frequent terms & topics (/api/words/frequency)
  5. Comparison Chart - Positive/Negative/Neutral distribution (/api/sentiment/stats)
  6. Trending Now - Bubble chart of trending topics (/api/trending/hashtags)
  7. The World Now - Global sentiment heatmap (/api/locations/sentiment)
  8. Time Line - Sentiment trends over time (/api/sentiment/timeline)

🧩 Production-Ready Architecture

  • Modular classifier design (3 independent packages)
  • Easy model swapping & ensemble voting
  • Containerized deployment (Docker)
  • Redis cache with 24-hour TTL
  • Flask REST API with CORS enabled
  • Django dashboard frontend
  • Fully async Kafka → Spark → Flask pipeline

� Testing & Examples

Each classifier includes comprehensive examples:

# Test SentiWordNetClassifier
python SentiWordNetClassifier/example_usage.py

# Test PolarityClassifier (all models + features)
python PolarityClassifier/example_usage.py

# Test EmoticonClassifier (emoji lexicon + SVM)
python EmoticonClassifier/example_usage.py

📚 Research Foundation

This implementation is based on the research paper:

Murthy, J. S., Siddesh, G. M., & Srinivasa, K. G. (2019). TwitSenti: a real-time Twitter sentiment analysis and visualization framework. Journal of Information & Knowledge Management, 18(02), 1950013.
https://www.worldscientific.com/doi/abs/10.1142/S0219649219500138

Key Features from Paper

EPS Pipeline: Emoticon + Polarity + SentiWordNet ensemble
Architecture: Kafka → Spark → Redis → Django visualization
Data Flow: Collection → Preprocessing → Classification → Storage → Display
Performance: Optimized for low-latency, scalable throughput


🛠️ Future Enhancements

  • Transformer models (BERT/RoBERTa) for improved accuracy
  • Multi-language support (Arabic, Spanish, Chinese)
  • Elasticsearch + Kibana dashboard integration
  • Kubernetes deployment (EKS/GKE/AKS)
  • Advanced NER (Named Entity Recognition)
  • Aspect-based sentiment analysis
  • Sarcasm & irony detection

📄 License

MIT License - See LICENSE file for details


👏 Acknowledgments

This project implements the architecture and concepts from the original research paper:

Citation (BibTeX):

@article{murthy2019twitsenti,
  title={TwitSenti: a real-time Twitter sentiment analysis and visualization framework},
  author={Murthy, Jamuna S and Siddesh, GM and Srinivasa, KG},
  journal={Journal of Information \& Knowledge Management},
  volume={18},
  number={02},
  pages={1950013},
  year={2019},
  publisher={World Scientific}
}

Technologies & Communities

  • Apache Kafka & Spark communities
  • Django & Flask frameworks
  • SentiWordNet lexicon project
  • Twitter Streaming API documentation
  • Redis caching engine

📞 Support

Author: Jamuna Srinivasa Murthy
Email: jamunamurthy.s@gmail.com

For issues, feature requests, or questions:

Happy sentiment analyzing! 🎉

About

Natural Language Processing, Real-Time Systems & Social Media Analytics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors