# Twitter Sentiment Analysis

This notebook demonstrates a complete Twitter sentiment analysis pipeline using live API data collection, preprocessing, model training, evaluation, and prediction.

**Technologies used:** Tweepy, Pandas, Scikit-learn, Joblib

## Setup and Imports

Install necessary libraries if you haven't already.

In [None]:
# Uncomment to install packages if running in a fresh environment
# %pip install tweepy python-dotenv pandas scikit-learn joblib

import tweepy
import os
from dotenv import load_dotenv
import pandas as pd
import re
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
import joblib
import matplotlib.pyplot as plt
import seaborn as sns
load_dotenv()


## Fetch Tweets

Fetch tweets using Twitter API. Make sure you have your `.env` file set up with your API keys.

In [None]:
API_KEY = os.getenv("API_KEY")
API_SECRET = os.getenv("API_SECRET")
ACCESS_TOKEN = os.getenv("ACCESS_TOKEN")
ACCESS_SECRET = os.getenv("ACCESS_SECRET")

auth = tweepy.OAuth1UserHandler(API_KEY, API_SECRET, ACCESS_TOKEN, ACCESS_SECRET)
api = tweepy.API(auth)

def fetch_tweets(query, count=100):
    tweets = tweepy.Cursor(api.search_tweets, q=query, lang="en", tweet_mode='extended').items(count)
    tweet_data = []
    for tweet in tweets:
        tweet_data.append(tweet.full_text)
    df = pd.DataFrame(tweet_data, columns=['text'])
    df.to_csv('data/tweets.csv', index=False)
    print(f"Saved {len(tweet_data)} tweets to data/tweets.csv")

# Uncomment to fetch tweets live
# fetch_tweets("OpenAI", 100)


## Data Cleaning

Clean tweet text by removing URLs, mentions, hashtags, special characters, and lowercasing.

In [None]:
def clean_text(text):
    text = re.sub(r'http\S+', '', text)
    text = re.sub(r'@\w+', '', text)
    text = re.sub(r'#', '', text)
    text = re.sub(r'[^A-Za-z\s]', '', text)
    text = text.lower()
    text = re.sub(r'\s+', ' ', text).strip()
    return text

df = pd.read_csv('data/tweets.csv')
df['cleaned'] = df['text'].apply(clean_text)
df.to_csv('data/cleaned_tweets.csv', index=False)
df.head()


## Exploratory Data Analysis (EDA)

Visualize word frequency and tweet examples.

In [None]:
from collections import Counter
all_words = ' '.join(df['cleaned']).split()
word_freq = Counter(all_words)
common_words = word_freq.most_common(20)

words, counts = zip(*common_words)
plt.figure(figsize=(10,5))
sns.barplot(x=list(words), y=list(counts))
plt.title("Top 20 Most Common Words")
plt.xticks(rotation=45)
plt.show()


## Train Sentiment Model

Label tweets with simple rules and train a Logistic Regression classifier.

In [None]:
def label_sentiment(text):
    positive_keywords = ['love', 'amazing', 'wonderful', 'fantastic', 'helpful', 'great']
    negative_keywords = ['terrible', 'hate', 'worst', 'disappointed', 'bug', 'crash', 'bad']

    text_lower = text.lower()
    if any(word in text_lower for word in positive_keywords):
        return 1
    elif any(word in text_lower for word in negative_keywords):
        return 0
    else:
        return 2  # Neutral/unknown

df['sentiment'] = df['cleaned'].apply(label_sentiment)
df = df[df['sentiment'] != 2]

X = df['cleaned']
y = df['sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('clf', LogisticRegression(solver='liblinear'))
])

pipeline.fit(X_train, y_train)
print(f"Training accuracy: {pipeline.score(X_train, y_train):.2f}")
print(f"Test accuracy: {pipeline.score(X_test, y_test):.2f}")

joblib.dump(pipeline, 'models/sentiment_model.pkl')


## Model Evaluation

Visualize confusion matrix.

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

y_pred = pipeline.predict(X_test)
cm = confusion_matrix(y_test, y_pred, labels=[1,0])
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=["Positive", "Negative"])
disp.plot(cmap='Blues')
plt.show()


## Predict on New Text

Enter new text to get sentiment prediction.

In [None]:
def predict_sentiment(text, model_path='models/sentiment_model.pkl'):
    model = joblib.load(model_path)
    prediction = model.predict([text])[0]
    sentiment_map = {0: "Negative", 1: "Positive"}
    return sentiment_map.get(prediction, "Neutral/Unknown")

sample_text = "I love using OpenAI's tools!"
print(f"Text: {sample_text}")
print(f"Predicted Sentiment: {predict_sentiment(sample_text)}")


## Conclusion

You now have a full Twitter sentiment analysis pipeline from data collection to prediction.

Feel free to improve the labeling, add more data, or try different models!