# ChatGPT Tweets Sentiment Analysis - Complete Beginner's Guide

Welcome to the comprehensive analysis of ChatGPT tweets! In this notebook, we'll work with the `file.csv` dataset containing tweets about ChatGPT and their sentiment labels.

## What you'll learn:
- üìä Data exploration and visualization
- üßπ Text preprocessing for NLP
- ü§ñ Multiple machine learning approaches
- üìà Model comparison and evaluation
- üî§ Word embeddings with Word2Vec

Let's start from the very beginning!

## 1. Import Required Libraries

First, let's import all the libraries we'll need for our analysis.

In [None]:
# Essential libraries for data manipulation and visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Text processing libraries
import re
import warnings
warnings.filterwarnings('ignore')

# Natural Language Processing
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

# Machine Learning libraries
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Advanced NLP (optional)
try:
    from gensim.models import Word2Vec
    from sklearn.cluster import KMeans
    from sklearn.decomposition import PCA
    print("‚úì Advanced NLP libraries loaded (Word2Vec available)")
except ImportError:
    print("‚ö†Ô∏è Gensim not available. Install with: pip install gensim")

# Download NLTK data
print("Downloading NLTK data...")
nltk.download('stopwords', quiet=True)
nltk.download('punkt', quiet=True)
nltk.download('wordnet', quiet=True)
print("‚úì NLTK data downloaded")

# Set plotting style
plt.style.use('default')
sns.set_palette("husl")

print("üéâ All libraries imported successfully!")

## 2. Load the CSV File

Let's load our dataset and take a first look at what we're working with.

In [None]:
# Load the dataset
try:
    df = pd.read_csv('file.csv')
    print(f"‚úÖ Dataset loaded successfully!")
    print(f"üìä Dataset shape: {df.shape}")
    print(f"üìã Columns: {df.columns.tolist()}")
except FileNotFoundError:
    print("‚ùå file.csv not found. Please ensure the file is in the current directory.")

# Display first few rows
print("\nüîç First 5 rows of the dataset:")
df.head()

In [None]:
# Let's see what our data looks like
print("üìù Sample tweets and their labels:")
for i in range(3):
    print(f"\nTweet {i+1}:")
    print(f"Text: {df['tweets'].iloc[i][:100]}...")
    print(f"Label: {df['labels'].iloc[i]}")

## 3. Explore the Dataset Structure

Now let's get a comprehensive understanding of our dataset.

In [None]:
# Basic dataset information
print("üìä DATASET OVERVIEW")
print("=" * 50)
print(f"Number of rows: {df.shape[0]:,}")
print(f"Number of columns: {df.shape[1]}")
print(f"\nColumn information:")
print(df.info())

In [None]:
# Check data types and basic statistics
print("üìà BASIC STATISTICS")
print("=" * 50)
print(df.describe(include='all'))

In [None]:
# Analyze the sentiment labels
print("üòäüòêüòû SENTIMENT ANALYSIS")
print("=" * 50)

sentiment_counts = df['labels'].value_counts()
sentiment_percentages = df['labels'].value_counts(normalize=True) * 100

print("Sentiment distribution:")
for label, count in sentiment_counts.items():
    percentage = sentiment_percentages[label]
    print(f"{label}: {count:,} tweets ({percentage:.2f}%)")

# Let's see if the dataset is balanced
print(f"\nDataset balance:")
if sentiment_percentages.max() - sentiment_percentages.min() < 20:
    print("‚úÖ Dataset is relatively balanced")
else:
    print("‚ö†Ô∏è Dataset is imbalanced - consider this in model training")

## 4. Handle Missing Data

Let's check for any missing values and handle them appropriately.

In [None]:
# Check for missing values
print("üîç MISSING DATA ANALYSIS")
print("=" * 50)

missing_data = df.isnull().sum()
print("Missing values per column:")
print(missing_data)

if missing_data.sum() > 0:
    print(f"\n‚ö†Ô∏è Total missing values: {missing_data.sum()}")
    print("We'll need to handle these missing values.")
else:
    print("\n‚úÖ No missing values found!")

# Check for empty strings or very short tweets
empty_tweets = df['tweets'].str.len() < 10
print(f"\nTweets with less than 10 characters: {empty_tweets.sum()}")

In [None]:
# Clean the data
print("üßπ DATA CLEANING")
print("=" * 50)

# Remove rows with missing tweets or labels
initial_size = len(df)
df = df.dropna(subset=['tweets', 'labels'])
after_dropna = len(df)

# Remove very short tweets (less than 10 characters)
df = df[df['tweets'].str.len() >= 10].reset_index(drop=True)
final_size = len(df)

print(f"Initial dataset size: {initial_size:,}")
print(f"After removing missing values: {after_dropna:,}")
print(f"After removing short tweets: {final_size:,}")
print(f"Removed {initial_size - final_size:,} rows ({((initial_size - final_size)/initial_size)*100:.2f}%)")

## 5. Basic Data Analysis

Let's analyze the characteristics of our tweets.

In [None]:
# Text length analysis
df['tweet_length'] = df['tweets'].str.len()
df['word_count'] = df['tweets'].str.split().str.len()

print("üìè TEXT LENGTH ANALYSIS")
print("=" * 50)
print("Tweet length statistics (characters):")
print(df['tweet_length'].describe())

print("\nWord count statistics:")
print(df['word_count'].describe())

# Analysis by sentiment
print("\nüìä ANALYSIS BY SENTIMENT")
print("=" * 50)
for sentiment in df['labels'].unique():
    subset = df[df['labels'] == sentiment]
    print(f"\n{sentiment.upper()} tweets:")
    print(f"  Average length: {subset['tweet_length'].mean():.1f} characters")
    print(f"  Average words: {subset['word_count'].mean():.1f} words")

In [None]:
# Find most common words (simple analysis)
print("üî§ MOST COMMON WORDS")
print("=" * 50)

# Combine all tweets and count words
all_text = ' '.join(df['tweets'].str.lower())
words = re.findall(r'\b\w+\b', all_text)
word_freq = pd.Series(words).value_counts()

print("Top 20 most common words:")
for i, (word, count) in enumerate(word_freq.head(20).items(), 1):
    print(f"{i:2d}. {word}: {count:,}")

## 6. Data Visualization

Now let's create visualizations to better understand our data.

In [None]:
# Create comprehensive visualizations
fig, axes = plt.subplots(2, 3, figsize=(18, 12))

# 1. Sentiment distribution (bar chart)
sentiment_counts.plot(kind='bar', ax=axes[0,0], color=['lightgreen', 'lightcoral', 'lightblue'])
axes[0,0].set_title('Sentiment Distribution', fontsize=14, fontweight='bold')
axes[0,0].set_xlabel('Sentiment')
axes[0,0].set_ylabel('Number of Tweets')
axes[0,0].tick_params(axis='x', rotation=45)

# 2. Sentiment distribution (pie chart)
axes[0,1].pie(sentiment_counts.values, labels=sentiment_counts.index, autopct='%1.1f%%', 
              colors=['lightgreen', 'lightcoral', 'lightblue'])
axes[0,1].set_title('Sentiment Distribution (%)', fontsize=14, fontweight='bold')

# 3. Tweet length by sentiment
df.boxplot(column='tweet_length', by='labels', ax=axes[0,2])
axes[0,2].set_title('Tweet Length by Sentiment', fontsize=14, fontweight='bold')
axes[0,2].set_xlabel('Sentiment')
axes[0,2].set_ylabel('Tweet Length (characters)')

# 4. Tweet length distribution
axes[1,0].hist(df['tweet_length'], bins=50, alpha=0.7, color='skyblue', edgecolor='black')
axes[1,0].set_title('Distribution of Tweet Lengths', fontsize=14, fontweight='bold')
axes[1,0].set_xlabel('Tweet Length (characters)')
axes[1,0].set_ylabel('Frequency')

# 5. Word count by sentiment
df.boxplot(column='word_count', by='labels', ax=axes[1,1])
axes[1,1].set_title('Word Count by Sentiment', fontsize=14, fontweight='bold')
axes[1,1].set_xlabel('Sentiment')
axes[1,1].set_ylabel('Word Count')

# 6. Word count distribution
axes[1,2].hist(df['word_count'], bins=50, alpha=0.7, color='lightgreen', edgecolor='black')
axes[1,2].set_title('Distribution of Word Counts', fontsize=14, fontweight='bold')
axes[1,2].set_xlabel('Word Count')
axes[1,2].set_ylabel('Frequency')

plt.tight_layout()
plt.savefig('chatgpt_tweets_analysis.png', dpi=300, bbox_inches='tight')
plt.show()

In [None]:
# Additional analysis: Sentiment by word count ranges
print("üìä SENTIMENT ANALYSIS BY WORD COUNT RANGES")
print("=" * 50)

# Create word count bins
df['word_range'] = pd.cut(df['word_count'], bins=[0, 10, 20, 30, 100], 
                         labels=['1-10', '11-20', '21-30', '30+'])

# Cross-tabulation
sentiment_by_words = pd.crosstab(df['word_range'], df['labels'], normalize='index') * 100

print("Sentiment distribution by word count ranges (%):")
print(sentiment_by_words.round(2))

# Visualize this
plt.figure(figsize=(10, 6))
sentiment_by_words.plot(kind='bar', stacked=True)
plt.title('Sentiment Distribution by Word Count Ranges', fontsize=14, fontweight='bold')
plt.xlabel('Word Count Range')
plt.ylabel('Percentage')
plt.legend(title='Sentiment')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

## 7. Text Preprocessing

Before we can apply machine learning, we need to clean and preprocess our text data.

In [None]:
# Initialize preprocessing tools
ps = PorterStemmer()
stop_words = set(stopwords.words('english'))

def remove_special_characters(text):
    """Remove URLs, mentions, hashtags, and special characters"""
    if pd.isna(text):
        return ""
    
    text = str(text)
    # Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    # Remove mentions (@username)
    text = re.sub(r'@\w+', '', text)
    # Remove hashtags (#hashtag)
    text = re.sub(r'#\w+', '', text)
    # Remove special characters and numbers
    text = re.sub(r'[^A-Za-z\s]', '', text)
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text

def preprocess_text(text):
    """Complete text preprocessing pipeline"""
    if pd.isna(text):
        return ""
    
    # Convert to lowercase
    text = str(text).lower()
    
    # Remove special characters, URLs, mentions, hashtags
    text = remove_special_characters(text)
    
    # Tokenize and remove stopwords
    words = text.split()
    words = [word for word in words if word not in stop_words and len(word) > 2]
    
    # Apply stemming
    words = [ps.stem(word) for word in words]
    
    return ' '.join(words)

print("üßπ TEXT PREPROCESSING")
print("=" * 50)
print("Preprocessing tweets... This may take a moment.")

# Apply preprocessing
df['cleaned_tweets'] = df['tweets'].apply(preprocess_text)

# Remove tweets that became empty after preprocessing
df = df[df['cleaned_tweets'].str.len() > 0].reset_index(drop=True)

print(f"‚úÖ Preprocessing complete!")
print(f"üìä Final dataset size: {len(df):,} tweets")

In [None]:
# Show preprocessing examples
print("üìù PREPROCESSING EXAMPLES")
print("=" * 50)

for i in range(3):
    print(f"\nExample {i+1}:")
    print(f"Original:  {df['tweets'].iloc[i][:100]}...")
    print(f"Cleaned:   {df['cleaned_tweets'].iloc[i][:100]}...")
    print(f"Label:     {df['labels'].iloc[i]}")

## 8. Machine Learning Models

Now let's build and compare different machine learning models for sentiment classification.

In [None]:
# Prepare data for machine learning
print("ü§ñ MACHINE LEARNING SETUP")
print("=" * 50)

X = df['cleaned_tweets']
y = df['labels']

# Split the data (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set size: {len(X_train):,}")
print(f"Test set size: {len(X_test):,}")

# Check class distribution in splits
print("\nClass distribution in training set:")
print(y_train.value_counts(normalize=True) * 100)
print("\nClass distribution in test set:")
print(y_test.value_counts(normalize=True) * 100)

In [None]:
# Method 1: Bag of Words (CountVectorizer)
print("üìä METHOD 1: BAG OF WORDS")
print("=" * 50)

# Create Bag of Words features
cv = CountVectorizer(max_features=5000, ngram_range=(1, 2))
X_train_bow = cv.fit_transform(X_train)
X_test_bow = cv.transform(X_test)

print(f"Feature matrix shape: {X_train_bow.shape}")
print(f"Feature names example: {cv.get_feature_names_out()[:10]}")

# Train Random Forest with Bag of Words
rf_bow = RandomForestClassifier(n_estimators=100, random_state=42)
rf_bow.fit(X_train_bow, y_train)

# Make predictions
y_pred_bow = rf_bow.predict(X_test_bow)
accuracy_bow = accuracy_score(y_test, y_pred_bow)

print(f"\nüéØ Bag of Words + Random Forest Results:")
print(f"Accuracy: {accuracy_bow:.4f} ({accuracy_bow*100:.2f}%)")

In [None]:
# Method 2: TF-IDF with multiple algorithms
print("üìä METHOD 2: TF-IDF")
print("=" * 50)

# Create TF-IDF features
tfidf = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

print(f"TF-IDF feature matrix shape: {X_train_tfidf.shape}")

# Define models to test
models = {
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
    'SVM': SVC(random_state=42)
}

# Train and evaluate each model
results = {}
predictions = {}

for name, model in models.items():
    print(f"\nTraining {name}...")
    
    # Train model
    model.fit(X_train_tfidf, y_train)
    
    # Make predictions
    y_pred = model.predict(X_test_tfidf)
    accuracy = accuracy_score(y_test, y_pred)
    
    # Store results
    results[name] = accuracy
    predictions[name] = y_pred
    
    print(f"‚úÖ {name} Accuracy: {accuracy:.4f} ({accuracy*100:.2f}%)")

In [None]:
# Compare all models
print("üèÜ MODEL COMPARISON")
print("=" * 50)

all_results = {
    'Bag of Words + Random Forest': accuracy_bow,
    **{f'TF-IDF + {name}': acc for name, acc in results.items()}
}

# Sort by accuracy
sorted_results = sorted(all_results.items(), key=lambda x: x[1], reverse=True)

print("Model Performance Ranking:")
for i, (model_name, accuracy) in enumerate(sorted_results, 1):
    print(f"{i}. {model_name}: {accuracy:.4f} ({accuracy*100:.2f}%)")

# Find best model
best_model_name = sorted_results[0][0]
best_accuracy = sorted_results[0][1]

print(f"\nü•á Best Model: {best_model_name}")
print(f"üéØ Best Accuracy: {best_accuracy:.4f} ({best_accuracy*100:.2f}%)")

In [None]:
# Detailed analysis of the best TF-IDF model
best_tfidf_model = max(results, key=results.get)
best_model = models[best_tfidf_model]
best_predictions = predictions[best_tfidf_model]

print(f"üìä DETAILED ANALYSIS: {best_tfidf_model}")
print("=" * 50)

# Classification report
print("Classification Report:")
print(classification_report(y_test, best_predictions))

# Confusion matrix
cm = confusion_matrix(y_test, best_predictions)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=best_model.classes_, 
            yticklabels=best_model.classes_)
plt.title(f'Confusion Matrix - TF-IDF + {best_tfidf_model}', fontsize=14, fontweight='bold')
plt.xlabel('Predicted Sentiment')
plt.ylabel('Actual Sentiment')
plt.tight_layout()
plt.savefig('confusion_matrix.png', dpi=300, bbox_inches='tight')
plt.show()

## 9. Word2Vec Analysis (Advanced)

Let's explore Word2Vec embeddings for deeper text understanding.

In [None]:
# Word2Vec Analysis
print("üî§ WORD2VEC ANALYSIS")
print("=" * 50)

try:
    # Prepare sentences for Word2Vec (list of word lists)
    sentences = [text.split() for text in df['cleaned_tweets'] if text.strip()]
    
    print(f"Preparing {len(sentences):,} sentences for Word2Vec training...")
    
    # Train Word2Vec model
    w2v_model = Word2Vec(
        sentences=sentences,
        vector_size=100,      # Dimensionality of word vectors
        window=5,             # Context window size
        min_count=5,          # Ignore words with frequency less than this
        workers=4,            # Number of worker threads
        epochs=10             # Number of training epochs
    )
    
    vocab_size = len(w2v_model.wv.key_to_index)
    print(f"‚úÖ Word2Vec model trained successfully!")
    print(f"üìö Vocabulary size: {vocab_size:,} words")
    
    # Save the model
    w2v_model.save('chatgpt_word2vec.model')
    print("üíæ Model saved as 'chatgpt_word2vec.model'")
    
except Exception as e:
    print(f"‚ùå Error training Word2Vec: {e}")
    print("Skipping Word2Vec analysis...")

In [None]:
# Explore word similarities
if 'w2v_model' in locals():
    print("üîç WORD SIMILARITY ANALYSIS")
    print("=" * 50)
    
    # Test words related to our domain
    test_words = ['chatgpt', 'openai', 'good', 'bad', 'great', 'terrible', 'amazing', 'awful']
    available_words = [word for word in test_words if word in w2v_model.wv.key_to_index]
    
    print(f"Available test words: {available_words}")
    
    # Find similar words
    for word in available_words[:5]:  # Limit to first 5 to save space
        try:
            similar_words = w2v_model.wv.most_similar(word, topn=5)
            print(f"\nüî§ Words similar to '{word}':")
            for similar_word, similarity in similar_words:
                print(f"   {similar_word}: {similarity:.3f}")
        except Exception as e:
            print(f"   Error finding similarities for '{word}': {e}")
    
    # Most common words in vocabulary
    print(f"\nüìä Most common words in vocabulary:")
    word_counts = {}
    for sentence in sentences:
        for word in sentence:
            if word in w2v_model.wv.key_to_index:
                word_counts[word] = word_counts.get(word, 0) + 1
    
    top_words = sorted(word_counts.items(), key=lambda x: x[1], reverse=True)[:10]
    for i, (word, count) in enumerate(top_words, 1):
        print(f"   {i:2d}. {word}: {count:,} occurrences")

## 10. Results Summary and Next Steps

Let's summarize our findings and suggest next steps.

In [None]:
# Final summary
print("üìã ANALYSIS SUMMARY")
print("=" * 60)

print(f"üìä Dataset Overview:")
print(f"   ‚Ä¢ Total tweets analyzed: {len(df):,}")
print(f"   ‚Ä¢ Sentiment classes: {df['labels'].nunique()} ({', '.join(df['labels'].unique())})")
print(f"   ‚Ä¢ Average tweet length: {df['tweet_length'].mean():.1f} characters")
print(f"   ‚Ä¢ Average word count: {df['word_count'].mean():.1f} words")

print(f"\nüéØ Model Performance:")
for model_name, accuracy in sorted_results:
    print(f"   ‚Ä¢ {model_name}: {accuracy:.4f} ({accuracy*100:.2f}%)")

print(f"\nü•á Best Performing Model:")
print(f"   ‚Ä¢ {best_model_name}")
print(f"   ‚Ä¢ Accuracy: {best_accuracy:.4f} ({best_accuracy*100:.2f}%)")

if 'vocab_size' in locals():
    print(f"\nüî§ Word2Vec Model:")
    print(f"   ‚Ä¢ Vocabulary size: {vocab_size:,} words")
    print(f"   ‚Ä¢ Vector dimensions: 100")
    print(f"   ‚Ä¢ Model saved as: chatgpt_word2vec.model")

print(f"\nüíæ Generated Files:")
print(f"   ‚Ä¢ chatgpt_tweets_analysis.png (data visualizations)")
print(f"   ‚Ä¢ confusion_matrix.png (model evaluation)")
if 'w2v_model' in locals():
    print(f"   ‚Ä¢ chatgpt_word2vec.model (Word2Vec embeddings)")

In [None]:
# Suggestions for next steps
print("üöÄ NEXT STEPS & IMPROVEMENTS")
print("=" * 60)

print("1. üìà Model Improvements:")
print("   ‚Ä¢ Try deep learning models (LSTM, BERT)")
print("   ‚Ä¢ Experiment with different preprocessing techniques")
print("   ‚Ä¢ Use cross-validation for more robust evaluation")
print("   ‚Ä¢ Handle class imbalance if present")

print("\n2. üîç Feature Engineering:")
print("   ‚Ä¢ Add sentiment-specific features (emoji, punctuation)")
print("   ‚Ä¢ Try different n-gram ranges")
print("   ‚Ä¢ Experiment with different vectorization parameters")

print("\n3. üìä Advanced Analysis:")
print("   ‚Ä¢ Topic modeling with LDA")
print("   ‚Ä¢ Temporal analysis (if timestamps available)")
print("   ‚Ä¢ User behavior analysis")
print("   ‚Ä¢ Hashtag and mention analysis")

print("\n4. üî§ Word2Vec Applications:")
print("   ‚Ä¢ Document clustering using embeddings")
print("   ‚Ä¢ Visualize word relationships with t-SNE")
print("   ‚Ä¢ Create custom sentiment lexicons")

print("\n‚úÖ Analysis Complete!")
print("You now have a comprehensive understanding of your ChatGPT tweets dataset!")