# TWIITER SENTIMENT ANALYSIS

## Business Understanding.
#### Business Overview

In today’s digital age, social media platforms such as Twitter have become powerful spaces where consumers freely express their opinions about brands, products, and user experiences.

For technology giants like Apple and Google, tweets represent an unfiltered stream of public perception that can influence reputation, marketing strategies, and customer loyalty. The abundance of this user-generated content provides a valuable opportunity for organizations to leverage **Natural Language Processing (NLP)** techniques to uncover actionable insights from unstructured text data.

This project utilizes a labeled dataset of tweets sourced from **CrowdFlower (via data.world)**, containing human-rated sentiments toward Apple and Google products. By analyzing this data, the project aims to build an automated system capable of classifying tweet sentiments and essential step toward understanding customer emotions and brand perception.

#### Business Objectives

The overall business objective of this project is to build an automated system that can analyze and classify sentiments expressed in tweets related to Apple and Google products. The system will provide insights into how consumers perceive these brands and their respective products, allowing decision-makers to understand market sentiment trends and react accordingly.

Specifically, the project aims to:

* Identify and categorize public sentiments toward Apple and Google products as positive, negative, or neutral.

* Develop an NLP model that demonstrates the feasibility of automated sentiment analysis.

* Provide data-driven insights that can guide brand management, product improvement and customer engagement strategies.

* Enable future scalability where the approach can be extended to other brands or social media platforms.

#### Business Problem

Organizations such as Apple and Google receive continuous feedback from millions of users daily on social media. Manually analyzing this information to identify consumer attitudes is inefficient and impractical.

Tweets often contain informal language, slang and abbreviations, making traditional text analysis approaches insufficient.
Without an automated solution, it becomes challenging for companies to:

* Detect sudden shifts in customer sentiment,

* Identify negative feedback early enough to take corrective action, and

* Understand product-related discussions that could inform business strategy.

This project addresses the need for an automated sentiment classification model that can efficiently process text data and provide reliable, timely insights into customer sentiment.

#### Success Criteria

Achieve acceptable model performance metrics (≥ 80% accuracy or balanced F1-score for binary classification).

Produce interpretable and reproducible results.

Ensure a well-documented and modular workflow following the CRISP-DM process: Business Understanding → Data Understanding → Data Preparation → Modeling → Evaluation → Deployment.

Deliver clear visualizations and concise insights to support decision-making


## Data Understanding


This dataset contains Twitter/X posts (tweets) from the SXSW conference with sentiment analysis labels. The data has 3 columns:
1. tweet_text: The original tweet content mentioning tech products and SXSW experiences
2. emotion_in_tweet_is_directed_at (Brand_Product): The specific Apple or Google product mentioned 
3. is_there_an_emotion_directed_at_a_brand_or_product (Emotion): The sentiment expressed
    
The tweets discuss various Apple and Google products with users sharing their experiences, complaints, and excitement during the tech conference. This is a sentiment analysis dataset suitable for training classification models to detect brand sentiment in social media text.

In [2]:
!pip install pyspellchecker

Collecting pyspellchecker
  Downloading pyspellchecker-0.8.3-py3-none-any.whl (7.2 MB)
Installing collected packages: pyspellchecker
Successfully installed pyspellchecker-0.8.3


In [3]:
!pip install wordcloud

Collecting wordcloud
  Downloading wordcloud-1.9.4-cp38-cp38-win_amd64.whl (300 kB)
Installing collected packages: wordcloud
Successfully installed wordcloud-1.9.4


In [4]:
!pip install --upgrade wordcloud pillow

Requirement already up-to-date: wordcloud in c:\users\user\anaconda3\envs\learn-env\lib\site-packages (1.9.4)
Collecting pillow
  Downloading pillow-10.4.0-cp38-cp38-win_amd64.whl (2.6 MB)
Installing collected packages: pillow
  Attempting uninstall: pillow
    Found existing installation: Pillow 8.0.0
    Uninstalling Pillow-8.0.0:
      Successfully uninstalled Pillow-8.0.0
Successfully installed pillow-10.4.0


ERROR: After October 2020 you may experience errors when installing or updating packages. This is because pip will change the way that it resolves dependency conflicts.

We recommend you use --use-feature=2020-resolver to test your packages with the new resolver before it becomes the default.

fbprophet 0.7.1 requires cmdstanpy==0.9.5, which is not installed.


In [6]:
pip install --upgrade pillow


Requirement already up-to-date: pillow in c:\users\user\anaconda3\envs\learn-env\lib\site-packages (10.4.0)
Note: you may need to restart the kernel to use updated packages.


In [None]:
#Load the libraries
import re
import string
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from spellchecker import SpellChecker
from collections import Counter
from wordcloud import WordCloud

import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')

from nltk import pos_tag
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords,wordnet
from nltk.stem import PorterStemmer, WordNetLemmatizer

import warnings
warnings.filterwarnings('ignore')


In [None]:
# Load dataset
data_df = pd.read_csv('judge-1377884607_tweet_product_company.csv', encoding='latin1')
data_df.head(10)

In [None]:
data_df.tail()

🔹 Observations

Dataset loaded successfully with 3 main columns.

Data represents tweets related to Apple and Google products.

Encoding changed to 'latin1' to handle special characters

In [None]:
data_df.info()

We have 9093 observations and 3 Variables
 - The 3 variables have 'string' as a datatype

In [None]:
#checking the shape
print(f"Dataset contains {data_df.shape[0]} rows and {data_df.shape[1]} columns.")

In [None]:
#checking the columns
data_df.columns

In [None]:
#checking for null values
data_df.isna().sum()

In [None]:
#drop the rows with two null columns
data_df = data_df.dropna(thresh=data_df.shape[1] - 1).reset_index(drop=True)

In [None]:
data_df.shape

Only 1 row was dropped.

In [None]:
#checking duplicates
len(data_df[data_df.duplicated()])

In [None]:
#drop duplicates
data_df.drop_duplicates(keep = 'first', inplace = True)
data_df.shape

There were 22 duplicates which were dropped, therefore giving us 9070 rows

In [None]:
# Rename to simple names for easy use
df = data_df.rename(columns={
    data_df.columns[0]: 'tweet',
    data_df.columns[1]: 'product',
    data_df.columns[2]: 'sentiment'
})

# Check first few rows
df.head()

# Tweet Column

#### 1.Remove Twitter handles and hashtags

In [None]:
# Create a copy
df['cleaned_tweet'] = (df['tweet'].astype(str))

In [None]:
# remove handles and strip the '#' from hashtags, keep the hashtag words
df['cleaned_tweet'] = (df['cleaned_tweet'].astype(str)
                                .str.replace(r'@\w+', '', regex=True)   
                                .str.replace(r'#', '', regex=True)      
                                .str.strip())


df[['tweet', 'cleaned_tweet']].head(10)

#### 2. Remove URL's

In [None]:
#remove URLs
df['cleaned_tweet'] = (
    df['cleaned_tweet']
      .str.replace(r'http\S+|www\S+', '', regex=True)  
      .str.replace(r'\s+', ' ', regex=True)            
      .str.strip()
)

# Show result so we can compare
df[['tweet', 'cleaned_tweet']].head(10)

#### 3.Remove punctuations, numbers and special characters

In [None]:
#remove special characters and numbers (but keep ! and ?)
df['cleaned_tweet'] = (
    df['cleaned_tweet']
      .str.replace(r'[^a-zA-Z ]', '', regex=True)   
      .str.replace(r'\s+', ' ', regex=True)           
      .str.strip()
)

df[['tweet', 'cleaned_tweet']].tail(10)

#### 4. Tokenization

In [None]:
# Tokenize and lowercasing
#df['cleaned_tweet'] = df['cleaned_tweet'].apply(lambda x: word_tokenize(x.lower()))

In [None]:
df['cleaned_tweet'] = df['cleaned_tweet'].apply(
    lambda x: word_tokenize(x.lower()) if isinstance(x, str) else x
)

In [None]:
df['cleaned_tweet'].head(10)

In [None]:
# Function to reduce elongation (e.g., "soooo" → "soo")
def reduce_elongation(word):
    return re.sub(r'(.)\1{2,}', r'\1\1', word)

# Apply elongation reduction to each token
df['cleaned_tweet'] = df['cleaned_tweet'].apply(lambda tokens: [reduce_elongation(w) for w in tokens])

# Preview results
df[['tweet', 'cleaned_tweet']].head(10)


#### 6. Remove single characters

In [None]:
#remove single characters
df['cleaned_tweet'] = df['cleaned_tweet'].replace(re.compile(r"(^| ).( |$)"), " ")
# Check result
df[['tweet', 'cleaned_tweet']].head(10)

#### 7. Remove Stopwords

In [None]:
# Define stopwords and exceptions
stop_words = set(stopwords.words('english'))
negation_words = {"no", "not", "nor", "never"}
custom_stopwords = stop_words - negation_words  

# Remove stopwords directly from tokenized lists
df['cleaned_tweet'] = df['cleaned_tweet'].apply(
    lambda tokens: [word for word in tokens if word not in custom_stopwords]
)

# Check result
df[['tweet', 'cleaned_tweet']].head(10)


In [None]:
# Check top 20 most frequent words
word_count = Counter()

for tokens in df['cleaned_tweet']:
    for word in tokens:
        word_count[word] += 1

# Display top 20 most frequent words
word_count.most_common(20)


#### Removing reject words

In [None]:
# Define reject words that add no sentiment meaning
reject_words = { 'link', 'sxsw', }

# Apply removal to your tokenized column
df['cleaned_tweet'] = df['cleaned_tweet'].apply(
    lambda tokens: [w for w in tokens if w not in reject_words]
)
# Preview
df[['tweet', 'cleaned_tweet']].head(10)

#### 8.Lemmatization

In [None]:

lemmatizer = WordNetLemmatizer()

# Helper to map POS tags to WordNet format
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN  # default to noun

# Apply POS tagging + lemmatization
df['cleaned_tweet'] = df['cleaned_tweet'].apply(
    lambda tokens: [
        lemmatizer.lemmatize(word, get_wordnet_pos(pos))
        for word, pos in pos_tag(tokens)
    ]
)

# Check results
df[['tweet', 'cleaned_tweet']].sample(10)


#### 9. Spell Checking

In [None]:
"""# Initialize spellchecker
spell = SpellChecker()

# Function to correct spelling for tokenized data
def correct_spelling(tokens):
    # Find misspelled words in the token list
    misspelled = spell.unknown(tokens)

    corrected_tokens = []
    for word in tokens:
        if word in misspelled:
            corrected_tokens.append(spell.correction(word))  # replace with corrected word
        else:
            corrected_tokens.append(word)  # keep as is

    return corrected_tokens

# Apply to your dataframe (tokenized column)
df['cleaned_tweet'] = df['cleaned_tweet'].apply(correct_spelling)

# Preview
df[['tweet', 'cleaned_tweet_text']].head(10)"""

In [None]:
# Join back into text for TF-IDF and model training
df['cleaned_tweet'] = df['cleaned_tweet'].apply(lambda tokens: ' '.join(tokens))
df[['tweet', 'cleaned_tweet']].head(10)

Overall Observations on the Tweet Column.

    - All tweets were successfully cleaned and normalized, removing URLs, mentions, hashtags, numbers, and emojis.

    - Text was converted to lowercase for consistency across the dataset.

    - Contractions like “don’t” → “do not” were expanded to preserve meaning.

    - Extra spaces and punctuation were removed to simplify token patterns.

    - Removed single character words like "I", "g"

    - Applied tokenization, stopword removal, and lemmatization — reducing words to their root forms.
    
    - Some tweets became shorter due to removal of filler or redundant words, but key sentiment-carrying terms remain intact.
    
    - Ran a spell checker through the words to correct any spelling errors. 

    - The resulting text is now noise-free and uniform.

In [None]:

 #Combine all tweets
all_words = ' '.join(df['cleaned_tweet'])
word_freq = Counter(all_words.split())

# WordCloud
plt.figure(figsize=(10,6))
plt.imshow(WordCloud(width=800, height=400, background_color='white').generate(all_words))
plt.axis('off')
plt.show()

In [None]:

# Flatten the token lists into one long string
all_words = ' '.join([' '.join(tokens) for tokens in df['cleaned_tweet']])

# Optional: get frequency counts
word_freq = Counter(all_words.split())

# Generate the WordCloud
plt.figure(figsize=(10,6))
plt.imshow(
    WordCloud(
        width=800,
        height=400,
        background_color='white'
    ).generate(all_words)
)
plt.axis('off')
plt.show()


Observations from Word Cloud
Most Dominant Terms:

"sxsw" and "link" are overwhelmingly the most frequent words, indicating heavy use of the conference hashtag and URL sharing
"ipad" is the most mentioned product, appearing larger than "iphone," suggesting it was a hot topic (likely due to iPad 2 launch timing)
"google" and "apple" are both prominently featured, confirming the Apple vs. Google product focus

Context & Activity Words:

"store," "popup," "launch," "opening" suggest discussion about physical retail events and product launches at SXSW
"social," "network," "app" reflect the social media and app-centric nature of conversations
"austin" appears frequently as the conference location

Communication Patterns:

"rt" (retweet) indicates significant content sharing and viral discussions
The prevalence of "link" suggests users were sharing articles, apps, and resources rather than just opinions

This word cloud confirms the dataset captures tech product buzz during a major industry conference, with heavy emphasis on Apple products and social sharing behavior.


# Product Column

In [None]:
df['product'].value_counts(dropna=False).head(20)

In [None]:
# rename iPad or iPhone App to app and Other Apple product or service 
valuechange_map = {
    'iPad or iPhone App': 'app',
    'Other Apple product or service': 'apple',
    'Other Google product or google': 'google'

}

# Create a readable sentiment column
df['product'] = df['product'].map(valuechange_map)

In [None]:
df['product'] = df['product'].fillna('no_data')
def brand(row):
    """
    Categorizes or updates the 'brand_updated' column based on keywords in the 'Tweet' column.

    Parameters:
    - row (pd.Series): A row of a Pandas DataFrame representing a tweet.

    Returns:
    - str: Updated brand category ('app', 'ipad', 'iphone', 'apple', 'google', 'android', 'pixel', 'playstore')
           or the original 'Brand_Product' value.
    """
    tweet = row['tweet'].lower()  # make it case-insensitive

    # Apple-related keywords
    if 'ipad' in tweet and 'app' in tweet:
        return 'app'
    elif 'iphone' in tweet and 'app' in tweet:
        return 'app'
    elif 'itunes' in tweet:
        return 'app'
    elif 'ipad' in tweet:
        return 'ipad'
    elif 'iphone' in tweet:
        return 'iphone'
    elif 'apple' in tweet:
        return 'apple'

    # Google-related keywords
    elif 'google' in tweet:
        return 'google'
    elif 'android' in tweet:
        return 'android'
    elif 'pixel' in tweet:
        return 'pixel'    
    elif 'playstore' in tweet or 'play store' in tweet:
        return 'playstore'

    # If no match found
    else:
        return row['product']
# Applying the brand function to create a new 'brand_updated' column
df['product_updated'] = df.apply(brand, axis=1)
df['product_updated'] = df['product_updated'].str.lower()


In [None]:
df["product_updated"].value_counts()

In [None]:
df = df[df['product_updated'] != 'no_data']

Observation

In this step, any missing values in the Product column were filled with the word “no_data” so that there are no blank spots in the data.

Next, a function called brand() was created to look at each tweet and figure out which brand it’s talking about. 
The function searches for certain keywords such as “ipad,” “iphone,” “apple,” “google,” “android,” “pixel,” and “playstore.”

Based on what it finds, the tweet is labeled with the right brand name. If no matching word is found, the original value in Brand_Product is kept.

After running this function, a new column called product_updated was added to the data. 
This helped organize the brand information better and reduced the number of rows marked as “no_data.”
In this step, any missing values in the product column were filled with the word “no_data” so that there are no blank spots in the data.

Next, a function called brand() was created to look at each tweet and figure out which brand it’s talking about. 

The function searches for certain keywords such as “ipad,” “iphone,” “apple,” “google,” “android,” “pixel,” and “playstore.” Based on what it finds, the tweet is labeled with the right brand name. If no matching word is found, the original value in product is kept.

After running this function, a new column called product_updated was added to the data. This helped organize

# Sentiment Column

In [None]:
#Check the sentiment value counts
df['sentiment'].value_counts(dropna=False)

In [None]:
#clean and standardize
df['sentiment_cleaned'] = (
    df['sentiment']
      .astype(str)
      .str.lower()
      .str.strip()
      .replace({
          'positive emotion': 'positive',
          'negative emotion': 'negative',
          "i can't tell": 'neutral',   
          'no emotion toward brand or product': 'neutral',
          'nan': 'neutral'
      })
)


df[['sentiment', 'sentiment_cleaned']].head(10)

In [None]:
#recheck value counts
df['sentiment_cleaned'].value_counts()

In [None]:
#map the value counts
df['sentiment_label'] = df['sentiment_cleaned'].map({'positive': 1,
                                                     'negative': 0,
                                                     'neutral':2 })

In [None]:
#the values of the cleaned sentiments
df[['sentiment_cleaned', 'sentiment_label']].head()

Observation

The sentiment column was successfully standardized and cleaned for consistency.

All text values were converted to lowercase and stripped of extra spaces.

Original long-form labels (e.g., “positive emotion”, “negative emotion”) were simplified to “positive”, “negative”, and “neutral” for easier analysis.

Ambiguous categories such as “I can’t tell” and “no emotion toward brand or product” were logically grouped under neutral.

Created a numerical mapping:

1 → positive

0 → negative

2 → neutral

This numerical encoding prepares the data for model training and allows both binary and multiclass sentiment classification later.

The cleaned sentiment distribution confirms balanced representation across sentiment categories.

# Merged all cleaned columns

In [None]:
# Keep only necessary cleaned columns
dataset = df[['cleaned_tweet', 'product_updated', 'sentiment_label']].copy()

# Rename columns for clarity
dataset = dataset.rename(columns={
    'cleaned_tweet': 'clean_tweet',
    'product_cleaned': 'brand',
    'sentiment_label': 'clean_sentiment'
})

# Preview the cleaned datase
dataset.head(10)

In [None]:
# Check missing values
dataset.isna().sum()

In [None]:
# Check for duplicate rows
dataset.duplicated().sum()
dataset = dataset.drop_duplicates().reset_index(drop=True)

In [None]:
dataset.head(10)

In [None]:
# cleaned dataset

dataset.to_csv('cleaned_twitter_dataset.csv', index=False)

# EDA
### Univariate analysis

In [None]:
# Reverse map the numeric labels back to words
sentiment_map = {
    1: 'Positive',
    0: 'Negative',
    2: 'Neutral'
}

# Create a readable sentiment column
dataset['sentiment_text'] = dataset['clean_sentiment'].map(sentiment_map)

# Check if mapping worked
dataset[['clean_tweet', 'sentiment_text']].head(10)


In [None]:
# Plot sentiment distribution as a donut (pie) chart
plt.figure(figsize=(8, 6))

# Extract sentiment counts and labels
sentiment_counts = dataset['sentiment_text'].value_counts()
labels = sentiment_counts.index
colors = sns.color_palette('muted')

# Plot donut-style pie chart
plt.pie(
    sentiment_counts,
    labels=labels,
    autopct="%.2f%%",
    startangle=90,
    wedgeprops=dict(width=0.5),
    colors=colors
)

plt.title("Sentiment Distribution of Tweets", fontsize=16, fontweight='bold')
plt.legend(labels=labels, title="Sentiment", loc="center left", bbox_to_anchor=(1, 0, 0.5, 1))
plt.tight_layout()
plt.show()


In [None]:
# plot brands frequency in ascending order

plt.figure(figsize=(12,8))
dataset['product_updated'].value_counts().plot(kind='bar')
plt.title("Mentioned Brands/Products in ascending order")
plt.xlabel("Product / Brand")
plt.ylabel("Count")
plt.xticks(rotation=45)
plt.show()


#### Observation
The top most tweeted brand is google followed by app then ipad while the least mentioned is other google products or service.

In [None]:
from collections import Counter
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import seaborn as sns

# Split dataset by sentiment
sentiments = dataset['sentiment_text'].unique()

# Set up subplots for each sentiment
plt.figure(figsize=(14, 10))

for i, sentiment in enumerate(sentiments, 1):
    plt.subplot(3, 1, i)
    
    # Filter tweets for this sentiment
    words = ' '.join(dataset[dataset['sentiment_text'] == sentiment]['clean_tweet'])
    word_freq = Counter(words.split()).most_common(15)  # top 15 frequent words
    
    # Convert to DataFrame for easy plotting
    freq_df = pd.DataFrame(word_freq, columns=['word', 'count'])
    
    # Barplot
    sns.barplot(x='count', y='word', data=freq_df, palette='muted')
    plt.title(f"Top Words in {sentiment.capitalize()} Tweets", fontsize=14, fontweight='bold')
    plt.xlabel("Frequency")
    plt.ylabel("Word")

plt.tight_layout()
plt.show()


In [None]:
# Create the bigram vectorizer
vectorizer = CountVectorizer(ngram_range=(2,2), stop_words='english')
bigrams = vectorizer.fit_transform(dataset['clean_tweet'])

# Sum up bigram occurrences
bigram_sum = bigrams.sum(axis=0)
bigram_freq = [(word, bigram_sum[0, idx]) for word, idx in vectorizer.vocabulary_.items()]

# Convert to DataFrame and sort
bigram_df = pd.DataFrame(bigram_freq, columns=['Bigram', 'Frequency']).sort_values(by='Frequency', ascending=False)

# Select top 20
top_bigrams = bigram_df.head(20)

# Plot
plt.figure(figsize=(12,8))
sns.barplot(y='Bigram', x='Frequency', data=top_bigrams, palette='coolwarm')
plt.title('Top 20 Most Frequent Bigrams', fontsize=14, fontweight='bold')
plt.xlabel('Frequency')
plt.ylabel('Bigrams')
plt.tight_layout()
plt.show()


#### Observation
The most frequent bigram "sxsw link" followed by "link sxsw" then "apple store".

In [None]:
# Create trigram vectorizer
vectorizer = CountVectorizer(ngram_range=(3,3), stop_words='english')
trigrams = vectorizer.fit_transform(dataset['clean_tweet'])

# Sum up trigram occurrences
trigram_sum = trigrams.sum(axis=0)
trigram_freq = [(word, trigram_sum[0, idx]) for word, idx in vectorizer.vocabulary_.items()]

# Convert to DataFrame and sort
trigram_df = pd.DataFrame(trigram_freq, columns=['Trigram', 'Frequency']).sort_values(by='Frequency', ascending=False)

# Select top 20
top_trigrams = trigram_df.head(20)

# Plot
plt.figure(figsize=(10,6))
sns.barplot(y='Trigram', x='Frequency', data=top_trigrams, palette='coolwarm')
plt.title('Top 20 Most Frequent Trigrams', fontsize=14, fontweight='bold')
plt.xlabel('Frequency')
plt.ylabel('Trigrams')
plt.tight_layout()
plt.show()


#### Observation
The most frequent Trigram in the dataset is 'new social network' closely followed by 'social network circle' and 'major new social'.

## Bivariate analysis

In [None]:
# Use the same muted color palette from the pie chart
colors = sns.color_palette('muted')

# Plot sentiment distribution across brands
plt.figure(figsize=(12, 8))
sns.countplot(
    data=dataset,
    x='product_updated',
    hue='sentiment_text',       # use text labels for clearer legend
    palette=colors
)

# Titles and labels
plt.title("Sentiment Distribution Across Brands", fontsize=16, fontweight='bold')
plt.xlabel("Brand / Product", fontsize=12)
plt.ylabel("Tweet Count", fontsize=12)
plt.xticks(rotation=45)
plt.legend(title="Sentiment", loc='upper right')

plt.tight_layout()
plt.show()


In [None]:
# Separate tweets by sentiment
positive_tweets = ' '.join(dataset[dataset['clean_sentiment'] == 1]['clean_tweet'])
negative_tweets = ' '.join(dataset[dataset['clean_sentiment'] == 0]['clean_tweet'])
neutral_tweets  = ' '.join(dataset[dataset['clean_sentiment'] == 2]['clean_tweet'])

# Generate Word Clouds
fig, axes = plt.subplots(1, 3, figsize=(18,6))

for ax, text, title in zip(
    axes,
    [positive_tweets, negative_tweets, neutral_tweets],
    ['Positive', 'Negative', 'Neutral']
):
    wc = WordCloud(width=600, height=400, background_color='white', colormap='viridis').generate(text)
    ax.imshow(wc, interpolation='bilinear')
    ax.set_title(f"Top Words in {title} Tweets", fontsize=14, fontweight='bold')
    ax.axis('off')

plt.tight_layout()
plt.show()


In [None]:
sentiment_brand = dataset.groupby(['product_updated', 'clean_sentiment']).size().unstack(fill_value=0)
sentiment_brand.plot(kind='bar', stacked=True, figsize=(12,6), colormap='coolwarm')
plt.title("Sentiment Composition by Brand/Product")
plt.xlabel("Brand/Product")
plt.ylabel("Tweet Count")
plt.xticks(rotation=45)
plt.legend(['Positive', 'Negative', 'Neutral'])
plt.show()


In [None]:
plt.figure(figsize=(12,8))
sns.boxplot(
    data=dataset,
    x='sentiment_text',
    y='tweet_length',
    hue='product_updated',
    palette='muted'   # or 'Set2', 'coolwarm', 'husl', etc.
)
plt.title("Tweet Length Distribution by Product and Sentiment", fontsize=14, fontweight='bold')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Step 1: Compute tweet length (number of words in each cleaned_tweet)
dataset['tweet_length'] = dataset['clean_tweet'].apply(lambda x: len(str(x).split()))

# Step 2: Define sentiment mapping (for labels)
sentiment_map = {0: 'Negative', 1: 'Positive', 2: 'Neutral'}
dataset['sentiment_text'] = dataset['clean_sentiment'].map(sentiment_map)

# Step 3: Define color palette consistent with previous plots
custom_palette = {
    'Positive': 'green',
    'Negative': 'red',
    'Neutral': 'gray'
}

# Step 4: Plot the boxplot
plt.figure(figsize=(10,6))
sns.boxplot(
    data=dataset,
    x='sentiment_text',
    y='tweet_length',
    palette=custom_palette
)

# Step 5: Customize appearance
plt.title("Tweet Length vs Sentiment", fontsize=14, fontweight='bold')
plt.xlabel("Sentiment")
plt.ylabel("Tweet Length (Word Count)")
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()


In [None]:
plt.figure(figsize=(14,8))
sns.boxplot(
    data=dataset,
    x='sentiment_text',
    y='tweet_length',
    hue='product_updated',
    palette='Set3'  # automatically assigns distinct colors per product
)
plt.title("Tweet Length Distribution by Product and Sentiment", fontsize=14, fontweight='bold')
plt.xlabel("Sentiment")
plt.ylabel("Tweet Length (Word Count)")
plt.legend(title='Product', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()
