# Sentiment Analysis of Tweets about Apple and Google Products

### Problem statement


This notebook builds an NLP model to classify sentiment in tweets directed at Apple and Google products.  


### Libraries

In [82]:
# pandas: for data handling
import pandas as pd

# re: Python's built-in library for regular expressions (used for text cleaning)
import re

# nltk: Natural Language Toolkit, useful for tokenization, stopword removal, and lemmatization
import nltk

nltk.download("punkt")        # tokenizer model

nltk.download("punkt_tab")    # sentence boundary detection

nltk.download("wordnet")      # lexical database for lemmatization 

nltk.download("omw-1.4")      # WordNet data for multiple languages

nltk.download("stopwords")    # common words to filter out (e.g., "the", "is")

# Import stopwords list from nltk (words to ignore during analysis)
from nltk.corpus import stopwords

# Import tokenizer to split text into individual words
from nltk.tokenize import word_tokenize

# Import lemmatizer to reduce words to their base form (e.g., "running" → "run")
from nltk.stem import WordNetLemmatizer

# TfidfVectorizer: convert text data into numerical features using TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer

# train_test_split: split data into training and testing sets for model evaluation
from sklearn.model_selection import train_test_split


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\A808865\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\A808865\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\A808865\AppData\Roaming\nltk_data...
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\A808865\AppData\Roaming\nltk_data...
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\A808865\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Loading Data

In [83]:
# Reading the CSV file with correct encoding
df = pd.read_csv('Data\judge-1377884607_tweet_product_company.csv', encoding='Latin-1')

# Displaying the first 5 rows of the dataset
df.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


## Exploratory Data Analysis (EDA)

- In order to better understand the dataset and prepare it for sentiment analysis, we will focus on the following checks:
    - Preview the data: Inspect the first few rows to quickly grasp the dataset’s structure.
    - Detect any missing values in the data that could introduce bias or cause issues during preprocessing and modeling.
    - Identify and remove duplicate tweets to prevent overrepresentation of certain entries, which could distort the sentiment model.
    - Review the balance of sentiment categories, since skewed classes may result in models that favor majority classes and perform poorly on minority ones.

In [84]:
#shape of the data set
df.shape

(9093, 3)

In [85]:
# Basic information about the dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9093 entries, 0 to 9092
Data columns (total 3 columns):
 #   Column                                              Non-Null Count  Dtype 
---  ------                                              --------------  ----- 
 0   tweet_text                                          9092 non-null   object
 1   emotion_in_tweet_is_directed_at                     3291 non-null   object
 2   is_there_an_emotion_directed_at_a_brand_or_product  9093 non-null   object
dtypes: object(3)
memory usage: 213.2+ KB


#### Handling Missing Values

In [86]:
# Removing any rows where the tweet text is missing, because a tweet with no text is useless for sentiment analysis.
df.dropna(subset=["tweet_text"], inplace=True)

# For the column emotion_in_tweet_is_directed_at, instead of dropping missing values, it replaces them with "Unknown". This way you don’t lose the tweet itself — you just acknowledge that the target of the emotion is not specified.
df.fillna({'emotion_in_tweet_is_directed_at': 'Unknown'}, inplace=True)

# Reset index
df.reset_index(drop=True, inplace=True)


In [87]:
df.isna().sum()

tweet_text                                            0
emotion_in_tweet_is_directed_at                       0
is_there_an_emotion_directed_at_a_brand_or_product    0
dtype: int64

### Step 3: Basic Text Cleaning

In [88]:
def clean_tweet_text(text):
    # Remove URLs
    text = re.sub(r"http\S+|www\S+|https\S+", "", text, flags=re.MULTILINE)
    # Remove user mentions
    text = re.sub(r"@\w+", "", text)
    # Remove hashtags
    text = re.sub(r"#\w+", "", text)
    # Remove special characters and numbers
    text = re.sub(r"[^A-Za-z\s]", "", text)
    return text

df["clean_text"] = df["tweet_text"].apply(clean_tweet_text)


### Step 4: Tokenization

In [89]:
df["tokens"] = df["clean_text"].apply(word_tokenize)

### Step 5: Stopward Removal

In [90]:
stop_words = set(stopwords.words("english"))

df["tokens"] = df["tokens"].apply(lambda x: [word for word in x if word not in stop_words])

### Step 6: Lemmatization

In [91]:
lemmatizer = WordNetLemmatizer()
df["tokens"] = df["tokens"].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])

### Step 7: Join Tokens Back

In [92]:
df["processed_text"] = df["tokens"].apply(lambda x: " ".join(x))

### Step 8: Vectorization (TF-IDF Example)

In [93]:
vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(df["processed_text"])

print('TF - IDF shape', X.shape)

TF - IDF shape (9092, 5000)


### Step 9: Train-Test Split

In [94]:
X_train, X_test, y_train, y_test = train_test_split(X, df["is_there_an_emotion_directed_at_a_brand_or_product"], test_size=0.2, random_state=42)
print("Train size:", X_train.shape)
print("Test size:", X_test.shape)

Train size: (7273, 5000)
Test size: (1819, 5000)


## Refactoring the steps above into a Pipeline

In [95]:
import pandas as pd
import re
import nltk
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin

# Download NLTK resources (leave commented if already downloaded)
# nltk.download("punkt")
# nltk.download("stopwords")
# nltk.download("wordnet")
# nltk.download("omw-1.4")

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

stop_words = set(stopwords.words("english"))
lemmatizer = WordNetLemmatizer()

# -------------------------------
# Custom Preprocessor
# -------------------------------

class TextPreprocessor(BaseEstimator, TransformerMixin):
    def __init__(self, text_column):
        self.text_column = text_column

    def clean_text(self,text):
        text = re.sub(r"http\S+|www\S+|https\S+", "", text) # remove urls
        text = re.sub(r"@\w+", "", text) # remove mentions
        text = re.sub(r"#\w+", "", text) # remove hashtags
        text = re.sub(r"[^A-Za-z\s]", "", text) # remove special characters
        return text.strip()
    
    def tokenize_lemmatize(self, text):
        tokens = word_tokenize(text)
        tokens = [t for t in tokens if t not in stop_words]
        tokens = [lemmatizer.lemmatize(t) for t in tokens]
        return " ".join(tokens)
    
    def transform(self, X, y=None):
        X_filled = X.copy()
        
        # Fill missing values in the second column
        second_col = X_filled.columns[1]
        X_filled[second_col] = X_filled[second_col].fillna("Unknown")
        
        # Process the text column
        X_filled[self.text_column] = X_filled[self.text_column].apply(
            lambda t: self.tokenize_lemmatize(self.clean_text(t))
        )
        
        return X_filled  # return as DataFrame

    def fit(self, X, y=None):
        return self
    
# -------------------------------
# Load Dataset
# -------------------------------
    
df = pd.read_csv("Data\judge-1377884607_tweet_product_company.csv", encoding='Latin-1')

df.dropna(subset=["tweet_text"], inplace=True)

X = df["tweet_text"]

# -------------------------------
# Build Preprocessing Pipeline
# -------------------------------

preprocessing_pipeline = Pipeline([
    ("text_preprocessor", TextPreprocessor(text_column="tweet_text"))
])

# Apply Pipeline
df_preprocessed = preprocessing_pipeline.fit_transform(df)

# view processed tweets
print(df_preprocessed.head(10))


                                           tweet_text  \
0   I G iPhone After hr tweeting dead I need upgra...   
1   Know Awesome iPadiPhone app youll likely appre...   
2                             Can wait also They sale   
3    I hope year festival isnt crashy year iPhone app   
4   great stuff Fri Marissa Mayer Google Tim OReil...   
5   New iPad Apps For And Communication Are Showca...   
7   starting around corner hop skip jump good time...   
8     Beautifully smart simple idea RT wrote iPad app   
9   Counting day plus strong Canadian dollar mean ...   
10  Excited meet I show Sprint Galaxy S still runn...   

   emotion_in_tweet_is_directed_at  \
0                           iPhone   
1               iPad or iPhone App   
2                             iPad   
3               iPad or iPhone App   
4                           Google   
5                          Unknown   
7                          Android   
8               iPad or iPhone App   
9                            A