# Twitter Sentiment Analysis

---

## 1. Business Understanding

### 1.1 Overview

In the modern digital era, social media platforms like Twitter (now X) have become powerful channels for consumers to express opinions, experiences, and emotions about brands, products, and services. These views can significantly influence purchasing decisions, brand reputation, and marketing strategies.

Manually tracking and interpreting this vast, unstructured feedback is impractical for companies. As a result, organizations increasingly turn to **Natural Language Processing (NLP)** and **machine learning models** to automatically analyze and interpret tweet sentiments.

![NLP Diagram](nlp.image.png)
*Figure 1: Illustration of how Natural Language Processing (NLP) processes and classifies text to produce outputs.*

---

### 1.2 Business Problem

Businesses need to understand **how customers feel** about their products and brands in real time. However, the sheer volume and unstructured nature of tweets make manual analysis impossible.

The core challenge is to **automatically classify each tweet** as **positive**, **negative**, or **neutral**. This provides actionable insights to:

- Identify emerging trends in customer satisfaction or dissatisfaction.
- Track public reactions to product launches or campaigns.
- Inform data-driven marketing and customer engagement decisions.

---

### 1.3 Project Objective

**Main Objective:**  
To **develop an automated sentiment classification model** that accurately analyzes and categorizes sentiments expressed in posts on **X (formerly Twitter)** as **positive, negative, or neutral**, enabling real-time insights into customer perceptions of a brand to support **data-driven marketing** and **brand management decisions**.

**Specific Objectives:**  

1. **Build a Binary Classification Model:**  
   Develop and train a machine learning model to accurately distinguish between **positive** and **negative** sentiments in X posts.  

2. **Extend to Multiclass Classification:**  
   Enhance the model to classify posts into **three categories**:  
   - No emotion toward brand or product (Neutral)
   - Positive emotion  
   - Negative emotion  
   This should be done **while maintaining or improving overall classification performance**.

3. **Support Business Decision-Making:**  
   Deliver **interpretable sentiment insights** to marketing teams and brand managers to:  
   - Optimize campaigns  
   - Address customer concerns  
   - Enhance brand reputation  

---

### 1.4 Business Value

An accurate sentiment analysis system delivers substantial value to decision-makers by enabling:

- **Brand Monitoring**: Track customer feelings about the brand over time.  
- **Marketing Optimization**: Pinpoint campaigns that drive positive engagement or negative feedback.  
- **Customer Insights**: Uncover pain points or drivers of satisfaction.  
- **Faster Decision-Making**: Provide near real-time feedback analysis.  

---

### 1.5 Research Questions

1. **How do customers feel about the company’s products or services**, based on sentiments expressed on Twitter?  
2. **What key factors or topics drive positive and negative sentiments** toward the brand on Twitter?  
3. **How can Twitter sentiment insights support business decisions**, such as marketing strategies, customer engagement, and brand reputation management?

---

### 1.6 Success Criteria

The project's success will be measured by:

1. **Actionable Insights**: The system delivers meaningful customer opinion trends on Twitter, supporting data-driven decisions.  
2. **Brand Reputation Tracking**: Enables real-time monitoring of public sentiment, allowing timely responses to issues.  
3. **Marketing and Engagement Impact**: Insights improve strategies, engagement, and brand perception based on identified trends.


## 2. DATA UNDERSTANDING

**2.1 Data Source**
* The dataset used for this analysis is a text data from CrowdFlower via data.world (https://data.world/crowdflower/brands-and-product-emotions).
* It consist of over 9,000 tweets. 

**2.2 Dataset Description**
* Each record(row) in the dataset represents a single tweet text about products from companies like Apple and Google.
* Before any preprocessing the columns in the dataset consist of object data types
* The key columns in the dataset are:
  1. "tweet_text"	- representing the actual text of the tweet
  2. "emotion_in_tweet_is_directed_at" - representing the company the emotion is directed towards (eg iPhone, Google, iPad)
  3. "is_there_an_emotion_directed_at_a_brand_or_product" - The sentiment label, it shows whether a tweet is positive, negative or none

**2.3 Data Quality**
* There is 1 missing value in the "tweet_text" column
* There are 5802 missing values in "emotion_in_tweet_is_directed_at" column.
* The dataset is text-heavy, so preprocessing steps like text cleaning, tokenization, and vectorization (TF-IDF) are required before modeling.

**2.4 Features & Target**
* Features: The main feature is the "tweet text", which will be transformed into numerical form using text vectorization technique.
* Target Variable: "is_there_an_emotion_directed_at_a_brand_or_product" is the target column, representing the tweet’s sentiment (positive, negative, or neutral).


**We aim to predict how people feel(positive, negative, neutral) towards Apple and Google products based on the tweet content.**

In [17]:
# Load the data
import pandas as pd
df = pd.read_csv("judge-1377884607_tweet_product_company.csv", encoding="ISO-8859-1")
df.head(7)

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion
5,@teachntech00 New iPad Apps For #SpeechTherapy...,,No emotion toward brand or product
6,,,No emotion toward brand or product


In [18]:
# basic information about the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9093 entries, 0 to 9092
Data columns (total 3 columns):
 #   Column                                              Non-Null Count  Dtype 
---  ------                                              --------------  ----- 
 0   tweet_text                                          9092 non-null   object
 1   emotion_in_tweet_is_directed_at                     3291 non-null   object
 2   is_there_an_emotion_directed_at_a_brand_or_product  9093 non-null   object
dtypes: object(3)
memory usage: 213.2+ KB


In [19]:
# check the shape of dataset
df.shape

(9093, 3)

In [20]:
# check for nall values
df.isna().sum()

tweet_text                                               1
emotion_in_tweet_is_directed_at                       5802
is_there_an_emotion_directed_at_a_brand_or_product       0
dtype: int64

In [21]:
# check the unique values in "emotion_in_tweet_is_directed_at" column
df["emotion_in_tweet_is_directed_at"].value_counts()

emotion_in_tweet_is_directed_at
iPad                               946
Apple                              661
iPad or iPhone App                 470
Google                             430
iPhone                             297
Other Google product or service    293
Android App                         81
Android                             78
Other Apple product or service      35
Name: count, dtype: int64

In [22]:
# check the unique values in "is_there_an_emotion_directed_at_a_brand_or_product" column
df["is_there_an_emotion_directed_at_a_brand_or_product"].value_counts()

is_there_an_emotion_directed_at_a_brand_or_product
No emotion toward brand or product    5389
Positive emotion                      2978
Negative emotion                       570
I can't tell                           156
Name: count, dtype: int64

In [23]:
# drop the missing value in the "text" column
df = df.dropna(subset=["tweet_text"])

In [24]:
# Drop "emotion_in_tweet_is_directed_at" column
df = df.drop(columns=["emotion_in_tweet_is_directed_at"])

In [25]:
# confirm that there are no NaNs
df.isna().sum()

tweet_text                                            0
is_there_an_emotion_directed_at_a_brand_or_product    0
dtype: int64

In [26]:
# check for duplicates
df.duplicated().sum()

22

* There are 22 duplicates in the dataset. These duplicates were droped to avoid bias and to ensure each tweet contribute unique information to the model.

In [27]:
# dropping all duplicates
df=df.drop_duplicates()


In [28]:
# confirm that there are 0 duplicates
df.duplicated().sum()

0

* Perform Data Cleaning and Exploratory Data Analysis with nltk to remove unnecessary characters and noise so that our model only learn from meaningful information.
* Common data cleaning tasks to be considered are Standardizing Case & Tokenizing.