# Apple Twitter Sentiment Classification Using Machine Learning

### Project Summary

This project aims to build a sentiment classification model that can automatically determine whether a tweet about Apple expresses a **positive**, **neutral**, or **negative** sentiment. The dataset, sourced from [CrowdFlower via data.world](https://data.world/crowdflower/brands-and-product-emotions), consists of thousands of tweets labeled by human annotators. It includes tweet text, sentiment labels (encoded as **1 = negative**, **3 = neutral**, **5 = positive**), and metadata. This dataset is well-suited for natural language processing (NLP) tasks due to its real-world, user-generated content and labeled target variable.


### **Data Preparation**

To prepare the data, we focused on cleaning and preprocessing the `text` column. Key steps included:

- Converting text to lowercase  
- Removing URLs, punctuation, and stopwords  
- Tokenizing and normalizing text  

These steps are essential in NLP to reduce noise and ensure the model focuses on the most meaningful features. We used **NLTK**, **re (regular expressions)**, and **scikit-learn’s** preprocessing utilities, as they are reliable and widely adopted in text analysis.


### **Modeling**

For modeling, we employed **Logistic Regression** and **Multinomial Naive Bayes** using **scikit-learn**, both of which are effective for text classification using Bag-of-Words and TF-IDF feature extraction techniques. Hyperparameter tuning was performed using **GridSearchCV** to optimize model performance. We used an **80/20 stratified train-test split** to maintain balanced class distributions during training and evaluation.


### **Evaluation**

Model performance was assessed using the following metrics:

- **Accuracy**
- **F1-Score**
- **Confusion Matrix**

The best-performing model achieved an **F1-score above 80%**, demonstrating strong performance in correctly identifying sentiment in Apple-related tweets. Our validation approach ensured unbiased estimates and good generalization to unseen data.



In cell below Import requred libraries.

In [70]:
# ilport libraries
import pandas as pd
import numpy as np


In [71]:
# load data

apple_df = pd.read_csv('./data/judge-1377884607_tweet_product_company.csv',encoding = 'unicode_escape')
apple_df.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


In [72]:
print(apple_df.columns)

Index(['tweet_text', 'emotion_in_tweet_is_directed_at',
       'is_there_an_emotion_directed_at_a_brand_or_product'],
      dtype='object')


In [73]:
# rename columns for redability

apple_df = apple_df.rename(columns={
    "tweet_text" : "tweet",
    "emotion_in_tweet_is_directed_at" : "product",
    "is_there_an_emotion_directed_at_a_brand_or_product" : "sentiment"
})
apple_df.head()

Unnamed: 0,tweet,product,sentiment
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


In [74]:
#check metadata summary 

def meta_num_summary(df):
    print("-----info()-----")
    df.info()
    

    

In [75]:
meta_num_summary(apple_df)

-----info()-----
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9093 entries, 0 to 9092
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   tweet      9092 non-null   object
 1   product    3291 non-null   object
 2   sentiment  9093 non-null   object
dtypes: object(3)
memory usage: 213.2+ KB


The output above shows `apple_df` contains three features which are both object datatype. 
- tweet as 1 missing value
- product contains alot of missing values
- sentiment as no missing value

In [76]:
# shape of dataset

apple_df.shape

(9093, 3)

Cell above shows entire shape of the dataset which contains **9093 entries and 3 features**

In [77]:
#sentiment class data balance



def data_bal(df, column):
    
    return df[column].value_counts()

    return df[column].value_counts()


In [78]:
# check for class imbalance
data_bal(apple_df, 'sentiment')

sentiment
No emotion toward brand or product    5389
Positive emotion                      2978
Negative emotion                       570
I can't tell                           156
Name: count, dtype: int64

In [79]:
#check tweet per product
data_bal(apple_df, 'product')

product
iPad                               946
Apple                              661
iPad or iPhone App                 470
Google                             430
iPhone                             297
Other Google product or service    293
Android App                         81
Android                             78
Other Apple product or service      35
Name: count, dtype: int64

In cell below I drop the **Nan** in `tweet` feature to remove all nan values

In [87]:
apple_df.dropna(subset=['tweet'], inplace=True)

In [81]:
# remove duplicates

apple_df.drop_duplicates(inplace=True)
apple_df.duplicated().sum()


np.int64(0)

In cell I check for missing values within the dataset fearures and impute by **unknown** if they are found

In [88]:
apple_df.isna().sum()

tweet        0
product      0
sentiment    0
dtype: int64

**product** feature contains some missing value and impute the values using **undefined**

In [89]:
apple_df['product'] = apple_df['product'].fillna("undefined")
apple_df.isna().sum()

tweet        0
product      0
sentiment    0
dtype: int64

In [90]:
apple_df.shape

(9070, 3)

### Basic Text Cleaning and Tokenization

Before training a sentiment analysis model, it's essential to clean and preprocess the raw tweet text to reduce noise and ensure consistent interpretation of language by the model.

In this project, basic text cleaning will involve:

- **Converting all text to lowercase** to treat "Apple" and "apple" as the same word.
- **Removing punctuation and special characters**, which can affect token matching.
- **Eliminating URLs, mentions, and hashtags** commonly found in tweets but not useful for sentiment detection.
- **Removing stopwords** ( such as "and", "the", "is") that do not add meaningful value to sentiment classification.
- **Tokenizing** each sentence into a list of individual words (tokens) for further analysis.

These steps ensure that words with similar meaning or usage are treated consistently. For example, without cleaning, words like "stock" and "stock." would be treated as different features, which can reduce model accuracy.

We will use standard Python libraries such as **NLTK**, **re (regular expressions)**, and **scikit-learn’s text preprocessing tools** to carry out these steps efficiently.

By the end of this stage, each tweet will be transformed into a clean, tokenized version of its original text, ready for vectorization and modeling.


