# APEX Analytics Sentiment Analysis on Apple and Google Products
### Authors: Audrey, Petronilla, George, Stella, Sylvia, Job

## Business Understanding

In this project, we are building a model that can analyze tweets about Apple and Google products and tell us what kind of emotion is behind them i.e positive, negative, or neutral. This kind of tool could help big companies like Apple or Google keep track of how people feel about their products without having to manually read through thousands of tweets a day.

## Problem Statement
We want to create a machine learning model that can take a tweet and predict its sentiment. Specifically, the tweet should be classified as either:Positive,Negative or Neutral. The dataset contains over 9,000 tweets labeled by humans.Our task is to train a model to learn from those labeled examples and apply the same logic to new, unseen tweets.

## Objectives
1.Biuld a model that can tell if a tweet is positive or negative.

2.Use NLP steps like cleaning the text, splitting it into words, and turning it into numbers.

3.Modellimg

4.Evaluation using accuracy, precision, recall, and F1-score.

## Research Questions


1.Can we accurately predict sentiment just from the text of a tweet?

2.What kinds of words or phrases are most common in positive, negative, or neutral tweets?

3.Which model performs best for this type of text classification task?

4.What are the best evaluation metrics.

## Limitations

1.Tweets are short and can be all over the place.

2.Neutral tweets are tricky. Sometimes it is hard to tell if someone is being neutral or just not clear.

3.The tweets were labeled by people, and not everyone agrees on what is positive or negative,so there might be some mixed signals in the data.

4.The data is a bit old, so the way people talk in the tweets might not fully match how people tweet today.

## Data Understanding

In [1]:
# Run this cell without changes
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import train_test_split, GridSearchCV

from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay, silhouette_score

import nltk
import re
nltk.download('punkt')
from nltk.tokenize import word_tokenize
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB

from sklearn.cluster import KMeans

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [2]:
#load the data

df = pd.read_csv("judge-1377884607_tweet_product_company.csv",encoding='latin-1')
df.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


In [3]:
#shape of the data set
df.shape

(9093, 3)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9093 entries, 0 to 9092
Data columns (total 3 columns):
 #   Column                                              Non-Null Count  Dtype 
---  ------                                              --------------  ----- 
 0   tweet_text                                          9092 non-null   object
 1   emotion_in_tweet_is_directed_at                     3291 non-null   object
 2   is_there_an_emotion_directed_at_a_brand_or_product  9093 non-null   object
dtypes: object(3)
memory usage: 213.2+ KB


In [5]:
#checking for missing values
df.isnull().sum()

Unnamed: 0,0
tweet_text,1
emotion_in_tweet_is_directed_at,5802
is_there_an_emotion_directed_at_a_brand_or_product,0


In [6]:
# We drop the missing row
df = df[df["tweet_text"].notnull()].reset_index(drop=True)

In [7]:
# Check for duplicates
print(df.duplicated().sum())

22


In [8]:
df.drop_duplicates(inplace=True)

In [9]:
df.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


In [10]:
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [11]:
 def tokenize_and_preprocess(reviews):

    stop_words = stopwords.words('english')
    patt = re.compile(r'\b(' + r'|'.join(stop_words) + r')\b\s+')

    preproc_step1 = df.tweet_text.str.lower().str.replace(
        r'[0-9]+', '',regex = True).str.replace(patt, '', regex = True)

    # tokeniz. result is a Pandas series of document represented as lists of tokens
    preproc1_tokenized = preproc_step1.apply(word_tokenize)


    # inner function. takes in single document as token list.
    # processes further by stemming and removing non-alphabetic characters

    def remove_punct_and_stem(doc_tokenized):

        stemmer = SnowballStemmer('english')
        # Remove URLs
        doc_tokenized = [word for word in doc_tokenized if not word.startswith(('http', 'www'))]
        # Remove mentions and hashtags
        doc_tokenized = [word for word in doc_tokenized if not word.startswith(('@', '#'))]
        doc_tokenized = [word for word in doc_tokenized if '@' not in word and '#' not in word]
        # Remove non-alphabetic characters
        doc_tokenized = [word for word in doc_tokenized if word.isalpha()]
        # Apply stemming
        filtered_stemmed_tok = [stemmer.stem(tok) for tok in doc_tokenized if tok.isalpha()]
        return " ".join(filtered_stemmed_tok)
        filtered_stemmed_tok = [stemmer.stem(tok) for tok in doc_tokenized if tok.isalpha() ]

        return " ".join(filtered_stemmed_tok)

    preprocessed = preproc1_tokenized.apply(remove_punct_and_stem)

    return preprocessed

df['preprocessed_text'] =tokenize_and_preprocess(df.tweet_text)

In [12]:
df.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product,preprocessed_text
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion,wesley g iphon hrs tweet dead need upgrad plug...
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion,jessede know fludapp awesom app like appreci d...
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion,swonderlin wait ipad also sale sxsw
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion,sxsw hope crashi app sxsw
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion,sxtxstate great stuff fri sxsw marissa mayer g...


In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9070 entries, 0 to 9091
Data columns (total 4 columns):
 #   Column                                              Non-Null Count  Dtype 
---  ------                                              --------------  ----- 
 0   tweet_text                                          9070 non-null   object
 1   emotion_in_tweet_is_directed_at                     3282 non-null   object
 2   is_there_an_emotion_directed_at_a_brand_or_product  9070 non-null   object
 3   preprocessed_text                                   9070 non-null   object
dtypes: object(4)
memory usage: 354.3+ KB


In [14]:
df['is_there_an_emotion_directed_at_a_brand_or_product'].unique()

array(['Negative emotion', 'Positive emotion',
       'No emotion toward brand or product', "I can't tell"], dtype=object)

In [15]:
df['is_there_an_emotion_directed_at_a_brand_or_product'].value_counts()

Unnamed: 0_level_0,count
is_there_an_emotion_directed_at_a_brand_or_product,Unnamed: 1_level_1
No emotion toward brand or product,5375
Positive emotion,2970
Negative emotion,569
I can't tell,156


In [16]:
# Define the mapping  to numerical values
emotions_map = {
    'Positive emotion': 1,
    'Negative emotion': -1,
    'No emotion toward brand or product': 0,
    "I can't tell": 0
}

# Apply the mapping to create a new column 'emotion_label'
df['emotions'] = df['is_there_an_emotion_directed_at_a_brand_or_product'].map(emotions_map)

# Optional: Display the distribution of the new labels
print(df['emotions'].value_counts())

df.head()

emotions
 0    5531
 1    2970
-1     569
Name: count, dtype: int64


Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product,preprocessed_text,emotions
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion,wesley g iphon hrs tweet dead need upgrad plug...,-1
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion,jessede know fludapp awesom app like appreci d...,1
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion,swonderlin wait ipad also sale sxsw,1
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion,sxsw hope crashi app sxsw,-1
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion,sxtxstate great stuff fri sxsw marissa mayer g...,1


In [17]:
df=df.drop(columns=['tweet_text','emotion_in_tweet_is_directed_at','is_there_an_emotion_directed_at_a_brand_or_product'])
df.head()

Unnamed: 0,preprocessed_text,emotions
0,wesley g iphon hrs tweet dead need upgrad plug...,-1
1,jessede know fludapp awesom app like appreci d...,1
2,swonderlin wait ipad also sale sxsw,1
3,sxsw hope crashi app sxsw,-1
4,sxtxstate great stuff fri sxsw marissa mayer g...,1


In [18]:
#Due to the class imbalance , we use a combination of oversampling the Negative class and class weighting technique

from sklearn.utils import class_weight
import numpy as np


class_weights = class_weight.compute_class_weight(
    class_weight='balanced',
    classes=np.unique(df['emotions']),
    y=df['emotions']
)

# Create a dictionary mapping class labels to their corresponding weights
class_weights_dict = dict(zip(np.unique(df['emotions']), class_weights))

print(class_weights_dict)

{np.int64(-1): np.float64(5.3134153485647335), np.int64(0): np.float64(0.5466160429096607), np.int64(1): np.float64(1.0179573512906845)}


# Modelling

In [19]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Input
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping
import numpy as np

In [20]:
# Assuming X is the feature matrix and y is the label vector
X = df['preprocessed_text']
y = df['emotions']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Encoding target labels
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)
y_test_encoded = label_encoder.transform(y_test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Encoding target labels
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)
y_test_encoded = label_encoder.transform(y_test)

### Logistic Regression

In [21]:
# Create a TfidfVectorizer object
vectorizer = TfidfVectorizer()

# Fit the vectorizer to the training data and transform it
X_train_vec = vectorizer.fit_transform(X_train)

# Transform the test data using the fitted vectorizer
X_test_vec = vectorizer.transform(X_test)

# Now use the vectorized data for training the model
log_model = LogisticRegression(max_iter=1000)
log_model.fit(X_train_vec, y_train_encoded)  # Use vectorized training data
log_preds = log_model.predict(X_test_vec)  # Use vectorized test data

print("Logistic Regression Classification Report:")
#Convert label_encoder.classes_ to a list of strings
target_names = [str(cls) for cls in label_encoder.classes_]
print(classification_report(y_test_encoded, log_preds, target_names=target_names))

Logistic Regression Classification Report:
              precision    recall  f1-score   support

          -1       0.82      0.08      0.14       119
           0       0.73      0.87      0.79      1139
           1       0.61      0.49      0.54       556

    accuracy                           0.70      1814
   macro avg       0.72      0.48      0.49      1814
weighted avg       0.70      0.70      0.67      1814



In [22]:
#Model Tuning with GridSearchCV
param_grid = {
    'C': [0.1, 1, 10],
    'solver': ['liblinear', 'lbfgs'],  # Keep both solvers
    'penalty': ['l2']           # Use only 'l2'
}
grid_search = GridSearchCV(LogisticRegression(max_iter=2000), param_grid, cv=5, scoring='f1_weighted')

grid_search.fit(X_train_vec, y_train_encoded)
best_model = grid_search.best_estimator_
best_params = grid_search.best_params_

print("Best Hyperparameters:", best_params)

Best Hyperparameters: {'C': 10, 'penalty': 'l2', 'solver': 'liblinear'}


### Random Forest Classifier

In [23]:
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train_vec, y_train_encoded)  # Use vectorized training data
rf_preds = rf_model.predict(X_test_vec)  # Use vectorized test data

print("Random Forest Classification Report:")
# Convert label_encoder.classes_ to a list of strings
target_names = [str(cls) for cls in label_encoder.classes_]
print(classification_report(y_test_encoded, rf_preds, target_names=target_names))

Random Forest Classification Report:
              precision    recall  f1-score   support

          -1       0.71      0.20      0.31       119
           0       0.72      0.88      0.79      1139
           1       0.62      0.44      0.52       556

    accuracy                           0.70      1814
   macro avg       0.68      0.51      0.54      1814
weighted avg       0.69      0.70      0.67      1814



In [27]:
param_grid_rf = {
    'n_estimators': [50,100],
    'max_depth': [None, 10],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1],
    'max_features': ['sqrt'],
    'bootstrap': [True],
    'criterion': ['gini']
}

grid_search_rf = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid_rf,
    cv=2,
    scoring='f1_weighted',
    n_jobs=-1,
    verbose=2
)

grid_search_rf.fit(X_train_vec, y_train_encoded)

best_rf_model = grid_search_rf.best_estimator_
best_rf_params = grid_search_rf.best_params_

print("Best Random Forest Parameters:", best_rf_params)

Fitting 2 folds for each of 8 candidates, totalling 16 fits
Best Random Forest Parameters: {'bootstrap': True, 'criterion': 'gini', 'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 100}


### Neural Network

In [25]:
y_train_categorical = to_categorical(y_train_encoded)
y_test_categorical = to_categorical(y_test_encoded)

# Create a TfidfVectorizer object
vectorizer = TfidfVectorizer()

# Fit the vectorizer to the training data and transform it
X_train_vec = vectorizer.fit_transform(X_train)

# Transform the test data using the fitted vectorizer
X_test_vec = vectorizer.transform(X_test)

nn_model = Sequential([
    Input(shape=(X_train_vec.shape[1],)),
    Dense(128, activation='relu'),  # Use X_train_vec.shape[1]
    Dense(64, activation='relu'),
    Dense(len(label_encoder.classes_), activation='softmax')
])

nn_model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
nn_model.fit(X_train_vec.toarray(), y_train_categorical, epochs=5, batch_size=32, validation_split=0.1)  # Use X_train_vec.toarray()

nn_preds = nn_model.predict(X_test_vec.toarray())  # Use X_test_vec.toarray()
nn_preds_labels = nn_preds.argmax(axis=1)

print("Neural Network Classification Report:")
# Convert label_encoder.classes_ to a list of strings
target_names = [str(cls) for cls in label_encoder.classes_]
print(classification_report(y_test_encoded, nn_preds_labels, target_names=target_names))

Epoch 1/5
[1m205/205[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 18ms/step - accuracy: 0.5811 - loss: 0.8939 - val_accuracy: 0.6639 - val_loss: 0.7151
Epoch 2/5
[1m205/205[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 12ms/step - accuracy: 0.7590 - loss: 0.5663 - val_accuracy: 0.6708 - val_loss: 0.7363
Epoch 3/5
[1m205/205[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 12ms/step - accuracy: 0.8649 - loss: 0.3505 - val_accuracy: 0.6570 - val_loss: 0.8419
Epoch 4/5
[1m205/205[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 12ms/step - accuracy: 0.9033 - loss: 0.2427 - val_accuracy: 0.6667 - val_loss: 0.9520
Epoch 5/5
[1m205/205[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 12ms/step - accuracy: 0.9209 - loss: 0.1902 - val_accuracy: 0.6639 - val_loss: 1.0697
[1m57/57[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step
Neural Network Classification Report:
              precision    recall  f1-score   support

          -1       0.41 

In [26]:
def create_nn_model(units1, units2, dropout_rate, optimizer='adam'):
    model = Sequential([
        Input(shape=(X_train_vec.shape[1],)),
        Dense(units1, activation='relu'),
        Dropout(dropout_rate),
        Dense(units2, activation='relu'),
        Dense(len(label_encoder.classes_), activation='softmax')
    ])
    model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy'])
    return model

# Hyperparameter grid
param_grid_nn = {
    'units1': [64, 128],
    'units2': [32, 64],
    'dropout_rate': [0.2, 0.5],
    'optimizer': ['adam', 'rmsprop'],
    'epochs': [30],
    'batch_size': [64]
}

best_accuracy = 0.0
best_params = {}
best_model = None

for units1 in param_grid_nn['units1']:
    for units2 in param_grid_nn['units2']:
        for dropout_rate in param_grid_nn['dropout_rate']:
            for optimizer in param_grid_nn['optimizer']:
                for epochs in param_grid_nn['epochs']:
                    for batch_size in param_grid_nn['batch_size']:
                        print(f"Trying: units1={units1}, units2={units2}, dropout={dropout_rate}, optimizer={optimizer}, epochs={epochs}, batch_size={batch_size}")

                        model = create_nn_model(units1, units2, dropout_rate, optimizer)
                        es = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True, verbose=0)

                        history = model.fit(
                            X_train_vec.toarray(), y_train_categorical,
                            epochs=epochs,
                            batch_size=batch_size,
                            validation_split=0.1,
                            verbose=0,
                            callbacks=[es]
                        )

                        val_loss, accuracy = model.evaluate(X_train_vec.toarray(), y_train_categorical, verbose=0)

                        if accuracy > best_accuracy:
                            best_accuracy = accuracy
                            best_params = {
                                'units1': units1,
                                'units2': units2,
                                'dropout_rate': dropout_rate,
                                'optimizer': optimizer,
                                'epochs': epochs,
                                'batch_size': batch_size
                            }
                            best_model = model

print("\nBest Neural Network Parameters:", best_params)

Trying: units1=64, units2=32, dropout=0.2, optimizer=adam, epochs=30, batch_size=64
Trying: units1=64, units2=32, dropout=0.2, optimizer=rmsprop, epochs=30, batch_size=64
Trying: units1=64, units2=32, dropout=0.5, optimizer=adam, epochs=30, batch_size=64
Trying: units1=64, units2=32, dropout=0.5, optimizer=rmsprop, epochs=30, batch_size=64
Trying: units1=64, units2=64, dropout=0.2, optimizer=adam, epochs=30, batch_size=64
Trying: units1=64, units2=64, dropout=0.2, optimizer=rmsprop, epochs=30, batch_size=64
Trying: units1=64, units2=64, dropout=0.5, optimizer=adam, epochs=30, batch_size=64
Trying: units1=64, units2=64, dropout=0.5, optimizer=rmsprop, epochs=30, batch_size=64
Trying: units1=128, units2=32, dropout=0.2, optimizer=adam, epochs=30, batch_size=64
Trying: units1=128, units2=32, dropout=0.2, optimizer=rmsprop, epochs=30, batch_size=64
Trying: units1=128, units2=32, dropout=0.5, optimizer=adam, epochs=30, batch_size=64
Trying: units1=128, units2=32, dropout=0.5, optimizer=rmsp