# Sentiment Analysis

### **Author:** ***Martin Waweru***


## The project will CRISP-DM Criteria
Business understanding  
Data Understanding  
Data preparation  
Modeling  
Evaluation  
Deployment 

## Business Understanding

### Business Problem
Businesses face the challenge of analyzing large volumes of unstructured text data, such as customer reviews and social media posts, to understand sentiment. Manual analysis is time-consuming and inefficient, creating a need for an automated solution to classify text into positive, negative, or neutral sentiments. This will help businesses make data-driven decisions and improve customer satisfaction.

### Business Overview
Sentiment analysis is vital across industries like retail, hospitality, and finance. It helps monitor brand reputation, identify customer pain points, and tailor marketing strategies. For example, analyzing product reviews or social media feedback enables companies to enhance customer experiences and address issues promptly, driving growth and improving brand loyalty.

### Objective of the Project
The project aims to build a sentiment analysis model to classify text into positive, negative, or neutral sentiments. It involves preprocessing text data, extracting features, training machine learning or deep learning models, and evaluating performance. The final goal is to create a tool that automates sentiment analysis, helping businesses analyze text data efficiently and make informed decisions.

## Data Understanding

### Data repository
The dataset, known as "Tweet Sentiment Analysis", was downloaded from Kaggle . It contains text data from tweets, where each tweet is labeled with a sentiment: positive, negative, or neutral. The dataset can be [Download Here](https://www.kaggle.com/datasets/yasserh/twitter-tweets-sentiment-dataset)

### Data overview

The dataset provided contains text samples with four columns: textID, text, selected_text, and sentiment. Each row represents a unique text entry, where:

1. textID is a unique identifier for each text.
2. text contains the full sentence or phrase.
3. selected_text highlights the specific part of the text that reflects the sentiment.
4. sentiment labels the text as positive, negative, or neutral.

## Data Preparation

In [113]:
# import libraries
import pandas as pd
import numpy as np

import re
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

from sklearn.model_selection import train_test_split

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Conv1D, GlobalMaxPooling1D, Dense


# Download stopwords
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\PC\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\PC\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [114]:
# load the dataset
df = pd.read_csv("Tweets.csv")
df.head()

Unnamed: 0,textID,text,selected_text,sentiment
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative
2,088c60f138,my boss is bullying me...,bullying me,negative
3,9642c003ef,what interview! leave me alone,leave me alone,negative
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative


In [115]:
# preview the dataset
display(df.head(10))
display(df.tail(10))

Unnamed: 0,textID,text,selected_text,sentiment
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative
2,088c60f138,my boss is bullying me...,bullying me,negative
3,9642c003ef,what interview! leave me alone,leave me alone,negative
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative
5,28b57f3990,http://www.dothebouncy.com/smf - some shameles...,http://www.dothebouncy.com/smf - some shameles...,neutral
6,6e0c6d75b1,2am feedings for the baby are fun when he is a...,fun,positive
7,50e14c0bb8,Soooo high,Soooo high,neutral
8,e050245fbd,Both of you,Both of you,neutral
9,fc2cbefa9d,Journey!? Wow... u just became cooler. hehe....,Wow... u just became cooler.,positive


Unnamed: 0,textID,text,selected_text,sentiment
27471,15bb120f57,"i`m defying gravity. and nobody in alll of oz,...","i`m defying gravity. and nobody in alll of oz,...",neutral
27472,8f5adc47ec,http://twitpic.com/663vr - Wanted to visit the...,were too late,negative
27473,a208770a32,in spoke to you yesterday and u didnt respond...,in spoke to you yesterday and u didnt respond ...,neutral
27474,8f14bb2715,So I get up early and I feel good about the da...,I feel good ab,positive
27475,b78ec00df5,enjoy ur night,enjoy,positive
27476,4eac33d1c0,wish we could come see u on Denver husband l...,d lost,negative
27477,4f4c4fc327,I`ve wondered about rake to. The client has ...,", don`t force",negative
27478,f67aae2310,Yay good for both of you. Enjoy the break - y...,Yay good for both of you.,positive
27479,ed167662a5,But it was worth it ****.,But it was worth it ****.,positive
27480,6f7127d9d7,All this flirting going on - The ATG smiles...,All this flirting going on - The ATG smiles. Y...,neutral


In [116]:
# check info of the data
print(f"The shape indicates that the dataset has {df.shape[0]} rows and {df.shape[1]} Columns")

The shape indicates that the dataset has 27481 rows and 4 Columns


In [117]:
#Check more info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27481 entries, 0 to 27480
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   textID         27481 non-null  object
 1   text           27480 non-null  object
 2   selected_text  27480 non-null  object
 3   sentiment      27481 non-null  object
dtypes: object(4)
memory usage: 858.9+ KB


Checking for missing values

In [118]:
# checking missing values
df.isnull().sum()

textID           0
text             1
selected_text    1
sentiment        0
dtype: int64

In [119]:
# droping the missing values
df.dropna(inplace=True)
print(f"Now the dataset has {df.isnull().sum().sum()} missing texts or values")

Now the dataset has 0 missing texts or values


In [120]:
# Checkinf for duplicated texts
print(f"This dataset contains {df.duplicated().sum()} duplicated rows")

This dataset contains 0 duplicated rows


Working on columns

In [121]:
df.columns

Index(['textID', 'text', 'selected_text', 'sentiment'], dtype='object')

The dataset contains 4 columns, lets review the importance of each column as illustrated in the table below

|Column name| Description | Status |
|--------------------|-------------------------------------------|--------|
|**textID**          | A unique identifier for each tweet.       | Drop   |
|**text**            | The full original text of the tweet.      | Keep   |
|**selected_text**   | Text extract that shows the sentiment.    | Drop   |
|**sentiment**       | The sentiment label                       | Keep   |

In [122]:
# drop columns
df.drop(columns=['textID', 'selected_text'], inplace=True)
#print columns
df.columns

Index(['text', 'sentiment'], dtype='object')

### Text processing 
This process involves standardizing text data that is by: 
1. Converting to lowercase
2. Removing special characters, numbers, and punctuation
3. Removing stopwords
4. Lemmatization that is reducing words to their root form

In [123]:
# Initialize lemmatizer and stopwords
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

In [124]:
""""
Since stopword removal can eliminate important words like "not," sentiment can be misinterpreted. 
To counter this, we define a set of common negation words to preserve and detect sentiment reversals. 
This includes standard negations and contractions like "not," "no," "never," "n't," "can't," etc.
"""
negation_words = {"not", "no", "never", "n't", "can't", "won't", "shouldn't", "isn't", "wasn't", "couldn't"}
# define a function
def preprocess_text(text):
    # standardize text to lowercased
    text = text.lower()
    # Remove HTML tags
    text = re.sub(r'<.*?>', '', text) 
    # Remove special characters
    text = re.sub(r'[^a-z\s]', '', text) 
    
    words = text.split()
    
    # Negation handling
    processed_words = []
    negate = False
    
    for word in words:
        if word in negation_words:
            negate = True
            processed_words.append(word)
        elif negate:
            processed_words.append(f"not_{word}")
            negate = False
        else:
            processed_words.append(word)
    
    # Remove stopwords but keep negation words
    processed_words = [word for word in processed_words if word not in stop_words or word in negation_words]
    # Lemmatization
    processed_words = [lemmatizer.lemmatize(word) for word in processed_words]
    return " ".join(processed_words)
# Apply preprocessing
df['cleaned_text'] = df['text'].apply(preprocess_text)

In [125]:
# preview the text processed data
print(f"This is the orginal text before text processing: {df[['text']].head()}")
print("_"*100)
# After text processing
print(f"This is the new text after text processing: {df[['cleaned_text']].head()}")


This is the orginal text before text processing:                                                 text
0                I`d have responded, if I were going
1      Sooo SAD I will miss you here in San Diego!!!
2                          my boss is bullying me...
3                     what interview! leave me alone
4   Sons of ****, why couldn`t they put them on t...
____________________________________________________________________________________________________
This is the new text after text processing:                              cleaned_text
0                      id responded going
1                 sooo sad miss san diego
2                            bos bullying
3                   interview leave alone
4  son couldnt put release already bought


## Modelling
### Tokenization and padding

The process of tokenization and padding involves converting text into numerical format for machine learning models. Tokenization breaks text into words or subwords and maps them to unique numerical indices. Padding ensures that all sequences have the same length by adding zeros or placeholders to shorter sequences. This standardization allows models to process text efficiently.

In [126]:
# Tokenize the text
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(df['cleaned_text'])
X = tokenizer.texts_to_sequences(df['cleaned_text'])
# Pad sequences to a fixed length
# a twitter comment usully is of about a mean of 25 words
# doubled it
max_len = 50
X = pad_sequences(X, maxlen=max_len)

In [127]:
# Map sentiment labels to numerical values
y = df['sentiment'].map({'negative': 0, 'neutral': 1, 'positive': 2})

In [128]:
# intiate the sequential model
model = Sequential()
# Embedding layer to remove input_length
model.add(Embedding(input_dim=5001, output_dim=128))  
# 1D Convolutional layer
model.add(Conv1D(filters=128, kernel_size=3, activation='relu'))
# Global Max Pooling
model.add(GlobalMaxPooling1D())
# Fully connected layers
model.add(Dense(10, activation='relu'))
model.add(Dense(3, activation='softmax'))  # 3 classes: negative, neutral, positive
# Compile the model
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [129]:
# checking the class distribution before splittng the data
print("Original class distribution:\n", y.value_counts())

Original class distribution:
 sentiment
1    11117
2     8582
0     7781
Name: count, dtype: int64


In [130]:
# spliting the data
# using test size of 20% and random state of 42 and parameter strarify= y
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [131]:
#checking the class distribution after spliting the data
print("Training set class distribution:\n", y_train.value_counts())
print("Test set class distribution:\n", y_test.value_counts())
print("The dataset contains class imbalances, but the diffrence between them is not huge, hence no need of balancing them using SMOTE")

Training set class distribution:
 sentiment
1    8894
2    6865
0    6225
Name: count, dtype: int64
Test set class distribution:
 sentiment
1    2223
2    1717
0    1556
Name: count, dtype: int64
The dataset contains class imbalances, but the diffrence between them is not huge, hence no need of balancing them using SMOTE


In [132]:
# Train the sentiment analysis model for 5 epochs with a batch size of 64, using training data.
# Validate performance on the test set after each epoch.
sent_model = model.fit(X_train, y_train, epochs=5, batch_size=64, validation_data=(X_test, y_test))

Epoch 1/5
[1m344/344[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 28ms/step - accuracy: 0.5502 - loss: 0.9184 - val_accuracy: 0.7080 - val_loss: 0.6868
Epoch 2/5
[1m344/344[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 31ms/step - accuracy: 0.7610 - loss: 0.5865 - val_accuracy: 0.7069 - val_loss: 0.6985
Epoch 3/5
[1m344/344[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 33ms/step - accuracy: 0.8373 - loss: 0.4413 - val_accuracy: 0.6936 - val_loss: 0.7486
Epoch 4/5
[1m344/344[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 32ms/step - accuracy: 0.8957 - loss: 0.3088 - val_accuracy: 0.6843 - val_loss: 0.8677
Epoch 5/5
[1m344/344[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 30ms/step - accuracy: 0.9346 - loss: 0.2042 - val_accuracy: 0.6696 - val_loss: 1.0383


In [133]:
# Retrieve the training history, including accuracy and loss for both training and validation sets.
sent_model.history

{'accuracy': [0.639692485332489,
  0.7633733749389648,
  0.8296488523483276,
  0.8887372612953186,
  0.928993821144104],
 'loss': [0.8004828095436096,
  0.5861129760742188,
  0.44870051741600037,
  0.3200085759162903,
  0.21639761328697205],
 'val_accuracy': [0.7079694271087646,
  0.7068777084350586,
  0.6935953497886658,
  0.6843158602714539,
  0.6695778965950012],
 'val_loss': [0.6867584586143494,
  0.6985465884208679,
  0.7485771775245667,
  0.8676957488059998,
  1.038257360458374]}

In [134]:
# Evaluate the trained model 
test_loss, test_accuracy = model.evaluate(X_test, y_test)
# measure its performance.
print(f"Test Accuracy: {test_accuracy:.4f}")
print(f"Test Loss: {test_loss:.4f}")

[1m172/172[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 7ms/step - accuracy: 0.6593 - loss: 1.0567
Test Accuracy: 0.6696
Test Loss: 1.0383


### Create function to predict sentiments

In [135]:
# Mapping labels
sentiment_labels = {0: "Negative", 1: "Neutral", 2: "Positive"}
#define a function
def predict_sentiment():
    user_text = input("Enter a tweet: ")
    # Preprocess the input text
    processed_text = preprocess_text(user_text)
    # Tokenize and pad the sequence
    sequence = tokenizer.texts_to_sequences([processed_text])
    padded_sequence = pad_sequences(sequence, maxlen=max_len)
    # Predict sentiment
    prediction = model.predict(padded_sequence)
    predicted_class = prediction.argmax(axis=1)[0]  # Get class with highest probability
    
    print(f"\nTweet: {user_text}")
    print(f"Predicted Sentiment: {sentiment_labels[predicted_class]}\n")


### Some examples of predicted sentiments
This is how it works the user inputs a tweep comment and the system predicts if it is postive, neutral or negative

In [104]:
predict_sentiment()

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 76ms/step

Tweet: William will not win the election in 2027, mark this tweet, will be open to debate on 2027
Predicted Sentiment: Negative



In [106]:
predict_sentiment()

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 67ms/step

Tweet: I really hate you, i dont want to see you again
Predicted Sentiment: Negative



In [108]:
predict_sentiment()


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 51ms/step

Tweet: congratulations team, we won, i knew this team was strong
Predicted Sentiment: Positive



In [109]:
predict_sentiment()

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 63ms/step

Tweet: I really loved the way we spent time together, it was nice
Predicted Sentiment: Positive



In [111]:
predict_sentiment()

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 68ms/step

Tweet: allow me not to comment on this issue
Predicted Sentiment: Neutral



In [112]:
predict_sentiment()

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 71ms/step

Tweet: I'm not sure if i will make it to come 
Predicted Sentiment: Neutral

