## <span style="font-family:Georgia, serif;">**Twitter Sentiment Analysis** :Understanding Emotions in Tweets about Apple and Google products.</span>

![alt text](twits.jpg "Title")

## Overview

## Business Understanding

**Business Problem**: Using Sentiment Analysis to Improve Apple and Google Product Marketing Strategies 

The introduction of social media has completely changed how businesses interact with their consumers and the general public in today's connected society. While the digital age offers limitless possibilities for marketing and brand development, it also brings its own set of difficulties. One of these difficulties is the inability of enterprises to precisely gauge public opinion and feelings towards their goods or services.

In the age of social media, organizations are acutely aware of the need to harness the wealth of sentiment and emotion data available on these platforms. However, they often struggle to do so effectively, given the unprecedented speed, diversity, and complexity of social media communication. The dynamic nature of the medium, the diverse and contextual language used, the rapid increase of emojis and visual content, the volume of noise, and ethical concerns all contribute to the challenge of gauging public sentiment and emotions. 

To overcome these challenges, organizations must invest in advanced sentiment analysis tools and technologies, develop cultural and linguistic expertise, and strike a balance between data-driven insights and ethical considerations. By doing so, they can unlock the valuable insights hidden within the social media storm and use them to inform strategic decisions, enhance products and services, and build stronger connections with their audience in this rapidly evolving digital landscape.




## Data understanding

**Dataset Overview:**
This dataset comprises 8,721 entries organized into three distinct columns: tweet_text, emotion_in_tweet_is_directed_at, and is_there_an_emotion_directed_at_a_brand_or_product. Each entry in the dataset represents a tweet along with associated metadata regarding the product or brand it's directed at and the emotional sentiment conveyed within the tweet.

**Column Descriptions:**

`tweet_text:`

The tweet_text column contains the textual content of individual tweets. These tweets are typically concise, informal expressions shared by users on a social media platform, such as Twitter. Each tweet serves as a snapshot of a user's thoughts, opinions, or experiences related to a particular product or brand.
The textual data within this column can vary in length, language, and complexity. It may include hashtags, mentions of other users, URLs, and a wide range of linguistic elements.
Analysis of the tweet text can provide valuable insights into the sentiments, opinions, or feedback expressed by users regarding the product or brand.

`emotion_in_tweet_is_directed_at:`

The emotion_in_tweet_is_directed_at column provides information about the specific product or brand mentioned or targeted by each tweet. This column serves as a categorical label indicating the entity towards which the emotion or sentiment expressed in the tweet is directed.
Entries in this column may include the names or identifiers of various products or brands, allowing for the categorization of tweets based on the entity they reference.
Understanding which products or brands are most frequently mentioned in tweets can help identify consumer preferences and the areas where sentiment analysis may be most relevant.

`is_there_an_emotion_directed_at_a_brand_or_product:`

The is_there_an_emotion_directed_at_a_brand_or_product column characterizes the emotional sentiment or tone conveyed within each tweet directed at a product or brand.
This column serves as a crucial indicator of the emotional context of the tweets and can be categorized into several classes, including:

* Positive: Tweets expressing favorable sentiments, such as satisfaction, excitement, or endorsement, towards the product or brand.

* Negative: Tweets containing unfavorable sentiments, such as criticism, frustration, or dissatisfaction, directed at the product or brand.

* No Emotion: Tweets that do not convey any discernible emotional sentiment. These tweets may provide neutral or factual information.

* Not Clear: Tweets where the emotional tone is ambiguous or unclear, making it challenging to determine the sentiment.

Analyzing this column allows for sentiment classification and provides valuable insights into how consumers perceive and react to products or brands in the context of social media.
Dataset Size:

The dataset contains a total of 8,721 entries, each representing a unique tweet. This dataset size is substantial and provides a rich source of data for sentiment analysis and brand/product perception studies.

**Data Exploration and Analysis:**

To gain a deeper understanding of the dataset and its implications, exploratory data analysis (EDA) techniques, natural language processing (NLP) methods, and sentiment analysis tools can be applied.
EDA involves a series of techniques and methods to gain insights into the structure, content, and patterns within the textual information. Here's a step-by-step guide on how to apply EDA to text data:

## Data Preparation

**Data Loading and Overview:** Here we begin by loading our text data into a preferred data analysis environment (e.g., Python with pandas).The data has non UTF-8 characters so we first create a function that removes the non UTF-8 characters.

In [61]:
import pandas as pd
import re
import io
import warnings

# Suppress all warnings
warnings.filterwarnings('ignore')

def remove_non_utf8(text):
    return re.sub(r'[^\x00-\x7F]+', '', text)

with open('data/judge_1377884607_tweet_product_company.csv', 'r', encoding='utf-8') as file:
    cleaned_text = remove_non_utf8(file.read())

df = pd.read_csv(io.StringIO(cleaned_text))
df.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


In [62]:
#renaming the column 'is_there_an_emotion_directed_at_a_brand_or_product' to 'Sentiment'.

column_name_mapping = {'is_there_an_emotion_directed_at_a_brand_or_product': 'Sentiment'}
df.rename(columns=column_name_mapping, inplace=True)

In [63]:
#filling in missing (NaN) values with the string 'N/A' 

df['emotion_in_tweet_is_directed_at'].fillna('N/A', inplace=True)
df['tweet_text'].fillna('N/A', inplace=True)

In [64]:

def assign_brand(phrase):
    """ 
    Takes in a phrase as input and returns
    a brand label based on certain keywords found in the phrase.
    """
    if 'iPad' in phrase or 'iPhone' in phrase :
        return 'Apple'
    elif 'Other Apple product or service' in phrase or 'Apple' in phrase:
        return 'Apple' 
    elif 'iPad or iPhone App' in phrase:
        return 'Apple'       
    elif 'Google' in phrase or 'Other Google product or service' in phrase:
        return 'Google'
    elif 'Android App' in phrase or 'Android' in phrase:
        return 'Android'
    else:
        return 'N/A'


#creating a new column called 'brand' that contains the assigned brand labels

df['brand'] = df['emotion_in_tweet_is_directed_at'].apply(assign_brand)

Text Preprocessing:Before conducting EDA, it's crucial to preprocess the text data. Common preprocessing steps include:
* Lowercasing: Convert all text to lowercase to ensure consistency.
* Tokenization: Split text into individual words or tokens.
* Stop Word Removal: Eliminate common and uninformative words like "the," "and," "in.
* Punctuation Removal: Remove special characters, punctuation marks, and symbols.
* Lemmatization or Stemming: Reduce words to their root form for better analysis.

In the cell below we create a functions that integrates the above processes.

In [65]:
import nltk
import re
from nltk.tokenize import word_tokenize,TweetTokenizer
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
# nltk.download('punkt')
# nltk.download('stopwords')


def clean_and_preprocess_text(text):
    tokenizer = TweetTokenizer()
    tokens = tokenizer.tokenize(text)
    # Convert tokens to lowercase
    tokens = [token.lower() for token in tokens]
    # Remove mentions (words starting with '@') and URLs
    tokens = [token for token in tokens if not token.startswith('@') and not token.startswith('http')]
    # Remove punctuation and numbers using regular expressions
    tokens = [re.sub(r'[^a-zA-Z]', '', token) for token in tokens]
    # Remove stop words
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [token for token in tokens if token not in stop_words]
    # Apply stemming using the Porter Stemmer
    stemmer = PorterStemmer()
    stemmed_tokens = [stemmer.stem(token) for token in filtered_tokens]

    cleaned_text = ' '.join(stemmed_tokens) 
    return cleaned_text

In [66]:
#Testing the function
df['tweet_text'][4]

"@sxtxstate great stuff on Fri #SXSW: Marissa Mayer (Google), Tim O'Reilly (tech books/conferences) &amp; Matt Mullenweg (Wordpress)"

In [67]:
clean_and_preprocess_text(df['tweet_text'][4])

'great stuff fri sxsw  marissa mayer  googl   tim oreilli  tech book  confer   matt mullenweg  wordpress '

In [68]:
#creating a new column that contains processed text
df['processed_text'] = df['tweet_text'].map(clean_and_preprocess_text)

Next we are going to define a mapping which associates certain sentiment labels with numerical values.This is essential because NLP involve tasks such as regression that requires integer inputs.We can then use the numbers as reference.In the following mapping,'Positive emotion' is mapped to 1.0,'Negative emotion' is mapped to 0.0, 'No emotion toward brand or product' is mapped to 2.0 and 'I can't tell' is also mapped to 2.0.

In [69]:
# Define a mapping dictionary
sentiment_mapping = {'No emotion toward brand or product': 2.0,
                  'Positive emotion': 1.0, 
                  'Negative emotion': 0.0,
                  'I can\'t tell': 2.0}

# Use the .map() method to map values in column 'Sentiment' to new values
df['Sentiment'] = df['Sentiment'].map(sentiment_mapping)
df.sample(5)

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,Sentiment,brand,processed_text
1627,Google's #geosocial Offers platform goes live ...,,2.0,,googl geosoci offer platform goe live sxsw link
3461,Still a big line outside of Apple's pop-up sho...,,2.0,,still big line outsid appl popup shop day ip...
1461,omarg: It's not a rumor: Apple is opening up a...,,2.0,,omarg rumor appl open temporari store downto...
3280,Here is the video I took with my iPhone of @me...,,2.0,,video took iphon super sxsw link
4690,Andrew K of PRX equates the homogeneity of the...,,0.0,,andrew k prx equat homogen appl ecosystem w pr...


In subsequent stages of data preparation, we will construct a DataFrame that exclusively contains instances where the sentiment is either positive (1.0) or negative (0.0). This step is crucial for generating a dataset suitable for training our baseline model and assessing its performance when dealing with a binary target variable, specifically 0 or 1.When we get the data we split it into training and test for validation purposes.

In [70]:
from sklearn.model_selection import train_test_split
#creating new df where sentiment is either positive or negative
bi_tar = df[(df['Sentiment'] == 0)| (df['Sentiment'] == 1)]

X = bi_tar['processed_text']
y = bi_tar['Sentiment']

X_train_bi, X_test_bi, y_train_bi, y_test_bi = train_test_split(X, y, test_size=0.2, random_state=42)


In [71]:
#creating data for the iterated model that has multiple classes(targets)
multi_tar = df.copy()
X = multi_tar['processed_text']
y = multi_tar['Sentiment']
#One-hot encoding to ensure that each class label is treated as a separate, independent category. 
y_dummies = pd.get_dummies(y)
X_train_multi, X_test_multi, y_train_multi, y_test_multi = train_test_split(X, y_dummies, test_size=0.2, random_state=42)

As previously mentioned, Natural Language Processing (NLP) relies on regression techniques, necessitating the transformation of text data into numerical vectors that machine learning algorithms can comprehend and process effectively. This conversion can be achieved through various methods, such as GloVe (Global Vectors for Word Representation), Word2Vec, TF-IDF (Term Frequency-Inverse Document Frequency), and BERT (Bidirectional Encoder Representations from Transformers). In our particular scenario, we will opt for TF-IDF, which can be seamlessly incorporated into a function designed to convert our text data. This function will also incorporate Compressed Sparse Row (CSR) matrices to transform the TF-IDF training matrix into a space-efficient CSR matrix format.

In [72]:
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import csr_matrix

vectorizer = TfidfVectorizer()

def csr_Tfid_vect(X_train,X_test):
    tf_idf_train = vectorizer.fit_transform(X_train)
    tf_idf_test = vectorizer.transform(X_test)

    tf_idf_train = csr_matrix(tf_idf_train)
    tf_idf_test = csr_matrix(tf_idf_test)

    return tf_idf_train,tf_idf_test

X_tf_idf_train_bi,X_tf_idf_test_bi = csr_Tfid_vect(X_train_bi,X_test_bi)

In [73]:
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense
from keras.layers import Dropout
from keras.models import Sequential
from keras.preprocessing import text, sequence


With the provided data and functions, we are prepared to embark on exploratory data analysis (EDA) and commence the modeling process.

## Text Analysis

First we can calculate word frequencies, and print the top 10 most frequent words along with their normalized frequencies

In [74]:
import nltk
from nltk.corpus import stopwords
from nltk.probability import FreqDist
import string

# Concatenate all tweet text into a single string
big_sentence = ' '.join(df['tweet_text'])

# Tokenization pattern
pattern = "([a-zA-Z]+(?:'[a-z]+)?)"

# Tokenize and convert to lowercase
tweets_raw = nltk.regexp_tokenize(big_sentence, pattern)
tweets_raw = [word.lower() for word in tweets_raw]

# Create a list of stopwords and add punctuation
stopwords_list = stopwords.words('english')
stopwords_list += list(string.punctuation)

# Remove stopwords and punctuation
tweets_raw_stopped = [word for word in tweets_raw if word not in stopwords_list]

# Calculate word frequency distribution
tweets_freqdist = FreqDist(tweets_raw_stopped)

# Calculate total word count
total_word_count = sum(tweets_freqdist.values())

# Get and print the top 10 most frequent words with normalized frequencies
tweets_freqdist_top_10 = tweets_freqdist.most_common(10)
print(f'{"Word":<10} {"Normalized Frequency":<20}')
for word in tweets_freqdist_top_10:
    normalized_frequency = word[1] / total_word_count
    print(f'{word[0]:<10} {normalized_frequency:^20.4}')


Word       Normalized Frequency
sxsw              0.0858       
mention          0.06442       
link             0.03821       
rt               0.02743       
ipad             0.02671       
google            0.022        
apple            0.01969       
quot             0.01503       
iphone           0.01414       
store            0.01316       


We can then use NLTK to find and score the top bigram (two-word combinations) collocations in the tweets_raw_stopped text data.

In [75]:
from nltk.collocations import BigramCollocationFinder

bigram_measures = nltk.collocations.BigramAssocMeasures()
tweets_finder = BigramCollocationFinder.from_words(tweets_raw_stopped)
tweets_scored = tweets_finder.score_ngrams(bigram_measures.raw_freq)
tweets_scored[:15]

[(('rt', 'mention'), 0.02666802617674867),
 (('sxsw', 'link'), 0.008527835968929014),
 (('link', 'sxsw'), 0.007656513598190615),
 (('sxsw', 'rt'), 0.006275374946701025),
 (('mention', 'mention'), 0.005728481118258838),
 (('mention', 'sxsw'), 0.005478207671344618),
 (('apple', 'store'), 0.005348436254426132),
 (('sxsw', 'mention'), 0.004755195491370201),
 (('link', 'rt'), 0.004718117943679205),
 (('mention', 'google'), 0.004356611853691997),
 (('social', 'network'), 0.0040970690198550265),
 (('new', 'social'), 0.003781909864481563),
 (('mention', 'rt'), 0.0031886691014256317),
 (('network', 'called'), 0.003021820136816151),
 (('store', 'sxsw'), 0.003021820136816151)]

SXSW is renowned for its conferences and festivals that commemorate the intersection of technology, film, music, education, and culture.
RT is the inaugural 24/7 English-language news channel from Russia, offering a Russian perspective on worldwide news.

Next we are going to train a Word2Vec model on tokenized text data from the 'processed_text' column of our dataFrame. We can then retrieve the most similar words to specific words from the trained Word2Vec model.

In [76]:
from gensim.models import Word2Vec
from nltk import word_tokenize

data = df['processed_text'].map(word_tokenize)
model = Word2Vec(data, window=5, min_count=1, workers=4)
model.train(data, total_examples=model.corpus_count, epochs=10)

(708737, 978660)

In [77]:
wv = model.wv
wv.most_similar('sxsw')

[('rt', 0.683062732219696),
 ('tnw', 0.6658665537834167),
 ('conveni', 0.6573855876922607),
 ('unit', 0.6483298540115356),
 ('ncbshow', 0.6436857581138611),
 ('cnt', 0.6349107623100281),
 ('pun', 0.63333660364151),
 ('steam', 0.6303820610046387),
 ('sink', 0.6300735473632812),
 ('motiv', 0.6265491843223572)]

In [78]:
wv.most_similar(negative='sxsw')

[('cosbi', 0.4372282028198242),
 ('dpe', 0.23546390235424042),
 ('shortcut', 0.23334373533725739),
 ('linney', 0.22220098972320557),
 ('laura', 0.21703214943408966),
 ('hootstat', 0.19113896787166595),
 ('ure', 0.17929907143115997),
 ('pah', 0.17042851448059082),
 ('comedi', 0.1692882627248764),
 ('sketchi', 0.1489688605070114)]

## Modeling & Evaluation

**Baseline Model**: Sentiment is either positive(1)or negative(0)


Here we are going to set up and evaluate three machine learning models (Random Forest, Support Vector Machine, and Logistic Regression) using cross-validation.The output will provide us with an idea of how well each model is performing on our dataset based on the specified cross-validation scheme (in out case, 2-fold cross-validation). We can assess which model is the most suitable for our specific classification task based on these scores.

In [79]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

rf =  Pipeline([('Random Forest', RandomForestClassifier(n_estimators=100, verbose=True))])
svc = Pipeline([('Support Vector Machine', SVC())])
lr = Pipeline([('Logistic Regression', LogisticRegression())])

models = [('Random Forest', rf),
          ('Support Vector Machine', svc),
          ('Logistic Regression', lr)]

scores = [(name, cross_val_score(model,X_tf_idf_train_bi, y_train_bi, cv=2).mean()) for name, model, in models]
scores 

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    7.3s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    6.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.0s finished


[('Random Forest', 0.8637869451193023),
 ('Support Vector Machine', 0.8582943167130577),
 ('Logistic Regression', 0.8447452254919312)]

Based on these scores, the Random Forest model achieved the highest mean accuracy, followed by the Support Vector Machine and Logistic Regression models. We can use the Random Forest model as our final baseline model.

Moving on we can implement a neural network for our classification task.We can use deep learning libraries such as TensorFlow or PyTorch.In our case we will use keras which was originally developed as a separate library but has been tightly integrated into TensorFlow.We will use a "Sequential" model which is a specific type of neural network architecture available within Keras. It's designed for building feedforward neural networks, where layers are stacked sequentially.


In [80]:
X_tf_idf_train_bi.shape

(2731, 3978)

In [81]:
zty = X_tf_idf_train_bi.toarray()

In [82]:
model_1 = Sequential()

model_1.add(Dense(units=64, input_shape=(3978,)))
model_1.add(Dropout(0.5))

model_1.add(Dense(32, activation='relu'))
model_1.add(Dropout(0.5))

model_1.add(Dense(1, activation='sigmoid'))

model_1.compile(
    optimizer="adam",
    loss='binary_crossentropy',
    metrics=["accuracy"]
)

model_1.summary()

Model: "sequential_7"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_16 (Dense)            (None, 64)                254656    
                                                                 
 dropout_9 (Dropout)         (None, 64)                0         
                                                                 
 dense_17 (Dense)            (None, 32)                2080      
                                                                 
 dropout_10 (Dropout)        (None, 32)                0         
                                                                 
 dense_18 (Dense)            (None, 1)                 33        
                                                                 
Total params: 256769 (1003.00 KB)
Trainable params: 256769 (1003.00 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [83]:
model_1.fit(zty, y_train_bi, epochs=5, batch_size=32, validation_split=0.1)

Epoch 1/5


Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.src.callbacks.History at 0x1dbaaea15a0>

**Iterated Model**: Sentiment is either positive(1),negative(0),No emotion toward brand or product(2) or Not clear(3)

In [84]:
X_tf_idf_train_multi ,X_tf_idf_test_multi = csr_Tfid_vect(X_train_multi,X_test_multi)

In [85]:
X_tf_idf_train_multi.shape

(6976, 6486)

In [86]:
pty = X_tf_idf_train_multi.toarray()

In [87]:
model_2 = Sequential()

model_2.add(Dense(64, activation='relu', input_shape=(6486,)))
model_2.add(Dropout(0.5))

model_2.add(Dense(3, activation='softmax'))

model_2.compile(
    optimizer="adam",
    loss='categorical_crossentropy',
    metrics=["accuracy"]
)

model_2.fit(pty, y_train_multi, epochs=5, batch_size=32)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.src.callbacks.History at 0x1dbb6220c40>

In [88]:
model_2.predict(X_tf_idf_test_multi.toarray())



array([[0.002417  , 0.01918484, 0.9783982 ],
       [0.0041768 , 0.09200642, 0.90381676],
       [0.21975414, 0.32695657, 0.4532893 ],
       ...,
       [0.12181512, 0.4919104 , 0.3862745 ],
       [0.02306609, 0.0343857 , 0.9425483 ],
       [0.02364898, 0.6384658 , 0.33788523]], dtype=float32)

In [98]:
def sentiment_predict(text):
    cl_txt = clean_and_preprocess_text(text)
    tfidf_vector_single = vectorizer.transform([cl_txt])
    csr_mat = csr_matrix(tfidf_vector_single)
    csr_array = csr_mat.toarray()
    pred = model_2.predict(csr_array)
    rounded_arr = np.round(pred)

    # Extract the rounded values 
    a, b, c = rounded_arr[0]

    # Determine the sentiment label based on the rounded values
    if a == 1:
        return "Negative"
    elif b == 1:
        return "Positive"
    elif c == 1:
        return "Neutral"
    else:
        return "Unknown"


In [102]:
sentiment_predict('I love this product')



'Positive'

In [103]:
import pickle

# with open('sent_model.pkl', 'wb') as file:
#     pickle.dump(model_2, file)

# file.close()

In [104]:
# with open('tfidf_vectorizer.pkl', 'wb') as file:
#     pickle.dump(vectorizer, file)

# file.close()    

## Deployment

In [106]:
import requests


tweet = 'I am enjoying the new apple product'

url = 'http://localhost:5000/predict'
data = {'x': tweet}  # Input data for prediction
response = requests.post(url, json=data)
result = response.json()

result

{'prediction': 'Positive'}