## <span style="font-family:Georgia, serif;">**Twitter Sentiment Analysis** :Understanding Emotions in Tweets about Apple and Google products.</span>

![alt text](twits.jpg "Title")

## Overview

## Business Understanding

**Business Problem**: Using Sentiment Analysis to Improve Apple and Google Product Marketing Strategies 

The introduction of social media has completely changed how businesses interact with their consumers and the general public in today's connected society. While the digital age offers limitless possibilities for marketing and brand development, it also brings its own set of difficulties. One of these difficulties is the inability of enterprises to precisely gauge public opinion and feelings towards their goods or services.

In the age of social media, organizations are acutely aware of the need to harness the wealth of sentiment and emotion data available on these platforms. However, they often struggle to do so effectively, given the unprecedented speed, diversity, and complexity of social media communication. The dynamic nature of the medium, the diverse and contextual language used, the rapid increase of emojis and visual content, the volume of noise, and ethical concerns all contribute to the challenge of gauging public sentiment and emotions. 

To overcome these challenges, organizations must invest in advanced sentiment analysis tools and technologies, develop cultural and linguistic expertise, and strike a balance between data-driven insights and ethical considerations. By doing so, they can unlock the valuable insights hidden within the social media storm and use them to inform strategic decisions, enhance products and services, and build stronger connections with their audience in this rapidly evolving digital landscape.




## Data understanding

## Data Preparation

In [2]:
import pandas as pd
import re
import io
import warnings

# Suppress all warnings
warnings.filterwarnings('ignore')

def remove_non_utf8(text):
    return re.sub(r'[^\x00-\x7F]+', '', text)

with open('data/judge_1377884607_tweet_product_company.csv', 'r', encoding='utf-8') as file:
    cleaned_text = remove_non_utf8(file.read())

df = pd.read_csv(io.StringIO(cleaned_text))
df.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


In [3]:
column_name_mapping = {'is_there_an_emotion_directed_at_a_brand_or_product': 'Sentiment'}
# Rename the columns using the .rename() method
df.rename(columns=column_name_mapping, inplace=True)


In [4]:
df['emotion_in_tweet_is_directed_at'].fillna('N/A', inplace=True)
df['tweet_text'].fillna('N/A', inplace=True)

In [5]:
def assign_brand(phrase):
    if 'iPad' in phrase or 'iPhone' in phrase :
        return 'Apple'
    elif 'Other Apple product or service' in phrase or 'Apple' in phrase:
        return 'Apple' 
    elif 'iPad or iPhone App' in phrase:
        return 'Apple'       
    elif 'Google' in phrase or 'Other Google product or service' in phrase:
        return 'Google'
    elif 'Android App' in phrase or 'Android' in phrase:
        return 'Android'
    else:
        return 'N/A'

df['brand'] = df['emotion_in_tweet_is_directed_at'].apply(assign_brand)

In [6]:
import nltk
import re
from nltk.tokenize import word_tokenize,TweetTokenizer
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
# nltk.download('punkt')
# nltk.download('stopwords')


def clean_and_preprocess_text(text):
    tokenizer = TweetTokenizer()
    tokens = tokenizer.tokenize(text)
    # Convert tokens to lowercase
    tokens = [token.lower() for token in tokens]
    # Remove mentions (words starting with '@') and URLs
    tokens = [token for token in tokens if not token.startswith('@') and not token.startswith('http')]
    # Remove punctuation and numbers using regular expressions
    tokens = [re.sub(r'[^a-zA-Z]', '', token) for token in tokens]
    # Remove stop words
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [token for token in tokens if token not in stop_words]
    # # Apply stemming using the Porter Stemmer
    # stemmer = PorterStemmer()
    # stemmed_tokens = [stemmer.stem(token) for token in filtered_tokens]

    cleaned_text = ' '.join(filtered_tokens) 
    return cleaned_text

In [7]:
df['tweet_text'][4]

"@sxtxstate great stuff on Fri #SXSW: Marissa Mayer (Google), Tim O'Reilly (tech books/conferences) &amp; Matt Mullenweg (Wordpress)"

In [8]:
clean_and_preprocess_text(df['tweet_text'][4])

'great stuff fri sxsw  marissa mayer  google   tim oreilly  tech books  conferences   matt mullenweg  wordpress '

In [9]:
df['processed_text'] = df['tweet_text'].map(clean_and_preprocess_text)

In [10]:
# Define a mapping dictionary
sentiment_mapping = {'No emotion toward brand or product': 2.0,
                  'Positive emotion': 1.0, 
                  'Negative emotion': 0.0,
                  'I can\'t tell': 2.0}

# Use the .map() method to map values in column 'A' to new values
df['Sentiment'] = df['Sentiment'].map(sentiment_mapping)
df.sample(5)

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,Sentiment,brand,processed_text
7695,@mention several people/events/sessions on iPh...,,2.0,,several people events sessions iphone apps s...
5968,RT @mention Just a friendly reminder for #SXSW...,,2.0,,rt friendly reminder sxsw attendees talk text...
7726,Cause we so need another one RT @mention Googl...,,2.0,,cause need another one rt google launch major ...
7414,Thanks girl! RT @mention Congrats to @mention ...,Other Apple product or service,1.0,Apple,thanks girl rt congrats winning last ipad cas...
2113,Marissa Mayer #sxsw Google &quot;making smart ...,Google,1.0,Google,marissa mayer sxsw google making smart phones...


In [11]:
from sklearn.model_selection import train_test_split
#creating new df where sentiment is either positive or negative
bi_tar = df[(df['Sentiment'] == 0)| (df['Sentiment'] == 1)]

X = bi_tar['processed_text']
y = bi_tar['Sentiment']

X_train_bi, X_test_bi, y_train_bi, y_test_bi = train_test_split(X, y, test_size=0.2, random_state=42)


In [12]:
multi_tar = df.copy()
X = multi_tar['processed_text']
y = multi_tar['Sentiment']

y_dummies = pd.get_dummies(y)
X_train_multi, X_test_multi, y_train_multi, y_test_multi = train_test_split(X, y_dummies, test_size=0.2, random_state=42)

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import csr_matrix

def csr_Tfid_vect(X_train,X_test):
    vectorizer = TfidfVectorizer()
    tf_idf_train = vectorizer.fit_transform(X_train)
    tf_idf_test = vectorizer.transform(X_test)

    tf_idf_train = csr_matrix(tf_idf_train)
    tf_idf_test = csr_matrix(tf_idf_test)

    return tf_idf_train,tf_idf_test

X_tf_idf_train_bi,X_tf_idf_test_bi = csr_Tfid_vect(X_train_bi,X_test_bi)

In [14]:
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense
from keras.layers import Dropout
from keras.models import Sequential
from keras.preprocessing import text, sequence

def token_seq(feature):
    tokenizer = text.Tokenizer(num_words=20000)
    tokenizer.fit_on_texts(list(feature))
    list_tokenized = tokenizer.texts_to_sequences(feature)
    seq = sequence.pad_sequences(list_tokenized, maxlen=100)
    return seq

## Text Analysis

In [15]:
from  nltk import FreqDist
import string

big_sentence = ' '.join(df['tweet_text'])
pattern = "([a-zA-Z]+(?:'[a-z]+)?)"
tweets_raw = nltk.regexp_tokenize(big_sentence, pattern)
tweets_raw = [word.lower() for word in tweets_raw]
stopwords_list = stopwords.words('english')
stopwords_list += list(string.punctuation)
tweets_raw_stopped = [word for word in tweets_raw if word not in stopwords_list]
tweets_freqdist = FreqDist(tweets_raw_stopped)
total_word_count = sum(tweets_freqdist.values())
tweets_freqdist_top_10 = tweets_freqdist.most_common(10)
print(f'{"Word":10} Normalized Frequency')
for word in tweets_freqdist_top_10:
    normalized_frequency = word[1] / total_word_count
    print(f'{word[0]:10} {normalized_frequency:^20.4}')

Word       Normalized Frequency
sxsw              0.0858       
mention          0.06442       
link             0.03821       
rt               0.02743       
ipad             0.02671       
google            0.022        
apple            0.01969       
quot             0.01503       
iphone           0.01414       
store            0.01316       


In [16]:
from nltk.collocations import BigramCollocationFinder

bigram_measures = nltk.collocations.BigramAssocMeasures()
tweets_finder = BigramCollocationFinder.from_words(tweets_raw_stopped)
tweets_scored = tweets_finder.score_ngrams(bigram_measures.raw_freq)
tweets_scored[:15]

[(('rt', 'mention'), 0.02666802617674867),
 (('sxsw', 'link'), 0.008527835968929014),
 (('link', 'sxsw'), 0.007656513598190615),
 (('sxsw', 'rt'), 0.006275374946701025),
 (('mention', 'mention'), 0.005728481118258838),
 (('mention', 'sxsw'), 0.005478207671344618),
 (('apple', 'store'), 0.005348436254426132),
 (('sxsw', 'mention'), 0.004755195491370201),
 (('link', 'rt'), 0.004718117943679205),
 (('mention', 'google'), 0.004356611853691997),
 (('social', 'network'), 0.0040970690198550265),
 (('new', 'social'), 0.003781909864481563),
 (('mention', 'rt'), 0.0031886691014256317),
 (('network', 'called'), 0.003021820136816151),
 (('store', 'sxsw'), 0.003021820136816151)]

SXSW is best known for its conference and festivals that celebrate the convergence of tech, film, music, education, and culture.
RT is the first Russian 24/7 English-language news channel which brings the Russian view on global news.

In [17]:
from gensim.models import Word2Vec
from nltk import word_tokenize

data = df['processed_text'].map(word_tokenize)
model = Word2Vec(data, window=5, min_count=1, workers=4)
model.train(data, total_examples=model.corpus_count, epochs=10)

(724216, 978660)

In [18]:
wv = model.wv
wv.most_similar('sxsw')

[('rt', 0.6841039657592773),
 ('insertion', 0.6667062640190125),
 ('closing', 0.6543408036231995),
 ('amiss', 0.6521309018135071),
 ('motivator', 0.651293158531189),
 ('fascinated', 0.6509873867034912),
 ('south', 0.6475978493690491),
 ('asd', 0.6465953588485718),
 ('bart', 0.642691433429718),
 ('gtd', 0.6417152285575867)]

In [19]:
wv.most_similar(negative='sxsw')

[('ringing', 0.3468869626522064),
 ('rows', 0.3398160934448242),
 ('captures', 0.32452720403671265),
 ('ure', 0.3209640681743622),
 ('snazzy', 0.3176901042461395),
 ('missoni', 0.2706739902496338),
 ('bt', 0.2581568956375122),
 ('cosby', 0.24320140480995178),
 ('disgusted', 0.18615488708019257),
 ('overthere', 0.18594567477703094)]

## Modeling & Evaluation

**Baseline Model**: Sentiment is either positive(1)or negative(0)


In [20]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

rf =  Pipeline([('Random Forest', RandomForestClassifier(n_estimators=100, verbose=True))])
svc = Pipeline([('Support Vector Machine', SVC())])
lr = Pipeline([('Logistic Regression', LogisticRegression())])

models = [('Random Forest', rf),
          ('Support Vector Machine', svc),
          ('Logistic Regression', lr)]

scores = [(name, cross_val_score(model,X_tf_idf_train_bi, y_train_bi, cv=2).mean()) for name, model, in models]
scores 

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:   12.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:   10.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.2s finished


[('Random Forest', 0.8637869451193023),
 ('Support Vector Machine', 0.8568296515587877),
 ('Logistic Regression', 0.8443789251256308)]

In [21]:
X_t =  token_seq(X_train_bi) 

In [22]:
model_1 = Sequential()

model_1.add(Dense(units=64, input_shape=(100,)))
model_1.add(Dropout(0.5))

model_1.add(Dense(32, activation='relu'))
model_1.add(Dropout(0.5))

model_1.add(Dense(1, activation='sigmoid'))

model_1.compile(
    optimizer="adam",
    loss='binary_crossentropy',
    metrics=["accuracy"]
)

model_1.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 64)                6464      
                                                                 
 dropout (Dropout)           (None, 64)                0         
                                                                 
 dense_1 (Dense)             (None, 32)                2080      
                                                                 
 dropout_1 (Dropout)         (None, 32)                0         
                                                                 
 dense_2 (Dense)             (None, 1)                 33        
                                                                 
Total params: 8577 (33.50 KB)
Trainable params: 8577 (33.50 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [23]:
model_1.fit(X_t, y_train_bi, epochs=15, batch_size=32, validation_split=0.1)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<keras.src.callbacks.History at 0x1c44728d2a0>

**Iterated Model**: Sentiment is either positive(1),negative(0),No emotion toward brand or product(2) or Not clear(3)

In [24]:
X_t_multi = token_seq(X_train_multi)

In [25]:
model_2 = Sequential()

model_2.add(Dense(64, activation='relu', input_shape=(100,)))
model_2.add(Dropout(0.5))

model_2.add(Dense(3, activation='softmax'))

model_2.compile(
    optimizer="adam",
    loss='categorical_crossentropy',
    metrics=["accuracy"]
)

model_2.fit(X_t_multi, y_train_multi, epochs=15, batch_size=32, validation_split=0.1)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<keras.src.callbacks.History at 0x1c4470df3a0>

In [26]:
c = [d for d in df['Sentiment']]


In [37]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
tfidf_vectorizer = TfidfVectorizer(max_features=1000)

r_f = Pipeline([('tfidf', TfidfVectorizer(max_features=1000)),
                           ('classifier', LogisticRegression())])

scores = cross_val_score(r_f,df['processed_text'],c, cv=5).mean()
scores

0.6728576404405773

In [42]:
r_f.fit(df['processed_text'],c)

In [43]:
def map_to_new_column(value):
    if value == 0.0:
        return 'negative'
    elif value == 1.0:
        return 'positive'
    elif value == 2.0:
        return 'neutral'
    else:
        return 'unknown'  

In [46]:
def sentiment_predict(filepath,tweet_column,dir_at,model):
    with open(filepath, 'r', encoding='utf-8') as file:
        cleaned_text = remove_non_utf8(file.read())
    df = pd.read_csv(io.StringIO(cleaned_text))

    df[tweet_column].fillna('N/A', inplace=True)
    df['processed_text'] = df[tweet_column].map(clean_and_preprocess_text)

    df[dir_at].fillna('N/A', inplace=True)
    df['brand'] = df[dir_at].apply(assign_brand)

    data = df['processed_text']
    pred = model.predict(data)

    df['pred'] = pred
    df['prediction'] = df['pred'].apply(map_to_new_column)
    
    return df


In [47]:
filepath = 'data/judge_1377884607_tweet_product_company.csv'
tweet_column = 'tweet_text'
dir_at = 'emotion_in_tweet_is_directed_at'
model = r_f

pred_df = sentiment_predict(filepath,tweet_column,dir_at,model)
pred_df.sample(5)

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product,processed_text,brand,pred,prediction
2823,#sxsw: @mention We think we control our identi...,,I can't tell,sxsw think control identities facebook googl...,,2.0,neutral
4715,"In the Google keynote, Marissa Meyer and some ...",,No emotion toward brand or product,google keynote marissa meyer guy demoing new ...,,2.0,neutral
1055,Talked to some great developers at the Android...,Android,Positive emotion,talked great developers android meetup lookin...,Android,1.0,positive
3300,"@mention Hey Mark, no sleep for you at #sxsw! ...",iPad,Positive emotion,hey mark sleep sxsw bring home shiny new ipa...,Apple,1.0,positive
7744,Spending some time this morning resetting my a...,Android,Negative emotion,spending time morning resetting android phone ...,Android,2.0,neutral


In [48]:
#positive apple tweets
apple_pos = pred_df[(pred_df['brand'] == 'Apple') & (pred_df['prediction'] == 'positive')]
#negative apple tweets
apple_neg = pred_df[(pred_df['brand'] == 'Apple') & (pred_df['prediction'] == 'negative')]
apple_neg.emotion_in_tweet_is_directed_at.value_counts()

iPad                  26
iPhone                20
Apple                 19
iPad or iPhone App    15
Name: emotion_in_tweet_is_directed_at, dtype: int64

In [49]:
#positive Google tweets
Google_pos = pred_df[(pred_df['brand'] == 'Google') & (pred_df['prediction'] == 'positive')]
#negative google tweets
Google_neg = pred_df[(pred_df['brand'] == 'Google') & (pred_df['prediction'] == 'negative')]
Google_neg.emotion_in_tweet_is_directed_at.value_counts()

Google                             10
Other Google product or service     2
Name: emotion_in_tweet_is_directed_at, dtype: int64

In [None]:
import pickle

with open('model_rf.pkl', 'wb') as file:
    pickle.dump(r_f, file)

file.close()

## Deployment