# Leveraging Sentiment Analysis for Enhanced Brand Management

## Problem Statement
---
In the rapidly evolving technology market, understanding customer sentiment towards products is crucial for companies like Google and Apple. These insights can guide product development, marketing strategies, customer service, and more. However, manually analysing customer sentiment is a time-consuming and labour-intensive process. Given the vast amount of customer feedback available on platforms like Twitter, it is virtually impossible for humans to process all the data in a timely manner.
Moreover, human analysis is subject to bias and inconsistency, and the quality of analysis can vary greatly depending on the individual’s understanding and interpretation. This makes it difficult to scale and standardize the sentiment analysis process.

Therefore, there is a need for an automated, efficient, and reliable solution to analyse customer sentiment towards Google and Apple products. Machine Learning, with its ability to learn patterns from large datasets and make predictions, offers a promising solution to this problem.
By applying Machine Learning techniques for sentiment analysis, we can process vast amounts of data in a fraction of the time it would take a human. This not only saves time and resources but also provides consistent and unbiased analysis. Furthermore, Machine Learning models can continuously learn and improve over time, adapting to new trends and nuances in customer sentiment.

## Business Understanding
---
Twitter is a platform where users often share their experiences and opinions about products. Analysing these sentiments can provide valuable feedback on what users like or dislike about a product, which can guide improvements and new features. Sentiment analysis can help understand how the brand is perceived in the market. Positive sentiment is usually associated with a strong brand image, while negative sentiment can indicate potential issues that need to be addressed.  By analysing sentiment, Apple and Google can identify trends in consumer behaviour and preferences. This can inform strategic decisions, such as the timing of product releases or marketing campaigns.

Comparing sentiment towards different products can provide insights into competitive positioning. For example, if sentiment towards an Apple product is more positive than a similar Google product, it might indicate a competitive advantage for Apple. Negative tweets can be a signal of customer service issues that need to be addressed. Apple and Google can use sentiment analysis to proactively identify and resolve these issues. This data-driven approach fosters brand loyalty, increases customer satisfaction, and ultimately drives sales growth.

### Key Objectives
$i.$ Utilize Natural Language Processing techniques to construct a machine learning model for automated sentiment analysis of tweets related to Google and Apple products.<br>
$ii.$ Evaluate and select the most suitable machine learning model for sentiment analysis based on its performance metrics.<br>
$iii.$ Analyse frequency of the sentiments expressed in tweets about Google and Apple products.

## Data Understanding
---
The dataset, sourced from CrowdFlower via [data.world](https://data.world/crowdflower/brands-and-product-emotions), comprises over 9,000 tweets with sentiment ratings labeled as positive, negative, or neutral by human raters.

The tweets were posted during the South by Southwest conference, primarily discussing Google and Apple products.
The crowd was asked if the tweet expressed positive, negative, or no emotion towards a brand and/or product. If some emotion was expressed they were also asked to say which brand or product was the target of that emotion. The data was compiled in 2013 by Kent Cavender-Bares.

Tweets, being succinct and emotionally charged, serve as effective indicators of consumer sentiment. South by Southwest serves as a platform for showcasing the latest technology, enabling consumers to compare products from major tech companies directly and potentially mitigating biases to some extent.

The target variable was engineered into two classes: tweets with positive sentiment and those without positive sentiment, encompassing neutral, negative, and indistinguishable sentiments. Emphasis is placed solely on identifying positive tweets, as positive emotions are known to drive sales and contribute to return on investment.

## Exploratory Data Analysis
---

In [80]:
#importing the needed libraries
import pandas as pd
import re
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#nlp
import nltk
from nltk.corpus import stopwords,wordnet
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import RegexpTokenizer,TweetTokenizer
from nltk.stem import WordNetLemmatizer
from string import punctuation
#modeling
from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score,precision_score,f1_score,confusion_matrix
from sklearn.pipeline import Pipeline


In [50]:
#importing the data
data = pd.read_csv('dataset.csv', encoding= 'unicode_escape')

data.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


In [51]:
#changing the column names for easier readability
data.columns = ['tweet_text', 'product', 'sentiment']

In [52]:
#Viewing the first few rows
data.head()

Unnamed: 0,tweet_text,product,sentiment
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


In [53]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9093 entries, 0 to 9092
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   tweet_text  9092 non-null   object
 1   product     3291 non-null   object
 2   sentiment   9093 non-null   object
dtypes: object(3)
memory usage: 213.2+ KB


## Data Cleaning and Preprocessing
---

In [54]:
#Checking for null values
data.isna().sum()

tweet_text       1
product       5802
sentiment        0
dtype: int64

In [56]:
#we drop the one raw of tweet that has missing text

data.dropna(subset=['tweet_text'], inplace=True)


In [57]:
data.loc[data['sentiment'] == 'No emotion toward brand or product'].sample(10)

Unnamed: 0,tweet_text,product,sentiment
7506,Audience Q: What prototyping tools do you use?...,,No emotion toward brand or product
6211,RT @mention Just met three girls from the Appl...,,No emotion toward brand or product
2035,#Android App Review: #SXSW Go. | {link} pls ª¼,,No emotion toward brand or product
6247,RT @mention Less than half an hour and we'll t...,,No emotion toward brand or product
8952,{link} #sxsw Google going Social again?,,No emotion toward brand or product
6151,RT @mention iPad 2's stacked up like pizza box...,,No emotion toward brand or product
5770,RT @mention Get a look at #SXSW's rumored #App...,,No emotion toward brand or product
2561,@mention Download the @mention for iPhone to r...,,No emotion toward brand or product
1761,#google 80's party @mention maggie's! Hair did...,,No emotion toward brand or product
376,HootSuite blog ÛÒ Social Media Dashboard åÈ H...,,No emotion toward brand or product


In the `product` column, it is observed that almost 60% of the data is missing. Further inspection reveals that most of these rows are those in which users had no emotion toward the Google or Apple products.

The loss of such a substantial amount of data cannot be afforded. Therefore, for these records, a new product category called `undefined` will be created for the analysis of this project.

In [58]:
#replacing null value with undefined
data['product'].fillna('undefined', inplace = True)

In [60]:
#sanity check
data.isna().sum()

tweet_text    0
product       0
sentiment     0
dtype: int64

In [61]:
#viewing distribution of products

data['product'].value_counts()

product
undefined                          5801
iPad                                946
Apple                               661
iPad or iPhone App                  470
Google                              430
iPhone                              297
Other Google product or service     293
Android App                          81
Android                              78
Other Apple product or service       35
Name: count, dtype: int64

The spread within products is highly imbalanced, with over half of the tweets not mentioning a specific product. To address this, another column called "Brand" will be added, indicating the brand the tweet is about based on the information from the "Product" column.  

First, all entries in the "Product" column will be double-checked. Then, a function will be created to loop through the product column and assign the appropriate brand to the new column. If the brand is undefined, the function will iterate through the text of the tweet to identify any mentions of product words. If none are found, the brand will remain undefined. If words for both brands are mentioned, the brand will be designated as "Both". This approach aims to create more balanced classes within this new feature.

In [62]:
#creating a function to identify the brand

def find_brand(Product, Tweet): 
    #Labeling brand as Undetermined by default
    brand = 'Undefined' 

    #Labeling Google based on product column
    if ((Product.lower().__contains__('google')) or (Product.lower().__contains__('android'))): 
        brand = 'Google' 

    #Labeling Apple based on product column

    elif ((Product.lower().__contains__('apple')) or (Product.lower().__contains__('ip'))): 
        brand = 'Apple' 
    
    if (brand == 'Undefined'): 
        #Making tweet lowercase
        lower_tweet = Tweet.lower() 

        #labeling google if there is mention of google or android on tweet
        is_google = (lower_tweet.__contains__('google')) or (lower_tweet.__contains__('android')) 

        #labeling apple based on tweet
        is_apple = (lower_tweet.__contains__('apple')) or (lower_tweet.__contains__('ip')) 
        
        #if it has both identifiers in the tweet
        if (is_google and is_apple): 
            brand = 'Both' 
        elif (is_google):
            brand = 'Google' 
        elif (is_apple):
            brand = 'Apple' 
    
    return brand

#Applying function to product and tweet_text columb

data['brand'] = data.apply(lambda x: find_brand(x['product'], x['tweet_text']), axis = 1) 
data['brand'].value_counts(normalize= True) 

brand
Apple        0.590189
Google       0.304883
Undefined    0.081500
Both         0.023427
Name: proportion, dtype: float64

The distribution has undergone a notable shift. Although class imbalances persist, they manifest differently this time. Over half of the tweets pertain to the brand Apple, while approximately a third relate to the brand Google. Tweets with undetermined brands constitute about eight percent of all tweets, with those mentioning both brands accounting for only around two percent. While this represents an improvement over the previous scenario where over half of the tweets were classified as unknown (referring to undetermined product tweets), the current imbalances remain suboptimal.

In [63]:
data['sentiment'].value_counts()

sentiment
No emotion toward brand or product    5388
Positive emotion                      2978
Negative emotion                       570
I can't tell                           156
Name: count, dtype: int64

These classes exhibit significant imbalance, with over half of the tweets categorized as having no emotion, followed by approximately a third rated as positive. Less than 600 tweets were classified as negative, and fewer than 200 were labeled as "I can't tell," indicating uncertainty in sentiment determination. Overall, just over a third of the tweets were assigned positive or negative sentiment.

For the analysis in this project, "No emotion" and "I can't tell" will be grouped together and relabeled as "Neutral." Additionally, the names within the Emotion column will be simplified, changing "Positive emotion" to "Positive" and "Negative emotion" to "Negative." This adjustment aims to enhance clarity and consistency for subsequent analysis.

In [65]:
#creating a function to modify the sentiments

def clean_sentiments(data, column):
    #Making list for new names of emotions
    emotion_list = [] 
    for sentiment in data[column]:
        #Renaming `no emotions` sentiment
        if sentiment == "No emotion toward brand or product": 
            #renaming to `Neutral`        
            emotion_list.append('Neutral') 

        #Renaming I can't tell to `Neutral` too
        elif sentiment == "I can't tell": 
            emotion_list.append('Neutral')

        #Renaming positive emotion
        elif sentiment == "Positive emotion": 
            emotion_list.append('Positive') 

        #Renaming negative emotion
        elif sentiment == "Negative emotion": 
            emotion_list.append('Negative') 
    
    # Assigning the updated emotion_list to the DataFrame's column
    data[column] = emotion_list
    return data

#applying the function to the dataset

data = clean_sentiments(data, 'sentiment') 
data['sentiment'].value_counts() 

sentiment
Neutral     5544
Positive    2978
Negative     570
Name: count, dtype: int64

In [66]:
#previewing the cleaned data
data.head()

Unnamed: 0,tweet_text,product,sentiment,brand
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative,Apple
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive,Apple
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive,Apple
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative,Apple
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive,Google


## Preprocessing

In [67]:
#previwing random tweets
data['tweet_text'].loc[1133]

'Check out the @mention Route {link} ; RSVP here -&gt; https://www.facebook.com/event.php?eid=141164002609303 #sxswi #sxsw'

**Cleaning the Tweets**<br>

Remove:<br>
$i.$  "{link}" and '[video]' because they are just a place holders for an external link<br>
$ii.$ twitter handles because they don't tell us any important information<br>
$iii.$  stopwords<br>
$iv.$  punctuation<br>
$v.$ all forms of "SXSW" because it's in large number of tweets therefore it has no value. Since the tweets were collected during the south by south west conference therefore many tweets have the the tag.<br>
$vi.$  websites and html formating


Lowercase every word in the corpus<br>
Tokenize<br>
Lemmatize


In [68]:
#downloading stopwords
nltk.download('stopwords', quiet= True)
stopword_list = stopwords.words('english')

nltk.download('wordnet', quiet= True)

#instantiate regextokenize and defining a pattern to remove words less that 3 characters
tokenizer = RegexpTokenizer(r"(?u)\w{3,}")

#add 'SXSW' to the stopwordlist
stopword_list.append('sxsw')

#add link the stopword_list
stopword_list.append('link')

#add punctuations to stopwords
stopword_list += punctuation

#instantiating lemmatizer
lemma = WordNetLemmatizer()

#Instantiating tweet tokenizer

tweet_tokenize = TweetTokenizer(strip_handles= True)

In [69]:
#defining the function to clean and tokenize the tweets
def clean_tweets(text):
    """
    This function takes a tweet and preprocesses it in readiness for modelling
    """
    #Use TweetTokenizer object to remove the handles from the Tweet
    no_handle = tweet_tokenize.tokenize(text)

    #Join the list of non-handle tokens back together
    tweet = " ".join(no_handle) 

    #remove http websites, hashtag sign, any words in curly brackets,
        #any words with ampersand in front, www dot com websites, links,
        #videos, and non-english characters
    clean = re.sub("(https?:\/\/\S+) \
                   |(#[A-Za-z0-9_]+) \
                   |(\{([a-zA-Z].+)\}) \
                   |(&[a-z]+;) \
                   |(www\.[a-z]?\.?(com)+|[a-z]+\.(com))\
                   |({link})\
                   |([^\x00-\x7F]+\ *(?:[^\x00-\x7F]| )*)"," ", tweet)
    
    #Turn all the tokens lowercase
    lower_tweet = clean.lower()
    #Only include words with 3 or more characters
    token_list = tokenizer.tokenize(lower_tweet)

    #Remove stopwords
    stopwords_removed=[token for token in token_list if token not in stopword_list]

    #Lemmatize the remaining word tokens
    lemma_tokens = [lemma.lemmatize(token) for token in stopwords_removed]


    
    return lemma_tokens

In [70]:
#sample tweet before cleaning
data['tweet_text'].iloc[100]

'Headline: &quot;#iPad 2 is the Must-Have Gadget at #SXSW&quot; Hmm... I could have seen that one coming! {link} #gadget'

In [71]:
#sanity check for the cleaninig function
clean_tweets(data['tweet_text'].iloc[100])

['headline',
 'ipad',
 'must',
 'gadget',
 'hmm',
 'could',
 'seen',
 'one',
 'coming',
 'gadget']

In [72]:
data['clean_tweet_token'] = data['tweet_text'].apply(clean_tweets)

data.head()

Unnamed: 0,tweet_text,product,sentiment,brand,clean_tweet_token
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative,Apple,"[iphone, hr, tweeting, rise_austin, dead, need..."
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive,Apple,"[know, awesome, ipad, iphone, app, likely, app..."
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive,Apple,"[wait, ipad, also, sale]"
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative,Apple,"[hope, year, festival, crashy, year, iphone, app]"
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive,Google,"[great, stuff, fri, marissa, mayer, google, ti..."


## Modelling
---

In [73]:
#data check

data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9092 entries, 0 to 9092
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   tweet_text         9092 non-null   object
 1   product            9092 non-null   object
 2   sentiment          9092 non-null   object
 3   brand              9092 non-null   object
 4   clean_tweet_token  9092 non-null   object
dtypes: object(5)
memory usage: 684.2+ KB


A new DataFrame will be created containing only the cleaned tokens column and emotions column. 

Before modeling the data, a train-test split will be performed to divide the data into training and test sets, thereby avoiding data leakage.

In [75]:
#subseting the columns for modeling

df = data[['clean_tweet_token', 'sentiment']]

df.head()

Unnamed: 0,clean_tweet_token,sentiment
0,"[iphone, hr, tweeting, rise_austin, dead, need...",Negative
1,"[know, awesome, ipad, iphone, app, likely, app...",Positive
2,"[wait, ipad, also, sale]",Positive
3,"[hope, year, festival, crashy, year, iphone, app]",Negative
4,"[great, stuff, fri, marissa, mayer, google, ti...",Positive


In [92]:
#defining target and Feature
X = df['clean_tweet_token']

y = df['sentiment']

#performing train test split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42 )

#checking shapes of the both train and test sets
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(6819,)
(6819,)
(2273,)
(2273,)


### Baseline Model
**Logistic regression**



In [93]:
#Function to passthrough our pipelines
def passthrough(doc): 
    return doc 

In [94]:
#instantiating label encoder
le = LabelEncoder()
#transforming both train and test sets
y_train_enc = le.fit_transform(y_train)

y_test_enc = le.transform(y_test)

#building modeling pipeline using Count Vectorizer

LRpipelineCV = Pipeline([
    ('BOW', CountVectorizer(preprocessor = passthrough, tokenizer = passthrough)), 
    ('Classifier', LogisticRegression()), 
]) 

#Fitting to pipeline
LRpipelineCV.fit(X_train, y_train_enc) 

