# From Sentiments to Strategies: Building an NLP Model for Brand Engagement

`
from IPython.display import Image
`
Image(url = "https://storage.ning.com/topology/rest/1.0/file/get/3780584426?profile=original",width = 1000, height=800)


# Data understanding

The dataset was sourced from data.world provided by CrowdFlower which has tweets about Apple and Google from the South by Southwest (SXSW) conference. The tweet labels were crowdsourced and reflect which emotion they convey and what product/service/company this emotion is directed at based on the content.

There are 9093 records and 3 features in this data.

Associated columns in the dataset are:
- `tweet_text`: Contains the text of the tweets.

- `emotion_in_tweet_is_directed_at`: Indicates the brand or product mentioned in the tweet (many missing values).

- `is_there_an_emotion_directed_at_a_brand_or_product`: Categorizes the sentiment as "Positive emotion," "Negative emotion," or potentially other classes.

The column names will be renamed to manageable ones in the data cleaning steps

# Data Preparation

In [97]:
# Data Exploration Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re

# Preprocessing and Feature Extraction
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline

# Handling Imbalanced Data
from imblearn.over_sampling import SMOTE

# Model Selection and Evaluation
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import recall_score, accuracy_score, precision_score, f1_score, confusion_matrix, classification_report

# Models
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier, RandomForestClassifier 


In [98]:
# Ensure necessary NLTK resources are downloaded
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ADMIN\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ADMIN\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\ADMIN\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ADMIN\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [99]:
class DataOverview():
    """
    This class takes a dataframe and returns basic information.
    """
    def __init__(self, data):
        self.data = data

    def read_head(self):
        """Returns the first 5 rows"""
        return self.data.head()

    def read_columns(self):
        """Returns the columns of the DataFrame"""
        return self.data.columns

    def read_info(self):
        """Returns the features, datatypes and non-null count"""
        return self.data.info()
    def read_describe(self):
        """Returns the statistical summary of the dataset"""
        return self.data.describe()
    def read_shape(self):
        """Returns the number of rows and columns"""
        return self.data.shape
    def read_unique(self,column_name):
        """Returns unique values for a specific column"""
        if column_name in self.data.columns:
            return self.data[column_name].unique()
        else:
            raise ValueError(f"Column '{column_name}' does not exist in the DataFrame.")
    def read_corr(self):
        """Returns a correlation dataframe"""
        return self.data.corr()

    def read_corr_wrt_target(self, target='churn'):
        """Returns a Series containing the correlation of features with respect to target"""
        return self.data.corr()[target].sort_values(ascending=False)

    def read_multicollinearity(self, target='churn'):
        """Returns a correlation dataframe without the target"""
        return self.data.corr().iloc[0:-1, 0:-1]

    def read_na(self):
        """Returns the sum of all null values per feature"""
        return self.data.isna().sum()

    def read_duplicated(self):
        """Returns the sum of all duplicated records"""
        return self.data.duplicated().sum()

In [100]:
# The data
filepath='../data/tweet_product_company.csv'
df = pd.read_csv(filepath,encoding='iso-8859-1')

In [101]:
# Instantiate datapreparation object
dprep = DataOverview(data=df)

# First 5 lines of the DataFrame
dprep.read_head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


Renaming the column names for easier work and readability as they are very long

In [102]:
# Renaming columns for ease of work
df.rename(columns = {'tweet_text': 'Text', 'emotion_in_tweet_is_directed_at': 'Product/Brand', 
                     'is_there_an_emotion_directed_at_a_brand_or_product':'Emotion'}, inplace = True)


Exploring features and their datatypes

In [103]:
dprep.read_info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9093 entries, 0 to 9092
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Text           9092 non-null   object
 1   Product/Brand  3291 non-null   object
 2   Emotion        9093 non-null   object
dtypes: object(3)
memory usage: 213.2+ KB


In [104]:
# Exploring feature null values
dprep.read_na()

Text                1
Product/Brand    5802
Emotion             0
dtype: int64

Product has 5802 nulls values i.e 64% of all records in the dataset, while Text has 1 missing records

We take a look at the Product/Brand and Emotion columns to check their unique values.

In [105]:
dprep.read_unique('Product/Brand')

array(['iPhone', 'iPad or iPhone App', 'iPad', 'Google', nan, 'Android',
       'Apple', 'Android App', 'Other Google product or service',
       'Other Apple product or service'], dtype=object)

In [106]:
dprep.read_unique('Emotion')

array(['Negative emotion', 'Positive emotion',
       'No emotion toward brand or product', "I can't tell"], dtype=object)

- Product/Brand: We observe a lof of information on different products and services for the different brands.There is also nan values.
- Emotion: We observe 'I can't tell' emotion which is ambigous, and thus will need handling

## Addressing Missing Values

Remember we were missing the body of text for 1 tweet and a total of 5,802 tags for which product/brand the corresponding tweet was about. Let's start with looking at the missing tweet.

In [107]:
# create a copy for cleaning purposes
df_1 = df.copy()

In [108]:
df_1[df_1['Text'].isna()]

Unnamed: 0,Text,Product/Brand,Emotion
6,,,No emotion toward brand or product


Valuable information(customer sentiment) is missing from this Text, thus worthless, we can drop the record. 

In [109]:
df_1 = df_1[df_1['Text'].notna()]
df_1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9092 entries, 0 to 9092
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Text           9092 non-null   object
 1   Product/Brand  3291 non-null   object
 2   Emotion        9092 non-null   object
dtypes: object(3)
memory usage: 284.1+ KB


Next, we look at Product/Brand missing values which has a total of 5801 missing tags

In [110]:
df_1[df_1['Product/Brand'].isna()].head(10)

Unnamed: 0,Text,Product/Brand,Emotion
5,@teachntech00 New iPad Apps For #SpeechTherapy...,,No emotion toward brand or product
16,Holler Gram for iPad on the iTunes App Store -...,,No emotion toward brand or product
32,"Attn: All #SXSW frineds, @mention Register fo...",,No emotion toward brand or product
33,Anyone at #sxsw want to sell their old iPad?,,No emotion toward brand or product
34,Anyone at #SXSW who bought the new iPad want ...,,No emotion toward brand or product
35,At #sxsw. Oooh. RT @mention Google to Launch ...,,No emotion toward brand or product
37,SPIN Play - a new concept in music discovery f...,,No emotion toward brand or product
39,VatorNews - Google And Apple Force Print Media...,,No emotion toward brand or product
41,HootSuite - HootSuite Mobile for #SXSW ~ Updat...,,No emotion toward brand or product
42,Hey #SXSW - How long do you think it takes us ...,,No emotion toward brand or product


Observation is that the tweets are not directed to any specific brand or product, we will thus use 'Unknown' as a placeholder

In [111]:
df_1['Product/Brand'].fillna('Unknown', inplace = True)

In [112]:
# Verifying the null values are handled
df_1.isna().sum()

Text             0
Product/Brand    0
Emotion          0
dtype: int64

## Handling the Emotion Column

In [113]:
df_1['Emotion'].value_counts()

No emotion toward brand or product    5388
Positive emotion                      2978
Negative emotion                       570
I can't tell                           156
Name: Emotion, dtype: int64

For easy interpretability the emotion values can be cleaned to shorter formats

In [114]:
df_1['Emotion'] = df_1['Emotion'].replace({
    'Positive emotion': 'Positive',
    'Negative emotion': 'Negative',
    'No emotion toward brand or product': 'Neutral',
    "I can't tell": 'Unknown'
})

In [116]:
# verifying emotion values
df_1['Emotion'].value_counts()

Neutral     5388
Positive    2978
Negative     570
Unknown      156
Name: Emotion, dtype: int64

We take a look at the tweets with 'Unknown' emotion values to check wether we can identify any patterns or we can easily tell/categorise them in the Neutral,Positive or Negative categories

In [117]:
pd.set_option("display.max_colwidth", 300)
df_1[df_1['Emotion']=='Unknown']

Unnamed: 0,Text,Product/Brand,Emotion
90,Thanks to @mention for publishing the news of @mention new medical Apps at the #sxswi conf. blog {link} #sxsw #sxswh,Unknown,Unknown
102,ÛÏ@mention &quot;Apple has opened a pop-up store in Austin so the nerds in town for #SXSW can get their new iPads. {link} #wow,Unknown,Unknown
237,"Just what America needs. RT @mention Google to Launch Major New Social Network Called Circles, Possibly Today {link} #sxsw",Unknown,Unknown
341,The queue at the Apple Store in Austin is FOUR blocks long. Crazy stuff! #sxsw,Unknown,Unknown
368,Hope it's better than wave RT @mention Buzz is: Google's previewing a social networking platform at #SXSW: {link},Unknown,Unknown
...,...,...,...
9020,It's funny watching a room full of people hold their iPad in the air to take a photo. Like a room full of tablets staring you down. #SXSW,Unknown,Unknown
9032,"@mention yeah, we have @mention , Google has nothing on us :) #SXSW",Unknown,Unknown
9037,"@mention Yes, the Google presentation was not exactly what I was expecting. #sxsw",Unknown,Unknown
9058,&quot;Do you know what Apple is really good at? Making you feel bad about your Xmas present!&quot; - Seth Meyers on iPad2 #sxsw #doyoureallyneedthat?,Unknown,Unknown


From observation, this tweets are difficult to classify without further context. Some might be genuine and some sarcastic and thus may not be useful for our models as we need to have labels. The solution is to drop them, as they also are only 
1.7% of the dataset.

In [120]:
# dropping the 'Unknown' in the Emotion column
df_1 = df_1.loc[df_1['Emotion'] != 'Unknown']
# Verifying they are dropped
df_1['Emotion'].value_counts()

Neutral     5388
Positive    2978
Negative     570
Name: Emotion, dtype: int64

# Handling Duplicates

As we've addressed null values, and cleaned up our dataset a little bit more, next step is to check for duplicates (check tweets repeated multiple times).

In [122]:
df_1.duplicated().sum()

22

We have 22 tweets that are duplicated. Lets have a closer look 

In [123]:
df_1[df_1.duplicated()]

Unnamed: 0,Text,Product/Brand,Emotion
468,"Before It Even Begins, Apple Wins #SXSW {link}",Apple,Positive
776,"Google to Launch Major New Social Network Called Circles, Possibly Today {link} #sxsw",Unknown,Neutral
2232,Marissa Mayer: Google Will Connect the Digital &amp; Physical Worlds Through Mobile - {link} #sxsw,Unknown,Neutral
2559,Counting down the days to #sxsw plus strong Canadian dollar means stock up on Apple gear,Apple,Positive
3950,Really enjoying the changes in Gowalla 3.0 for Android! Looking forward to seeing what else they &amp; Foursquare have up their sleeves at #SXSW,Android App,Positive
3962,"#SXSW is just starting, #CTIA is around the corner and #googleio is only a hop skip and a jump from there, good time to be an #android fan",Android,Positive
4897,"Oh. My. God. The #SXSW app for iPad is pure, unadulterated awesome. It's easier to browse events on iPad than on the website!!!",iPad or iPhone App,Positive
5338,RT @mention ÷¼ GO BEYOND BORDERS! ÷_ {link} ã_ #edchat #musedchat #sxsw #sxswi #classical #newTwitter,Unknown,Neutral
5341,"RT @mention ÷¼ Happy Woman's Day! Make love, not fuss! ÷_ {link} ã_ #edchat #musedchat #sxsw #sxswi #classical #newTwitter",Unknown,Neutral
5881,"RT @mention Google to Launch Major New Social Network Called Circles, Possibly Today {link} #sxsw",Unknown,Neutral


In [124]:
df_1.drop_duplicates(keep='first', inplace=True)
df_1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8914 entries, 0 to 9092
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Text           8914 non-null   object
 1   Product/Brand  8914 non-null   object
 2   Emotion        8914 non-null   object
dtypes: object(3)
memory usage: 278.6+ KB


Next we move on to EDA as we have already done the data cleaning process