Build a model that can rate the sentiment of a Tweet based on its content.

You'll build an NLP model to analyze Twitter sentiment about Apple and Google products. The dataset comes from CrowdFlower via data.world. Human raters rated the sentiment in over 9,000 Tweets as positive, negative, or neither.

Aim for a Proof of Concept
There are many approaches to NLP problems - start with something simple and iterate from there. For example, you could start by limiting your analysis to positive and negative Tweets only, allowing you to build a binary classifier. Then you could add in the neutral Tweets to build out a multiclass classifier. You may also consider using some of the more advanced NLP methods in the Mod 4 Appendix.

Evaluation
Evaluating multiclass classifiers can be trickier than binary classifiers because there are multiple ways to mis-classify an observation, and some errors are more problematic than others. Use the business problem that your NLP project sets out to solve to inform your choice of evaluation metrics.

Data: https://data.world/crowdflower/brands-and-product-emotions

In [1]:
import pandas as pd

In [2]:
df =pd.read_csv('../../data/judge-1377884607_tweet_product_company.csv', encoding = 'unicode_escape')
df

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion
...,...,...,...
9088,Ipad everywhere. #SXSW {link},iPad,Positive emotion
9089,"Wave, buzz... RT @mention We interrupt your re...",,No emotion toward brand or product
9090,"Google's Zeiger, a physician never reported po...",,No emotion toward brand or product
9091,Some Verizon iPhone customers complained their...,,No emotion toward brand or product


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9093 entries, 0 to 9092
Data columns (total 3 columns):
 #   Column                                              Non-Null Count  Dtype 
---  ------                                              --------------  ----- 
 0   tweet_text                                          9092 non-null   object
 1   emotion_in_tweet_is_directed_at                     3291 non-null   object
 2   is_there_an_emotion_directed_at_a_brand_or_product  9093 non-null   object
dtypes: object(3)
memory usage: 213.2+ KB


In [4]:
# rename columns
df = df.rename(columns = {'tweet_text': 'Tweet', 
                         'emotion_in_tweet_is_directed_at': 'Product', 
                         'is_there_an_emotion_directed_at_a_brand_or_product': 'Sentiment'})
df.head() #Sanity Check

Unnamed: 0,Tweet,Product,Sentiment
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


In [5]:
df_na_c2=df[df['Product'].isna()]
df_na_c2

Unnamed: 0,Tweet,Product,Sentiment
5,@teachntech00 New iPad Apps For #SpeechTherapy...,,No emotion toward brand or product
6,,,No emotion toward brand or product
16,Holler Gram for iPad on the iTunes App Store -...,,No emotion toward brand or product
32,"Attn: All #SXSW frineds, @mention Register fo...",,No emotion toward brand or product
33,Anyone at #sxsw want to sell their old iPad?,,No emotion toward brand or product
...,...,...,...
9087,"@mention Yup, but I don't have a third app yet...",,No emotion toward brand or product
9089,"Wave, buzz... RT @mention We interrupt your re...",,No emotion toward brand or product
9090,"Google's Zeiger, a physician never reported po...",,No emotion toward brand or product
9091,Some Verizon iPhone customers complained their...,,No emotion toward brand or product


In [6]:
df_na_c2['Sentiment'].value_counts()

No emotion toward brand or product    5298
Positive emotion                       306
I can't tell                           147
Negative emotion                        51
Name: Sentiment, dtype: int64

In [7]:
df_nona_c2 = df[(df['Product'].isna())== False]
df_nona_c2

Unnamed: 0,Tweet,Product,Sentiment
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion
...,...,...,...
9077,@mention your PR guy just convinced me to swit...,iPhone,Positive emotion
9079,&quot;papyrus...sort of like the ipad&quot; - ...,iPad,Positive emotion
9080,Diller says Google TV &quot;might be run over ...,Other Google product or service,Negative emotion
9085,I've always used Camera+ for my iPhone b/c it ...,iPad or iPhone App,Positive emotion


In [8]:
df_nona_c2['Product'].value_counts()

iPad                               946
Apple                              661
iPad or iPhone App                 470
Google                             430
iPhone                             297
Other Google product or service    293
Android App                         81
Android                             78
Other Apple product or service      35
Name: Product, dtype: int64

In [9]:
df_nona_c2['Sentiment'].value_counts()

Positive emotion                      2672
Negative emotion                       519
No emotion toward brand or product      91
I can't tell                             9
Name: Sentiment, dtype: int64

In [10]:
df_nona_c2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3291 entries, 0 to 9088
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Tweet      3291 non-null   object
 1   Product    3291 non-null   object
 2   Sentiment  3291 non-null   object
dtypes: object(3)
memory usage: 102.8+ KB


In [11]:
df['Sentiment'].value_counts()

No emotion toward brand or product    5389
Positive emotion                      2978
Negative emotion                       570
I can't tell                           156
Name: Sentiment, dtype: int64

- if only use data directed to products, then not enough data

In [12]:
# nltk related imports
import nltk
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer

In [13]:
# set up tokenizer
tokenizer_tweet= TweetTokenizer(strip_handles=True)
# create stemmer object
stemmer = SnowballStemmer('english')

In [14]:
stopwords_list= stopwords.words('english')
stopwords_list

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [15]:
def preprocessing(text, tokenizer, stopwords_list, stemmer):
    
    text=tokenizer.tokenize(text)
    text=[word for word in text if word not in stopwords_list]
    text=[stemmer.stem(word) for word in text]
    return text

In [16]:
# set up Tweet tokenizer
from nltk.tokenize import TweetTokenizer
tokenizer_tweet= TweetTokenizer(strip_handles=True)
sample_tweet=tokenizer_tweet.tokenize(df['Tweet'][5])
sample_tweet

['New',
 'iPad',
 'Apps',
 'For',
 '#SpeechTherapy',
 'And',
 'Communication',
 'Are',
 'Showcased',
 'At',
 'The',
 '#SXSW',
 'Conference',
 'http://ht.ly/49n4M',
 '#iear',
 '#edchat',
 '#asd']

In [17]:
# take out row with null values
df=df[df['Tweet'].isna()==False]

In [18]:
# import regular expression python library
import re
# add a hashtag column
df['hashtags'] = df['Tweet'].apply(lambda x: re.findall(r'\B#\w*[a-zA-Z]+\w*', x))
df

Unnamed: 0,Tweet,Product,Sentiment,hashtags
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion,"[#RISE_Austin, #SXSW]"
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion,[#SXSW]
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion,"[#iPad, #SXSW]"
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion,[#sxsw]
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion,[#SXSW]
...,...,...,...,...
9088,Ipad everywhere. #SXSW {link},iPad,Positive emotion,[#SXSW]
9089,"Wave, buzz... RT @mention We interrupt your re...",,No emotion toward brand or product,"[#sxsw, #google, #circles]"
9090,"Google's Zeiger, a physician never reported po...",,No emotion toward brand or product,"[#sxsw, #health2dev]"
9091,Some Verizon iPhone customers complained their...,,No emotion toward brand or product,[#SXSW]


In [19]:
import string

In [20]:
# original code from https://github.com/srobz/Classifying-a-Tweet-s-Sentiment-Based-on-its-Content/blob/main/Phase%204%20Project%20-%201%20-%20Data%20Cleaning.ipynb
df['clean'] = df['Tweet'] 

df['clean'] = df['clean'].str.lower() #Making everything lowercase

df['clean'] = df['clean'].apply(lambda x: re.sub(r'https?:\/\/\S+', '', x)) #Removing URLs with http/s

df['clean'] = df['clean'].apply(lambda x: re.sub(r"www\.[a-z]?\.?(com)+|[a-z]+\.(com)", '', x)) #Removing URLs with www

df['clean'] = df['clean'].apply(lambda x: re.sub(r'{link}', '', x)) #Removing {link} from tweets

df['clean'] = df['clean'].apply(lambda x: re.sub(r"\[video\]", '', x)) #Removing [video] from tweets

df['clean'] = df['clean'].apply(lambda x: re.sub(r'&[a-z]+;', '', x)) #Removing HTML reference characters

df['clean'] = df['clean'].apply(lambda x: re.sub(r"@[A-Za-z0-9]+", '', x)) #Removing all twitter handles from tweets

df['clean'] = df['clean'].apply(lambda x: re.sub(r"[^\x00-\x7F]+\ *(?:[^\x00-\x7F]| )*", '', x)) #Removing other characters

def remove_punctuation(text): #Function to remove punctuation from tweet
    punctuationfree = "".join([i for i in text if i not in string.punctuation]) #Removing punctuation from tweet
    return punctuationfree #Returning punctuation free tweet

df['clean'] = df['clean'].apply(lambda x: remove_punctuation(x)) #Applying function to tweets

df['clean'] = df['clean'].apply(lambda x: re.sub(r"[ ]{2,}", ' ', x)) #Removing extra spaces

In [21]:
df['preprocessed_text']=df['clean'].apply(lambda x:preprocessing(x, tokenizer_tweet,stopwords_list, stemmer))

In [22]:
df

Unnamed: 0,Tweet,Product,Sentiment,hashtags,clean,preprocessed_text
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion,"[#RISE_Austin, #SXSW]",i have a 3g iphone after 3 hrs tweeting at ri...,"[3g, iphon, 3, hrs, tweet, riseaustin, dead, n..."
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion,[#SXSW],know about awesome ipadiphone app that youll ...,"[know, awesom, ipadiphon, app, youll, like, ap..."
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion,"[#iPad, #SXSW]",can not wait for ipad 2 also they should sale...,"[wait, ipad, 2, also, sale, sxsw]"
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion,[#sxsw],i hope this years festival isnt as crashy as ...,"[hope, year, festiv, isnt, crashi, year, iphon..."
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion,[#SXSW],great stuff on fri sxsw marissa mayer google ...,"[great, stuff, fri, sxsw, marissa, mayer, goog..."
...,...,...,...,...,...,...
9088,Ipad everywhere. #SXSW {link},iPad,Positive emotion,[#SXSW],ipad everywhere sxsw,"[ipad, everywher, sxsw]"
9089,"Wave, buzz... RT @mention We interrupt your re...",,No emotion toward brand or product,"[#sxsw, #google, #circles]",wave buzz rt we interrupt your regularly sched...,"[wave, buzz, rt, interrupt, regular, schedul, ..."
9090,"Google's Zeiger, a physician never reported po...",,No emotion toward brand or product,"[#sxsw, #health2dev]",googles zeiger a physician never reported pote...,"[googl, zeiger, physician, never, report, pote..."
9091,Some Verizon iPhone customers complained their...,,No emotion toward brand or product,[#SXSW],some verizon iphone customers complained their...,"[verizon, iphon, custom, complain, time, fell,..."


In [23]:
# mapping sentiment column
df['Sentiment'].value_counts()

No emotion toward brand or product    5388
Positive emotion                      2978
Negative emotion                       570
I can't tell                           156
Name: Sentiment, dtype: int64

In [24]:
# negative = 0, neutral/I can't tell = 1, positive =2 
dict_sent = {'No emotion toward brand or product':1, 
             'Positive emotion':2,
             'Negative emotion':0,
             "I can't tell": 1}
df['Sentiment'] = df['Sentiment'].map(dict_sent)

In [25]:
df['Sentiment']

0       0
1       2
2       2
3       0
4       2
       ..
9088    2
9089    1
9090    1
9091    1
9092    1
Name: Sentiment, Length: 9092, dtype: int64

In [26]:
# drop the product column because we want general sentiment to make it applicable to all tech products
df = df.drop('Product', axis=1)

In [27]:
df["preprocessed_text"] = df["preprocessed_text"].str.join(" ")

In [28]:
y = df['Sentiment']
X = df.drop(['Tweet','Sentiment', 'clean', 'hashtags'], axis=1)

In [29]:
X

Unnamed: 0,preprocessed_text
0,3g iphon 3 hrs tweet riseaustin dead need upgr...
1,know awesom ipadiphon app youll like appreci d...
2,wait ipad 2 also sale sxsw
3,hope year festiv isnt crashi year iphon app sxsw
4,great stuff fri sxsw marissa mayer googl tim o...
...,...
9088,ipad everywher sxsw
9089,wave buzz rt interrupt regular schedul sxsw ge...
9090,googl zeiger physician never report potenti ae...
9091,verizon iphon custom complain time fell back h...


In [30]:
# train_test_split
from sklearn.model_selection import train_test_split
X_train_int, X_test, y_train_int, y_test = train_test_split(X, y, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train_int, y_train_int, random_state=42)

In [31]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [32]:
df['preprocessed_text']

0       3g iphon 3 hrs tweet riseaustin dead need upgr...
1       know awesom ipadiphon app youll like appreci d...
2                              wait ipad 2 also sale sxsw
3        hope year festiv isnt crashi year iphon app sxsw
4       great stuff fri sxsw marissa mayer googl tim o...
                              ...                        
9088                                  ipad everywher sxsw
9089    wave buzz rt interrupt regular schedul sxsw ge...
9090    googl zeiger physician never report potenti ae...
9091    verizon iphon custom complain time fell back h...
9092                  rt googl test checkin offersat sxsw
Name: preprocessed_text, Length: 9092, dtype: object

In [36]:
tf_idf = TfidfVectorizer()


In [37]:
tf_idf.fit(X_train)

TfidfVectorizer()

In [38]:
X_train_vec=tf_idf.transform(X_train)
X_val_vec=tf_idf.transform(X_val)

In [39]:
X_train_vec

<1x1 sparse matrix of type '<class 'numpy.float64'>'
	with 1 stored elements in Compressed Sparse Row format>

In [None]:
print(X_train_vec)

In [None]:
tfidf_train_df=pd.DataFrame(X_train.toarray(), 
                              columns=tf_idf.get_feature_names())

In [None]:
tfidf_train_df

In [None]:
X_train

### Modeling

In [None]:
# import stuff used for modeling
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import plot_confusion_matrix, plot_roc_curve, confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.model_selection import cross_validate
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt

In [None]:
# build an evaluation function 
def evaluate(model, X_tr, y_tr, X_te, y_te):
    print('Accuracy Score:')
    print(f'Train - {accuracy_score(y_tr, model.predict(X_tr))}')
    print(f'Test - {accuracy_score(y_te, model.predict(X_te))}')
    print('  ')
    print('Confusion matrix for test data')
    return plot_confusion_matrix(model, X_te, y_te, include_values=True, cmap=plt.cm.Blues)

Model 1: decision tree


In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
dt = DecisionTreeClassifier()

In [None]:
dt.get_params().keys()

In [None]:
dt_grid_params = {'max_depth':[1,5,10], 'min_samples_split':[2,10,100]}
dt_grid = GridSearchCV(dt, dt_grid_params)
dt_output = dt_grid.fit(X_train, y_train)
dt_output.best_params_
dt_best_model = dt_output.best_estimator_
cross_validate(dt_best_model, X_train, y_train, return_train_score=True)
evaluate(dt_best_model, X_train, y_train, X_val, y_val)

Model 2: random forrest

Model 3: naive bayes

Model 4: neural network