Build a model that can rate the sentiment of a Tweet based on its content.

You'll build an NLP model to analyze Twitter sentiment about Apple and Google products. The dataset comes from CrowdFlower via data.world. Human raters rated the sentiment in over 9,000 Tweets as positive, negative, or neither.

Aim for a Proof of Concept
There are many approaches to NLP problems - start with something simple and iterate from there. For example, you could start by limiting your analysis to positive and negative Tweets only, allowing you to build a binary classifier. Then you could add in the neutral Tweets to build out a multiclass classifier. You may also consider using some of the more advanced NLP methods in the Mod 4 Appendix.

Evaluation
Evaluating multiclass classifiers can be trickier than binary classifiers because there are multiple ways to mis-classify an observation, and some errors are more problematic than others. Use the business problem that your NLP project sets out to solve to inform your choice of evaluation metrics.

Data: https://data.world/crowdflower/brands-and-product-emotions

In [1]:
import pandas as pd

In [2]:
df =pd.read_csv('../../data/judge-1377884607_tweet_product_company.csv', encoding = 'unicode_escape')
df

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion
...,...,...,...
9088,Ipad everywhere. #SXSW {link},iPad,Positive emotion
9089,"Wave, buzz... RT @mention We interrupt your re...",,No emotion toward brand or product
9090,"Google's Zeiger, a physician never reported po...",,No emotion toward brand or product
9091,Some Verizon iPhone customers complained their...,,No emotion toward brand or product


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9093 entries, 0 to 9092
Data columns (total 3 columns):
 #   Column                                              Non-Null Count  Dtype 
---  ------                                              --------------  ----- 
 0   tweet_text                                          9092 non-null   object
 1   emotion_in_tweet_is_directed_at                     3291 non-null   object
 2   is_there_an_emotion_directed_at_a_brand_or_product  9093 non-null   object
dtypes: object(3)
memory usage: 213.2+ KB


In [4]:
df_na_c2=df[df['emotion_in_tweet_is_directed_at'].isna()]
df_na_c2

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
5,@teachntech00 New iPad Apps For #SpeechTherapy...,,No emotion toward brand or product
6,,,No emotion toward brand or product
16,Holler Gram for iPad on the iTunes App Store -...,,No emotion toward brand or product
32,"Attn: All #SXSW frineds, @mention Register fo...",,No emotion toward brand or product
33,Anyone at #sxsw want to sell their old iPad?,,No emotion toward brand or product
...,...,...,...
9087,"@mention Yup, but I don't have a third app yet...",,No emotion toward brand or product
9089,"Wave, buzz... RT @mention We interrupt your re...",,No emotion toward brand or product
9090,"Google's Zeiger, a physician never reported po...",,No emotion toward brand or product
9091,Some Verizon iPhone customers complained their...,,No emotion toward brand or product


In [5]:
df_na_c2['is_there_an_emotion_directed_at_a_brand_or_product'].value_counts()

No emotion toward brand or product    5298
Positive emotion                       306
I can't tell                           147
Negative emotion                        51
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64

In [6]:
df_nona_c2 = df[(df['emotion_in_tweet_is_directed_at'].isna())== False]
df_nona_c2

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion
...,...,...,...
9077,@mention your PR guy just convinced me to swit...,iPhone,Positive emotion
9079,&quot;papyrus...sort of like the ipad&quot; - ...,iPad,Positive emotion
9080,Diller says Google TV &quot;might be run over ...,Other Google product or service,Negative emotion
9085,I've always used Camera+ for my iPhone b/c it ...,iPad or iPhone App,Positive emotion


In [7]:
df_nona_c2['emotion_in_tweet_is_directed_at'].value_counts()

iPad                               946
Apple                              661
iPad or iPhone App                 470
Google                             430
iPhone                             297
Other Google product or service    293
Android App                         81
Android                             78
Other Apple product or service      35
Name: emotion_in_tweet_is_directed_at, dtype: int64

In [8]:
df_nona_c2['is_there_an_emotion_directed_at_a_brand_or_product'].value_counts()

Positive emotion                      2672
Negative emotion                       519
No emotion toward brand or product      91
I can't tell                             9
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64

In [9]:
df_nona_c2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3291 entries, 0 to 9088
Data columns (total 3 columns):
 #   Column                                              Non-Null Count  Dtype 
---  ------                                              --------------  ----- 
 0   tweet_text                                          3291 non-null   object
 1   emotion_in_tweet_is_directed_at                     3291 non-null   object
 2   is_there_an_emotion_directed_at_a_brand_or_product  3291 non-null   object
dtypes: object(3)
memory usage: 102.8+ KB


In [10]:
df['is_there_an_emotion_directed_at_a_brand_or_product'].value_counts()

No emotion toward brand or product    5389
Positive emotion                      2978
Negative emotion                       570
I can't tell                           156
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64

- if only use data directed to products, then not enough data

In [11]:
# nltk related imports
import nltk
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer

In [12]:
# set up tokenizer
tokenizer_tweet= TweetTokenizer(strip_handles=True)
# create stemmer object
stemmer = SnowballStemmer('english')

In [13]:
stopwords_list= stopwords.words('english')
stopwords_list

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [14]:
def preprocessing(text, tokenizer, stopwords_list, stemmer):
    
    text=tokenizer.tokenize(text)
    text=[word for word in text if word not in stopwords_list]
    text=[stemmer.stem(word) for word in text]
    return text

In [15]:
# set up Tweet tokenizer
from nltk.tokenize import TweetTokenizer
tokenizer_tweet= TweetTokenizer(strip_handles=True)
sample_tweet=tokenizer_tweet.tokenize(df['tweet_text'][5])
sample_tweet

['New',
 'iPad',
 'Apps',
 'For',
 '#SpeechTherapy',
 'And',
 'Communication',
 'Are',
 'Showcased',
 'At',
 'The',
 '#SXSW',
 'Conference',
 'http://ht.ly/49n4M',
 '#iear',
 '#edchat',
 '#asd']

In [16]:
# take out row with null values
df_nona_c1=df[df['tweet_text'].isna()==False]

In [17]:
# import regular expression python library
import re
# add a hashtag column
df_nona_c1['hashtags'] = df_nona_c1['tweet_text'].apply(lambda x: re.findall(r'\B#\w*[a-zA-Z]+\w*', x))
df_nona_c1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_nona_c1['hashtags'] = df_nona_c1['tweet_text'].apply(lambda x: re.findall(r'\B#\w*[a-zA-Z]+\w*', x))


Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product,hashtags
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion,"[#RISE_Austin, #SXSW]"
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion,[#SXSW]
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion,"[#iPad, #SXSW]"
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion,[#sxsw]
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion,[#SXSW]
...,...,...,...,...
9088,Ipad everywhere. #SXSW {link},iPad,Positive emotion,[#SXSW]
9089,"Wave, buzz... RT @mention We interrupt your re...",,No emotion toward brand or product,"[#sxsw, #google, #circles]"
9090,"Google's Zeiger, a physician never reported po...",,No emotion toward brand or product,"[#sxsw, #health2dev]"
9091,Some Verizon iPhone customers complained their...,,No emotion toward brand or product,[#SXSW]


In [18]:
import string

In [19]:
# original code from https://github.com/srobz/Classifying-a-Tweet-s-Sentiment-Based-on-its-Content/blob/main/Phase%204%20Project%20-%201%20-%20Data%20Cleaning.ipynb
df_nona_c1['clean'] = df_nona_c1['tweet_text'] 

df_nona_c1['clean'] = df_nona_c1['clean'].str.lower() #Making everything lowercase

df_nona_c1['clean'] = df_nona_c1['clean'].apply(lambda x: re.sub(r'https?:\/\/\S+', '', x)) #Removing URLs with http/s

df_nona_c1['clean'] = df_nona_c1['clean'].apply(lambda x: re.sub(r"www\.[a-z]?\.?(com)+|[a-z]+\.(com)", '', x)) #Removing URLs with www

df_nona_c1['clean'] = df_nona_c1['clean'].apply(lambda x: re.sub(r'{link}', '', x)) #Removing {link} from tweets

df_nona_c1['clean'] = df_nona_c1['clean'].apply(lambda x: re.sub(r"\[video\]", '', x)) #Removing [video] from tweets

df_nona_c1['clean'] = df_nona_c1['clean'].apply(lambda x: re.sub(r'&[a-z]+;', '', x)) #Removing HTML reference characters

df_nona_c1['clean'] = df_nona_c1['clean'].apply(lambda x: re.sub(r"@[A-Za-z0-9]+", '', x)) #Removing all twitter handles from tweets

df_nona_c1['clean'] = df_nona_c1['clean'].apply(lambda x: re.sub(r"[^\x00-\x7F]+\ *(?:[^\x00-\x7F]| )*", '', x)) #Removing other characters

def remove_punctuation(text): #Function to remove punctuation from tweet
    punctuationfree = "".join([i for i in text if i not in string.punctuation]) #Removing punctuation from tweet
    return punctuationfree #Returning punctuation free tweet

df_nona_c1['clean'] = df_nona_c1['clean'].apply(lambda x: remove_punctuation(x)) #Applying function to tweets

df_nona_c1['clean'] = df_nona_c1['clean'].apply(lambda x: re.sub(r"[ ]{2,}", ' ', x)) #Removing extra spaces

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_nona_c1['clean'] = df_nona_c1['tweet_text']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_nona_c1['clean'] = df_nona_c1['clean'].str.lower() #Making everything lowercase
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_nona_c1['clean'] = df_nona_c1['clean'].apply(lambda x: re.sub(r'https?:\

In [20]:
df_nona_c1['preprocessed_text']=df_nona_c1['clean'].apply(lambda x:preprocessing(x, tokenizer_tweet,stopwords_list, stemmer))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_nona_c1['preprocessed_text']=df_nona_c1['clean'].apply(lambda x:preprocessing(x, tokenizer_tweet,stopwords_list, stemmer))


In [21]:
df_nona_c1

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product,hashtags,clean,preprocessed_text
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion,"[#RISE_Austin, #SXSW]",i have a 3g iphone after 3 hrs tweeting at ri...,"[3g, iphon, 3, hrs, tweet, riseaustin, dead, n..."
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion,[#SXSW],know about awesome ipadiphone app that youll ...,"[know, awesom, ipadiphon, app, youll, like, ap..."
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion,"[#iPad, #SXSW]",can not wait for ipad 2 also they should sale...,"[wait, ipad, 2, also, sale, sxsw]"
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion,[#sxsw],i hope this years festival isnt as crashy as ...,"[hope, year, festiv, isnt, crashi, year, iphon..."
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion,[#SXSW],great stuff on fri sxsw marissa mayer google ...,"[great, stuff, fri, sxsw, marissa, mayer, goog..."
...,...,...,...,...,...,...
9088,Ipad everywhere. #SXSW {link},iPad,Positive emotion,[#SXSW],ipad everywhere sxsw,"[ipad, everywher, sxsw]"
9089,"Wave, buzz... RT @mention We interrupt your re...",,No emotion toward brand or product,"[#sxsw, #google, #circles]",wave buzz rt we interrupt your regularly sched...,"[wave, buzz, rt, interrupt, regular, schedul, ..."
9090,"Google's Zeiger, a physician never reported po...",,No emotion toward brand or product,"[#sxsw, #health2dev]",googles zeiger a physician never reported pote...,"[googl, zeiger, physician, never, report, pote..."
9091,Some Verizon iPhone customers complained their...,,No emotion toward brand or product,[#SXSW],some verizon iphone customers complained their...,"[verizon, iphon, custom, complain, time, fell,..."


In [24]:
y = df_nona_c1['is_there_an_emotion_directed_at_a_brand_or_product']
X = df_nona_c1.drop(['tweet_text','is_there_an_emotion_directed_at_a_brand_or_product', 'clean'], axis=1)

In [26]:
# train_test_split
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, y, random_state=42)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
tf_idf = TfidfVectorizer()
tf_idf.fit_tran