# Phase 4 project

Do we remove stop words or not?
Do we stem or lemmatize our text data, or leave the words as is?
Is basic tokenization enough, or do we need to support special edge cases through the use of regex?
Do we use the entire vocabulary, or just limit the model to a subset of the most frequently used words? If so, how many?
Do we engineer other features, such as bigrams, or POS tags, or Mutual Information Scores?
What sort of vectorization should we use in our model? Boolean Vectorization? Count Vectorization? TF-IDF? More advanced vectorization strategies such as Word2Vec?

Stopword languages? imbalanced dataset with smote?

## Exploratory phase

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
import nltk
from nltk import FreqDist
from nltk.corpus import stopwords
import string
import re
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

In [2]:
df = pd.read_csv("BrandsAndProductEmotions.csv", encoding='unicode_escape')
df

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion
...,...,...,...
9088,Ipad everywhere. #SXSW {link},iPad,Positive emotion
9089,"Wave, buzz... RT @mention We interrupt your re...",,No emotion toward brand or product
9090,"Google's Zeiger, a physician never reported po...",,No emotion toward brand or product
9091,Some Verizon iPhone customers complained their...,,No emotion toward brand or product


The names following the @ symbol are probably useless, but since they do include terms that might be seen elsewhere (e.g. sxsw) I may leave them in. Any of the ones that only show up once or twice will be taken out by the tokenization process regardless.

Those column names are a mouthful. And the last row is definitely not made for human eyes so I'll just cut it right away.

In [3]:
df.columns = ["text", "product", "emotion"]
df.drop(9092, inplace = True)

Now let's see if there's anything interesting from a macro level.

In [4]:
#Value counts for the two categorical columns
print("Emotion")
print(df.emotion.value_counts())
print("\nProducts")
print(df["product"].value_counts())

Emotion
No emotion toward brand or product    5388
Positive emotion                      2978
Negative emotion                       570
I can't tell                           156
Name: emotion, dtype: int64

Products
iPad                               946
Apple                              661
iPad or iPhone App                 470
Google                             430
iPhone                             297
Other Google product or service    293
Android App                         81
Android                             78
Other Apple product or service      35
Name: product, dtype: int64


570 is a little over 6% of the whole dataset. I may need to address this class imbalance later when building my model.

In [5]:
print(df.shape)
print("\n")
print(df.info())

(9092, 3)


<class 'pandas.core.frame.DataFrame'>
Int64Index: 9092 entries, 0 to 9091
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   text     9091 non-null   object
 1   product  3291 non-null   object
 2   emotion  9092 non-null   object
dtypes: object(3)
memory usage: 284.1+ KB
None


Somehow, one of the text entries is null, and nearly 2/3s of the rest of the entries are missing a product label.

In [6]:
#Before I forget, let me look at that null entry
df[df["text"].isnull()]

Unnamed: 0,text,product,emotion
6,,,No emotion toward brand or product


Predictably, this was useless.

In [7]:
df.drop(6, inplace = True)
df.reset_index(inplace = True)

Now let's look at one of those tweets that's missing a product label.

In [8]:
df.iloc[9090]["text"]

'Some Verizon iPhone customers complained their time fell back an hour this weekend.  Of course they were the New Yorkers who attended #SXSW.'

Yeah, this is just a generic tweet. How did this end up here? It also was labeled as lacking an emotional quality. Is that what all of the null entries in the product column are like?

In [9]:
df[df["product"].isnull()]["emotion"].value_counts()

No emotion toward brand or product    5296
Positive emotion                       306
I can't tell                           147
Negative emotion                        51
Name: emotion, dtype: int64

No, there are emotional statements in here. This seems to be *somewhat* similar to the rest of the dataset except for a big dropoff for positive tweets.

Anyway, I have two tasks to start with: seeing if the NaN entries in the "emotion" column  are worth keeping, and investigating the "I can't tell"s.

# Investigating Entries

### NaNs

In [10]:
null_prod_s = df[df["product"].isnull() == True].sample(n=10, random_state =2)
for i in range(10):
    print(null_prod_s["text"].iloc[i])
    print("\n")

RT Hiring marketers, designers, creatives, social media pros... Come see #Aquent booth 1415 #SXSW trade show. Might win iPad 2


RT @mention Full #SXSW #touchingstories presentation: {link}


RT @mention #SXSW Interactive Award: Music Category goes to &quot;Wilderness Downtown&quot;. Congrats shared with @mention @mention @mention #winning


One guy stakes out the Austin Apple popup shop at #SXSW for his #iPad 2 {link} #SXSWi


RT @mention RT @mention Google set to launch new social network #Circles today at #sxsw


hey @mention heard you're at #sxsw. Come by to the @mention grille and make your comic into a iPhone case? What do you say? :]


Too bad I don't have a _ button!
RT @mention I know its #SXSW time when I have an abnormal amount of app updates on my iPhone.


RT @mention &quot;my kids will not grow up thinking the New York Times and Google are in separate industries&quot; @mention #bvj #SXSW


packing for #sxsw = iPad, iPhone, BlackBerry, laptop, and video camera. Need a st

Oh, I get it now: the collection of tweets is just the result of scraping and collecting any tweets that have tech-company keywords in them. Since this sounds like a realistic version of what would happen in the real world, I think I should leave these in and hope my model can accurately toss them aside.

### "I can't tell"

In [11]:
ict_emo_s = df[df["emotion"] == "I can't tell"].sample(n=10, random_state =2)
for i in range(10):
    print(ict_emo_s["text"].iloc[i])
    print("\n")

Comprando mi iPad 2 en el #SXSW (@mention Apple Store, SXSW w/ 62 others) {link}


The queue at the Apple Store in Austin is FOUR blocks long. Crazy stuff! #sxsw


RT @mention Demo of Google Hotpot at #bettersearch panel: still pull search, but personalized. Not yet serendipitous? #SXSW


Why Barry Diller thinks iPad only content is nuts @mention #SXSW {link}


Like @mention I've now seen most of Austin in Google Streetview checking out apartments for #sxsw. Austin is not easy on the click.


Anyone know status of iPad 2s in Austin pop-up store? Sold out? Getting more? #ipad2 #sxsw


Reports of @mention introducing a new social media platform at #SXSW were premature, but hopefully not overly optimistic {link}


DANG RT @mention Confirmed! Apple store 2 week popup in Austin for #SXSW {link} (via @mention who gave us no credit! )


At the Team Android party. Can't find it on Gowalla or Foursquare, so um, there you go. #sxsw


Line for Source Code is even longer than for iPad 2. Take that

Oh hey, Spanish. This possibility didn't cross my mind at all. Looking at sample tweets by hand I didn't see any other languages, but I think I can find a way to filter out non-english tweets a little later.

Many of the others in this sample are unemotional, but a couple... could be? I'll try a few more.

In [12]:
df[df['text'].str.contains('Compr')]

Unnamed: 0,index,text,product,emotion
881,882,Comprando mi iPad 2 en el #SXSW (@mention Appl...,,I can't tell


In [13]:
ict_emo_s2 = df[df["emotion"] == "I can't tell"].sample(n=12, random_state =3)
for i in range(10):
    print(ict_emo_s2["text"].iloc[i])
    print("\n")

I really think that most of the iPad 2 stock went down to #SXSW.


The queue at the Apple Store in Austin is FOUR blocks long. Crazy stuff! #sxsw


I'd take #sxsw hashtags over scheen, ipad/ipad2, tv or sports tweets any day, but for those needing a filterÛ_ {link}


TR @mention Google (tries again) to launch a new social network called Circles: {link} #sxsw {link}


Walked by the mobile Apple store in austin.  Line was insane. #sxsw


#sxsw: @mention We think we control our identities on Facebook, but as Google becomes an AI our profile will be built of what we do


RT never use mine on the go RT @mention &quot;You're probably using your iPad on the go.&quot; #disagree #SXSW #uxdes


Why Barry Diller thinks iPad only content is nuts @mention #SXSW {link}


Next up: designing interfaces for iPad. Or should I hit up &quot;Your Brand is Obsolete&quot;? #SXSW


@mention is biyt.ly for email, like google voice for email #loveit #sxsw #startupbus




There's a lot of ambiguity in some of these, and the dabbling in emotionally charged language is sure to make things a little difficult in the future. My original plan was to include these in the "no emotion" pile but after thinking more about it I think that'd be a mistake. Thankfully, these are only about 1% of the possible entries so I'll remove them.

In [14]:
df = df.drop(df["emotion"].loc[df["emotion"]=="I can't tell"].index)
df.reset_index(inplace = True)
#https://stackoverflow.com/questions/53182464/pandas-delete-a-row-in-a-dataframe-based-on-a-value

In [15]:
df.emotion.value_counts()

No emotion toward brand or product    5387
Positive emotion                      2978
Negative emotion                       570
Name: emotion, dtype: int64

# Preprocessing

In addition to the Flatiron Approved preprocessing steps (lower case, remove punctuation, lemmatize), I'll also expand the contractions.

In [16]:
import contractions

In [17]:
df["processed_text"] = ""
for i in range(len(df)):
    exp_text = []
    for word in df["text"].iloc[i].split():
        exp_text.append(contractions.fix(word))
    exp_text = " ".join(exp_text)
    df["processed_text"].iloc[i] = exp_text
    
df.drop("text", axis = 1, inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


In [18]:
df.processed_text[1]

'@jessedee Know about @fludapp ? Awesome iPad/iPhone app that you will likely appreciate for its design. Also, they are giving free Ts at #SXSW'

In [19]:
from nltk.tokenize import RegexpTokenizer

basic_token_pattern = r"(?u)\b\w\w+\b"

tokenizer = RegexpTokenizer(basic_token_pattern)
for i in range(len(df)):
    df.processed_text.iloc[i] = " ".join(tokenizer.tokenize(df.processed_text.iloc[i]))

In [20]:
df["processed_text"] = df["processed_text"].str.lower()

In [21]:
df.head()

Unnamed: 0,level_0,index,product,emotion,processed_text
0,0,0,iPhone,Negative emotion,wesley83 have 3g iphone after hrs tweeting at ...
1,1,1,iPad or iPhone App,Positive emotion,jessedee know about fludapp awesome ipad iphon...
2,2,2,iPad,Positive emotion,swonderlin can not wait for ipad also they sho...
3,3,3,iPad or iPhone App,Negative emotion,sxsw hope this year festival is not as crashy ...
4,4,4,Google,Positive emotion,sxtxstate great stuff on fri sxsw marissa maye...


I don't know how much it will help, but I figured I may as well create a column which identifies the company each tweet refers to.

In [22]:
df["product"].value_counts()

iPad                               942
Apple                              659
iPad or iPhone App                 470
Google                             429
iPhone                             296
Other Google product or service    292
Android App                         81
Android                             78
Other Apple product or service      35
Name: product, dtype: int64

In [23]:
df["company"] = ""
for i in range(len(df)):
    if "Google" in str(df["product"].iloc[i]):
        df["company"].iloc[i] = "Google"
    elif "Android" in str(df["product"].iloc[i]):
        df["company"].iloc[i] = "Android"
    else:
        df["company"].iloc[i] = "Apple"
df.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Unnamed: 0,level_0,index,product,emotion,processed_text,company
0,0,0,iPhone,Negative emotion,wesley83 have 3g iphone after hrs tweeting at ...,Apple
1,1,1,iPad or iPhone App,Positive emotion,jessedee know about fludapp awesome ipad iphon...,Apple
2,2,2,iPad,Positive emotion,swonderlin can not wait for ipad also they sho...,Apple
3,3,3,iPad or iPhone App,Negative emotion,sxsw hope this year festival is not as crashy ...,Apple
4,4,4,Google,Positive emotion,sxtxstate great stuff on fri sxsw marissa maye...,Google


In [24]:
df["company"].value_counts()

Apple      8055
Google      721
Android     159
Name: company, dtype: int64

WOW that is an imbalanced dataset. Google and android combined are only about 9 and a half percent of the data. I am not currently aware of whether or not I can use something like SMOTE on more than one feature without something breaking...

## Foreign language identification

I think I can safely speculate that nearly all of the tweets are in English, but only nearly. There is basically no way that the model can learn a non_English language so they are just injecting noise into my dataset. Can I remove them?

My plan to identify those messages is to collect and inspect the ones that don't include at least one word in the English stopwords list. Of course, languages share spellings (e.g. "me" in Spanish/English) so I will exclude cognates from the two most spoken languages that share our alphabet: Spanish and... Portuguese? Really? Learn something new every day.

Then, I'll look at tweets filtered in the opposite way: just those that include Spanish/Portuguese stopwords outside of the stopwords shared with English.

In [25]:
english_stop = stopwords.words("english")
spanish_stop = stopwords.words("spanish")
port_stop = stopwords.words("portuguese")

In [29]:
s_bilinguals = [word for word in english_stop if word in spanish_stop]
s_bilinguals

['me', 'he', 'has', 'a', 'no', 'o', 'y']

In [30]:
p_bilinguals = [word for word in english_stop if word in port_stop]
p_bilinguals

['me', 'do', 'a', 'as', 'for', 'no', 'o']

In [31]:
shared_words = list(set(p_bilinguals+s_bilinguals))
english_filter = [word for word in english_stop if word not in shared_words]

In [32]:
num_filtered = 0
for i in range(len(df)):
    if not any(x in english_filter for x in df["processed_text"][i].split()):
        num_filtered += 1
        print(df["processed_text"][i])
        print(i) #including the row number so I can perhaps remove them

wooooo ûï mention apple store downtown austin open til midnight sxsw
48
attending mention ipad design headaches sxsw link
66
worship mention link sxsw
76
stay tune mention showcase h4ckers link sxsw
84
unboxing apple sxsw mention apple store sxsw link
110
samsung sony follow apple hp lead mention link austin atx sxsw
130
samsung sony follow apple hp lead mention link austin atx sxsw via mention rg
131
beautiful sxsw mention apple store sxsw pic link
133
essential sxsw tools link
159
arduino android flaming skulls link refrigerator speaks salon 30pm mention mention sxsw smartthings
248
breaking apple announces partnership porn industry for new ipad2 video chat app called quot facetime quot sxsw
263
temporary apple store opens mention 6th amp congress tomorrow ipad2 sxsw
295
base camp apple sxsw link
299
hootsuite blog ûò social media dashboard åè hootsuite mobile for sxsw updates for iphone blackberry amp android link
370
hootsuite blog ûò social media dashboard åè hootsuite mobile for 

covet new ipad link sxsw
4497
downloading sxsw featured artists link itunes
4516
z16 saving grace link codes valid 00 59 59p 03 12 11 infektd sxsw zlf
4524
child using ipad for first time link uxdes sxsw
4527
interesting rt mention google launching secret new social network called quot circles quot link sxsw
4553
sxsw keynote marissa mayers 12 billion miles driven google maps navigation yr route around traffic saving users yrs total wow
4623
google announces check ins sxsw location based geo fencing applications link ireport wssxsw sxsw scrm sm marketing
4628
sxsw mint talks mobile app development challenges teases new ipad app link
4658
quot mention hoot new blog post hootsuite mobile for sxsw updates for iphone bberry android link mention
4691
sxsw mention talk demo ing google places hotpot integrated rating recommendation system for android iphone cool stuff
4713
team android choice awards finalists announced link sxsw
4890
two year old shows us howmto use ipad usdes sxsw
4910
googl

In [None]:
num_filtered

303 tweets to look through is a bit daunting. But scrolling through the first couple dozen, I don't see any yet.

Let's try the other filter.

In [33]:
p_s_stop = list(set(port_stop+spanish_stop))
p_s_filter = [palabra for palabra in p_s_stop if palabra not in shared_words]

In [34]:
num_filtered = 0
for i in range(len(df)):
    if any(x in p_s_filter for x in df["processed_text"][i].split()):
        num_filtered += 1
        print(df["processed_text"][i])
        print(i)

great sxsw ipad app from madebymany http tinyurl com 4nqv92l
13
photo just installed the sxsw iphone app which is really nice http tumblr com x6t1pi6av7
22
you must have this app for your ipad if you are going to sxsw http itunes apple com us app holler gram id420666439 mt hollergram
30
the best rt mention ha first in line for ipad2 at sxsw quot pop up quot apple store was an event planner eventprofs pcma engage365
35
spin play new concept in music discovery for your ipad from mention amp spin com link itunes sxsw mention
36
ha ha rt mention sxsw news yahoo com is loosing search traffic to new site google com doubt it will last though that weird name
73
the sxsw app on the iphone is live rsvp for events from your phone amp check out sundayswagger eventbrite com 20 scoremore
258
special apple store opening at 6th and congress for sxsw amp ipad launch www apple com retail thedomain apple ipad2 sxsw fb
289
10 austin classics not to be missed at sxsw m4blog com twitter for iphone 12 11 12 

most valuable apple ipad apps top critical tasks managed uber la lt all ipad all day sxsw link
4159
mr heavenly aka the band michael cera ready to rock the bat bar at sxsw iphone cameras de rigeur
4164
hey pick me up one too will ya rt mention omg still in line for the new ipad dieing of hunger sxsw who else is in line
4215
zomg rt mention special apple store at sxsw just for ipad line is block long plixi com 83323324 plixi com 83323414
4256
doodle competition invite kids to create own doodles winners are put on google com to celebrate their awesomeness googledoodle sxsw
4289
mike tyson appears at sxsw to promote iphone ipad game link seo sem games miketyson topnews
4329
thewildernessdowntown com best of show at sxsw google thomas gayno quot it is like choreography of browser windows quot sxsw
4487
interesting post re seo rt mention how to improve website rankings advice from google amp bing at sxsw link se
4530
fascinating talk on health data from gov to private on the verge of someth

think apple us winning sxsw just sea of silver white and glass
7329
how to improve website rankings google bing advice at sxsw link get com or pk but be content rich deep
7345
come by aquent booth 1413 at sxsw to learn about internetonlinewebsite com meet us amp enter to win an ipad
7379
funniest question at google marissa mayer talk at sxsw quot hi marissa um are you guys hiring quot
7380
popup store de gadgets da apple no sxsw link
7486
miami horror tacos and bloody marys am there https sites google com site frontgatesxsw11 sxsw
7591
happily drowning in sea of apple macs and monster energy drink sxsw
7601
google coming location dominance at sxsw pcmag com link
7716
ha that is one way to track your contacts rt mention mention and john android sxsw mention link
7759
sxsw news yahoo com is loosing search traffic to new site google com doubt it will last though that weird name
7804
here it is join actsofsharing com and starting tonight the person with the most friends in their city by en

In [35]:
num_filtered

184

Scrolling through these, I see a couple of non-english tweets, but not much. There may not be even 10 in the nearly 10,000. This might not be as big of a deal as I was worried about.

On the other hand, there's a new problem to consider: a LOT of these tweets are near-duplicates of one another. Perhaps this would be more of a familiar phenomenon to someone familiar with Twitter, but that wasn't me. So that's the next thing to fix.

## Duplicates

In [36]:
!pip install thefuzz



In [37]:
pip install rapidfuzz

Note: you may need to restart the kernel to use updated packages.


In [39]:
pip install -r requirements.txt

[31mERROR: Could not open requirements file: [Errno 2] No such file or directory: 'requirements.txt'[0m
Note: you may need to restart the kernel to use updated packages.


In [None]:
def is_retweet(cell_1, cell_2):
    return thefuzz.partial_ratio(cell_1, cell_2) >= 60

# 

In [None]:
# Features to build the model
features = ['product', "company", 'processed_text']

X = df[features]
y = df[['emotion']]

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)