# Malcolm's Scratch Notebook

Import pandas into notebook

In [1]:
import pandas as pd
import re

Import the data into a pandas dataframe.

In [2]:
df = pd.read_csv('data/crowdflower-brands-and-product-emotions/data/judge_1377884607_tweet_product_company.csv')

Checks the first 5 observations of the data

In [3]:
df.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


There are three different columns in the dataframe. There is the text of the tweet, what the the tweet is about, and if there is a positive or negative emotion towards the object of the tweet. Next we want to check the number of observations.

In [4]:
df.shape

(8721, 3)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8721 entries, 0 to 8720
Data columns (total 3 columns):
tweet_text                                            8720 non-null object
emotion_in_tweet_is_directed_at                       3169 non-null object
is_there_an_emotion_directed_at_a_brand_or_product    8721 non-null object
dtypes: object(3)
memory usage: 204.5+ KB


There are 8,721 observations in the data set. There is one null tweet_text in the dataframe and the majority of observations have null value about the objec of the tweet. There are no null values for the emotion of the tweet.

In [6]:
df.describe()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
count,8720,3169,8721
unique,8693,9,4
top,RT @mention Marissa Mayer: Google Will Connect...,iPad,No emotion toward brand or product
freq,5,910,5156


There are 8,693 unique tweet_text observations. This indicates that there are multiple duplicate observations in the dataframe that will need to be cleaned/removed. There are 9 different subjects that tweets are about and the most common non-null subject is the iPad. The majority of the emotions have neither a positive or negative emotion towards the subject matter of the tweet.

Below are the four different types of emotions. Postive, negative, no emotion and can't tell.

In [7]:
df.is_there_an_emotion_directed_at_a_brand_or_product.unique()

array(['Negative emotion', 'Positive emotion',
       'No emotion toward brand or product', "I can't tell"], dtype=object)

Next we check the specific numbers for each type of emotion.

In [8]:
df.is_there_an_emotion_directed_at_a_brand_or_product.value_counts()

No emotion toward brand or product    5156
Positive emotion                      2869
Negative emotion                       545
I can't tell                           151
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64

As seen before, the most common emotion is to have no emotion towards the brand or product. The next most common emotion is a positive emotion.

The next step is to check the specific number of tweets toward a certain brand or product.

In [9]:
df.emotion_in_tweet_is_directed_at.value_counts()

iPad                               910
Apple                              640
iPad or iPhone App                 451
Google                             412
iPhone                             288
Other Google product or service    282
Android App                         78
Android                             74
Other Apple product or service      34
Name: emotion_in_tweet_is_directed_at, dtype: int64

The most common brand or product of a brand is Apple with a concentration on the iPad. The google brand is the fourth most common tweet subject matter. Overall, most tweets have to deal with Apple rather than Google.

In [10]:
df[df.duplicated()]

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
457,"Before It Even Begins, Apple Wins #SXSW {link}",Apple,Positive emotion
752,Google to Launch Major New Social Network Call...,,No emotion toward brand or product
2138,Marissa Mayer: Google Will Connect the Digital...,,No emotion toward brand or product
2437,Counting down the days to #sxsw plus strong Ca...,Apple,Positive emotion
3759,Really enjoying the changes in Gowalla 3.0 for...,Android App,Positive emotion
3771,"#SXSW is just starting, #CTIA is around the co...",Android,Positive emotion
4669,"Oh. My. God. The #SXSW app for iPad is pure, u...",iPad or iPhone App,Positive emotion
5107,RT @mention ��� GO BEYOND BORDERS! ��_ {link} ...,,No emotion toward brand or product
5110,"RT @mention ��� Happy Woman's Day! Make love, ...",,No emotion toward brand or product
5650,RT @mention Google to Launch Major New Social ...,,No emotion toward brand or product


There are 22 duplicate texts in the dataframe.

## Data Prep and Cleaning

First we want to drop the duplicate texts.

In [11]:
df = df.drop_duplicates()

Check to see if there are any duplicates

In [12]:
df[df.duplicated()]

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product


Next we create a list of tweet texts that we can start cleaning.

In [13]:
df.tweet_text.isna().value_counts()

False    8698
True        1
Name: tweet_text, dtype: int64

In [14]:
df.dropna(subset = ['tweet_text'], inplace = True)

In [15]:
tweet_text = list(df.tweet_text)

In [16]:
tweet_text

['.@wesley83 I have a 3G iPhone. After 3 hrs tweeting at #RISE_Austin, it was dead!  I need to upgrade. Plugin stations at #SXSW.',
 "@jessedee Know about @fludapp ? Awesome iPad/iPhone app that you'll likely appreciate for its design. Also, they're giving free Ts at #SXSW",
 '@swonderlin Can not wait for #iPad 2 also. They should sale them down at #SXSW.',
 "@sxsw I hope this year's festival isn't as crashy as this year's iPhone app. #sxsw",
 "@sxtxstate great stuff on Fri #SXSW: Marissa Mayer (Google), Tim O'Reilly (tech books/conferences) &amp; Matt Mullenweg (Wordpress)",
 '@teachntech00 New iPad Apps For #SpeechTherapy And Communication Are Showcased At The #SXSW Conference http://ht.ly/49n4M #iear #edchat #asd',
 '#SXSW is just starting, #CTIA is around the corner and #googleio is only a hop skip and a jump from there, good time to be an #android fan',
 'Beautifully smart and simple idea RT @madebymany @thenextweb wrote about our #hollergram iPad app for #sxsw! http://bit.ly/ieaV

The first step is to remove the noise, such as punctuation. 

In [17]:
text_no_noise = [re.sub(r'[\.\,\:\;\!\?\@\#]', '', text) for text in tweet_text]

In [18]:
text_no_noise = [re.sub(r'[^\x00-\x7F]+',' ', text) for text in text_no_noise]

In [20]:
text_no_noise

['wesley83 I have a 3G iPhone After 3 hrs tweeting at RISE_Austin it was dead  I need to upgrade Plugin stations at SXSW',
 "jessedee Know about fludapp  Awesome iPad/iPhone app that you'll likely appreciate for its design Also they're giving free Ts at SXSW",
 'swonderlin Can not wait for iPad 2 also They should sale them down at SXSW',
 "sxsw I hope this year's festival isn't as crashy as this year's iPhone app sxsw",
 "sxtxstate great stuff on Fri SXSW Marissa Mayer (Google) Tim O'Reilly (tech books/conferences) &amp Matt Mullenweg (Wordpress)",
 'teachntech00 New iPad Apps For SpeechTherapy And Communication Are Showcased At The SXSW Conference http//htly/49n4M iear edchat asd',
 'SXSW is just starting CTIA is around the corner and googleio is only a hop skip and a jump from there good time to be an android fan',
 'Beautifully smart and simple idea RT madebymany thenextweb wrote about our hollergram iPad app for sxsw http//bitly/ieaVOB',
 'Counting down the days to sxsw plus strong