#  Pandas string functions
You might wonder, why we need to bother with string functions from pandas and not just use the Python standard ones? The reason is that Python's string functions are for individual string objects, while the pandas functions are for Series and DataFrames. So you can think of the pandas string functions as an extension that allows us to operate on an entire Series or DataFrame of strings. As most of the time, the text data that we will be working with will already be in the form of a Series or a DataFrame, so using the specific functions from pandas will make our life a lot easier.

In [1]:
import pandas as pd
import numpy as np

s=pd.Series(['0', 'John Wood', 'Colin Welsh', 'my list', '02456', np.nan, 'HELLO WORLD', 'water%'])
s

0              0
1      John Wood
2    Colin Welsh
3        my list
4          02456
5            NaN
6    HELLO WORLD
7         water%
dtype: object

In [2]:
s.str.lower()

0              0
1      john wood
2    colin welsh
3        my list
4          02456
5            NaN
6    hello world
7         water%
dtype: object

In [3]:
s.str.upper()

0              0
1      JOHN WOOD
2    COLIN WELSH
3        MY LIST
4          02456
5            NaN
6    HELLO WORLD
7         WATER%
dtype: object

In [4]:
s.str.len()

0     1.0
1     9.0
2    11.0
3     7.0
4     5.0
5     NaN
6    11.0
7     6.0
dtype: float64

In [5]:
s.str.split(' ')

0               [0]
1      [John, Wood]
2    [Colin, Welsh]
3        [my, list]
4           [02456]
5               NaN
6    [HELLO, WORLD]
7          [water%]
dtype: object

In [6]:
substrings = s.str.split(' ', expand=True)
substrings

Unnamed: 0,0,1
0,0,
1,John,Wood
2,Colin,Welsh
3,my,list
4,02456,
5,,
6,HELLO,WORLD
7,water%,


In [7]:
substrings[1]

0     None
1     Wood
2    Welsh
3     list
4     None
5      NaN
6    WORLD
7     None
Name: 1, dtype: object

In [8]:
s.str.replace('strA','strB')

0              0
1      John Wood
2    Colin Welsh
3        my list
4          02456
5            NaN
6    HELLO WORLD
7         water%
dtype: object

In [9]:
s.str.replace('%',' percent ')

0                 0
1         John Wood
2       Colin Welsh
3           my list
4             02456
5               NaN
6       HELLO WORLD
7    water percent 
dtype: object

In [10]:
s.str.replace('%','')

0              0
1      John Wood
2    Colin Welsh
3        my list
4          02456
5            NaN
6    HELLO WORLD
7          water
dtype: object

In [11]:
s.str[0:2]

0      0
1     Jo
2     Co
3     my
4     02
5    NaN
6     HE
7     wa
dtype: object

In [12]:
s.str.slice(0,2)

0      0
1     Jo
2     Co
3     my
4     02
5    NaN
6     HE
7     wa
dtype: object

In [13]:
# str.slice_replace(i,j,'str')

s.str.slice_replace(0,2, '___')

0             ___
1      ___hn Wood
2    ___lin Welsh
3        ___ list
4          ___456
5             NaN
6    ___LLO WORLD
7         ___ter%
dtype: object

In [14]:
flag = s.str.contains('0')
flag

0     True
1    False
2    False
3    False
4     True
5      NaN
6    False
7    False
dtype: object

In [15]:
flag = s.str.contains('0', na=False)
flag

0     True
1    False
2    False
3    False
4     True
5    False
6    False
7    False
dtype: bool

In [16]:
s[flag]

0        0
4    02456
dtype: object

# Cleaning up the movies dataset

In [17]:
import pandas as pd
import numpy as np

movies = pd.read_csv('tmdb_5000_movies.csv')
movies.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


In [18]:
genres=movies['genres']

In [19]:
genres[0]

'[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]'

We would like to replace this entry with just the names of the genres separated by a comma such as

 'Action, Adventure, Fantasy, Science Fiction' 

How can we go about this? Since each entry is a JSON string, we could use the json module

In [20]:
import json

json_obj = json.loads(genres[0]) # Load json string
names = [x['name'] for x in json_obj] # ['Action', 'Adventure', 'Fantasy', 'Science Fiction']
', '.join(names) # 'Action, Adventure, Fantasy, Science Fiction'

'Action, Adventure, Fantasy, Science Fiction'

In [21]:
def  transform(s):
    s=s.str.strip('[]')
    return(s)

In [22]:
genres= transform(genres)
genres[0]

'{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}'

In [23]:
def transform(s):
    s=s.str.strip('[]')
    s=s.str.replace('{','')
    s=s.str.replace('}','')
    s=s.str.replace(',','')
    s=s.str.replace('\"id\":','')
    s=s.str.replace('\"name\":','')
    s=s.str.replace('"','')
    s=s.str.replace('0','')
    s=s.str.replace('1','')
    s=s.str.replace('2','')
    s=s.str.replace('3','')
    s=s.str.replace('4','')
    s=s.str.replace('5','')
    s=s.str.replace('6','')
    s=s.str.replace('7','')
    s=s.str.replace('8','')
    s=s.str.replace('9','')
    s=s.str.replace('    ',', ')
    s=s.str.replace('   ','')
    return s

In [24]:
genres= transform(genres)
genres[0]

'Action, Adventure, Fantasy, Science Fiction'

In [25]:
movies['genres']=genres

In [26]:
movies.loc[:,['title','genres']].head(10)

Unnamed: 0,title,genres
0,Avatar,"Action, Adventure, Fantasy, Science Fiction"
1,Pirates of the Caribbean: At World's End,"Adventure, Fantasy, Action"
2,Spectre,"Action, Adventure, Crime"
3,The Dark Knight Rises,"Action, Crime, Drama, Thriller"
4,John Carter,"Action, Adventure, Science Fiction"
5,Spider-Man 3,"Fantasy, Action, Adventure"
6,Tangled,"Animation, Family"
7,Avengers: Age of Ultron,"Action, Adventure, Science Fiction"
8,Harry Potter and the Half-Blood Prince,"Adventure, Fantasy, Family"
9,Batman v Superman: Dawn of Justice,"Action, Adventure, Fantasy"


# Further practice with the movies dataset

Task: transform the entries of the column keywords so that they each contain the first 3 keywords separated by a comma. For example the entry

In [27]:
movies.keywords[0]

'[{"id": 1463, "name": "culture clash"}, {"id": 2964, "name": "future"}, {"id": 3386, "name": "space war"}, {"id": 3388, "name": "space colony"}, {"id": 3679, "name": "society"}, {"id": 3801, "name": "space travel"}, {"id": 9685, "name": "futuristic"}, {"id": 9840, "name": "romance"}, {"id": 9882, "name": "space"}, {"id": 9951, "name": "alien"}, {"id": 10148, "name": "tribe"}, {"id": 10158, "name": "alien planet"}, {"id": 10987, "name": "cgi"}, {"id": 11399, "name": "marine"}, {"id": 13065, "name": "soldier"}, {"id": 14643, "name": "battle"}, {"id": 14720, "name": "love affair"}, {"id": 165431, "name": "anti war"}, {"id": 193554, "name": "power relations"}, {"id": 206690, "name": "mind and soul"}, {"id": 209714, "name": "3d"}]'

**should become 'culture clash, future, space war'.**

In [28]:
keywords = movies['keywords']


In [29]:
keywords = transform(keywords)
keywords.head()

0    culture clash, future, space war, space colony...
1    ocean, drug abuse, exotic island, east india t...
2    spy, based on novel, secret agent, sequel, mi,...
3    dc comics, crime fighter, terrorist, secret id...
4    based on novel, mars, medallion, space travel,...
Name: keywords, dtype: object

In [30]:
keywords_df = keywords.str.split(',' , expand = True)
keywords_df[0:3]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,87,88,89,90,91,92,93,94,95,96
0,culture clash,future,space war,space colony,society,space travel,futuristic,romance,space,alien,...,,,,,,,,,,
1,ocean,drug abuse,exotic island,east india trading company,love of one's life,traitor,shipwreck,strong woman,ship,alliance,...,,,,,,,,,,
2,spy,based on novel,secret agent,sequel,mi,british secret service,united kingdom,,,,...,,,,,,,,,,


In [31]:
movies['keywords'] = keywords_df[0] + ', ' + keywords_df[1] + ', ' + keywords_df[2]
movies.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"Action, Adventure, Fantasy, Science Fiction",http://www.avatarmovie.com/,19995,"culture clash, future, space war",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"Adventure, Fantasy, Action",http://disney.go.com/disneypictures/pirates/,285,"ocean, drug abuse, exotic island",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"Action, Adventure, Crime",http://www.sonypictures.com/movies/spectre/,206647,"spy, based on novel, secret agent",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"Action, Crime, Drama, Thriller",http://www.thedarkknightrises.com/,49026,"dc comics, crime fighter, terrorist",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"Action, Adventure, Science Fiction",http://movies.disney.com/john-carter,49529,"based on novel, mars, medallion",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


# Regular expressions

In [32]:
s=pd.Series(['0', 'John Wood', 'Colin Welsh', 'my list', '02456', np.nan, 'HELLO WORLD', 'water%'])

In [33]:
s.str.contains('John')

0    False
1     True
2    False
3    False
4    False
5      NaN
6    False
7    False
dtype: object

In [34]:
s.str.contains('John') | s.str.contains('Colin')

0    False
1     True
2     True
3    False
4    False
5    False
6    False
7    False
dtype: bool

In [35]:
s.str.contains('John|Colin')

0    False
1     True
2     True
3    False
4    False
5      NaN
6    False
7    False
dtype: object

In [36]:
s2 = pd.Series(['bar', 'sugar', 'cartoon', 'argon'])

In [37]:
s2.str.contains('.ar')

0     True
1     True
2     True
3    False
dtype: bool

In [38]:
s2.str.contains('[bc]ar')

0     True
1    False
2     True
3    False
dtype: bool

We can also specify inside the square brackets what kind of characters we want to match as follows:

- [a-z] - match any lowercase letter
- [A-Z] - match any uppercase letter
- [0-9] - match any digit
- [a-zA-Z0-9] - match any letter or digit

In [39]:
s[s.str.contains('[0-9]', na=False)]

0        0
4    02456
dtype: object

- [^a-z] - match any character that is not a lowercase letter
- [^A-Z] - match any character that is not a uppercase letter
- [^0-9] - match any character that is not a digit
- [^a-zA-Z0-9] - match any character that is not a letter or digit

- \d - match any digit
- \D - match any non digit
- \w - match a word character
- \W - match a non-word character
- \s - match whitespace (spaces, tabs, newlines, etc.)
- \S - match non-whitespace

In [40]:
s[s.str.contains('[\d]', na=False)]

0        0
4    02456
dtype: object

#### Matching at the start and end of strings
We can also specify the location of the string where we want to match by using

- ^ - match at the beginning of a string
- $ - searches for matches at the end of a string

In [41]:
s2[s2.str.contains('^[bc]', na=False)]

0        bar
2    cartoon
dtype: object

In [42]:
s2[s2.str.contains('ar$', na=False)]

0      bar
1    sugar
dtype: object

#### Matching preceding characters
Often we want to mention a certain character and then ask to match one or more copies of this character. We can do this using the following metacharacters

In [43]:
s3= pd.Series(['forest', 'o', 'ff', 'foo', 'fof'])
s3.str.contains('f+o?f+')
# What this does is search for all strings that contain 1 or more f's then an optional o and then 1 or more f's. We can see that the third and fifth strings satisfy this as shown in the output

0    False
1    False
2     True
3    False
4     True
dtype: bool

An important thing to know is that the backslash character \ lets us escape regular expressions, for situations where we want to match the metacharacter itself. For example, if we want to match periods we cannot just use . since this will match any character as we mentioned before. We must use instead \..

#### Grouping
We can place parentheses around a regular expression to allow us to group the results so that we can extract each component separately instead of the full match. This can be especially useful if we want to use the str.extract() method since in this case we must have the matches grouped so that they can be extracted in a new DataFrame

In [44]:
s4= pd.Series(['Monday5km', 'Wednesday10km', 'Saturday25km'])

In [45]:
s4.str.extract('(\w+day)',expand=True)

Unnamed: 0,0
0,Monday
1,Wednesday
2,Saturday


Let's break the regular expression 
\w+day
 down:

- \w
: matches a word character once (it is equivalent to 
[a-zA-Z0-9_]
).
- If you add the 
+
 quantifier, this will match the preceding character 1 or more times. So, 
\w+
 will match word characters 1 or more times.
- day
: matches the characters "day" literally (case sensitive).
Altogether, 
\w+day
 will match any word characters preceding the string "day", and then the string "day". It won't match anything after the string "day".

I hope it is now clear why 
(\w)
 will match only one word character.

The regular expression pattern: 
\w+y
, will match any word characters preceding the string "y", and then the string "y". In practice, this will match the whole day name, without matching any characters after the string "y".

Note that the command would not have worked had we not used the parentheses to indicate that we want to group the matches. That is every time we use the str.extract() function we must use this option to group the results.

Grouping the results also means that we can refer to them. Let's look at a particular example where we want to take each match of the previous regular expression '\w+day' and now replace each string the first three letters so that we have the abbreviated names 'Mon', 'Wed' and 'Sat'. For this we can use the str.replace() function. Normally we would need to provide a fixed string by which to replace every match. However, if we choose to group our matches using parentheses then we have the option to specify a function which gives a separate replacement string to each match in the group. In our case this function has to take the first three characters of each string. We define the function as follows

In [46]:
def f(x):
    return x.groups()[0][:3]

**The groups attribute refers to the fact that the matches are grouped, and now we index the first and only group in this case and ask for the first three characters to be returned for each match.**

In [47]:
s4.str.replace('(\w+day)', f)

0     Mon5km
1    Wed10km
2    Sat25km
dtype: object

# Exercise: using regular expressions in pandas

In [48]:
meal_plan = ['Monday: 9:12am – Omelet,  3:30pm– Apple slices with almond butter', 
             'Tuesday: 9:35am – Banana bread, 11:00am –Sauteed veggies, 7:02pm– Taco pie',
             'Wednesday: 9:00am – Banana pancakes',  
             'Thursday: 7:23pm– Slow cooker pulled pork', 'Friday: 3:30pm – Can of tuna', 
             'Saturday: 9:11am: Eggs and sweet potato hash browns, 3:22pm: Almonds', 
             'Sunday: 11:00am: Meat and veggie stir fry'] 

In [49]:
df = pd.DataFrame(meal_plan, columns=['text'])
df

Unnamed: 0,text
0,"Monday: 9:12am – Omelet, 3:30pm– Apple slices..."
1,"Tuesday: 9:35am – Banana bread, 11:00am –Saute..."
2,Wednesday: 9:00am – Banana pancakes
3,Thursday: 7:23pm– Slow cooker pulled pork
4,Friday: 3:30pm – Can of tuna
5,Saturday: 9:11am: Eggs and sweet potato hash b...
6,Sunday: 11:00am: Meat and veggie stir fry


In [50]:
sol = df['text'].str.extractall(('(\d?\d):(\d\d) ?([ap]m)'))
sol

Unnamed: 0_level_0,Unnamed: 1_level_0,0,1,2
Unnamed: 0_level_1,match,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0,9,12,am
0,1,3,30,pm
1,0,9,35,am
1,1,11,0,am
1,2,7,2,pm
2,0,9,0,am
3,0,7,23,pm
4,0,3,30,pm
5,0,9,11,am
5,1,3,22,pm


In [51]:
days=['Mon','Tue','Wed','Thu','Fri','Sat','Sun']
meals = ['breakfast','lunch','dinner']

In [52]:
sol.index.set_levels([days,meals],inplace = True)
sol.index.set_names(['Day','Meal'], inplace = True)

In [53]:
sol

Unnamed: 0_level_0,Unnamed: 1_level_0,0,1,2
Day,Meal,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Mon,breakfast,9,12,am
Mon,lunch,3,30,pm
Tue,breakfast,9,35,am
Tue,lunch,11,0,am
Tue,dinner,7,2,pm
Wed,breakfast,9,0,am
Thu,breakfast,7,23,pm
Fri,breakfast,3,30,pm
Sat,breakfast,9,11,am
Sat,lunch,3,22,pm


In [54]:
sol.columns=['Hour','Minutes','Period']

In [55]:
sol

Unnamed: 0_level_0,Unnamed: 1_level_0,Hour,Minutes,Period
Day,Meal,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Mon,breakfast,9,12,am
Mon,lunch,3,30,pm
Tue,breakfast,9,35,am
Tue,lunch,11,0,am
Tue,dinner,7,2,pm
Wed,breakfast,9,0,am
Thu,breakfast,7,23,pm
Fri,breakfast,3,30,pm
Sat,breakfast,9,11,am
Sat,lunch,3,22,pm


# Sentiment Analysis

In this unit we showcase the work flow of a real-life example of data science: Sentiment analysis for tweets. In the process we will come across our **first example of machine learning.** Using a package called nltk, we will walk you through the text-processing steps necessary for transforming the text data from our tweets into a numerical representation for our machine learning model (a Naive Bayes Classifier). Try to understand the main ideas but don't worry about the precise implementation.

In [56]:
import pandas as pd
df = pd.read_csv('tweets.csv', header=None)
df.columns = ['sentiment','text']
df.head()

Unnamed: 0,sentiment,text
0,4,@stellargirl I loooooooovvvvvveee my Kindle2. ...
1,4,Reading my kindle2... Love it... Lee childs i...
2,4,"Ok, first assesment of the #kindle2 ...it fuck..."
3,4,@kenburbary You'll love your Kindle2. I've had...
4,4,@mikefish Fair enough. But i have the Kindle2...


In [57]:
#total number of tweets
df.shape[0]

498

In [58]:
#number of positive tweets
df[df['sentiment']==4].shape[0]

182

In [59]:
#number of neutral tweets
df[df['sentiment']==2].shape[0]

139

In [60]:
#number of negative tweets
df[df['sentiment']==0].shape[0]


177

In [63]:
pos_tweets = df.loc[df['sentiment']==4,'text']
neg_tweets = df.loc[df['sentiment']==0,'text']

We will now clean up our tweets. Our goal here is to reduce each tweet to a list of essential words. These words should not contain any symbols and they should not be stopwords. **This process of turning text into a list of essential words is known as tokenization**. Luckily, in Python, we have the choice of many useful text processing libraries that can help us with such tasks as cleanup and tokenization. In this exercise, we will use perhaps **the NLTK library**. We will also download its list of English stopwords so that we do not have to define our own. There is even a special TweetTokenizer object that will automatically remove symbols and hashtags from our tweets and turn them into a list of essential words. We need to add the following import commands to the top of our notebook.

In [65]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import TweetTokenizer
import string
import re

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Utilisateur\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


In [66]:
stopwords_english = stopwords.words('english')
# Happy Emoticons
emoticons_happy = set([
    ':-)', ':)', ';)', ':o)', ':]', ':3', ':c)', ':>', '=]', '8)', '=)', ':}',
    ':^)', ':-D', ':D', '8-D', '8D', 'x-D', 'xD', 'X-D', 'XD', '=-D', '=D',
    '=-3', '=3', ':-))', ":'-)", ":')", ':*', ':^*', '>:P', ':-P', ':P', 'X-P',
    'x-p', 'xp', 'XP', ':-p', ':p', '=p', ':-b', ':b', '>:)', '>;)', '>:-)',
    '<3'
    ])

In [67]:
emoticons_sad = set([
    ':L', ':-/', '>:/', ':S', '>:[', ':@', ':-(', ':[', ':-||', '=L', ':<',
    ':-[', ':-<', '=\\', '=/', '>:(', ':(', '>.<', ":'-(", ":'(", ':\\', ':-c',
    ':c', ':{', '>:\\', ';('
    ])

In [68]:
emoticons = emoticons_happy.union(emoticons_sad)

In [71]:
def clean_tweets(tweet):

    # remove hyperlinks
    tweet = re.sub(r'https?:\/\/[^\s]+', '', tweet)

    # remove hashtags
    tweet = re.sub(r'#', '', tweet)
    
    # tokenize tweets
    tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)
    tweet_tokens = tokenizer.tokenize(tweet)

    tweets_clean = []    
    for word in tweet_tokens:
        if (word not in stopwords_english and # remove stopwords
            word not in emoticons and # remove emoticons
            word not in string.punctuation): # remove punctuation
            tweets_clean.append(word)   
    return tweets_clean

In [72]:
sample = pos_tweets.iloc[4]
sample

"@mikefish  Fair enough. But i have the Kindle2 and I think it's perfect  :)"

In [73]:
clean_tweets(sample)

['fair', 'enough', 'kindle', '2', 'think', 'perfect']

We have turned our original tweet into a list of meaningful words. The next step will be to turn this list of meaningful words into a feature vector, by **counting the frequency of each word**. This is known in natural language processing literature as the bag of words model, whereas a string of text is represented by just a word vector and their corresponding frequencies. Let's now define our feature extractor

In [74]:
def bag_of_words(tweet):
    words = clean_tweets(tweet)
    words_dictionary = dict([word, True] for word in words)    
    return words_dictionary

This routine simply calls the previous cleaning routine on each tweet and creates a dictionary with a True value for each of the appearing words. Here is a demonstration on our previous sample tweet

In [75]:
bag_of_words(sample)

{'fair': True,
 'enough': True,
 'kindle': True,
 '2': True,
 'think': True,
 'perfect': True}

In [76]:
# positive tweets feature set
pos_tweets_set = []
for tweet in pos_tweets:
    pos_tweets_set.append((bag_of_words(tweet), 'pos'))    

#negative tweets feature set
neg_tweets_set = []
for tweet in neg_tweets:
    neg_tweets_set.append((bag_of_words(tweet), 'neg'))

tweets = pos_tweets_set + neg_tweets_set  

#### The Naive Bayes classifier
The Naive-Bayes classifier is a probabilistic classifier that is based on Baye’s theorem. Given a feature vector 
x
=
(
x
1
,
⋯
,
x
n
)
 the algorithm predicts the conditional probability 
P
(
C
k
|
x
)
 for each possible class Ck. The probability is computed using Baye’s Theorem as follows:

P
(
C
k
|
x
)
=
P
(
C
k
)
P
(
x
|
C
k
)
/
P
(
x
)

It then assigns the tweet to the class achieving the highest conditional probability. Since the denominator is constant for each possible class, the problem is reduced to computing the numerator, which is equal to the joint probability

P
(
C
k
)
P
(
x
|
C
k
)
=
P
(
C
k
,
x
1
,
⋯
,
x
n
)

Using the law of total probability this can be rewritten as


P(
C
k
,
x
1
,
⋯
,
x
n
)
=
P
(
x
1
|
x
2
,
⋯
,
x
n
,
C
k
)
⋯
P
(
x
n
−
1
|
x
n
,
C
k
)
P
(
x
n
|
C
k
)
P
(
C
k
)

The algorithm now makes the simplifying assumption that each feature xi is independent of any other feature 
x
j
 conditional on the class 
C
k
 . This is where the name “Naive” comes from. This means that the probability above can be simply computed as
 
 P
(
C
k
,
x
1
,
⋯
,
x
n
)
=
P
(
C
k
)
n
∏
i
=
1
 P
(
x
i
|
C
k
)

These probabilities are now computed using the training data which consists of a set of prelabelled tweets. The priori probability 
P
(
C
k
)
 of each class is computed by checking the frequency of each class in the training data. The probability 
P
(
x
i
|
C
k
)
 is computed by checking the frequency that the word xi occurs in tweets labeled as class 
C
k
 in the training data. Finally, the class with the highest probability is assigned to each unlabelled tweet.

#### Implementation
We will now use our feature sets to implement the Naive Bayes algorithm from the NLTK library. First, we randomly split our data in about 20% test set and 80% training set.

In [90]:
from random import shuffle 
shuffle(pos_tweets_set)
shuffle(neg_tweets_set)

test_set = pos_tweets_set[:36] + neg_tweets_set[:36]
train_set = pos_tweets_set[36:] + neg_tweets_set[36:]

We now feed the training set directly to the Naive Bayes classifier. In order to use this classifier we need to add the following import statements at the top of our notebook:

In [91]:
from nltk import classify
from nltk import NaiveBayesClassifier
classifier = NaiveBayesClassifier.train(train_set)

In [92]:
accuracy = classify.accuracy(classifier, test_set)
accuracy

0.8333333333333334

In [93]:
classifier.show_most_informative_features(10) 

Most Informative Features
                       2 = True              pos : neg    =     11.9 : 1.0
                    hate = True              neg : pos    =      9.3 : 1.0
                    time = True              neg : pos    =      9.0 : 1.0
                  kindle = True              pos : neg    =      8.0 : 1.0
                    love = True              pos : neg    =      6.8 : 1.0
                    best = True              pos : neg    =      5.5 : 1.0
                   still = True              neg : pos    =      5.2 : 1.0
                    want = True              pos : neg    =      4.8 : 1.0
                   phone = True              neg : pos    =      3.8 : 1.0
                    fail = True              neg : pos    =      3.8 : 1.0


How should we interpret this chart? It simply means that if the word great appears in the tweet then the sentiment is 5.9 times more likely to be positive than negative. On the other hand, if the word safeway appears in the tweet then the sentiment is 4.8 times more likely to be negative than positive.

Another useful property to look at is what is called the confusion matrix. This tells us exactly how many positive and negative tweets we are labeling correctly and incorrectly. Here is how to obtain it :

In [94]:
from collections import defaultdict
from nltk.metrics import ConfusionMatrix

actual_set = defaultdict(set)
predicted_set = defaultdict(set)

actual_set_cm = []
predicted_set_cm = []

for index, (feature, actual_label) in enumerate(test_set):
    actual_set[actual_label].add(index)
    actual_set_cm.append(actual_label)

    predicted_label = classifier.classify(feature)

    predicted_set[predicted_label].add(index)
    predicted_set_cm.append(predicted_label)

print(ConfusionMatrix(actual_set_cm, predicted_set_cm)) 

    |  n  p |
    |  e  o |
    |  g  s |
----+-------+
neg |<30> 6 |
pos |  6<30>|
----+-------+
(row = reference; col = test)



https://machinelearningmastery.com/naive-bayes-classifier-scratch-python/