In [1]:
import nltk

In [2]:
# nltk.download('twitter_samples')

[nltk_data] Downloading package twitter_samples to C:\Users\Sabam-
[nltk_data]     Mr\AppData\Roaming\nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!


True

In [3]:
from nltk.corpus import twitter_samples

### This will import three datasets from NLTK that contain various tweets to train and test the model:

- negative_tweets.json: 5000 tweets with negative sentiments
- positive_tweets.json: 5000 tweets with positive sentiments
- tweets.20150430-223406.json: 20000 tweets with no sentiments

In [4]:
positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')
text = twitter_samples.strings('tweets.20150430-223406.json')

In [5]:
twitter_samples

<TwitterCorpusReader in 'C:\\Users\\Sabam-Mr\\AppData\\Roaming\\nltk_data\\corpora\\twitter_samples'>

Before using a tokenizer in NLTK, you need to download an additional resource, punkt. The punkt module is a pre-trained model that helps you tokenize words and sentences. For instance, this model knows that a name may contain a period (like “S. Daityari”) and the presence of this period in a sentence does not necessarily end it. First, start a Python interactive session

In [6]:
# nltk.download('punkt')

[nltk_data] Downloading package punkt to C:\Users\Sabam-
[nltk_data]     Mr\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [7]:
tweet_tokens = twitter_samples.tokenized('positive_tweets.json')

In [8]:
tweet_tokens

[['#FollowFriday',
  '@France_Inte',
  '@PKuchly57',
  '@Milipol_Paris',
  'for',
  'being',
  'top',
  'engaged',
  'members',
  'in',
  'my',
  'community',
  'this',
  'week',
  ':)'],
 ['@Lamb2ja',
  'Hey',
  'James',
  '!',
  'How',
  'odd',
  ':/',
  'Please',
  'call',
  'our',
  'Contact',
  'Centre',
  'on',
  '02392441234',
  'and',
  'we',
  'will',
  'be',
  'able',
  'to',
  'assist',
  'you',
  ':)',
  'Many',
  'thanks',
  '!'],
 ['@DespiteOfficial',
  'we',
  'had',
  'a',
  'listen',
  'last',
  'night',
  ':)',
  'As',
  'You',
  'Bleed',
  'is',
  'an',
  'amazing',
  'track',
  '.',
  'When',
  'are',
  'you',
  'in',
  'Scotland',
  '?',
  '!'],
 ['@97sides', 'CONGRATS', ':)'],
 ['yeaaaah',
  'yippppy',
  '!',
  '!',
  '!',
  'my',
  'accnt',
  'verified',
  'rqst',
  'has',
  'succeed',
  'got',
  'a',
  'blue',
  'tick',
  'mark',
  'on',
  'my',
  'fb',
  'profile',
  ':)',
  'in',
  '15',
  'days'],
 ['@BhaktisBanter',
  '@PallaviRuhail',
  'This',
  'one',
  '

In [9]:
tweet1_tokens = twitter_samples.tokenized('positive_tweets.json')[0]
print(tweet1_tokens[0])

#FollowFriday


In [10]:
tweet1_tokens

['#FollowFriday',
 '@France_Inte',
 '@PKuchly57',
 '@Milipol_Paris',
 'for',
 'being',
 'top',
 'engaged',
 'members',
 'in',
 'my',
 'community',
 'this',
 'week',
 ':)']

#### Normalizing the Data
Words have different forms—for instance, “ran”, “runs”, and “running” are various forms of the same verb, “run”. Depending on the requirement of your analysis, all of these versions may need to be converted to the same form, “run”. Normalization in NLP is the process of converting a word to its canonical form.

Normalization helps group together words with the same meaning but different forms. Without normalization, “ran”, “runs”, and “running” would be treated as different words, even though you may want them to be treated as the same word. In this section, you explore stemming and lemmatization, which are two popular techniques of normalization.

Stemming is a process of removing affixes from a word. Stemming, working with only simple verb forms, is a heuristic process that removes the ends of words.

In this tutorial you will use the process of lemmatization, which normalizes a word with the context of vocabulary and morphological analysis of words in text. The lemmatization algorithm analyzes the structure of the word and its context to convert it to a normalized form. Therefore, it comes at a cost of speed. A comparison of stemming and lemmatization ultimately comes down to a trade off between speed and accuracy.

In [11]:
# nltk.download('wordnet')
# nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package wordnet to C:\Users\Sabam-
[nltk_data]     Mr\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Sabam-Mr\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

wordnet is a lexical database for the English language that helps the script determine the base word. You need the averaged_perceptron_tagger resource to determine the context of a word in a sentence.

In [12]:
from nltk.tag import pos_tag

In [13]:
print(pos_tag(tweet_tokens[0]))

[('#FollowFriday', 'JJ'), ('@France_Inte', 'NNP'), ('@PKuchly57', 'NNP'), ('@Milipol_Paris', 'NNP'), ('for', 'IN'), ('being', 'VBG'), ('top', 'JJ'), ('engaged', 'VBN'), ('members', 'NNS'), ('in', 'IN'), ('my', 'PRP$'), ('community', 'NN'), ('this', 'DT'), ('week', 'NN'), (':)', 'NN')]


Pos Tag Description
https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

In [14]:
from nltk.stem.wordnet import WordNetLemmatizer

In [15]:
def lemmatize_sentence(tokens):
    lemmatizer = WordNetLemmatizer()
    lemmatized_sentence = []
    for word, tag in pos_tag(tokens):
        if tag.startswith('NN'):
            pos = 'n'
        elif tag.startswith('VB'):
            pos = 'v'
        else:
            pos = 'a'
        lemmatized_sentence.append(lemmatizer.lemmatize(word, pos))
    return lemmatized_sentence

print(lemmatize_sentence(tweet_tokens[0]))

['#FollowFriday', '@France_Inte', '@PKuchly57', '@Milipol_Paris', 'for', 'be', 'top', 'engage', 'member', 'in', 'my', 'community', 'this', 'week', ':)']


You will notice that the verb being changes to its root form, be, and the noun members changes to member

## hint : Regex 

In [16]:
import re
string = "at what time?"
match = re.findall('at',string)
print(match)

['at', 'at']


In [17]:
match = re.search('at',string)
if (match):
    print("String found at: ",match.start())
else:
    print("String not found!")

String found at:  0


In [18]:
match = re.split('a',string)
print(match)

['', 't wh', 't time?']


### re.sub()

The re.sub() function is used to replace occurrences of a particular sub-string with another sub-string.

This function takes as input the following:

    1. The sub-string to replace
    2. The sub-string to replace with
    3. The actual string


In [19]:
match = re.sub("\s","!!!",string)
print(match)

at!!!what!!!time?


## Removing Noise from the Data
In this step, you will remove noise from the dataset. Noise is any part of the text that does not add meaning or information to data.

Noise is specific to each project, so what constitutes noise in one project may not be in a different project. For instance, the most common words in a language are called stop words. Some examples of stop words are “is”, “the”, and “a”. They are generally irrelevant when processing language, unless a specific use case warrants their inclusion.

In this tutorial, you will use regular expressions in Python to search for and remove these items:

- Hyperlinks - All hyperlinks in Twitter are converted to the URL shortener t.co. Therefore, keeping them in the text processing would not add any value to the analysis.
- Twitter handles in replies - These Twitter usernames are preceded by a @ symbol, which does not convey any meaning.
- Punctuation and special characters - While these often provide context to textual data, this context is often difficult to process. For simplicity, you will remove all punctuation and special characters from tweets.

To remove hyperlinks, you need to first search for a substring that matches a URL starting with http:// or https://, followed by letters, numbers, or special characters. Once a pattern is matched, the .sub() method replaces it with an empty string.

Since we will normalize word forms within the remove_noise() function, you can comment out the lemmatize_sentence() function from the script.

In [20]:
# tok = pos_tag(tweet_tokens)
# wow = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*\(\),]|'\
#              '(?:%[0-9a-fA-F][0-9a-fA-F]))+','',token)

In [21]:
import string

def remove_noise(tweet_tokens, stop_words = ()):
    
    cleaned_tokens = []
    
    for token, tag in pos_tag(tweet_tokens):
        token = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*\(\),]|'\
                       '(?:%[0-9a-fA-F][0-9a-fA-F]))+','', token)
        token = re.sub("(@[A-Za-z0-9_]+)","", token)
        
        if tag.startswith("NN"):
            pos = 'n'
        elif tag.startswith('VB'):
            pos = 'v'
        else:
            pos = 'a'
            
        lemmatizer = WordNetLemmatizer()
        token = lemmatizer.lemmatize(token, pos)
        
        if len(token) > 0 and token not in string.punctuation and token.lower() not in stop_words:
            cleaned_tokens.append(token.lower())
    
    return cleaned_tokens
                

This code creates a remove_noise() function that removes noise and incorporates the normalization and lemmatization mentioned in the previous section. The code takes two arguments: the tweet tokens and the tuple of stop words.

The code then uses a loop to remove the noise from the dataset. To remove hyperlinks, the code first searches for a substring that matches a URL starting with http:// or https://, followed by letters, numbers, or special characters. Once a pattern is matched, the .sub() method replaces it with an empty string, or ''.

Similarly, to remove @ mentions, the code substitutes the relevant part of text using regular expressions. The code uses the re library to search @ symbols, followed by numbers, letters, or _, and replaces them with an empty string.

Finally, you can remove punctuation using the library string.

In addition to this, you will also remove stop words using a built-in set of stop words in NLTK, which needs to be downloaded separately.

In [22]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

In [23]:
tweet_tokens[0]

['#FollowFriday',
 '@France_Inte',
 '@PKuchly57',
 '@Milipol_Paris',
 'for',
 'being',
 'top',
 'engaged',
 'members',
 'in',
 'my',
 'community',
 'this',
 'week',
 ':)']

In [24]:
print(remove_noise(tweet_tokens[0], stop_words))

['#followfriday', 'top', 'engage', 'member', 'community', 'week', ':)']


Notice that the function removes all @ mentions, stop words, and converts the words to lowercase

use the remove_noise() function to clean the positive and negative tweets.

In [25]:
positive_tweet_tokens = twitter_samples.tokenized('positive_tweets.json')
negative_tweet_tokens = twitter_samples.tokenized('negative_tweets.json')

positive_cleaned_tokens_list = []
negative_cleaned_tokens_list = []

for tokens in positive_tweet_tokens:
    positive_cleaned_tokens_list.append(remove_noise(tokens, stop_words))
    

for tokens in negative_tweet_tokens:
    negative_cleaned_tokens_list.append(remove_noise(tokens, stop_words))

In [26]:
print(positive_tweet_tokens[500])
print(positive_cleaned_tokens_list[500])

['Dang', 'that', 'is', 'some', 'rad', '@AbzuGame', '#fanart', '!', ':D', 'https://t.co/bI8k8tb9ht']
['dang', 'rad', '#fanart', ':d']


## Determining Word Density

In this step you removed noise from the data to make the analysis more effective. In the next step you will analyze the data to find the most common words in your sample dataset.

The most basic form of analysis on textual data is to take out the word frequency. A single tweet is too small of an entity to find out the distribution of words, hence, the analysis of the frequency of words would be done on all positive tweets.

The following snippet defines a generator function, named get_all_words, that takes a list of tweets as an argument to provide a list of words in all of the tweet tokens joined

In [27]:
def get_all_words(cleaned_tokens_list):
    for tokens in cleaned_tokens_list:
        for token in tokens:
            yield token

In [28]:
all_pos_words = get_all_words(positive_cleaned_tokens_list)

In [29]:
all_pos_words

<generator object get_all_words at 0x00000281CCB061C8>

In [30]:
for i in all_pos_words:
    print(i)

#followfriday
top
engage
member
community
week
:)
hey
james
odd
:/
please
call
contact
centre
02392441234
able
assist
:)
many
thanks
listen
last
night
:)
bleed
amazing
track
scotland
congrats
:)
yeaaaah
yippppy
accnt
verify
rqst
succeed
get
blue
tick
mark
fb
profile
:)
15
day
one
irresistible
:)
#flipkartfashionfriday
like
keep
lovely
customer
wait
long
hope
enjoy
happy
friday
lwwf
:)
second
thought
’
enough
time
dd
:)
new
short
enter
system
sheep
must
buy
jgh
go
bayan
:d
bye
act
mischievousness
call
etl
layer
in-house
warehouse
app
katamari
well
…
name
imply
:p
#followfriday
top
influencers
community
week
:)
love
big
...
juicy
...
selfies
:)
follow
follow
u
back
:)
perfect
already
know
what's
wait
:)
great
new
opportunity
junior
triathletes
age
12
13
gatorade
series
get
entry
:)
laying
greeting
card
range
print
today
love
job
:-)
friend's
lunch
...
yummmm
:)
#nostalgia
#tbs
#ku
id
conflict
thanks
help
:d
here's
screenshot
work
hi
liv
:)
hello
need
know
something
u
fm
twitter
—
sure
th

:)
good
hear
:)
nighty
night
let
bed
bug
bite
:d
beautiful
bracelet
good
idea
:)
wait
foundry
:)
like
game
:)
ah
make
sense
...
see
pic
eff
phone
last
night
woot
:)
x
hi
derek
could
try
use
parkshare
gloucestershire
:)
aaaahhh
man
track
traffic
stress
reliever
:)
#followfriday
top
new
follower
community
week
:)
how're
today
dear
:)
hear
guy
want
arbeloa
:d
turn
17
15
day
...
omg
least
3
year
difference
:)
well
say
europe
rise
find
hard
believe
:)
#wsalelove
uncountable
coz
love
unlimited
:)
yes
course
:)
#teampositive
yes
love
:-)
#aldub
:)
☕
☕
☕
thank
rita
hi
could
please
dm
us
info
we'd
happy
help
:)
way
boy
:)
hope
get
lot
nice
gift
:d
hi
bam
follow
bestfriend
love
lot
:)
see
warsaw
<3
love
<3
x40
#followfriday
top
support
community
week
:)
true
:)
add
video
playlist
im
back
twitch
today
go
league
:)
1
3
sethi
high
:)
exe
skeem
saam
:)
people
make
smile
:)
thanks
invite
:)
polite
izzat
:)
wese
trust
khawateen
k
sath
selfies
say
mana
kar
deya
:)
thank
that's
friday
evening
sort
:-)
s

want
:d
#iamca
ah
find
thanks
:)
aftie
:p
goodmorning
:)
follow
follow
u
back
:)
’
exactly
look
’
go
:d
#prokabaddi
koel
mallick
recite
national
anthem
day
6
:)
#yournaturalleaders
#youngnaturalleaders
mon
27july
#cumbria
uk
:)
#flockstars
thur
30july
itv
#goodmorning
#goodnight
...
#sleeptight
#haveagoodday
:)
one
leg
september
perhaps
bb
promote
full
album
september
:d
still
fully
intend
write
many
game
design
possible
attack
plan
next
6
month
>:d
bird
come
join
us
little
fun
:)
#teamadmicro
#fridaydownpour
hope
clear
wkend
good
one
:)
time
sleep
:)
that's
three
word
that's
rohit
:d
gorgeous
queen
...
god
:)
long
wait
:)
#otwolgrandtrailer
get
inspire
sheer
fact
obama
beat
innumerable
odds
become
us
president
ni
shauri
yako
:-)
#memotohaters
everyone
go
drop
follow
stream
:d
happy
birthday
sunday
:)
hope
great
day
lot
pamper
love
t'was
great
thank
:)
i'm
go
cabincrew
interview
langkawi
1st
august
:)
please
wish
luck
thank
:d
fulfil
fantasy
:)
👉
👈
💖
cool
thanks
thinking
hope
well
:-)


obituary
advert
:)
#goofingaround
#madness
#mad
#bollywood
…
lovely
giveaway
:)
#freebiefriday
dah
move
:)
there's
nothing
cool
totally
someone
bitterness
anger
hatred
towards
pure
indifference
:)
u
wake
4
suite
life
zach
cody
:)
oh
know
:)
yes
order
deliver
work
address
add
address
book
ac
achieving
excellence
music
music
producer
mind
boggling
fatiguing
job
...
baareeq
...
:)
:)
:)
love
thank
share
:)
follow
follow
u
back
:)
#gamedev
hobby
:)
yes
yes
yes
...
:)
thank
:)
tweenie_fox
hi
please
click
list
hair
accessory
product
sometimes
tamang
hinala
:)
i'm
follow
:)
hello
babe
get
niam
access
could
give
something
ship
dm
:)
selfieeeee
:)
especially
three
lass
:)
w
aling
hi
#gorgeous
nice
swim
si
get
#birthday
shout
please
would
make
day
#perfection
:)
xx
haha
hope
get
work
tho
..
bout
quit
:p
goodbye
forget
feminist
fight
right
act
like
snobby
little
bitch
:-)
goodnight
follow
follow
u
back
:)
caroline
caroline
guy
say
mighty
fine
🔥
throw
little
make
:-)
hbd
❤
️
follback
:)
time
jog
:

dog
hate
...
:)
patty
:-)
definitely
post
blog
:)
elaborate
project
thing
never
planning
trip
kuching
enjoy
special
room
rate
merdeka
palace
hotel
suites
plusmiles
card
:)
glad
service
kinda
:)
hahahaa
mean
new
nex
still
safe
gwd
:d
really
love
—
love
thing
:)
shes
nice
girl
trust
:)
—
okok
trust
u
<3
333
cause
u
know
ur
wrong
:)
fuck
idiot
chaerin
unnie
great
viable
alternative
nowadays
:)
glad
could
help
make
sure
pass
comment
thank
lovely
tweet
:)
ip
tombow
abt
arrive
:)
hi
would
like
concert
let
know
city
country
i'll
start
work
thanks
:)
happy
friyay
happy
follower
:-)
xxx
♥
kinda
smug
know
stuff
marrickville
public
ten
year
ago
:-)
love
since
eighteen
:)
perfect
triangle
win
ball
game
:)
#auvssscr
#ncaaseason91
thanks
share
wishing
wicked
weekend
:)
morning
:)
kills
future
starts
slow
i'm
trying
make
ice
chocolate
coffee
popsicle
soft
melt
mouth
:)
perfect
hot
day
thankyouuu
dianna
:d
ngga
usah
dipikirin
elah
can't
easily
say
someone
who's
entp
:)
reason
smile
every
day
:)
killin

wallet
turn
fair
mean
lot
people
come
:)
early
bird
go
already
das
nice
:)
request
medium
persons
cover
rally
please
rotate
camera
:)
omg
sucks
least
see
one
direction
:)
drive
tomorrow
still
want
come
:)
fback
:)
eek
get
go
red
suitcase
beijing
china
festival
science
come
:)
visit
blog
thanks
:d
awesome
:)
twitter
meni
tebrik
etdi
:)
congratulations
700
follower
tomorrow
back
school
:d
someone
bring
sunshine
bob
love
forever
💗
:-)
rod
tame
embrace
inner
actor
aplomb
:-)
congratulations
mate
matter
pay
bill
always
..
#foreveralone
:d
:d
last
day
work
time
real
:d
summer
job
school
start
three
week
:/
#mysummer
#happy
#happyfriday
anything
relate
need
help
give
us
call
01482
333505
:)
hahahaha
way
:p
must
wear
ah
school
uniform
thank
evil
:)
owwww
choo
chweet
...
love
u
:-)
test
:)
bless
:)
shorthaired
one
look
lot
like
oscar
:)
every
dog
day
:)
can't
stress
enough
thank
omg
realize
video
awww
cute
congrats
:d
worth
fifth
harmony
judge
:)
kik
denerivery
506
#kik
#kiksexting
#facetime
#k

homie
dassy
fwm
:d
:)
:)
selamat
sore
like
model
:)
sorry
nichola
glad
back
work
:)
always
need
us
mka
found
someone
meet
long
ago
malta
come
extra
ticket
:)
#gto
#tomorrowland
#incall
:)
baby
😘
😘
😘
ha
yes
make
quick
tho
:d
shobs
incomplete
:)
#friends
#barkada
#buddy
thanks
us
long
:)
:p
hey
hope
u
good
day
:)
follow
follow
u
back
:)
great
day
silverstone
today
beautiful
classic
car
bonus
:d
im
streaming
dude
pull
300
view
time
:)
mum
ask
want
go
bookstore
sure
know
:)
i'm
book
shopping
mind
follow
back
:)
feel
like
keep
friend
lately
like
smh
ganna
text
see
they're
good
:-)
yep
would
vote
could
;-)
imagine
hillary
clinton
first
female
us
president
:-)
thanks
:)
fback
:)
every
time
visit
court
notice
something
funny
slice
life-so
many
story
hide
untapped
:)
dinner
love
love
babe
:)
x
thank
:d
thank
really
appreciate
take
time
watch
video
:)
lmaoo
mca
team
gettin
hella
money
dm
u
wana
make
fast
bandz
asap
:)
thank
much
:)
#ff
hell
yes
donington
park
24/25
hop
see
team
season
2
:d
thank

apply
within
:-)
hey
james
thanks
tweet
currently
:)
let
us
know
help
anything
else
al
appreciate
follow
like
tweet
:)
dorset
beautiful
place
goddess
:d
honestly
blast
get
shitfaced
chat
kind
soul
donate
70
tonight
sing
disney
song
:d
hi
doug
today
thanks
lot
follow
look
forward
tweet
:)
counted
:p
28
bnte
hain
;p
shiiiitt
:)
..
yes
sometimes
pass
:)
case
cat
video
mood
swear
:d
rm35
still
negooo
male
:)
like
sister
show
say
morning
welcome
:-)
good
day
madeline
little
school
run
nun
group
little
girl
:-)
anything
need
wake
:)
good
mornin
beautiful
yapsters
:-)
happy
friday
:-)
xx
finally
get
ply
copy
icon
happy
yes
love
alchemists
:d
hi
bam
follow
bestfriend
love
lot
:)
see
warsaw
<3
love
<3
x27
yes
thats
good
:)
hope
news
dayz
game
preview
:)
thug
:)
tomorrow
lmao
hehe
:d
#sharethelove
top
highvalue
member
week
:)
thanks
share
enjoy
app
wishing
wicked
weekend
:)
halsey
pretty
holiday
:)
#flipkartfashionfriday
happy
30th
wed
anniversary
folk
:)
case
bae
act
brand
new
:)
💘
i'll
ask
fir

bc
ain't
:-)
happy
burfday
:d
here's
cute
panda
make
day
:)
need
:)
#teenchoice
#choiceinternationalartist
#superjunior
glad
hear
solve
luke
:)
name
store
pass
kind
word
gen
late
lunch
:)
#kfc
university
santo
tomas
love
:)
i'm
proud
work
hope
enjoy
let
know
view
:)
norhing
dialogue
helpful
like
please
love
guy
:)
thank
come
last
night
great
see
:)
lovely
time
get
chance
come
say
hello
us
garden
home
area
really
hope
enjoy
show
:)
call
anything
along
line
jojo
chainsaw
:-)
i'm
gonna
laugh
shake
head
life
fill
lot
amusement
:)
lol
sure
:)
appreciate
follow
:)
lovely
weekend
sense
awe
:)
via
nice
:-)
here's
cute
panda
make
day
:)
strong
moment
...
get
good
protect
:d
need
shower
food
first
though
:)
may
well
pop
round
2ish
school
holiday
:)
wait
fahad
bhai
...
:-)
please
invite
#iqrar
#waseem
bhai
show
fair
...
us
fans
like
us
abroad
get
hurt
...
:)
need
put
rotation
lol
:)
welcome
:)
hey
survive
train
:d
delicious
moviee
:d
★
chef
grogol
—
thank
:)
happy
birthday
singles
dating
open
que

good
morning
:-)
fall
love
canopy
maple
leaf
secrets
maple
syrup
farm
:)
happy
friday
xx
thank
stephanie
ily
2
💖
💖
:)
congrtaualtions
:)
oh
course
that's
understand
notice
issue
well
phileas
club
inc
tomorrow
:)
smile
smile
:)
follback
:)
i'm
photograph
phonegraphs
:)
srsly
today
like
best
day
ever
:)
haha
10:17
:d
happy
birthday
ripaaaaa
❤
:)
guys
blue
eye
:d
ye
banate
hain
:d
good
morning
:)
hope
friday
great
start
x
patient
children
cry
x
ray
dept
...
leave
lol
must
say
lovely
pleasant
hospital
grt
weekend
:)
see
new
feature
i've
try
infographic
yet
sound
like
good
stuff
:d
#sharethelove
top
highvalue
member
week
:)
four
o'clock
coffee
habit
:-)
#thankyou1dfor
great
roadtrip
wwat
:)
ok
19:30
apple
store
ifc
we'll
pick
place
go
:-)
change
know
really
:)
#credits
yes
bath
bombs
back
two
great
new
product
can't
wait
order
mine
:)
whip
bag
bombs
...
please
throw
awesome
smile
lilsisbro
:)
hopefully
career
always
success
new
phone
today
cool
:)
pre-ordered
pixar's
inside
steelbook
:d
wel

silly
make
us
realize
suave
gentlemen
god
:)
news
i'm
34
mile
hit
monthly
mileage
target
week
go
yet
:-)
want
iphone
could
facetime
people
take
nice
picture
cuz
samsung
quality
suck
:p
thanks
share
:-)
ey
get
message
beth
:)
x
watched
shes
date
gangster
2nd
time
realize
something
love
athena's
style
:-)
current
playlist
:d
look
rather
fancy
right
:-)
bye
wellington
awesome
hope
meet
:)
fellow
woman
proud
moment
...
wanna
b
rich
move
:-)
keep
eye
peel
soon
christina
sign
newsletter
offer
go
first
:)
thanks
zy
let's
friend
yaa
:)
welcome
hope
help
:)
doesnt
olur
:d
say
hi
:)
new
video
people
check
:)
thanks
follow
:)
hi
bam
follow
bestfriend
love
lot
:)
see
warsaw
<3
love
<3
x13
hello
:)
get
youth
job
opportunities
follow
#starsquad
⭐
please
get
follow
:)
see
soon
:)
thing
😂
:)
flawless
remix
thank
guy
really
:)
happy
see
many
good
reaction
makes
wanna
draw
yes
dad
:)
thanks
bro
:)
omg
i'm
sooo
jealous
:)
thank
wonderful
feedback
sir
hope
see
soon
:)
hmm
gold
...
:p
everyone
stay
life
de

know
tweet
:)
hahahahaha
law
make
happen
would
heartbeat
:d
already
clutch
shit
...
foh
homie
:)
great
idea
...
:-)
🐢
—
like
🐢
:)
hi
please
check
final
design
:)
visit
blog
thanks
:d
going
see
tonight
son
choice
film
glad
hear
worth
watching
:)
hate
life
much
:-)
lmao
angel
must
confuse
another
doctor
:)
#besteverdoctorwhoepisode
:)
:)
:)
relevant
:)
:)
want
:)
:)
puke
:)
make
sure
us
rubber
ducks
proper
bath
without
us
:-)
#quacketyquack
hi
bam
follow
bestfriend
love
lot
:)
see
warsaw
<3
love
<3
x8
thats
:)
one
irresistible
:)
#flipkartfashionfriday
long
subliminal
message
:)
h
#eatmeat
#brewproject
#lovenafianna
nice
one
mr
lewis
:)
everytime
look
clock
3:02
something
weird
happens
finish
cough
blood
start
3:02
nice
:)
hi
thanks
connect
specialise
thermal
image
survey
–
need
us
:)
haha
thank
:)
🍰
’
muslim
love
prophet
:)
#غردلي
would
anyway
:-)
stop
get
hate
whatever
kind
is.he
make
mistake
understand
stop
:)
thanks
retweet
:)
go
question
politician
i'm
go
argue
man
intellect
...
:-)

boob
:)
worry
:)
blame
haha
x
celebrate
end
work
week
:)
invite
friend
dine
experience
modern
korean
house
…
one
absolute
best
bit
job
thanks
let
grill
write
process
:)
like
:d
disk
management
since
nt4
iirc
:d
also
ux
refinement
zdps
okay
didnt
mind
everyone
opinion
:)
anyway
we're
talk
first
time
great
justice
also
text
adventure
:d
keep
try
get
:)
always
nice
get
invite
:)
awwww
thank
daw
tine
:)
btw
heard
sa
gensan
ka
miss
know
dont
give
know
good
day
come
:-)
order
frightlings
undead
plush
cushion
:d
hopefully
they'll
soon
xxxxxx
nba
2k15
mypark
–
chronicles
gryph
volume
3
←
click
aqui
:d
i'm
gonna
buy
:)
download
twitter
phone
time
:-)
i'm
reason
ok
try
come
decision
favorable
side
:)
nba
2k15
mypark
–
chronicles
gryph
volume
3
←
click
aqui
:d
nba
2k15
mypark
–
chronicles
gryph
volume
3
←
click
aqui
:d
i'll
give
minute
:)
happy
birthday
bro
:)
good
one
hey
thank
follow
:)
friday
mood
:)
source
ellen
degeneres
show
oh
yes
shirt
they're
nice
:)
rofl
old
mint
:d
i'm
rock
sexy
superd

come
fly
baby
:)
#retweet
#marine
#navy
#airforce
#battlefield
6:15
pm
rn
currently
70
degree
fahrenheit
really
warm
good
:)
havent
tweet
year
kinda
miss
:)
yum
:d
collection
..
:)
#retweet
bestfriend
nice
friend
:)
🍸
╲
─
─
╱
🍤
╭
╮
╭
╮
┓
┓
╭
╮
╮
┳
╭
🍸
╲
─
─
╱
🍤
🍤
─
╲
╱
─
🍸
┣
╱
╰
╯
┗
┗
╰
╯
╰
┻
╯
🍤
─
╲
╱
─
🍸
big
love
hug
babe
:)
place
peaceful
relax
..
:)
#retweet
right
:)
usually
high
stool
three
kid
wait
topple
:d
love
coco
folks
:)
retweet
:)
that's
great
stephanie
good
luck
training
:)
#findyourfit
preferred
term
whomosexual
otherwise
sound
like
plan
:-)
last
order
2015
31st
july
please
:-)
aug
i'm
super
pack
however
rather
thing
rather
work
home
:)
summer
style
stack
pandora
ring
lady
:-)
ring
favourite
cant
wait
:)
great
attend
#digitalexeter
last
night
hear
interesting
talk
#digitalmarketing
#sociamedia
:-)
congrats
pink
:)
hope
enjoy
ride
andy
:)
nb
bom
dia
todos
good
morning
:)
thank
follow
certainly
look
:)
#forklift
#warehouse
worker
get
#fridayfeeling
:-)
go
first
lsceens
i'l

good
low
quality
pic
high
quality
girl
:)
sorry
many
update
usual
tell
people
miss
photo
lol
:)
...
happy
friday
:-)
sports
club
:)
walk
barely
accord
:)
good
hope
get
home
ok
:)
:)
ty
mne
pro
pchelok
ja
...
:)
thanks
nice
one
julie
look
forward
:)
carefully
...
:)
goodmorning
what's
come
next
=:
=:
like
u
:)
happy
friday
:-)
thank
sweet
kind
comet
:-)
fav
awake
fam
:)
ahah
thanks
candy
axio
remind
lot
white
rabbit
candy
:d
okay
:d
stats
day
arrive
2
new
follower
unfollowers
:)
via
basically
nutshell
:)
take
“
letshavecocktailsafternuclai
”
happy
birthday
malik
umair
big
celebrity
canon
gang
stay
blessed
bro
:)
i'm
start
grind
yt
:)
find
#thoracicbridge
#5minute
flow
thanks
share
:-)
already
pass
#family
like
#nonscripted
thanks
tom
:d
happy
friday
:-)
bby
might
take
one
:)
:d
true
okay
get
see
morning
that's
right
:-)
reserve
password
okay
snapchat
shoshannavassil
#snapchat
#kik
#addmeonsnapchat
#dmme
#xxx
#mpoints
#hotel
:)
ahh
thank
much
:)
xxx
wanna
see
beautiful
rotate
photo
work


:)
cute
boy
jules
:-)
can't
wait
week
:d
thank
follow
us
betty
miller
new
friend
always
welcome
:)
person
small
thing
u
expect
person
u
love
:)
yes
hddcs
also
favourite
:)
salman
khan
movie
eternal
case
watch
pthht
oh
shoot
well
watch
:d
far
lfc
fan
make
expert
spot
mental
weakness
lack
consistency
:-)
tope
:)
good
night
sweet
dreams
xxoo
♡
♥
♡
little
finger
:d
love
:)
13
photo
explain
ukraine
russia
friday
:)
hey
fam
vote
already
:)
gotta
win
boys
#teenchoice
hi
jane
silver-washed
fritillary
:)
pleasure
:)
great
day
hunny
:)
good
night
moon
:)
#pandora
2
aps
study
hall
haha
great
:)
follow
everyone
back
#teamfollowback
:-)
ha
almost
give
fuck
:)
boy
trash
waste
time
that's
sure
amazing
pic
today
:d
yes
already
add
please
check
final
design
:)
nothing
much
need
grow
:)
easy
clever
interested
audience
welcome
:)
great
video
:d
ask
end
..
i'm
friend
:)
thank's
follow
kevens
:)
ha
love
pic
clearly
much
learn
popular
english
pastime
:)
#ashramcalling
return
:p
catch
fall
fall
ontrack
:)
al

😍
😍
🌸
🌸
🌸
pity
party
:-)
#tgif
unless
one
student
:)
light
read
#singapore
#heroes
#gp
#essay
#alevel
dominique
i'm
big
fan
like
oh
i'm
england
get
fan
sign
:)
good
cause
i'd
like
upload
:)
goood
mornin
earthling
:d
happy
tweeting
like
new
build
:)
mm
okay
yay
ask
tomorrow
please
:)
always
:)
positive
lamp
shop
:)
word
big
site
get
changes
a-foot
:)
#website
#development
#revamp
vote
#brainchild
3
time
please
may
poem
:d
6248108
80khan
confidence
confined
u
ur
limits
...
:)
wishing
colorado
:)
happy
#friday
folk
order
#goodyear
#tyres
online
get
upto
£
40
cashback
cs
apply
#yourewelcome
:)
nightly
routine
simpin
:-)
add
snapchat
yall
:)
give
name
i'm
look
influencers
app
:)
interested
here's
invite
sketchbook
art
love
4wilde
drawing
hair
turn
pretty
cool
:d
#art
#hair
#colors
#colorpencils
#cray
…
tune
back
hubby
:)
u
play
queen
pls
fell
asleep
like
6:30
can't
fall
asleep
two
hour
:-)
great
night
great
people
:)
imma
use
next
time
:)
ob
11h
v
kino
:)
goning
make
thos
positive
day
:)
ad

weekend
next
riding
day
tuesday
:)
sunny
weekend
around
#ourdaughtersourpride
many
many
congratulations
papa
ji
:)
thanks
becca
:)
love
new
song
delta
rock
:)
saw
think
might
need
upgrade
girl
get
big
:)
first
love
wanna
fuck
late
night
thinking
get
nut
:-)
v
look
...
enjoy
cute
baby
panda
:)
yun
:)
watching
joe
dirt
2
:)
ohh
#happyfriday
thanks
love
team
:)
always
positive
:)
#postive
#selfie
good
morning
sharon
really
hope
good
today
medicine
work
thank
:)
ttyl
x
oh
fab
gav
lovely
lovely
lady
linda
talk
time
great
actress
:d
x
follback
:d
answering
super
sonically
fast
:)
would
love
win
first
time
;)
#gohf
3hrs
..
:)
tym
prepare
dieback
music
pack
totally
worth
awesome
:)
#dota2
morning
:)
think
might
endit
london
photo
vids
minecon
london
video
:)
look
remember
great
time
ok
...
sering
2
play
yah
min
..
haha
:)
joerine
scene
:)
joshane
power
tandem
:d
can't
sleep
need
try
lay
bed
bore
stats
day
arrive
6
new
follower
unfollowers
:)
via
b
call
friend
need
:)
...
hello
ligao
city
albay

In [31]:
# from nltk import FreqDist

# freq_dist_pos = FreqDist(all_pos_words)
# print(freq_dist_pos)

<FreqDist with 0 samples and 0 outcomes>


In [32]:
def get_tweets_for_model(cleaned_tokens_list):
    for tweet_tokens in cleaned_tokens_list:
        yield dict([token, True] for token in tweet_tokens)

positive_tokens_for_model = get_tweets_for_model(positive_cleaned_tokens_list)
negative_tokens_for_model = get_tweets_for_model(negative_cleaned_tokens_list)

In [33]:
positive_tokens_for_model

<generator object get_tweets_for_model at 0x00000281CCB06648>

### Splitting the Dataset for Training and Testing the Model

In [34]:
import random

positive_dataset = [(tweet_dict, "Positive")
                   for tweet_dict in positive_tokens_for_model]
negative_dataset = [(tweet_dict, "Negative")
                   for tweet_dict in negative_tokens_for_model]

dataset = positive_dataset + negative_dataset

In [36]:
positive_dataset

[({'#followfriday': True,
   'top': True,
   'engage': True,
   'member': True,
   'community': True,
   'week': True,
   ':)': True},
  'Positive'),
 ({'hey': True,
   'james': True,
   'odd': True,
   ':/': True,
   'please': True,
   'call': True,
   'contact': True,
   'centre': True,
   '02392441234': True,
   'able': True,
   'assist': True,
   ':)': True,
   'many': True,
   'thanks': True},
  'Positive'),
 ({'listen': True,
   'last': True,
   'night': True,
   ':)': True,
   'bleed': True,
   'amazing': True,
   'track': True,
   'scotland': True},
  'Positive'),
 ({'congrats': True, ':)': True}, 'Positive'),
 ({'yeaaaah': True,
   'yippppy': True,
   'accnt': True,
   'verify': True,
   'rqst': True,
   'succeed': True,
   'get': True,
   'blue': True,
   'tick': True,
   'mark': True,
   'fb': True,
   'profile': True,
   ':)': True,
   '15': True,
   'day': True},
  'Positive'),
 ({'one': True,
   'irresistible': True,
   ':)': True,
   '#flipkartfashionfriday': True},
  'P

In [37]:
dataset

[({'#followfriday': True,
   'top': True,
   'engage': True,
   'member': True,
   'community': True,
   'week': True,
   ':)': True},
  'Positive'),
 ({'hey': True,
   'james': True,
   'odd': True,
   ':/': True,
   'please': True,
   'call': True,
   'contact': True,
   'centre': True,
   '02392441234': True,
   'able': True,
   'assist': True,
   ':)': True,
   'many': True,
   'thanks': True},
  'Positive'),
 ({'listen': True,
   'last': True,
   'night': True,
   ':)': True,
   'bleed': True,
   'amazing': True,
   'track': True,
   'scotland': True},
  'Positive'),
 ({'congrats': True, ':)': True}, 'Positive'),
 ({'yeaaaah': True,
   'yippppy': True,
   'accnt': True,
   'verify': True,
   'rqst': True,
   'succeed': True,
   'get': True,
   'blue': True,
   'tick': True,
   'mark': True,
   'fb': True,
   'profile': True,
   ':)': True,
   '15': True,
   'day': True},
  'Positive'),
 ({'one': True,
   'irresistible': True,
   ':)': True,
   '#flipkartfashionfriday': True},
  'P

In [38]:
random.shuffle(dataset)

In [39]:
train_data = dataset[:7000]
test_data = dataset[7000:]

In [43]:
len(train_data)

7000

## Building and Testing the Model

In [45]:
from nltk import classify
from nltk import NaiveBayesClassifier
classifier = NaiveBayesClassifier.train(train_data)

print("Accuracy is: ",classify.accuracy(classifier, test_data))

print(classifier.show_most_informative_features(10))

Accuracy is:  0.9966666666666667
Most Informative Features
                      :) = True           Positi : Negati =   1008.1 : 1.0
                follower = True           Positi : Negati =     37.4 : 1.0
                     sad = True           Negati : Positi =     37.0 : 1.0
                    glad = True           Positi : Negati =     23.2 : 1.0
                     bam = True           Positi : Negati =     21.9 : 1.0
                  arrive = True           Positi : Negati =     18.0 : 1.0
                    blog = True           Positi : Negati =     14.5 : 1.0
               goodnight = True           Positi : Negati =     13.1 : 1.0
                  ignore = True           Negati : Positi =     10.9 : 1.0
                    huhu = True           Negati : Positi =     10.2 : 1.0
None


Accuracy is defined as the percentage of tweets in the testing dataset for which the model was correctly able to predict the sentiment. A 99.5% accuracy on the test set is pretty good.

Next, you can check how the model performs on random tweets from Twitter. Add this code to the file:

In [46]:
from nltk.tokenize import word_tokenize
custom_tweet = "I ordered just once from TerribleCo, they screwed up, never used the app again."

custom_tokens = remove_noise(word_tokenize(custom_tweet))

print(classifier.classify(dict([token, True] for token in custom_tokens)))

Negative


# hint : nltk.tokenize 

In [47]:
s = '''Good muffins cost $3.88\nin New York. Please buy me two of them.\n\nThanks.'''
word_tokenize(s)

['Good',
 'muffins',
 'cost',
 '$',
 '3.88',
 'in',
 'New',
 'York',
 '.',
 'Please',
 'buy',
 'me',
 'two',
 'of',
 'them',
 '.',
 'Thanks',
 '.']

In [48]:
from nltk.tokenize import wordpunct_tokenize
wordpunct_tokenize(s)

['Good',
 'muffins',
 'cost',
 '$',
 '3',
 '.',
 '88',
 'in',
 'New',
 'York',
 '.',
 'Please',
 'buy',
 'me',
 'two',
 'of',
 'them',
 '.',
 'Thanks',
 '.']

We can also operate at the level of sentences, using the sentence
tokenizer directly as follows:

In [50]:
from nltk.tokenize import sent_tokenize
print(sent_tokenize(s))
[word_tokenize(t) for t in sent_tokenize(s)]

['Good muffins cost $3.88\nin New York.', 'Please buy me two of them.', 'Thanks.']


[['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.'],
 ['Please', 'buy', 'me', 'two', 'of', 'them', '.'],
 ['Thanks', '.']]

NLTK tokenizers can produce token-spans, represented as tuples of integers
having the same semantics as string slices, to support efficient comparison
of tokenizers.  (These methods are implemented as generators.)

In [51]:
from nltk.tokenize import WhitespaceTokenizer
list(WhitespaceTokenizer().span_tokenize(s))

[(0, 4),
 (5, 12),
 (13, 17),
 (18, 23),
 (24, 26),
 (27, 30),
 (31, 36),
 (37, 43),
 (44, 47),
 (48, 50),
 (51, 54),
 (55, 57),
 (58, 63),
 (65, 72)]

## Continue, give a new data and test the model

In [52]:
custom_tweet2 = 'Congrats #SportStar on your 7th best goal from last season winning goal of the year :) #Baller #Topbin #oneofmanyworlddies'
custom_tokens2 = remove_noise(word_tokenize(custom_tweet2))

print(classifier.classify(dict([token, True] for token in custom_tokens2)))

Positive


In [53]:
custom_tweet3 = 'Thank you for sending my baggage to CityX and flying me to CityY at the same time. Brilliant service. #thanksGenericAirline'

custom_tokens3 = remove_noise(word_tokenize(custom_tweet3))

print(classifier.classify(dict([token, True] for token in custom_tokens3)))

Positive


In [55]:
dict([token, True] for token in custom_tokens3)

{'thank': True,
 'you': True,
 'for': True,
 'send': True,
 'my': True,
 'baggage': True,
 'to': True,
 'cityx': True,
 'and': True,
 'fly': True,
 'me': True,
 'cityy': True,
 'at': True,
 'the': True,
 'same': True,
 'time': True,
 'brilliant': True,
 'service': True,
 'thanksgenericairline': True}

The model classified this example as positive. This is because the training data wasn’t comprehensive enough to classify "sarcastic" tweets as negative. In case you want your model to predict sarcasm, you would need to provide sufficient amount of training data to train it accordingly.

In this step you built and tested the model. You also explored some of its limitations, such as not detecting sarcasm in particular examples. Your completed code still has artifacts leftover from following the tutorial, so the next step will guide you through aligning the code to Python’s best practices.