<font color="#4b76b7">To start practicing, you will need to make a copy of it. Go to File > Save a Copy in Drive. You can then use the new copy that will appear in the new tab.</font>


# AfterWork Data Science: Getting Started with NLP Project

### Prerequisites

In [1]:
# Importing the required libraries
# ---
# 
import pandas as pd # library for data manipulation
import numpy as np  # librariy for scientific computations
import re           # regex library to perform text preprocessing
import string       # library to work with strings
import nltk         # library for natural language processing
import scipy        # scientific conputing 

### 1. Importing our Data

In [35]:
# Question: Given a new tweets, create a sentiment analysis model that will 
# predict whether a tweet will contain positive or negative sentiment.
# ---
# Dataset url = https://bit.ly/31kqByD 
# ---
#
df = pd.read_csv('https://bit.ly/31kqByD', encoding='latin-1')
df.head()

Unnamed: 0.1,Unnamed: 0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
0,346508,0,2016177685,Wed Jun 03 06:18:50 PDT 2009,NO_QUERY,UriGrey,Obama forges his Muslim alliance against the c...
1,883537,4,1686152287,Sun May 03 04:02:08 PDT 2009,NO_QUERY,MariesolW,Had the most spectacular prom ever but now my...
2,764173,0,2298725623,Tue Jun 23 12:02:12 PDT 2009,NO_QUERY,ColleenBurns,I am overwhelmed today taking a moment to eat...
3,638701,0,2234530495,Thu Jun 18 23:13:54 PDT 2009,NO_QUERY,queenarchy,@lindork Tres sad. I was totally a Max fan. #...
4,664821,0,2244623416,Fri Jun 19 14:59:46 PDT 2009,NO_QUERY,reinventingjess,"Crap, I was counting down the hours until my d..."


### 2. Data Exploration

In [36]:
# We can determine the size of our dataset
# ---
#
df.shape

(10000, 7)

Seems this dataset will need some data cleaning i.e. columns. We also don't need some columns to perform create our model. We will drop those columns.

### 3. Data Preparation

#### Basic Data Cleaning Techniques

In [37]:
# We rename the columns for ease of referencing our columns later on
# ---
#
df.columns = ['id', 'target', 't_id', 'created_at', 'query', 'user', 'text']
df.head()

Unnamed: 0,id,target,t_id,created_at,query,user,text
0,346508,0,2016177685,Wed Jun 03 06:18:50 PDT 2009,NO_QUERY,UriGrey,Obama forges his Muslim alliance against the c...
1,883537,4,1686152287,Sun May 03 04:02:08 PDT 2009,NO_QUERY,MariesolW,Had the most spectacular prom ever but now my...
2,764173,0,2298725623,Tue Jun 23 12:02:12 PDT 2009,NO_QUERY,ColleenBurns,I am overwhelmed today taking a moment to eat...
3,638701,0,2234530495,Thu Jun 18 23:13:54 PDT 2009,NO_QUERY,queenarchy,@lindork Tres sad. I was totally a Max fan. #...
4,664821,0,2244623416,Fri Jun 19 14:59:46 PDT 2009,NO_QUERY,reinventingjess,"Crap, I was counting down the hours until my d..."


In [38]:
# We retain the relevant columns by dropping the columns we don't need 
# for creating a sentiment analysis model. 
# ---
#
df = df.drop(['id', 't_id', 'created_at', 'query', 'user'], axis = 1)
df.head()

Unnamed: 0,target,text
0,0,Obama forges his Muslim alliance against the c...
1,4,Had the most spectacular prom ever but now my...
2,0,I am overwhelmed today taking a moment to eat...
3,0,@lindork Tres sad. I was totally a Max fan. #...
4,0,"Crap, I was counting down the hours until my d..."


In [24]:
# Understanding the distribution of target
# ---
#
df.target.value_counts() 

0    5067
4    4933
Name: target, dtype: int64

In [9]:
# Let's determine whether our columns have the right data types
# ---
#
df.dtypes

target     int64
text      object
dtype: object

In [39]:
# What values are in our target variable?
# ---
#
df.target.unique()

array([0, 4])

These are the two classes to which each document (text) belongs. The target value 0 means a text with a negative sentiment, while that of 4 means a text with a positive sentiment. 

In [11]:
# Let's check for missing values 
# ---
# 
df.isnull().sum()

target    0
text      0
dtype: int64

We don't have any missing values, so we are good to go.

#### Text Processing

In [40]:
# Text Cleaning: Removing all urls/links
# ---
# 
df['text'] =  df['text'].apply(lambda x: re.sub(r'http\S+|www\S+|https\S+','', str(x)))
df[['text']].head()

Unnamed: 0,text
0,Obama forges his Muslim alliance against the c...
1,Had the most spectacular prom ever but now my...
2,I am overwhelmed today taking a moment to eat...
3,@lindork Tres sad. I was totally a Max fan. #...
4,"Crap, I was counting down the hours until my d..."


In [41]:
# Text Cleaning: Removing @ and # characters or replace them with space
# ---
# YOUR CODE GOES BELOW
#
df['text'] = df.text.str.replace('#',' ')
df['text'] = df.text.str.replace('@',' ')
df[['text']].head()


Unnamed: 0,text
0,Obama forges his Muslim alliance against the c...
1,Had the most spectacular prom ever but now my...
2,I am overwhelmed today taking a moment to eat...
3,lindork Tres sad. I was totally a Max fan. ...
4,"Crap, I was counting down the hours until my d..."


In [42]:
# Text Cleaning: Conversion to lowercase
# ---
# YOUR CODE GOES BELOW
#
df['text'] = df.text.apply(lambda x: " ".join(x.lower() for x in x.split()))

df[['text']].head()

Unnamed: 0,text
0,obama forges his muslim alliance against the c...
1,had the most spectacular prom ever but now my ...
2,i am overwhelmed today taking a moment to eat ...
3,lindork tres sad. i was totally a max fan. sytycd
4,"crap, i was counting down the hours until my d..."


In [43]:
# Text Cleaning: Splitting concatenated words
# ---
# Performing this step will take few minutes...
# ---
# YOUR CODE GOES BELOW
# 

# Installing wordnija and textblob
# ---
#

!pip3 install wordninja
!pip3 install textblob

# Importing those libraries
# ---
#
import wordninja 
from textblob import TextBlob

Collecting wordninja
  Downloading wordninja-2.0.0.tar.gz (541 kB)
[?25l[K     |▋                               | 10 kB 22.7 MB/s eta 0:00:01[K     |█▏                              | 20 kB 9.2 MB/s eta 0:00:01[K     |█▉                              | 30 kB 7.8 MB/s eta 0:00:01[K     |██▍                             | 40 kB 7.0 MB/s eta 0:00:01[K     |███                             | 51 kB 4.2 MB/s eta 0:00:01[K     |███▋                            | 61 kB 4.3 MB/s eta 0:00:01[K     |████▎                           | 71 kB 4.3 MB/s eta 0:00:01[K     |████▉                           | 81 kB 4.8 MB/s eta 0:00:01[K     |█████▌                          | 92 kB 3.7 MB/s eta 0:00:01[K     |██████                          | 102 kB 4.1 MB/s eta 0:00:01[K     |██████▋                         | 112 kB 4.1 MB/s eta 0:00:01[K     |███████▎                        | 122 kB 4.1 MB/s eta 0:00:01[K     |███████▉                        | 133 kB 4.1 MB/s eta 0:00:01[K     |██

In [45]:
# Performing the split
# ---
#
df['text'] = df.text.apply(lambda x: wordninja.split(str(TextBlob(x))))  
df['text'] = df.text.str.join(' ')
df[['text']].sample(10) 

Unnamed: 0,text
4899,tow mee omg what did i do to deserve this mis ...
5764,bleu u and even worse my parents and sister ar...
36,i hope i get all this homework done
2195,myspace is offering its mumbai amp delhi users...
9767,im off to bed im tweet buddies ill be on later...
7203,she s electric i was gona have a go later on o...
3534,i'm so sick and tired of being sick and tired ...
9241,i woke up to i dream a dream blaring and for s...
5664,thursday but i still have 2 more days to work
9917,miley cyrus can you and mandy pl ll eeee a a a...


In [47]:
# Text Cleaning: Removing punctuation characters
# ---
# YOUR CODE GOES BELOW
#
df['text'] = df.text.str.replace('[^\w\s]','')

In [50]:
# Text Cleaning: Removing stop words
# ---
# YOUR CODE GOES BELOW
# 
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop = stopwords.words('english')
stop
df['text'] = df.text.apply(lambda x: " ".join(x for x in x.split() if x not in stop))
df[['text']].sample(5)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,text
2949,v ikki chaos ugh many bad effects right
1659,offending noise nik
2607,shut cops
3165,matt g good morning
3956,last night fantastic love friends shopping aunt


In [52]:
#0 means we don't have any stopwords
df['no_of_stopwords'] = df.text.apply(lambda x: len([x for x in x.split() if x in stop]))
df[['text','no_of_stopwords']].sample(5)

Unnamed: 0,text,no_of_stopwords
3989,chris clarke take back mm mm got possums flowe...,0
6491,meat stack,0
614,tired day im feel sleepy want eat err rrr,0
6270,teff 95 youre welcome sweetie,0
8648,stars mile e omg thats sad love babies happen sad,0


In [57]:
# Text Cleaning: Lemmatization
# ---
# YOUR CODE GOES BELOW
#

# For lemmatization, we will need to download wordnet
#
nltk.download('wordnet')
from textblob import Word

# Lemmatizing our text
# ---
#
df['text'] = df.text.apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()])) 
df[['text']].sample(10)

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,text
3108,james buckley believe digging stubble u gettin...
8737,beck eeee h close p robs already need somethin...
1397,ok really need go workout tonight officially r...
7505,end 1 excited plan tonight cant believe leavin...
4646,imran anwar nice work going compare clip lol b...
2039,came home mall looking dress best friend turn ...
7631,hell z yea yeah darling im struggling juggle e...
4070,global final today wish luck
3934,lost 9 75 big loss wood still nice day weather...
8959,miss girlfriend close


We won't remove numerics because we could loose meaning of our text if we lost the numerics. We could also further prepare our text by performing spelling correction but this is a resource intensive process that we will skip for now.

#### Feature Engineering Techniques 

In [58]:
# Feature Construction: Length of tweet
# ---
# YOUR CODE GOES BELOW
#
df['length_of_tweet'] = df.text.str.len()

In [59]:
# Feature Construction: Word count 
# ---
# YOUR CODE GOES BELOW
# 
df['word_count'] = df.text.apply(lambda x: len(str(x).split(" ")))

In [61]:
# Feature Construction: Word density (Average no. of words / tweet)
# ---
# YOUR CODE GOES BELOW
#

#create a custom avg_word function to perform average of each record 
# then apply the function to the df column.
def avg_word(sentence):
  words = sentence.split()
  try:
    z = (sum(len(word) for word in words)/len(words))
  except ZeroDivisionError:
    z = 0 
  return z
df['avg_word_length'] = df.text.apply(lambda x: avg_word(x)) 

In [63]:
# Feature Construction: Noun count
# ---
# YOUR CODE GOES BELOW
#
# First, we will download the punkt and the averaged_perceptron_tagger into our notebook environment. 
# which will allow us to find the part of speech tags.
# ---
#
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# We create the function to check and get the part of speech tag count of a words in a given sentence
pos_dic = {
    'noun' : ['NN','NNS','NNP','NNPS'],
    'pron' : ['PRP','PRP$','WP','WP$'],
    'verb' : ['VB','VBD','VBG','VBN','VBP','VBZ'],
    'adj' :  ['JJ','JJR','JJS'],
    'adv' : ['RB','RBR','RBS','WRB']
}

def pos_check(x, flag):
    cnt = 0
    try:
        wiki = TextBlob(x)
        for tup in wiki.tags:
            ppo = list(tup)[1]
            if ppo in pos_dic[flag]:
                cnt += 1
    except:
        pass
    return cnt

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [64]:
# Noun Count
# ---
# YOUR CODE GOES BELOW
#
df['noun_count'] = df.text.apply(lambda x: pos_check(x, 'noun'))
df[['text','noun_count']].sample(5)

Unnamed: 0,text,noun_count
2507,katie uk know thinking im getting old festival...,4
2053,crab lin 1 win think well fine,3
585,movie earth come wednesday incidentally earth ...,4
960,love earth beyonc ditched limo cycle around du...,4
3418,bah spent lunch sorting bike find replacement ...,7


In [65]:
# Feature Construction: Verb count
# ---
# YOUR CODE GOES BELOW
#
df['verb_count'] = df.text.apply(lambda x: pos_check(x, 'verb'))
df[['text','verb_count']].sample(10)

Unnamed: 0,text,verb_count
525,friend fail want head bar sf 4 event chicago,1
4110,pat sam wish could help ya,3
3536,cold cold cold morning sun freezing,0
5712,im cold weather crap,1
3268,done working night time relaxing probably fall...,4
7249,hey p shed fart nice,1
7807,still headache cant absent today there sum thi...,1
8372,mastering sound hah yep,1
3806,tier sore thumb,0
5270,implies voter turnout large number win defeat ...,4


In [66]:
# Feature Construction: Adjective count / Tweet
# ---
# YOUR CODE GOES BELOW
#
df['adj_count'] = df.text.apply(lambda x: pos_check(x, 'adj'))
df[['text','adj_count']].sample(10)

Unnamed: 0,text,adj_count
1641,damn gonna wear skirt super cute,2
3069,kal yan varma also need watch 3 year st 7 st t...,1
9600,irish 1974 yeah 1 band dude said something int...,1
562,shoe lover 79 one want meet plus mum well atm ...,2
9261,feel like home,0
9137,sk devi tt nah oct ot u,1
7867,watching come dine get gamer,0
5969,sparkle en wow thats good,1
4815,kinda lazy e working th ng surfin net much fun...,5
6864,psychic g god awful brings back painful memory,3


In [67]:
# Feature Construction: Adverb count / Tweet
# ---
# YOUR CODE GOES BELOW
#
df['adv_count'] = df.text.apply(lambda x: pos_check(x, 'adv'))
df[['text','adv_count']].sample(10)

Unnamed: 0,text,adv_count
7029,saw b wb tonight oyster creek love summer time,0
2999,upset kicked peyton lucas one tree hill mik call,0
4176,bad rush net,0
9318,summer kick suite freedom tonight eh still sha...,1
3508,nicky tv f tell go,0
4578,going lumpys diner lunch e w family,0
4178,want evening sun garden,0
3900,miss family much hh hh miss 23 year old buddy ...,3
8365,e screwed leg properly last night bad leg oh w...,2
5038,good twit make great day,0


In [68]:
# Feature Construction: Pronoun 
# ---
# YOUR CODE GOES BELOW
#
df['pron_count'] = df.text.apply(lambda x: pos_check(x, 'pron'))
df[['text','pron_count']].sample(10)

Unnamed: 0,text,pron_count
515,nine ace picture blank hot riding always feel ...,0
2783,leaving work go home work,0
8827,proofing annual report fun find whether compan...,0
9955,link enjoy dark chocolate within post,0
2138,missing cup final plane,0
7180,go class today go class today go class today,0
4439,web c u canada ever note eye fi,0
5760,nice day one spend,0
5005,great weekend netherlands beating england cric...,0
5580,trying upload picture reason cant resize iphoto,0


In [69]:
# Feature Construction: Subjectivity
# ---
# YOUR CODE GOES BELOW
# 
# Function to get subjectivity of text using the module textblob
def get_subjectivity(text):
    textblob = TextBlob(text)
    subj = textblob.sentiment.subjectivity
    return subj

df['subjectivity'] = df.text.apply(get_subjectivity)
df[['text', 'subjectivity']].sample(10)

Unnamed: 0,text,subjectivity
3662,inter bike hopefully never love greenville,0.6
9171,n santosh man movie u watch b lore fu kin sh m...,0.0
3144,damm amanda tapping cancelled appearance colle...,1.0
1929,could day take longer stuck small business man...,0.4
8858,option get follower twitter free,0.8
5096,x aviv didnt even end going laid low brother c...,0.3
1238,la ur ann game tonight dear theyre til tomorrow,0.4
5599,v daniel thanks por que le sea le term e pront...,0.2
2602,tik shi right live music everywhere back miss,0.345238
6739,human doodad think made worse,0.35


In [71]:
# Feature Construction: Polarity
# ---
# YOUR CODE GOES BELOW
# 
from textblob import TextBlob

# Function to get polarity of text using the module textblob
def get_polarity(text):
    textblob = TextBlob(text)
    pol = textblob.sentiment.polarity
    return pol

df['polarity'] = df.text.apply(get_polarity)
df[['text', 'polarity']].sample(10)


Unnamed: 0,text,polarity
4017,ache clue,0.0
8364,g gr rrr teeth still hurt lame,-0.5
2093,britney spear ive seen agree,0.0
734,good afternoon welcome mama mia pizza would li...,0.733333
7994,band good kris watching near atm trash come sa...,0.4
7124,melody wow thats awesome lunch dessert woohoo,0.55
7043,ear soar,0.0
4401,home fr w rk pjs time watch movie go sleep see...,0.0
4587,wrestler springsteen kill lot one fa v,0.0
4166,morning everybody church morning cant wait,0.0


In [79]:
# Feature Construction: Word Level N-Gram TF-IDF Feature 
# ---
# YOUR CODE GOES BELOW
#
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_features=1000, lowercase=True, analyzer='word', ngram_range=(1,3),  stop_words= 'english')

df_word_vect = tfidf.fit_transform(df['text'])

# Show feature matrix / Priviewing the created sparse matrix
#
df_word_vect.toarray()

# Show tf-idf feature matrix
# ---
#
tfidf.get_feature_names()

# Creating data frame to view our matrix
# ---
#
pd.DataFrame(df_word_vect.toarray(), columns=tfidf.get_feature_names())



Unnamed: 0,00,09,10,100,11,12,13,15,16,19,20,2009,21,22,24,30,40,50,aaa,aaaa,aaaaaa,able,account,ache,actually,ad,adam,add,added,afternoon,age,ago,agree,ah,aha,ahh,ahh hh,air,airport,al,...,world,worried,worry,worse,worst,worth,wouldnt,wow,write,wrong,wu,ww,www,xd,xo,xx,xxx,ya,yah,yay,ye,yea,yeah,year,yep,yes,yesterday,yo,youll,young,youre,youtube,youve,yr,yummy,yy,yyyy,zy,zz,zzz
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.413509,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9996,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9997,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9998,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [80]:
# Feature Construction: Character Level N-Gram TF-IDF Feature
# ---
# YOUR CODE GOES BELOW
# 
tfidf = TfidfVectorizer(max_features=1000, lowercase=True, analyzer='char', ngram_range=(1,3),  stop_words= 'english')
df_char_vect = tfidf.fit_transform(df['text'])
df_char_vect.toarray()

array([[0.24420187, 0.        , 0.        , ..., 0.        , 0.08323055,
        0.        ],
       [0.23112721, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.17551168, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.21991513, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.18421581, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.28106415, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

In [77]:
# Let's prepare the constructed features for modeling
# ---
#
X_metadata = np.array(df.iloc[:, 2:12])
X_metadata

array([[ 0.        , 67.        , 11.        , ...,  2.        ,
         0.        ,  0.9       ],
       [ 0.        , 81.        , 12.        , ...,  1.        ,
         0.        ,  0.85      ],
       [ 0.        , 40.        ,  6.        , ...,  0.        ,
         0.        ,  0.        ],
       ...,
       [ 0.        , 45.        ,  8.        , ...,  2.        ,
         0.        ,  0.83333333],
       [ 0.        , 34.        ,  6.        , ...,  1.        ,
         0.        ,  0.56785714],
       [ 0.        , 44.        , 10.        , ...,  0.        ,
         0.        ,  0.6       ]])

In [81]:
# We combine our two tfidf (sparse) matrices and X_metadata
# ---
#
X = scipy.sparse.hstack([df_word_vect, df_char_vect,  X_metadata])
X

<10000x2010 sparse matrix of type '<class 'numpy.float64'>'
	with 944959 stored elements in COOrdinate format>

In [82]:
# Getting our response variable
# ---
#
y = np.array(df.iloc[:, 0])
y

array([0, 4, 0, ..., 0, 4, 0])

### 4. Data Modelling

During this step, we will use machine learning algorithms to train and test our sentiment analysis models.

In [83]:
# Splitting our data
# ---
#
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [84]:
# Fitting our model
# ---
#

# Importing the algorithms
from sklearn.naive_bayes import MultinomialNB 
from sklearn.linear_model import LogisticRegression

nb_classifier = MultinomialNB() 
lr_classifier = LogisticRegression(max_iter=1000) 

# Training our model
nb_classifier.fit(X_train, y_train) 
lr_classifier.fit(X_train, y_train)

LogisticRegression(max_iter=1000)

In [85]:
# Making predictions
# ---
#
y_predict_nb = nb_classifier.predict(X_test) 
y_predict_lr = lr_classifier.predict(X_test)

In [86]:
# Evaluating the Models
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Accuracy scores
# ---
#
print("Naive Bayes Classifier:\n", accuracy_score(y_test, y_predict_nb)) 
print("Logistic Regression Classifier: \n", accuracy_score(y_test, y_predict_lr))

Naive Bayes Classifier:
 0.725
Logistic Regression Classifier: 
 0.733


In [87]:
# Confusion matrices
# ---
# 
print("Naive Bayes Classifier: \n", confusion_matrix(y_test, y_predict_nb)) 
print("Logistic Regression Classifier: \n", confusion_matrix(y_test, y_predict_lr))

Naive Bayes Classifier: 
 [[760 290]
 [260 690]]
Logistic Regression Classifier: 
 [[763 287]
 [247 703]]


In [88]:
# Classification Reports
# ---
#
print("Naive Bayes Classifier: \n", classification_report(y_test, y_predict_nb)) 
print("Logistic Regression Classifier: \n", classification_report(y_test, y_predict_lr))

Naive Bayes Classifier: 
               precision    recall  f1-score   support

           0       0.75      0.72      0.73      1050
           4       0.70      0.73      0.72       950

    accuracy                           0.73      2000
   macro avg       0.72      0.73      0.72      2000
weighted avg       0.73      0.72      0.73      2000

Logistic Regression Classifier: 
               precision    recall  f1-score   support

           0       0.76      0.73      0.74      1050
           4       0.71      0.74      0.72       950

    accuracy                           0.73      2000
   macro avg       0.73      0.73      0.73      2000
weighted avg       0.73      0.73      0.73      2000



**Evaluation our Models**

* **Accuracy:** the percentage of texts that were assigned the correct topic.
* **Precision:** the percentage of texts the classifier classified correctly out of the total number of texts it predicted for each topic
* **Recall:** the percentage of texts the model predicted for each topic out of the total number of texts it should have predicted for that topic.
* **F1 Score:** the average of both precision and recall.

To improve our model, we can try perfoming other text processing techniques that would better prepare our data for fitting our model. We can also use different vectorizing techniques, implement other machine learning models and perform hyperparameter tuning.

### 5. Recommendations


Our best model had an accuracy of 73.3% and use it for classifying newer tweets. We can improve this performance by performing hyperparameter tuning and feature engineering methods. 