<a href="https://colab.research.google.com/github/Oughty-Otieno/Getting-Started-with-Text-Analysis/blob/main/Copy_of_%5BProject_Guiding_Notebook%5D_AfterWork_Data_Science_Getting_Started_with_Text_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<font color="#4b76b7">To start practicing, you will need to make a copy of it. Go to File > Save a Copy in Drive. You can then use the new copy that will appear in the new tab.</font>


# AfterWork Data Science: Getting Started with NLP Project

### Prerequisites

In [1]:
# Importing the required libraries
# ---
# 
import pandas as pd # library for data manipulation
import numpy as np  # librariy for scientific computations
import re           # regex library to perform text preprocessing
import string       # library to work with strings
import nltk         # library for natural language processing
import scipy        # scientific conputing 

### 1. Importing our Data

In [2]:
# Question: Given a new tweets, create a sentiment analysis model that will 
# predict whether a tweet will contain positive or negative sentiment.
# ---
# Dataset url = https://bit.ly/31kqByD 
# ---
#
df = pd.read_csv('https://bit.ly/31kqByD', encoding='latin-1')
df.head()

Unnamed: 0.1,Unnamed: 0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
0,346508,0,2016177685,Wed Jun 03 06:18:50 PDT 2009,NO_QUERY,UriGrey,Obama forges his Muslim alliance against the c...
1,883537,4,1686152287,Sun May 03 04:02:08 PDT 2009,NO_QUERY,MariesolW,Had the most spectacular prom ever but now my...
2,764173,0,2298725623,Tue Jun 23 12:02:12 PDT 2009,NO_QUERY,ColleenBurns,I am overwhelmed today taking a moment to eat...
3,638701,0,2234530495,Thu Jun 18 23:13:54 PDT 2009,NO_QUERY,queenarchy,@lindork Tres sad. I was totally a Max fan. #...
4,664821,0,2244623416,Fri Jun 19 14:59:46 PDT 2009,NO_QUERY,reinventingjess,"Crap, I was counting down the hours until my d..."


### 2. Data Exploration

In [3]:
# We can determine the size of our dataset
# ---
#
df.shape

(10000, 7)

Seems this dataset will need some data cleaning i.e. columns. We also don't need some columns to perform create our model. We will drop those columns.

### 3. Data Preparation

#### Basic Data Cleaning Techniques

In [4]:
# We rename the columns for ease of referencing our columns later on
# ---
#
df.columns = ['id', 'target', 't_id', 'created_at', 'query', 'user', 'text']
df.head()

Unnamed: 0,id,target,t_id,created_at,query,user,text
0,346508,0,2016177685,Wed Jun 03 06:18:50 PDT 2009,NO_QUERY,UriGrey,Obama forges his Muslim alliance against the c...
1,883537,4,1686152287,Sun May 03 04:02:08 PDT 2009,NO_QUERY,MariesolW,Had the most spectacular prom ever but now my...
2,764173,0,2298725623,Tue Jun 23 12:02:12 PDT 2009,NO_QUERY,ColleenBurns,I am overwhelmed today taking a moment to eat...
3,638701,0,2234530495,Thu Jun 18 23:13:54 PDT 2009,NO_QUERY,queenarchy,@lindork Tres sad. I was totally a Max fan. #...
4,664821,0,2244623416,Fri Jun 19 14:59:46 PDT 2009,NO_QUERY,reinventingjess,"Crap, I was counting down the hours until my d..."


In [5]:
# We retain the relevant columns by dropping the columns we don't need 
# for creating a sentiment analysis model. 
# ---
#
df = df.drop(['id', 't_id', 'created_at', 'query', 'user'], axis = 1)
df.head()

Unnamed: 0,target,text
0,0,Obama forges his Muslim alliance against the c...
1,4,Had the most spectacular prom ever but now my...
2,0,I am overwhelmed today taking a moment to eat...
3,0,@lindork Tres sad. I was totally a Max fan. #...
4,0,"Crap, I was counting down the hours until my d..."


In [6]:
# Understanding the distribution of target
# ---
#
df.target.value_counts() 

0    5067
4    4933
Name: target, dtype: int64

In [7]:
# Let's determine whether our columns have the right data types
# ---
#
df.dtypes

target     int64
text      object
dtype: object

In [8]:
# What values are in our target variable?
# ---
#
df.target.unique()

array([0, 4])

These are the two classes to which each document (text) belongs. The target value 0 means a text with a negative sentiment, while that of 4 means a text with a positive sentiment. 

In [9]:
# Let's check for missing values 
# ---
# 
df.isnull().sum()

target    0
text      0
dtype: int64

We don't have any missing values, so we are good to go.

#### Text Processing

In [10]:
# Text Cleaning: Removing all urls/links
# ---
# 
df['text'] =  df['text'].apply(lambda x: re.sub(r'http\S+|www\S+|https\S+','', str(x)))
df[['text']].head()

Unnamed: 0,text
0,Obama forges his Muslim alliance against the c...
1,Had the most spectacular prom ever but now my...
2,I am overwhelmed today taking a moment to eat...
3,@lindork Tres sad. I was totally a Max fan. #...
4,"Crap, I was counting down the hours until my d..."


In [11]:
# Text Cleaning: Removing @ and # characters or replace them with space
# ---
# YOUR CODE GOES BELOW
#

df['text'] = df.text.str.replace('#','')
df['text'] = df.text.str.replace('@','')

df[['text']].head()


Unnamed: 0,text
0,Obama forges his Muslim alliance against the c...
1,Had the most spectacular prom ever but now my...
2,I am overwhelmed today taking a moment to eat...
3,lindork Tres sad. I was totally a Max fan. SY...
4,"Crap, I was counting down the hours until my d..."


In [12]:
# Text Cleaning: Conversion to lowercase
# ---
# YOUR CODE GOES BELOW
#
df['text'] = df.text.apply(lambda x: " ".join(x.lower() for x in x.split()))

df.head()

Unnamed: 0,target,text
0,0,obama forges his muslim alliance against the c...
1,4,had the most spectacular prom ever but now my ...
2,0,i am overwhelmed today taking a moment to eat ...
3,0,lindork tres sad. i was totally a max fan. sytycd
4,0,"crap, i was counting down the hours until my d..."


In [13]:
# Text Cleaning: Splitting concatenated words
# ---
# Performing this step will take few minutes...
# ---
# YOUR CODE GOES BELOW
# 

# Installing wordnija and textblob
# ---
#
!pip3 install wordninja
!pip3 install textblob

# Importing those libraries
# ---
#
import wordninja 
from textblob import TextBlob

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting wordninja
  Downloading wordninja-2.0.0.tar.gz (541 kB)
[K     |████████████████████████████████| 541 kB 5.3 MB/s 
[?25hBuilding wheels for collected packages: wordninja
  Building wheel for wordninja (setup.py) ... [?25l[?25hdone
  Created wheel for wordninja: filename=wordninja-2.0.0-py3-none-any.whl size=541551 sha256=e3ea9509770f5ce45a665ba253d3f2dd0d9e9454699e6f9ef0e761f8946f0fbf
  Stored in directory: /root/.cache/pip/wheels/dd/3f/eb/a2692e3d2b9deb1487b09ba4967dd6920bd5032bfd9ff7acfc
Successfully built wordninja
Installing collected packages: wordninja
Successfully installed wordninja-2.0.0
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [14]:
# Performing the split
# ---
#
df['text'] = df.text.apply(lambda x: wordninja.split(str(TextBlob(x))))  
df['text'] = df.text.str.join(' ')

In [15]:
# Text Cleaning: Removing punctuation characters
# ---
# YOUR CODE GOES BELOW
#
df['text'] = df.text.str.replace('[^\w\s]','')

df.head()

  """


Unnamed: 0,target,text
0,0,obama forges his muslim alliance against the c...
1,4,had the most spectacular prom ever but now my ...
2,0,i am overwhelmed today taking a moment to eat ...
3,0,lin dork tres sad i was totally a max fan sytycd
4,0,crap i was counting down the hours until my da...


# Spelling correction


In [19]:
df['corrected_tweet'] = df['text'].apply(lambda x: str(TextBlob(x).correct()))
df[['text', 'corrected_tweet']].head(20)

Unnamed: 0,text,corrected_tweet
0,obama forges muslim alliance civilized world d...,drama forges muslin alliance civilized world d...
1,spectacular prom ever bed serenading must answ...,spectacular from ever bed spreading must answe...
2,overwhelmed today taking moment eat pray,overwhelmed today taking moment eat pray
3,lin dork tres sad totally max fan sytycd,in work tres sad totally max fan sytycd
4,crap counting hours dad could come home amp he...,cap counting hours dad could come home amp hel...
5,dc b tv dc b tv go check things buy others loo...,do b to do b to go check things buy others loo...
6,mr ke never gmail anymore,mr ke never email anymore
7,alex jeffrey id loved come couple unfortunate ...,flex geoffrey id loved come couple unfortunate...
8,br rrr heading work chilly today,br err heading work chilly today
9,ga bri iii ella nee ed talk u good new sss,ga brim iii elba nee ed talk u good new ses


In [21]:
# Text Cleaning: Removing stop words
# ---
# YOUR CODE GOES BELOW
# 
import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords
stop = stopwords.words('english')

df['corrected_tweet'] = df.corrected_tweet.apply(lambda x: " ".join(x for x in x.split() if x not in stop))

df.head()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,target,text,lemmatized,corrected_tweet
0,0,obama forges muslim alliance civilized world d...,obama forge muslim alliance civilized world di...,obama forges muslim alliance civilized world d...
1,4,spectacular prom ever bed serenading must answ...,spectacular prom ever bed serenading must answ...,spectacular prom ever bed serenading must answ...
2,0,overwhelmed today taking moment eat pray,overwhelmed today taking moment eat pray,overwhelmed today taking moment eat pray
3,0,lin dork tres sad totally max fan sytycd,lin dork tres sad totally max fan sytycd,lin dork tres sad totally max fan sytycd
4,0,crap counting hours dad could come home amp he...,crap counting hour dad could come home amp hel...,crap counting hours dad could come home amp he...


In [22]:
# Text Cleaning: Lemmatization
# ---
# YOUR CODE GOES BELOW
#

# For lemmatization, we will need to download wordnet
#
nltk.download('wordnet')
from textblob import Word

# Lemmatizing our text
# ---
#
df['lemmatized'] = df.corrected_tweet.apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()])) 
df[['corrected_tweet', 'lemmatized']].head(10)

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,corrected_tweet,lemmatized
0,obama forges muslim alliance civilized world d...,obama forge muslim alliance civilized world di...
1,spectacular prom ever bed serenading must answ...,spectacular prom ever bed serenading must answ...
2,overwhelmed today taking moment eat pray,overwhelmed today taking moment eat pray
3,lin dork tres sad totally max fan sytycd,lin dork tres sad totally max fan sytycd
4,crap counting hours dad could come home amp he...,crap counting hour dad could come home amp hel...
5,dc b tv dc b tv go check things buy others loo...,dc b tv dc b tv go check thing buy others look...
6,mr ke never gmail anymore,mr ke never gmail anymore
7,alex jeffrey id loved come couple unfortunate ...,alex jeffrey id loved come couple unfortunate ...
8,br rrr heading work chilly today,br rrr heading work chilly today
9,ga bri iii ella nee ed talk u good new sss,ga bri iii ella nee ed talk u good new ss


We won't remove numerics because we could loose meaning of our text if we lost the numerics. We could also further prepare our text by performing spelling correction but this is a resource intensive process that we will skip for now.

#### Feature Engineering Techniques 

In [23]:
# Feature Construction: Length of tweet
# ---
# YOUR CODE GOES BELOW
#
df['length_of_text'] = df.lemmatized.str.len()
df[['lemmatized','length_of_text']].sample(5)

Unnamed: 0,lemmatized,length_of_text
6808,chicken biscuit smell hand still lingering han...,82
6881,cant believe tomorrow sunday already,36
7799,anti gay protester dw n tw n make moon che ez ...,107
7975,rosie bunny got wool hat wrist thing y saw pic...,73
1767,couch smell lie k smelly workout ppl ewe wie,44


In [24]:
# Feature Construction: Word count 
# ---
# YOUR CODE GOES BELOW
# 
df['word_count'] = df.lemmatized.apply(lambda x: len(str(x).split(" ")))
df[['lemmatized', 'word_count']].sample(5)

Unnamed: 0,lemmatized,word_count
6133,still cant believe summer im programed correctly,7
9805,skin thi watt im great thx r u,8
3946,omg today bad day nothing went right hopefully...,15
1460,baby girl paris milwaukee curiosity,5
6379,bed got early day moro est london snatch car b...,11


In [25]:
# Feature Construction: Word density (Average no. of words / tweet)
# ---
# YOUR CODE GOES BELOW
#
def avg_word(sentence):
  words = sentence.split()
  try:
    z = (sum(len(word) for word in words)/len(words))
  except ZeroDivisionError:
    z = 0 
  return z



In [26]:
df['avg_word_length'] = df.lemmatized.apply(lambda x: avg_word(x))
df[['lemmatized','avg_word_length']].sample(5)

Unnamed: 0,lemmatized,avg_word_length
7326,jer cx apparently window left open amp portion...,4.428571
5977,tweeting apparently bother people dont boring ...,5.545455
414,li shy b unfortunately wanna shop,4.666667
4558,nigel net al furlong kaz zy lady j ever feel t...,3.923077
9132,maria p 91 pretty good thanks oh h pity wake e...,4.25


In [27]:
# Feature Construction: Noun count
# ---
# YOUR CODE GOES BELOW
#
# First, we will download the punkt and the averaged_perceptron_tagger into our notebook environment. 
# which will allow us to find the part of speech tags.
# ---
#
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# We create the function to check and get the part of speech tag count of a words in a given sentence

pos_dic = {
    'noun' : ['NN','NNS','NNP','NNPS'],
    'pron' : ['PRP','PRP$','WP','WP$'],
    'verb' : ['VB','VBD','VBG','VBN','VBP','VBZ'],
    'adj' :  ['JJ','JJR','JJS'],
    'adv' : ['RB','RBR','RBS','WRB']
}

def pos_check(x, flag):
    cnt = 0
    try:
        wiki = TextBlob(x)
        for tup in wiki.tags:
            ppo = list(tup)[1]
            if ppo in pos_dic[flag]:
                cnt += 1
    except:
        pass
    return cnt


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [28]:
# Noun Count
# ---
# YOUR CODE GOES BELOW
#
df['noun_count'] = df.lemmatized.apply(lambda x: pos_check(x, 'noun'))
df[['lemmatized','noun_count']].sample(10)

Unnamed: 0,lemmatized,noun_count
2376,eating chinese food leslie j rick,4
9386,disappointed discover flash 10 0 2 update fix ...,4
6293,tired last thing want finish cleaning house,3
5176,awesome lost one diamond ea ring,3
270,got britney vma 2008 hair today,4
675,emmy 56 sorry still whole online class nt even...,3
1112,link idd lol wish,4
1446,internet cant slow,2
4113,b leh really bad dream,3
153,postal gu la thing going lately read u really ...,4


In [29]:
# Feature Construction: Verb count
# ---
# YOUR CODE GOES BELOW
#
df['verb_count'] = df.lemmatized.apply(lambda x: pos_check(x, 'verb'))
df[['lemmatized','verb_count']].sample(10)

Unnamed: 0,lemmatized,verb_count
7865,work taking hour lunch,1
6131,adding new sound system c ruiz er,1
6951,ooo love bank holiday x,2
3812,young q glad made safe course love,1
3316,tired love work lt 3,1
724,back college watching vids back dont wanna fac...,0
1730,2 1 2 hour im still smarter,0
1633,got bed minute ago 2,2
5566,early saturday way coffee bean tea latte,0
7102,never know wear,1


In [30]:
# Feature Construction: Adjective count / Tweet
# ---
# YOUR CODE GOES BELOW
#
df['adjective_count'] = df.lemmatized.apply(lambda x: pos_check(x, 'adj'))
df[['lemmatized','adjective_count']].sample(10)

Unnamed: 0,lemmatized,adjective_count
8224,dusting napping,0
6246,hey cheryl c uz steam cleaned carpet gross fee...,2
3515,im upset inside big brother net think ree going,2
2219,rob dyer 4 c didnt get see 3 still really like...,3
5882,ate jan lous comment made smile said quo ang g...,4
5975,lisp,0
9648,emma 300 ben scared german shepherd never used...,4
6329,love people get mad hurt,1
4372,loved new hannah montana episode guy go prom a...,2
2747,sad say good bye lauren watching v 5 year hill...,2


In [31]:
# Feature Construction: Adverb count / Tweet
# ---
# YOUR CODE GOES BELOW
#

df['adv_count'] = df.lemmatized.apply(lambda x: pos_check(x, 'adv'))
df[['lemmatized','adv_count']].sample(10)


Unnamed: 0,lemmatized,adv_count
3944,yes miss jon wish,1
7532,im drinking first soda like whole year today s...,1
674,c jemison 8350 har rahs club without,0
1378,sitting fo wi fi phone doesnt tether,0
6138,another morning one cook breakfast breakfast s...,0
4481,yay map done time proof amp get approval redes...,0
2819,glad didnt take metro today,0
714,beautiful kitty,0
7064,didnt get king pacific job,0
1092,ra gun zoo beloved family fun city woman like ...,1


In [32]:
# Feature Construction: Pronoun 
# ---
# YOUR CODE GOES BELOW
#
df['pron_count'] = df.lemmatized.apply(lambda x: pos_check(x, 'pron'))
df[['lemmatized','pron_count']].sample(10)

Unnamed: 0,lemmatized,pron_count
9252,didnt open tweet deck 2 pm losing power,0
907,rachael 6148 kano he quality,1
6417,bored like crazy nothing tv,0
1715,eric amer r im jealous,0
7920,abraham lloyd boy kid got class,0
5763,azar nous h oh hang,0
1790,kata 159 oh h noo itll youtube tomorrow thank ...,0
8388,sen ize added,0
5076,platypus factory also hope everything going we...,0
478,mom birthday afk,0


In [33]:
# Feature Construction: Subjectivity
# ---
# YOUR CODE GOES BELOW
# 
# Function to get subjectivity of text using the module textblob
def get_subjectivity(text):
    textblob = TextBlob(text)
    subj = textblob.sentiment.subjectivity
    return subj

df['subjectivity'] = df.lemmatized.apply(get_subjectivity)
df[['lemmatized', 'subjectivity']].sample(10)

Unnamed: 0,lemmatized,subjectivity
8235,jasmin love sure friend whats stupid friend te...,0.82963
1948,,0.0
7933,icarus malfoy hope find cat soon gone spc post...,0.5
7537,finished watching oc feel lost,0.0
6357,fyi dont fuck lie try take fool im one upper h...,0.4
3840,ball bb working read text message holl dont text,0.0
9915,mel watson oh h look delicious send south,1.0
5025,always burn k nu ckel,0.0
7104,ne x 007 sure manage odd p 3 practice session ...,0.712963
79,unn allman yeah look like quo busy quo fucking...,0.55


In [34]:
# Feature Construction: Polarity
# ---
# YOUR CODE GOES BELOW
# 
def get_polarity(text):
    textblob = TextBlob(text)
    pol = textblob.sentiment.polarity
    return pol

df['polarity'] = df.lemmatized.apply(get_polarity)
df[['lemmatized', 'polarity']].sample(50)

Unnamed: 0,lemmatized,polarity
3453,tired still playing dying hot,-0.075
1963,good morning twitter community got finished ea...,0.677778
2667,bad cramp,-0.7
1571,wow monday feeling slow today organizing image...,0.1
6176,sad leaving arizona today,-0.5
9332,murray steinman glad see twitter cant wait see...,0.5
9510,whole body hurt,0.2
3447,quo never meant baby happened quo knock,0.0
7373,morning work manchester meet afternoon plus su...,0.0
1419,rvi nd rock lol amazing isnt weekend plan chal...,0.633333


In [35]:
# Feature Construction: Word Level N-Gram TF-IDF Feature 
# ---
# YOUR CODE GOES BELOW
#

from nltk import word_tokenize, ngrams

# Word ngrams
# ---
#
list(ngrams(word_tokenize(df['lemmatized'][0]), 2)) 

[('obama', 'forge'),
 ('forge', 'muslim'),
 ('muslim', 'alliance'),
 ('alliance', 'civilized'),
 ('civilized', 'world'),
 ('world', 'didnt'),
 ('didnt', 'even'),
 ('even', 'drop'),
 ('drop', 'cup'),
 ('cup', 'tea')]

In [36]:
# Feature Construction: Character Level N-Gram TF-IDF Feature
# ---
# YOUR CODE GOES BELOW
# 
list(ngrams(df['lemmatized'][0], 2))

[('o', 'b'),
 ('b', 'a'),
 ('a', 'm'),
 ('m', 'a'),
 ('a', ' '),
 (' ', 'f'),
 ('f', 'o'),
 ('o', 'r'),
 ('r', 'g'),
 ('g', 'e'),
 ('e', ' '),
 (' ', 'm'),
 ('m', 'u'),
 ('u', 's'),
 ('s', 'l'),
 ('l', 'i'),
 ('i', 'm'),
 ('m', ' '),
 (' ', 'a'),
 ('a', 'l'),
 ('l', 'l'),
 ('l', 'i'),
 ('i', 'a'),
 ('a', 'n'),
 ('n', 'c'),
 ('c', 'e'),
 ('e', ' '),
 (' ', 'c'),
 ('c', 'i'),
 ('i', 'v'),
 ('v', 'i'),
 ('i', 'l'),
 ('l', 'i'),
 ('i', 'z'),
 ('z', 'e'),
 ('e', 'd'),
 ('d', ' '),
 (' ', 'w'),
 ('w', 'o'),
 ('o', 'r'),
 ('r', 'l'),
 ('l', 'd'),
 ('d', ' '),
 (' ', 'd'),
 ('d', 'i'),
 ('i', 'd'),
 ('d', 'n'),
 ('n', 't'),
 ('t', ' '),
 (' ', 'e'),
 ('e', 'v'),
 ('v', 'e'),
 ('e', 'n'),
 ('n', ' '),
 (' ', 'd'),
 ('d', 'r'),
 ('r', 'o'),
 ('o', 'p'),
 ('p', ' '),
 (' ', 'c'),
 ('c', 'u'),
 ('u', 'p'),
 ('p', ' '),
 (' ', 't'),
 ('t', 'e'),
 ('e', 'a')]

In [63]:
# Let's prepare the constructed features for modeling
# ---
#
X_metadata = np.array(df.iloc[:, 2:12])
X_metadata



# Creating a word level tf-idf feature
# ---
#
tfidf = TfidfVectorizer(max_features=1000, lowercase=True, analyzer='word', ngram_range=(1, 3),  stop_words= 'english') 
df_tweets_vect = tfidf.fit_transform(df['lemmatized'])

# Creating our dataframe
pd.DataFrame(df_tweets_vect.toarray(), columns=tfidf.get_feature_names())

X_metadata



array([['obama forge muslim alliance civilized world didnt even drop cup tea',
        'obama forges muslim alliance civilized world didnt even drop cup tea',
        67, ..., 1, 2, 0],
       ['spectacular prom ever bed serenading must answer sweet dream friend wonderful day',
        'spectacular prom ever bed serenading must answer sweet dreams friends wonderful day',
        81, ..., 2, 1, 0],
       ['overwhelmed today taking moment eat pray',
        'overwhelmed today taking moment eat pray', 40, ..., 0, 0, 0],
       ...,
       ['hah linas hyper already well lucky im college',
        'hah linas hyper already well lucky im college', 45, ..., 2, 2,
        0],
       ['omg really good day happened right',
        'omg really good day happened right', 34, ..., 2, 1, 0],
       ['love 2 cook pie saw division 68 th didnt see',
        'love 2 cook pie saw division 68 th didnt see', 44, ..., 0, 0, 0]],
      dtype=object)

In [60]:
# Creating a character level tf-idf feature
# ---
#
tfidf = TfidfVectorizer(max_features=1000, lowercase=True, analyzer='char', ngram_range=(1, 3),  stop_words= 'english', dtype=float) 
df_tweets_vect = tfidf.fit_transform(df['lemmatized'])

# Creating our dataframe
pd.DataFrame(df_tweets_vect.toarray(), columns=tfidf.get_feature_names())



Unnamed: 0,Unnamed: 1,1,1.1,2,2.1,3,3.1,4,4.1,a,...,ye,yea,yes,yi,yin,yo,you,ys,z,z.1
0,0.244202,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.054803,...,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.083231,0.0
1,0.231131,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.047154,...,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.0
2,0.175511,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.0
3,0.244659,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.0
4,0.275581,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.044175,...,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,0.169383,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.127164,0.128659,0.0,0.0,0.0,0.000000,0.0
9996,0.332136,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.0
9997,0.219923,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.070507,...,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.0
9998,0.184216,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.0


In [64]:
# We combine our two tfidf (sparse) matrices and X_metadata
# ---
#

X = scipy.sparse.hstack([df_tweets_vect, df_tweets_char])
X

<10000x2000 sparse matrix of type '<class 'numpy.float64'>'
	with 878525 stored elements in COOrdinate format>

In [58]:
# Getting our response variable
# ---
#
y = np.array(df.iloc[:, 0])
y

array([0, 4, 0, ..., 0, 4, 0])

### 4. Data Modelling

During this step, we will use machine learning algorithms to train and test our sentiment analysis models.

In [65]:
# Splitting our data
# ---
#
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [66]:
# Fitting our model
# ---
#

# Importing the algorithms
from sklearn.naive_bayes import MultinomialNB 
from sklearn.linear_model import LogisticRegression

nb_classifier = MultinomialNB() 
lr_classifier = LogisticRegression(max_iter=1000) 

# Training our model
nb_classifier.fit(X_train, y_train) 
lr_classifier.fit(X_train, y_train)

LogisticRegression(max_iter=1000)

In [67]:
# Making predictions
# ---
#
y_predict_nb = nb_classifier.predict(X_test) 
y_predict_lr = lr_classifier.predict(X_test)

In [68]:
# Evaluating the Models
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Accuracy scores
# ---
#
print("Naive Bayes Classifier:\n", accuracy_score(y_test, y_predict_nb)) 
print("Logistic Regression Classifier: \n", accuracy_score(y_test, y_predict_lr))

Naive Bayes Classifier:
 0.7315
Logistic Regression Classifier: 
 0.731


In [69]:
# Confusion matrices
# ---
# 
print("Naive Bayes Classifier: \n", confusion_matrix(y_test, y_predict_nb)) 
print("Logistic Regression Classifier: \n", confusion_matrix(y_test, y_predict_lr))

Naive Bayes Classifier: 
 [[764 286]
 [251 699]]
Logistic Regression Classifier: 
 [[762 288]
 [250 700]]


In [70]:
# Classification Reports
# ---
#
print("Naive Bayes Classifier: \n", classification_report(y_test, y_predict_nb)) 
print("Logistic Regression Classifier: \n", classification_report(y_test, y_predict_lr))

Naive Bayes Classifier: 
               precision    recall  f1-score   support

           0       0.75      0.73      0.74      1050
           4       0.71      0.74      0.72       950

    accuracy                           0.73      2000
   macro avg       0.73      0.73      0.73      2000
weighted avg       0.73      0.73      0.73      2000

Logistic Regression Classifier: 
               precision    recall  f1-score   support

           0       0.75      0.73      0.74      1050
           4       0.71      0.74      0.72       950

    accuracy                           0.73      2000
   macro avg       0.73      0.73      0.73      2000
weighted avg       0.73      0.73      0.73      2000



**Evaluation our Models**

* **Accuracy:** the percentage of texts that were assigned the correct topic.
* **Precision:** the percentage of texts the classifier classified correctly out of the total number of texts it predicted for each topic
* **Recall:** the percentage of texts the model predicted for each topic out of the total number of texts it should have predicted for that topic.
* **F1 Score:** the average of both precision and recall.

To improve our model, we can try perfoming other text processing techniques that would better prepare our data for fitting our model. We can also use different vectorizing techniques, implement other machine learning models and perform hyperparameter tuning.

### 5. Recommendations


Our best model had an accuracy of 73.25% and use it for classifying newer tweets. We can improve this performance by performing hyperparameter tuning and feature engineering methods. 