<a href="https://colab.research.google.com/github/MbogoriL/text-analysis/blob/main/Text_Analysis_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<font color="#4b76b7">To start practicing, you will need to make a copy of it. Go to File > Save a Copy in Drive. You can then use the new copy that will appear in the new tab.</font>


# AfterWork Data Science: Getting Started with NLP Project

### Prerequisites

In [None]:
# Importing the required libraries
# ---
# 
import pandas as pd # library for data manipulation
import numpy as np  # librariy for scientific computations
import re           # regex library to perform text preprocessing
import string       # library to work with strings
import nltk         # library for natural language processing
import scipy        # scientific conputing 

### 1. Importing our Data

In [None]:
# Question: Given a new tweets, create a sentiment analysis model that will 
# predict whether a tweet will contain positive or negative sentiment.
# ---
# Dataset url = https://bit.ly/31kqByD 
# ---
#
df = pd.read_csv('https://bit.ly/31kqByD', encoding='latin-1', header=None)
df.head()

Unnamed: 0,0,1,2,3,4,5,6
0,,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,346508.0,0,2016177685,Wed Jun 03 06:18:50 PDT 2009,NO_QUERY,UriGrey,Obama forges his Muslim alliance against the c...
2,883537.0,4,1686152287,Sun May 03 04:02:08 PDT 2009,NO_QUERY,MariesolW,Had the most spectacular prom ever but now my...
3,764173.0,0,2298725623,Tue Jun 23 12:02:12 PDT 2009,NO_QUERY,ColleenBurns,I am overwhelmed today taking a moment to eat...
4,638701.0,0,2234530495,Thu Jun 18 23:13:54 PDT 2009,NO_QUERY,queenarchy,@lindork Tres sad. I was totally a Max fan. #...


### 2. Data Exploration

In [None]:
# We can determine the size of our dataset
# ---
#
df.shape

(10001, 7)

Seems this dataset will need some data cleaning i.e. columns. We also don't need some columns to perform create our model. We will drop those columns.

### 3. Data Preparation

#### Basic Data Cleaning Techniques

In [None]:
# We rename the columns for ease of referencing our columns later on
# ---
#
df.columns = ['id', 'target', 't_id', 'created_at', 'query', 'user', 'text']
df.head()

Unnamed: 0,id,target,t_id,created_at,query,user,text
0,,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,346508.0,0,2016177685,Wed Jun 03 06:18:50 PDT 2009,NO_QUERY,UriGrey,Obama forges his Muslim alliance against the c...
2,883537.0,4,1686152287,Sun May 03 04:02:08 PDT 2009,NO_QUERY,MariesolW,Had the most spectacular prom ever but now my...
3,764173.0,0,2298725623,Tue Jun 23 12:02:12 PDT 2009,NO_QUERY,ColleenBurns,I am overwhelmed today taking a moment to eat...
4,638701.0,0,2234530495,Thu Jun 18 23:13:54 PDT 2009,NO_QUERY,queenarchy,@lindork Tres sad. I was totally a Max fan. #...


In [None]:
# We retain the relevant columns by dropping the columns we don't need 
# for creating a sentiment analysis model. 
# ---
#
df = df.drop(['id', 't_id', 'created_at', 'query', 'user'], axis = 1)
df.head()

Unnamed: 0,target,text
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,Obama forges his Muslim alliance against the c...
2,4,Had the most spectacular prom ever but now my...
3,0,I am overwhelmed today taking a moment to eat...
4,0,@lindork Tres sad. I was totally a Max fan. #...


In [None]:
# Understanding the distribution of target
# ---
#
df.target.value_counts() 

0    5068
4    4933
Name: target, dtype: int64

In [None]:
# Let's determine whether our columns have the right data types
# ---
#
df.dtypes

target     int64
text      object
dtype: object

In [None]:
# What values are in our target variable?
# ---
#
df.target.unique()

array([0, 4])

These are the two classes to which each document (text) belongs. The target value 0 means a text with a negative sentiment, while that of 4 means a text with a positive sentiment. 

In [None]:
# Let's check for missing values 
# ---
# 
df.isnull().sum()

target    0
text      0
dtype: int64

We don't have any missing values, so we are good to go.

#### Text Processing

In [None]:
# Text Cleaning: Removing all urls/links
# ---
# 
df['text'] =  df['text'].apply(lambda x: re.sub(r'http\S+|www\S+|https\S+','', str(x)))
df[['text']].head()

Unnamed: 0,text
0,@switchfoot - A that's a bummer. You shoulda...
1,Obama forges his Muslim alliance against the c...
2,Had the most spectacular prom ever but now my...
3,I am overwhelmed today taking a moment to eat...
4,@lindork Tres sad. I was totally a Max fan. #...


In [None]:
# Text Cleaning: Removing @ and # characters or replace them with space
# ---
# YOUR CODE GOES BELOW
#
df['text']= df.text.str.replace('#',' ')
df['text'] = df.text.str.replace('@', ' ')
df.head()

Unnamed: 0,target,text
0,0,switchfoot - A that's a bummer. You shoulda...
1,0,Obama forges his Muslim alliance against the c...
2,4,Had the most spectacular prom ever but now my...
3,0,I am overwhelmed today taking a moment to eat...
4,0,lindork Tres sad. I was totally a Max fan. ...


In [None]:
# Text Cleaning: Conversion to lowercase
# ---
# YOUR CODE GOES BELOW
#
df['text'] = df.text.str.lower()
df.head()

Unnamed: 0,target,text
0,0,switchfoot - a that's a bummer. you shoulda...
1,0,obama forges his muslim alliance against the c...
2,4,had the most spectacular prom ever but now my...
3,0,i am overwhelmed today taking a moment to eat...
4,0,lindork tres sad. i was totally a max fan. ...


In [None]:
# Text Cleaning: Splitting concatenated words
# ---
# Performing this step will take few minutes...
# ---
# YOUR CODE GOES BELOW
# 

# Installing wordnija and textblob
# ---
#
!pip3 install wordninja
!pip3 install textblob

# Importing those libraries
# ---
#
import wordninja 
from textblob import TextBlob

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting wordninja
  Downloading wordninja-2.0.0.tar.gz (541 kB)
[K     |████████████████████████████████| 541 kB 14.6 MB/s 
[?25hBuilding wheels for collected packages: wordninja
  Building wheel for wordninja (setup.py) ... [?25l[?25hdone
  Created wheel for wordninja: filename=wordninja-2.0.0-py3-none-any.whl size=541551 sha256=8f255ab1714bfbe5b351835ac54f177e67d7d4cb5cd2648f64dad84d554f3a4d
  Stored in directory: /root/.cache/pip/wheels/dd/3f/eb/a2692e3d2b9deb1487b09ba4967dd6920bd5032bfd9ff7acfc
Successfully built wordninja
Installing collected packages: wordninja
Successfully installed wordninja-2.0.0
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
# Performing the split
# ---
#
df['text'] = df.text.apply(lambda x: wordninja.split(str(TextBlob(x))))  
df['text'] = df.text.str.join(' ')
df[['text']].sample(10) 

Unnamed: 0,text
3091,thanks s abby grey es
427,so i officially need a new computer all 3 of o...
5177,awesome i lost one of my diamond ea rings
3343,she wish the dream is true n will happy foreve...
8413,lakers so much for the star spangled magic und...
3504,with hometown buds in quincy ordering chinese ...
2215,miss go one tte 09 hmmm m i wanted to be on th...
8164,damn humidity today makes its feels like its 110
3411,work on the weekend is an evil thing
2284,something is wrong with my twitter app


In [None]:
# Text Cleaning: Removing punctuation characters
# ---
# YOUR CODE GOES BELOW
#
df['text'] =  df['text'].apply(lambda x: re.sub(r'[^\w\s]','', str(x)))
df.sample(10)

Unnamed: 0,target,text
3776,0,s agt so much for being a pas away now im suff...
8608,4,songz yuu up he yyyy yyyy yyyy tre yyyy
7433,4,wants some sunshine in her heart
1642,0,damn it what am i gonna wear now my skirt is s...
6385,4,wolverine was really good
9739,4,has map u mental access may have to wait until...
9047,4,geo kitten 78 a tad stressed out but it is gor...
505,4,why are you mad i miss you too baby
5908,0,donna mir field wh hh a a at i am tres jealous...
667,0,i hope i get that job in me dow hall i need a ...


In [None]:
# Text Cleaning: Removing stop words
# ---
# YOUR CODE GOES BELOW
# 
import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords
stop = stopwords.words('english')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
df['text'] = df.text.apply(lambda x: " ".join(x for x in x.split() if x not in stop))
df[['text']].sample(5)

Unnamed: 0,text
2241,wow fair amount bright sunshine predicted rain...
5368,woo pp im women ess tim tams nice
9580,charley farley im kind love hathaway mum under...
4588,wrestler springsteen kill lot one fa vs
9291,nice carefree best ie laugh love


In [None]:
# Text Cleaning: Lemmatization
# ---
# YOUR CODE GOES BELOW
#

# For lemmatization, we will need to download wordnet
#
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [None]:
# Lemmatizing our text
# ---
from textblob import Word
df['text'] = df.text.apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()])) 
df.sample(10)

Unnamed: 0,target,text
4160,4,snorkel pop hey talking eeee
6973,4,preparing go help girl friend unpack
7951,0,jungle mag isnt
4157,4,chris misty belle im working summer camp june
7801,0,missed da show
6660,4,little white kitten currently sleeping inside ...
9143,0,exciting evening bh oh well
8273,4,sometimes hate thats normal dont worry bout
4597,0,lay
8889,4,song today black gold adeles cover sam spar ro...


We won't remove numerics because we could loose meaning of our text if we lost the numerics. We could also further prepare our text by performing spelling correction but this is a resource intensive process that we will skip for now.

#### Feature Engineering Techniques 

In [None]:
df.sample(2)

Unnamed: 0,target,text,length_of_text
9970,0,leigh sh hi friend good slept lot running arou...,83
3455,4,going take shower ill get ready birthday tomor...,73


Error: Runtime no longer has a reference to this dataframe, please re-run this cell and try again.


In [None]:
# Feature Construction: Length of tweet
# ---
# YOUR CODE GOES BELOW

df['length_of_text'] = df.text.str.len()
df['text','length_of_text'].sample(5)

KeyError: ignored

In [None]:
# Feature Construction: Word count 
# ---
# YOUR CODE GOES BELOW
# 
df['word_count'] = df.text.apply(lambda x: len(str(x).split(" ")))
df[['text', 'word_count']].sample(5)

Unnamed: 0,text,word_count
5715,alia liao thx,3
8616,megas urus x last 2 brisbane wanted go one las...,15
5103,unlocked song guitar hero metallica tried war ...,11
3258,really unfortunate ill probably never renaissa...,7
1868,dd lovato fan 101 could follow please would re...,10


In [None]:
# Feature Construction: Word density (Average no. of words / tweet)
# ---
# YOUR CODE GOES BELOW
#
def avg_word(sentence):
  words = sentence.split()
  try:
    z = (sum(len(word) for word in words)/len(words))
  except ZeroDivisionError:
    z = 0 
  return z

df['avg_word_length'] = df.text.apply(lambda x: avg_word(x))
df[['text','avg_word_length']].sample(5)


Unnamed: 0,text,avg_word_length
8214,blink 182 come argentina please,5.4
327,think taylor swift 13 best person ever even th...,4.4375
8944,karen 230683 thought one admire iphone doesnt ...,5.0
5872,paul saunders well nice meet quo turning passi...,5.0
9846,hannah tanna h 12 im going see,3.428571


In [None]:
# Feature Construction: Noun count
# ---
# YOUR CODE GOES BELOW
#
# First, we will download the punkt and the averaged_perceptron_tagger into our notebook environment. 
# which will allow us to find the part of speech tags.
# ---
#
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# We create the function to check and get the part of speech tag count of a words in a given sentence
pos_dic = {
    'noun' : ['NN','NNS','NNP','NNPS'],
    'pron' : ['PRP','PRP$','WP','WP$'],
    'verb' : ['VB','VBD','VBG','VBN','VBP','VBZ'],
    'adj' :  ['JJ','JJR','JJS'],
    'adv' : ['RB','RBR','RBS','WRB']
}

def pos_check(x, flag):
    cnt = 0
    try:
        wiki = TextBlob(x)
        for tup in wiki.tags:
            ppo = list(tup)[1]
            if ppo in pos_dic[flag]:
                cnt += 1
    except:
        pass
    return cnt

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [None]:
# Noun Count
# ---
# YOUR CODE GOES BELOW
#
df['noun_count'] = df.text.apply(lambda x: pos_check(x, 'noun'))
df[['text','noun_count']].sample(10)

Unnamed: 0,text,noun_count
6144,jesse mccartney hey coming uk time soon really...,6
7929,anticipating,0
9524,teeth insist falling apart,1
7394,officially bored mind 1 outside guess ill go b...,7
7748,rite look forward final,1
4755,tickle joey love relaxing watch movie scary x ox,6
3091,thanks abby grey e,3
531,beautiful day sunshine amp day yay oh amp ice ...,7
7238,go ood morning ever morning lol want know like...,7
2662,conversation kill bei basically ended youre go...,4


In [None]:
# Feature Construction: Verb count
# ---
# YOUR CODE GOES BELOW
#
df['verb_count'] = df.text.apply(lambda x: pos_check(x, 'verb'))
df[['text','verb_count']].sample(10)

Unnamed: 0,text,verb_count
3162,playing sorority life facebook,1
1023,teri la thanks ill add list along rerun mary t...,1
8506,ooo oo eeeeee ely dont want go skool mor ra,2
4213,searching agency know agency tell,3
2043,im sorry cant keep way used update come trio n...,2
5941,love boyfriend quo class quo quo shit quo quo ...,3
8972,ginger cm 2 prize left first party second one ...,4
7681,em praying hope thing get better,2
2682,exam lame nnnn psych tomorrow,0
9437,19 f chi 75 dunno u wanna know eat sour drop,1


In [None]:
# Feature Construction: Adjective count / Tweet
# ---
# YOUR CODE GOES BELOW
df['adj_count'] = df.text.apply(lambda x: pos_check(x, 'adj'))
df[['text','adj_count']].sample(10)

Unnamed: 0,text,adj_count
5207,www dodger alright still love,0
2147,summer kid wrote massive reply fb lost write hah,2
6740,human doodad think made worse,2
2095,summer cottage yesterday fun,0
8947,recognizing best friend 8 th grade bad,2
1379,sitting fo wi fi phone doesnt tether,1
5766,ad iq shun cant see photo presenter admin dani...,2
8433,miracle believe,0
8685,e sam anna promotion home bar get try diff bee...,3
7470,eagle powder,0


In [None]:
# Feature Construction: Adverb count / Tweet
# ---
# YOUR CODE GOES BELOW
#
df['adv_count'] = df.text.apply(lambda x: pos_check(x, 'adv'))
df[['text','adv_count']].sample(10)

Unnamed: 0,text,adv_count
7355,rican barbie e love name jasmine named daughter,0
342,jbs live chat great,1
6348,detail freak med student yay isnt wonderful,0
4931,bright side maybe infomercial put back sleep h...,1
3978,testing think panda alpha 2 much better,1
8049,kl,0
4716,sky blue design 2 talk bt yummy way find saw t...,0
3562,splendid day w rk home w gpa,0
6230,cca xvi px miss looking forward chat today,1
8097,sorry decided follow friday week many going an...,2


In [None]:
# Feature Construction: Pronoun 
# ---
# YOUR CODE GOES BELOW
#
df['pron_count'] = df.text.apply(lambda x: pos_check(x, 'pron'))
df[['text','pron_count']].sample(10)

Unnamed: 0,text,pron_count
3208,killing plata boy aim bae kev rosa already left,0
5439,im heart good party inc glas,0
1381,jason me nick know kish ka thats fave kish ka ...,1
8305,matt costa music hawaii matt costa doubt could...,0
1998,singapore shall go kot,0
1378,grandparent wishing pool,0
6491,thanks follow friday love here another guy fol...,0
9322,loving weather back leeds tomorrow cant wait,1
7435,mano mio bad basic removed darn apple afraid c...,0
9674,got 500 quid back tax man nice,0


In [None]:
# Feature Construction: Subjectivity
# ---
# YOUR CODE GOES BELOW
# 
def get_subjectivity(text):
    try:
        textblob = TextBlob(unicode(text, 'utf-8'))
        subj = textblob.sentiment.subjectivity
    except:
        subj = 0.0
    return subj

df['subjectivity'] = df.text.apply(get_subjectivity)
df[['text', 'subjectivity']].sample(10)

Unnamed: 0,text,subjectivity
5587,checking bing v google result similar ranking ...,0.0
5055,havent tweeted 3 day lt got back long weekend ...,0.0
8520,good morning,0.0
1696,kirby good luck website launch today,0.0
9256,rufus nod uf u mountain work better work suck ...,0.0
3089,j ric z love song one fave colla b soon hah,0.0
5957,ao scott pan whatever work letdown larry funny...,0.0
5151,ich v erste exactly reason im using wordpress ...,0.0
2100,song heart excellent news delighted,0.0
1338,brando ly nicole ahh hh h wish could,0.0


In [None]:
# Feature Construction: Polarity
# ---
# YOUR CODE GOES BELOW
def get_polarity(text):
    try:
        textblob = TextBlob(unicode(text, 'utf-8'))
        pol = textblob.sentiment.polarity
    except:
        pol = 0.0
    return pol

df['polarity'] = df.text.apply(get_polarity)
df[['text', 'polarity']].sample(10)


Unnamed: 0,text,polarity
1901,back durham greensboro cook closed hmmm,0.0
2207,surgery tom morrow,0.0
6718,way home anger outlet jane eyre,0.0
3998,amanda 2610 miss next couple tutorial going ab...,0.0
4561,meet classmate classmate last school year voti...,0.0
7505,homemade yogurt like homemade sour milk time g...,0.0
6764,salonen lol chill en,0.0
3216,jt harris 3 found tv stream live vid canadian ...,0.0
1822,going g sunset tt shop p ping shit miss b,0.0
1111,shin tv yes always camera nothing exciting hap...,0.0


In [None]:
# Feature Construction: Word Level N-Gram TF-IDF Feature  
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(max_features=1000, lowercase=True, analyzer='word', ngram_range=(1,3),  stop_words= 'english')
df_word_vect = tfidf.fit_transform(df['text']) 
df_word_vect.toarray()                                 

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [None]:
# Feature Construction: Character Level N-Gram TF-IDF 
tfidf = TfidfVectorizer(max_features=1000, lowercase=True, analyzer='char', ngram_range=(1,3),  stop_words= 'english')
df_char_vect = tfidf.fit_transform(df['text'])
df_char_vect.toarray()

array([[0.22513591, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.24420445, 0.        , 0.        , ..., 0.        , 0.08323393,
        0.        ],
       [0.23114214, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.21991948, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.18421819, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.28106448, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

In [None]:
# Let's prepare the constructed features for modeling
# ---
#
X_metadata = np.array(df.iloc[:, 2:12])
X_metadata

array([[49.        ,  9.        ,  4.55555556, ...,  0.        ,
         0.        ,  0.        ],
       [67.        , 11.        ,  5.18181818, ...,  0.        ,
         0.        ,  0.        ],
       [81.        , 12.        ,  5.83333333, ...,  0.        ,
         0.        ,  0.        ],
       ...,
       [45.        ,  8.        ,  4.75      , ...,  0.        ,
         0.        ,  0.        ],
       [34.        ,  6.        ,  4.83333333, ...,  0.        ,
         0.        ,  0.        ],
       [44.        , 10.        ,  3.5       , ...,  0.        ,
         0.        ,  0.        ]])

In [None]:
# We combine our two tfidf (sparse) matrices and X_metadata
# ---
#
X = scipy.sparse.hstack([df_word_vect, df_char_vect,  X_metadata])
X

<10001x2010 sparse matrix of type '<class 'numpy.float64'>'
	with 938213 stored elements in COOrdinate format>

In [None]:
# Getting our response variable
# ---
#
y = np.array(df.iloc[:, 0])
y

array([0, 0, 4, ..., 0, 4, 0])

### 4. Data Modelling

During this step, we will use machine learning algorithms to train and test our sentiment analysis models.

In [None]:
# Splitting our data
# ---
#
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Fitting our model
# ---
#

# Importing the algorithms
from sklearn.naive_bayes import MultinomialNB 
from sklearn.linear_model import LogisticRegression

nb_classifier = MultinomialNB() 
lr_classifier = LogisticRegression(max_iter=1000) 

# Training our model
nb_classifier.fit(X_train, y_train) 
lr_classifier.fit(X_train, y_train)

LogisticRegression(max_iter=1000)

In [None]:
# Making predictions
# ---
#
y_predict_nb = nb_classifier.predict(X_test) 
y_predict_lr = lr_classifier.predict(X_test)

In [None]:
# Evaluating the Models
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Accuracy scores
# ---
#
print("Naive Bayes Classifier:\n", accuracy_score(y_test, y_predict_nb)) 
print("Logistic Regression Classifier: \n", accuracy_score(y_test, y_predict_lr))

Naive Bayes Classifier:
 0.7006496751624188
Logistic Regression Classifier: 
 0.7136431784107946


In [None]:
# Confusion matrices
# ---
# 
print("Naive Bayes Classifier: \n", confusion_matrix(y_test, y_predict_nb)) 
print("Logistic Regression Classifier: \n", confusion_matrix(y_test, y_predict_lr))

Naive Bayes Classifier: 
 [[713 263]
 [336 689]]
Logistic Regression Classifier: 
 [[718 258]
 [315 710]]


In [None]:
# Classification Reports
# ---
#
print("Naive Bayes Classifier: \n", classification_report(y_test, y_predict_nb)) 
print("Logistic Regression Classifier: \n", classification_report(y_test, y_predict_lr))

Naive Bayes Classifier: 
               precision    recall  f1-score   support

           0       0.68      0.73      0.70       976
           4       0.72      0.67      0.70      1025

    accuracy                           0.70      2001
   macro avg       0.70      0.70      0.70      2001
weighted avg       0.70      0.70      0.70      2001

Logistic Regression Classifier: 
               precision    recall  f1-score   support

           0       0.70      0.74      0.71       976
           4       0.73      0.69      0.71      1025

    accuracy                           0.71      2001
   macro avg       0.71      0.71      0.71      2001
weighted avg       0.71      0.71      0.71      2001



**Evaluation our Models**

* **Accuracy:** the percentage of texts that were assigned the correct topic.
* **Precision:** the percentage of texts the classifier classified correctly out of the total number of texts it predicted for each topic
* **Recall:** the percentage of texts the model predicted for each topic out of the total number of texts it should have predicted for that topic.
* **F1 Score:** the average of both precision and recall.

To improve our model, we can try perfoming other text processing techniques that would better prepare our data for fitting our model. We can also use different vectorizing techniques, implement other machine learning models and perform hyperparameter tuning.

### 5. Recommendations


Our best model had an accuracy of 71% and use it for classifying newer tweets. We can improve this performance by performing hyperparameter tuning and feature engineering methods. 