<font color="#4b76b7">To start practicing, you will need to make a copy of it. Go to File > Save a Copy in Drive. You can then use the new copy that will appear in the new tab.</font>


# AfterWork Data Science: Getting Started with NLP Project

### Prerequisites

In [1]:
# Importing the required libraries
# ---
# 
import pandas as pd # library for data manipulation
import numpy as np  # librariy for scientific computations
import re           # regex library to perform text preprocessing
import string       # library to work with strings
import nltk         # library for natural language processing
import scipy        # scientific conputing 

### 1. Importing our Data

In [2]:
# Question: Given a new tweets, create a sentiment analysis model that will 
# predict whether a tweet will contain positive or negative sentiment.
# ---
# Dataset url = https://bit.ly/31kqByD 
# ---
#
df = pd.read_csv('https://bit.ly/31kqByD', encoding='latin-1')
df.head()

Unnamed: 0.1,Unnamed: 0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
0,346508,0,2016177685,Wed Jun 03 06:18:50 PDT 2009,NO_QUERY,UriGrey,Obama forges his Muslim alliance against the c...
1,883537,4,1686152287,Sun May 03 04:02:08 PDT 2009,NO_QUERY,MariesolW,Had the most spectacular prom ever but now my...
2,764173,0,2298725623,Tue Jun 23 12:02:12 PDT 2009,NO_QUERY,ColleenBurns,I am overwhelmed today taking a moment to eat...
3,638701,0,2234530495,Thu Jun 18 23:13:54 PDT 2009,NO_QUERY,queenarchy,@lindork Tres sad. I was totally a Max fan. #...
4,664821,0,2244623416,Fri Jun 19 14:59:46 PDT 2009,NO_QUERY,reinventingjess,"Crap, I was counting down the hours until my d..."


### 2. Data Exploration

In [3]:
# We can determine the size of our dataset
# ---
#
df.shape

(10000, 7)

Seems this dataset will need some data cleaning i.e. columns. We also don't need some columns to perform create our model. We will drop those columns.

### 3. Data Preparation

#### Basic Data Cleaning Techniques

In [4]:
# We rename the columns for ease of referencing our columns later on
# ---
#
df.columns = ['id', 'target', 't_id', 'created_at', 'query', 'user', 'text']
df.head()

Unnamed: 0,id,target,t_id,created_at,query,user,text
0,346508,0,2016177685,Wed Jun 03 06:18:50 PDT 2009,NO_QUERY,UriGrey,Obama forges his Muslim alliance against the c...
1,883537,4,1686152287,Sun May 03 04:02:08 PDT 2009,NO_QUERY,MariesolW,Had the most spectacular prom ever but now my...
2,764173,0,2298725623,Tue Jun 23 12:02:12 PDT 2009,NO_QUERY,ColleenBurns,I am overwhelmed today taking a moment to eat...
3,638701,0,2234530495,Thu Jun 18 23:13:54 PDT 2009,NO_QUERY,queenarchy,@lindork Tres sad. I was totally a Max fan. #...
4,664821,0,2244623416,Fri Jun 19 14:59:46 PDT 2009,NO_QUERY,reinventingjess,"Crap, I was counting down the hours until my d..."


In [5]:
# We retain the relevant columns by dropping the columns we don't need 
# for creating a sentiment analysis model. 
# ---
#
df = df.drop(['id', 't_id', 'created_at', 'query', 'user'], axis = 1)
df.head()

Unnamed: 0,target,text
0,0,Obama forges his Muslim alliance against the c...
1,4,Had the most spectacular prom ever but now my...
2,0,I am overwhelmed today taking a moment to eat...
3,0,@lindork Tres sad. I was totally a Max fan. #...
4,0,"Crap, I was counting down the hours until my d..."


In [6]:
# Understanding the distribution of target
# ---
#
df.target.value_counts() 

0    5067
4    4933
Name: target, dtype: int64

In [7]:
# Let's determine whether our columns have the right data types
# ---
#
df.dtypes

target     int64
text      object
dtype: object

In [8]:
# What values are in our target variable?
# ---
#
df.target.unique()

array([0, 4])

These are the two classes to which each document (text) belongs. The target value 0 means a text with a negative sentiment, while that of 4 means a text with a positive sentiment. 

In [9]:
# Let's check for missing values 
# ---
# 
df.isnull().sum()

target    0
text      0
dtype: int64

We don't have any missing values, so we are good to go.

#### Text Processing

In [10]:
# Text Cleaning: Removing all urls/links
# ---
# 
df['text'] =  df['text'].apply(lambda x: re.sub(r'http\S+|www\S+|https\S+','', str(x)))
df[['text']].head()

Unnamed: 0,text
0,Obama forges his Muslim alliance against the c...
1,Had the most spectacular prom ever but now my...
2,I am overwhelmed today taking a moment to eat...
3,@lindork Tres sad. I was totally a Max fan. #...
4,"Crap, I was counting down the hours until my d..."


In [11]:
# Text Cleaning: Removing @ and # characters or replace them with space
# ---
# YOUR CODE GOES BELOW
#
df['text'] = df.text.str.replace('#',' ')
df['text'] = df.text.str.replace('@',' ')

In [14]:
# Text Cleaning: Conversion to lowercase
# ---
# YOUR CODE GOES BELOW
#
df['text'] = df.text.apply(lambda x: " ".join(x.lower() for x in x.split()))

In [21]:
# Text Cleaning: Splitting concatenated words
# ---
# Performing this step will take few minutes...
# ---
# YOUR CODE GOES BELOW
# 

# Installing wordnija and textblob
!pip install wordninja
!pip install textblob

# Importing those libraries
from textblob import TextBlob
import wordninja


In [22]:
# Performing the split
# ---
#
for wordstring in df['text']:
    split = wordninja.split(wordstring)

In [24]:
# Text Cleaning: Removing punctuation characters
# ---
# YOUR CODE GOES BELOW
#
df['text'] = df.text.str.replace('[^\w\s]','')

In [26]:
# Text Cleaning: Removing stop words
# ---
# YOUR CODE GOES BELOW
# 

nltk.download('stopwords')
from nltk.corpus import stopwords
stop = stopwords.words('english')
from textblob import Word
Word = Word

df['text'] = df.text.apply(lambda x: " ".join(x for x in x.split() if x not in stop))


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [28]:
# Text Cleaning: Lemmatization
# ---
# YOUR CODE GOES BELOW
#

# For lemmatization, we will need to download wordnet
nltk.download('wordnet')


# Lemmatizing our text
# ---
#
df['text'] = df.text.apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()])) 

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


We won't remove numerics because we could loose meaning of our text if we lost the numerics. We could also further prepare our text by performing spelling correction but this is a resource intensive process that we will skip for now.

#### Feature Engineering Techniques 

In [29]:
# Feature Construction: Length of tweet
# ---
# YOUR CODE GOES BELOW
#
def features(text):
    df['length_of_tweet'] = df.text.str.len()
df.text.apply(features)
df.sample(5)    

Unnamed: 0,target,text,length_of_tweet
3698,0,deliciatan ok im depressed sophia ellie going ...,80
1804,4,need look old ballet stuff hahah im excited co...,54
8840,0,work depressing standing checkout near window ...,89
1117,0,veggin unrelaxing saturday,26
1208,4,ajsouthern delicious thanks,27


In [30]:
# Feature Construction: Word count 
# ---
# YOUR CODE GOES BELOW
# 
def features(text):
    df['word_count '] = df.text.apply(lambda x: len(str(x).split(" ")))
df.text.apply(features)
df.sample(5)  

Unnamed: 0,target,text,length_of_tweet,word_count
3599,0,paddington train waiting move back bristol gon...,85,15
7973,0,chioma_ haha kick thats nice lol attemting im ...,91,15
3271,0,stupid phone volume couldnt hear ring hubby ca...,50,8
5253,0,na okget,8,2
2530,0,wondering keep receiving status change note al...,94,14


In [32]:
# Feature Construction: Word density (Average no. of words / tweet)
# ---
# YOUR CODE GOES BELOW
#
def avg_word(sentence):
  words = sentence.split()
  try:
    z = (sum(len(word) for word in words)/len(words))
  except ZeroDivisionError:
    z = 0 
  return z
def features(text):
    df['word_density'] = df.text.apply(lambda x: avg_word(x)) 
df.text.apply(features)
df.sample(5) 

Unnamed: 0,target,text,length_of_tweet,word_count,word_density
2358,0,empath_eia know share sentiment entirely need ...,67,10,5.8
2844,0,noticed cut kevins guitar solo quotfly mequot ...,51,8,5.5
8618,0,doingnothing real cool mom say cant anything,44,7,5.428571
7926,4,getting ready watch mtv award new moon trailor,46,8,4.875
1973,4,mcscouser oh wont stop sarcastic thats ill car...,63,10,5.4


In [35]:
# Feature Construction: Noun count
# ---
# YOUR CODE GOES BELOW
#
# First, we will download the punkt and the averaged_perceptron_tagger into our notebook environment. 
# which will allow us to find the part of speech tags.
# ---
#
# nltk.download('punkt')
# nltk.download('averaged_perceptron_tagger')

# We create the function to check and get the part of speech tag count of words in a given sentence
def part_of_speech(x, flag):
    cnt = 0
    try:
        wiki = TextBlob(x)
        for tup in wiki.tags:
            ppo = list(tup)[1]
            if ppo in pos_dic[flag]:
                cnt += 1
    except:
        pass
    return cnt

In [37]:
# Noun Count
# ---
# YOUR CODE GOES BELOW
#
# def features(text):
df['noun_count'] = df.text.apply(lambda x: part_of_speech(x, 'noun'))
df.text.apply(features)
df.sample(5) 

Unnamed: 0,target,text,length_of_tweet,word_count,word_density,noun_count
7225,4,ph mengwith ph accept,21,4,4.5,0
5943,4,cant believe 8 hour work ive 4 already im happ...,68,13,4.307692,0
1683,0,need anbesol,12,2,5.5,0
2957,4,doctorfollowill jeez thank u countrypunkgarage...,92,15,5.2,0
1758,0,finish hannah montana season 3 today bring sea...,89,15,5.0,0


In [38]:
# Feature Construction: Verb count
# ---
# YOUR CODE GOES BELOW
df['verb_count'] = df.text.apply(lambda x: part_of_speech(x, 'verb'))
df.text.apply(features)
df.sample(5) 

Unnamed: 0,target,text,length_of_tweet,word_count,word_density,noun_count,verb_count
4244,0,eatn blt wishn tha parade would come thru oakland,49,9,4.555556,0,0
1140,4,encision wish lived beach im sayin,34,6,4.833333,0,0
6537,0,slicc3081 wth u go shawnee one night next ur c...,66,12,4.583333,0,0
7745,0,ok officially hate home alone im hearing clock...,103,17,5.117647,0,0
2777,0,im looking forward going home tomorrow really ...,67,10,5.8,0,0


In [39]:
# Feature Construction: Adjective count / Tweet
# ---
# YOUR CODE GOES BELOW
df['adjective_count'] = df.text.apply(lambda x: part_of_speech(x, 'adj'))
df.text.apply(features)
df.sample(5) 


Unnamed: 0,target,text,length_of_tweet,word_count,word_density,noun_count,verb_count,adjective_count
9536,4,nycjill twilight quiz nerd,26,4,5.75,0,0,0
254,4,tweetingsfromua painted desert fave well em dr...,79,12,5.666667,0,0,0
8688,4,kelly_straycat knowthanks twitterbut thank nev...,55,5,10.2,0,0,0
418,4,ajschokora think alike better watch,35,5,6.2,0,0,0
1107,0,dad tv loud get creeped cuz im dodgy neighbour...,63,11,4.818182,0,0,0


In [40]:
# Feature Construction: Adverb count / Tweet
# ---
# YOUR CODE GOES BELOW
#
df['adverb_count'] = df.text.apply(lambda x: part_of_speech(x, 'adv'))
df.text.apply(features)
df.sample(5) 

Unnamed: 0,target,text,length_of_tweet,word_count,word_density,noun_count,verb_count,adjective_count,adverb_count
9975,4,lihis gonna bring white guitar sign might let ...,60,10,5.1,0,0,0,0
2277,0,knocked oil burner amp cracked sad,34,6,4.833333,0,0,0,0
6962,4,sinasabi good evening,21,3,6.333333,0,0,0,0
4130,4,krazymary lol daughter turned 2 drive crazy wo...,72,12,5.083333,0,0,0,0
4489,0,mac n cheese hate life,22,5,3.6,0,0,0,0


In [41]:
# Feature Construction: Pronoun 
# ---
# YOUR CODE GOES BELOW
#
df['pronoun_count'] = df.text.apply(lambda x: part_of_speech(x, 'pron'))
df.text.apply(features)
df.sample(5) 

Unnamed: 0,target,text,length_of_tweet,word_count,word_density,noun_count,verb_count,adjective_count,adverb_count,pronoun_count
8661,0,slimmduddy might dont even know yet know birth...,54,9,5.111111,0,0,0,0,0
2914,0,iblvtoo lol right bad mirror,28,5,4.8,0,0,0,0,0
974,4,getting organized big week networking marketin...,96,12,7.083333,0,0,0,0,0
109,0,exsmaps kusanagi apologizing tv right could fi...,89,13,5.923077,0,0,0,0,0
5144,0,ug woke 4 hour nap cant sleep,29,7,3.285714,0,0,0,0,0


In [42]:
# Feature Construction: Subjectivity
# ---
# YOUR CODE GOES BELOW
# 
def subjectivity(text):
    try:
        textblob = TextBlob(unicode(text, 'utf-8'))
        subj = textblob.sentiment.subjectivity
    except:
        subj = 0.0
    return subj

df['subjectivity'] = df.text.apply(subjectivity)
df.text.apply(features)
df.sample(5)     

Unnamed: 0,target,text,length_of_tweet,word_count,word_density,noun_count,verb_count,adjective_count,adverb_count,pronoun_count,subjectivity
234,4,probably shouldnt watching titanic tnt cruise ...,82,12,5.916667,0,0,0,0,0,0.0
7190,0,hate u miss someone want talk u end checkin ur...,68,15,3.6,0,0,0,0,0,0.0
2213,4,got new phone bothering look twitter mobile nu...,91,14,5.571429,0,0,0,0,0,0.0
5312,4,watching grey garden,20,3,6.0,0,0,0,0,0,0.0
2168,4,tedx novel ideal coming chennai expecting kiru...,67,10,5.8,0,0,0,0,0,0.0


In [43]:
# Feature Construction: Polarity
# ---
# YOUR CODE GOES BELOW
# 
def get_polarity(text):
    try:
        textblob = TextBlob(unicode(text, 'utf-8'))
        pol = textblob.sentiment.polarity
    except:
        pol = 0.0
    return pol
df['polarity'] = df.text.apply(get_polarity)
df.text.apply(features)
df.sample(5)  

Unnamed: 0,target,text,length_of_tweet,word_count,word_density,noun_count,verb_count,adjective_count,adverb_count,pronoun_count,subjectivity,polarity
883,0,crystalchappell im stuck spending night servin...,60,8,6.625,0,0,0,0,0,0.0,0.0
5592,0,suddenly got feeling missing friend im emo hahaha,49,8,5.25,0,0,0,0,0,0.0,0.0
2369,4,pandamayhem wonderful rain free nkotb day im g...,80,14,4.785714,0,0,0,0,0,0.0,0.0
7427,4,tffdavid well met office forecasting warmer av...,105,18,4.888889,0,0,0,0,0,0.0,0.0
9253,4,bl4ckw0lf wonder would come name really palind...,91,13,6.076923,0,0,0,0,0,0.0,0.0


In [44]:
# Feature Construction: Word Level N-Gram TF-IDF Feature 
# ---
# YOUR CODE GOES BELOW
#
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(max_features=1000, lowercase=True, analyzer='word', ngram_range=(1,3),  stop_words= 'english')
df_word_vect = tfidf.fit_transform(df.text) 


In [45]:
# Feature Construction: Character Level N-Gram TF-IDF Feature
# ---
# YOUR CODE GOES BELOW
# 
tfidf = TfidfVectorizer(max_features=1000, lowercase=True, analyzer='char', ngram_range=(1,3),  stop_words= 'english')
df_char_vect = tfidf.fit_transform(df.text)

In [46]:
# Let's prepare the constructed features for modeling
# ---
#
X_metadata = np.array(df.iloc[:, 2:12])
X_metadata

array([[67.        , 11.        ,  5.18181818, ...,  0.        ,
         0.        ,  0.        ],
       [81.        , 12.        ,  5.83333333, ...,  0.        ,
         0.        ,  0.        ],
       [40.        ,  6.        ,  5.83333333, ...,  0.        ,
         0.        ,  0.        ],
       ...,
       [46.        ,  8.        ,  4.875     , ...,  0.        ,
         0.        ,  0.        ],
       [34.        ,  6.        ,  4.83333333, ...,  0.        ,
         0.        ,  0.        ],
       [40.        ,  6.        ,  5.83333333, ...,  0.        ,
         0.        ,  0.        ]])

In [47]:
# We combine our two tfidf (sparse) matrices and X_metadata
# ---
#
X = scipy.sparse.hstack([df_word_vect, df_char_vect,  X_metadata])
X

<10000x2010 sparse matrix of type '<class 'numpy.float64'>'
	with 892312 stored elements in COOrdinate format>

In [48]:
# Getting our response variable
# ---
#
y = np.array(df.iloc[:, 0])
y

array([0, 4, 0, ..., 0, 4, 0])

### 4. Data Modelling

During this step, we will use machine learning algorithms to train and test our sentiment analysis models.

In [49]:
# Splitting our data
# ---
#
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [50]:
# Fitting our model
# ---
#

# Importing the algorithms
from sklearn.naive_bayes import MultinomialNB 
from sklearn.linear_model import LogisticRegression

nb_classifier = MultinomialNB() 
lr_classifier = LogisticRegression(max_iter=1000) 

# Training our model
nb_classifier.fit(X_train, y_train) 
lr_classifier.fit(X_train, y_train)

LogisticRegression(max_iter=1000)

In [51]:
# Making predictions
# ---
#
y_predict_nb = nb_classifier.predict(X_test) 
y_predict_lr = lr_classifier.predict(X_test)

In [52]:
# Evaluating the Models
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Accuracy scores
# ---
#
print("Naive Bayes Classifier:\n", accuracy_score(y_test, y_predict_nb)) 
print("Logistic Regression Classifier: \n", accuracy_score(y_test, y_predict_lr))

Naive Bayes Classifier:
 0.726
Logistic Regression Classifier: 
 0.7285


In [53]:
# Confusion matrices
# ---
# 
print("Naive Bayes Classifier: \n", confusion_matrix(y_test, y_predict_nb)) 
print("Logistic Regression Classifier: \n", confusion_matrix(y_test, y_predict_lr))

Naive Bayes Classifier: 
 [[754 296]
 [252 698]]
Logistic Regression Classifier: 
 [[758 292]
 [251 699]]


In [54]:
# Classification Reports
# ---
#
print("Naive Bayes Classifier: \n", classification_report(y_test, y_predict_nb)) 
print("Logistic Regression Classifier: \n", classification_report(y_test, y_predict_lr))

Naive Bayes Classifier: 
               precision    recall  f1-score   support

           0       0.75      0.72      0.73      1050
           4       0.70      0.73      0.72       950

    accuracy                           0.73      2000
   macro avg       0.73      0.73      0.73      2000
weighted avg       0.73      0.73      0.73      2000

Logistic Regression Classifier: 
               precision    recall  f1-score   support

           0       0.75      0.72      0.74      1050
           4       0.71      0.74      0.72       950

    accuracy                           0.73      2000
   macro avg       0.73      0.73      0.73      2000
weighted avg       0.73      0.73      0.73      2000



**Evaluation our Models**

* **Accuracy:** the percentage of texts that were assigned the correct topic.
* **Precision:** the percentage of texts the classifier classified correctly out of the total number of texts it predicted for each topic
* **Recall:** the percentage of texts the model predicted for each topic out of the total number of texts it should have predicted for that topic.
* **F1 Score:** the average of both precision and recall.

To improve our model, we can try perfoming other text processing techniques that would better prepare our data for fitting our model. We can also use different vectorizing techniques, implement other machine learning models and perform hyperparameter tuning.

### 5. Recommendations


Our best model had an accuracy of 72.85% and use it for classifying newer tweets. We can improve this performance by performing hyperparameter tuning and feature engineering methods. 