<a href="https://colab.research.google.com/github/Ckiteme/CKiteme-Asignment-Getting-Started-with-Text-Analysis/blob/main/CKiteme_Asignment_Getting_Started_with_Text_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<font color="#4b76b7">To start practicing, you will need to make a copy of it. Go to File > Save a Copy in Drive. You can then use the new copy that will appear in the new tab.</font>


# AfterWork Data Science: Getting Started with NLP Project

**Background Information**
The management of a certain Marketing Firm would like to track the sentiments of their
customers. This would help in shortening the amount of time that it takes to act on
feedback.

**Problem Statement**
Your task for this project will be to create a model that can predict whether the sentiment
of a tweet is positive or negative. The desired accuracy of your model is 70%.

**Below are the text processing steps that you will be required to perform in this project:**

**● Text Cleaning/ Text Processing**
○ Removing all URLs/links
○ Replacing @ and # Characters
○ Feature Construction (No. of Punctuation Characters)
○ Removing Punctuation Characters
○ Feature Construction (Lowercase, Uppercase and Proper case words)
○ Conversion to Lowercase
○ Splitting Concatenated words
○ Spelling Correction
○ Feature Construction (Counting the no. of stop words/tweet)
○ Removing Stop words
○ Lemmatization

● **Text Feature Engineering Techniques**
○ Length of text
○ Word Count
Word density (Average no. of Words / Tweet)
○ Noun Count
○ Verb Count
○ Adjective Count
○ Adverb Count
○ Pronoun Count
○ Polarity
○ Subjectivity
○ Word Level N-Gram TF-IDF tweet_word_tfidf
○ Character Level N-Gram TF-IDF tweet_character_tfidf

**Dataset**

● Datasets for this project can be found here: [https://bit.ly/31kqByD].
● You can load the dataset from the URL.


### Prerequisites

In [1]:
# Importing the required libraries
# ---
# 
import pandas as pd # library for data manipulation
import numpy as np  # librariy for scientific computations
import re           # regex library to perform text preprocessing
import string       # library to work with strings
import nltk         # library for natural language processing
import scipy        # scientific conputing 

### 1. Importing our Data

In [2]:
# Question: Given a new tweets, create a sentiment analysis model that will 
# predict whether a tweet will contain positive or negative sentiment.
# ---
# Dataset url = https://bit.ly/31kqByD 
# ---
#
df = pd.read_csv('https://bit.ly/31kqByD', encoding='latin-1')
df.head()

Unnamed: 0.1,Unnamed: 0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
0,346508,0,2016177685,Wed Jun 03 06:18:50 PDT 2009,NO_QUERY,UriGrey,Obama forges his Muslim alliance against the c...
1,883537,4,1686152287,Sun May 03 04:02:08 PDT 2009,NO_QUERY,MariesolW,Had the most spectacular prom ever but now my...
2,764173,0,2298725623,Tue Jun 23 12:02:12 PDT 2009,NO_QUERY,ColleenBurns,I am overwhelmed today taking a moment to eat...
3,638701,0,2234530495,Thu Jun 18 23:13:54 PDT 2009,NO_QUERY,queenarchy,@lindork Tres sad. I was totally a Max fan. #...
4,664821,0,2244623416,Fri Jun 19 14:59:46 PDT 2009,NO_QUERY,reinventingjess,"Crap, I was counting down the hours until my d..."


### 2. Data Exploration

In [3]:
# We can determine the size of our dataset
# ---
#
df.shape

(10000, 7)

Seems this dataset will need some data cleaning i.e. columns. We also don't need some columns to perform create our model. We will drop those columns.

### 3. Data Preparation

#### Basic Data Cleaning Techniques

In [4]:
# We rename the columns for ease of referencing our columns later on
# ---
#
df.columns = ['id', 'target', 't_id', 'created_at', 'query', 'user', 'text']
df.head()

Unnamed: 0,id,target,t_id,created_at,query,user,text
0,346508,0,2016177685,Wed Jun 03 06:18:50 PDT 2009,NO_QUERY,UriGrey,Obama forges his Muslim alliance against the c...
1,883537,4,1686152287,Sun May 03 04:02:08 PDT 2009,NO_QUERY,MariesolW,Had the most spectacular prom ever but now my...
2,764173,0,2298725623,Tue Jun 23 12:02:12 PDT 2009,NO_QUERY,ColleenBurns,I am overwhelmed today taking a moment to eat...
3,638701,0,2234530495,Thu Jun 18 23:13:54 PDT 2009,NO_QUERY,queenarchy,@lindork Tres sad. I was totally a Max fan. #...
4,664821,0,2244623416,Fri Jun 19 14:59:46 PDT 2009,NO_QUERY,reinventingjess,"Crap, I was counting down the hours until my d..."


In [5]:
# We retain the relevant columns by dropping the columns we don't need 
# for creating a sentiment analysis model. 
# ---
#
df = df.drop(['id', 't_id', 'created_at', 'query', 'user'], axis = 1)
df.head()

Unnamed: 0,target,text
0,0,Obama forges his Muslim alliance against the c...
1,4,Had the most spectacular prom ever but now my...
2,0,I am overwhelmed today taking a moment to eat...
3,0,@lindork Tres sad. I was totally a Max fan. #...
4,0,"Crap, I was counting down the hours until my d..."


In [6]:
# Understanding the distribution of target
# ---
#
df.target.value_counts() 

0    5067
4    4933
Name: target, dtype: int64

In [7]:
# Let's determine whether our columns have the right data types
# ---
#
df.dtypes

target     int64
text      object
dtype: object

In [8]:
# What values are in our target variable?
# ---
#
df.target.unique()

array([0, 4])

These are the two classes to which each document (text) belongs. The target value 0 means a text with a negative sentiment, while that of 4 means a text with a positive sentiment. 

In [9]:
# Let's check for missing values 
# ---
# 
df.isnull().sum()

target    0
text      0
dtype: int64

We don't have any missing values, so we are good to go.

#### Text Processing

In [10]:
# Text Cleaning: Removing all urls/links
# ---
# 
df['text'] =  df['text'].apply(lambda x: re.sub(r'http\S+|www\S+|https\S+','', str(x)))
df[['text']].head()

Unnamed: 0,text
0,Obama forges his Muslim alliance against the c...
1,Had the most spectacular prom ever but now my...
2,I am overwhelmed today taking a moment to eat...
3,@lindork Tres sad. I was totally a Max fan. #...
4,"Crap, I was counting down the hours until my d..."


In [12]:
# Text Cleaning: Removing @ and # characters or replace them with space
# ---
# YOUR CODE GOES BELOW
df['text'] = df.text.str.replace('#','')
df['text'] = df.text.str.replace('@','')
df[['text']].sample(5)


Unnamed: 0,text
4424,mikeneumann I need Chinese food. Serious. Cr...
3784,"good mornin tweepz... aha,, juz got up,, :p sm..."
2080,"zoernert iod2009 Yes, got my poken but is not ..."
6680,_ryssa with your choice of icing. There's Nea...
3863,i really dont want to leave him


In [13]:
# Text Cleaning: Conversion to lowercase
# ---
# YOUR CODE GOES BELOW
df['text'] = df.text.apply(lambda x: " ".join(x.lower() for x in x.split()))
df['text'].sample(5)


9117    i am so tired i feel almost comatose. didnt sl...
1965       lyssabrooke omg! i love that song! and i got u
3414    thank you universe for the great conversation ...
8298                      thornandes feel better soon luv
9662                         i dun want to go to work tmr
Name: text, dtype: object

In [14]:
# Text Cleaning: Splitting concatenated words
# ---
# Performing this step will take few minutes...
# ---
# YOUR CODE GOES BELOW
# 

# Installing wordnija and textblob
# ---
!pip3 install wordninja
!pip3 install textblob



# Importing those libraries
# ---
import wordninja 
from textblob import TextBlob


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting wordninja
  Downloading wordninja-2.0.0.tar.gz (541 kB)
[K     |████████████████████████████████| 541 kB 5.1 MB/s 
[?25hBuilding wheels for collected packages: wordninja
  Building wheel for wordninja (setup.py) ... [?25l[?25hdone
  Created wheel for wordninja: filename=wordninja-2.0.0-py3-none-any.whl size=541551 sha256=4b9ec30c80a187b454627bab65e27a598990a55f458424818b913116fa58f51a
  Stored in directory: /root/.cache/pip/wheels/dd/3f/eb/a2692e3d2b9deb1487b09ba4967dd6920bd5032bfd9ff7acfc
Successfully built wordninja
Installing collected packages: wordninja
Successfully installed wordninja-2.0.0
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [15]:
# Performing the split
# ---
df['text'] = df.text.apply(lambda x: wordninja.split(str(TextBlob(x))))  
df['text'] = df.text.str.join(' ')
df[['text']].sample(10) 

Unnamed: 0,text
1410,ooo haven't checked keri o for about a day and...
5399,allotment ali hey ali
2719,the movers will be here in 3 weeks
3667,bk ii sorry buddy wish you were coming too
3391,why are zip i a bags so fc king expensive
1333,swizz le squeak to mil teasing l about doting ...
7338,going to sleep for the day night whatever i re...
2313,trial chem exams and solutions
9177,i'm a grandma baby girl born 2 17 am 7 0 19 in...
806,shaun divine y lol z a girl said that if she g...


In [16]:
# Text Cleaning: Removing punctuation characters
# ---
# YOUR CODE GOES BELOW
df['text'] = df.text.str.replace('[^\w\s]','')


  after removing the cwd from sys.path.


In [17]:
# Text Cleaning: Removing stop words
# ---
# YOUR CODE GOES BELOW
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop = stopwords.words('english')

df['text'] = df.text.apply(lambda x: " ".join(x for x in x.split() if x not in stop))
df[['text']].sample(5)



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Unnamed: 0,text
8464,jb larry king interview broke heart dont know
5705,cheers 01 deluxe version dont know h ever list...
3663,hungry
894,like disease raining vancouver heard might get...
6964,lee n kwan dont feel sad hug zzz think happily...


In [18]:
# Text Cleaning: Lemmatization
# ---
# YOUR CODE GOES BELOW
#

# For lemmatization, we will need to download wordnet
nltk.download('wordnet')
from textblob import Word


# Lemmatizing our text
# ---
df['lemmatization'] = df.text.apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()])) 
df[['text', 'lemmatization']].sample(10)


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


Unnamed: 0,text,lemmatization
5111,really really really hope justine wins tonight...,really really really hope justine win tonight ...
2649,browser isnt working,browser isnt working
1867,dd lovato fans 101 could follow please would r...,dd lovato fan 101 could follow please would re...
5278,ah bored goes home feel like sleeping yes shal...,ah bored go home feel like sleeping yes shall ...
4016,ok ive filtered tw able tweet deck bb clogged ...,ok ive filtered tw able tweet deck bb clogged ...
3693,la rz zzz im sorry b day partying tomorrow,la rz zzz im sorry b day partying tomorrow
9065,left que es feeling much fru ste ration anger ...,left que e feeling much fru ste ration anger team
4133,ellen moore 08 pretty good thanks,ellen moore 08 pretty good thanks
7976,wow would want back,wow would want back
610,yesterday exciting hah today wasnt,yesterday exciting hah today wasnt


We won't remove numerics because we could loose meaning of our text if we lost the numerics. We could also further prepare our text by performing spelling correction but this is a resource intensive process that we will skip for now.

#### Feature Engineering Techniques 

In [21]:

# Custom Functions
# Avg. words
def avg_word(sentence):
  words = sentence.split()
  try:
    z = (sum(len(word) for word in words)/len(words))
  except ZeroDivisionError:
    z = 0 
  return z




# Subjectivity 
def get_subjectivity(tweet):
    try:
        textblob = TextBlob(unicode(tweet, 'utf-8'))
        subj = textblob.sentiment.subjectivity
    except:
        subj = 0.0
    return subj

# Polarity
def get_polarity(tweet):
    try:
        textblob = TextBlob(unicode(tweet, 'utf-8'))
        pol = textblob.sentiment.polarity
    except:
        pol = 0.0
    return pol



In [22]:
# Feature Construction: Length of tweet
# ---
# YOUR CODE GOES BELOW
#
df['length_of_tweet'] = df.text.str.len()

In [23]:
# Feature Construction: Word count 
# ---
# YOUR CODE GOES BELOW
df['word_count'] = df.text.apply(lambda x: len(str(x).split(" ")))


In [24]:
# Feature Construction: Word density (Average no. of words / tweet)
# ---
# YOUR CODE GOES BELOW
#
df['avg_word_length'] = df.text.apply(lambda x: avg_word(x))

In [25]:
# Feature Construction: Noun count
# ---
# YOUR CODE GOES BELOW
#
# First, we will download the punkt and the averaged_perceptron_tagger into our notebook environment. 
# which will allow us to find the part of speech tags.
# ---
#
# Library for Noun count
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')


# We create the function to check and get the part of speech tag count of a words in a given sentence
# Noun count
pos_dic = {
    'noun' : ['NN','NNS','NNP','NNPS'],
    'pron' : ['PRP','PRP$','WP','WP$'],
    'verb' : ['VB','VBD','VBG','VBN','VBP','VBZ'],
    'adj' :  ['JJ','JJR','JJS'],
    'adv' : ['RB','RBR','RBS','WRB']
}

def pos_check(x, flag):
    cnt = 0
    try:
        wiki = TextBlob(x)
        for tup in wiki.tags:
            ppo = list(tup)[1]
            if ppo in pos_dic[flag]:
                cnt += 1
    except:
        pass
    return cnt

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [26]:
# Noun Count
# ---
# YOUR CODE GOES BELOW
#
df['noun_count'] = df.text.apply(lambda x: pos_check(x, 'noun'))

In [27]:
# Feature Construction: Verb count
# ---
# YOUR CODE GOES BELOW
df['verb_count'] = df.text.apply(lambda x: pos_check(x, 'verb'))

In [28]:
# Feature Construction: Adjective count / Tweet
# ---
# YOUR CODE GOES BELOW
#
df['adj_count'] = df.text.apply(lambda x: pos_check(x, 'adj'))

In [30]:
# Feature Construction: Adverb count / Tweet
# ---
# YOUR CODE GOES BELOW
df['adv_count'] = df.text.apply(lambda x: pos_check(x, 'adv'))


In [31]:
# Feature Construction: Pronoun 
# ---
# YOUR CODE GOES BELOW
df['pron_count'] = df.text.apply(lambda x: pos_check(x, 'pron'))


In [32]:
# Feature Construction: Subjectivity
# ---
# YOUR CODE GOES BELOW
# 
df['subjectivity'] = df.text.apply(get_subjectivity)

In [33]:
# Feature Construction: Polarity
# ---
# YOUR CODE GOES BELOW
# 
df['polarity'] = df.text.apply(get_polarity)

In [34]:
# Feature Construction: Word Level N-Gram TF-IDF Feature 
# ---
# YOUR CODE GOES BELOW
from sklearn.feature_extraction.text import TfidfVectorizer
# YOUR CODE GOES BELOW
#
tfidf = TfidfVectorizer(max_features=1000, lowercase=True, analyzer='word', ngram_range=(1,3),  stop_words= 'english')
df_word_vect = tfidf.fit_transform(df.text)


In [36]:
# Feature Construction: Character Level N-Gram TF-IDF Feature
# ---
# YOUR CODE GOES BELOW
tfidf = TfidfVectorizer(max_features=1000, lowercase=True, analyzer='char', ngram_range=(1,3),  stop_words= 'english')
df_char_vect = tfidf.fit_transform(df.text)


In [37]:
# Let's prepare the constructed features for modeling
# ---
#
X_metadata = np.array(df.iloc[:, 2:12])
X_metadata

array([['obama forge muslim alliance civilized world didnt even drop cup tea',
        68, 11, ..., 0, 1, 0.0],
       ['spectacular prom ever bed serenading must answer sweet dream friend wonderful day',
        83, 12, ..., 0, 1, 0.0],
       ['overwhelmed today taking moment eat pray', 40, 6, ..., 0, 0,
        0.0],
       ...,
       ['hah linas hyper already well lucky im college', 45, 8, ..., 0,
        2, 0.0],
       ['omg really good day happened right', 34, 6, ..., 0, 1, 0.0],
       ['love 2 cook pie saw division 68 th didnt see', 44, 10, ..., 0,
        0, 0.0]], dtype=object)

In [57]:
# We combine our two tfidf (sparse) matrices and X_metadata
# ---
#
X = scipy.sparse.hstack([df_word_vect, df_char_vect,X_metadata])
X

TypeError: ignored

In [58]:
# Getting our response variable
# ---
#
y = np.array(df.iloc[:, 0])
y

array([0, 4, 0, ..., 0, 4, 0])

### 4. Data Modelling

During this step, we will use machine learning algorithms to train and test our sentiment analysis models.

In [60]:
# Splitting our data
# ---
#
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

NameError: ignored

In [None]:
# Fitting our model
# ---
#

# Importing the algorithms
from sklearn.naive_bayes import MultinomialNB 
from sklearn.linear_model import LogisticRegression

nb_classifier = MultinomialNB() 
lr_classifier = LogisticRegression(max_iter=1000) 

# Training our model
nb_classifier.fit(X_train, y_train) 
lr_classifier.fit(X_train, y_train)

In [None]:
# Making predictions
# ---
#
y_predict_nb = nb_classifier.predict(X_test) 
y_predict_lr = lr_classifier.predict(X_test)

In [None]:
# Evaluating the Models
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Accuracy scores
# ---
#
print("Naive Bayes Classifier:\n", accuracy_score(y_test, y_predict_nb)) 
print("Logistic Regression Classifier: \n", accuracy_score(y_test, y_predict_lr))

In [None]:
# Confusion matrices
# ---
# 
print("Naive Bayes Classifier: \n", confusion_matrix(y_test, y_predict_nb)) 
print("Logistic Regression Classifier: \n", confusion_matrix(y_test, y_predict_lr))

In [None]:
# Classification Reports
# ---
#
print("Naive Bayes Classifier: \n", classification_report(y_test, y_predict_nb)) 
print("Logistic Regression Classifier: \n", classification_report(y_test, y_predict_lr))

**Evaluation our Models**

* **Accuracy:** the percentage of texts that were assigned the correct topic.
* **Precision:** the percentage of texts the classifier classified correctly out of the total number of texts it predicted for each topic
* **Recall:** the percentage of texts the model predicted for each topic out of the total number of texts it should have predicted for that topic.
* **F1 Score:** the average of both precision and recall.

To improve our model, we can try perfoming other text processing techniques that would better prepare our data for fitting our model. We can also use different vectorizing techniques, implement other machine learning models and perform hyperparameter tuning.

### 5. Recommendations


Our best model had an accuracy of 73.25% and use it for classifying newer tweets. We can improve this performance by performing hyperparameter tuning and feature engineering methods. 