# <font color='#2F4F4F'>Getting Started with Text Analysis</font>

# <font color='#2F4F4F'>1. Define the Research Question</font>

### Background Information
The management of a certain Marketing Firm would like to track the sentiments of their customers. This would help in shortening the amount of time that it takes to act on feedback.


### Problem Statement
Your have been tasked to create a model that can predict whether the sentiment of a tweet is positive or negative.

### Metric for Success

The desired accuracy of the model is 70%.

### The Experimental Design

* Define the Research Question
* Import Libraries
* Import & Explore Data
* Data Preparation
* Data Modelling & Evaluation
* Recommendations
* Challenging the Solution

# <font color='#2F4F4F'>2. Import Libraries</font>

In [1]:
import pandas as pd # library for data manipulation
import numpy as np  # librariy for scientific computations
import re           # regex library to perform text preprocessing
import string       # library to work with strings
import nltk         # library for natural language processing
import scipy        # scientific conputing
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB, BernoulliNB 
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.preprocessing import MaxAbsScaler, MinMaxScaler
pd.set_option('display.max_columns', None)  # see entire column content in the dataframe

# <font color='#2F4F4F'>3. Import & Explore Data</font>

In [2]:
df = pd.read_csv('https://bit.ly/31kqByD', encoding = 'latin-1', header = None)
df.head()

Unnamed: 0,0,1,2,3,4,5,6
0,,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,346508.0,0,2016177685,Wed Jun 03 06:18:50 PDT 2009,NO_QUERY,UriGrey,Obama forges his Muslim alliance against the c...
2,883537.0,4,1686152287,Sun May 03 04:02:08 PDT 2009,NO_QUERY,MariesolW,Had the most spectacular prom ever but now my...
3,764173.0,0,2298725623,Tue Jun 23 12:02:12 PDT 2009,NO_QUERY,ColleenBurns,I am overwhelmed today taking a moment to eat...
4,638701.0,0,2234530495,Thu Jun 18 23:13:54 PDT 2009,NO_QUERY,queenarchy,@lindork Tres sad. I was totally a Max fan. #...


> We don't need some columns, we will drop them later.

In [3]:
# check dataset shape
df.shape

(10001, 7)

# <font color='#2F4F4F'>4. Data Preparation</font>

#### Basic Data Cleaning Techniques

In [4]:
# Let's rename the columns for ease of referencing later on
df.columns = ['id', 'target', 't_id', 'created_at', 'query', 'user', 'text']
df.head()

Unnamed: 0,id,target,t_id,created_at,query,user,text
0,,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,346508.0,0,2016177685,Wed Jun 03 06:18:50 PDT 2009,NO_QUERY,UriGrey,Obama forges his Muslim alliance against the c...
2,883537.0,4,1686152287,Sun May 03 04:02:08 PDT 2009,NO_QUERY,MariesolW,Had the most spectacular prom ever but now my...
3,764173.0,0,2298725623,Tue Jun 23 12:02:12 PDT 2009,NO_QUERY,ColleenBurns,I am overwhelmed today taking a moment to eat...
4,638701.0,0,2234530495,Thu Jun 18 23:13:54 PDT 2009,NO_QUERY,queenarchy,@lindork Tres sad. I was totally a Max fan. #...


In [5]:
# drop the columns we don't need 
new_df = df.drop(['id', 't_id', 'created_at', 'query', 'user'], axis = 1)
new_df.head()

Unnamed: 0,target,text
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,Obama forges his Muslim alliance against the c...
2,4,Had the most spectacular prom ever but now my...
3,0,I am overwhelmed today taking a moment to eat...
4,0,@lindork Tres sad. I was totally a Max fan. #...


In [6]:
# check the distribution of target
new_df.target.value_counts() 

0    5068
4    4933
Name: target, dtype: int64

In [8]:
# check data types
new_df.dtypes

target     int64
text      object
dtype: object

In [9]:
# check unique values in target variable
new_df.target.unique()

array([0, 4])

These are the two classes to which each text belongs. The target value 0 means a text with a negative sentiment, while that of 4 means a text with a positive sentiment. 

In [10]:
# check for missing values 
new_df.isnull().sum()

target    0
text      0
dtype: int64

#### Text Processing

In [11]:
# text cleaning: remove all urls/links
# ---
# 
new_df['text'] =  new_df['text'].apply(lambda x: re.sub(r'http\S+|www\S+|https\S+','', str(x)))
new_df[['text']].head()

Unnamed: 0,text
0,@switchfoot - A that's a bummer. You shoulda...
1,Obama forges his Muslim alliance against the c...
2,Had the most spectacular prom ever but now my...
3,I am overwhelmed today taking a moment to eat...
4,@lindork Tres sad. I was totally a Max fan. #...


In [12]:
# text cleaning: remove @ and # characters  or replace them with space
new_df.text = new_df.text.str.replace("@", " ")
new_df.text = new_df.text.str.replace("#", " ")
new_df.head()

Unnamed: 0,target,text
0,0,switchfoot - A that's a bummer. You shoulda...
1,0,Obama forges his Muslim alliance against the c...
2,4,Had the most spectacular prom ever but now my...
3,0,I am overwhelmed today taking a moment to eat...
4,0,lindork Tres sad. I was totally a Max fan. ...


In [13]:
# text cleaning: conversion to lowercase
new_df.text = new_df.text.apply(lambda x: x.lower())
new_df.head()

Unnamed: 0,target,text
0,0,switchfoot - a that's a bummer. you shoulda...
1,0,obama forges his muslim alliance against the c...
2,4,had the most spectacular prom ever but now my...
3,0,i am overwhelmed today taking a moment to eat...
4,0,lindork tres sad. i was totally a max fan. ...


In [14]:
# text cleaning: split concatenated words
# Install wordnija and textblob
!pip3 install wordninja
!pip3 install textblob

# Import libraries
import wordninja 
from textblob import TextBlob

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [15]:
# perform the split
new_df.text = new_df.text.apply(lambda x: wordninja.split(str(TextBlob(x))))

In [16]:
new_df.text = new_df.text.str.join(' ')
new_df.head(50)

Unnamed: 0,target,text
0,0,switch foot a that's a bummer you should a got...
1,0,obama forges his muslim alliance against the c...
2,4,had the most spectacular prom ever but now my ...
3,0,i am overwhelmed today taking a moment to eat ...
4,0,lin dork tres sad i was totally a max fan sytycd
5,0,crap i was counting down the hours until my da...
6,4,dc b tv dc b tv i had to go check some things ...
7,0,s mr or ke why are you never on gmail anymore
8,0,alex jeffrey s i'd have loved to have come jus...
9,0,br rrr heading to work chilly today


In [17]:
# text cleaning: remove punctuation characters
new_df.text = new_df.text.str.replace('[^\w\s]', '')

  


In [18]:
# text cleaning: remove stop words
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

new_df.text = new_df.text.apply(lambda x: " ".join(x for x in x.split() if x not in stop_words))
new_df.head(10)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,target,text
0,0,switch foot thats bummer got david carr third day
1,0,obama forges muslim alliance civilized world d...
2,4,spectacular prom ever bed serenading must answ...
3,0,overwhelmed today taking moment eat pray
4,0,lin dork tres sad totally max fan sytycd
5,0,crap counting hours dad could come home amp he...
6,4,dc b tv dc b tv go check things buy others loo...
7,0,mr ke never gmail anymore
8,0,alex jeffrey id loved come couple unfortunate ...
9,0,br rrr heading work chilly today


In [19]:
# text cleaning: lemmatization
nltk.download('wordnet')
nltk.download('omw-1.4')
from textblob import Word

# lemmatize our text
new_df.text = new_df.text.apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
new_df.head(10)

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


Unnamed: 0,target,text
0,0,switch foot thats bummer got david carr third day
1,0,obama forge muslim alliance civilized world di...
2,4,spectacular prom ever bed serenading must answ...
3,0,overwhelmed today taking moment eat pray
4,0,lin dork tres sad totally max fan sytycd
5,0,crap counting hour dad could come home amp hel...
6,4,dc b tv dc b tv go check thing buy others look...
7,0,mr ke never gmail anymore
8,0,alex jeffrey id loved come couple unfortunate ...
9,0,br rrr heading work chilly today


We won't remove numerics because we could loose meaning of our text if we lost the numerics. We could also further prepare our text by performing spelling correction but this is a resource intensive process that we will skip for now.

#### Feature Engineering Techniques 

In [20]:
# feature construction: length of text
new_df['text_len'] = new_df.text.str.len()

In [21]:
# feature construction: word count 
new_df['word_count'] = new_df.text.apply(lambda x: len(str(x).split()))

In [22]:
# feature construction: word density (Average word length / tweet)
def avg_word_len(sentence):
  words = sentence.split()
  sum = 0

  for word in words:
    sum += len(word)

  return sum/len(words) if len(words) > 0 else 0

In [23]:
new_df['word_density'] = new_df.text.apply(lambda x: avg_word_len(x))
new_df.head()

Unnamed: 0,target,text,text_len,word_count,word_density
0,0,switch foot thats bummer got david carr third day,49,9,4.555556
1,0,obama forge muslim alliance civilized world di...,67,11,5.181818
2,4,spectacular prom ever bed serenading must answ...,81,12,5.833333
3,0,overwhelmed today taking moment eat pray,40,6,5.833333
4,0,lin dork tres sad totally max fan sytycd,40,8,4.125


In [24]:
# feature construction: noun count
# download the punkt and the averaged_perceptron_tagger
# which will allow us to find the part of speech tags
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# create the function to check and get the part of speech tag count of a words in a given sentence
pos_dict = {
    'noun' : ['NN','NNS','NNP','NNPS'],
    'pron' : ['PRP','PRP$','WP','WP$'],
    'verb' : ['VB','VBD','VBG','VBN','VBP','VBZ'],
    'adj' :  ['JJ','JJR','JJS'],
    'adv' : ['RB','RBR','RBS','WRB']
}

def pos_check(x, flag):
    cnt = 0
    try:
        wiki = TextBlob(x)
        for tup in wiki.tags:
            ppo = list(tup)[1]
            if ppo in pos_dict[flag]:
                cnt += 1
    except:
        pass
    return cnt

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [25]:
# noun count
new_df['noun_count'] = new_df.text.apply(lambda x: pos_check(x, 'noun'))

In [26]:
# feature construction: verb count
new_df['verb_count'] = new_df.text.apply(lambda x: pos_check(x, 'verb'))

In [27]:
# feature construction: adjective count
new_df['adj_count'] = new_df.text.apply(lambda x: pos_check(x, 'adj'))

In [28]:
# feature construction: adverb count
new_df['adv_count'] = new_df.text.apply(lambda x: pos_check(x, 'adv'))

In [29]:
# feature construction: pronoun 
new_df['pron_count'] = new_df.text.apply(lambda x: pos_check(x, 'pron'))

In [30]:
# feature construction: subjectivity
def get_subjectivity(text):
  textblob = TextBlob(text)
  return textblob.sentiment.subjectivity

new_df['subjectivity'] = new_df.text.apply(get_subjectivity)

In [31]:
# feature construction: polarity
def get_polarity(text):
  textblob = TextBlob(text)
  return textblob.sentiment.polarity

new_df['polarity'] = new_df.text.apply(get_polarity)

In [32]:
# feature construction: word level N-Gram TF-IDF feature
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_w = TfidfVectorizer(max_features=1000, lowercase=True, analyzer='word', ngram_range=(1,3),  stop_words= 'english')
word_vect = tfidf_w.fit_transform(new_df.text)

In [33]:
# Feature Construction: Character Level N-Gram TF-IDF Feature
tfidf_c = TfidfVectorizer(max_features=1000, lowercase=True, analyzer='char', ngram_range=(1,3),  stop_words= 'english')
char_vect = tfidf_c.fit_transform(new_df.text)

In [34]:
new_df.shape

(10001, 12)

In [35]:
# prepare the constructed features for modeling

X_metadata = np.array(new_df.iloc[:, 2:12])
X_metadata

array([[49.        ,  9.        ,  4.55555556, ...,  0.        ,
         0.        ,  0.        ],
       [67.        , 11.        ,  5.18181818, ...,  0.        ,
         0.9       ,  0.4       ],
       [81.        , 12.        ,  5.83333333, ...,  0.        ,
         0.85      ,  0.65      ],
       ...,
       [45.        ,  8.        ,  4.75      , ...,  0.        ,
         0.83333333,  0.33333333],
       [34.        ,  6.        ,  4.83333333, ...,  0.        ,
         0.56785714,  0.49285714],
       [44.        , 10.        ,  3.5       , ...,  0.        ,
         0.6       ,  0.5       ]])

In [None]:
new_df['polarity'].value_counts(ascending=True)

In [41]:
# combine our two tfidf (sparse) matrices and X_metadata

X = scipy.sparse.hstack([word_vect, char_vect,  X_metadata])
X

<10001x2010 sparse matrix of type '<class 'numpy.float64'>'
	with 951523 stored elements in COOrdinate format>

In [42]:
# set the target variable

Y = np.array(new_df.iloc[:, 0])
Y

array([0, 0, 4, ..., 0, 4, 0])

# <font color='#2F4F4F'>5. Data Modelling & Evaluation</font>

In [58]:
# split data

features_train, features_test, target_train, target_test = train_test_split(X, Y, test_size=0.2, random_state=42)

In [59]:
max_abs_scaler = MaxAbsScaler().fit(features_train)

features_train_scaled = max_abs_scaler.transform(features_train) 
features_test_scaled = max_abs_scaler.transform(features_test)

In [52]:
# instantiate our models
bnb_classifier = BernoulliNB()
lr_classifier = LogisticRegression(max_iter=1000)

# fit the models
bnb_classifier.fit(features_train_scaled, target_train) 
lr_classifier.fit(features_train_scaled, target_train)

LogisticRegression(max_iter=1000)

In [53]:
bnb_predictions = bnb_classifier.predict(features_test_scaled)

In [54]:
lr_predictions = lr_classifier.predict(features_test_scaled)

In [55]:
# evaluating the models

print("Accuracy Score - Bernoulli NB Classifier:", accuracy_score(target_test, bnb_predictions))
print("Accuracy Score - Logistic Regression Classifier:", accuracy_score(target_test, lr_predictions))

Accuracy Score - Bernoulli NB Classifier: 0.697151424287856
Accuracy Score - Logistic Regression Classifier: 0.7011494252873564


In [63]:
# confusion matrices

print("Bernoulli NB: \n", confusion_matrix(target_test, bnb_predictions))
print("Logistic Regression: \n", confusion_matrix(target_test, lr_predictions))

Bernoulli NB: 
 [[710 266]
 [340 685]]
Logistic Regression: 
 [[708 268]
 [330 695]]


In [65]:
# Classification Reports

print("Bernoulli NB: \n", classification_report(target_test, bnb_predictions))
print("Logistic Regression: \n", classification_report(target_test, lr_predictions))

Bernoulli NB: 
               precision    recall  f1-score   support

           0       0.68      0.73      0.70       976
           4       0.72      0.67      0.69      1025

    accuracy                           0.70      2001
   macro avg       0.70      0.70      0.70      2001
weighted avg       0.70      0.70      0.70      2001

Logistic Regression: 
               precision    recall  f1-score   support

           0       0.68      0.73      0.70       976
           4       0.72      0.68      0.70      1025

    accuracy                           0.70      2001
   macro avg       0.70      0.70      0.70      2001
weighted avg       0.70      0.70      0.70      2001



**Evaluation our Models**

* **Accuracy:** the percentage of texts that were assigned the correct topic.
* **Precision:** the percentage of texts the classifier classified correctly out of the total number of texts it predicted for each topic
* **Recall:** the percentage of texts the model predicted for each topic out of the total number of texts it should have predicted for that topic.
* **F1 Score:** the average of both precision and recall.

# <font color='#2F4F4F'>6. Recommendations</font>


Our best model had an accuracy of 70.11%. I recommend it for classifying newer tweets.

# <font color='#2F4F4F'>7. Challenging the Solution</font>

Did we have the right question? 
>Yes

Did we have the right data? 
>Yes.

What can be done to improve the solution?

> To improve our model, we can try perfoming other text processing techniques that would better prepare our data for fitting our model. 
> We can also use different vectorizing techniques, implement other machine learning models and perform hyperparameter tuning.