<a href="https://colab.research.google.com/github/BrianKipngeno/Text-classification-with-Python/blob/main/Text_Classification_with_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Prerequisites

In [None]:
# Importing the required libraries
# ---
#
import pandas as pd # library for data manipulation
import numpy as np  # library for scientific compuations
import re           # library to perform text preprocessing
import string       # library to work with strings
import nltk         # library for natural language processing
import scipy        # library for scientific computing

In [None]:
# Library for Stop words
!pip3 install wordninja
!pip3 install textblob
import wordninja
from textblob import TextBlob

nltk.download('stopwords')
from nltk.corpus import stopwords
stop = stopwords.words('english')

# Library for Lemmatization
nltk.download('wordnet')
from textblob import Word

# Library for Noun count
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Library for TD-IDF
from sklearn.feature_extraction.text import TfidfVectorizer

Collecting wordninja
  Downloading wordninja-2.0.0.tar.gz (541 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m541.6/541.6 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wordninja
  Building wheel for wordninja (setup.py) ... [?25l[?25hdone
  Created wheel for wordninja: filename=wordninja-2.0.0-py3-none-any.whl size=541530 sha256=8a1866ae890adeb82367bae9278cb4330a638ebabcf00557516a26c7ee837b40
  Stored in directory: /root/.cache/pip/wheels/aa/44/3a/f2a5c1859b8b541ded969b4cd12d0a58897f12408f4f51e084
Successfully built wordninja
Installing collected packages: wordninja
Successfully installed wordninja-2.0.0


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [None]:
# Custom Functions
# ---
#

# Avg. words
def avg_word(sentence):
  words = sentence.split()
  try:
    z = (sum(len(word) for word in words)/len(words))
  except ZeroDivisionError:
    z = 0
  return z

# Noun count
pos_dic = {
    'noun' : ['NN','NNS','NNP','NNPS'],
    'pron' : ['PRP','PRP$','WP','WP$'],
    'verb' : ['VB','VBD','VBG','VBN','VBP','VBZ'],
    'adj' :  ['JJ','JJR','JJS'],
    'adv' : ['RB','RBR','RBS','WRB']
}

def pos_check(x, flag):
    cnt = 0
    try:
        wiki = TextBlob(x)
        for tup in wiki.tags:
            ppo = list(tup)[1]
            if ppo in pos_dic[flag]:
                cnt += 1
    except:
        pass
    return cnt

# Subjectivity
def get_subjectivity(tweet):
    try:
        textblob = TextBlob(unicode(tweet, 'utf-8'))
        subj = textblob.sentiment.subjectivity
    except:
        subj = 0.0
    return subj

# Polarity
def get_polarity(tweet):
    try:
        textblob = TextBlob(unicode(tweet, 'utf-8'))
        pol = textblob.sentiment.polarity
    except:
        pol = 0.0
    return pol

## Example

### Importing Data

In [None]:
# Question: Create a classification model to classify new tweets
# with different sentiments given the following dataset.
# ---
# Dataset URL = http://bit.ly/VaccinationsDS
# ---
#
df = pd.read_csv('http://bit.ly/VaccinationsDS')
df.head()

Unnamed: 0,tweet,retweets,likes,sentiment
0,Mother coming from #RoutineImmunization sessio...,9.0,11.0,neutral
1,Odisha vaccinates over 1 crore children in 19 ...,14.0,53.0,neutral
2,India is at the forefront of vaccine developme...,16.0,15.0,neutral
3,India is at the forefront of vaccine developme...,16.0,15.0,neutral
4,The mobile-based application “Kilkari” aims to...,37.0,80.0,neutral


### Data Exploration

In [None]:
# We can determine the size of our dataset
# ---
#
df.shape

(1801, 4)

In [None]:
# To get an understanding of our dataset lets sample 10 records
# ---
#
df.sample(5)

Unnamed: 0,tweet,retweets,likes,sentiment
1028,525 children got vaccinated with measles-rubel...,0.0,2.0,neutral
1467,തൃഷ മലയാളത്തിൽ സംസാരിക്കുന്നത് കണ്ടിട്ടുണ്ടോ.....,0.0,1.0,neutral
1785,#Vaccines are the best defence against #diseas...,4.0,1.0,positive
1089,Pregnant mothers and children are getting vacc...,13.0,18.0,neutral
1181,RT MoHFW_INDIA: #Immunization of the child is ...,1.0,0.0,negative


This dataset will need some data cleaning i.e. removal of links, hashtags, etc.

In [None]:
# sampling tweets with neutral sentiments
df_neutral = df[df["sentiment"] == 'neutral']
df_neutral = df_neutral.sample(50)

# sampling tweets with negative sentiments
df_negative = df[df["sentiment"] == 'negative']
df_negative = df_negative.sample(50)

# sampling tweets positive
df_positive = df[df["sentiment"] == 'positive']
df_positive = df_positive.sample(50)

# combining our dataframes
df = pd.concat([df_neutral, df_negative, df_positive])
df.head()

Unnamed: 0,tweet,retweets,likes,sentiment
1490,ഇന്നത്തെ ചോദ്യം... #MRCampaign #Keralapic.twit...,1.0,2.0,neutral
1247,#vaccines are safe of high quality & given by ...,1.0,0.0,neutral
1338,Intensified #MissionIndradhanush immunization ...,14.0,42.0,neutral
252,RT MoHFW_INDIA: Pulse Polio Day being observed...,0.0,0.0,neutral
406,#MissionIndradhanush has led to increase in an...,17.0,53.0,neutral


We now have a balanced dataset.

### Data Preparation

#### Basic Data Cleaning

In [None]:
# Let's determine whether our columns have the right data types
# ---
#
df.dtypes

Unnamed: 0,0
tweet,object
retweets,float64
likes,float64
sentiment,object


In [None]:
# What values are in our target variable?
# ---
#
df.sentiment.unique()

array(['neutral', 'negative', 'positive'], dtype=object)

In [None]:
# Let's check for missing values
# ---
#
df.isnull().sum()

Unnamed: 0,0
tweet,0
retweets,0
likes,0
sentiment,0


We don't have any missing values, so we are good to go.

#### Text Processing

In [None]:
# We will create a custom function that will contain all the text cleaning
# techniques. We will then reuse the same function for cleaning new data.
# ---
#
def text_cleaning(tweet):
  # Removing url/links
  df['tweet'] = df.tweet.apply(lambda x: re.sub(r'http\S+|www\S+|https\S+','', str(x)))

  # Removing @ and # characters and replacing them with space
  df['tweet'] = df.tweet.str.replace('#',' ')
  df['tweet'] = df.tweet.str.replace('@',' ')

  # Conversion to lowercase
  df['tweet'] = df.tweet.apply(lambda x: " ".join(x.lower() for x in x.split()))

  # Removing punctuation characters
  df['tweet'] = df.tweet.str.replace('[^\w\s]','')

  # Removing stop words
  df['tweet'] = df.tweet.apply(lambda x: " ".join(x for x in x.split() if x not in stop))

  # Lemmatization
  df['tweet'] = df.tweet.apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))

In [None]:
# Applying the text_cleaning function to our dataframe.
# ---
# NB: This process may take 2-5 min.
# ---
#
df['tweet'].apply(text_cleaning)
df.sample(5)

Unnamed: 0,tweet,retweets,likes,sentiment
317,today pulse polio day. ensure child get two dr...,0.0,1.0,positive
1750,thank teacher supporting mrcampaign. wishing h...,1.0,10.0,positive
762,successfully vaccinating 3.3cr child first pha...,68.0,213.0,neutral
868,narrowing equity gap among poor marginalized u...,7.0,27.0,negative
821,routine immunization protects child many disea...,23.0,30.0,negative


#### Feature Engineering

In [None]:
# We will create a custom function that will contain all the
# feature engineering techniques. We can then use the function for cleaning new data.
# ---
#
def feature_engineering(tweet):
  # Length of tweet
  df['length_of_tweet'] = df.tweet.str.len()

  # Word count
  df['word_count'] = df.tweet.apply(lambda x: len(str(x).split(" ")))

  # Word density (Average no. of words / tweet)
  df['avg_word_length'] = df.tweet.apply(lambda x: avg_word(x))

  # Noun Count
  df['noun_count'] = df.tweet.apply(lambda x: pos_check(x, 'noun'))

  # Verb Count
  df['verb_count'] = df.tweet.apply(lambda x: pos_check(x, 'verb'))

  # Adjective Count / Tweet
  df['adj_count'] = df.tweet.apply(lambda x: pos_check(x, 'adj'))

  # Adverb Count / Tweet
  df['adv_count'] = df.tweet.apply(lambda x: pos_check(x, 'adv'))

  # Pronoun
  df['pron_count'] = df.tweet.apply(lambda x: pos_check(x, 'pron'))

  # Subjectivity
  df['subjectivity'] = df.tweet.apply(get_subjectivity)

  # Polarity
  df['polarity'] = df.tweet.apply(get_polarity)

In [None]:
# Applying the custom feature engineering function to our dataframe.
# This process may take 2-5 min.
# ---
#
df.tweet.apply(feature_engineering)
df.sample(5)

Unnamed: 0,tweet,retweets,likes,sentiment,length_of_tweet,word_count,avg_word_length,noun_count,verb_count,adj_count,adv_count,pron_count,subjectivity,polarity
992,pertemuan teknis rencana pelaksanaan kampanye ...,0.0,0.0,positive,96,12,7.083333,9,1,1,0,0,0.0,0.0
439,vaccine never tested safe pregnant women. read...,0.0,5.0,negative,215,27,7.0,18,3,5,1,0,0.0,0.0
1062,get access dengue vaccine india worldimmunizat...,0.0,0.0,negative,93,11,7.545455,9,1,1,0,0,0.0,0.0
268,949 child immunized polio vaccine shargole rol...,0.0,2.0,negative,257,28,8.214286,13,4,7,1,0,0.0,0.0
993,mimzii’s post vaccination recovery! karogonsal...,0.0,0.0,positive,58,7,7.428571,7,2,0,0,0,0.0,0.0


In [None]:
# Performing further feature engineering techniques
# ---
#

# Feature Construction: Word Level N-Gram TF-IDF Feature
tfidf = TfidfVectorizer(max_features=1000, lowercase=True, analyzer='word', ngram_range=(1,3),  stop_words= 'english')
df_word_vect = tfidf.fit_transform(df.tweet)

# Feature Construction: Character Level N-Gram TF-IDF
tfidf = TfidfVectorizer(max_features=1000, lowercase=True, analyzer='char', ngram_range=(1,3),  stop_words= 'english')
df_char_vect = tfidf.fit_transform(df.tweet)



In [None]:
# Label Preparation i.e. replacing categorial values with numerical ones
# ---
#
y = np.array(df['sentiment'].replace(['neutral', 'positive', 'negative'], ['0','1','2']))
y

array(['0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0',
       '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0',
       '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0',
       '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '2', '2',
       '2', '2', '2', '2', '2', '2', '2', '2', '2', '2', '2', '2', '2',
       '2', '2', '2', '2', '2', '2', '2', '2', '2', '2', '2', '2', '2',
       '2', '2', '2', '2', '2', '2', '2', '2', '2', '2', '2', '2', '2',
       '2', '2', '2', '2', '2', '2', '2', '2', '2', '1', '1', '1', '1',
       '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1',
       '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1',
       '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1',
       '1', '1', '1', '1', '1', '1', '1'], dtype=object)

In [None]:
# Let's prepare the constructed features for modeling
# ---
# We will select all columns but the sentiment (which is the label) and tweet columns
# ---
#
X_metadata = np.array(df[df.columns.difference(['sentiment', 'tweet'])])
X_metadata

array([[ 0.        ,  0.        , 14.75      , ...,  0.        ,
         0.        ,  4.        ],
       [ 4.        ,  0.        ,  7.42857143, ...,  0.        ,
         1.        , 14.        ],
       [ 3.        ,  0.        ,  7.53846154, ...,  0.        ,
         1.        , 13.        ],
       ...,
       [ 2.        ,  0.        , 12.3       , ...,  0.        ,
         1.        , 10.        ],
       [ 3.        ,  1.        ,  9.52631579, ...,  0.        ,
         3.        , 19.        ],
       [ 2.        ,  1.        , 12.33333333, ...,  0.        ,
         1.        , 15.        ]])

In [None]:
# We combine our two tfidf (sparse) matrices and X_metadata
# ---
#
X = scipy.sparse.hstack([df_word_vect, df_char_vect, X_metadata])
X

<150x2012 sparse matrix of type '<class 'numpy.float64'>'
	with 37641 stored elements in COOrdinate format>

### Data Modelling

In this step we use machine learning algorithms to train and test our sentiment analysis models.

In [None]:
# Splitting our data
# ---
#
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:

# Install and import xgboost
!pip install xgboost
from xgboost import XGBClassifier



In [None]:
# Fitting our model
# ---
#

# Importing the algorithms
# ---
#
from sklearn.linear_model import LogisticRegression      # Logistic Regression Classifier
from sklearn.tree import DecisionTreeClassifier          # Decision Tree Classifier
from sklearn.svm import SVC                              # SVM Classifier
from sklearn.naive_bayes import MultinomialNB            # Naive Bayes Classifier
from sklearn.neighbors import KNeighborsClassifier       # KNN Classifier

# Ensemble classifiers
from sklearn.ensemble import BaggingClassifier           # Bagging Meta-Estimator Classifier
from sklearn.ensemble import RandomForestClassifier      # RandomForest Classifier
from sklearn.ensemble import AdaBoostClassifier          # AdaBoost Classifier
from sklearn.ensemble import GradientBoostingClassifier  # AdaBoost GradientBoostingClassifier

# Install and import xgboost
!pip install xgboost
from xgboost import XGBClassifier

# Instantiating our models
# ---
#
logistic_classifier = LogisticRegression(solver='saga', max_iter=800, multi_class='multinomial') # solver works well with a large dataset like ours
decision_classifier = DecisionTreeClassifier(random_state=42)
svm_classifier = SVC()
knn_classifier = KNeighborsClassifier()
naive_classifier = MultinomialNB()

bagging_meta_classifier = BaggingClassifier()
random_forest_classifier = RandomForestClassifier()
ada_boost_classifier = AdaBoostClassifier(random_state=42)
gbm_classifier = GradientBoostingClassifier(random_state=42)
xg_boost_classifier = XGBClassifier() # Now you can use XGBClassifier without the xgb prefix

# Training our models
# ---
#
logistic_classifier.fit(X_train, y_train)
decision_classifier.fit(X_train, y_train)
svm_classifier.fit(X_train, y_train)
knn_classifier.fit(X_train, y_train)
naive_classifier.fit(X_train, y_train)

bagging_meta_classifier.fit(X_train, y_train)
random_forest_classifier.fit(X_train, y_train)
ada_boost_classifier.fit(X_train, y_train)
gbm_classifier.fit(X_train, y_train)





In [None]:
# Making predictions
# ---
#
logistic_y_prediction = logistic_classifier.predict(X_test)
decision_y_prediction = decision_classifier.predict(X_test)
svm_y_prediction = svm_classifier.predict(X_test)
knn_y_prediction = knn_classifier.predict(X_test)
naive_y_prediction = naive_classifier.predict(X_test)

bagging_y_classifier = bagging_meta_classifier.predict(X_test)
random_forest_y_classifier = random_forest_classifier.predict(X_test)
ada_boost_y_classifier = ada_boost_classifier.predict(X_test)
gbm_y_classifier = gbm_classifier.predict(X_test)

In [None]:
# Evaluating the Models
# ---
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Accuracy scores
#
print("Logistic Regression Classifier", accuracy_score(logistic_y_prediction, y_test))
print("Decision Trees Classifier", accuracy_score(decision_y_prediction, y_test))
print("SVN Classifier", accuracy_score(svm_y_prediction, y_test))
print("KNN Classifier", accuracy_score(knn_y_prediction, y_test))
print("Naive Bayes Classifier", accuracy_score(naive_y_prediction, y_test))

print("Bagging Classifier", accuracy_score(bagging_y_classifier, y_test))
print("Random Forest Classifier", accuracy_score(random_forest_y_classifier, y_test))
print("Ada Boost Classifier", accuracy_score(ada_boost_y_classifier, y_test))
print("GBM Classifier", accuracy_score(gbm_y_classifier, y_test))

Logistic Regression Classifier 0.43333333333333335
Decision Trees Classifier 0.5
SVN Classifier 0.43333333333333335
KNN Classifier 0.4666666666666667
Naive Bayes Classifier 0.26666666666666666
Bagging Classifier 0.7333333333333333
Random Forest Classifier 0.7
Ada Boost Classifier 0.43333333333333335
GBM Classifier 0.6


We could use the accuracy as a reliable metric because our dataset was balanced.

In [None]:
# Confusion matrix
# ---
# Regardless of the size of the confusion matrix, the method for intepretation is the same.
# The left-hand side contains the predicted values and the actual class labels run across the top.
# The instances that the classifier has correctly predicted run diagonally from the top-left
# to the bottom-right.
# ---
#
print('Logistic Regression Classifier:')
print(confusion_matrix(logistic_y_prediction, y_test))

print('Decision Trees Classifier:')
print(confusion_matrix(decision_y_prediction, y_test))

print('SVN Classifier:')
print(confusion_matrix(svm_y_prediction, y_test))

print('KNN Classifier:')
print(confusion_matrix(knn_y_prediction, y_test))

print('Naive Bayes Classifier:')
print(confusion_matrix(naive_y_prediction, y_test))

print('Bagging Classifier:')
print(confusion_matrix(bagging_y_classifier, y_test))

print('Random Forest Classifier:')
print(confusion_matrix(random_forest_y_classifier, y_test))

print('Ada Boost Classifier:')
print(confusion_matrix(ada_boost_y_classifier, y_test))

print('GBM Classifier:')
print(confusion_matrix(gbm_y_classifier, y_test))

Logistic Regression Classifier:
[[7 4 2]
 [1 2 3]
 [2 5 4]]
Decision Trees Classifier:
[[7 3 2]
 [1 4 3]
 [2 4 4]]
SVN Classifier:
[[8 6 2]
 [0 2 4]
 [2 3 3]]
KNN Classifier:
[[6 5 4]
 [3 3 0]
 [1 3 5]]
Naive Bayes Classifier:
[[0 0 0]
 [2 4 5]
 [8 7 4]]
Bagging Classifier:
[[9 4 1]
 [1 7 2]
 [0 0 6]]
Random Forest Classifier:
[[8 3 0]
 [0 8 4]
 [2 0 5]]
Ada Boost Classifier:
[[5 5 1]
 [5 4 4]
 [0 2 4]]
GBM Classifier:
[[7 4 1]
 [2 5 2]
 [1 2 6]]


**3x3 Matrix Intepretation: Logistic Regression**

Looking at Logistic classification matrix, the first rows are actually 0's, second row 1's and third row 2's. The model predicted 3 of 0's correctly, and incorrectly predicted 2 of the 1's to be 1 and 5 of the 0's to be 2's.




In [None]:
# Classification Reports
# ---
#
print("Logistic Regression Classifier", classification_report(logistic_y_prediction, y_test))
print("Decision Trees Classifier", classification_report(decision_y_prediction, y_test))
print("SVM Classifier", classification_report(svm_y_prediction, y_test))
print("KNN Classifier", classification_report(knn_y_prediction, y_test))
print("Naive Bayes Classifier", classification_report(naive_y_prediction, y_test))

print("Bagging Classifier", classification_report(bagging_y_classifier, y_test))
print("Random Forest Classifier", classification_report(random_forest_y_classifier, y_test))
print("Ada Boost Classifier", classification_report(ada_boost_y_classifier, y_test))
print("GBM Classifier", classification_report(gbm_y_classifier, y_test))

Logistic Regression Classifier               precision    recall  f1-score   support

           0       0.70      0.54      0.61        13
           1       0.18      0.33      0.24         6
           2       0.44      0.36      0.40        11

    accuracy                           0.43        30
   macro avg       0.44      0.41      0.41        30
weighted avg       0.50      0.43      0.46        30

Decision Trees Classifier               precision    recall  f1-score   support

           0       0.70      0.58      0.64        12
           1       0.36      0.50      0.42         8
           2       0.44      0.40      0.42        10

    accuracy                           0.50        30
   macro avg       0.50      0.49      0.49        30
weighted avg       0.53      0.50      0.51        30

SVM Classifier               precision    recall  f1-score   support

           0       0.80      0.50      0.62        16
           1       0.18      0.33      0.24         6
   

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


**Evaluation our Models**

* **Accuracy:** the percentage of texts that were assigned the correct topic.
* **Precision:** the percentage of texts the classifier classified correctly out of the total number of texts it predicted for each topic
* **Recall:** the percentage of texts the model predicted for each topic out of the total number of texts it should have predicted for that topic.
* **F1 Score:** the average of both precision and recall.

### Recommendation

Our best performing models were bagging and Random forest. To improve our
model, we can try perfoming other text processing techniques that would better prepare our data for fitting our model. We can also use different vectorizing techniques, implement other machine learning models, perform hyperparameter tuning and sample a balanced dataset.