Week 11: Day 4 – What is Text Classification in NLP

In [None]:
# Import library
import pandas as pd
import numpy as np
import nltk
from nltk import word_tokenize

### The Bag of words Approach
In common terms, it basically creates a list of all the unique words present across all the documents and then count the frequency of each of these words appearing in the documents

In [None]:
review_1 = 'The movie was good and we really like it'
review_2 = 'the movie was good but the ending was boring'
review_3 ='we did not like the movie as it was too lenghty'

We'll now join all the three reviews by creating a set to get all the unique words across all the reviews:

- The goal is to classify the movie reviews into positive or negative

In [None]:
# convert all the words to tokens
review_1_tokens = word_tokenize(review_1)
print(review_1_tokens)
review_2_tokens = word_tokenize(review_2)
print(review_2_tokens)
review_3_tokens = word_tokenize(review_3)
print(review_3_tokens)


['The', 'movie', 'was', 'good', 'and', 'we', 'really', 'like', 'it']
['the', 'movie', 'was', 'good', 'but', 'the', 'ending', 'was', 'boring']
['we', 'did', 'not', 'like', 'the', 'movie', 'as', 'it', 'was', 'too', 'lenghty']


In [None]:
# Union of tokenized words
review_tokens = set(review_1_tokens).union(set(review_2_tokens)).union(set(review_3_tokens))
print(review_tokens)

{'boring', 'ending', 'movie', 'and', 'we', 'but', 'The', 'the', 'lenghty', 'it', 'as', 'too', 'was', 'not', 'did', 'good', 'really', 'like'}


In [None]:
# number of words
len(review_tokens)

18

In [None]:
review_tokens

{'The',
 'and',
 'as',
 'boring',
 'but',
 'did',
 'ending',
 'good',
 'it',
 'lenghty',
 'like',
 'movie',
 'not',
 'really',
 'the',
 'too',
 'was',
 'we'}

### Processing Tokens

we'll now create a dictionary where the keys will be the 18 tokens and the default value of each token will be 0

In [None]:
review1_dict = dict.fromkeys(review_tokens,0)
review2_dict = dict.fromkeys(review_tokens,0)
review3_dict = dict.fromkeys(review_tokens,0)

In [None]:
for token in review_1_tokens:
    review1_dict[token]+=1

In [None]:
review1_dict

{'boring': 0,
 'ending': 0,
 'movie': 1,
 'and': 1,
 'we': 1,
 'but': 0,
 'The': 1,
 'the': 0,
 'lenghty': 0,
 'it': 1,
 'as': 0,
 'too': 0,
 'was': 1,
 'not': 0,
 'did': 0,
 'good': 1,
 'really': 1,
 'like': 1}

In [None]:
for token in review_2_tokens:
    review2_dict[token]+=1
    
for token in review_3_tokens:
    review3_dict[token]+=1

In [None]:
# convert to a data frame
reviews_Dict_DF = pd.DataFrame([review1_dict,review2_dict,review3_dict])

In [None]:
reviews_Dict_DF
# at this state it is easy to impelement any machine learning algorithim

Unnamed: 0,boring,ending,movie,and,we,but,The,the,lenghty,it,as,too,was,not,did,good,really,like
0,0,0,1,1,1,0,1,0,0,1,0,0,1,0,0,1,1,1
1,1,1,1,0,0,1,0,2,0,0,0,0,2,0,0,1,0,0
2,0,0,1,0,1,0,0,1,1,1,1,1,1,1,1,0,0,1


<!-- ## Adding Counts To The Tokens
Create a for loop which for each of the tokens in the review will add 1 to the value of that token in the dictionary:
     -->

## Count Vectorization in Scikit-Learn

Let's start by importing the necessary libraries as shown below:

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

Using the same review, create a list of same reviews:

In [None]:
review_list = [review_1,review_2,review_3]
review_list

['The movie was good and we really like it',
 'the movie was good but the ending was boring',
 'we did not like the movie as it was too lenghty']

Now, instantiate the count vectorizer and fit transform it with the list:

In [None]:
count_vect = CountVectorizer()
count_vect

CountVectorizer()

In [None]:
X_counts = count_vect.fit_transform(review_list)
X_counts.toarray()

array([[1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1],
       [0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 2, 0, 2, 0],
       [0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1]], dtype=int64)

Let's check the type of count vectorized matrix:

In [None]:
type(X_counts)

scipy.sparse.csr.csr_matrix

In [None]:
# returns the list of unique nnames as a result of count vectorization:  
X_names = count_vect.get_feature_names()
X_names

['and',
 'as',
 'boring',
 'but',
 'did',
 'ending',
 'good',
 'it',
 'lenghty',
 'like',
 'movie',
 'not',
 'really',
 'the',
 'too',
 'was',
 'we']

## Working On the vectorizer Matrix

You will now create a new pandas datafeame out of the scipy csr matrix and the vectorizer as datafeame column names:

In [None]:
# Convert to a dataframe and array
a=pd.DataFrame(X_counts.toarray(),columns=X_names)
a
# Note that Count Vectorizer omits words that are less than 2 letters long

Unnamed: 0,and,as,boring,but,did,ending,good,it,lenghty,like,movie,not,really,the,too,was,we
0,1,0,0,0,0,0,1,1,0,1,1,0,1,1,0,1,1
1,0,0,1,1,0,1,1,0,0,0,1,0,0,2,0,2,0
2,0,1,0,0,1,0,0,1,1,1,1,1,0,1,1,1,1


conda install -c anaconda ipython

!pip install sklearn
import sklearn

import sys
print(sys.path)

## TF -IDF in Scikit-Learn
A statistics which shows, how important a word is in a collection of document

TF(t,d): The total number of occurences of word t in the instances of document d

IDF: log(total number of documents/number of documents containing t)

TF_IDF Score: TFIDF(d,t) = TF(d,t)*IDF(t)

In [None]:
# Let us start with importing the library for TF-IDF:
from sklearn.feature_extraction.text import TfidfVectorizer


In [None]:
import sklearn
sklearn.__version__

'0.23.2'

In [None]:
# Instantiate the TF IDF vectorizer by passing the parameters:
tf_vect =TfidfVectorizer(min_df=1,# Tells the vectorizer to ignore the words that have a document frequency less than this number
                         lowercase=True, # Boolean parameter to convert all the words into lowercase before tokenizing
                         stop_words='english')

Once we have instantiated the vectorizer, we can pass the review list of all the three reviews to the same:

In [None]:
# fit and transform data into matrix
tf_matrix = tf_vect.fit_transform(review_list)

The above will result into a scipy csr matrix with 3 rows and 8 columns

In [None]:
type(tf_matrix)

scipy.sparse.csr.csr_matrix

In [None]:
# returns row and columns
tf_matrix.shape

(3, 8)

You can check the names that the vectorizer has counted by:


In [None]:
# you get back a list with 8 tokens
tf_names = tf_vect.get_feature_names()

In [None]:
tf_names

['boring', 'did', 'ending', 'good', 'lenghty', 'like', 'movie', 'really']

## CSR Matrix To Pandas DF
 You can now create a pandas dataframe by passing a scipy csr matrix as values and the list of tokens, you got above as column names:

In [None]:
# The dataframe has tf-idf values for all the 8 tokens in floats across all the three reviews
tf_df = pd.DataFrame(tf_matrix.toarray(),columns=tf_names)

In [None]:
# probability of the word occuring
tf_df

Unnamed: 0,boring,did,ending,good,lenghty,like,movie,really
0,0.0,0.0,0.0,0.480458,0.0,0.480458,0.373119,0.631745
1,0.584483,0.0,0.584483,0.444514,0.0,0.0,0.345205,0.0
2,0.0,0.584483,0.0,0.0,0.584483,0.444514,0.345205,0.0
