# <font color = 'pickle'> Import/install Libraries

In [1]:
# install nltk
!pip install nltk -qq
# install spacy
!pip install -U spacy -qq
# download spacy model
!python -m spacy download en_core_web_sm -qq

2022-09-12 13:25:10.403564: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
[K     |████████████████████████████████| 12.8 MB 30.0 MB/s 
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [2]:
# mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
from pathlib import Path

import re
import sys
import textwrap as tw
import pandas as pd
import numpy as np
from  collections import OrderedDict as odict

import spacy

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

# Import TweetTokenizer from nltk.tokenize module
from nltk.tokenize import TweetTokenizer
# import vectorizers - main of this lecture
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [4]:
spacy.__version__

'3.4.1'

In [5]:
base_folder = Path('/content/drive/MyDrive/data')
data_folder = base_folder/'datasets'

In [6]:
# load spacy model
nlp = spacy.load('en_core_web_sm')

# <font color = 'pickle'> Bag of Words (Sparse Embeddings)


## <font color = 'pickle'>**What is Bag of Words (BoW)?**

A **bag-of-words** is a representation of text that describes the occurrence of words within a document disregarding grammar and word order. It involves two steps:

    1. Create Vocabulary. Each word in vocabulary forms feature(independent variable) to represent document.
    2. Score words (based on frequency) to create Vectors.

## <font color = 'pickle'> **Why do you need to learn Bag of Words?**

- Till now we have learnt how to pre-process the text data i.e clean the text data.
- Our final goal is to use text data in Machine Learning (ML) models. For example - we want to predict whether e-mail is a spam or not based on the text of the data. 
- But ML models can understand only numbers. Therefore we need to convert text to vectors (numbers).
- The simple method of converting text to numbers is to use 'Bag of Words approach'






## <font color = 'pickle'>**Learning Outcome** </font>
After completing this tutorial, you will know

1. What the bag-of-words approach is and how you can use it to represent text data.
2. What are different techniques to prepare a vocabulary and score words.
3. How to implement 'Bag-of-words' approach in python using sklearn.

# <font color = 'pickle'> **Tutorial Overview**
 - Generating Vocab
 - Generating vectors using Vocab
     - Binary Vectorizer
     - Count Vectorizer
     - tfidf Vectorizer
 
 - Modifying Vocab
 - Example - IMDB Dataset


## <font color = 'pickle'> **Generating Vocab**

###  <font color = 'pickle'> **Dummy Corpus**

In [7]:
# Dummy corpus
Corpus = ["Count Vectorizer - for this vectorizer, scoring is done based on frequency. For this vectorizer frequency is key. @vectorizer #frequency @frequency",
          "tfidf vectorizer - for this vectorizer, scoring is done based on tfidf,  higher tfidf higher score #tfidf @vectorizer "  ,
          "Binary vectorizer -for this vectorizer, scoring is done based on presence of word. For this vectorizer, dummy is key #dummy @dummy @vectorizer "]
        

### <font color = 'pickle'>**Create an instance of Vectorizer**

In [8]:
vectorizer = CountVectorizer()

In [9]:
CountVectorizer??

### <font color = 'pickle'>**Fit Vectorizer on corpus to generate vocab**

In [10]:
# Fit the vectorizer on corpus
vectorizer.fit(Corpus)

CountVectorizer()

<font color = 'indianred'>**Vectorizer().fit() does the following**:
* lowercases your text 
* uses utf-8 encoding
* performs tokenization (converts raw text to smaller units of text)
* uses word level tokenization (meaning each word is treated as a separate token) and  ignores single characters during tokenization ( words like ‘a’ and ‘I’ are removed)
* By default, the regular expression that is used to split the text and create tokens is : "\b\w\w+\b". This means it finds all sequences of characters that consist of at least two letters or numbers(\w) and that are separated by word boundaries (\b). It does not find single-letter words, and it splits up contractions like “doesn’t” or “bit.ly”, but it matches “h8ter” as a single word. The CountVectorizer then converts all words to lowercasecharacters, so that “soon”, “Soon”, and “sOon” all correspond to the same token (and therefore feature).
* It then creates a dictionary of unique words.
* The set of unique words is used as features in the CountVectorizer.

In [11]:
# Let us see the dictionary created 
vectorizer.vocabulary_

{'count': 2,
 'vectorizer': 17,
 'for': 5,
 'this': 16,
 'scoring': 14,
 'is': 8,
 'done': 3,
 'based': 0,
 'on': 11,
 'frequency': 6,
 'key': 9,
 'tfidf': 15,
 'higher': 7,
 'score': 13,
 'binary': 1,
 'presence': 12,
 'of': 10,
 'word': 18,
 'dummy': 4}

In [12]:
# The set of unique words is used as features in the CountVectorizer
features = vectorizer.get_feature_names_out()
print(features)
print(len(features))

['based' 'binary' 'count' 'done' 'dummy' 'for' 'frequency' 'higher' 'is'
 'key' 'of' 'on' 'presence' 'score' 'scoring' 'tfidf' 'this' 'vectorizer'
 'word']
19


## <font color = 'pickle'>**Generate Vectors using Vocab**

### <font color = 'pickle'>**Binary Vectorizer**

In [13]:
binary_vectorizer = CountVectorizer(binary=True)
binary_vectorizer.fit(Corpus)

CountVectorizer(binary=True)

- We can now call transform() method to transform sentences in our corpus to vectors
- Each sentence in vocab will be represented by vector of length equal to len(dictionary)
- The vectors are stored in the form of a sparse matrix.
- We can use toarray() function to get complete matrix
- Number of columns represent the number of features (len(vocab))
- Number of rows represent the number the sentences in a corpus
- For each row, the numbers displayed are 0 or 1 - indicating absence or presence of a word in a sentence.

In [14]:
binary_vectors = binary_vectorizer.transform(Corpus)

In [15]:
print(f'vectors in sparse format') 
print(binary_vectors)

vectors in sparse format
  (0, 0)	1
  (0, 2)	1
  (0, 3)	1
  (0, 5)	1
  (0, 6)	1
  (0, 8)	1
  (0, 9)	1
  (0, 11)	1
  (0, 14)	1
  (0, 16)	1
  (0, 17)	1
  (1, 0)	1
  (1, 3)	1
  (1, 5)	1
  (1, 7)	1
  (1, 8)	1
  (1, 11)	1
  (1, 13)	1
  (1, 14)	1
  (1, 15)	1
  (1, 16)	1
  (1, 17)	1
  (2, 0)	1
  (2, 1)	1
  (2, 3)	1
  (2, 4)	1
  (2, 5)	1
  (2, 8)	1
  (2, 9)	1
  (2, 10)	1
  (2, 11)	1
  (2, 12)	1
  (2, 14)	1
  (2, 16)	1
  (2, 17)	1
  (2, 18)	1


In [16]:
print(f'\nbinary vectors in array(dense) format') 
print(binary_vectors.toarray())
print(f'\nThe shape of the binary vectors is : {binary_vectors.toarray().shape}')


binary vectors in array(dense) format
[[1 0 1 1 0 1 1 0 1 1 0 1 0 0 1 0 1 1 0]
 [1 0 0 1 0 1 0 1 1 0 0 1 0 1 1 1 1 1 0]
 [1 1 0 1 1 1 0 0 1 1 1 1 1 0 1 0 1 1 1]]

The shape of the binary vectors is : (3, 19)


In [17]:
# create dataframe for better visualization
df_binary = pd.DataFrame(binary_vectors.toarray(), columns = features)
df_binary

Unnamed: 0,based,binary,count,done,dummy,for,frequency,higher,is,key,of,on,presence,score,scoring,tfidf,this,vectorizer,word
0,1,0,1,1,0,1,1,0,1,1,0,1,0,0,1,0,1,1,0
1,1,0,0,1,0,1,0,1,1,0,0,1,0,1,1,1,1,1,0
2,1,1,0,1,1,1,0,0,1,1,1,1,1,0,1,0,1,1,1


### <font color = 'pickle'>**Count Vectorizer**
-  The vectors are stored in the form of a sparse matrix.
- Number of columns represent the number of features (len(vocab))
- Number of rows represent the number the sentences in a corpus
- Thus, each sentence is represented by a vector of size of length of vocab.
- For each row, the numbers displayed are the number of times a particular word has occurred in the sentence.

In [18]:
term_freq_vectorizer = CountVectorizer(binary=False)
# we can combine fit and transform steps into a single step using fit_transform()
count_vectors = term_freq_vectorizer.fit_transform(Corpus)
print(f'count vectors in array (dense) format\n') 
print(count_vectors.toarray())
print(f'\nThe shape of the count vectors is : {count_vectors.toarray().shape}')

count vectors in array (dense) format

[[1 0 1 1 0 2 4 0 2 1 0 1 0 0 1 0 2 4 0]
 [1 0 0 1 0 1 0 2 1 0 0 1 0 1 1 4 1 3 0]
 [1 1 0 1 3 2 0 0 2 1 1 1 1 0 1 0 2 4 1]]

The shape of the count vectors is : (3, 19)


In [19]:
# create dataframe for better visualization
df_count = pd.DataFrame(count_vectors.toarray(), columns = term_freq_vectorizer.get_feature_names_out())
df_count

Unnamed: 0,based,binary,count,done,dummy,for,frequency,higher,is,key,of,on,presence,score,scoring,tfidf,this,vectorizer,word
0,1,0,1,1,0,2,4,0,2,1,0,1,0,0,1,0,2,4,0
1,1,0,0,1,0,1,0,2,1,0,0,1,0,1,1,4,1,3,0
2,1,1,0,1,3,2,0,0,2,1,1,1,1,0,1,0,2,4,1


### <font color = 'pickle'>**tf-idf Vectorizer**

- One measure of how important a word is term frequency (tf) (how frequently a word occurs in a document). We examined term frequency in previous sections where we used CountVectorizer to get the freqency of each word.
- But there may be words in a document, that occur many times but these words also occur in all other documents as well.
- Therefore the word mght not be a good representation of the document.
- We can account for this by giving more importance to words that occur in fewer documents using inverse document frequency((# Number of documents) / (Number of documents containing the word)).
- This can be combined with term frequency to calculate a term’s tf-idf (the two quantities multiplied together), the frequency of a term adjusted for how rarely it is used.
- The idea of tf-idf is to find the important words for the content of each document by decreasing the weight for commonly used words and increasing the weight for words that are not used very much in a collection or corpus of documents.
- tf-idf gives more weight to the the words that are important (i.e., occur more frequently) in a given document, but occur rarely in other documents.

In [20]:
tfidf_vectorizer = TfidfVectorizer()
# we can combine fit and transform steps into a single step using fit_transform()
tfidf_vectors = tfidf_vectorizer.fit_transform(Corpus)
print(f'tfidf vectors in array (dense) format\n') 
print(tfidf_vectors.toarray())
print(f'\nThe shape of the tfidf vectors is : {tfidf_vectors.toarray().shape}')

tfidf vectors in array (dense) format

[[0.11016796 0.         0.18653056 0.11016796 0.         0.22033591
  0.74612225 0.         0.22033591 0.1418613  0.         0.11016796
  0.         0.         0.11016796 0.         0.22033591 0.44067182
  0.        ]
 [0.11455596 0.         0.         0.11455596 0.         0.11455596
  0.         0.3879202  0.11455596 0.         0.         0.11455596
  0.         0.1939601  0.11455596 0.77584039 0.11455596 0.34366788
  0.        ]
 [0.11874019 0.20104462 0.         0.11874019 0.60313387 0.23748039
  0.         0.         0.23748039 0.15289962 0.20104462 0.11874019
  0.20104462 0.         0.11874019 0.         0.23748039 0.47496077
  0.20104462]]

The shape of the tfidf vectors is : (3, 19)


In [21]:
# create dataframe for better visualization
df_tfidf = pd.DataFrame(tfidf_vectors.toarray(), columns = tfidf_vectorizer.get_feature_names_out())
df_tfidf.round(4)

Unnamed: 0,based,binary,count,done,dummy,for,frequency,higher,is,key,of,on,presence,score,scoring,tfidf,this,vectorizer,word
0,0.1102,0.0,0.1865,0.1102,0.0,0.2203,0.7461,0.0,0.2203,0.1419,0.0,0.1102,0.0,0.0,0.1102,0.0,0.2203,0.4407,0.0
1,0.1146,0.0,0.0,0.1146,0.0,0.1146,0.0,0.3879,0.1146,0.0,0.0,0.1146,0.0,0.194,0.1146,0.7758,0.1146,0.3437,0.0
2,0.1187,0.201,0.0,0.1187,0.6031,0.2375,0.0,0.0,0.2375,0.1529,0.201,0.1187,0.201,0.0,0.1187,0.0,0.2375,0.475,0.201


### <font color = 'pickle'>**Undertstanding tfidf calculations**

By default <br> 
$\text{tfidf}(w, d) = \text{tf(w, d)} * \text{idf(w)}$
<br>
$\text{idf(w)} = \log\big(\frac{N + 1}{N_w + 1}\big) + 1$
<br><br>
if smooth_idf = False (default is True):
<br>
$\text{idf(w)} = \log\big(\frac{N }{N_w}\big) + 1$
<br><br>
if sublinear_tfbool = True (default is False)
<br>
$\text{tf(w, d)} = \log(\text{tf(w, d)} ) + 1$

Here:<br>
- $\text{tf}(w, d)$ is number of times word $w$ appears in document $d$ 
<br>
- $\text{idf}(w)$ is inverse documsnt frequency of word $w$
- $N$ is total number of documents
- $N_w$ is number of documents taht contaon word w

In [22]:
# Calculate inverse document frequency for each feature (word)
term_idf = tfidf_vectorizer.idf_
term_idf

array([1.        , 1.69314718, 1.69314718, 1.        , 1.69314718,
       1.        , 1.69314718, 1.69314718, 1.        , 1.28768207,
       1.69314718, 1.        , 1.69314718, 1.69314718, 1.        ,
       1.69314718, 1.        , 1.        , 1.69314718])

In [23]:
# create dataframe for better visualization
df_idf = pd.DataFrame(term_idf, index = tfidf_vectorizer.get_feature_names_out())
df_idf.round(4).T

Unnamed: 0,based,binary,count,done,dummy,for,frequency,higher,is,key,of,on,presence,score,scoring,tfidf,this,vectorizer,word
0,1.0,1.6931,1.6931,1.0,1.6931,1.0,1.6931,1.6931,1.0,1.2877,1.6931,1.0,1.6931,1.6931,1.0,1.6931,1.0,1.0,1.6931


In [24]:
# create dataframe for tf vectors for the first document

first_document_tf = count_vectors[0].toarray().ravel()
feature_names_tf = term_freq_vectorizer.get_feature_names_out()
df_tf = pd.DataFrame({'features':feature_names_tf, 'tf':first_document_tf})

# create dataframe for tfidf vectors for the first document

first_document_tfidf = tfidf_vectors[0].toarray().ravel()
feature_names_tfidf = tfidf_vectorizer.get_feature_names_out()
df_tfidf = pd.DataFrame({'features':feature_names_tfidf , 'idf':term_idf, 'norm_tfidf':first_document_tfidf})

# combine dataframes

df = pd.merge(left = df_tf, right = df_tfidf)
df.sort_values(by=["norm_tfidf"],ascending=False, inplace = True)
df

Unnamed: 0,features,tf,idf,norm_tfidf
6,frequency,4,1.693147,0.746122
17,vectorizer,4,1.0,0.440672
16,this,2,1.0,0.220336
5,for,2,1.0,0.220336
8,is,2,1.0,0.220336
2,count,1,1.693147,0.186531
9,key,1,1.287682,0.141861
11,on,1,1.0,0.110168
14,scoring,1,1.0,0.110168
0,based,1,1.0,0.110168


**Observations from above results**
- words 'frequency' and 'vectorizer' occurs 4 times in the documsnt and hence term frequency is 4.
- Word 'vectorizer' occurs in every documsnt and hence idf is 1 (log(1) + 1).
- norm_tfidf gives higher score to word 'frequency' than 'vectorizer'.
- norm_tfidf is not equal to idf * tf

Let us know understand how norm_tfidf is calculated:

In [25]:
# calculate tfidf (without any normalization)
df['tfidf'] = df.eval('tf*idf')

In [26]:
# calculate tfidf - normalized
df['sq_tfidf'] = df.eval('tfidf**2')
df['norm_tfidf_manually'] = df['tfidf']/np.sqrt(df['sq_tfidf'].sum())
df

Unnamed: 0,features,tf,idf,norm_tfidf,tfidf,sq_tfidf,norm_tfidf_manually
6,frequency,4,1.693147,0.746122,6.772589,45.867958,0.746122
17,vectorizer,4,1.0,0.440672,4.0,16.0,0.440672
16,this,2,1.0,0.220336,2.0,4.0,0.220336
5,for,2,1.0,0.220336,2.0,4.0,0.220336
8,is,2,1.0,0.220336,2.0,4.0,0.220336
2,count,1,1.693147,0.186531,1.693147,2.866747,0.186531
9,key,1,1.287682,0.141861,1.287682,1.658125,0.141861
11,on,1,1.0,0.110168,1.0,1.0,0.110168
14,scoring,1,1.0,0.110168,1.0,1.0,0.110168
0,based,1,1.0,0.110168,1.0,1.0,0.110168


## <font color = 'pickle'>**Modifying Vocab**

### <font color = 'pickle'>**Case sensitive**

In [27]:
# lowercase=False to get feature name containing words for both lower and upper case
vectorizer = CountVectorizer(lowercase=False)

# we can use fit_transform to use fit() and transform() in one step
vectors = vectorizer.fit_transform(Corpus)
vectorizer.vocabulary_

{'Count': 1,
 'Vectorizer': 3,
 'for': 7,
 'this': 18,
 'vectorizer': 19,
 'scoring': 16,
 'is': 10,
 'done': 5,
 'based': 4,
 'on': 13,
 'frequency': 8,
 'For': 2,
 'key': 11,
 'tfidf': 17,
 'higher': 9,
 'score': 15,
 'Binary': 0,
 'presence': 14,
 'of': 12,
 'word': 20,
 'dummy': 6}

### <font color = 'pickle'>**Filtering words based on frequency**

- max_df - Ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words).
- min_df -  Ignore terms that have a document frequency strictly lower than the given threshold (known as cut-off in the literature.) 

- max_features - build a vocabulary that only consider the top max_features ordered by term frequency across the corpus


In [28]:
# remove rare words - remove words which appear in less than 2 documents
vectorizer = CountVectorizer(min_df=2)
vectorizer.fit(Corpus)
vectorizer.vocabulary_

{'vectorizer': 8,
 'for': 2,
 'this': 7,
 'scoring': 6,
 'is': 3,
 'done': 1,
 'based': 0,
 'on': 5,
 'key': 4}

In [31]:
# remove words which appear in more than 2 documents
vectorizer = CountVectorizer(max_df=2)
vectorizer.fit(Corpus)
vectorizer.vocabulary_

{'count': 1,
 'frequency': 3,
 'key': 5,
 'tfidf': 9,
 'higher': 4,
 'score': 8,
 'binary': 0,
 'presence': 7,
 'of': 6,
 'word': 10,
 'dummy': 2}

In [32]:
# retain most frequent words only - retain top n words based on term frequency across corpus
vectorizer = CountVectorizer(max_features=5)
vectorizer.fit(Corpus)
vectorizer.vocabulary_

{'vectorizer': 4, 'for': 0, 'this': 3, 'is': 1, 'tfidf': 2}

### <font color = 'pickle'>**Stop Words**

In [33]:
# We can also specify list of stopwords to countvectorizer to get the feature without stopwords

# Import libraries
nltk_stop_words = stopwords.words('english')

vectorizer = CountVectorizer(max_features=5, stop_words= nltk_stop_words)
vectorizer.fit(Corpus)
vectorizer.vocabulary_

{'vectorizer': 4, 'done': 1, 'based': 0, 'frequency': 2, 'tfidf': 3}

### <font color = 'pickle'>**Custom Tokenizer and Preprocessor**

#### <font color = 'pickle'>**nltk tokenizer**

In [34]:
# We can use custom tokenizere.g. we can use nltk tweet tokenizer to get each tokens as feature

# Create an object of TweetTokenizer class
tweet_tokenizer = TweetTokenizer()

# Pass the tokenizer object to tokenizer argument in CountVectorizer
# only works if analyzer = 'word'
vectorizer = CountVectorizer(analyzer='word', tokenizer=tweet_tokenizer.tokenize)

vectorizer.fit_transform(Corpus)

vectorizer.vocabulary_

{'count': 11,
 'vectorizer': 26,
 '-': 4,
 'for': 14,
 'this': 25,
 ',': 3,
 'scoring': 23,
 'is': 17,
 'done': 12,
 'based': 9,
 'on': 20,
 'frequency': 15,
 '.': 5,
 'key': 18,
 '@vectorizer': 8,
 '#frequency': 1,
 '@frequency': 7,
 'tfidf': 24,
 'higher': 16,
 'score': 22,
 '#tfidf': 2,
 'binary': 10,
 'presence': 21,
 'of': 19,
 'word': 27,
 'dummy': 13,
 '#dummy': 0,
 '@dummy': 6}

#### <font color = 'pickle'>**spacy pre-processor and tokenizer**

In [35]:
def spacy_preprocessor(text):    

    # Create spacy object
    doc = nlp(text)

    # remove punctuations, apply lametization
    filtered_text = [token.lemma_ for token in doc if not token.is_punct]
  
    return " ".join(filtered_text)

In [36]:
# Spacy Tokenizer
def spacy_tokenizer(data):
  doc=nlp(data)
  return [token.text for token in doc]

In [37]:
# custom preprocessor and spacy tokenizer
vectorizer = CountVectorizer(analyzer='word', preprocessor=spacy_preprocessor , tokenizer=spacy_tokenizer, token_pattern=None)
vectors=vectorizer.fit(Corpus)
vectorizer.vocabulary_

{'Count': 5,
 'Vectorizer': 6,
 'for': 12,
 'this': 22,
 'vectorizer': 23,
 'scoring': 20,
 'be': 8,
 'done': 10,
 'base': 7,
 'on': 17,
 'frequency': 13,
 'key': 15,
 '@vectorizer': 4,
 '@frequency': 3,
 'tfidf': 21,
 '  ': 0,
 'high': 14,
 'score': 19,
 'binary': 9,
 '-for': 1,
 'presence': 18,
 'of': 16,
 'word': 24,
 'dummy': 11,
 '@dummy': 2}

#### <font color = 'pickle'>**token pattterns with regular expressions**

In [38]:
# We can pass regex to the argument token_pattern to get required pattern
# whitespace tokenizer
# This can be very useful if we have allready cleaned the text
vectorizer = CountVectorizer(analyzer='word', token_pattern=r"[\S]+")

# Assign the encoded(transformed) vectors to a variable
vectors = vectorizer.fit_transform(Corpus)

vectorizer.vocabulary_

{'count': 10,
 'vectorizer': 28,
 '-': 3,
 'for': 13,
 'this': 27,
 'vectorizer,': 29,
 'scoring': 24,
 'is': 17,
 'done': 11,
 'based': 8,
 'on': 21,
 'frequency.': 15,
 'frequency': 14,
 'key.': 19,
 '@vectorizer': 7,
 '#frequency': 1,
 '@frequency': 6,
 'tfidf': 25,
 'tfidf,': 26,
 'higher': 16,
 'score': 23,
 '#tfidf': 2,
 'binary': 9,
 '-for': 4,
 'presence': 22,
 'of': 20,
 'word.': 30,
 'dummy': 12,
 'key': 18,
 '#dummy': 0,
 '@dummy': 5}

### <font color = 'pickle'>**ngrams**

- Till now our features consists of single token. However, in some cases we may want to use sequence of tokens as features
- Consider the folowing corpus
 1. This item is good
 2. This item is not good
- Now  both the documents will have feature 'good' and 'not' will be an additional feature in document 2.
- For applications like sentiment analysis - it might be a good idea to consider 'not good' as a single token.

- We can use ngram_range(min_n, max_n) in CountVectorizer to create features that consists of sequence of words.

- We can get n-grams of required length by providing range (min_n, max_n)

- if we specify min_n = 2 and max_n = 3 - we will get bigrams and trigrams as features.

In [39]:
min_n = 2
max_n = 3

# only works if analyzer = 'word'
vectorizer1 = CountVectorizer(analyzer = 'word', ngram_range=(min_n, max_n))
vectorizer2 = CountVectorizer(analyzer = 'word', ngram_range=(min_n, max_n))

text1= ["This item is not good"]
text2 = ["This item is good"]

# Fit the vectorizers to text
vectorizer1.fit_transform(text1)
vectorizer2.fit_transform(text2)

features1 = vectorizer1.get_feature_names_out()
features2 = vectorizer2.get_feature_names_out()

print('Features for text 1\n')
for feature in features1:
  print(feature)

print(f'\nFeatures for text 2\n')
for feature in features2:
  print(feature)

Features for text 1

is not
is not good
item is
item is not
not good
this item
this item is

Features for text 2

is good
item is
item is good
this item
this item is


## <font color = 'pickle'>**Example : IMDB Data set** 

### <font color = 'pickle'>**Import Data**

In [40]:
# Use train.csv of IMDB movie review data (we downloaded this in the last lecture)
train_data = data_folder / 'aclImdb'/'train.csv'
test_data = data_folder / 'aclImdb'/'test.csv'

In [41]:
!head -n 2 {str(train_data)}

,Reviews,Labels
0,"Ever wanted to know just how much Hollywood could get away with before the Hayes Code was officially put into effect? Well, unfortunately ""Convention City"" is lost, so well just have to watch ""Tarzan and His Mate"" to find out. For 1934, there is a remarkable amount of sexual innuendo and even exposed flesh. Just look at Jane's nude swim. While Tarzan is often thought of as b-adventure films made for young boys and no one else, this picture proves that the series was originally very adult. Over seventy years later, it is still as sexy as it was when it came out.<br /><br />In addition to the envelope pushing taboo nature, it is a superb and exciting adventure story. I've always enjoyed the jungle films that Hollywood churned out in the 30s and the 40s, but there are few from the genre I'd call great films. ""Tarzan and His Mate"" is by far the best film from this long gone subgenre. The sequences of the attacks on the safari by either apes or natives still manage 

In [42]:
# Reading data
train_df = pd.read_csv(train_data, index_col=0)
test_df = pd.read_csv(test_data, index_col=0)
print(f'Shape of Training data set is : {train_df.shape}')
print(f'Shape of Test data set is : {test_df.shape}')
print(f'\nTop five rows of Training data set:\n')
train_df.head()

Shape of Training data set is : (25000, 2)
Shape of Test data set is : (25000, 2)

Top five rows of Training data set:



Unnamed: 0,Reviews,Labels
0,Ever wanted to know just how much Hollywood co...,1
1,The movie itself was ok for the kids. But I go...,1
2,You could stage a version of Charles Dickens' ...,1
3,this was a fantastic episode. i saw a clip fro...,1
4,and laugh out loud funny in many scenes.<br />...,1


### <font color = 'pickle'>**Generating Vocab**
- <font color = 'indianred'>**Vocab should be created only based on training dataset**</font>
- We will generate vocab using CountVectorizer
- <font color = 'indianred'>**Use fit_transform() on Training data set**. 
- **Use only transform() on Test dataset**. This make sures that we generate vocab only based on training dataset.

In [43]:
# Initialize vectorizer
nltk_stop_words = stopwords.words('english')
bag_of_word = CountVectorizer(stop_words= nltk_stop_words)

# Fit on training data
bag_of_word.fit(train_df['Reviews'].values)

CountVectorizer(stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours',
                            'ourselves', 'you', "you're", "you've", "you'll",
                            "you'd", 'your', 'yours', 'yourself', 'yourselves',
                            'he', 'him', 'his', 'himself', 'she', "she's",
                            'her', 'hers', 'herself', 'it', "it's", 'its',
                            'itself', ...])

In [44]:
# get feature names
features = bag_of_word.get_feature_names_out()

In [45]:
# check the legth of the vocab
len(features)

74704

### <font color = 'pickle'>**Create vectors for reviews**

In [46]:
# Transform the training and test dataset 
bow_vector_train = bag_of_word.transform(train_df['Reviews'].values)
bow_vector_test = bag_of_word.transform(test_df['Reviews'].values)

In [47]:
# Shape of the matrix for train dataset
bow_vector_train

<25000x74704 sparse matrix of type '<class 'numpy.int64'>'
	with 2479678 stored elements in Compressed Sparse Row format>

In [48]:
# Shape of the matrix for train dataset
bow_vector_test

<25000x74704 sparse matrix of type '<class 'numpy.int64'>'
	with 2385031 stored elements in Compressed Sparse Row format>

### <font color = 'pickle'>**Limit vocab using max_features**
We got 25k rows with 78k+ features, but what if we want only top 5k features.
We can do this by providing max_features parameter.

In [49]:
# Limit Vocab size using Max features
spacy_stop_words = nlp.Defaults.stop_words
bag_of_word = CountVectorizer(max_features=5000, stop_words= spacy_stop_words)  # Max features

# Fit on training data
bag_of_word.fit(train_df['Reviews'].values)

  % sorted(inconsistent)


CountVectorizer(max_features=5000,
                stop_words={"'d", "'ll", "'m", "'re", "'s", "'ve", 'a', 'about',
                            'above', 'across', 'after', 'afterwards', 'again',
                            'against', 'all', 'almost', 'alone', 'along',
                            'already', 'also', 'although', 'always', 'am',
                            'among', 'amongst', 'amount', 'an', 'and',
                            'another', 'any', ...})

In [50]:
# Transform the training and test dataset 
bow_vector_train = bag_of_word.transform(train_df['Reviews'].values)
bow_vector_test = bag_of_word.transform(train_df['Reviews'].values)

In [51]:
# Document representation
vocab = bag_of_word.get_feature_names_out()
pd.DataFrame(bow_vector_train.toarray(), columns=vocab)

Unnamed: 0,00,000,10,100,11,12,13,13th,14,15,...,yesterday,york,young,younger,youth,zero,zizek,zombie,zombies,zone
0,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24995,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
24996,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
24997,0,0,0,0,0,0,0,0,0,1,...,0,0,0,1,0,0,0,0,0,0
24998,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


---