# Bag fo words tokenizer

This script illustrates how to create a bag-of-words vector for analyzing review texts. 

It is based on the tutorial from here: https://www.analyticsvidhya.com/blog/2021/08/a-friendly-guide-to-nlp-bag-of-words-with-python-example/

Documentation for the sklearn for the CountVectorizer: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html 

In [None]:
# importing the libraries to vectorize text
# and to manipulate dataframes
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

In [None]:
# import warnings filter
from warnings import simplefilter
# ignore all future warnings
simplefilter(action='ignore', category=FutureWarning)

In [None]:
# create the feature extractor, i.e., BOW vectorizer
# please note the argument - max_features
# this argument says that we only want three features
# this will illustrate that we can get problems - e.g. noise
# when using too few features
vectorizer = CountVectorizer()

In [None]:
# simple input data - two sentences
sentence1 = 'printf("Hello world!");'
sentence2 = 'return 1'

In [None]:
# creating the feature vectors for the input data
X = vectorizer.fit_transform([sentence1, sentence2])

# creating the data frame based on the vectorized data
df_bow_sklearn = pd.DataFrame(X.toarray(),
                              columns=vectorizer.get_feature_names(), 
                              index=[sentence1, sentence2])

# take a peek at the featurized data
df_bow_sklearn.head()

Unnamed: 0,hello,printf,return,world
"printf(""Hello world!"");",1,1,0,1
return 1,0,0,1,0


In [None]:
# we can print the features - column names in our dataframe
print(vectorizer.get_feature_names())

['hello', 'printf', 'return', 'world']


## Review example

Now that we know how feature extraction works, let us work with a bit more advanced example. Let's take the entire review dataset and extract the B-o-W features.

In [None]:
# access google drive with code
# we import the library that helps us to connect to Google Drive
from google.colab import drive

# we connect to the google drive
drive.mount('/content/gdrive/')

# and we enter the folder where I stored the data
%cd '/content/gdrive/My Drive/ds/'

Mounted at /content/gdrive/
/content/gdrive/My Drive/ds


In [None]:
# read the file with gerrit code reviews
dfReviews = pd.read_csv('./gerrit_reviews.csv', sep=';')

# just checking that we have the right columns
# and the right data
dfReviews.head()

Unnamed: 0,change_id,revision-id,filename,line,start_line,end_line,LOC,message
0,cps~master~Ia67db468ece4a7ab694d95cb63a954f24d...,eee4b4538e74468dd70ffef68164ad9353d70616,cps-ncmp-service/src/test/groovy/org/onap/cps/...,259.0,249,259,def dmiServiceName = 'some service name',a lof of (brittle) code just for a stricter ch...
1,cps~master~Ia67db468ece4a7ab694d95cb63a954f24d...,eee4b4538e74468dd70ffef68164ad9353d70616,cps-ncmp-service/src/test/groovy/org/onap/cps/...,259.0,249,259,def compositeState = new CompositeStat...,a lof of (brittle) code just for a stricter ch...
2,cps~master~Ia67db468ece4a7ab694d95cb63a954f24d...,eee4b4538e74468dd70ffef68164ad9353d70616,cps-ncmp-service/src/test/groovy/org/onap/cps/...,259.0,249,259,lockReason: CompositeState.Loc...,a lof of (brittle) code just for a stricter ch...
3,cps~master~Ia67db468ece4a7ab694d95cb63a954f24d...,eee4b4538e74468dd70ffef68164ad9353d70616,cps-ncmp-service/src/test/groovy/org/onap/cps/...,259.0,249,259,"lastUpdateTime: 'some-timestamp',",a lof of (brittle) code just for a stricter ch...
4,cps~master~Ia67db468ece4a7ab694d95cb63a954f24d...,eee4b4538e74468dd70ffef68164ad9353d70616,cps-ncmp-service/src/test/groovy/org/onap/cps/...,259.0,249,259,"dataSyncEnabled: false,",a lof of (brittle) code just for a stricter ch...


In [None]:
import numpy as np
# before we use the feature extractor, let's check if the data contains NANs
print(f'The data contains {dfReviews.LOC.isnull().sum()} empty rows')

# remove the empty rows
dfReviews.dropna(inplace=True)

# checking again, to make sure that it does not contain them
print(f'The data contains {dfReviews.LOC.isnull().sum()} empty rows')


The data contains 218 empty rows
The data contains 0 empty rows


In [None]:
# now, let's convert the code (LOC) column to the vector of features
# using BOW from the example above
vectorizer = CountVectorizer(max_features = 20)

dfFeatures = vectorizer.fit_transform(dfReviews.LOC)

# creating the data frame based on the vectorized data
df_bow_sklearn = pd.DataFrame(dfFeatures.toarray(),
                              columns=vectorizer.get_feature_names(), 
                              index=dfReviews.LOC)

# take a peek at the featurized data
df_bow_sklearn.head()

Unnamed: 0_level_0,and,components,data,def,exception,final,fragmententity,if,is,log,new,public,return,state,string,the,this,to,xml,xpath
LOC,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
def dmiServiceName = 'some service name',0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
"def compositeState = new CompositeState(cmHandleState: CmHandleState.ADVISED,",0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
"lockReason: CompositeState.LockReason.builder().lockReasonCategory(LockReasonCategory.LOCKED_MODULE_SYNC_FAILED).details(""lock details"").build(),",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
"lastUpdateTime: 'some-timestamp',",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
"dataSyncEnabled: false,",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


What we can see in the table above is that there is a lot of 0s in the feature vector. That is because we only use 20 features. 

Let's make this a bit more reliable - let's put constraints on the frequency of features, rather than on their number and see what happens.

In [None]:
# now, let's convert the code (LOC) column to the vector of features
# using BOW from the example above
vectorizer = CountVectorizer(min_df=2, 
                             max_df=10)

dfFeatures = vectorizer.fit_transform(dfReviews.LOC)

# creating the data frame based on the vectorized data
df_bow_sklearn = pd.DataFrame(dfFeatures.toarray(),
                              columns=vectorizer.get_feature_names(), 
                              index=dfReviews.LOC)

# take a peek at the featurized data
df_bow_sklearn.head()

Unnamed: 0_level_0,11,2017,2019,2020,2021,2022,204,2c,2f,400,...,xpathsdescendant,yangcontainername,yangcontainername_,yangmodelcmhandle,yangresourcenametocontent,yangresourcesnametocontentmap,yangtextschemasourceset,yangtextschemasourcesetbuilder,yangtextschemasourcesetcache,yangutils
LOC,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
def dmiServiceName = 'some service name',0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
"def compositeState = new CompositeState(cmHandleState: CmHandleState.ADVISED,",0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
"lockReason: CompositeState.LockReason.builder().lockReasonCategory(LockReasonCategory.LOCKED_MODULE_SYNC_FAILED).details(""lock details"").build(),",0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
"lastUpdateTime: 'some-timestamp',",0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
"dataSyncEnabled: false,",0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
# we can also check how many features we actually have
# which is a lot in this case
len(df_bow_sklearn.columns)

662

## Seeding empty rows

In order to illustrate the problems with empty data points, let's introduce an empty data point to the data set. 


In [None]:
# example one: 
for col in df_bow_sklearn.columns:
  df_bow_sklearn.loc[df_bow_sklearn.sample(frac=0.1).index, col] = pd.np.nan

In [None]:
df_bow_sklearn.head()

Unnamed: 0_level_0,11,2017,2019,2020,2021,2022,204,2c,2f,400,...,xpathsdescendant,yangcontainername,yangcontainername_,yangmodelcmhandle,yangresourcenametocontent,yangresourcesnametocontentmap,yangtextschemasourceset,yangtextschemasourcesetbuilder,yangtextschemasourcesetcache,yangutils
LOC,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
def dmiServiceName = 'some service name',0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,,0.0,0.0,,0.0,,0.0,0.0,,0.0
"def compositeState = new CompositeState(cmHandleState: CmHandleState.ADVISED,",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"lockReason: CompositeState.LockReason.builder().lockReasonCategory(LockReasonCategory.LOCKED_MODULE_SYNC_FAILED).details(""lock details"").build(),",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,,0.0,,0.0,0.0,,0.0,0.0,0.0
"lastUpdateTime: 'some-timestamp',",0.0,0.0,0.0,0.0,0.0,0.0,,0.0,,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"dataSyncEnabled: false,",0.0,,0.0,0.0,,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
df_bow_sklearn.to_csv('./gerrit_reviews_nan.csv', sep='$')

## Imputing missing values

In this part of the demonstration, we can learn how to use Iterative Imputer to fill in the missing data in the dataset. 

Please remember that filling in the missing data (imputation) is not entirely without any problems. The imputer does not really know what the data should be, but it learn the pattern in the data and repeats it. 

So, the imputed data can contain noise and is definitely less trustworthy compared to the clean data. 

In [None]:
import numpy as np
dfNaNs = pd.read_csv('./gerrit_reviews_nan.csv', sep='$')

# before we use the feature extractor, let's check if the data contains NANs
print(f'The data contains {dfNaNs.isnull().sum()} NaN values')

The data contains LOC                                 0
11                                222
2017                              186
2019                              208
2020                              194
                                 ... 
yangresourcesnametocontentmap     213
yangtextschemasourceset           205
yangtextschemasourcesetbuilder    208
yangtextschemasourcesetcache      207
yangutils                         185
Length: 663, dtype: int64 NaN values


In [None]:
dfNaNs.shape

(939, 663)

In [None]:
# remove the empty rows
dfNaNs.dropna(inplace=True)

dfNaNs.shape

(0, 663)

After we removed the empty rows, there is no data points left! 

So, we cannot use the best strategy - removal. We need to adopt another strategy - imputing the data. 

In this example, we use the IterativeImputer method from the Python sklearn library: https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html#sklearn.impute.IterativeImputer 

Iterative imputer is one of the newest ones in the library and is based on an iterative algorithm, which chooses one features as y and trains a classifier to predict it based on the other features. It does it iteratively until it knows how to predict all features. 

In [None]:
# since we removed NaNs, we get no data
# so we need to read the dataset again in order to perform
# data imputation
dfNaNs = pd.read_csv('./gerrit_reviews_nan.csv', sep='$')

In [None]:
# in order to use the imputer, we need to remove the index from the data
# we remove the index by first re-setting it (so that it becomes a regular column)
# and then by removing this column. 
dfNaNs_features = dfNaNs.reset_index()
dfNaNs_features.drop(['LOC', 'index'], axis=1, inplace=True)
dfNaNs_features.head()

Unnamed: 0,11,2017,2019,2020,2021,2022,204,2c,2f,400,...,xpathsdescendant,yangcontainername,yangcontainername_,yangmodelcmhandle,yangresourcenametocontent,yangresourcesnametocontentmap,yangtextschemasourceset,yangtextschemasourcesetbuilder,yangtextschemasourcesetcache,yangutils
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,,0.0,0.0,,0.0,,0.0,0.0,,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,,0.0,,0.0,0.0,,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,,0.0,0.0,,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
# let's use iterative imputed to impute data to the dataframe
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# create the instance of the imputer
imp = IterativeImputer(max_iter=3, 
                       random_state=42,
                       verbose = 2)

# train the imputer on the features in the dataset
imp.fit(dfNaNs_features)

[IterativeImputer] Completing matrix with shape (939, 662)
[IterativeImputer] Ending imputation round 1/3, elapsed time 296.99
[IterativeImputer] Change: 63.10345024384126, scaled tolerance: 0.004 
[IterativeImputer] Ending imputation round 2/3, elapsed time 601.36
[IterativeImputer] Change: 38.602742334401924, scaled tolerance: 0.004 
[IterativeImputer] Ending imputation round 3/3, elapsed time 883.77
[IterativeImputer] Change: 21.298713137313726, scaled tolerance: 0.004 




IterativeImputer(max_iter=3, random_state=42, verbose=2)

In [None]:
# now, we fill in the NaNs in the original dataset
npNoNaNs = imp.transform(dfNaNs_features)
dfNoNaNs = pd.DataFrame(npNoNaNs)

[IterativeImputer] Completing matrix with shape (939, 662)
[IterativeImputer] Ending imputation round 1/3, elapsed time 1.70
[IterativeImputer] Ending imputation round 2/3, elapsed time 3.05
[IterativeImputer] Ending imputation round 3/3, elapsed time 6.34


In [None]:
dfNoNaNs.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,652,653,654,655,656,657,658,659,660,661
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.024886,0.0,0.0,0.59267,0.0,0.027994,0.0,0.0,-0.015006,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.031049,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,-0.176479,0.0,0.038797,0.0,0.0,-0.020543,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00571,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.026949,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
