# <center> NATURAL LANGUAGE PROCESSING (NLP) </center>
### <center> K NIDHI SHARMA </center>
### <center> VECTORIZATION </center>

Vectorization is a methodology to map words or phrases from vocabulary to a corresponding vector of real numbers which used to find word predictions, word similarities/semantics.

The process of converting words into numbers are called Vectorization.

In [1]:
import nltk
import warnings
warnings.filterwarnings('ignore')

import pandas as pd

In [2]:
from nltk.corpus import PlaintextCorpusReader
corpus_root = "E:\\PG_sem3\\nlp\\corpus"

wordlists = PlaintextCorpusReader(corpus_root,'.*')
lst = wordlists.fileids()
lst

['MDS373A_NLP_syllabus.txt',
 'da.txt',
 'ds.txt',
 'mca.txt',
 'newsemi.txt',
 'stat.txt']

In [10]:
lst[0]

'MDS373A_NLP_syllabus.txt'

In [4]:
w=wordlists.words()
w

['MDS373A', ':', 'Natural', 'Language', 'Processing', ...]

### Count vectorizer

Count vectorizer is a method to convert text to numerical data.

In [5]:
from sklearn.feature_extraction.text import CountVectorizer

In [6]:
vectorizer = CountVectorizer()

X = vectorizer.fit_transform(w)
print(vectorizer.get_feature_names())

['05', '10', '106101007', '12', '150', '1994', '1999', '2009', '2013', '2014', '2022', '2nd', '30', '50', '90', 'above', 'abroad', 'ac', 'achieve', 'additional', 'advantage', 'aggregate', 'aiu', 'algorithms', 'all', 'also', 'alternative', 'always', 'ambiguity', 'among', 'amongst', 'an', 'analysis', 'analytics', 'analyze', 'and', 'antonyms', 'any', 'appearing', 'applicants', 'applications', 'apply', 'approach', 'approaches', 'architecture', 'are', 'art', 'as', 'at', 'audio', 'available', 'avoid', 'bachelor', 'background', 'based', 'bayes', 'bba', 'bca', 'bcom', 'be', 'become', 'been', 'being', 'below', 'between', 'bird', 'bottlenecks', 'bsc', 'business', 'by', 'cambridge', 'came', 'can', 'candidate', 'candidates', 'case', 'characteristics', 'check', 'chemistry', 'chromosomes', 'circle', 'class', 'classes', 'classical', 'classification', 'classifications', 'co1', 'co2', 'co3', 'coherence', 'cohesion', 'communication', 'comparable', 'computational', 'compute', 'computer', 'concepts', 'con

In [7]:
## Stopwords

vectorizer = CountVectorizer(stop_words='english')

X = vectorizer.fit_transform(w[])
print(vectorizer.get_feature_names())

['05', '10', '106101007', '12', '150', '1994', '1999', '2009', '2013', '2014', '2022', '2nd', '30', '50', '90', 'abroad', 'ac', 'achieve', 'additional', 'advantage', 'aggregate', 'aiu', 'algorithms', 'alternative', 'ambiguity', 'analysis', 'analytics', 'analyze', 'antonyms', 'appearing', 'applicants', 'applications', 'apply', 'approach', 'approaches', 'architecture', 'art', 'audio', 'available', 'avoid', 'bachelor', 'background', 'based', 'bayes', 'bba', 'bca', 'bcom', 'bird', 'bottlenecks', 'bsc', 'business', 'cambridge', 'came', 'candidate', 'candidates', 'case', 'characteristics', 'check', 'chemistry', 'chromosomes', 'circle', 'class', 'classes', 'classical', 'classification', 'classifications', 'co1', 'co2', 'co3', 'coherence', 'cohesion', 'communication', 'comparable', 'computational', 'compute', 'computer', 'concepts', 'conducted', 'considered', 'containing', 'contrastive', 'correction', 'count', 'counterfeit', 'course', 'courses', 'covers', 'credits', 'criteria', 'daniel', 'data

X.toarray() will display output in array

In [8]:
Doc_Term_Matrix = pd.DataFrame(X.toarray(),columns= vectorizer.get_feature_names())
Doc_Term_Matrix

Unnamed: 0,05,10,106101007,12,150,1994,1999,2009,2013,2014,...,word,word2vec,wordnet,words,working,works,write,www,year,years
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1425,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1426,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1427,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1428,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### TF-IDF

It is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling.

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [11]:
# create the transform
tfidfvectorizer = TfidfVectorizer(analyzer='word',stop_words= 'english')

# tokenize and build vocab
tfidfvectorizer.fit(w)
print(tfidfvectorizer.vocabulary_)

{'mds373a': 213, 'natural': 227, 'language': 192, 'processing': 262, 'total': 345, 'teaching': 334, 'hours': 162, 'semester': 295, '90': 14, 'max': 212, 'marks': 207, '150': 4, 'credits': 87, '05': 0, 'course': 84, 'objectives': 239, 'goal': 150, 'make': 205, 'familiar': 132, 'concepts': 76, 'study': 323, 'human': 165, 'computational': 73, 'perspective': 250, 'covers': 86, 'syntactic': 329, 'semantic': 293, 'discourse': 104, 'models': 225, 'emphasizing': 117, 'machine': 202, 'learning': 194, 'outcomes': 242, 'co1': 66, 'understand': 352, 'various': 361, 'approaches': 34, 'syntax': 330, 'semantics': 294, 'nlp': 234, 'co2': 67, 'apply': 32, 'methods': 218, 'generation': 148, 'dialogue': 100, 'summarization': 326, 'using': 359, 'co3': 68, 'analyze': 27, 'methodologies': 217, 'used': 358, 'translation': 347, 'techniques': 337, 'including': 172, 'unsupervised': 356, 'real': 270, 'time': 340, 'applications': 31, 'unit': 354, '12': 3, 'introduction': 180, 'background': 41, 'overview': 244, 'h

In [12]:
tfidf_wm = tfidfvectorizer.fit_transform(w)
tfidf_tokens = tfidfvectorizer.get_feature_names()

df_tfidfvect = pd.DataFrame(data = tfidf_wm.toarray(),columns = tfidf_tokens)
df_tfidfvect

Unnamed: 0,05,10,106101007,12,150,1994,1999,2009,2013,2014,...,word,word2vec,wordnet,words,working,works,write,www,year,years
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1425,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1426,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1427,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1428,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Comparing TF-IDF & CountVectorizer

TF-IDF is better than Count Vectorizers because it not only focuses on the frequency of words present in the corpus but also provides the importance of the words. We can then remove the words that are less important for analysis, hence making the model building less complex by reducing the input dimensions