### Vectorization

To use textual data for predictive modelling, the text must be parsed to remove a certain words. This is called Tokenisation.
These tokens are then encoded to integers or floats to be used as inputs to the ML model. This is called Vectorization


### Count Vectorizer

In [4]:
from sklearn.feature_extraction.text import CountVectorizer
docs = ["Simon Baker and Gabriel Macht are amazing actors.", "Suits and Mentalist are the best shows I have seen", "Harvey Specter and Patrick Jane"]
cv = CountVectorizer()
X = cv.fit_transform(docs)
print(X.todense())
print(cv.vocabulary_)

[[1 1 1 1 1 0 1 0 0 0 1 0 0 0 0 1 0 0 0]
 [0 0 1 1 0 1 0 0 1 0 0 1 0 1 1 0 0 1 1]
 [0 0 1 0 0 0 0 1 0 1 0 0 1 0 0 0 1 0 0]]
{'simon': 15, 'baker': 4, 'and': 2, 'gabriel': 6, 'macht': 10, 'are': 3, 'amazing': 1, 'actors': 0, 'suits': 17, 'mentalist': 11, 'the': 18, 'best': 5, 'shows': 14, 'have': 8, 'seen': 13, 'harvey': 7, 'specter': 16, 'patrick': 12, 'jane': 9}


### DictVectorizer

It converts mapping to vectors

In [6]:
from sklearn.feature_extraction import DictVectorizer
docs = [{"Harvey Specter": 1, "is": 1, "awesome": 2}, {"Life": 1, "is": 1, "like ": 2, "this": 3, "and": 1, "I": 2, "like it": 3}]
dv = DictVectorizer()
X = dv.fit_transform(docs)
print(X.todense())

[[1. 0. 0. 0. 2. 1. 0. 0. 0.]
 [0. 2. 1. 1. 0. 1. 2. 3. 3.]]


### tfidf Vectorizer

It is a solution for repeated words in a text, such that the repeated words get less weightage. Idf refers to Inverse Document Frequency and Tf refers to term frequency 

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

tfidf_vectorizer = TfidfVectorizer()
cv_vectorizer = CountVectorizer()
docs = ["Mayur is a Guitarist", "Mayur is Musician", "Mayur is also a programmer"]
X_idf = tfidf_vectorizer.fit_transform(docs)
X_cv = cv_vectorizer.fit_transform(docs)
print(X_idf.todense())
print(tfidf_vectorizer.vocabulary_)
print(X_cv.todense())

[[0.         0.76749457 0.45329466 0.45329466 0.         0.        ]
 [0.         0.         0.45329466 0.45329466 0.76749457 0.        ]
 [0.6088451  0.         0.35959372 0.35959372 0.         0.6088451 ]]
{'mayur': 3, 'is': 2, 'guitarist': 1, 'musician': 4, 'also': 0, 'programmer': 5}
[[0 1 1 1 0 0]
 [0 0 1 1 1 0]
 [1 0 1 1 0 1]]
