<a href="https://colab.research.google.com/github/sandipanpaul21/NLP-using-Python/blob/master/08_Count_Vectorization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Count Vectorization (AKA One-Hot Encoding)

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

In [3]:
# CountVectorizer

# To create a Count Vectorizer, we simply need to instantiate one.
# There are special parameters we can set here when making the vectorizer, but
# for the most basic example, it is not needed.
vectorizer = CountVectorizer()
vectorizer

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [4]:
# For our text, we are going to take some text

sample_text = ["One of the most basic ways we can numerically represent words "
               "is through the one-hot encoding method (also sometimes called "
               "count vectorizing)."]
sample_text

['One of the most basic ways we can numerically represent words is through the one-hot encoding method (also sometimes called count vectorizing).']

In [5]:
# To actually create the vectorizer, we simply need to call fit on the text
# data that we wish to fix
vectorizer.fit(sample_text)

# Now, we can inspect how our vectorizer vectorized the text
# This will print out a list of words used, and their index in the vectors
print('Vocabulary Count: ')
print(vectorizer.vocabulary_)

Vocabulary Count: 
{'one': 12, 'of': 11, 'the': 15, 'most': 9, 'basic': 1, 'ways': 18, 'we': 19, 'can': 3, 'numerically': 10, 'represent': 13, 'words': 20, 'is': 7, 'through': 16, 'hot': 6, 'encoding': 5, 'method': 8, 'also': 0, 'sometimes': 14, 'called': 2, 'count': 4, 'vectorizing': 17}


In [6]:
# If we would like to actually create a vector, we can do so by passing the
# text into the vectorizer to get back counts
vector = vectorizer.transform(sample_text)

# Our final vector:
print('Full vector: ')
print(vector.toarray())

Full vector: 
[[1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1 1 1 1]]


In [7]:
# Or if we wanted to get the vector for one word:
print('Hot vector: ')
print(vectorizer.transform(['hot']).toarray())

Hot vector: 
[[0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]


In [8]:
# Or if we wanted to get multiple vectors at once to build matrices
print('Hot and one: ')
print(vectorizer.transform(['hot', 'one']).toarray())

Hot and one: 
[[0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]]


In [9]:
# We could also do the whole thing at once with the fit_transform method:
print('One swoop:')
new_text = ['Today is the day that I do the thing today, today']
new_vectorizer = CountVectorizer()
print(new_vectorizer.fit_transform(new_text).toarray())

One swoop:
[[1 1 1 1 2 1 3]]


In [10]:
# Using It on Real Data:

# So let’s use it on some real data! 
# We will check out the 20 News Group dataset that comes with scikit-learn.

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer

import numpy as np

# Create our vectorizer
vectorizer = CountVectorizer()

# Let's fetch all the possible text data
newsgroups_data = fetch_20newsgroups()

# Why not inspect a sample of the text data?
print('Sample 0: ')
print(newsgroups_data.data[0])
print()

# Create the vectorizer
vectorizer.fit(newsgroups_data.data)

# Let's look at the vocabulary:
print('Vocabulary: ')
print(vectorizer.vocabulary_)
print()

# Converting our first sample into a vector
v0 = vectorizer.transform([newsgroups_data.data[0]]).toarray()[0]
print('Sample 0 (vectorized): ')
print(v0)
print()

# It's too big to even see...
# What's the length?
print('Sample 0 (vectorized) length: ')
print(len(v0))
print()

# How many words does it have?
print('Sample 0 (vectorized) sum: ')
print(np.sum(v0))
print()

# What if we wanted to go back to the source?
print('To the source:')
print(vectorizer.inverse_transform(v0))
print()

# So all this data has a lot of extra garbage... Why not strip it away?
newsgroups_data = fetch_20newsgroups(remove=('headers', 'footers', 'quotes'))

# Why not inspect a sample of the text data?
print('Sample 0: ')
print(newsgroups_data.data[0])
print()

# Create the vectorizer
vectorizer.fit(newsgroups_data.data)

# Let's look at the vocabulary:
print('Vocabulary: ')
print(vectorizer.vocabulary_)
print()

# Converting our first sample into a vector
v0 = vectorizer.transform([newsgroups_data.data[0]]).toarray()[0]
print('Sample 0 (vectorized): ')
print(v0)
print()

# It's too big to even see...
# What's the length?
print('Sample 0 (vectorized) length: ')
print(len(v0))
print()

# How many words does it have?
print('Sample 0 (vectorized) sum: ')
print(np.sum(v0))
print()

# What if we wanted to go back to the source?
print('To the source:')
print(vectorizer.inverse_transform(v0))
print()

Sample 0: 
From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----






Vocabulary: 

Sample 0 (vectorized): 
[0 0 0 ... 0 0 0]

Sample 0 (vectorized) length: 
130107

Sample 0 (vectorized) sum: 
122

To the source:
[array(['15', '60s', '70s', 'addition', 'all', 'anyone', 'be', 'body',
       'bricklin', 'brought', 'bumper', 'by', 'call