# Cleaning Text can be Fun. And, painful ;)

It's kind of an EDA but kind of Not an EDA. NDA :)
Sometimes you have to just play around with data to see what surprises might popup. 

Get your Bag on! We are going to work using the Bag of Words method with scikit-learn next.

# Bag-of-Words!

The model is simple in that it throws away all of the order information in the words and focuses on the occurrence of words in a document. This can be done by assigning each word a unique number.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

In [None]:
#taking only the id,excerpt,target
df = pd.read_csv("../input/commonlitreadabilityprize/train.csv",usecols=["id","excerpt","target"])
test_df = pd.read_csv("../input/commonlitreadabilityprize/test.csv",usecols=["id","excerpt"])
print("train shape",df.shape)
df.head()

This is the bag-of-words model in which we are only concerned with encoding schemes that represent what words are present or the degree to which they are present in encoded documents without any information about order.

Steps:
* Create an instance of the CountVectorizer class
* Call the fit() function in order to learn a vocabulary from one or more documents.
* Call the transform() function on one or more documents as needed to encode each as a vector.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

So what is all this text about anyway? Do we have sentences? punctuation? Capital letters, etc. etc. etc.?? Let's look at one to see. 

And then on to fun!

In [None]:
df.loc[0,'excerpt']

So. Well. We have some work to do here. 
1. We have capital letters.
2. We have punctuation (which we may want/need to understand complexity).
3. We have \n for new line.
4. And finally, we have sentences. 

Maybe a fun exercise will be to clean all of it up and test it in various ways! 

In [None]:
vectorizer = CountVectorizer()

In [None]:
words_excerpt = df.loc[0,'excerpt']
words_excerpt

In [None]:
# Intially attempted to just put this in as (words_excerpt) but ran into a string error. 
# added as array and no error. 
vector = vectorizer.fit([words_excerpt])
vocab = vector.vocabulary_
# summarize your tokenization of vocab that was just built
print(vocab)

In [None]:
# encode your excerpt
vect_enc = vectorizer.transform([words_excerpt])
print(vect_enc.shape)
print(type(vect_enc))
print(vect_enc.toarray())

There are 104 words in the vocab and encoded vectors have a length of 104. 

The encoded vector is a sparse matrix. 

We output an array version of the encoded vector showing a count of 1 when there is one occurrence and number counts for the rest. One shows up 19 times!

# Lets try TF-IDF

This is an acronym that stands for Term Frequency - Inverse Document Frequency which are the components of the resulting scores assigned to each word.

* Term Frequency: This summarizes how often a given word appears within a document. 􏰀 
* Inverse Document Frequency: This downscales words that appear a lot across documents.

TF-IDF are word frequency scores that try to highlight words that are more interesting, e.g. frequent in a document but not across documents.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# create the transform
vectorizer = TfidfVectorizer()
#tokenize and build vocab
vectorizer.fit([words_excerpt])
# summarize
print(vectorizer.vocabulary_)
print(vectorizer.idf_)


In [None]:
# encode document
vect_tf = vectorizer.transform([words_excerpt])
# summarize encoded vector
print(vect_tf.shape)
print(vect_tf.toarray())

We now have normalized scores between 0 and 1 and the encoded document vectors can then be used directly with most machine learning algorithms.

# Hashing

Why Hash? The above counts and frequencies can be useful, however the vocabulary can become very large which will require large vectors for encoding documents and impose large requirements on memory and slow down algorithms. 

We can use a one way hash of words to convert them to integers. The clever part is that no vocabulary is required and you can choose an arbitrary-long fixed length vector. A downside is that the hash is a one-way function so there is no way to convert the encoding back to a word.

The example below demonstrates the HashingVectorizer for encoding a single document. An arbitrary fixed-length vector size of 20 was chosen. This corresponds to the range of the hash function, where small values (like 20) may result in hash collisions.

In [None]:
from sklearn.feature_extraction.text import HashingVectorizer

vectorizer = HashingVectorizer(n_features=20)
# encode text
vect_hash = vectorizer.transform([words_excerpt])
# summarize
print(vect_hash.shape)
print(vect_hash.toarray())

Running the example encodes the sample document as a 20-element sparse array. The values of the encoded document correspond to normalized word counts by default in the range of -1 to 1, but could be made simple integer counts by changing the default configuration.