## Review: What are the pre-processings to apply a machine learning algorithm on text data?

1. The text must be parsed to words, called tokenization

2. Then the words need to be encoded as integers or floating point values

3. scikit-learn library offers easy-to-use tools to perform both tokenization and feature extraction of text data

## What is TF-IDF Vectorizer?

- Word counts are a good starting point, but are very basic

An alternative is to calculate word frequencies, and by far the most popular method is called TF-IDF. 

**Term Frequency**: This summarizes how often a given word appears within a document

**Inverse Document Frequency**: This downscales words that appear a lot across documents

## Intuitive idea behind TF-IDF:
    
- If a word appears frequently in a document, it's important. Give the word a high score

- But if a word appears in many documents, it's not a unique identifier. Give the word a low score

<img src="../Notebooks/Images/tfidf_slide.png" width="700" height="700">

## Activity: Obtain the keywords from TF-IDF

1- First obtain the TF-IDF matrix for given corpus

2- Do column-wise addition

3- Sort the score from highest to lowest

4- Return the associated words based on step 3

In [26]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import numpy as np

documents = ['The sky is blue', 'The sun is bright', 'The sun in the sky is bright', 'we can see the shining sun, the bright sun']

# finds top keywords in our list
# there are no word orders with tfidvectorizers and count vectorizers 
vectorizer = TfidfVectorizer(stop_words='english')
tf_idf_matrix = vectorizer.fit_transform(documents)
print(tf_idf_matrix.toarray());
print(vectorizer.get_feature_names()) ## grabs names of features
# first column is bright
# the score goes down as the weight of score decrease 
column_score = np.sum(tf_idf_matrix, axis=1)
sorted_words = np.sort(column_score)
tfidf_scores = np.ravel(sorted_words)
# returns top 3 words 
print(sorted(dict(zip(vectorizer.get_feature_names(), tfidf_scores)).items(), key=lambda x: x[1], reverse=True)[:3])

        

[[0.78528828 0.         0.         0.6191303  0.        ]
 [0.         0.70710678 0.         0.         0.70710678]
 [0.         0.53256952 0.         0.65782931 0.53256952]
 [0.         0.36626037 0.57381765 0.         0.73252075]]
['blue', 'bright', 'shining', 'sky', 'sun']
[('shining', 1.7229683601509618), ('sky', 1.672598769332993), ('bright', 1.414213562373095)]


# Sklearn,, Gensim and keras have a built in tfidf score

## Word2Vec

- Data Scientists have assigned a vector to each english word

- This process of assignning vectors to each word is called Word2Vec

- In DS 2.4, we will learn how they accomplished Word2Vec task

- Download this huge Word2Vec file: https://nlp.stanford.edu/projects/glove/

- Do not open the extracted file

## What is the property of vectors associated to each word in Word2Vec?

- Words with similar meanings would be closer to each other in Euclidean Space

- For example if $V_{pizza}$, $V_{food}$ and $V_{sport}$ represent the vector associated to pizza, food and sport then:

${\| V_{pizza} - V_{food}}\|$ < ${\| V_{pizza} - V_{sport}}\|$

In [58]:
hashtable = {'h':2, 'a':5, 'b':1, 'c':15}

           
a = sorted(hashtable.items(), key=lambda x: x[1]) 
a

[('b', 1), ('h', 2), ('a', 5), ('c', 15)]

In [67]:
for key,value in hashtable.items():
    ##hashtable = sorted(hashtable.items()
    pass

AttributeError: 'list' object has no attribute 'items'