<a href="https://colab.research.google.com/github/Maheenms/GoogleCoLab/blob/main/nlp_hashingTF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import os
# Find the latest version of spark 3.2  from http://www.apache.org/dist/spark/ and enter as the spark version
# For example:
# spark_version = 'spark-3.2.2'
spark_version = 'spark-3.2.2'
os.environ['SPARK_VERSION']=spark_version

# Install Spark and Java
!apt-get update
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://www.apache.org/dist/spark/$SPARK_VERSION/$SPARK_VERSION-bin-hadoop2.7.tgz
!tar xf $SPARK_VERSION-bin-hadoop2.7.tgz
!pip install -q findspark

# Set Environment Variables
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = f"/content/{spark_version}-bin-hadoop2.7"

# Start a SparkSession
import findspark
findspark.init()

Get:1 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease [3,626 B]
Ign:2 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Get:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease [1,581 B]
Hit:4 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Release
Get:5 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease [15.9 kB]
Get:6 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ Packages [98.9 kB]
Get:7 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Hit:8 http://archive.ubuntu.com/ubuntu bionic InRelease
Get:9 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Packages [992 kB]
Get:10 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Hit:11 http://ppa.launchpad.net/cran/libgit2/ubuntu bionic InRelease
Get:13 http://ppa.launchpad.net/deadsnakes/ppa/ubuntu bioni

In [2]:
# Start Spark session
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Hashing").getOrCreate()

In [4]:
from pyspark.ml.feature import HashingTF, IDF, Tokenizer # term frequency--> measure or count of a word that occurs  and Inverse document frequency--> evaluates the most imp words in the phrases

TF-IDF (term frequency-inverse document frequency) is an information retrieval technique that helps find the most relevant documents corresponding to a given query.

TF is a measure of how often a phrase appears in a document, and IDF is about how important that phrase is. The multiplication of these two scores makes up a TF-IDF score.

Google has been using TF-IDF (or TF  IDF, TF*IDF, TFIDF, TF.IDF) to rank your content for a long time. It seems that Google focuses more on term frequency rather than on counting keywords. 

TF-IDF is used by search engines to better understand the content that is undervalued. For example, when you search for “Coke” on Google, Google may use TF-IDF to figure out if a page titled “COKE” is about:

1. Coca-Cola.
2. Drugs.
3. A solid, carbon-rich residue derived from the distillation of crude oil.
4. A county in Texas.


The TF-IDF algorithm is used to weigh a keyword in any content and assign importance to that keyword based on the number of times it appears in the document. More importantly, it checks how relevant the keyword is throughout the web, which is referred to as corpus.

For a term t in document d, the weight Wt,d of term t in document d is given by:

Wt,d = TFt,d log (N/DFt)

Where:

* TFt,d is the number of occurrences of t in document d.

* DFt is the number of documents containing the term t.

* N is the total number of documents in the corpus. (number of rows of data)

**How is the TF-IDF score calculated?**

TF-IDF is scored between 0 and 1. The higher the numerical weight value, the rarer the term. The smaller the weight, the more common the term. 

TF (term frequency) example
The TF (term frequency) of a word is the frequency of a word (i.e., number of times it appears) in a document. When you know TF, you’re able to see if you’re using a term too much or too little.

When a 100-word document contains the term “cat” 12 times, the TF for the word ‘cat’ is

TFcat = 12/100 i.e. 0.12

IDF (inverse document frequency) example
The IDF (inverse document frequency) of a word is the measure of how significant that term is in the whole corpus (a body of documents).

Let’s say the size of the corpus is 10,000,000 million documents. If we assume there are 0.3 million documents that contain the term “cat”, then the IDF (i.e. log {DF}) is given by the total number of documents (10,000,000) divided by the number of documents containing the term “cat” (300,000).

IDF (cat) = log (10,000,000/300,000) = 1.52

TF-IDF Calculation
Put the TF and IDF calculations together to get a TF IDF score.

∴ Wcat = (TF*IDF) cat = 0.12 * 1.52 = 0.182

A TF-IDF score of 0.182 is much closer to 0 than 1. This suggests that “cat” is a common term with less weight. 

Now that you have this figured out (right?), let’s look at how this can benefit you.

In [5]:
# Input data: Each row is a bag of words with an ID
df = spark.createDataFrame([
    (0, "PYTHON HIVE HIVE".split(" ")),
    (1, "JAVA JAVA SQL".split(" "))
], ["id", "words"])
df.show(truncate = False)

+---+--------------------+
|id |words               |
+---+--------------------+
|0  |[PYTHON, HIVE, HIVE]|
|1  |[JAVA, JAVA, SQL]   |
+---+--------------------+



First, CountVectorizer will generate a vocabulary in case an a-priory vocabulary is not available. For instance, in this example CountVectorizer will create a vocabulary of size 4 which includes PYTHON, HIVE, JAVA and SQL terms. It will be followed by fitting of the CountVectorizer Model. During the fitting process, CountVectorizer will select the top VocabSize words ordered by term frequency. The model will produce a sparse vector which can be fed into other algorithms.

In [20]:
# Fit a CountVectorizerModel from the corpus
from pyspark.ml.feature import CountVectorizer
cv = CountVectorizer(inputCol="words", outputCol="features")
model = cv.fit(df)
result = model.transform(df)
result.show(truncate=False)

+---+--------------------+-------------------+
|id |words               |features           |
+---+--------------------+-------------------+
|0  |[PYTHON, HIVE, HIVE]|(4,[0,3],[2.0,1.0])|
|1  |[JAVA, JAVA, SQL]   |(4,[1,2],[2.0,1.0])|
+---+--------------------+-------------------+




<img src='https://miro.medium.com/max/1400/1*cE0fIDQbIWoHuFLcvujqmw.png'>


For the purpose of understanding, the feature vector can be divided into 3 parts:

<img src='https://miro.medium.com/max/828/1*LvsokJXq_I77pgKb9ZckUg.png'>

* The leading number represents the size of the vector. Here, it is 4.
* The first list of numbers represent the vector indices.

<img src='https://miro.medium.com/max/640/1*BXTQ7nnRNtc9DLadnYeBAQ.png'>

For instance, ‘JAVA’ term has a higher frequency of 2 as compared to term ‘SQL’ which has a frequency of 1. Therefore, ‘JAVA’ has index 1 whereas ‘SQL’ has index 2

* The second list of numbers represents the values corresponding to these indices.
It can be seen in document 2, ‘JAVA’ with index 1 has value 2 and ‘SQL’ with index 2 has value 1

However, it should be noted that since the frequency of ‘HIVE’ and ‘JAVA’ is same, the indices are inter changeable.

<img src='https://miro.medium.com/max/640/1*PdzU9Mp3-pC4baRyjK09Tw.png'>

Here, ’HIVE’ has index 0 and ‘JAVA’ has index 1

<img src='https://miro.medium.com/max/640/1*QuUdQrgdEv1fhKWMmDW9QA.png'>

Here, ’HIVE’ has index 1 and ‘JAVA’ has index 0


**HashingTF**

HashingTF converts documents to vectors of fixed size. The default feature dimension is 262,144. The terms are mapped to indices using a Hash Function. The hash function used is MurmurHash 3. The term frequencies are computed with respect to the mapped indices.

In [9]:
# Get term frequency vector through HashingTF
from pyspark.ml.feature import HashingTF
ht = HashingTF(inputCol="words", outputCol="features")
result = ht.transform(df)
result.show(truncate=False)

+---+--------------------+----------------------------------+
|id |words               |features                          |
+---+--------------------+----------------------------------+
|0  |[PYTHON, HIVE, HIVE]|(262144,[129668,191247],[2.0,1.0])|
|1  |[JAVA, JAVA, SQL]   |(262144,[53343,256570],[2.0,1.0]) |
+---+--------------------+----------------------------------+



<img src='https://miro.medium.com/max/828/1*ewi6TzvNlFGZgxIyY6U9BQ.png'>

It can be seen in the above example that the dimension of the vector is set to default i.e. 262,144. Also, term ‘PYTHON’ is mapped to index 134160 by the hashing function and has frequency equal to 1. Similar, insights can be gained with respect to other terms.

In [7]:
ht2 = HashingTF(inputCol = 'words', outputCol='features', numFeatures=32) # 2 ^4
result = ht2.transform(df)
result.show(truncate = False)

+---+--------------------+----------------------+
|id |words               |features              |
+---+--------------------+----------------------+
|0  |[PYTHON, HIVE, HIVE]|(32,[4,15],[2.0,1.0]) |
|1  |[JAVA, JAVA, SQL]   |(32,[26,31],[1.0,2.0])|
+---+--------------------+----------------------+



In [10]:
# Fit the IDF data onto the datset

idf = IDF(inputCol='features', outputCol= 'features 2')
idfModel = idf.fit(result) # fit the model onto the result dataframe
rescaledData = idfModel.transform(result)

#show resulting data 
rescaledData.show(truncate = False)

+---+--------------------+----------------------------------+----------------------------------------------------------------+
|id |words               |features                          |features 2                                                      |
+---+--------------------+----------------------------------+----------------------------------------------------------------+
|0  |[PYTHON, HIVE, HIVE]|(262144,[129668,191247],[2.0,1.0])|(262144,[129668,191247],[0.8109302162163288,0.4054651081081644])|
|1  |[JAVA, JAVA, SQL]   |(262144,[53343,256570],[2.0,1.0]) |(262144,[53343,256570],[0.8109302162163288,0.4054651081081644]) |
+---+--------------------+----------------------------------+----------------------------------------------------------------+



Example 2

In [11]:
# Sample DataFrame with repeating words
dataframe = spark.createDataFrame([
    (0, "The cow cow jumped and jumped cow"),
    (1, "then the cow said"),
    (2, "I am a cow that jumped")
],["id", "words"])

dataframe.show(truncate=False)

+---+---------------------------------+
|id |words                            |
+---+---------------------------------+
|0  |The cow cow jumped and jumped cow|
|1  |then the cow said                |
|2  |I am a cow that jumped           |
+---+---------------------------------+



In [12]:
# Tokenize the words
tokenizer = Tokenizer(inputCol="words", outputCol="tokens")
wordsData = tokenizer.transform(dataframe)
wordsData.show(truncate=False)

+---+---------------------------------+-----------------------------------------+
|id |words                            |tokens                                   |
+---+---------------------------------+-----------------------------------------+
|0  |The cow cow jumped and jumped cow|[the, cow, cow, jumped, and, jumped, cow]|
|1  |then the cow said                |[then, the, cow, said]                   |
|2  |I am a cow that jumped           |[i, am, a, cow, that, jumped]            |
+---+---------------------------------+-----------------------------------------+



We use HashingTF to hash terms into fixed-length vectors, map to an index, and return a vector of term counts.

* Note that HashingTF takes a numFeatures parameter that specifies the number of buckets into which the words will be split. This number must be higher than the number of unique words.

* By default, this value is 2^18, or 262,144. We need to use a power of 2 so that indexes are evenly mapped.

In [22]:
from pyspark.ml.feature import CountVectorizer
cV2 = CountVectorizer(inputCol = "tokens", outputCol = "features")
model = cV2.fit(wordsData)
result2 = model.transform(wordsData)
result2.show(truncate = False)


+---+---------------------------------+-----------------------------------------+--------------------------------------------+
|id |words                            |tokens                                   |features                                    |
+---+---------------------------------+-----------------------------------------+--------------------------------------------+
|0  |The cow cow jumped and jumped cow|[the, cow, cow, jumped, and, jumped, cow]|(10,[0,1,2,9],[3.0,2.0,1.0,1.0])            |
|1  |then the cow said                |[then, the, cow, said]                   |(10,[0,2,7,8],[1.0,1.0,1.0,1.0])            |
|2  |I am a cow that jumped           |[i, am, a, cow, that, jumped]            |(10,[0,1,3,4,5,6],[1.0,1.0,1.0,1.0,1.0,1.0])|
+---+---------------------------------+-----------------------------------------+--------------------------------------------+



In [23]:
# Run the hashing term frequency
hashing = HashingTF(inputCol="tokens", outputCol="hashedValues", numFeatures=pow(2,4))

# Transform into a DF
hashed_df = hashing.transform(wordsData)

In [24]:
# Display new DataFrame
hashed_df.show(truncate=False)

+---+---------------------------------+-----------------------------------------+----------------------------------------+
|id |words                            |tokens                                   |hashedValues                            |
+---+---------------------------------+-----------------------------------------+----------------------------------------+
|0  |The cow cow jumped and jumped cow|[the, cow, cow, jumped, and, jumped, cow]|(16,[1,3,6,11],[1.0,3.0,2.0,1.0])       |
|1  |then the cow said                |[then, the, cow, said]                   |(16,[0,1,3,13],[1.0,1.0,1.0,1.0])       |
|2  |I am a cow that jumped           |[i, am, a, cow, that, jumped]            |(16,[0,3,6,12,13],[1.0,2.0,1.0,1.0,1.0])|
+---+---------------------------------+-----------------------------------------+----------------------------------------+



In [25]:
# Fit the IDF on the data set 
idf = IDF(inputCol="hashedValues", outputCol="features")
idfModel = idf.fit(hashed_df)
rescaledData = idfModel.transform(hashed_df)

In [26]:
# Display the DataFrame
rescaledData.select("tokens", "features").show(truncate=False)

+-----------------------------------------+-------------------------------------------------------------------------------------------------------+
|tokens                                   |features                                                                                               |
+-----------------------------------------+-------------------------------------------------------------------------------------------------------+
|[the, cow, cow, jumped, and, jumped, cow]|(16,[1,3,6,11],[0.28768207245178085,0.0,0.5753641449035617,0.6931471805599453])                        |
|[then, the, cow, said]                   |(16,[0,1,3,13],[0.28768207245178085,0.28768207245178085,0.0,0.28768207245178085])                      |
|[i, am, a, cow, that, jumped]            |(16,[0,3,6,12,13],[0.28768207245178085,0.0,0.28768207245178085,0.6931471805599453,0.28768207245178085])|
+-----------------------------------------+---------------------------------------------------------------------