<a href="https://colab.research.google.com/github/GloriaMoraaRiechi/Spring-2025/blob/main/nlpFeatureExtractionHashing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#NLP Feature Extraction:
a) Apply HashingTF and IDF in Google Colab using PySPark using the shakespeare.txtdataset. (calculate DF, IDF, TF-IDF, search for specific keyword in the document)

b) Apply Word2Vec in Google Colab using PySPark using the shakespeare.txt (get word vectors and find similarities) dataset

#HashingTF and IDF

They are components of the TF-IDF (Term Frequency-Inverse Document Frequency) algorithm, which is a popular technique used in natural language processing and machine learning for text analysis. They are available in the pyspark.ml.feature module.

**HashingTF**

A feature transformer that converts text into a fixed-length numerical vector using the hashing trick.

**IDF(Inverse Document Frequency)**

Scales term frequency by reducing the weight of frequently occuring words and increasing the importance of rare words. Used after HashingTF to normalize the term frequencies

In [None]:
from pyspark.sql import SparkSession
from pyspark.ml.feature import HashingTF, IDF, Tokenizer
from pyspark.ml import Pipeline
from pyspark.sql.functions import array_contains, col

Initialize spark session

In [None]:
spark = SparkSession.builder.appName("NLPFeatureExtractionHashingTF").getOrCreate()


Load the data

In [None]:
from google.colab import files

# Upload the file
uploaded = files.upload()

# Get the uploaded file name
file_name = list(uploaded.keys())[0]

# Read the file
data = spark.read.csv(file_name, header=False, inferSchema=True, sep="\t")


Saving shakespeare.txt to shakespeare.txt


In [None]:
data = data.withColumnRenamed("_c0", "value")
data.show(10, truncate=False)
data.printSchema()

+-------------------------------------------------------------------+
|value                                                              |
+-------------------------------------------------------------------+
|This is the 100th Etext file presented by Project Gutenberg, and   |
|is presented in cooperation with World Library, Inc., from their   |
|Library of the Future and Shakespeare CDROMS.  Project Gutenberg   |
|often releases Etexts that are NOT placed in the Public Domain!!   |
|Shakespeare                                                        |
|*This Etext has certain copyright implications you should read!*   |
|<<THIS ELECTRONIC VERSION OF THE COMPLETE WORKS OF WILLIAM         |
|SHAKESPEARE IS COPYRIGHT 1990-1993 BY WORLD LIBRARY, INC., AND IS  |
|PROVIDED BY PROJECT GUTENBERG ETEXT OF ILLINOIS BENEDICTINE COLLEGE|
|WITH PERMISSION.  ELECTRONIC AND MACHINE READABLE COPIES MAY BE    |
+-------------------------------------------------------------------+
only showing top 10 

Split the sentences into individual words

In [None]:
# creates a new column "words"
tokenizer = Tokenizer(inputCol="value", outputCol="words")
tokenizedData = tokenizer.transform(data)
tokenizedData.select("value", "words").show(5, truncate=False)

+----------------------------------------------------------------+----------------------------------------------------------------------------+
|value                                                           |words                                                                       |
+----------------------------------------------------------------+----------------------------------------------------------------------------+
|This is the 100th Etext file presented by Project Gutenberg, and|[this, is, the, 100th, etext, file, presented, by, project, gutenberg,, and]|
|is presented in cooperation with World Library, Inc., from their|[is, presented, in, cooperation, with, world, library,, inc.,, from, their] |
|Library of the Future and Shakespeare CDROMS.  Project Gutenberg|[library, of, the, future, and, shakespeare, cdroms., , project, gutenberg] |
|often releases Etexts that are NOT placed in the Public Domain!!|[often, releases, etexts, that, are, not, placed, in, the, public, dom

Convert the words into a fixed length numerical factor

In [None]:
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=1000)
hashedData = hashingTF.transform(tokenizedData)
hashedData.select("words", "rawFeatures").show(truncate=False)

+-----------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------+
|words                                                                        |rawFeatures                                                                                              |
+-----------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------+
|[this, is, the, 100th, etext, file, presented, by, project, gutenberg,, and] |(1000,[17,108,115,209,230,313,373,488,581,716,891],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])        |
|[is, presented, in, cooperation, with, world, library,, inc.,, from, their]  |(1000,[115,209,360,588,643,650,663,738,921,967],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])               |
|[library, of, the, future, and, shakespeare, cdroms., , project, gute

 Apply inverse document frequency (determines how important a word is by computing the term frequency inverse document frequency)


In [None]:
idf = IDF(inputCol="rawFeatures", outputCol="features")

In [None]:
idfModel = idf.fit(hashedData) # fit the IDF model on the hashed data
idfData = idfModel.transform(hashedData)
idfData.select("words", "features").show(truncate=False)


+-----------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|words                                                                        |features                                                                                                                                                                                                                                                                           |
+-----------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [None]:
# Combine all the transformations into a single workflow
pipeline = Pipeline(stages=[tokenizer, hashingTF, idf])


In [None]:
# Fit and transform the data
model = pipeline.fit(data)
result = model.transform(data)
result.select("value", "features").show(5, truncate=False)

+----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|value                                                           |features                                                                                                                                                                                                                                                       |
+----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|This is the 100th Etext file p

In [None]:
# Search for "grief"
keyword = "grief"
filteredResult = result.filter(array_contains(col("words"), keyword))
filteredResult.show(n=filteredResult.count(), truncate=False)

+---------------------------------------------------------------------+-----------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|value                                                                |words                                                                                    |rawFeatures                                                                                                                   |features                                                                      

In [None]:
spark.stop()

# Word2Vec

Feature transformer thet converts words into numerical vectors using a neural network-based embedding model. It maps words to a continous vector space
preserving similarity. It captures relationships between words.

 word2Vec converts words into numerical vector representations to find similarities between them


In [None]:
from pyspark.sql import SparkSession
from pyspark.ml.feature import Tokenizer, Word2Vec
from pyspark.sql.functions import col, regexp_replace, trim, lower

In [None]:
spark = SparkSession.builder.appName("NLPFeatureExtractionWord2Vec").getOrCreate()

In [None]:
from google.colab import files

# Upload the file
uploaded = files.upload()

# Get the uploaded file name
file_name = list(uploaded.keys())[0]

# Read the file
data = spark.read.text(file_name).withColumnRenamed("value", "raw_text")


Saving shakespeare.txt to shakespeare (3).txt


Text processing

In [None]:
pip install inflect



In [None]:
import inflect
from pyspark.sql.functions import udf, col
from pyspark.sql.types import StringType

engine = inflect.engine()

def replace_ordinal(text):
    words = text.split()
    for i, word in enumerate(words):
        # Remove trailing punctuation (e.g., "100th," → "100th")
        clean_word = word.strip(".,;!?")
        # Check for ordinal patterns like 100th, 2nd, etc.
        if len(clean_word) >= 2 and clean_word[-2:] in ('th', 'st', 'nd', 'rd'):
            number_part = clean_word[:-2]
            if number_part.isdigit():
                try:
                    # Convert to integer and get ordinal word
                    ordinal_word = engine.number_to_words(int(number_part), ordinal=True)
                    # Replace the original word with its ordinal text
                    words[i] = words[i].replace(clean_word, ordinal_word)
                except:
                    pass  # Skip conversion on errors
    return ' '.join(words)

# Register UDF
replace_ordinal_udf = udf(replace_ordinal, StringType())
data = data.withColumn("cleaned_text", replace_ordinal_udf(col("raw_text")))

In [None]:
data.show(5, truncate=False)

+----------------------------------------------------------------+----------------------------------------------------------------+
|raw_text                                                        |cleaned_text                                                    |
+----------------------------------------------------------------+----------------------------------------------------------------+
|This is the 100th Etext file presented by Project Gutenberg, and|This is the 100th Etext file presented by Project Gutenberg, and|
|is presented in cooperation with World Library, Inc., from their|is presented in cooperation with World Library, Inc., from their|
|Library of the Future and Shakespeare CDROMS.  Project Gutenberg|Library of the Future and Shakespeare CDROMS. Project Gutenberg |
|often releases Etexts that are NOT placed in the Public Domain!!|often releases Etexts that are NOT placed in the Public Domain!!|
|Shakespeare                                                     |Shakespear

Split the sentences into numerical vector representations

In [None]:
tokenizer = Tokenizer(inputCol="cleaned_text", outputCol="words")
wordsData = tokenizer.transform(data)
wordsData.select("cleaned_text", "words").show(5, truncate=False)

+----------------------------------------------------------------+----------------------------------------------------------------------------+
|cleaned_text                                                    |words                                                                       |
+----------------------------------------------------------------+----------------------------------------------------------------------------+
|This is the 100th Etext file presented by Project Gutenberg, and|[this, is, the, 100th, etext, file, presented, by, project, gutenberg,, and]|
|is presented in cooperation with World Library, Inc., from their|[is, presented, in, cooperation, with, world, library,, inc.,, from, their] |
|Library of the Future and Shakespeare CDROMS. Project Gutenberg |[library, of, the, future, and, shakespeare, cdroms., project, gutenberg]   |
|often releases Etexts that are NOT placed in the Public Domain!!|[often, releases, etexts, that, are, not, placed, in, the, public, dom

Convert the words into numerical vector representations


In [None]:
word2Vec = Word2Vec(vectorSize=3, minCount=0, inputCol="words", outputCol="result")

Train the model

In [None]:
model = word2Vec.fit(wordsData)

Transform the data to get word vectors (each sentence is converted to sentence-level embeddings


In [None]:
result = model.transform(wordsData)
result.select("words", "result").show(5, truncate=False)

+----------------------------------------------------------------------------+--------------------------------------------------------------+
|words                                                                       |result                                                        |
+----------------------------------------------------------------------------+--------------------------------------------------------------+
|[this, is, the, 100th, etext, file, presented, by, project, gutenberg,, and]|[-0.5995194738392126,-7.235522974621166E-4,0.7563755735754967]|
|[is, presented, in, cooperation, with, world, library,, inc.,, from, their] |[-0.927121011260897,0.07981600277125836,0.4507550247013569]   |
|[library, of, the, future, and, shakespeare, cdroms., project, gutenberg]   |[-1.0230527371168137,-0.00898773761259185,1.024926015900241]  |
|[often, releases, etexts, that, are, not, placed, in, the, public, domain!!]|[-0.17524426193399864,0.112800582989373,-0.04939867573028261] |
|[shak

get word vector for specific word

In [None]:
word = "grief"
wordVector = model.getVectors().filter(col("word") == word)
wordVector.show(truncate=False)

+-----+------------------------------------------------------------+
|word |vector                                                      |
+-----+------------------------------------------------------------+
|grief|[-0.2706522047519684,0.20689503848552704,0.2844409942626953]|
+-----+------------------------------------------------------------+



Find similar words

In [None]:
synonyms = model.findSynonyms("love", 5)
synonyms.show(truncate=False)

+-------------+------------------+
|word         |similarity        |
+-------------+------------------+
|another?     |0.9999030828475952|
|fabian.      |0.9998683333396912|
|curate;      |0.9998637437820435|
|marshalship, |0.9997220635414124|
|well-govern'd|0.9997202157974243|
+-------------+------------------+

