# TD-IDF - Código para medir importância de uma palavra num documento

In [None]:
Importar subset da Wikipedia que está em bucket S3

In [1]:
rawdata = spark.read.options(sep="\t").csv("s3://aws-emr-studio-904233096976-us-east-2/1757969012559/subset-small.tsv")
rawdata.show()

VBox()

Starting Spark application


ID,Kind,State,Spark UI,Driver log,User,Current session?
0,pyspark,idle,Link,,,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---+--------------------+-------------------+--------------------+
|_c0|                 _c1|                _c2|                 _c3|
+---+--------------------+-------------------+--------------------+
| 12|           Anarchism|2008-12-30 06:23:05|Anarchism (someti...|
| 25|              Autism|2008-12-24 20:41:05|Autism is a brain...|
| 39|              Albedo|2008-12-29 18:19:09|The albedo of an ...|
|290|                   A|2008-12-27 04:33:16|The letter A is t...|
|303|             Alabama|2008-12-29 08:15:47|Alabama (formally...|
|305|            Achilles|2008-12-30 06:18:01|thumb\n\nIn Greek...|
|307|     Abraham Lincoln|2008-12-28 20:18:23|Abraham Lincoln (...|
|308|           Aristotle|2008-12-29 23:54:48|Aristotle (Greek:...|
|309|An American in Paris|2008-09-27 19:29:28|An American in Pa...|
|324|       Academy Award|2008-12-28 17:50:43|The Academy Award...|
|330|             Actrius|2008-05-23 15:24:32|Actrius (Actresse...|
|332|     Animalia (book)|2008-12-18 11:12:34|th

Como o dataset não tem nomes das colunas definidos, faremos manualmente:

In [None]:
articles = rawdata.toDF("ID", "Title", "Time", "Document")
articles.show()

Next we need to "clean" our data. We know TF/IDF can't handle null documents, so first let's check for that.

In [None]:
articles.filter(articles.Document.isNull()).count()

Looks like there is one null document. As there is only one and it's clearly corrupt when we look into it, we can just remove it and call it a day.

In [None]:
cleanedArticles = articles.filter(articles.Document.isNotNull())
cleanedArticles.filter(articles.Document.isNull()).count()

TF/IDF wants numbers, not words. So now we need to pre-process our data before we can run any fun algorithms on it. We'll first tokenize the articles to split them up into words, and store them in a sparse vector that is now a numeric representation of the words in each article.

In [None]:
from pyspark.ml.feature import HashingTF, IDF, Tokenizer

tokenizer= Tokenizer(inputCol="Document", outputCol="words")
wordsData = tokenizer.transform(cleanedArticles)

In [None]:
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures")
featurizedData = hashingTF.transform(wordsData)
featurizedData.show()

That hashing operation basically computed term frequencies for us by storing how often each hashed word occured in each article. So we have TF, but we want TF/IDF scores for every term in every document. We'll store these final scores in a new column called "features", which is a sparse vector containing TF/IDF scores for each feature.

In [None]:
idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(featurizedData)
rescaledData = idfModel.transform(featurizedData)

In [None]:
rescaledData.show()

So let's use this to do a search for the term "Gettysburg". Again, we need numbers, not words, so the first task is to get the hash value for "Gettysburg"

In [None]:
from pyspark.sql.types import *

schema = StructType([StructField("words", ArrayType(StringType()))])

df = spark.createDataFrame(([[["gettysburg"]]]), schema).toDF("words")
df.show()

gettysburg = hashingTF.transform(df)
gettysburg.show()

featureVec = gettysburg.select('rawFeatures').collect()
print(featureVec)

gettysburgID = int(featureVec[0].rawFeatures.indices[0])
print(gettysburgID)

OK, we have the magic number that represents "Gettysburg". Now we can add another column - we'll call it "score" - that just extracts the TF/IDF value for Gettysburg for each document.

In [None]:
from pyspark.sql.types import FloatType
from pyspark.sql.functions import udf

termExtractor = udf(lambda x: float(x[gettysburgID]), FloatType())
gettysburgDF = rescaledData.withColumn('score', termExtractor(rescaledData.features))

gettysburgDF.show()
                                                        

Now all we have to do is sort our articles by score, and we'll have the most relevant articles for Gettysburg!

In [None]:
sortedResults = gettysburgDF.filter("score > 0").orderBy('score', ascending=False).select('ID', 'Title', 'Document', 'score')
sortedResults.show(truncate=100)