Note: This is running in jupyter with spark context, which can also be running directly in `pyspark`.


# TF-IDF
Ref https://www.linkedin.com/learning/spark-for-machine-learning-ai/tokenize-text-data

In [47]:
sentences_df = spark.createDataFrame(
    [
    (1, "One thesis thesis."),
    (2, "The thesis My thesis includes methods to predict traffic"),
    (3, "It also contains supporting tools for future pipelines"),
    (4, "Short dot.")
    ],
    ["id", "sentence"]
)

In [48]:
sentences_df.show()

+---+--------------------+
| id|            sentence|
+---+--------------------+
|  1|  One thesis thesis.|
|  2|The thesis My the...|
|  3|It also contains ...|
|  4|          Short dot.|
+---+--------------------+



## tokenizer 
(split sentences into lower-case words)

**Note** that the period (.) is connected to the last word, which is NOT wanted.

In [66]:
from pyspark.ml.feature import Tokenizer, RegexTokenizer

sent_token = Tokenizer(inputCol='sentence', outputCol='words')
sent_tokenized_df = sent_token.transform(sentences_df)
sent_tokenized_df.show()

+---+--------------------+--------------------+
| id|            sentence|               words|
+---+--------------------+--------------------+
|  1|  One thesis thesis.|[one, thesis, the...|
|  2|The thesis My the...|[the, thesis, my,...|
|  3|It also contains ...|[it, also, contai...|
|  4|          Short dot.|       [short, dot.]|
+---+--------------------+--------------------+



## tf-idf

### *tf*:

TF maps words to indices.

In [50]:
from pyspark.ml.feature import HashingTF, IDF

sentences_df

DataFrame[id: bigint, sentence: string]

In [51]:
sentences_df.take(1)

[Row(id=1, sentence='One thesis thesis.')]

In [52]:
sent_tokenized_df.take(1)

[Row(id=1, sentence='One thesis thesis.', words=['one', 'thesis', 'thesis.'])]

In [67]:
hashingTF = HashingTF(numFeatures=10, inputCol='words', outputCol='rawFeatures')
sent_hashTF_df = hashingTF.transform(sent_tokenized_df)

In [61]:
sent_hashTF_df.take(2)

[Row(id=1, sentence='One thesis thesis.', words=['one', 'thesis', 'thesis.'], rawFeatures=SparseVector(10, {0: 1.0, 1: 1.0, 4: 1.0})),
 Row(id=2, sentence='The thesis My thesis includes methods to predict traffic', words=['the', 'thesis', 'my', 'thesis', 'includes', 'methods', 'to', 'predict', 'traffic'], rawFeatures=SparseVector(10, {0: 3.0, 2: 1.0, 3: 1.0, 4: 1.0, 6: 1.0, 8: 1.0, 9: 1.0}))]

**Explanation** In row 2, two words 'thesis's are hashed to index '0', but another word is also hashed to '0', so index '0' got hit 3 times.

**Note** In rwo 1, the two terms 'thesis' and 'thesis.' are treated differently due to the '.' char, so be careful whether this is what we want.

### *idf:*

In [58]:
idfModel = IDF(inputCol='rawFeatures').fit(sent_hashTF_df)
tfidf_df = idfModel.transform(sent_hashTF_df)

In [59]:
tfidf_df.take(2)

[Row(id=1, sentence='One thesis thesis.', words=['one', 'thesis', 'thesis.'], rawFeatures=SparseVector(10, {0: 1.0, 1: 1.0, 4: 1.0}), IDF_493fb1d2efd8f3d32d6b__output=SparseVector(10, {0: 0.2231, 1: 0.9163, 4: 0.2231})),
 Row(id=2, sentence='The thesis My thesis includes methods to predict traffic', words=['the', 'thesis', 'my', 'thesis', 'includes', 'methods', 'to', 'predict', 'traffic'], rawFeatures=SparseVector(10, {0: 3.0, 2: 1.0, 3: 1.0, 4: 1.0, 6: 1.0, 8: 1.0, 9: 1.0}), IDF_493fb1d2efd8f3d32d6b__output=SparseVector(10, {0: 0.6694, 2: 0.5108, 3: 0.5108, 4: 0.2231, 6: 0.5108, 8: 0.5108, 9: 0.9163}))]

**Note** If the IDF()'s param 'outputCol' is not specified, a random named column is added.

# Try The Real-World Documents

See file `TFIDF_Kmeans.py`