### Bag-of-words model


https://en.wikipedia.org/wiki/Bag-of-words_model

Bag of Words (BoW) is an algorithm that counts how many times a word appears in a document. It’s a tally. Those word counts allow us to compare documents and gauge their similarities for applications like search, document classification and topic modeling. BoW is a also method for preparing text for input in a deep-learning net.

### TF–IDF

https://en.wikipedia.org/wiki/Tf%E2%80%93idf

http://blog.christianperone.com/2011/09/machine-learning-text-feature-extraction-tf-idf-part-i/ <br/>
http://blog.christianperone.com/2011/10/machine-learning-text-feature-extraction-tf-idf-part-ii/

In [50]:
import pyspark
from pyspark.sql import *

from pyspark.ml.feature import StopWordsRemover
from pyspark import SparkConf, SparkContext
from pyspark.mllib.feature import HashingTF
from pyspark.mllib.feature import IDF
from pyspark.ml.feature import HashingTF, IDF, Tokenizer
from itertools import islice
from pyspark.mllib.feature import Word2Vec

from pyspark.sql import SparkSession

In [7]:
spark = SparkSession.builder.appName('ops').getOrCreate()

#### Load data

In [36]:
num_rows_to_show = 20
text_file = 'data/listings.csv'

In [17]:
df = spark.read.csv(text_file, inferSchema=True, header=True)

In [18]:
coprus = df.select("id", "name").dropna(subset="name")

In [37]:
coprus.show(num_rows_to_show, False)

+-----+--------------------------------------------------+
|id   |name                                              |
+-----+--------------------------------------------------+
|2818 |Quiet Garden View Room & Super Fast WiFi          |
|20168|100%Centre-Studio 1 Private Floor/Bathroom        |
|25428|Lovely apt in City Centre (Jordaan)               |
|27886|Romantic, stylish B&B houseboat in canal district |
|28658|Cosy guest room near city centre -1               |
|28871|Comfortable double room                           |
|29051|Comfortable single room                           |
|31080|2-story apartment + rooftop terrace               |
|38266|Nice and quiet place in the Jordaan               |
|41125|Amsterdam Center Entire Apartment                 |
|42970|Comfortable room@PERFECT location + 2 bikes       |
|43109|Oasis in the middle of Amsterdam                  |
|43980|View into park / museum district (long/short stay)|
|44129|Luxury design with canal view                    

In [38]:
number_of_docs = coprus.count()
number_of_docs

19439

#### Tokenization

In [47]:
tokenizer = Tokenizer(inputCol="name", outputCol="raw_words")
wordsData = tokenizer.transform(coprus)

In [48]:
wordsData.show(num_rows_to_show, False)

+-----+--------------------------------------------------+-----------------------------------------------------------+
|id   |name                                              |raw_words                                                  |
+-----+--------------------------------------------------+-----------------------------------------------------------+
|2818 |Quiet Garden View Room & Super Fast WiFi          |[quiet, garden, view, room, &, super, fast, wifi]          |
|20168|100%Centre-Studio 1 Private Floor/Bathroom        |[100%centre-studio, 1, private, floor/bathroom]            |
|25428|Lovely apt in City Centre (Jordaan)               |[lovely, apt, in, city, centre, (jordaan)]                 |
|27886|Romantic, stylish B&B houseboat in canal district |[romantic,, stylish, b&b, houseboat, in, canal, district]  |
|28658|Cosy guest room near city centre -1               |[cosy, guest, room, near, city, centre, -1]                |
|28871|Comfortable double room                  

#### Remov stop words

In [54]:
locale = spark._jvm.java.util.Locale
locale.setDefault(locale.forLanguageTag("en-US"))

In [57]:
StopWordsRemover.loadDefaultStopWords("english")

remover = StopWordsRemover(inputCol="raw_words", outputCol="words")
wordsData = remover.transform(wordsData)

In [58]:
wordsData.show(num_rows_to_show, False)

+-----+--------------------------------------------------+-----------------------------------------------------------+-----------------------------------------------------+
|id   |name                                              |raw_words                                                  |words                                                |
+-----+--------------------------------------------------+-----------------------------------------------------------+-----------------------------------------------------+
|2818 |Quiet Garden View Room & Super Fast WiFi          |[quiet, garden, view, room, &, super, fast, wifi]          |[quiet, garden, view, room, &, super, fast, wifi]    |
|20168|100%Centre-Studio 1 Private Floor/Bathroom        |[100%centre-studio, 1, private, floor/bathroom]            |[100%centre-studio, 1, private, floor/bathroom]      |
|25428|Lovely apt in City Centre (Jordaan)               |[lovely, apt, in, city, centre, (jordaan)]                 |[lovely, apt, cit

#### Hashing

In [63]:
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=10000)
featurizedData = hashingTF.transform(wordsData)

In [65]:
featurizedData.show(num_rows_to_show, False)

+-----+--------------------------------------------------+-----------------------------------------------------------+-----------------------------------------------------+----------------------------------------------------------------------------------+
|id   |name                                              |raw_words                                                  |words                                                |rawFeatures                                                                       |
+-----+--------------------------------------------------+-----------------------------------------------------------+-----------------------------------------------------+----------------------------------------------------------------------------------+
|2818 |Quiet Garden View Room & Super Fast WiFi          |[quiet, garden, view, room, &, super, fast, wifi]          |[quiet, garden, view, room, &, super, fast, wifi]    |(10000,[494,1692,1789,2659,7293,8048,8562,9263],[1.0,1.0,1.0

#### IDF

In [149]:
idf = IDF(inputCol="rawFeatures", outputCol="features", minDocFreq=1)
idfModel = idf.fit(featurizedData)

In [150]:
tfidf = idfModel.transform(featurizedData)

In [152]:
results = idfModel.transform(featurizedData)

In [158]:
results.select("words", "features").show(num_rows_to_show, False)

+-----------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|words                                                |features                                                                                                                                                                                            |
+-----------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[quiet, garden, view, room, &, super, fast, wifi]    |(10000,[494,1692,1789,2659,7293,8048,8562,9263],[2.445567235227968,2.8094747144167127,3.6888794541139363,5.2799682278798405,3.247046701834897,5.983267779903803,2.46837734783679,8.0833286

In [151]:
# keyword = "central"
# keywordDF = spark.createDataFrame([
#     (0, [keyword])
# ], ["id", "words"])


# keywordTF = hashingTF.transform(keywordDF)


# keywordHashValue = int(keywordTF.collect()[0]["rawFeatures"].indices[0])

# keywordRelevance = tfidf.rdd.map(lambda features: features[keywordHashValue])
# zippedResults = keywordRelevance.zip(documentId)