<h1 style="font-size:40px;"> MLlib: Machine Learning con Spark </h1>

In [1]:
import os
import pandas as pd
import numpy as np

from pyspark import SparkConf
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

from pyspark.ml.feature import VectorAssembler, VectorIndexer
from pyspark.ml.regression import DecisionTreeRegressor, RandomForestRegressor
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import RegressionEvaluator

In [2]:
conf = (

    SparkConf()
    .setAppName(u"[ICAI] ML a fondo")
    .set("spark.executor.memory", "7g")
    .set("spark.executor.cores", "5")
    .set("spark.default.parallelism", 800)
    .set("spark.sql.shuffle.partitions", 800)

)

In [3]:
spark = (

    SparkSession.builder
    .config(conf=conf)
    .enableHiveSupport()
    .getOrCreate()

)

# Tuning de modelos

![](img/rf.jpg)

Como hemos visto en el ejercicio anterior estabamos consiguiendo peores resultados en el [Random Forest](https://es.wikipedia.org/wiki/Random_forest) que un solo árbol, veamos como solucionar esto:

In [4]:
trainDF, testDF = ( #cargamos los datos dividiendola directamente en 70-30

    spark.read
    .options(header=True,inferSchema=True)
    .csv('/datos/hour.csv')
    .drop('casual','registerd','instant','dteday')
    .randomSplit([0.7, 0.3], seed=1234)

)

In [5]:
trainDF.cache()
testDF.cache()
print("## Número de registros en `trainDF`: {}".format(trainDF.count()))
print("## Número de registros en `testDF`: {}".format(testDF.count()))

## Número de registros en `trainDF`: 12150
## Número de registros en `testDF`: 5229


In [None]:
trainDF.printSchema() #todas las variables que decidamos que son categoricas lo tenemos que codificar

In [None]:
trainDF.limit(20).toPandas()

In [6]:
featuresCols = trainDF.columns[:-1] #quitamos la ultima que es la variable a predecir (el conteo)
featuresCols

['season',
 'yr',
 'mnth',
 'hr',
 'holiday',
 'weekday',
 'workingday',
 'weathersit',
 'temp',
 'atemp',
 'hum',
 'windspeed',
 'registered']

#### Primero montamos de nuevo el árbol:

In [7]:
vectorAssembler = VectorAssembler(inputCols=featuresCols, outputCol="rawFeatures") 
#coge todas las variables y las mete a un vector concatenadas

In [None]:
vectorAssembler.transform(trainDF)
#genera una columna llamada raw features 

In [None]:
VectorIndexer? #decide que variables son categoricas y las codifica

In [8]:
vectorIndexer = VectorIndexer(inputCol="rawFeatures", outputCol="features")

In [9]:
dt = DecisionTreeRegressor(labelCol='cnt') #arbol donde la variable a predecir es cnt (podriamos decir que inputCol="features")

In [10]:
pipeline = Pipeline(stages=[vectorAssembler, vectorIndexer, dt]) 
#creamos el pipleine que coge todos los numeros las pone en un vector, indexa y entrena el modelo
    #el pipeline coge el output del anterior para hacer el siguiete paso

In [11]:
model = pipeline.fit(trainDF) #entreno el modelo

In [12]:
evaluator = RegressionEvaluator(labelCol="cnt") #para el metodo de evaliacion

In [13]:
rmse_train = evaluator.evaluate(model.transform(trainDF))
rmse_valid = evaluator.evaluate(model.transform(testDF))

In [14]:
print("## RMSE (Train): {:.3f}".format(rmse_train))
print("## RMSE (Valid): {:.3f}".format(rmse_valid))

## RMSE (Train): 32.659
## RMSE (Valid): 33.179


In [None]:
model.stages #nos da los tres transformadores (podemos modificarlos)

#### Ahora el bosque:

In [15]:
rf = RandomForestRegressor(labelCol='cnt') #defino el random forest

In [16]:
pipeline2 = Pipeline(stages= model.stages[:-1] + [rf]) #usamos de la anterior todas hasta la penultima y cambiamos el dt por rf

In [17]:
model2 = pipeline2.fit(trainDF) #entreno

In [18]:
rmse_train = evaluator.evaluate(model2.transform(trainDF))
rmse_valid = evaluator.evaluate(model2.transform(testDF))

In [19]:
print("## RMSE (Train): {:.3f}".format(rmse_train))
print("## RMSE (Valid): {:.3f}".format(rmse_valid))
#vemos que el error es peor que en el dt

## RMSE (Train): 41.687
## RMSE (Valid): 42.729


Los modelos más complejos tienen varios hiper-parámetros que hay que configurar para conseguir la mayor *performance*, a esta búsqueda de la configuración óptima se le suele conocer como [*tuning*](https://en.wikipedia.org/wiki/Hyperparameter_optimization). Veamos los parámetros del modelo en cuestión:

In [20]:
print(RandomForestRegressor().explainParams())

cacheNodeIds: If false, the algorithm will pass trees to executors to match instances with nodes. If true, the algorithm will cache node IDs for each instance. Caching can speed up training of deeper trees. Users can set how often should the cache be checkpointed or disable it by setting checkpointInterval. (default: False)
checkpointInterval: set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations. Note: this setting will be ignored if the checkpoint directory is not set in the SparkContext. (default: 10)
featureSubsetStrategy: The number of features to consider for splits at each tree node. Supported options: auto, all, onethird, sqrt, log2, (0.0-1.0], [1-n]. (default: auto)
featuresCol: features column name. (default: features)
impurity: Criterion used for information gain calculation (case-insensitive). Supported options: variance (default: variance)
labelCol: label column name. (default: label)
maxBins: Max 

In [21]:
rf3 = RandomForestRegressor(labelCol='cnt', numTrees=200, maxDepth=10)

In [22]:
model3 = Pipeline(stages= model.stages[:-1] + [rf3]).fit(trainDF)

In [23]:
rmse_train = evaluator.evaluate(model3.transform(trainDF))
rmse_valid = evaluator.evaluate(model3.transform(testDF))

In [24]:
print("## RMSE (Train): {:.3f}".format(rmse_train))
print("## RMSE (Valid): {:.3f}".format(rmse_valid))

## RMSE (Train): 17.871
## RMSE (Valid): 22.209


¡Ya hemos conseguido mejores resultados!

### Cross-validation y búsqueda por malla

![](img/cv.png)

En Spark MLlib existen funciones para hacer fácil la búsqueda de hiperparámetros y la [validación cruzada](https://en.wikipedia.org/wiki/Cross-validation_(statistics))


In [25]:
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

In [26]:
paramGrid = (

    ParamGridBuilder()
    .addGrid(rf.numTrees, [100, 200, 300])
    .addGrid(rf.maxDepth, [4, 10])
    .build()

)

In [27]:
crossval = CrossValidator(

    estimator = pipeline2,
    estimatorParamMaps = paramGrid,
    evaluator = evaluator,
    numFolds = 3

)

**CUIDADO** Este `fit` puede durar varios minutos:

In [28]:
cvModel = crossval.fit(trainDF)

In [29]:
cvModel.avgMetrics

[54.16436455546874,
 23.147411883242945,
 51.926540068261154,
 22.823147824447453,
 52.40418431361557,
 22.793689821884286]

In [30]:
mejor = np.argsort(cvModel.avgMetrics)[0]

In [31]:
mejor

5

In [32]:
cvModel.avgMetrics[mejor]

22.793689821884286

In [33]:
cvModel.getEstimatorParamMaps()[mejor]

{Param(parent='RandomForestRegressor_41c4b53d06a9b79a06da', name='numTrees', doc='Number of trees to train (>= 1).'): 300,
 Param(parent='RandomForestRegressor_41c4b53d06a9b79a06da', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes.'): 10}

In [34]:
rmse_train = evaluator.evaluate(cvModel.transform(trainDF))
rmse_valid = evaluator.evaluate(cvModel.transform(testDF))

In [35]:
print("## RMSE (Train): {:.3f}".format(rmse_train))
print("## RMSE (Valid): {:.3f}".format(rmse_valid))

## RMSE (Train): 17.901
## RMSE (Valid): 22.400


# Clustering: K-Means

![](img/wikimedia.png)

Veamos un ejemplo de modelo no supervisado. Usaremos para ello un dataset de artículos de wikipedia que se puede encontrar en: https://dumps.wikimedia.org/

In [36]:
wiki_df = spark.read.parquet('/datos/wiki.parquet').repartition(800).cache()

In [37]:
wiki_df.count()

111495

In [38]:
wiki_df.printSchema()

root
 |-- id: long (nullable = true)
 |-- title: string (nullable = true)
 |-- lastrev_pdt_time: timestamp (nullable = true)
 |-- revid: long (nullable = true)
 |-- comment: string (nullable = true)
 |-- contributorid: long (nullable = true)
 |-- contributorusername: string (nullable = true)
 |-- contributorip: string (nullable = true)
 |-- text: string (nullable = true)



In [39]:
wiki_df.limit(10).toPandas()

Unnamed: 0,id,title,lastrev_pdt_time,revid,comment,contributorid,contributorusername,contributorip,text
0,33235801,KIG60,2016-03-03 17:56:42,708165041,/* External links */merge cat per https://en.w...,3637572.0,SQL,,{{Infobox Radio station\n | name = KIG60 - Bur...
1,3484057,Chris Brown (album),2016-03-03 08:06:20,708085410,/* Track listing */,,,82.51.120.241,{{Infobox album <!-- See Wikipedia:WikiProject...
2,2872543,Humane Slaughter Act,2016-03-03 14:13:09,708135254,/* Content of the Humane Slaughter Act */,,,2601:282:8200:4DC6:5D1B:7B5B:CB61:981A,<!-- Deleted image removed: [[Image:CattleRest...
3,8100880,The Bicester School,2016-03-04 12:30:29,708293801,Reverted edits by [[Special:Contribs/94.119.64...,506179.0,Gilliam,,{{Use dmy dates|date=October 2014}}\n{{Infobox...
4,32693240,Siege of Nagykanizsa,2016-03-05 01:30:19,708387224,/* References */[[WP:CHECKWIKI]] error fixes u...,1862829.0,Magioladitis,,{{refimprove|date=September 2011}}\n{{Infobox ...
5,17074415,Andr??s Roemer,2016-03-04 11:48:53,708287447,Cleaning,7971374.0,Werther mx,,{{Use mdy dates|date=January 2015}}\n{{Infobox...
6,47835010,Karie,2016-03-04 06:27:44,708241814,"Removing ""Karie.jpg"", it has been deleted from...",2304267.0,CommonsDelinker,,{{Infobox film\n| name = Karie\n| image =\n| c...
7,6100355,Ois??n McConville,2016-03-04 19:12:20,708348063,migrating [[Wikipedia:Persondata|Persondata]] ...,24420788.0,KasparBot,,{{Infobox GAA player \n| image = Ois...
8,18799478,Remetea Mare,2016-03-04 08:50:15,708260679,Robot - Speedily moving category Communes in T...,1215485.0,Cydebot,,{{refimprove|date=July 2009}}\n{{Infobox settl...
9,722668,Greece national football team,2016-03-03 08:45:24,708090889,/* Recent call-ups */,,,91.140.24.95,{{About|the men's team|the women's team|Greece...


Empezamos por un tratemiento del texto como ya hemos visto:

In [40]:
for i in wiki_df.select('text').take(4):

    print(i.text)
    print('------------\n\n')

{{Infobox Radio station
 | name = KIG60 - Burlington All Hazards
 | image = [[Image:Noaa all hazards.svg|150px]]
 | city = [[Burlington, Vermont]]
 | area = [[Burlington, Vermont metropolitan area|Burlington Metro]]
 | branding = [[NOAA Weather Radio All Hazards|NOAA All Hazards Radio]]
 | slogan = The Voice Of The National Weather Service
 | airdate = 
 | language = [[American English|English]]
 | frequency = 162.400 [[Megahertz|MHz]]
 | format = [[Weather radio|Weather/Civil Emergency]]
 | power = 500 [[Watt]]s
 | erp = 
 | haat = 
 | class = C
 | callsign_meaning = 
 | former_callsigns = 
 | owner = [[National Oceanic and Atmospheric Administration|NOAA]]/[[National Weather Service]]
 | webcast = 
 | website = [http://www.erh.noaa.gov/btv www.erh.noaa.gov/btv]
 | affiliations =
}}
'''KIG60''' (sometimes referred to as '''Burlington All Hazards''') is a [[NOAA Weather Radio All Hazards|NOAA Weather Radio]] station that serves the [[Burlington, Vermont metropolitan area|Burlington met

In [41]:
from pyspark.ml.feature import RegexTokenizer, StopWordsRemover, HashingTF, IDF, Normalizer

##### Step 1: Natural Language Processing: RegexTokenizer: Convert the lowerText col to a bag of words

In [42]:
tokenizer = (

    RegexTokenizer()
    .setInputCol("text")
    .setOutputCol("words")
    .setPattern(r'\W+')

)

In [43]:
wiki_words_df = tokenizer.transform(wiki_df)

In [44]:
wiki_words_df.printSchema()

root
 |-- id: long (nullable = true)
 |-- title: string (nullable = true)
 |-- lastrev_pdt_time: timestamp (nullable = true)
 |-- revid: long (nullable = true)
 |-- comment: string (nullable = true)
 |-- contributorid: long (nullable = true)
 |-- contributorusername: string (nullable = true)
 |-- contributorip: string (nullable = true)
 |-- text: string (nullable = true)
 |-- words: array (nullable = true)
 |    |-- element: string (containsNull = true)



In [45]:
np.array(wiki_words_df.select("words").first()[0])

array(['infobox', 'radio', 'station', 'name', 'kig60', 'burlington',
       'all', 'hazards', 'image', 'image', 'noaa', 'all', 'hazards',
       'svg', '150px', 'city', 'burlington', 'vermont', 'area',
       'burlington', 'vermont', 'metropolitan', 'area', 'burlington',
       'metro', 'branding', 'noaa', 'weather', 'radio', 'all', 'hazards',
       'noaa', 'all', 'hazards', 'radio', 'slogan', 'the', 'voice', 'of',
       'the', 'national', 'weather', 'service', 'airdate', 'language',
       'american', 'english', 'english', 'frequency', '162', '400',
       'megahertz', 'mhz', 'format', 'weather', 'radio', 'weather',
       'civil', 'emergency', 'power', '500', 'watt', 's', 'erp', 'haat',
       'class', 'c', 'callsign_meaning', 'former_callsigns', 'owner',
       'national', 'oceanic', 'and', 'atmospheric', 'administration',
       'noaa', 'national', 'weather', 'service', 'webcast', 'website',
       'http', 'www', 'erh', 'noaa', 'gov', 'btv', 'www', 'erh', 'noaa',
       'gov', 'b

##### Step 2: Natural Language Processing: StopWordsRemover: Remove Stop words

In [46]:
remover = StopWordsRemover().setInputCol("words").setOutputCol("noStopWords")

In [47]:
no_stop_words_list_df = remover.transform(wiki_words_df)

In [48]:
no_stop_words_list_df.printSchema()

root
 |-- id: long (nullable = true)
 |-- title: string (nullable = true)
 |-- lastrev_pdt_time: timestamp (nullable = true)
 |-- revid: long (nullable = true)
 |-- comment: string (nullable = true)
 |-- contributorid: long (nullable = true)
 |-- contributorusername: string (nullable = true)
 |-- contributorip: string (nullable = true)
 |-- text: string (nullable = true)
 |-- words: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- noStopWords: array (nullable = true)
 |    |-- element: string (containsNull = true)



In [49]:
no_stop_words_list_df.select("id", "title", "words", "noStopWords").show(15)

+--------+--------------------+--------------------+--------------------+
|      id|               title|               words|         noStopWords|
+--------+--------------------+--------------------+--------------------+
|33235801|               KIG60|[infobox, radio, ...|[infobox, radio, ...|
| 3484057| Chris Brown (album)|[infobox, album, ...|[infobox, album, ...|
| 2872543|Humane Slaughter Act|[deleted, image, ...|[deleted, image, ...|
| 8100880| The Bicester School|[use, dmy, dates,...|[use, dmy, dates,...|
|32693240|Siege of Nagykanizsa|[refimprove, date...|[refimprove, date...|
|17074415|      Andr??s Roemer|[use, mdy, dates,...|[use, mdy, dates,...|
|47835010|               Karie|[infobox, film, n...|[infobox, film, n...|
| 6100355|   Ois??n McConville|[infobox, gaa, pl...|[infobox, gaa, pl...|
|18799478|        Remetea Mare|[refimprove, date...|[refimprove, date...|
|  722668|Greece national f...|[about, the, men,...|[men, team, women...|
| 1300969|United States pre...|[main, 

In [50]:
no_stop_words_list_df.select(F.explode('noStopWords').alias('words')).select(F.countDistinct('words')).show()

+---------------------+
|count(DISTINCT words)|
+---------------------+
|              3720161|
+---------------------+



##### Step 3: HashingTF

![](img/tf-idf.png)

[*HashingTF*](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) es una técnica empleada para no tener que calcular la matriz *term frequency* completa, por métodos de hashing conseguimos resultados muy cercanos con una *performance* de cálculo mucho más rápida y paralelizable.

In [51]:
hashing_tf = HashingTF().setInputCol("noStopWords").setOutputCol("hashingTF").setNumFeatures(20000)

In [52]:
featurized_data_df = hashing_tf.transform(no_stop_words_list_df)

In [53]:
featurized_data_df.printSchema()

root
 |-- id: long (nullable = true)
 |-- title: string (nullable = true)
 |-- lastrev_pdt_time: timestamp (nullable = true)
 |-- revid: long (nullable = true)
 |-- comment: string (nullable = true)
 |-- contributorid: long (nullable = true)
 |-- contributorusername: string (nullable = true)
 |-- contributorip: string (nullable = true)
 |-- text: string (nullable = true)
 |-- words: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- noStopWords: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- hashingTF: vector (nullable = true)



In [54]:
featurized_data_df.select("id", "title", "noStopWords", "hashingTF").limit(10).toPandas()

Unnamed: 0,id,title,noStopWords,hashingTF
0,33235801,KIG60,"[infobox, radio, station, name, kig60, burling...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, ..."
1,3484057,Chris Brown (album),"[infobox, album, see, wikipedia, wikiproject, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
2,2872543,Humane Slaughter Act,"[deleted, image, removed, image, cattlerestrai...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
3,8100880,The Bicester School,"[use, dmy, dates, date, october, 2014, infobox...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
4,32693240,Siege of Nagykanizsa,"[refimprove, date, september, 2011, infobox, m...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
5,17074415,Andr??s Roemer,"[use, mdy, dates, date, january, 2015, infobox...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
6,47835010,Karie,"[infobox, film, name, karie, image, caption, w...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
7,6100355,Ois??n McConville,"[infobox, gaa, player, image, oisin, mcconvill...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
8,18799478,Remetea Mare,"[refimprove, date, july, 2009, infobox, settle...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
9,722668,Greece national football team,"[men, team, women, team, greece, women, nation...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."


In [55]:
vector = featurized_data_df.select('hashingTF').first()[0]

In [56]:
type(vector)

pyspark.ml.linalg.SparseVector

In [57]:
vector.values

array([ 1.,  1.,  2.,  1.,  1.,  5.,  2.,  1.,  2.,  1.,  1.,  1.,  2.,
        1.,  2.,  1.,  1., 10.,  2.,  1.,  3.,  1.,  2.,  1.,  1.,  1.,
        2.,  2.,  1., 14.,  6.,  1.,  3.,  1.,  1.,  1.,  2.,  2.,  2.,
        4.,  1.,  1.,  3., 11.,  1.,  1.,  1.,  1., 11.,  1.,  1.,  1.,
        1.,  3.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  2.,  1.,  1.,  1.,
        1.,  1.,  2.,  1., 12.,  1.,  1.,  1.,  4.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  2.,  4.,  2.,  1.,  4.,  4.,  1.,  1.,  5.,  1.,
        1.,  1.,  1.,  3.,  2.,  1.,  1.,  1.,  2.,  4.,  3.,  2.,  2.,
        2.,  1., 10.,  1.,  1.,  1.,  1.,  1.,  1.])

In [58]:
vector.toArray()

array([0., 0., 0., ..., 0., 0., 0.])

In [59]:
aux = featurized_data_df.select("id", "title", "noStopWords", "hashingTF").first()

In [60]:
aux

Row(id=33235801, title='KIG60', noStopWords=['infobox', 'radio', 'station', 'name', 'kig60', 'burlington', 'hazards', 'image', 'image', 'noaa', 'hazards', 'svg', '150px', 'city', 'burlington', 'vermont', 'area', 'burlington', 'vermont', 'metropolitan', 'area', 'burlington', 'metro', 'branding', 'noaa', 'weather', 'radio', 'hazards', 'noaa', 'hazards', 'radio', 'slogan', 'voice', 'national', 'weather', 'service', 'airdate', 'language', 'american', 'english', 'english', 'frequency', '162', '400', 'megahertz', 'mhz', 'format', 'weather', 'radio', 'weather', 'civil', 'emergency', 'power', '500', 'watt', 'erp', 'haat', 'class', 'c', 'callsign_meaning', 'former_callsigns', 'owner', 'national', 'oceanic', 'atmospheric', 'administration', 'noaa', 'national', 'weather', 'service', 'webcast', 'website', 'http', 'www', 'erh', 'noaa', 'gov', 'btv', 'www', 'erh', 'noaa', 'gov', 'btv', 'affiliations', 'kig60', 'sometimes', 'referred', 'burlington', 'hazards', 'noaa', 'weather', 'radio', 'hazards', '

##### Step 4: IDF

Calculamos ahora los puntuaciones inversas

In [61]:
idf = IDF().setInputCol("hashingTF").setOutputCol("idf")
idf_model = idf.fit(featurized_data_df)

In [62]:
idf_model

IDF_47fd8c036f5f17fcf6d9

In [63]:
idf_model.transform(featurized_data_df).select("text","hashingTF","idf").limit(10).toPandas()

Unnamed: 0,text,hashingTF,idf
0,{{Infobox Radio station\n | name = KIG60 - Bur...,"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 4.488..."
1,{{Infobox album <!-- See Wikipedia:WikiProject...,"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
2,<!-- Deleted image removed: [[Image:CattleRest...,"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
3,{{Use dmy dates|date=October 2014}}\n{{Infobox...,"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
4,{{refimprove|date=September 2011}}\n{{Infobox ...,"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
5,{{Use mdy dates|date=January 2015}}\n{{Infobox...,"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
6,{{Infobox film\n| name = Karie\n| image =\n| c...,"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
7,{{Infobox GAA player \n| image = Ois...,"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
8,{{refimprove|date=July 2009}}\n{{Infobox settl...,"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
9,{{About|the men's team|the women's team|Greece...,"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."


##### Step 5: Normalizer

Cómo queremos usar un método que usa distancias (K-Means) suele ser aconsejable normalizar las variables de entrada para que su dimensión no altere en el resultado del algoritmo:

In [64]:
normalizer = Normalizer().setInputCol("idf").setOutputCol("features")

##### Step 6: k-means & tie it all together...

![](img/clustering.png)

In [65]:
from pyspark.ml import Pipeline
from pyspark.ml.clustering import KMeans

In [66]:
kmeans = (
    
    KMeans()
    .setFeaturesCol("features")
    .setPredictionCol("prediction")
    .setK(100)
    .setSeed(1234)

)

In [67]:
pipeline = Pipeline(stages=[tokenizer, remover, hashing_tf, idf, normalizer, kmeans])

**CUIDADO** Este `fit` puede durar varios minutos:

In [68]:
model = pipeline.fit(wiki_df)

In [69]:
raw_predictions_df = model.transform(wiki_df).cache()

In [70]:
raw_predictions_df.count()

111495

En la variable `prediction` nos ha marcado en qué cluster a asignado cada artículo:

In [71]:
raw_predictions_df.select("prediction").limit(10).toPandas()

Unnamed: 0,prediction
0,83
1,24
2,92
3,25
4,80
5,64
6,64
7,70
8,61
9,45


¿Cuántos grupos hay?

In [72]:
raw_predictions_df.select("prediction").distinct().count()

100

In [73]:
raw_predictions_df.groupBy("prediction").count().orderBy(F.desc("count")).show(20)

+----------+-----+
|prediction|count|
+----------+-----+
|        64|44460|
|        78| 7822|
|        24| 3933|
|        92| 3689|
|        26| 3106|
|        75| 3029|
|        30| 2662|
|        80| 2646|
|        93| 2468|
|        25| 2206|
|         1| 2083|
|        65| 1906|
|         2| 1812|
|        82| 1701|
|        49| 1528|
|        15| 1389|
|        17| 1332|
|        74| 1228|
|        18| 1081|
|         6| 1036|
+----------+-----+
only showing top 20 rows



In [74]:
raw_predictions_df.filter(F.lower(F.col('title')).like('%hadoop%')).toPandas()

Unnamed: 0,id,title,lastrev_pdt_time,revid,comment,contributorid,contributorusername,contributorip,text,words,noStopWords,hashingTF,idf,features,prediction
0,5919308,Apache Hadoop,2016-03-04 22:58:48,708371306,/* History */,,,71.84.15.41,{{multiple issues|\n{{advert|date=October 2013...,"[multiple, issues, advert, date, october, 2013...","[multiple, issues, advert, date, october, 2013...","(0.0, 7.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 14.299669926436321, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.010390351683305786, 0.0, 0.0, 0.0, 0.0...",64


In [75]:
raw_predictions_df.filter(F.lower(F.col('title')).like('%apache spark%')).toPandas()

Unnamed: 0,id,title,lastrev_pdt_time,revid,comment,contributorid,contributorusername,contributorip,text,words,noStopWords,hashingTF,idf,features,prediction
0,42164234,Apache Spark,2016-03-03 14:13:40,708135330,relegate details to footnotes,196471,Qwertyus,,{{Infobox Software\n| name =...,"[infobox, software, name, apache, spark, logo,...","[infobox, software, name, apache, spark, logo,...","(0.0, 3.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 6.128429968472709, 0.0, 0.0, 0.0, 0.0, 0...","(0.0, 0.014094462407126097, 0.0, 0.0, 0.0, 0.0...",64


In [76]:
raw_predictions_df.filter('prediction = 24').select('title').limit(20).toPandas()

Unnamed: 0,title
0,Chris Brown (album)
1,Elis Paprika
2,Grammy Award for Best Rap Album
3,Maryland Deathfest
4,Crime Pays (Cam'ron album)
5,If You Leave (Daughter album)
6,Little One
7,Steve Scott (poet)
8,Elvis Costello
9,Min barndoms jul (Mia Marianne och Per Filip a...


Parece que la temática de estos artículos es informática / tecnología

In [77]:
spark.stop()