# Modelo de ML com Spark

Este notebook foi desenvolvido no **Databricks** e tem como objetivo demonstrar, passo a passo, como identificar e-mails classificados como *spam* ou *não spam* utilizando técnicas de **Processamento de Linguagem Natural (NLP)** e **Machine Learning**, aplicadas em um ambiente distribuído com **Apache Spark**.

## Preparando o Ambiente e Carregando a Tabela de Spam

In [None]:
from pyspark.sql import SparkSession
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import Tokenizer, StringIndexer, Word2Vec
from pyspark.ml.evaluation import BinaryClassificationEvaluator

spark = SparkSession.builder.appName("nlp").getOrCreate()

In [None]:
# Importando a tabela
spam = spark.sql("select * from spam_ham_dataset_csv")

In [None]:
# Verificando o formato
spam.show(5)

+-----+--------------------+
|label|                text|
+-----+--------------------+
|  ham|Subject: enron me...|
|  ham|Subject: hpl nom ...|
|  ham|Subject: neon ret...|
| spam|Subject: photosho...|
|  ham|Subject: re : ind...|
+-----+--------------------+
only showing top 5 rows



## Tratamento de Dados

In [None]:
# transformação do label em número
stringmodel = StringIndexer(inputCol="label",outputCol="label_num")
spam = stringmodel.fit(spam).transform(spam)
spam.show(5)

+-----+--------------------+---------+
|label|                text|label_num|
+-----+--------------------+---------+
|  ham|Subject: enron me...|      0.0|
|  ham|Subject: hpl nom ...|      0.0|
|  ham|Subject: neon ret...|      0.0|
| spam|Subject: photosho...|      1.0|
|  ham|Subject: re : ind...|      0.0|
+-----+--------------------+---------+
only showing top 5 rows



In [None]:
# Tokenizando os dos textos
tokens = Tokenizer(inputCol="text", outputCol="text_tokens")
spam = tokens.transform(spam)
spam.show(5)

+-----+--------------------+---------+--------------------+
|label|                text|label_num|         text_tokens|
+-----+--------------------+---------+--------------------+
|  ham|Subject: enron me...|      0.0|[subject:, enron,...|
|  ham|Subject: hpl nom ...|      0.0|[subject:, hpl, n...|
|  ham|Subject: neon ret...|      0.0|[subject:, neon, ...|
| spam|Subject: photosho...|      1.0|[subject:, photos...|
|  ham|Subject: re : ind...|      0.0|[subject:, re, :,...|
+-----+--------------------+---------+--------------------+
only showing top 5 rows



In [None]:
# Embedding
word2vec = Word2Vec(inputCol="text_tokens", outputCol="text_w2vec")
spam = word2vec.fit(spam).transform(spam)
spam.show(5)

+-----+--------------------+---------+--------------------+--------------------+
|label|                text|label_num|         text_tokens|          text_w2vec|
+-----+--------------------+---------+--------------------+--------------------+
|  ham|Subject: enron me...|      0.0|[subject:, enron,...|[-0.0071292772319...|
|  ham|Subject: hpl nom ...|      0.0|[subject:, hpl, n...|[0.05170821357285...|
|  ham|Subject: neon ret...|      0.0|[subject:, neon, ...|[-0.0362850259147...|
| spam|Subject: photosho...|      1.0|[subject:, photos...|[-0.0031904729834...|
|  ham|Subject: re : ind...|      0.0|[subject:, re, :,...|[0.00251259542671...|
+-----+--------------------+---------+--------------------+--------------------+
only showing top 5 rows



## Criação e Treino do Modelo

In [None]:
# Separacao em bases de treino e teste
spam_treino, spam_teste = spam.randomSplit([.7, .3])

In [None]:
# Criando o modelo
rf = RandomForestClassifier(labelCol="label_num", featuresCol="text_w2vec", numTrees=100)
modelo = rf.fit(spam_treino)

In [None]:
# Previsoes
prev = modelo.transform(spam_teste)

## Avaliando o Resultado

In [None]:
# Calculando a AUC
avaliar = BinaryClassificationEvaluator(rawPredictionCol="prediction", labelCol="label_num", metricName="areaUnderROC")
AUC = avaliar.evaluate(prev)
print(AUC)

0.9532997912540813


A Área sob a curva ROC foi de 95%, o que indica uma ótima capacidade de discriminação do modelo.