Esse projeto foi elaborado durante o curso disponível pelo seguinte link:
https://www.udemy.com/course/spark-and-python-for-big-data-with-pyspark/

Utilizamos o seguinte dataset - Repository SMS Spam Detection: https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection

Futuramente pretendo dar uma incrementada nesse projeto. Por enquanto, o último semestre da graduação em Física está me tomando bastante tempo. Assim que eu conseguir, vou revisitar este código.

#1. Importando dados

In [0]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SpamClassifier').getOrCreate()

In [0]:
df = spark.read.table("smsspamcollection")
df.show(n=2)

+---+--------------------+
|_c0|                 _c1|
+---+--------------------+
|ham|Go until jurong p...|
|ham|Ok lar... Joking ...|
+---+--------------------+
only showing top 2 rows



In [0]:
df = df.withColumnRenamed('_c0','class').withColumnRenamed('_c1','text')
df.show(n=2)

+-----+--------------------+
|class|                text|
+-----+--------------------+
|  ham|Go until jurong p...|
|  ham|Ok lar... Joking ...|
+-----+--------------------+
only showing top 2 rows



In [0]:
df.printSchema()

root
 |-- class: string (nullable = true)
 |-- text: string (nullable = true)
 |-- length: integer (nullable = true)



# 2. Tratamento dos Dados

### 2.1 Criando uma feature chamada length

In [0]:
from pyspark.sql.functions import length
df = df.withColumn('length',length(df['text']))

In [0]:
df.show(n=2)

+-----+--------------------+------+
|class|                text|length|
+-----+--------------------+------+
|  ham|Go until jurong p...|   111|
|  ham|Ok lar... Joking ...|    29|
+-----+--------------------+------+
only showing top 2 rows



In [0]:
df.groupby('class').mean().show()

+-----+-----------------+
|class|      avg(length)|
+-----+-----------------+
|  ham| 71.4545266210897|
| spam|138.6706827309237|
+-----+-----------------+



##2.2  Transformação de Features

In [0]:
from pyspark.ml.feature import Tokenizer,StopWordsRemover, CountVectorizer,IDF,StringIndexer
tokenizer = Tokenizer(inputCol="text", outputCol="token_text")
stopremove = StopWordsRemover(inputCol='token_text',outputCol='stop_tokens')
count_vec = CountVectorizer(inputCol='stop_tokens',outputCol='c_vec')
idf = IDF(inputCol="c_vec", outputCol="tf_idf")
ham_spam_to_num = StringIndexer(inputCol='class',outputCol='label')

from pyspark.ml.feature import VectorAssembler
from pyspark.ml.linalg import Vector

In [0]:
clean_up = VectorAssembler(inputCols=['tf_idf','length'],outputCol='features')

# 3 Construindo modelo

Futuramente pretendo incrementar essa etapa para deixar ele mais preciso.

In [0]:
from pyspark.ml.classification import NaiveBayes
nb = NaiveBayes()

### 3.1 Pipeline

In [0]:
from pyspark.ml import Pipeline

In [0]:
data_prep_pipe = Pipeline(stages=[ham_spam_to_num,tokenizer,stopremove,count_vec,idf,clean_up])
cleaner = data_prep_pipe.fit(df)
clean_data = cleaner.transform(df)

###3.2 Treinando modelo

In [0]:
clean_data = clean_data.select(['label','features'])
clean_data.show(n=5)

+-----+--------------------+
|label|            features|
+-----+--------------------+
|  0.0|(13424,[7,11,31,6...|
|  0.0|(13424,[0,24,297,...|
|  1.0|(13424,[2,13,19,3...|
|  0.0|(13424,[0,70,80,1...|
|  0.0|(13424,[36,134,31...|
+-----+--------------------+
only showing top 5 rows



In [0]:
(training,testing) = clean_data.randomSplit([0.7,0.3])
spam_predictor = nb.fit(training)

In [0]:
test_results = spam_predictor.transform(testing)
test_results.show(n=5)

+-----+--------------------+--------------------+--------------------+----------+
|label|            features|       rawPrediction|         probability|prediction|
+-----+--------------------+--------------------+--------------------+----------+
|  0.0|(13424,[0,1,2,41,...|[-1094.9572098231...|[0.99999999999971...|       0.0|
|  0.0|(13424,[0,1,14,31...|[-215.63146674005...|[1.0,2.4145304625...|       0.0|
|  0.0|(13424,[0,1,17,19...|[-806.84421716411...|[1.0,1.9786262592...|       0.0|
|  0.0|(13424,[0,1,24,31...|[-356.94849390029...|[1.0,2.3915155107...|       0.0|
|  0.0|(13424,[0,1,31,43...|[-341.30235143610...|[1.0,2.0115227827...|       0.0|
+-----+--------------------+--------------------+--------------------+----------+
only showing top 5 rows



###3.3 Evaluation

In [0]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [0]:
acc_eval = MulticlassClassificationEvaluator()
acc = acc_eval.evaluate(test_results)
print("Acurácia do modelo: {}".format(acc))

Acurácia do modelo: 0.9218262370742885
