### Apellidos y Nombres:

Lettere Dragosavljevich Mathias Giuseppe

### Fecha:

19-09-2023

# **Preprocesamiento de datos con Pyspark**


## Google Colab Setup

If you are going to use Google Colab instead of a Spark Cluster, you will need to run the following code to install Apache Spark.

In [1]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

In [2]:
#If the following links don't work, you will have to update them with the last versions of Apache Spark
!wget -q https://dlcdn.apache.org/spark/spark-3.4.1/spark-3.4.1-bin-hadoop3.tgz
!tar xf spark-3.4.1-bin-hadoop3.tgz

In [3]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.4.1-bin-hadoop3"

## Setup


In [4]:
# Installing required packages
!pip install pyspark
!pip install findspark

Collecting pyspark
  Downloading pyspark-3.4.1.tar.gz (310.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.4.1-py2.py3-none-any.whl size=311285387 sha256=d85ae4cf330bb9386b12066669246d4129f62cd491c03ec4b957f62f69c9c6b2
  Stored in directory: /root/.cache/pip/wheels/0d/77/a3/ff2f74cc9ab41f8f594dabf0579c2a7c6de920d584206e0834
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.4.1
Collecting findspark
  Downloading findspark-2.0.1-py2.py3-none-any.whl (4.4 kB)
Installing collected packages: findspark
Successfully installed findspark-2.0.1


In [5]:
import findspark
findspark.init()

In [6]:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession

#### Creating the spark session and context


In [7]:
# Creating a spark context class
sc = SparkContext()

# Creating a spark session
spark = SparkSession \
    .builder \
    .appName("Python Spark DataFrames basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

#### Initialize Spark session



In [8]:
spark

## Exercise 2 - Load the data and Spark dataframe


## Load the dataset into your Colab directory from your local system


In [10]:
from google.colab import files
files.upload()

Output hidden; open in https://colab.research.google.com to view.

In [11]:
dfsEmails = spark.read.csv("emails.csv", header=True, inferSchema=True, nullValue= 'NA')
print(dfsEmails.printSchema())

root
 |-- Email No.: string (nullable = true)
 |-- the: integer (nullable = true)
 |-- to: integer (nullable = true)
 |-- ect: integer (nullable = true)
 |-- and: integer (nullable = true)
 |-- for: integer (nullable = true)
 |-- of: integer (nullable = true)
 |-- a: integer (nullable = true)
 |-- you: integer (nullable = true)
 |-- hou: integer (nullable = true)
 |-- in: integer (nullable = true)
 |-- on: integer (nullable = true)
 |-- is: integer (nullable = true)
 |-- this: integer (nullable = true)
 |-- enron: integer (nullable = true)
 |-- i: integer (nullable = true)
 |-- be: integer (nullable = true)
 |-- that: integer (nullable = true)
 |-- will: integer (nullable = true)
 |-- have: integer (nullable = true)
 |-- with: integer (nullable = true)
 |-- your: integer (nullable = true)
 |-- at: integer (nullable = true)
 |-- we: integer (nullable = true)
 |-- s: integer (nullable = true)
 |-- are: integer (nullable = true)
 |-- it: integer (nullable = true)
 |-- by: integer (nullabl

## Preprocesamiento




In [17]:
from pyspark.sql.functions import col, isnan, when, count, isnull, max, min, mode, lit
from pyspark.sql import functions as F

In [25]:
dfsTemp = dfsEmails.withColumnRenamed("Email No.","Email ID")

In [26]:
dfsTemp.show()

+--------+---+---+---+---+---+---+---+---+---+---+---+---+----+-----+---+---+----+----+----+----+----+---+---+---+---+---+---+---+---+----+---+---+---+---+----+---+-----+---+------+---+---+---+---+----+---+---+---+---+---+----+----+---+---------+---+---+---+---+---+-----+------+---+---+---+---+---+---+-----+-----+-----+---+----+-------+---+-----+---+---+-----+---+----+---+------+--------+---+-----------+----+-------+---+----+---+---+----+----+----+---+------+----+--------+-----+-----+----+------+------+-----+-----+---+------+---+---------+---+-----+-------+---+---+---+-----+----+----+-----+----+----+----+------+-----+----+----+---+----+---+---------+---+----------+----+----+----+----+--------+----+-----+---+------+---+---+-----+----+---+------+-----+-------+----+----+----+----------+------+---+-----+------+----+---+---+-----+------+--------+----+---+-------+----+-------+----+----+-----+---+-----+-----+---+---+---+---+-----+----------+---+----+---+----+-----+---+---------+----+-------+-

In [28]:
substr_to_remove = ["Email"]
regex = "|".join(substr_to_remove)
df_new = dfsTemp.withColumn("Email ID", F.regexp_replace("Email ID", regex, ""))
df_new.show()

+--------+---+---+---+---+---+---+---+---+---+---+---+---+----+-----+---+---+----+----+----+----+----+---+---+---+---+---+---+---+---+----+---+---+---+---+----+---+-----+---+------+---+---+---+---+----+---+---+---+---+---+----+----+---+---------+---+---+---+---+---+-----+------+---+---+---+---+---+---+-----+-----+-----+---+----+-------+---+-----+---+---+-----+---+----+---+------+--------+---+-----------+----+-------+---+----+---+---+----+----+----+---+------+----+--------+-----+-----+----+------+------+-----+-----+---+------+---+---------+---+-----+-------+---+---+---+-----+----+----+-----+----+----+----+------+-----+----+----+---+----+---+---------+---+----------+----+----+----+----+--------+----+-----+---+------+---+---+-----+----+---+------+-----+-------+----+----+----+----------+------+---+-----+------+----+---+---+-----+------+--------+----+---+-------+----+-------+----+----+-----+---+-----+-----+---+---+---+---+-----+----------+---+----+---+----+-----+---+---------+----+-------+-

In [31]:
res = [count(when((col(c) == ' ')|( col(c).isNull()), c)).alias(c) for c in df_new.columns]
df_new.select(*res).show()
#def check_for_null_or_nan(df):
    #null_or_nan = lambda x: isnan(x) | isnull(x)
    #func = lambda x: df.filter(null_or_nan(x)).count()
    #print(*[f'{i} has {func(i)} nans/nulls' for i in df.columns if func(i)!=0],sep='\n')

+--------+---+---+---+---+---+---+---+---+---+---+---+---+----+-----+---+---+----+----+----+----+----+---+---+---+---+---+---+---+---+----+---+---+---+---+----+---+-----+---+------+---+---+---+---+----+---+---+---+---+---+----+----+---+---------+---+---+---+---+---+-----+------+---+---+---+---+---+---+-----+-----+-----+---+----+-------+---+-----+---+---+-----+---+----+---+------+--------+---+-----------+----+-------+---+----+---+---+----+----+----+---+------+----+--------+-----+-----+----+------+------+-----+-----+---+------+---+---------+---+-----+-------+---+---+---+-----+----+----+-----+----+----+----+------+-----+----+----+---+----+---+---------+---+----------+----+----+----+----+--------+----+-----+---+------+---+---+-----+----+---+------+-----+-------+----+----+----+----------+------+---+-----+------+----+---+---+-----+------+--------+----+---+-------+----+-------+----+----+-----+---+-----+-----+---+---+---+---+-----+----------+---+----+---+----+-----+---+---------+----+-------+-

In [None]:
#No es necesario este paso, el dataset es ya bastante limpio y no requiere de prerpocesamiento adicional
#max_value = dfsEmails.select(max('delay')).collect()[0][0]
#min_value = dfsEmails.select(min('delay')).collect()[0][0]

#print("Maximum Value:", max_value)
#print("Minimum Value:", min_value)

Maximum Value: 1370
Minimum Value: -80


In [None]:
# Reemplazar nulos con la moda
#dfsTemp = dfsEmails
#modDel = dfsTemp.agg(mode('delay')).collect()[0][0]


#dfsClean = dfsTemp.fillna({'delay': modDel})

In [None]:
#res = [count(when((col(c) == ' ')|( col(c).isNull()), c)).alias(c) for c in dfsClean.columns]
#dfsClean.select(*res).show()

+---+---+---+-------+------+---+----+------+--------+-----+
|mon|dom|dow|carrier|flight|org|mile|depart|duration|delay|
+---+---+---+-------+------+---+----+------+--------+-----+
|  0|  0|  0|      0|     0|  0|   0|     0|       0|    0|
+---+---+---+-------+------+---+----+------+--------+-----+



## Creación de *Label*


In [32]:
dfsEClean = df_new.withColumnRenamed("Prediction","Label")

dfsEClean.show()

+--------+---+---+---+---+---+---+---+---+---+---+---+---+----+-----+---+---+----+----+----+----+----+---+---+---+---+---+---+---+---+----+---+---+---+---+----+---+-----+---+------+---+---+---+---+----+---+---+---+---+---+----+----+---+---------+---+---+---+---+---+-----+------+---+---+---+---+---+---+-----+-----+-----+---+----+-------+---+-----+---+---+-----+---+----+---+------+--------+---+-----------+----+-------+---+----+---+---+----+----+----+---+------+----+--------+-----+-----+----+------+------+-----+-----+---+------+---+---------+---+-----+-------+---+---+---+-----+----+----+-----+----+----+----+------+-----+----+----+---+----+---+---------+---+----------+----+----+----+----+--------+----+-----+---+------+---+---+-----+----+---+------+-----+-------+----+----+----+----------+------+---+-----+------+----+---+---+-----+------+--------+----+---+-------+----+-------+----+----+-----+---+-----+-----+---+---+---+---+-----+----------+---+----+---+----+-----+---+---------+----+-------+-

## Consolidar columnas (Features)

In [33]:
from pyspark.ml.feature import VectorAssembler

In [36]:
df_cols = list(set(dfsEClean.columns) - {'Label', 'Email ID'})
print(df_cols)
print(len(df_cols))

3000


In [51]:
assembler = VectorAssembler(inputCols=df_cols,
                            outputCol='FeaturesTree')
dfsEmailsClean = assembler.transform(dfsEClean)

dfsEmailsClean.show()

+--------+---+---+---+---+---+---+---+---+---+---+---+---+----+-----+---+---+----+----+----+----+----+---+---+---+---+---+---+---+---+----+---+---+---+---+----+---+-----+---+------+---+---+---+---+----+---+---+---+---+---+----+----+---+---------+---+---+---+---+---+-----+------+---+---+---+---+---+---+-----+-----+-----+---+----+-------+---+-----+---+---+-----+---+----+---+------+--------+---+-----------+----+-------+---+----+---+---+----+----+----+---+------+----+--------+-----+-----+----+------+------+-----+-----+---+------+---+---------+---+-----+-------+---+---+---+-----+----+----+-----+----+----+----+------+-----+----+----+---+----+---+---------+---+----------+----+----+----+----+--------+----+-----+---+------+---+---+-----+----+---+------+-----+-------+----+----+----+----------+------+---+-----+------+----+---+---+-----+------+--------+----+---+-------+----+-------+----+----+-----+---+-----+-----+---+---+---+---+-----+----------+---+----+---+----+-----+---+---------+----+-------+-

In [59]:
dfsReduced = dfsEmailsClean.select("Email ID", "Label", "FeaturesTree")
dfsReduced = dfsReduced.withColumnsRenamed({"Label": "label", "FeaturesTree": "features"})

dfsReduced.show()

+--------+-----+--------------------+
|Email ID|label|            features|
+--------+-----+--------------------+
|       1|    0|(3000,[65,250,572...|
|       2|    0|(3000,[28,54,58,6...|
|       3|    0|(3000,[62,65,148,...|
|       4|    0|(3000,[65,106,110...|
|       5|    0|(3000,[33,62,65,9...|
|       6|    1|(3000,[14,19,26,3...|
|       7|    0|(3000,[14,33,49,6...|
|       8|    1|(3000,[49,65,99,1...|
|       9|    0|(3000,[33,65,109,...|
|      10|    0|(3000,[14,19,37,4...|
|      11|    0|(3000,[2,33,39,62...|
|      12|    0|(3000,[14,24,26,3...|
|      13|    0|(3000,[14,24,26,3...|
|      14|    0|(3000,[3,5,14,24,...|
|      15|    0|(3000,[3,14,54,62...|
|      16|    0|(3000,[14,33,65,7...|
|      17|    1|(3000,[37,65,106,...|
|      18|    1|(3000,[0,1,24,26,...|
|      19|    0|(3000,[52,65,99,1...|
|      20|    0|(3000,[52,65,99,1...|
+--------+-----+--------------------+
only showing top 20 rows



## Entrenamiento y Prueba


In [68]:
eTrain, eTest = dfsReduced.randomSplit([0.70, 0.30], seed=23)

[eTest.count(), eTrain.count()]

[1555, 3617]

In [69]:
eTrain.show()

+--------+-----+--------------------+
|Email ID|label|            features|
+--------+-----+--------------------+
|      10|    0|(3000,[14,19,37,4...|
|     100|    1|(3000,[33,54,62,6...|
|    1000|    0|(3000,[65,103,148...|
|    1002|    0|(3000,[14,33,49,6...|
|    1003|    0|(3000,[65,104,106...|
|    1006|    0|(3000,[65,148,153...|
|    1007|    0|(3000,[7,26,33,65...|
|     101|    0|(3000,[33,62,65,1...|
|    1010|    0|(3000,[33,62,65,7...|
|    1011|    1|(3000,[1,26,33,37...|
|    1012|    0|(3000,[33,62,65,7...|
|    1013|    0|(3000,[65,106,110...|
|    1015|    0|(3000,[14,33,37,3...|
|    1016|    0|(3000,[14,33,37,6...|
|    1017|    0|(3000,[33,43,62,6...|
|    1018|    1|(3000,[9,26,33,37...|
|     102|    1|(3000,[42,62,65,7...|
|    1020|    0|(3000,[14,21,22,2...|
|    1021|    0|(3000,[65,106,113...|
|    1025|    0|(3000,[21,49,60,6...|
+--------+-----+--------------------+
only showing top 20 rows



## Árbol de Decisión


In [70]:
from pyspark.ml.classification import DecisionTreeClassifier


In [87]:

tree = DecisionTreeClassifier(maxDepth= 8, minInstancesPerNode= 2)
tree_model = tree.fit(eTrain)

In [88]:
prediction = tree_model.transform(eTest)

prediction['label', 'prediction', 'probability'].show()
#['label', 'prediction', 'probability']

+-----+----------+--------------------+
|label|prediction|         probability|
+-----+----------+--------------------+
|    0|       1.0|[0.48421052631578...|
|    0|       0.0|           [1.0,0.0]|
|    0|       0.0|           [1.0,0.0]|
|    1|       1.0|[0.04602510460251...|
|    1|       1.0|[0.04602510460251...|
|    0|       0.0|           [1.0,0.0]|
|    1|       1.0|[0.04602510460251...|
|    1|       1.0|[0.01729106628242...|
|    0|       0.0|           [1.0,0.0]|
|    1|       1.0|           [0.0,1.0]|
|    0|       0.0|[0.99790356394129...|
|    0|       0.0|           [1.0,0.0]|
|    1|       1.0|[0.04602510460251...|
|    0|       0.0|           [1.0,0.0]|
|    1|       1.0|           [0.0,1.0]|
|    0|       0.0|           [1.0,0.0]|
|    1|       1.0|[0.48421052631578...|
|    0|       0.0|           [1.0,0.0]|
|    0|       0.0|           [1.0,0.0]|
|    0|       0.0|[0.99790356394129...|
+-----+----------+--------------------+
only showing top 20 rows



## Matriz de Confusión



In [89]:
prediction.groupBy('label', 'prediction').count().show()

+-----+----------+-----+
|label|prediction|count|
+-----+----------+-----+
|    1|       0.0|   15|
|    0|       0.0|  942|
|    1|       1.0|  439|
|    0|       1.0|  159|
+-----+----------+-----+



In [90]:
TP=prediction.filter('prediction = 1 AND label = 1').count()
FP=prediction.filter('prediction = 1 AND label = 0').count()
FN=prediction.filter('prediction = 0 AND label = 1').count()
TN=prediction.filter('prediction = 0 AND label = 0').count()

print("Verdaderos positivos: ", TP)
print("Falsos positivos: ", FP)
print("Falsos Negativos: ", FN)
print("Verdaderos Negativos: ", TN)
print("Accuracy: ", (TN+TP)/(TP+FP+FN+TN))

Verdaderos positivos:  439
Falsos positivos:  159
Falsos Negativos:  15
Verdaderos Negativos:  942
Accuracy:  0.8881028938906752


## Análisis del modelo

El modelo es adecuado. Dado que su Accuracy es de 88,81% se considera satisfactorio para predecir con bastante certeza si un Email es SPAM.

In [91]:
spark.stop()