### Apellidos y Nombres:

Lettere Dragosavljevich Mathias Giuseppe

### Fecha:

26-09-2023

# **Preprocesamiento de datos con Pyspark**


## Google Colab Setup

If you are going to use Google Colab instead of a Spark Cluster, you will need to run the following code to install Apache Spark.

In [37]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

In [38]:
#If the following links don't work, you will have to update them with the last versions of Apache Spark
!wget -q https://dlcdn.apache.org/spark/spark-3.4.1/spark-3.4.1-bin-hadoop3.tgz
!tar xf spark-3.4.1-bin-hadoop3.tgz

In [39]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.4.1-bin-hadoop3"

## Setup


In [40]:
# Installing required packages
!pip install pyspark
!pip install findspark



In [41]:
import findspark
findspark.init()

In [42]:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession

#### Creating the spark session and context


In [43]:
# Creating a spark context class
sc = SparkContext()

# Creating a spark session
spark = SparkSession \
    .builder \
    .appName("Python Spark DataFrames basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

#### Initialize Spark session



In [44]:
spark

## Exercise 2 - Load the data and Spark dataframe


## Load the dataset into your Colab directory from your local system


In [46]:
from google.colab import files
files.upload()

Saving flights-larger.csv to flights-larger (1).csv


In [50]:
dfsFlight = spark.read.csv("flights-larger.csv", header=True, inferSchema=True, nullValue= 'NA')
print(dfsFlight.printSchema())

root
 |-- mon: integer (nullable = true)
 |-- dom: integer (nullable = true)
 |-- dow: integer (nullable = true)
 |-- carrier: string (nullable = true)
 |-- flight: integer (nullable = true)
 |-- org: string (nullable = true)
 |-- mile: integer (nullable = true)
 |-- depart: double (nullable = true)
 |-- duration: integer (nullable = true)
 |-- delay: integer (nullable = true)

None


## Preprocesamiento




In [51]:
from pyspark.sql.functions import col, isnan, when, count, isnull, max, min, mode, lit

In [52]:
res = [count(when((col(c) == ' ')|( col(c).isNull()), c)).alias(c) for c in dfsFlight.columns]
dfsFlight.select(*res).show()
#def check_for_null_or_nan(df):
    #null_or_nan = lambda x: isnan(x) | isnull(x)
    #func = lambda x: df.filter(null_or_nan(x)).count()
   # print(*[f'{i} has {func(i)} nans/nulls' for i in df.columns if func(i)!=0],sep='\n')

+---+---+---+-------+------+---+----+------+--------+-----+
|mon|dom|dow|carrier|flight|org|mile|depart|duration|delay|
+---+---+---+-------+------+---+----+------+--------+-----+
|  0|  0|  0|      0|     0|  0|   0|     0|       0|16711|
+---+---+---+-------+------+---+----+------+--------+-----+



In [53]:
max_value = dfsFlight.select(max('delay')).collect()[0][0]
min_value = dfsFlight.select(min('delay')).collect()[0][0]

print("Maximum Value:", max_value)
print("Minimum Value:", min_value)

Maximum Value: 1370
Minimum Value: -80


In [54]:
# Reemplazar nulos con la moda
dfsTemp = dfsFlight
modDel = dfsTemp.agg(mode('delay')).collect()[0][0]


dfsClean = dfsTemp.fillna({'delay': modDel})

In [55]:
res = [count(when((col(c) == ' ')|( col(c).isNull()), c)).alias(c) for c in dfsClean.columns]
dfsClean.select(*res).show()

+---+---+---+-------+------+---+----+------+--------+-----+
|mon|dom|dow|carrier|flight|org|mile|depart|duration|delay|
+---+---+---+-------+------+---+----+------+--------+-----+
|  0|  0|  0|      0|     0|  0|   0|     0|       0|    0|
+---+---+---+-------+------+---+----+------+--------+-----+



### Miles a KM


In [56]:
dfsClean = dfsClean.withColumn("km", col("mile") * lit(1.60934))

dfsClean = dfsClean.drop("mile")

dfsClean.show()

+---+---+---+-------+------+---+------+--------+-----+----------+
|mon|dom|dow|carrier|flight|org|depart|duration|delay|        km|
+---+---+---+-------+------+---+------+--------+-----+----------+
| 10| 10|  1|     OO|  5836|ORD|  8.18|      51|   27| 252.66638|
|  1|  4|  1|     OO|  5866|ORD|  15.5|     102|   -7| 749.95244|
| 11| 22|  1|     OO|  6016|ORD|  7.17|     127|  -19|1187.69292|
|  2| 14|  5|     B6|   199|JFK| 21.17|     365|   60|3617.79632|
|  5| 25|  3|     WN|  1675|SJC| 12.92|      85|   22| 621.20524|
|  3| 28|  1|     B6|   377|LGA| 13.33|     182|   70|1731.64984|
|  5| 28|  6|     B6|   904|ORD|  9.58|     130|   47| 1190.9116|
|  1| 19|  2|     UA|   820|SFO| 12.75|     123|  135|1092.74186|
|  8|  5|  5|     US|  2175|LGA|  13.0|      71|  -10| 344.39876|
|  5| 27|  5|     AA|  1240|ORD| 14.42|     195|  -11|1926.37998|
|  8| 20|  6|     B6|   119|JFK| 14.67|     198|   20|1902.23988|
|  2|  3|  1|     AA|  1881|JFK| 15.92|     200|   -9| 1754.1806|
|  8| 26| 

## Creación de *Label*


In [57]:
dfsClean = dfsClean.withColumn('label', (dfsClean.delay >= 15).cast('integer'))

dfsClean.show()

+---+---+---+-------+------+---+------+--------+-----+----------+-----+
|mon|dom|dow|carrier|flight|org|depart|duration|delay|        km|label|
+---+---+---+-------+------+---+------+--------+-----+----------+-----+
| 10| 10|  1|     OO|  5836|ORD|  8.18|      51|   27| 252.66638|    1|
|  1|  4|  1|     OO|  5866|ORD|  15.5|     102|   -7| 749.95244|    0|
| 11| 22|  1|     OO|  6016|ORD|  7.17|     127|  -19|1187.69292|    0|
|  2| 14|  5|     B6|   199|JFK| 21.17|     365|   60|3617.79632|    1|
|  5| 25|  3|     WN|  1675|SJC| 12.92|      85|   22| 621.20524|    1|
|  3| 28|  1|     B6|   377|LGA| 13.33|     182|   70|1731.64984|    1|
|  5| 28|  6|     B6|   904|ORD|  9.58|     130|   47| 1190.9116|    1|
|  1| 19|  2|     UA|   820|SFO| 12.75|     123|  135|1092.74186|    1|
|  8|  5|  5|     US|  2175|LGA|  13.0|      71|  -10| 344.39876|    0|
|  5| 27|  5|     AA|  1240|ORD| 14.42|     195|  -11|1926.37998|    0|
|  8| 20|  6|     B6|   119|JFK| 14.67|     198|   20|1902.23988

## Indexación



In [58]:
from pyspark.ml.feature import StringIndexer

In [59]:
# “carrier_idx” y “org_idx”
indexer = StringIndexer(inputCols=['carrier', 'org'],
                        outputCols=['carrier_idx', 'org_idx']).fit(dfsClean).transform(dfsClean)
dfsClean = indexer
dfsClean.show()

+---+---+---+-------+------+---+------+--------+-----+----------+-----+-----------+-------+
|mon|dom|dow|carrier|flight|org|depart|duration|delay|        km|label|carrier_idx|org_idx|
+---+---+---+-------+------+---+------+--------+-----+----------+-----+-----------+-------+
| 10| 10|  1|     OO|  5836|ORD|  8.18|      51|   27| 252.66638|    1|        2.0|    0.0|
|  1|  4|  1|     OO|  5866|ORD|  15.5|     102|   -7| 749.95244|    0|        2.0|    0.0|
| 11| 22|  1|     OO|  6016|ORD|  7.17|     127|  -19|1187.69292|    0|        2.0|    0.0|
|  2| 14|  5|     B6|   199|JFK| 21.17|     365|   60|3617.79632|    1|        4.0|    2.0|
|  5| 25|  3|     WN|  1675|SJC| 12.92|      85|   22| 621.20524|    1|        3.0|    5.0|
|  3| 28|  1|     B6|   377|LGA| 13.33|     182|   70|1731.64984|    1|        4.0|    3.0|
|  5| 28|  6|     B6|   904|ORD|  9.58|     130|   47| 1190.9116|    1|        4.0|    0.0|
|  1| 19|  2|     UA|   820|SFO| 12.75|     123|  135|1092.74186|    1|        0

In [60]:
dfsClean.show()

+---+---+---+-------+------+---+------+--------+-----+----------+-----+-----------+-------+
|mon|dom|dow|carrier|flight|org|depart|duration|delay|        km|label|carrier_idx|org_idx|
+---+---+---+-------+------+---+------+--------+-----+----------+-----+-----------+-------+
| 10| 10|  1|     OO|  5836|ORD|  8.18|      51|   27| 252.66638|    1|        2.0|    0.0|
|  1|  4|  1|     OO|  5866|ORD|  15.5|     102|   -7| 749.95244|    0|        2.0|    0.0|
| 11| 22|  1|     OO|  6016|ORD|  7.17|     127|  -19|1187.69292|    0|        2.0|    0.0|
|  2| 14|  5|     B6|   199|JFK| 21.17|     365|   60|3617.79632|    1|        4.0|    2.0|
|  5| 25|  3|     WN|  1675|SJC| 12.92|      85|   22| 621.20524|    1|        3.0|    5.0|
|  3| 28|  1|     B6|   377|LGA| 13.33|     182|   70|1731.64984|    1|        4.0|    3.0|
|  5| 28|  6|     B6|   904|ORD|  9.58|     130|   47| 1190.9116|    1|        4.0|    0.0|
|  1| 19|  2|     UA|   820|SFO| 12.75|     123|  135|1092.74186|    1|        0

## Consolidar columnas (features)

In [61]:
from pyspark.ml.feature import VectorAssembler

In [62]:
assembler = VectorAssembler(inputCols=['mon', 'dom', 'dow', 'carrier_idx', 'org_idx', 'km', 'depart', 'duration'],
                            outputCol='features')
dfsFlightClean = assembler.transform(dfsClean)

dfsFlightClean.show()

+---+---+---+-------+------+---+------+--------+-----+----------+-----+-----------+-------+--------------------+
|mon|dom|dow|carrier|flight|org|depart|duration|delay|        km|label|carrier_idx|org_idx|            features|
+---+---+---+-------+------+---+------+--------+-----+----------+-----+-----------+-------+--------------------+
| 10| 10|  1|     OO|  5836|ORD|  8.18|      51|   27| 252.66638|    1|        2.0|    0.0|[10.0,10.0,1.0,2....|
|  1|  4|  1|     OO|  5866|ORD|  15.5|     102|   -7| 749.95244|    0|        2.0|    0.0|[1.0,4.0,1.0,2.0,...|
| 11| 22|  1|     OO|  6016|ORD|  7.17|     127|  -19|1187.69292|    0|        2.0|    0.0|[11.0,22.0,1.0,2....|
|  2| 14|  5|     B6|   199|JFK| 21.17|     365|   60|3617.79632|    1|        4.0|    2.0|[2.0,14.0,5.0,4.0...|
|  5| 25|  3|     WN|  1675|SJC| 12.92|      85|   22| 621.20524|    1|        3.0|    5.0|[5.0,25.0,3.0,3.0...|
|  3| 28|  1|     B6|   377|LGA| 13.33|     182|   70|1731.64984|    1|        4.0|    3.0|[3.0,

## Entrenamiento y Prueba


In [71]:
flyTrain, flyTest = dfsFlightClean.randomSplit([0.8, 0.2], seed=23)

[flyTest.count(), flyTrain.count()]

[55438, 219562]

## Regresiòn Logìstica


In [72]:
from pyspark.ml.classification import LogisticRegression


In [73]:
logi = LogisticRegression()
logi_model = logi.fit(flyTrain)

In [74]:
prediction = logi_model.transform(flyTest)

prediction['label', 'prediction', 'probability'].show()
#['label', 'prediction', 'probability']

+-----+----------+--------------------+
|label|prediction|         probability|
+-----+----------+--------------------+
|    0|       1.0|[0.41219465937999...|
|    1|       1.0|[0.29744386637694...|
|    1|       1.0|[0.39281435528949...|
|    1|       0.0|[0.59197907885524...|
|    1|       1.0|[0.37397592573369...|
|    1|       1.0|[0.38433385868289...|
|    1|       1.0|[0.41566067333262...|
|    0|       0.0|[0.75084295539282...|
|    0|       1.0|[0.40751109239594...|
|    0|       0.0|[0.58836267158982...|
|    1|       1.0|[0.35645779888883...|
|    1|       1.0|[0.44735684723495...|
|    1|       0.0|[0.51378001993266...|
|    1|       1.0|[0.35847972562192...|
|    1|       1.0|[0.42615458709646...|
|    0|       1.0|[0.45932045572097...|
|    1|       1.0|[0.42900841067730...|
|    0|       1.0|[0.47391772798271...|
|    1|       1.0|[0.44366295011847...|
|    1|       1.0|[0.38943735846919...|
+-----+----------+--------------------+
only showing top 20 rows



## Matriz de Confusión



In [75]:
prediction.groupBy('label', 'prediction').count().show()

+-----+----------+-----+
|label|prediction|count|
+-----+----------+-----+
|    1|       0.0|12644|
|    0|       0.0|19063|
|    1|       1.0|13837|
|    0|       1.0| 9894|
+-----+----------+-----+



## Precisiòn Promedio

In [76]:
TP=prediction.filter('prediction = 1 AND label = 1').count()
FP=prediction.filter('prediction = 1 AND label = 0').count()
FN=prediction.filter('prediction = 0 AND label = 1').count()
TN=prediction.filter('prediction = 0 AND label = 0').count()

print("Verdaderos positivos: ", TP)
print("Falsos positivos: ", FP)
print("Falsos Negativos: ", FN)
print("Verdaderos Negativos: ", TN)
print("Accuracy: ", (TN+TP)/(TP+FP+FN+TN))

Verdaderos positivos:  13837
Falsos positivos:  9894
Falsos Negativos:  12644
Verdaderos Negativos:  19063
Accuracy:  0.5934557523720192


## Precisiòn

In [77]:
TP/(TP+FP)

0.5830769879061144

## Recall

In [78]:
TP/(TP+FN)

0.5225255843812545

## Análisis del modelo

Desafortunadamente, el modelo no es adecuado. Dado que su Accuracy es apenas de un 59,34%. En el Àrbol de Decisiòn el modelo llegò a una Precisiòn Promedio de 61,54. Lo que demuestra que es necesario hacer un fine tuning de ambos modelos para poder evaluar mejor. Vale recalcar que el modelo previo se entrenò con un split de 75% y 25%.

In [79]:
spark.stop()