### Apellidos y Nombres:

Lettere Dragosavljevich Mathias Giuseppe

### Fecha:

14-09-2023

# **Preprocesamiento de datos con Pyspark**


## Google Colab Setup

If you are going to use Google Colab instead of a Spark Cluster, you will need to run the following code to install Apache Spark.

In [None]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

In [None]:
#If the following links don't work, you will have to update them with the last versions of Apache Spark
!wget -q https://dlcdn.apache.org/spark/spark-3.4.1/spark-3.4.1-bin-hadoop3.tgz
!tar xf spark-3.4.1-bin-hadoop3.tgz

In [None]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.4.1-bin-hadoop3"

## Setup


In [None]:
# Installing required packages
!pip install pyspark
!pip install findspark

Collecting pyspark
  Using cached pyspark-3.4.1.tar.gz (310.8 MB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.4.1-py2.py3-none-any.whl size=311285387 sha256=f3f9a2be66cd78d8111d90306cd7b11173bea65f3dba14f4deda4833882c449e
  Stored in directory: /root/.cache/pip/wheels/0d/77/a3/ff2f74cc9ab41f8f594dabf0579c2a7c6de920d584206e0834
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.4.1
Collecting findspark
  Using cached findspark-2.0.1-py2.py3-none-any.whl (4.4 kB)
Installing collected packages: findspark
Successfully installed findspark-2.0.1


In [None]:
import findspark
findspark.init()

In [None]:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession

#### Creating the spark session and context


In [None]:
# Creating a spark context class
sc = SparkContext()

# Creating a spark session
spark = SparkSession \
    .builder \
    .appName("Python Spark DataFrames basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

#### Initialize Spark session



In [None]:
spark

## Exercise 2 - Load the data and Spark dataframe


## Load the dataset into your Colab directory from your local system


In [None]:
from google.colab import files
files.upload()

Output hidden; open in https://colab.research.google.com to view.

In [None]:
dfsFlight = spark.read.csv("flights-larger.csv", header=True, inferSchema=True, nullValue= 'NA')
print(dfsFlight.printSchema())

root
 |-- mon: integer (nullable = true)
 |-- dom: integer (nullable = true)
 |-- dow: integer (nullable = true)
 |-- carrier: string (nullable = true)
 |-- flight: integer (nullable = true)
 |-- org: string (nullable = true)
 |-- mile: integer (nullable = true)
 |-- depart: double (nullable = true)
 |-- duration: integer (nullable = true)
 |-- delay: integer (nullable = true)

None


## Preprocesamiento




In [None]:
from pyspark.sql.functions import col, isnan, when, count, isnull, max, min, mode, lit

In [None]:
res = [count(when((col(c) == ' ')|( col(c).isNull()), c)).alias(c) for c in dfsFlight.columns]
dfsFlight.select(*res).show()
#def check_for_null_or_nan(df):
    #null_or_nan = lambda x: isnan(x) | isnull(x)
    #func = lambda x: df.filter(null_or_nan(x)).count()
   # print(*[f'{i} has {func(i)} nans/nulls' for i in df.columns if func(i)!=0],sep='\n')

+---+---+---+-------+------+---+----+------+--------+-----+
|mon|dom|dow|carrier|flight|org|mile|depart|duration|delay|
+---+---+---+-------+------+---+----+------+--------+-----+
|  0|  0|  0|      0|     0|  0|   0|     0|       0|16711|
+---+---+---+-------+------+---+----+------+--------+-----+



In [None]:
max_value = dfsFlight.select(max('delay')).collect()[0][0]
min_value = dfsFlight.select(min('delay')).collect()[0][0]

print("Maximum Value:", max_value)
print("Minimum Value:", min_value)

Maximum Value: 1370
Minimum Value: -80


In [None]:
# Reemplazar nulos con la moda
dfsTemp = dfsFlight
modDel = dfsTemp.agg(mode('delay')).collect()[0][0]


dfsClean = dfsTemp.fillna({'delay': modDel})

In [None]:
res = [count(when((col(c) == ' ')|( col(c).isNull()), c)).alias(c) for c in dfsClean.columns]
dfsClean.select(*res).show()

+---+---+---+-------+------+---+----+------+--------+-----+
|mon|dom|dow|carrier|flight|org|mile|depart|duration|delay|
+---+---+---+-------+------+---+----+------+--------+-----+
|  0|  0|  0|      0|     0|  0|   0|     0|       0|    0|
+---+---+---+-------+------+---+----+------+--------+-----+



### Miles a KM


In [None]:
dfsClean = dfsClean.withColumn("km", col("mile") * lit(1.60934))

dfsClean = dfsClean.drop("mile")

dfsClean.show()

+---+---+---+-------+------+---+------+--------+-----+----------+
|mon|dom|dow|carrier|flight|org|depart|duration|delay|        km|
+---+---+---+-------+------+---+------+--------+-----+----------+
| 10| 10|  1|     OO|  5836|ORD|  8.18|      51|   27| 252.66638|
|  1|  4|  1|     OO|  5866|ORD|  15.5|     102|   -7| 749.95244|
| 11| 22|  1|     OO|  6016|ORD|  7.17|     127|  -19|1187.69292|
|  2| 14|  5|     B6|   199|JFK| 21.17|     365|   60|3617.79632|
|  5| 25|  3|     WN|  1675|SJC| 12.92|      85|   22| 621.20524|
|  3| 28|  1|     B6|   377|LGA| 13.33|     182|   70|1731.64984|
|  5| 28|  6|     B6|   904|ORD|  9.58|     130|   47| 1190.9116|
|  1| 19|  2|     UA|   820|SFO| 12.75|     123|  135|1092.74186|
|  8|  5|  5|     US|  2175|LGA|  13.0|      71|  -10| 344.39876|
|  5| 27|  5|     AA|  1240|ORD| 14.42|     195|  -11|1926.37998|
|  8| 20|  6|     B6|   119|JFK| 14.67|     198|   20|1902.23988|
|  2|  3|  1|     AA|  1881|JFK| 15.92|     200|   -9| 1754.1806|
|  8| 26| 

## Indexación



In [None]:
from pyspark.ml.feature import StringIndexer

In [None]:
# “carrier_idx” y “org_idx”
indexer = StringIndexer(inputCols=['carrier', 'org'],
                        outputCols=['carrier_idx', 'org_idx']).fit(dfsClean).transform(dfsClean)
dfsClean.show()

IllegalArgumentException: ignored

dfsClean.show()

## Consolidar columnas (features)

In [None]:
from pyspark.ml.feature import VectorAssembler

In [None]:
assembler = VectorAssembler(inputCols=['mon', 'dom', 'dow', 'carrier_idx', 'org_idx', 'km', 'depart', 'duration'],
                            outputCol='features')
dfsFlightClean = assembler.transform(dfsClean)

dfsFlightClean.show()

+---+---+---+-------+------+---+------+--------+-----+----------+-----+-----------+-------+--------------------+
|mon|dom|dow|carrier|flight|org|depart|duration|delay|        km|label|carrier_idx|org_idx|            features|
+---+---+---+-------+------+---+------+--------+-----+----------+-----+-----------+-------+--------------------+
| 10| 10|  1|     OO|  5836|ORD|  8.18|      51|   27| 252.66638|    1|        2.0|    0.0|[10.0,10.0,1.0,2....|
|  1|  4|  1|     OO|  5866|ORD|  15.5|     102|   -7| 749.95244|    0|        2.0|    0.0|[1.0,4.0,1.0,2.0,...|
| 11| 22|  1|     OO|  6016|ORD|  7.17|     127|  -19|1187.69292|    0|        2.0|    0.0|[11.0,22.0,1.0,2....|
|  2| 14|  5|     B6|   199|JFK| 21.17|     365|   60|3617.79632|    1|        4.0|    2.0|[2.0,14.0,5.0,4.0...|
|  5| 25|  3|     WN|  1675|SJC| 12.92|      85|   22| 621.20524|    1|        3.0|    5.0|[5.0,25.0,3.0,3.0...|
|  3| 28|  1|     B6|   377|LGA| 13.33|     182|   70|1731.64984|    1|        4.0|    3.0|[3.0,

## Entrenamiento y Prueba


In [None]:
flyTrain, flyTest = dfsFlightClean.randomSplit([0.75, 0.25], seed=23)

[flyTest.count(), flyTrain.count()]

[69159, 205841]

## Árbol de Decisión


In [None]:
from pyspark.ml.classification import DecisionTreeClassifier


In [None]:
tree = DecisionTreeClassifier()
tree_model = tree.fit(flyTrain)

In [None]:
prediction = tree_model.transform(flyTest)

prediction['label', 'prediction', 'probability'].show()
#['label', 'prediction', 'probability']

+-----+----------+--------------------+
|label|prediction|         probability|
+-----+----------+--------------------+
|    0|       1.0|[0.49632415797572...|
|    1|       1.0|[0.38037310609831...|
|    1|       1.0|[0.38037310609831...|
|    1|       0.0|[0.64395862717442...|
|    1|       1.0|[0.45002331002331...|
|    1|       1.0|[0.38037310609831...|
|    1|       1.0|[0.38037310609831...|
|    1|       1.0|[0.38037310609831...|
|    1|       1.0|[0.38037310609831...|
|    0|       0.0|[0.84286470743976...|
|    0|       1.0|[0.38037310609831...|
|    1|       1.0|[0.38037310609831...|
|    0|       0.0|[0.58645003065603...|
|    1|       1.0|[0.38037310609831...|
|    1|       1.0|[0.45002331002331...|
|    1|       1.0|[0.45002331002331...|
|    1|       0.0|[0.58645003065603...|
|    1|       1.0|[0.38037310609831...|
|    1|       1.0|[0.49632415797572...|
|    0|       1.0|[0.45002331002331...|
+-----+----------+--------------------+
only showing top 20 rows



## Matriz de Confusión



In [None]:
prediction.groupBy('label', 'prediction').count().show()

+-----+----------+-----+
|label|prediction|count|
+-----+----------+-----+
|    1|       0.0|11533|
|    0|       0.0|21154|
|    1|       1.0|21403|
|    0|       1.0|15069|
+-----+----------+-----+



In [None]:
TP=prediction.filter('prediction = 1 AND label = 1').count()
FP=prediction.filter('prediction = 1 AND label = 0').count()
FN=prediction.filter('prediction = 0 AND label = 1').count()
TN=prediction.filter('prediction = 0 AND label = 0').count()

print("Verdaderos positivos: ", TP)
print("Falsos positivos: ", FP)
print("Falsos Negativos: ", FN)
print("Verdaderos Negativos: ", TN)
print("Accuracy: ", (TN+TP)/(TP+FP+FN+TN))

Verdaderos positivos:  21403
Falsos positivos:  15069
Falsos Negativos:  11533
Verdaderos Negativos:  21154
Accuracy:  0.6153501351957085


## Análisis del modelo

Desafortunadamente, el modelo no es adecuado. Dado que su Accuracy es apenas de un 61,54% hace falta refinar al modelo para poder llegar a una capacidad de predicción deseable. Como objetivo generalmente en estos casos, se debería apuntar para un 80% como mínimo, por lo que se determinaría necesario realizar más pruebas y con diferentes split para entrenamiento y pruebas.

In [None]:
spark.stop()