### Apellidos y Nombres:

Lettere Dragosavljevich Mathias Giuseppe

### Fecha:

10-10-2023

# **Preprocesamiento de datos con Pyspark**


## Google Colab Setup

If you are going to use Google Colab instead of a Spark Cluster, you will need to run the following code to install Apache Spark.

In [2]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

In [3]:
#If the following links don't work, you will have to update them with the last versions of Apache Spark
!wget -q https://dlcdn.apache.org/spark/spark-3.4.1/spark-3.4.1-bin-hadoop3.tgz
!tar xf spark-3.4.1-bin-hadoop3.tgz

In [4]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.4.1-bin-hadoop3"

## Setup


In [5]:
# Installing required packages
!pip install pyspark
!pip install findspark

Collecting pyspark
  Downloading pyspark-3.5.0.tar.gz (316.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m316.9/316.9 MB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.0-py2.py3-none-any.whl size=317425344 sha256=e5aec81d0f2058152a0d12b855e295dc16001deb5c3ccb9386bab4b8a886feab
  Stored in directory: /root/.cache/pip/wheels/41/4e/10/c2cf2467f71c678cfc8a6b9ac9241e5e44a01940da8fbb17fc
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.0
Collecting findspark
  Downloading findspark-2.0.1-py2.py3-none-any.whl (4.4 kB)
Installing collected packages: findspark
Successfully installed findspark-2.0.1


In [6]:
import findspark
findspark.init()

In [7]:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession

#### Creating the spark session and context


In [8]:
# Creating a spark context class
sc = SparkContext()

# Creating a spark session
spark = SparkSession \
    .builder \
    .appName("Python Spark DataFrames basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

#### Initialize Spark session



In [9]:
spark

## Exercise 2 - Load the data and Spark dataframe


## Load the dataset into your Colab directory from your local system


In [10]:
from google.colab import files
files.upload()

Output hidden; open in https://colab.research.google.com to view.

In [11]:
dfsFlight = spark.read.csv("flights-larger.csv", header=True, inferSchema=True, nullValue= 'NA')
print(dfsFlight.printSchema())

root
 |-- mon: integer (nullable = true)
 |-- dom: integer (nullable = true)
 |-- dow: integer (nullable = true)
 |-- carrier: string (nullable = true)
 |-- flight: integer (nullable = true)
 |-- org: string (nullable = true)
 |-- mile: integer (nullable = true)
 |-- depart: double (nullable = true)
 |-- duration: integer (nullable = true)
 |-- delay: integer (nullable = true)

None


## Preprocesamiento




In [12]:
from pyspark.sql.functions import col, isnan, when, count, isnull, max, min, mode, lit

In [13]:
res = [count(when((col(c) == ' ')|( col(c).isNull()), c)).alias(c) for c in dfsFlight.columns]
dfsFlight.select(*res).show()
#def check_for_null_or_nan(df):
    #null_or_nan = lambda x: isnan(x) | isnull(x)
    #func = lambda x: df.filter(null_or_nan(x)).count()
   # print(*[f'{i} has {func(i)} nans/nulls' for i in df.columns if func(i)!=0],sep='\n')

+---+---+---+-------+------+---+----+------+--------+-----+
|mon|dom|dow|carrier|flight|org|mile|depart|duration|delay|
+---+---+---+-------+------+---+----+------+--------+-----+
|  0|  0|  0|      0|     0|  0|   0|     0|       0|16711|
+---+---+---+-------+------+---+----+------+--------+-----+



In [14]:
max_value = dfsFlight.select(max('delay')).collect()[0][0]
min_value = dfsFlight.select(min('delay')).collect()[0][0]

print("Maximum Value:", max_value)
print("Minimum Value:", min_value)

Maximum Value: 1370
Minimum Value: -80


In [15]:
# Reemplazar nulos con la moda
dfsTemp = dfsFlight
modDel = dfsTemp.agg(mode('delay')).collect()[0][0]


dfsClean = dfsTemp.fillna({'delay': modDel})

In [16]:
res = [count(when((col(c) == ' ')|( col(c).isNull()), c)).alias(c) for c in dfsClean.columns]
dfsClean.select(*res).show()

+---+---+---+-------+------+---+----+------+--------+-----+
|mon|dom|dow|carrier|flight|org|mile|depart|duration|delay|
+---+---+---+-------+------+---+----+------+--------+-----+
|  0|  0|  0|      0|     0|  0|   0|     0|       0|    0|
+---+---+---+-------+------+---+----+------+--------+-----+



### Miles a KM


In [17]:
dfsClean = dfsClean.withColumn("km", col("mile") * lit(1.60934))

dfsClean = dfsClean.drop("mile")

dfsClean.show()

+---+---+---+-------+------+---+------+--------+-----+----------+
|mon|dom|dow|carrier|flight|org|depart|duration|delay|        km|
+---+---+---+-------+------+---+------+--------+-----+----------+
| 10| 10|  1|     OO|  5836|ORD|  8.18|      51|   27| 252.66638|
|  1|  4|  1|     OO|  5866|ORD|  15.5|     102|   -7| 749.95244|
| 11| 22|  1|     OO|  6016|ORD|  7.17|     127|  -19|1187.69292|
|  2| 14|  5|     B6|   199|JFK| 21.17|     365|   60|3617.79632|
|  5| 25|  3|     WN|  1675|SJC| 12.92|      85|   22| 621.20524|
|  3| 28|  1|     B6|   377|LGA| 13.33|     182|   70|1731.64984|
|  5| 28|  6|     B6|   904|ORD|  9.58|     130|   47| 1190.9116|
|  1| 19|  2|     UA|   820|SFO| 12.75|     123|  135|1092.74186|
|  8|  5|  5|     US|  2175|LGA|  13.0|      71|  -10| 344.39876|
|  5| 27|  5|     AA|  1240|ORD| 14.42|     195|  -11|1926.37998|
|  8| 20|  6|     B6|   119|JFK| 14.67|     198|   20|1902.23988|
|  2|  3|  1|     AA|  1881|JFK| 15.92|     200|   -9| 1754.1806|
|  8| 26| 

## Indexación



In [18]:
from pyspark.ml.feature import StringIndexer

In [19]:
# “carrier_idx” y “org_idx”
indexer = StringIndexer(inputCols=['carrier', 'org'],
                        outputCols=['carrier_idx', 'org_idx']).fit(dfsClean).transform(dfsClean)
dfsClean.show()

+---+---+---+-------+------+---+------+--------+-----+----------+
|mon|dom|dow|carrier|flight|org|depart|duration|delay|        km|
+---+---+---+-------+------+---+------+--------+-----+----------+
| 10| 10|  1|     OO|  5836|ORD|  8.18|      51|   27| 252.66638|
|  1|  4|  1|     OO|  5866|ORD|  15.5|     102|   -7| 749.95244|
| 11| 22|  1|     OO|  6016|ORD|  7.17|     127|  -19|1187.69292|
|  2| 14|  5|     B6|   199|JFK| 21.17|     365|   60|3617.79632|
|  5| 25|  3|     WN|  1675|SJC| 12.92|      85|   22| 621.20524|
|  3| 28|  1|     B6|   377|LGA| 13.33|     182|   70|1731.64984|
|  5| 28|  6|     B6|   904|ORD|  9.58|     130|   47| 1190.9116|
|  1| 19|  2|     UA|   820|SFO| 12.75|     123|  135|1092.74186|
|  8|  5|  5|     US|  2175|LGA|  13.0|      71|  -10| 344.39876|
|  5| 27|  5|     AA|  1240|ORD| 14.42|     195|  -11|1926.37998|
|  8| 20|  6|     B6|   119|JFK| 14.67|     198|   20|1902.23988|
|  2|  3|  1|     AA|  1881|JFK| 15.92|     200|   -9| 1754.1806|
|  8| 26| 

In [20]:
dfsClean = indexer
dfsClean.show()

+---+---+---+-------+------+---+------+--------+-----+----------+-----------+-------+
|mon|dom|dow|carrier|flight|org|depart|duration|delay|        km|carrier_idx|org_idx|
+---+---+---+-------+------+---+------+--------+-----+----------+-----------+-------+
| 10| 10|  1|     OO|  5836|ORD|  8.18|      51|   27| 252.66638|        2.0|    0.0|
|  1|  4|  1|     OO|  5866|ORD|  15.5|     102|   -7| 749.95244|        2.0|    0.0|
| 11| 22|  1|     OO|  6016|ORD|  7.17|     127|  -19|1187.69292|        2.0|    0.0|
|  2| 14|  5|     B6|   199|JFK| 21.17|     365|   60|3617.79632|        4.0|    2.0|
|  5| 25|  3|     WN|  1675|SJC| 12.92|      85|   22| 621.20524|        3.0|    5.0|
|  3| 28|  1|     B6|   377|LGA| 13.33|     182|   70|1731.64984|        4.0|    3.0|
|  5| 28|  6|     B6|   904|ORD|  9.58|     130|   47| 1190.9116|        4.0|    0.0|
|  1| 19|  2|     UA|   820|SFO| 12.75|     123|  135|1092.74186|        0.0|    1.0|
|  8|  5|  5|     US|  2175|LGA|  13.0|      71|  -10|

## Bucketing de la variable departure time

In [21]:
from pyspark.ml.feature import Bucketizer

In [22]:
buckets = Bucketizer(splits=[0,3,6,9,12,15,18,21,24],
                     inputCol="depart",
                     outputCol="depart_bucket")

dfsClean = buckets.transform(dfsClean)
dfsClean.select('depart', 'depart_bucket').show(5)

+------+-------------+
|depart|depart_bucket|
+------+-------------+
|  8.18|          2.0|
|  15.5|          5.0|
|  7.17|          2.0|
| 21.17|          7.0|
| 12.92|          4.0|
+------+-------------+
only showing top 5 rows



In [23]:
dfsClean.groupby('depart_bucket').count().show()

+-------------+-----+
|depart_bucket|count|
+-------------+-----+
|          0.0|  206|
|          7.0|19128|
|          1.0|  705|
|          4.0|51955|
|          3.0|50866|
|          2.0|47684|
|          6.0|51940|
|          5.0|52516|
+-------------+-----+



## One-Hot Encoding y Consolidacion

In [24]:
from pyspark.ml.feature import OneHotEncoder

In [25]:
onehotDep = OneHotEncoder(inputCols=['depart_bucket'], outputCols=['depart_dummy'])
onehotOrg = OneHotEncoder(inputCols=['org_idx'], outputCols=['org_dummy'])
onehotDow = OneHotEncoder(inputCols=['dow'], outputCols=['dow_dummy'])
onehotMon = OneHotEncoder(inputCols=['mon'], outputCols=['mon_dummy'])

In [26]:
onehotDep = onehotDep.fit(dfsClean)
onehotOrg = onehotOrg.fit(dfsClean)
onehotDow = onehotDow.fit(dfsClean)
onehotMon = onehotMon.fit(dfsClean)

print("Categorias de Departure: ", onehotDep.categorySizes)
print("Categorias de Org: ", onehotOrg.categorySizes)
print("Categorias de Dow: ", onehotDow.categorySizes)
print("Categorias de Mon: ", onehotMon.categorySizes)

Categorias de Departure:  [8]
Categorias de Org:  [8]
Categorias de Dow:  [7]
Categorias de Mon:  [12]


In [27]:
dfsClean = onehotDep.transform(dfsClean)
dfsClean.select("depart", "depart_bucket", "depart_dummy").distinct().sort("depart_bucket").show()

+------+-------------+-------------+
|depart|depart_bucket| depart_dummy|
+------+-------------+-------------+
|  1.08|          0.0|(7,[0],[1.0])|
|   1.0|          0.0|(7,[0],[1.0])|
|  1.85|          0.0|(7,[0],[1.0])|
|  0.83|          0.0|(7,[0],[1.0])|
|  0.75|          0.0|(7,[0],[1.0])|
|  0.25|          0.0|(7,[0],[1.0])|
|  0.67|          0.0|(7,[0],[1.0])|
|  0.12|          0.0|(7,[0],[1.0])|
|  0.42|          0.0|(7,[0],[1.0])|
|  1.42|          0.0|(7,[0],[1.0])|
|  5.28|          1.0|(7,[1],[1.0])|
|  5.58|          1.0|(7,[1],[1.0])|
|  5.67|          1.0|(7,[1],[1.0])|
|   5.0|          1.0|(7,[1],[1.0])|
|  5.92|          1.0|(7,[1],[1.0])|
|  4.42|          1.0|(7,[1],[1.0])|
|  5.83|          1.0|(7,[1],[1.0])|
|  5.75|          1.0|(7,[1],[1.0])|
|   5.5|          1.0|(7,[1],[1.0])|
|  5.17|          1.0|(7,[1],[1.0])|
+------+-------------+-------------+
only showing top 20 rows



In [28]:
dfsClean = onehotOrg.transform(dfsClean)
dfsClean.select("org", "org_idx", "org_dummy").distinct().sort("org_idx").show()

+---+-------+-------------+
|org|org_idx|    org_dummy|
+---+-------+-------------+
|ORD|    0.0|(7,[0],[1.0])|
|SFO|    1.0|(7,[1],[1.0])|
|JFK|    2.0|(7,[2],[1.0])|
|LGA|    3.0|(7,[3],[1.0])|
|SMF|    4.0|(7,[4],[1.0])|
|SJC|    5.0|(7,[5],[1.0])|
|TUS|    6.0|(7,[6],[1.0])|
|OGG|    7.0|    (7,[],[])|
+---+-------+-------------+



In [30]:
dfsClean = onehotDow.transform(dfsClean)
dfsClean.select("dow", "dow_dummy").distinct().sort("dow").show()

IllegalArgumentException: ignored

In [31]:
dfsClean.select("dow", "dow_dummy").distinct().sort("dow").show()

+---+-------------+
|dow|    dow_dummy|
+---+-------------+
|  0|(6,[0],[1.0])|
|  1|(6,[1],[1.0])|
|  2|(6,[2],[1.0])|
|  3|(6,[3],[1.0])|
|  4|(6,[4],[1.0])|
|  5|(6,[5],[1.0])|
|  6|    (6,[],[])|
+---+-------------+



In [32]:
dfsClean = onehotMon.transform(dfsClean)
dfsClean.select("mon", "mon_dummy").distinct().sort("mon").show()

+---+---------------+
|mon|      mon_dummy|
+---+---------------+
|  0| (11,[0],[1.0])|
|  1| (11,[1],[1.0])|
|  2| (11,[2],[1.0])|
|  3| (11,[3],[1.0])|
|  4| (11,[4],[1.0])|
|  5| (11,[5],[1.0])|
|  6| (11,[6],[1.0])|
|  7| (11,[7],[1.0])|
|  8| (11,[8],[1.0])|
|  9| (11,[9],[1.0])|
| 10|(11,[10],[1.0])|
| 11|     (11,[],[])|
+---+---------------+



## Consolidar columnas (features)

In [33]:
from pyspark.ml.feature import VectorAssembler

In [34]:
assembler = VectorAssembler(inputCols=['km', 'org_dummy', 'depart_dummy', 'dow_dummy', 'mon_dummy'],
                            outputCol='features')
dfsFlightClean = assembler.transform(dfsClean)

dfsFlightClean.show()

+---+---+---+-------+------+---+------+--------+-----+----------+-----------+-------+-------------+-------------+-------------+-------------+---------------+--------------------+
|mon|dom|dow|carrier|flight|org|depart|duration|delay|        km|carrier_idx|org_idx|depart_bucket| depart_dummy|    org_dummy|    dow_dummy|      mon_dummy|            features|
+---+---+---+-------+------+---+------+--------+-----+----------+-----------+-------+-------------+-------------+-------------+-------------+---------------+--------------------+
| 10| 10|  1|     OO|  5836|ORD|  8.18|      51|   27| 252.66638|        2.0|    0.0|          2.0|(7,[2],[1.0])|(7,[0],[1.0])|(6,[1],[1.0])|(11,[10],[1.0])|(32,[0,1,10,16,31...|
|  1|  4|  1|     OO|  5866|ORD|  15.5|     102|   -7| 749.95244|        2.0|    0.0|          5.0|(7,[5],[1.0])|(7,[0],[1.0])|(6,[1],[1.0])| (11,[1],[1.0])|(32,[0,1,13,16,22...|
| 11| 22|  1|     OO|  6016|ORD|  7.17|     127|  -19|1187.69292|        2.0|    0.0|          2.0|(7,[2]

## Entrenamiento y Prueba


In [35]:
flyTrain, flyTest = dfsFlightClean.randomSplit([0.8, 0.2], seed=23)

[flyTest.count(), flyTrain.count()]

[55438, 219562]

## Regresion Lineal


In [36]:
from pyspark.ml.regression import LinearRegression


In [37]:
regr = LinearRegression(labelCol="duration")
regr = regr.fit(flyTrain)

In [38]:
predictions = regr.transform(flyTest)

predictions['duration', 'prediction'].show()
#['label', 'prediction', 'probability']

+--------+------------------+
|duration|        prediction|
+--------+------------------+
|     385|  366.561868636613|
|     325|355.08390473828297|
|     135|150.90142205430467|
|     310| 317.6808116176087|
|     130| 134.8531856229321|
|     150|153.98936519130487|
|     135|130.71573679202692|
|     125|125.86826355723642|
|     285| 266.2809715601529|
|     205|210.50730082305492|
|     104| 120.5234827606003|
|      80| 80.36443654659824|
|     250|227.56226093179066|
|     255| 232.3922508206837|
|     280|264.24500126662053|
|      70| 73.82833153156453|
|     205| 194.7824853696349|
|     125| 129.0390553738238|
|      70| 77.96578036246969|
|      70| 78.12739547175045|
+--------+------------------+
only showing top 20 rows



## Metricas de Evaluacion



In [39]:
from pyspark.ml.evaluation import RegressionEvaluator

### RMSE

In [40]:
RegressionEvaluator(labelCol="duration").evaluate(predictions)

10.906066601140216

### R^2

In [41]:
RegressionEvaluator(labelCol="duration", metricName= "r2").evaluate(predictions)

0.9843545897654332

Coeficientes

In [42]:
regr.coefficients

DenseVector([0.0744, 27.1168, 20.0542, 52.0668, 46.0914, 15.1708, 17.3132, 17.5607, -15.0363, -0.2504, 3.9295, 6.8911, 4.622, 8.7595, 8.9211, -0.0565, -0.0965, -0.1466, -0.0775, -0.0253, -0.0367, -1.8358, -1.9439, -2.0399, -3.4252, -4.0351, -3.9022, -3.978, -3.9971, -3.8898, -2.5948, -0.4182])

## LASSO

In [43]:
lasso = LinearRegression(labelCol="duration", elasticNetParam= 1, regParam = 0.01)
lasso = lasso.fit(flyTrain)

In [44]:
predictionsLasso = lasso.transform(flyTest)

predictionsLasso['duration', 'prediction'].show()
#['label', 'prediction', 'probability']

+--------+------------------+
|duration|        prediction|
+--------+------------------+
|     385| 366.5511697973193|
|     325| 355.3035062141946|
|     135|  150.930811519023|
|     310|318.73897535119204|
|     130|134.88573294134096|
|     150|153.99532050154193|
|     135|130.74077325994116|
|     125|126.10128193938766|
|     285|266.29683951517836|
|     205|210.51585376203218|
|     104|120.56045851045715|
|      80| 80.38473444092459|
|     250| 227.5659939700428|
|     255|232.38116985078292|
|     280| 264.2612698448754|
|      70| 73.86456188382968|
|     205|  194.812422269834|
|     125| 129.0644217667505|
|      70| 78.00952156522946|
|      70| 78.17271361120774|
+--------+------------------+
only showing top 20 rows



RMSE

In [45]:
RegressionEvaluator(labelCol="duration").evaluate(predictionsLasso)

10.908131871740867

R^2

In [46]:
RegressionEvaluator(labelCol="duration", metricName= "r2").evaluate(predictionsLasso)

0.9843486636939319

Coeficientes

In [47]:
lasso.coefficients

DenseVector([0.0744, 26.0499, 18.9657, 50.9882, 44.9963, 14.0681, 16.2105, 16.4367, -14.8485, -0.246, 3.6892, 6.6455, 4.3594, 8.5044, 8.6676, 0.0, -0.0094, -0.0631, 0.0, 0.0, 0.0, -1.4444, -1.5514, -1.639, -3.0242, -3.6278, -3.5063, -3.5882, -3.602, -3.4874, -2.1959, -0.0205])

## RIDGE

In [48]:
ridge = LinearRegression(labelCol="duration", elasticNetParam = 0, regParam= 0.01)
ridge = ridge.fit(flyTrain)

In [49]:
predictionsRidge = regr.transform(flyTest)

predictionsRidge['duration', 'prediction'].show()
#['label', 'prediction', 'probability']

+--------+------------------+
|duration|        prediction|
+--------+------------------+
|     385|  366.561868636613|
|     325|355.08390473828297|
|     135|150.90142205430467|
|     310| 317.6808116176087|
|     130| 134.8531856229321|
|     150|153.98936519130487|
|     135|130.71573679202692|
|     125|125.86826355723642|
|     285| 266.2809715601529|
|     205|210.50730082305492|
|     104| 120.5234827606003|
|      80| 80.36443654659824|
|     250|227.56226093179066|
|     255| 232.3922508206837|
|     280|264.24500126662053|
|      70| 73.82833153156453|
|     205| 194.7824853696349|
|     125| 129.0390553738238|
|      70| 77.96578036246969|
|      70| 78.12739547175045|
+--------+------------------+
only showing top 20 rows



RMSE

In [50]:
RegressionEvaluator(labelCol="duration").evaluate(predictionsRidge)

10.906066601140216

R^2

In [51]:
RegressionEvaluator(labelCol="duration", metricName= "r2").evaluate(predictionsRidge)

0.9843545897654332

Coeficientes

In [52]:
ridge.coefficients

DenseVector([0.0744, 26.981, 19.9196, 51.9328, 45.9513, 15.0315, 17.1744, 17.4217, -15.0226, -0.251, 3.9285, 6.8891, 4.6165, 8.7546, 8.9169, -0.0557, -0.0955, -0.1457, -0.0766, -0.0245, -0.0363, -1.8322, -1.9404, -2.0349, -3.42, -4.0297, -3.8968, -3.9737, -3.9927, -3.8855, -2.5914, -0.4155])

## Análisis del modelo

Los tres modelos en si, hasta su cuarto lugar decimal no tienen diferencia en cuanto a R^2. RIDGE y el modelo estándar tienen prácticamente ninguna diferencia, con LASSO teniendo el menor R^2 con 0.984349 (los otros modelos tienen 0.984354). En cuanto al RMSE se observa resultados similares, con RIDGE y el modelo estandar con el mismo valor de 10.9061 y LASSO con un error mayor en 0.0019. El factor que diferencia a los tres modelos, es su complejidad, como RIDGE mantuvo todas las variables que estaban presentes en el modelo estándar, no cambio de la complejidad original, pero LASSO eliminó a 4 variables y rebajó el peso del resto de una manera mucho más notable que RIDGE.

Por lo cual, aunque el rendimiento de los tres modelos es bastante similar, se decide tomar el modelo ejecutado con LASSO por su menor complejidad.

In [53]:
spark.stop()