### Apellidos y Nombres:

Lettere Dragosavljevich Mathias Giuseppe

### Fecha:

03-10-2023

# **Preprocesamiento de datos con Pyspark**


## Google Colab Setup

If you are going to use Google Colab instead of a Spark Cluster, you will need to run the following code to install Apache Spark.

In [1]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

In [2]:
#If the following links don't work, you will have to update them with the last versions of Apache Spark
!wget -q https://dlcdn.apache.org/spark/spark-3.4.1/spark-3.4.1-bin-hadoop3.tgz
!tar xf spark-3.4.1-bin-hadoop3.tgz

In [3]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.4.1-bin-hadoop3"

## Setup


In [4]:
# Installing required packages
!pip install pyspark
!pip install findspark

Collecting pyspark
  Downloading pyspark-3.5.0.tar.gz (316.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m316.9/316.9 MB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.0-py2.py3-none-any.whl size=317425344 sha256=c53c2badbfc1993292c7be11bbffd2a99061000be2f919297af5a69b0b4948d5
  Stored in directory: /root/.cache/pip/wheels/41/4e/10/c2cf2467f71c678cfc8a6b9ac9241e5e44a01940da8fbb17fc
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.0
Collecting findspark
  Downloading findspark-2.0.1-py2.py3-none-any.whl (4.4 kB)
Installing collected packages: findspark
Successfully installed findspark-2.0.1


In [5]:
import findspark
findspark.init()

In [6]:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession

#### Creating the spark session and context


In [7]:
# Creating a spark context class
sc = SparkContext()

# Creating a spark session
spark = SparkSession \
    .builder \
    .appName("Python Spark DataFrames basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

#### Initialize Spark session



In [8]:
spark

## Exercise 2 - Load the data and Spark dataframe


## Load the dataset into your Colab directory from your local system


In [9]:
from google.colab import files
files.upload()

Output hidden; open in https://colab.research.google.com to view.

In [10]:
dfsFlight = spark.read.csv("flights-larger.csv", header=True, inferSchema=True, nullValue= 'NA')
print(dfsFlight.printSchema())

root
 |-- mon: integer (nullable = true)
 |-- dom: integer (nullable = true)
 |-- dow: integer (nullable = true)
 |-- carrier: string (nullable = true)
 |-- flight: integer (nullable = true)
 |-- org: string (nullable = true)
 |-- mile: integer (nullable = true)
 |-- depart: double (nullable = true)
 |-- duration: integer (nullable = true)
 |-- delay: integer (nullable = true)

None


## Preprocesamiento




In [11]:
from pyspark.sql.functions import col, isnan, when, count, isnull, max, min, mode, lit

In [12]:
res = [count(when((col(c) == ' ')|( col(c).isNull()), c)).alias(c) for c in dfsFlight.columns]
dfsFlight.select(*res).show()
#def check_for_null_or_nan(df):
    #null_or_nan = lambda x: isnan(x) | isnull(x)
    #func = lambda x: df.filter(null_or_nan(x)).count()
   # print(*[f'{i} has {func(i)} nans/nulls' for i in df.columns if func(i)!=0],sep='\n')

+---+---+---+-------+------+---+----+------+--------+-----+
|mon|dom|dow|carrier|flight|org|mile|depart|duration|delay|
+---+---+---+-------+------+---+----+------+--------+-----+
|  0|  0|  0|      0|     0|  0|   0|     0|       0|16711|
+---+---+---+-------+------+---+----+------+--------+-----+



In [13]:
max_value = dfsFlight.select(max('delay')).collect()[0][0]
min_value = dfsFlight.select(min('delay')).collect()[0][0]

print("Maximum Value:", max_value)
print("Minimum Value:", min_value)

Maximum Value: 1370
Minimum Value: -80


In [14]:
# Reemplazar nulos con la moda
dfsTemp = dfsFlight
modDel = dfsTemp.agg(mode('delay')).collect()[0][0]


dfsClean = dfsTemp.fillna({'delay': modDel})

In [15]:
res = [count(when((col(c) == ' ')|( col(c).isNull()), c)).alias(c) for c in dfsClean.columns]
dfsClean.select(*res).show()

+---+---+---+-------+------+---+----+------+--------+-----+
|mon|dom|dow|carrier|flight|org|mile|depart|duration|delay|
+---+---+---+-------+------+---+----+------+--------+-----+
|  0|  0|  0|      0|     0|  0|   0|     0|       0|    0|
+---+---+---+-------+------+---+----+------+--------+-----+



### Miles a KM


In [16]:
dfsClean = dfsClean.withColumn("km", col("mile") * lit(1.60934))

dfsClean = dfsClean.drop("mile")

dfsClean.show()

+---+---+---+-------+------+---+------+--------+-----+----------+
|mon|dom|dow|carrier|flight|org|depart|duration|delay|        km|
+---+---+---+-------+------+---+------+--------+-----+----------+
| 10| 10|  1|     OO|  5836|ORD|  8.18|      51|   27| 252.66638|
|  1|  4|  1|     OO|  5866|ORD|  15.5|     102|   -7| 749.95244|
| 11| 22|  1|     OO|  6016|ORD|  7.17|     127|  -19|1187.69292|
|  2| 14|  5|     B6|   199|JFK| 21.17|     365|   60|3617.79632|
|  5| 25|  3|     WN|  1675|SJC| 12.92|      85|   22| 621.20524|
|  3| 28|  1|     B6|   377|LGA| 13.33|     182|   70|1731.64984|
|  5| 28|  6|     B6|   904|ORD|  9.58|     130|   47| 1190.9116|
|  1| 19|  2|     UA|   820|SFO| 12.75|     123|  135|1092.74186|
|  8|  5|  5|     US|  2175|LGA|  13.0|      71|  -10| 344.39876|
|  5| 27|  5|     AA|  1240|ORD| 14.42|     195|  -11|1926.37998|
|  8| 20|  6|     B6|   119|JFK| 14.67|     198|   20|1902.23988|
|  2|  3|  1|     AA|  1881|JFK| 15.92|     200|   -9| 1754.1806|
|  8| 26| 

## Indexación



In [17]:
from pyspark.ml.feature import StringIndexer

In [18]:
# “carrier_idx” y “org_idx”
indexer = StringIndexer(inputCols=['carrier', 'org'],
                        outputCols=['carrier_idx', 'org_idx']).fit(dfsClean).transform(dfsClean)
dfsClean.show()

+---+---+---+-------+------+---+------+--------+-----+----------+
|mon|dom|dow|carrier|flight|org|depart|duration|delay|        km|
+---+---+---+-------+------+---+------+--------+-----+----------+
| 10| 10|  1|     OO|  5836|ORD|  8.18|      51|   27| 252.66638|
|  1|  4|  1|     OO|  5866|ORD|  15.5|     102|   -7| 749.95244|
| 11| 22|  1|     OO|  6016|ORD|  7.17|     127|  -19|1187.69292|
|  2| 14|  5|     B6|   199|JFK| 21.17|     365|   60|3617.79632|
|  5| 25|  3|     WN|  1675|SJC| 12.92|      85|   22| 621.20524|
|  3| 28|  1|     B6|   377|LGA| 13.33|     182|   70|1731.64984|
|  5| 28|  6|     B6|   904|ORD|  9.58|     130|   47| 1190.9116|
|  1| 19|  2|     UA|   820|SFO| 12.75|     123|  135|1092.74186|
|  8|  5|  5|     US|  2175|LGA|  13.0|      71|  -10| 344.39876|
|  5| 27|  5|     AA|  1240|ORD| 14.42|     195|  -11|1926.37998|
|  8| 20|  6|     B6|   119|JFK| 14.67|     198|   20|1902.23988|
|  2|  3|  1|     AA|  1881|JFK| 15.92|     200|   -9| 1754.1806|
|  8| 26| 

In [19]:
dfsClean = indexer
dfsClean.show()

+---+---+---+-------+------+---+------+--------+-----+----------+-----------+-------+
|mon|dom|dow|carrier|flight|org|depart|duration|delay|        km|carrier_idx|org_idx|
+---+---+---+-------+------+---+------+--------+-----+----------+-----------+-------+
| 10| 10|  1|     OO|  5836|ORD|  8.18|      51|   27| 252.66638|        2.0|    0.0|
|  1|  4|  1|     OO|  5866|ORD|  15.5|     102|   -7| 749.95244|        2.0|    0.0|
| 11| 22|  1|     OO|  6016|ORD|  7.17|     127|  -19|1187.69292|        2.0|    0.0|
|  2| 14|  5|     B6|   199|JFK| 21.17|     365|   60|3617.79632|        4.0|    2.0|
|  5| 25|  3|     WN|  1675|SJC| 12.92|      85|   22| 621.20524|        3.0|    5.0|
|  3| 28|  1|     B6|   377|LGA| 13.33|     182|   70|1731.64984|        4.0|    3.0|
|  5| 28|  6|     B6|   904|ORD|  9.58|     130|   47| 1190.9116|        4.0|    0.0|
|  1| 19|  2|     UA|   820|SFO| 12.75|     123|  135|1092.74186|        0.0|    1.0|
|  8|  5|  5|     US|  2175|LGA|  13.0|      71|  -10|

## One-Hot Encoding y Consolidacion

In [20]:
from pyspark.ml.feature import OneHotEncoder

In [22]:
onehotCarrier = OneHotEncoder(inputCols=['carrier_idx'], outputCols=['carrier_dummy'])
onehotOrg = OneHotEncoder(inputCols=['org_idx'], outputCols=['org_dummy'])

In [23]:
onehotCarrier = onehotCarrier.fit(dfsClean)
onehotOrg = onehotOrg.fit(dfsClean)

print("Categorias de Carrier: ", onehotCarrier.categorySizes)
print("Categorias de Org: ", onehotOrg.categorySizes)

Categorias de Carrier:  [9]
Categorias de Org:  [8]


In [24]:
dfsClean = onehotCarrier.transform(dfsClean)
dfsClean.select("carrier", "carrier_idx", "carrier_dummy").distinct().sort("carrier_idx").show()

+-------+-----------+-------------+
|carrier|carrier_idx|carrier_dummy|
+-------+-----------+-------------+
|     UA|        0.0|(8,[0],[1.0])|
|     AA|        1.0|(8,[1],[1.0])|
|     OO|        2.0|(8,[2],[1.0])|
|     WN|        3.0|(8,[3],[1.0])|
|     B6|        4.0|(8,[4],[1.0])|
|     OH|        5.0|(8,[5],[1.0])|
|     US|        6.0|(8,[6],[1.0])|
|     HA|        7.0|(8,[7],[1.0])|
|     AQ|        8.0|    (8,[],[])|
+-------+-----------+-------------+



In [25]:
dfsClean = onehotOrg.transform(dfsClean)
dfsClean.select("org", "org_idx", "org_dummy").distinct().sort("org_idx").show()

+---+-------+-------------+
|org|org_idx|    org_dummy|
+---+-------+-------------+
|ORD|    0.0|(7,[0],[1.0])|
|SFO|    1.0|(7,[1],[1.0])|
|JFK|    2.0|(7,[2],[1.0])|
|LGA|    3.0|(7,[3],[1.0])|
|SMF|    4.0|(7,[4],[1.0])|
|SJC|    5.0|(7,[5],[1.0])|
|TUS|    6.0|(7,[6],[1.0])|
|OGG|    7.0|    (7,[],[])|
+---+-------+-------------+



## Consolidar columnas (features)

In [26]:
from pyspark.ml.feature import VectorAssembler

In [27]:
assembler = VectorAssembler(inputCols=['km', 'org_dummy'],
                            outputCol='features')
dfsFlightClean = assembler.transform(dfsClean)

dfsFlightClean.show()

+---+---+---+-------+------+---+------+--------+-----+----------+-----------+-------+-------------+-------------+--------------------+
|mon|dom|dow|carrier|flight|org|depart|duration|delay|        km|carrier_idx|org_idx|carrier_dummy|    org_dummy|            features|
+---+---+---+-------+------+---+------+--------+-----+----------+-----------+-------+-------------+-------------+--------------------+
| 10| 10|  1|     OO|  5836|ORD|  8.18|      51|   27| 252.66638|        2.0|    0.0|(8,[2],[1.0])|(7,[0],[1.0])|(8,[0,1],[252.666...|
|  1|  4|  1|     OO|  5866|ORD|  15.5|     102|   -7| 749.95244|        2.0|    0.0|(8,[2],[1.0])|(7,[0],[1.0])|(8,[0,1],[749.952...|
| 11| 22|  1|     OO|  6016|ORD|  7.17|     127|  -19|1187.69292|        2.0|    0.0|(8,[2],[1.0])|(7,[0],[1.0])|(8,[0,1],[1187.69...|
|  2| 14|  5|     B6|   199|JFK| 21.17|     365|   60|3617.79632|        4.0|    2.0|(8,[4],[1.0])|(7,[2],[1.0])|(8,[0,3],[3617.79...|
|  5| 25|  3|     WN|  1675|SJC| 12.92|      85|   22| 

## Entrenamiento y Prueba


In [28]:
flyTrain, flyTest = dfsFlightClean.randomSplit([0.75, 0.25], seed=23)

[flyTest.count(), flyTrain.count()]

[69159, 205841]

## Regresion Lineal


In [29]:
from pyspark.ml.regression import LinearRegression


In [30]:
regr = LinearRegression(labelCol="duration")
regr = regr.fit(flyTrain)

In [37]:
predictions = regr.transform(flyTest)

predictions['duration', 'prediction'].show()
#['label', 'prediction', 'probability']

+--------+------------------+
|duration|        prediction|
+--------+------------------+
|     385|364.78158358648426|
|     325| 359.6585134575749|
|     135|147.86705291396194|
|     310| 313.4826267538208|
|      70| 75.02484126937298|
|     130| 131.8393741612938|
|     150|150.50286902742627|
|     135| 131.8393741612938|
|     130| 131.8393741612938|
|     125|131.20662964138904|
|     285|264.96479544278395|
|     260|252.88423160681762|
|     205|211.34268022392905|
|     251| 241.7605441142942|
|     104|117.36661946672027|
|      80| 76.54272534141387|
|     250|228.96672918974213|
|     255|228.96672918974213|
|     280| 262.9314332129678|
|      70| 75.02484126937298|
+--------+------------------+
only showing top 20 rows



## Metricas de Evaluacion



In [38]:
from pyspark.ml.evaluation import RegressionEvaluator

### RMSE

In [40]:
RegressionEvaluator(labelCol="duration").evaluate(predictions)

11.25370547371769

### MSE

In [45]:
RegressionEvaluator(labelCol="duration", metricName= "mse").evaluate(predictions)

126.64588688918346

### R^2

In [46]:
RegressionEvaluator(labelCol="duration", metricName= "r2").evaluate(predictions)

0.9833387075106257

### Intercepts

In [41]:
regr.intercept

16.13330302894702

### Coeficientes

In [61]:
import array

In [66]:
dfsFlightClean.select("org", "org_idx").distinct().sort("org_idx").createOrReplaceTempView("Coeficientes")
spark.sql(" select * from Coeficientes ").show()
regr.coefficients.values

+---+-------+
|org|org_idx|
+---+-------+
|ORD|    0.0|
|SFO|    1.0|
|JFK|    2.0|
|LGA|    3.0|
|SMF|    4.0|
|SJC|    5.0|
|TUS|    6.0|
|OGG|    7.0|
+---+-------+



array([ 0.07432211, 28.03227616, 20.10100634, 52.6146618 , 46.69577103,
       15.5442479 , 17.816735  , 17.83076821])

## Análisis del modelo

Desafortunadamente, el modelo no es adecuado. Dado que su Accuracy es apenas de un 61,54% hace falta refinar al modelo para poder llegar a una capacidad de predicción deseable. Como objetivo generalmente en estos casos, se debería apuntar para un 80% como mínimo, por lo que se determinaría necesario realizar más pruebas y con diferentes split para entrenamiento y pruebas.

In [None]:
spark.stop()