### Apellidos y Nombres:

Lettere Dragosavljevich Mathias Giuseppe

### Fecha:

12-10-2023

# **Preprocesamiento de datos con Pyspark**


## Google Colab Setup

If you are going to use Google Colab instead of a Spark Cluster, you will need to run the following code to install Apache Spark.

In [1]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

In [2]:
#If the following links don't work, you will have to update them with the last versions of Apache Spark
!wget -q https://dlcdn.apache.org/spark/spark-3.4.1/spark-3.4.1-bin-hadoop3.tgz
!tar xf spark-3.4.1-bin-hadoop3.tgz

In [3]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.4.1-bin-hadoop3"

## Setup


In [4]:
# Installing required packages
!pip install pyspark
!pip install findspark

Collecting pyspark
  Downloading pyspark-3.5.0.tar.gz (316.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m316.9/316.9 MB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.0-py2.py3-none-any.whl size=317425344 sha256=c56ac5f0a6c947be8c8888f87c26970bb12781628ec466418c037af1028d9617
  Stored in directory: /root/.cache/pip/wheels/41/4e/10/c2cf2467f71c678cfc8a6b9ac9241e5e44a01940da8fbb17fc
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.0
Collecting findspark
  Downloading findspark-2.0.1-py2.py3-none-any.whl (4.4 kB)
Installing collected packages: findspark
Successfully installed findspark-2.0.1


In [5]:
import findspark
findspark.init()

In [6]:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession

#### Creating the spark session and context


In [7]:
# Creating a spark context class
sc = SparkContext()

# Creating a spark session
spark = SparkSession \
    .builder \
    .appName("Python Spark DataFrames basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

#### Initialize Spark session



In [8]:
spark

## Exercise 2 - Load the data and Spark dataframe


## Load the dataset into your Colab directory from your local system


In [9]:
from google.colab import files
files.upload()

Output hidden; open in https://colab.research.google.com to view.

In [327]:
dfsFlight = spark.read.csv("flights-larger.csv", header=True, inferSchema=True, nullValue= 'NA')
print(dfsFlight.printSchema())

root
 |-- mon: integer (nullable = true)
 |-- dom: integer (nullable = true)
 |-- dow: integer (nullable = true)
 |-- carrier: string (nullable = true)
 |-- flight: integer (nullable = true)
 |-- org: string (nullable = true)
 |-- mile: integer (nullable = true)
 |-- depart: double (nullable = true)
 |-- duration: integer (nullable = true)
 |-- delay: integer (nullable = true)

None


## Preprocesamiento




In [328]:
from pyspark.sql.functions import col, isnan, when, count, isnull, max, min, mode, lit

In [329]:
res = [count(when((col(c) == ' ')|( col(c).isNull()), c)).alias(c) for c in dfsFlight.columns]
dfsFlight.select(*res).show()
#def check_for_null_or_nan(df):
    #null_or_nan = lambda x: isnan(x) | isnull(x)
    #func = lambda x: df.filter(null_or_nan(x)).count()
   # print(*[f'{i} has {func(i)} nans/nulls' for i in df.columns if func(i)!=0],sep='\n')

+---+---+---+-------+------+---+----+------+--------+-----+
|mon|dom|dow|carrier|flight|org|mile|depart|duration|delay|
+---+---+---+-------+------+---+----+------+--------+-----+
|  0|  0|  0|      0|     0|  0|   0|     0|       0|16711|
+---+---+---+-------+------+---+----+------+--------+-----+



In [330]:
# Reemplazar nulos con la moda
dfsTemp = dfsFlight
modDel = dfsTemp.agg(mode('delay')).collect()[0][0]


dfsClean = dfsTemp.fillna({'delay': modDel})

In [331]:
res = [count(when((col(c) == ' ')|( col(c).isNull()), c)).alias(c) for c in dfsClean.columns]
dfsClean.select(*res).show()

+---+---+---+-------+------+---+----+------+--------+-----+
|mon|dom|dow|carrier|flight|org|mile|depart|duration|delay|
+---+---+---+-------+------+---+----+------+--------+-----+
|  0|  0|  0|      0|     0|  0|   0|     0|       0|    0|
+---+---+---+-------+------+---+----+------+--------+-----+



### Miles a KM


In [332]:
dfsClean = dfsClean.withColumn("km", col("mile") * lit(1.60934))

dfsClean = dfsClean.drop("mile")

dfsClean.show()

+---+---+---+-------+------+---+------+--------+-----+----------+
|mon|dom|dow|carrier|flight|org|depart|duration|delay|        km|
+---+---+---+-------+------+---+------+--------+-----+----------+
| 10| 10|  1|     OO|  5836|ORD|  8.18|      51|   27| 252.66638|
|  1|  4|  1|     OO|  5866|ORD|  15.5|     102|   -7| 749.95244|
| 11| 22|  1|     OO|  6016|ORD|  7.17|     127|  -19|1187.69292|
|  2| 14|  5|     B6|   199|JFK| 21.17|     365|   60|3617.79632|
|  5| 25|  3|     WN|  1675|SJC| 12.92|      85|   22| 621.20524|
|  3| 28|  1|     B6|   377|LGA| 13.33|     182|   70|1731.64984|
|  5| 28|  6|     B6|   904|ORD|  9.58|     130|   47| 1190.9116|
|  1| 19|  2|     UA|   820|SFO| 12.75|     123|  135|1092.74186|
|  8|  5|  5|     US|  2175|LGA|  13.0|      71|  -10| 344.39876|
|  5| 27|  5|     AA|  1240|ORD| 14.42|     195|  -11|1926.37998|
|  8| 20|  6|     B6|   119|JFK| 14.67|     198|   20|1902.23988|
|  2|  3|  1|     AA|  1881|JFK| 15.92|     200|   -9| 1754.1806|
|  8| 26| 

## Indexación



In [333]:
from pyspark.ml.feature import StringIndexer

In [334]:
# “carrier_idx” y “org_idx”
indexer = StringIndexer(inputCols=['carrier', 'org'],
                        outputCols=['carrier_idx', 'org_idx'])

## Bucketing de la variable departure time

In [335]:
from pyspark.ml.feature import Bucketizer

In [336]:
buckets = Bucketizer(splits=[0,3,6,9,12,15,18,21,24],
                     inputCol="depart",
                     outputCol="depart_bucket")

## One-Hot Encoding y Consolidacion

In [227]:
from pyspark.ml.feature import OneHotEncoder

In [228]:
onehot = OneHotEncoder(inputCols=['depart_bucket', 'org_idx', 'dow', 'mon'],
                          outputCols=['depart_dummy', 'org_dummy', 'dow_dummy', 'mon_dummy'])

## Consolidar columnas (features)

In [337]:
from pyspark.ml.feature import VectorAssembler

In [338]:
assembler = VectorAssembler(inputCols=['km', 'org_dummy', 'depart_dummy', 'dow_dummy', 'mon_dummy'],
                            outputCol='features')

## Entrenamiento y Prueba


In [339]:
flyTrain, flyTest = dfsClean.randomSplit([0.8, 0.2], seed=23)

[flyTest.count(), flyTrain.count()]

[55438, 219562]

## Regresion Lineal


In [340]:
from pyspark.ml.regression import LinearRegression


In [341]:
regr = LinearRegression(labelCol="duration")

## Metricas de Evaluacion



In [342]:
from pyspark.ml.evaluation import RegressionEvaluator

## Pipeline

In [343]:
from pyspark.ml import Pipeline

In [344]:
pipeline = Pipeline(stages=[indexer, buckets, onehot, assembler, regr])


In [345]:
pipeline = pipeline.fit(flyTrain)


In [346]:
predictions = pipeline.transform(flyTrain)
predictions['duration', 'prediction'].show()

+--------+------------------+
|duration|        prediction|
+--------+------------------+
|     420| 464.5520508196609|
|     300| 308.9213571745261|
|     315|340.95187564199966|
|     320| 344.8813401961892|
|     386| 381.7239438337204|
|     375|368.43025544653875|
|     379|368.43025544653875|
|     325| 359.7059103504604|
|     310| 340.8321126835566|
|     180| 188.5548115305948|
|     170|151.95936327209827|
|     165|151.95936327209827|
|     210|215.33729071194796|
|     165|149.69030125111894|
|      70| 77.96578036246969|
|     115|126.09373117984948|
|     160| 153.8277500820241|
|     164| 153.8277500820241|
|     130|135.01480073221285|
|     130| 134.8531856229321|
+--------+------------------+
only showing top 20 rows



In [348]:
# Intercept del modelo
print(pipeline.stages[4].intercept)

13.17306203333961


In [349]:
#Coeficientes
print(pipeline.stages[4].coefficients)

[0.07441743723705291,27.116840156040325,20.054222623165668,52.06683637180387,46.0914046151323,15.170785983543942,17.313151188150773,17.560729048242038,-15.036329315943679,-0.2503931902610769,3.929464554189561,6.891067633156792,4.622005612177438,8.759454443082602,8.921069552363369,-0.056486138422262515,-0.09647590994124813,-0.14662214101358348,-0.07745170257816011,-0.025257824620713836,-0.03673449706040399,-1.8357974072935823,-1.9439171577804848,-2.039857798022571,-3.425165112012754,-4.035108387425875,-3.902183706299486,-3.978029024587447,-3.9971450339263592,-3.8898461415022187,-2.5948469111861017,-0.4182014385818216]


In [None]:
spark.stop()