### Apellidos y Nombres:

Lettere Dragosavljevich Mathias Giuseppe

### Fecha:

28-09-2023

# **Preprocesamiento de datos con Pyspark**


## Google Colab Setup

If you are going to use Google Colab instead of a Spark Cluster, you will need to run the following code to install Apache Spark.

In [1]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

In [2]:
#If the following links don't work, you will have to update them with the last versions of Apache Spark
!wget -q https://dlcdn.apache.org/spark/spark-3.4.1/spark-3.4.1-bin-hadoop3.tgz
!tar xf spark-3.4.1-bin-hadoop3.tgz

In [3]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.4.1-bin-hadoop3"

## Setup


In [4]:
# Installing required packages
!pip install pyspark
!pip install findspark

Collecting pyspark
  Downloading pyspark-3.5.0.tar.gz (316.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m316.9/316.9 MB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.0-py2.py3-none-any.whl size=317425344 sha256=9b5704bc483e511deb629234fc2432bb9e917c0978f9c02f0a9bdcd291d90329
  Stored in directory: /root/.cache/pip/wheels/41/4e/10/c2cf2467f71c678cfc8a6b9ac9241e5e44a01940da8fbb17fc
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.0
Collecting findspark
  Downloading findspark-2.0.1-py2.py3-none-any.whl (4.4 kB)
Installing collected packages: findspark
Successfully installed findspark-2.0.1


In [5]:
import findspark
findspark.init()

In [6]:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession

#### Creating the spark session and context


In [7]:
# Creating a spark context class
sc = SparkContext()

# Creating a spark session
spark = SparkSession \
    .builder \
    .appName("Python Spark DataFrames basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

#### Initialize Spark session



In [8]:
spark

## Exercise 2 - Load the data and Spark dataframe


## Load the dataset into your Colab directory from your local system


In [9]:
from google.colab import files
files.upload()

Output hidden; open in https://colab.research.google.com to view.

In [10]:
dfsCredit = spark.read.csv("train.csv", header=True, inferSchema=True, nullValue= 'NA')
print(dfsCredit.printSchema())

root
 |-- ID: string (nullable = true)
 |-- Customer_ID: string (nullable = true)
 |-- Month: string (nullable = true)
 |-- Name: string (nullable = true)
 |-- Age: string (nullable = true)
 |-- SSN: string (nullable = true)
 |-- Occupation: string (nullable = true)
 |-- Annual_Income: string (nullable = true)
 |-- Monthly_Inhand_Salary: double (nullable = true)
 |-- Num_Bank_Accounts: integer (nullable = true)
 |-- Num_Credit_Card: integer (nullable = true)
 |-- Interest_Rate: integer (nullable = true)
 |-- Num_of_Loan: string (nullable = true)
 |-- Type_of_Loan: string (nullable = true)
 |-- Delay_from_due_date: integer (nullable = true)
 |-- Num_of_Delayed_Payment: string (nullable = true)
 |-- Changed_Credit_Limit: string (nullable = true)
 |-- Num_Credit_Inquiries: double (nullable = true)
 |-- Credit_Mix: string (nullable = true)
 |-- Outstanding_Debt: string (nullable = true)
 |-- Credit_Utilization_Ratio: double (nullable = true)
 |-- Credit_History_Age: string (nullable = true

## Preprocesamiento




In [11]:
from pyspark.sql.functions import col, isnan, when, count, isnull, max, min, mode, lit, median, mean

In [12]:
res = [count(when((col(c) == ' ')|( col(c).isNull()), c)).alias(c) for c in dfsCredit.columns]
dfsCredit.select(*res).show()
#def check_for_null_or_nan(df):
    #null_or_nan = lambda x: isnan(x) | isnull(x)
    #func = lambda x: df.filter(null_or_nan(x)).count()
   # print(*[f'{i} has {func(i)} nans/nulls' for i in df.columns if func(i)!=0],sep='\n')

+---+-----------+-----+----+---+---+----------+-------------+---------------------+-----------------+---------------+-------------+-----------+------------+-------------------+----------------------+--------------------+--------------------+----------+----------------+------------------------+------------------+---------------------+-------------------+-----------------------+-----------------+---------------+------------+
| ID|Customer_ID|Month|Name|Age|SSN|Occupation|Annual_Income|Monthly_Inhand_Salary|Num_Bank_Accounts|Num_Credit_Card|Interest_Rate|Num_of_Loan|Type_of_Loan|Delay_from_due_date|Num_of_Delayed_Payment|Changed_Credit_Limit|Num_Credit_Inquiries|Credit_Mix|Outstanding_Debt|Credit_Utilization_Ratio|Credit_History_Age|Payment_of_Min_Amount|Total_EMI_per_month|Amount_invested_monthly|Payment_Behaviour|Monthly_Balance|Credit_Score|
+---+-----------+-----+----+---+---+----------+-------------+---------------------+-----------------+---------------+-------------+-----------+---

In [13]:
dfsCreditReduce = dfsCredit.drop("Customer_ID","Month","Name","SSN","Credit_History_Age","Amount_invested_monthly")

In [14]:
res = [count(when((col(c) == ' ')|( col(c).isNull()), c)).alias(c) for c in dfsCreditReduce.columns]
dfsCreditReduce.select(*res).show()

+---+---+----------+-------------+---------------------+-----------------+---------------+-------------+-----------+------------+-------------------+----------------------+--------------------+--------------------+----------+----------------+------------------------+---------------------+-------------------+-----------------+---------------+------------+
| ID|Age|Occupation|Annual_Income|Monthly_Inhand_Salary|Num_Bank_Accounts|Num_Credit_Card|Interest_Rate|Num_of_Loan|Type_of_Loan|Delay_from_due_date|Num_of_Delayed_Payment|Changed_Credit_Limit|Num_Credit_Inquiries|Credit_Mix|Outstanding_Debt|Credit_Utilization_Ratio|Payment_of_Min_Amount|Total_EMI_per_month|Payment_Behaviour|Monthly_Balance|Credit_Score|
+---+---+----------+-------------+---------------------+-----------------+---------------+-------------+-----------+------------+-------------------+----------------------+--------------------+--------------------+----------+----------------+------------------------+-------------------

In [15]:
dfsCreditReduce.createOrReplaceTempView("CREDIT")

### Analisis de *Monthly_Inhand_Salary*

In [16]:
max_value = dfsCredit.select(max('Monthly_Inhand_Salary')).collect()[0][0]
min_value = dfsCredit.select(min('Monthly_Inhand_Salary')).collect()[0][0]

print("Maximum Value:", max_value)
print("Minimum Value:", min_value)

Maximum Value: 15204.633333333333
Minimum Value: 303.6454166666666


In [17]:
spark.sql( """
    SELECT
        skewness(`Monthly_Inhand_Salary`) AS Skewness_valor
    FROM
          CREDIT
""").show()

+------------------+
|    Skewness_valor|
+------------------+
|1.1272523762387807|
+------------------+



In [18]:
spark.sql( """
    SELECT
        kurtosis(`Monthly_Inhand_Salary`) AS Kurt_valor
    FROM
          CREDIT
""").show()

+------------------+
|        Kurt_valor|
+------------------+
|0.6129918350074992|
+------------------+



Se decidio reemplazar por la moda.

### Analisis de *Type_of_Loan*

In [19]:
median_value = dfsCredit.select(median('Type_of_Loan')).collect()[0][0]
mode_value = dfsCredit.select(mode('Type_of_Loan')).collect()[0][0]

print("Mediana:", median_value)
print("Moda:", mode_value)

Mediana: None
Moda: Not Specified


Decidio reemplazar por la mediana

### Analisis de *Num_of_Delayed_Payment*

In [20]:
max_value = dfsCredit.select(max('Num_of_Delayed_Payment')).collect()[0][0]
min_value = dfsCredit.select(min('Num_of_Delayed_Payment')).collect()[0][0]

print("Maximum Value:", max_value)
print("Minimum Value:", min_value)

Maximum Value: 9_
Minimum Value: -1


Reemplazar outliers y nulos por 0

### Analisis de *Num_Credit_Inquiries*

In [21]:
max_value = dfsCredit.select(max('Num_Credit_Inquiries')).collect()[0][0]
min_value = dfsCredit.select(min('Num_Credit_Inquiries')).collect()[0][0]

print("Maximum Value:", max_value)
print("Minimum Value:", min_value)

Maximum Value: 2597.0
Minimum Value: 0.0


In [22]:
spark.sql( """
    SELECT
        skewness(`Num_Credit_Inquiries`) AS Skewness_valor
    FROM
          CREDIT
""").show()

+-----------------+
|   Skewness_valor|
+-----------------+
|9.786096010010883|
+-----------------+



In [23]:
spark.sql( """
    SELECT
        kurtosis(`Num_Credit_Inquiries`) AS Kurt_valor
    FROM
          CREDIT
""").show()

+------------------+
|        Kurt_valor|
+------------------+
|100.59205666470639|
+------------------+



Se decidio reemplazar por 0.

### Analisis de *Credit_History_Age*

In [24]:
median_value = dfsCredit.select(median('Credit_History_Age')).collect()[0][0]
mode_value = dfsCredit.select(mode('Credit_History_Age')).collect()[0][0]

print("Mediana:", median_value)
print("Moda:", mode_value)

Mediana: None
Moda: 15 Years and 11 Months


Se decidio eliminar la columna

### Analisis de *Amount_invested_monthly*

In [25]:
max_value = dfsCredit.select(max('Amount_invested_monthly')).collect()[0][0]
min_value = dfsCredit.select(min('Amount_invested_monthly')).collect()[0][0]
median_value = dfsCredit.select(median('Credit_History_Age')).collect()[0][0]

print("Maximum Value:", max_value)
print("Minimum Value:", min_value)
print("Mediana:", median_value)

Maximum Value: __10000__
Minimum Value: 0.0
Mediana: None


Se decidio eliminar la columna

### Analisis de *Monthly_Balance*

In [26]:
max_value = dfsCredit.select(max('Monthly_Balance')).collect()[0][0]
min_value = dfsCredit.select(min('Monthly_Balance')).collect()[0][0]
median_value = dfsCredit.select(median('Monthly_Balance')).collect()[0][0]

print("Maximum Value:", max_value)
print("Minimum Value:", min_value)
print("Mediana:", median_value)

Maximum Value: __-333333333333333333333333333__
Minimum Value: 0.007759664775335295
Mediana: 336.73122455696387


In [27]:
spark.sql( """
    SELECT
        skewness(`Monthly_Balance`) AS Skewness_valor
    FROM
          CREDIT
""").show()

+------------------+
|    Skewness_valor|
+------------------+
|1.5965121144020633|
+------------------+



In [28]:
spark.sql( """
    SELECT
        kurtosis(`Monthly_Balance`) AS Kurt_valor
    FROM
          CREDIT
""").show()

+------------------+
|        Kurt_valor|
+------------------+
|2.9550127042684506|
+------------------+



Reemplazar nulos y outliers por la media

## Reemplazos y Limpieza

In [29]:
dfsTemp = dfsCreditReduce
modDel = dfsTemp.agg(mode('Monthly_Inhand_Salary')).collect()[0][0]
medTL = dfsTemp.agg(mode('Type_of_Loan')).collect()[0][0]
medB = dfsTemp.agg(mean('Monthly_Balance')).collect()[0][0]


dfsClean = dfsTemp.fillna({'Monthly_Inhand_Salary': modDel, "Type_of_Loan": medTL, "Num_Credit_Inquiries": 0, 'Monthly_Balance': medB, 'Num_of_Delayed_Payment': 0})

In [30]:
res = [count(when((col(c) == ' ')|( col(c).isNull()), c)).alias(c) for c in dfsClean.columns]
dfsClean.select(*res).show()

+---+---+----------+-------------+---------------------+-----------------+---------------+-------------+-----------+------------+-------------------+----------------------+--------------------+--------------------+----------+----------------+------------------------+---------------------+-------------------+-----------------+---------------+------------+
| ID|Age|Occupation|Annual_Income|Monthly_Inhand_Salary|Num_Bank_Accounts|Num_Credit_Card|Interest_Rate|Num_of_Loan|Type_of_Loan|Delay_from_due_date|Num_of_Delayed_Payment|Changed_Credit_Limit|Num_Credit_Inquiries|Credit_Mix|Outstanding_Debt|Credit_Utilization_Ratio|Payment_of_Min_Amount|Total_EMI_per_month|Payment_Behaviour|Monthly_Balance|Credit_Score|
+---+---+----------+-------------+---------------------+-----------------+---------------+-------------+-----------+------------+-------------------+----------------------+--------------------+--------------------+----------+----------------+------------------------+-------------------

## Limpieza Extra

In [31]:
import pyspark.sql.functions as F

In [32]:
dfsClean = dfsClean.withColumn("Age", F.regexp_replace("Age", "[^0-9]", ""))
dfsClean = dfsClean.withColumn("Annual_Income", F.regexp_replace("Annual_Income", "[^0-9]", ""))
dfsClean = dfsClean.withColumn("Delay_from_due_date", F.regexp_replace("Delay_from_due_date", "[^0-9]", ""))
dfsClean = dfsClean.withColumn("Changed_Credit_Limit", F.regexp_replace("Changed_Credit_Limit", "[^0-9]", ""))
dfsClean = dfsClean.withColumn("Monthly_Balance", F.regexp_replace("Monthly_Balance", "[^0-9]", ""))
dfsClean = dfsClean.withColumn("Num_of_Delayed_Payment", F.regexp_replace("Num_of_Delayed_Payment", "[^0-9]", ""))

ageRep= dfsCredit.select(median('Age')).collect()[0][0]
cardRep= dfsCredit.select(median('Num_Credit_Card')).collect()[0][0]
balanceRep = dfsCredit.select(mean('Monthly_Balance')).collect()[0][0]

dfsClean = dfsClean.withColumn("Age", when(col("Age") > 120, ageRep).otherwise(col("Age")))
dfsClean = dfsClean.withColumn("Num_Credit_Card", when(col("Num_Credit_Card") > 12, cardRep).otherwise(col("Num_Credit_Card")))

dfsClean.show()

+------+----+----------+-------------+---------------------+-----------------+---------------+-------------+-----------+--------------------+-------------------+----------------------+--------------------+--------------------+----------+----------------+------------------------+---------------------+-------------------+--------------------+-----------------+------------+
|    ID| Age|Occupation|Annual_Income|Monthly_Inhand_Salary|Num_Bank_Accounts|Num_Credit_Card|Interest_Rate|Num_of_Loan|        Type_of_Loan|Delay_from_due_date|Num_of_Delayed_Payment|Changed_Credit_Limit|Num_Credit_Inquiries|Credit_Mix|Outstanding_Debt|Credit_Utilization_Ratio|Payment_of_Min_Amount|Total_EMI_per_month|   Payment_Behaviour|  Monthly_Balance|Credit_Score|
+------+----+----------+-------------+---------------------+-----------------+---------------+-------------+-----------+--------------------+-------------------+----------------------+--------------------+--------------------+----------+---------------

## Creación de *Label*


## Indexación



In [61]:
from pyspark.ml.feature import StringIndexer

In [62]:
indexer = StringIndexer(inputCols=['Occupation', 'Type_of_Loan', "Credit_Mix", "Payment_of_Min_Amount", "Payment_Behaviour", "Credit_Score"],
                        outputCols=['Occupation_idx', 'Type_of_Loan_idx', "Credit_Mix_idx", "Payment_of_Min_Amount_idx", "Payment_Behaviour_idx", "label"]).fit(dfsClean).transform(dfsClean)
dfsClean = indexer
dfsClean.show()

Py4JJavaError: ignored

In [63]:
dfsClean =  dfsClean.drop('ID','Occupation', 'Type_of_Loan', "Credit_Mix", "Payment_of_Min_Amount", "Payment_Behaviour")
dfsClean.show()

+---+-------------+---------------------+-----------------+---------------+-------------+-----------+-------------------+----------------------+--------------------+--------------------+----------------+------------------------+-------------------+---------------+------------+--------------+----------------+--------------+-------------------------+---------------------+-----+
|Age|Annual_Income|Monthly_Inhand_Salary|Num_Bank_Accounts|Num_Credit_Card|Interest_Rate|Num_of_Loan|Delay_from_due_date|Num_of_Delayed_Payment|Changed_Credit_Limit|Num_Credit_Inquiries|Outstanding_Debt|Credit_Utilization_Ratio|Total_EMI_per_month|Monthly_Balance|Credit_Score|Occupation_idx|Type_of_Loan_idx|Credit_Mix_idx|Payment_of_Min_Amount_idx|Payment_Behaviour_idx|label|
+---+-------------+---------------------+-----------------+---------------+-------------+-----------+-------------------+----------------------+--------------------+--------------------+----------------+------------------------+--------------

## Casting a tipo numérico

In [64]:
new_data_types = {
    "Age": "integer",
    "Annual_Income": "float",
    "Num_of_Loan": "integer",
    "Delay_from_due_date": "integer",
    "Num_of_Delayed_Payment": "integer",
    "Changed_Credit_Limit": "float",
    "Outstanding_Debt": "float",
    "Monthly_Balance": "float"
}

for col_name, new_data_type in new_data_types.items():
    dfsClean = dfsClean.withColumn(col_name, col(col_name).cast(new_data_type))

dfsClean.printSchema()

root
 |-- Age: integer (nullable = true)
 |-- Annual_Income: float (nullable = true)
 |-- Monthly_Inhand_Salary: double (nullable = false)
 |-- Num_Bank_Accounts: integer (nullable = true)
 |-- Num_Credit_Card: double (nullable = true)
 |-- Interest_Rate: integer (nullable = true)
 |-- Num_of_Loan: integer (nullable = true)
 |-- Delay_from_due_date: integer (nullable = true)
 |-- Num_of_Delayed_Payment: integer (nullable = true)
 |-- Changed_Credit_Limit: float (nullable = true)
 |-- Num_Credit_Inquiries: double (nullable = false)
 |-- Outstanding_Debt: float (nullable = true)
 |-- Credit_Utilization_Ratio: double (nullable = true)
 |-- Total_EMI_per_month: double (nullable = true)
 |-- Monthly_Balance: float (nullable = true)
 |-- Credit_Score: string (nullable = true)
 |-- Occupation_idx: double (nullable = false)
 |-- Type_of_Loan_idx: double (nullable = false)
 |-- Credit_Mix_idx: double (nullable = false)
 |-- Payment_of_Min_Amount_idx: double (nullable = false)
 |-- Payment_Behav

In [65]:
res = [count(when((col(c) == ' ')|( col(c).isNull()), c)).alias(c) for c in dfsClean.columns]
dfsClean.select(*res).show()

+---+-------------+---------------------+-----------------+---------------+-------------+-----------+-------------------+----------------------+--------------------+--------------------+----------------+------------------------+-------------------+---------------+------------+--------------+----------------+--------------+-------------------------+---------------------+-----+
|Age|Annual_Income|Monthly_Inhand_Salary|Num_Bank_Accounts|Num_Credit_Card|Interest_Rate|Num_of_Loan|Delay_from_due_date|Num_of_Delayed_Payment|Changed_Credit_Limit|Num_Credit_Inquiries|Outstanding_Debt|Credit_Utilization_Ratio|Total_EMI_per_month|Monthly_Balance|Credit_Score|Occupation_idx|Type_of_Loan_idx|Credit_Mix_idx|Payment_of_Min_Amount_idx|Payment_Behaviour_idx|label|
+---+-------------+---------------------+-----------------+---------------+-------------+-----------+-------------------+----------------------+--------------------+--------------------+----------------+------------------------+--------------

In [66]:
dfsTemp = dfsClean

dfsClean = dfsTemp.fillna({'Num_of_Loan': 0, "Changed_Credit_Limit": 0, "Outstanding_Debt": 0})

In [67]:
dfsClean.show()

+---+-------------+---------------------+-----------------+---------------+-------------+-----------+-------------------+----------------------+--------------------+--------------------+----------------+------------------------+-------------------+---------------+------------+--------------+----------------+--------------+-------------------------+---------------------+-----+
|Age|Annual_Income|Monthly_Inhand_Salary|Num_Bank_Accounts|Num_Credit_Card|Interest_Rate|Num_of_Loan|Delay_from_due_date|Num_of_Delayed_Payment|Changed_Credit_Limit|Num_Credit_Inquiries|Outstanding_Debt|Credit_Utilization_Ratio|Total_EMI_per_month|Monthly_Balance|Credit_Score|Occupation_idx|Type_of_Loan_idx|Credit_Mix_idx|Payment_of_Min_Amount_idx|Payment_Behaviour_idx|label|
+---+-------------+---------------------+-----------------+---------------+-------------+-----------+-------------------+----------------------+--------------------+--------------------+----------------+------------------------+--------------

In [68]:
res = [count(when((col(c) == ' ')|( col(c).isNull()), c)).alias(c) for c in dfsClean.columns]
dfsClean.select(*res).show()

+---+-------------+---------------------+-----------------+---------------+-------------+-----------+-------------------+----------------------+--------------------+--------------------+----------------+------------------------+-------------------+---------------+------------+--------------+----------------+--------------+-------------------------+---------------------+-----+
|Age|Annual_Income|Monthly_Inhand_Salary|Num_Bank_Accounts|Num_Credit_Card|Interest_Rate|Num_of_Loan|Delay_from_due_date|Num_of_Delayed_Payment|Changed_Credit_Limit|Num_Credit_Inquiries|Outstanding_Debt|Credit_Utilization_Ratio|Total_EMI_per_month|Monthly_Balance|Credit_Score|Occupation_idx|Type_of_Loan_idx|Credit_Mix_idx|Payment_of_Min_Amount_idx|Payment_Behaviour_idx|label|
+---+-------------+---------------------+-----------------+---------------+-------------+-----------+-------------------+----------------------+--------------------+--------------------+----------------+------------------------+--------------

## Consolidar columnas (features)

In [69]:
from pyspark.ml.feature import VectorAssembler

In [91]:
assembler = VectorAssembler(inputCols=['Age','Annual_Income','Monthly_Inhand_Salary','Num_Bank_Accounts','Num_Credit_Card','Interest_Rate','Num_of_Loan','Delay_from_due_date','Num_of_Delayed_Payment','Changed_Credit_Limit','Num_Credit_Inquiries','Outstanding_Debt','Monthly_Balance','Occupation_idx','Credit_Mix_idx','Payment_of_Min_Amount_idx','Payment_Behaviour_idx'],
                            outputCol='features')
dfsCreditClean = assembler.transform(dfsClean)
dfsCreditClean.show()

+---+-------------+---------------------+-----------------+---------------+-------------+-----------+-------------------+----------------------+--------------------+--------------------+----------------+------------------------+-------------------+---------------+------------+--------------+----------------+--------------+-------------------------+---------------------+-----+--------------------+
|Age|Annual_Income|Monthly_Inhand_Salary|Num_Bank_Accounts|Num_Credit_Card|Interest_Rate|Num_of_Loan|Delay_from_due_date|Num_of_Delayed_Payment|Changed_Credit_Limit|Num_Credit_Inquiries|Outstanding_Debt|Credit_Utilization_Ratio|Total_EMI_per_month|Monthly_Balance|Credit_Score|Occupation_idx|Type_of_Loan_idx|Credit_Mix_idx|Payment_of_Min_Amount_idx|Payment_Behaviour_idx|label|            features|
+---+-------------+---------------------+-----------------+---------------+-------------+-----------+-------------------+----------------------+--------------------+--------------------+--------------

## Entrenamiento y Prueba


In [92]:
creditTrain, creditTest = dfsCreditClean.randomSplit([0.8, 0.2], seed=12)

[creditTest.count(), creditTrain.count()]

[20027, 79973]

# Regresiòn Logìstica


In [93]:
from pyspark.ml.classification import LogisticRegression


In [107]:
logi = LogisticRegression(aggregationDepth= 4)
logi_model = logi.fit(creditTrain)

In [108]:
prediction = logi_model.transform(creditTest)

prediction['label', 'prediction', 'probability'].show()
#['label', 'prediction', 'probability']

+-----+----------+--------------------+
|label|prediction|         probability|
+-----+----------+--------------------+
|  1.0|       1.0|[0.22315080944821...|
|  1.0|       1.0|[0.36194105927979...|
|  1.0|       0.0|[0.69929236626403...|
|  2.0|       0.0|[0.69666587815508...|
|  1.0|       0.0|[0.40045916704559...|
|  0.0|       0.0|[0.48156455093233...|
|  1.0|       0.0|[0.62404788021839...|
|  1.0|       0.0|[0.66173438440746...|
|  1.0|       0.0|[0.69717645506200...|
|  1.0|       0.0|[0.69581742106382...|
|  1.0|       1.0|[0.26391388478044...|
|  1.0|       1.0|[0.43654831946508...|
|  1.0|       1.0|[0.42737918508566...|
|  0.0|       1.0|[0.41921049768462...|
|  0.0|       1.0|[0.22134818312449...|
|  1.0|       1.0|[0.23952965594227...|
|  0.0|       1.0|[0.25116686077878...|
|  1.0|       1.0|[0.31390758307452...|
|  1.0|       1.0|[0.22582966763711...|
|  0.0|       0.0|[0.53256423419095...|
+-----+----------+--------------------+
only showing top 20 rows



## Matriz de Confusión



In [109]:
prediction.groupBy('label', 'prediction').count().show()

+-----+----------+-----+
|label|prediction|count|
+-----+----------+-----+
|  2.0|       0.0| 2118|
|  1.0|       1.0| 2536|
|  0.0|       1.0| 1680|
|  1.0|       0.0| 3020|
|  2.0|       2.0| 1324|
|  2.0|       1.0|   69|
|  1.0|       2.0|  162|
|  0.0|       0.0| 8302|
|  0.0|       2.0|  816|
+-----+----------+-----+



## Precisiòn Promedio

In [110]:
TP=prediction.filter('prediction = 1 AND label = 1').count()
FP=prediction.filter('prediction = 1 AND label = 0').count()
FN=prediction.filter('prediction = 0 AND label = 1').count()
TN=prediction.filter('prediction = 0 AND label = 0').count()

print("Verdaderos positivos: ", TP)
print("Falsos positivos: ", FP)
print("Falsos Negativos: ", FN)
print("Verdaderos Negativos: ", TN)
print("Accuracy: ", (TN+TP)/(TP+FP+FN+TN))

Verdaderos positivos:  2536
Falsos positivos:  1680
Falsos Negativos:  3020
Verdaderos Negativos:  8302
Accuracy:  0.6975157677950831


## Precisiòn

In [111]:
TP/(TP+FP)

0.6015180265654649

## Recall

In [112]:
TP/(TP+FN)

0.4564434845212383

## BIC (Criterio de Información Bayesiana)

In [130]:
from numpy import log
from sklearn.metrics import mean_squared_error
from pyspark.ml.evaluation import RegressionEvaluator

In [131]:
evaluator = RegressionEvaluator(labelCol="label", predictionCol="prediction", metricName="mse")
mse = evaluator.evaluate(prediction)
print("MSE: ", mse)

MSE:  0.8322264942327856


In [143]:
n = prediction.count()
num_params = len(dfsCreditClean.columns) - 1

bic = n * (mse + num_params * log(n))
print('BIC: %.3f' % bic)

BIC: 4380678.595


# Árbol de Decisión

In [144]:
from pyspark.ml.classification import DecisionTreeClassifier

In [145]:
tree = DecisionTreeClassifier()
#
tree_model = tree.fit(creditTrain)

In [146]:
prediction = tree_model.transform(creditTest)

prediction['label', 'prediction', 'probability'].show()
#['label', 'prediction', 'probability']

+-----+----------+--------------------+
|label|prediction|         probability|
+-----+----------+--------------------+
|  1.0|       0.0|[0.56886726893676...|
|  1.0|       1.0|[0.25632658980494...|
|  1.0|       1.0|[0.25632658980494...|
|  2.0|       0.0|[0.85871482960349...|
|  1.0|       0.0|[0.66666666666666...|
|  0.0|       1.0|[0.22266560255387...|
|  1.0|       1.0|[0.25632658980494...|
|  1.0|       1.0|[0.25632658980494...|
|  1.0|       1.0|[0.25632658980494...|
|  1.0|       1.0|[0.25632658980494...|
|  1.0|       1.0|[0.25632658980494...|
|  1.0|       1.0|[0.25632658980494...|
|  1.0|       1.0|[0.25632658980494...|
|  0.0|       0.0|[0.56886726893676...|
|  0.0|       0.0|[0.56886726893676...|
|  1.0|       1.0|[0.25632658980494...|
|  0.0|       1.0|[0.25632658980494...|
|  1.0|       1.0|[0.25632658980494...|
|  1.0|       1.0|[0.25632658980494...|
|  0.0|       1.0|[0.25632658980494...|
+-----+----------+--------------------+
only showing top 20 rows



## Matriz de Confusión

In [147]:
prediction.groupBy('label', 'prediction').count().show()

+-----+----------+-----+
|label|prediction|count|
+-----+----------+-----+
|  2.0|       0.0| 1460|
|  1.0|       1.0| 3689|
|  0.0|       1.0| 1447|
|  1.0|       0.0| 1674|
|  2.0|       2.0| 1902|
|  2.0|       1.0|  149|
|  1.0|       2.0|  355|
|  0.0|       0.0| 8262|
|  0.0|       2.0| 1089|
+-----+----------+-----+



## Precisiòn Promedio

In [148]:
TP=prediction.filter('prediction = 1 AND label = 1').count()
FP=prediction.filter('prediction = 1 AND label = 0').count()
FN=prediction.filter('prediction = 0 AND label = 1').count()
TN=prediction.filter('prediction = 0 AND label = 0').count()

print("Verdaderos positivos: ", TP)
print("Falsos positivos: ", FP)
print("Falsos Negativos: ", FN)
print("Verdaderos Negativos: ", TN)
print("Accuracy: ", (TN+TP)/(TP+FP+FN+TN))

Verdaderos positivos:  3689
Falsos positivos:  1447
Falsos Negativos:  1674
Verdaderos Negativos:  8262
Accuracy:  0.7929272823779193


## Precisión

In [149]:
TP/(TP+FP)

0.7182632398753894

## Recall

In [150]:
TP/(TP+FN)

0.6878612716763006

## BIC

In [151]:
evaluator = RegressionEvaluator(labelCol="label", predictionCol="prediction", metricName="mse")
mse = evaluator.evaluate(prediction)

# num de nodos
num_nodes = prediction.select('prediction').rdd.flatMap(lambda x: x).count()

# num de parámetros
num_params = num_nodes - 1

# cálculo BIC
n = prediction.count()  # num observaciones
bic = n * (mse + num_params * log(n))
print('BIC: %.3f' % bic)

BIC: 3972454557.878


## Análisis de los modelos

Comparando ambos modelos, se encuentra que el mejor modelo en este caso es el de Árbol de Decisión. Principalmente por lo que en tanto Precisión Promedio y Precisión es un 9,54% y 11,68% mejor respectivamente. Y en cuanto al BIC, AD (Árbol de Decisión) obtuvo un valor menor, por lo que es recomendable tomar mejor el modelo de AD sobre el de Regresión Logística en este caso.

In [152]:
spark.stop()