
# Glue Studio Notebook
You are now running a **Glue Studio** notebook; before you can start using your notebook you *must* start an interactive session.

## Available Magics
|          Magic              |   Type       |                                                                        Description                                                                        |
|-----------------------------|--------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|
| %%configure                 |  Dictionary  |  A json-formatted dictionary consisting of all configuration parameters for a session. Each parameter can be specified here or through individual magics. |
| %profile                    |  String      |  Specify a profile in your aws configuration to use as the credentials provider.                                                                          |
| %iam_role                   |  String      |  Specify an IAM role to execute your session with.                                                                                                        |
| %region                     |  String      |  Specify the AWS region in which to initialize a session.                                                                                                 |
| %session_id                 |  String      |  Returns the session ID for the running session.                                                                                                          |
| %connections                |  List        |  Specify a comma separated list of connections to use in the session.                                                                                     |
| %additional_python_modules  |  List        |  Comma separated list of pip packages, s3 paths or private pip arguments.                                                                                 |
| %extra_py_files             |  List        |  Comma separated list of additional Python files from S3.                                                                                                 |
| %extra_jars                 |  List        |  Comma separated list of additional Jars to include in the cluster.                                                                                       |
| %number_of_workers          |  Integer     |  The number of workers of a defined worker_type that are allocated when a job runs. worker_type must be set too.                                          |
| %glue_version               |  String      |  The version of Glue to be used by this session. Currently, the only valid options are 2.0 and 3.0 (eg: %glue_version 2.0).                               |
| %security_config            |  String      |  Define a security configuration to be used with this session.                                                                                            |
| %sql                        |  String      |  Run SQL code. All lines after the initial %%sql magic will be passed as part of the SQL code.                                                            |
| %streaming                  |  String      |  Changes the session type to Glue Streaming.                                                                                                              |
| %etl                        |  String      |  Changes the session type to Glue ETL.                                                                                                                    |
| %status                     |              |  Returns the status of the current Glue session including its duration, configuration and executing user / role.                                          |
| %stop_session               |              |  Stops the current session.                                                                                                                               |
| %list_sessions              |              |  Lists all currently running sessions by name and ID.                                                                                                     |
| %worker_type                |  String      |  Standard, G.1X, *or* G.2X. number_of_workers must be set too. Default is G.1X.                                                                           |
| %spark_conf                 |  String      |  Specify custom spark configurations for your session. E.g. %spark_conf spark.serializer=org.apache.spark.serializer.KryoSerializer.                      |

In [1]:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)

Welcome to the Glue Interactive Sessions Kernel
For more information on available magic commands, please type %help in any new cell.

Please view our Getting Started page to access the most up-to-date information on the Interactive Sessions kernel: https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions.html
Installed kernel version: 0.37.0 
Authenticating with environment variables and user-defined glue_role_arn: arn:aws:iam::319474470582:role/service-role/AWSGlueServiceRole-Aghar
Trying to create a Glue session for the kernel.
Worker Type: G.1X
Number of Workers: 5
Session ID: ed3903fe-e4d4-4f9a-b31d-0e491f499d57
Job Type: glueetl
Applying the following default arguments:
--glue_kernel_version 0.37.0
--enable-glue-datacatalog true
Waiting for session ed3903fe-e4d4-4f9a-b31d-0e491f499d57 to get into ready status...
Session ed3903fe-e4d4-4f9a-b31d-0e491f499d57 has been created.


In [2]:
df=glueContext.create_dynamic_frame.from_catalog(
                 database='bank_database-aghar',
                 table_name='cleaned_bankdata').toDF()

In [3]:
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StandardScaler
from pyspark.ml.feature import OneHotEncoder

In [4]:
catCols = ['job', 'marital', 'education', 'default','housing', 'loan', 'contact', 'poutcome']

In [5]:
# The index of string vlaues multiple columns
indexers = [
    StringIndexer(inputCol=c, outputCol="{0}_indexed".format(c))
    for c in catCols
]

In [6]:
# The encode of indexed vlaues multiple columns
encoders = [OneHotEncoder(dropLast=False,inputCol=indexer.getOutputCol(),
            outputCol="{0}_encoded".format(indexer.getOutputCol())) 
    for indexer in indexers
]


In [7]:
# Vectorizing encoded values
#VectorAssembler to aseemble all the OneHotEncoded columns and the following numerical columns in one column. Call this new assembled column as: 'rawFeatures' :
assembler = VectorAssembler(inputCols=[encoder.getOutputCol() for encoder in encoders],outputCol="rawFeatures")

numericCols = ['age', 'balance', 'duration',  'campaign', 'pdays', 'previous']

In [8]:
#pipeline to transform each one of the categorical columns to as many Onehotencoded columns by first using StringIndexer and then OneHotEncoder.
from pyspark.ml import Pipeline
pipeline = Pipeline(stages=indexers + encoders+ [assembler])
model=pipeline.fit(df)
transformed = model.transform(df)
transformed.show(5)

+---+-----------+-------+---------+-------+-------+-------+----+--------+--------+--------+-----+--------+--------+-------+-----------+---------------+-----------------+---------------+---------------+------------+---------------+----------------+-------------------+-----------------------+-------------------------+-----------------------+-----------------------+--------------------+-----------------------+------------------------+--------------------+
|age|        job|marital|education|default|balance|housing|loan| contact|duration|campaign|pdays|previous|poutcome|deposit|job_indexed|marital_indexed|education_indexed|default_indexed|housing_indexed|loan_indexed|contact_indexed|poutcome_indexed|job_indexed_encoded|marital_indexed_encoded|education_indexed_encoded|default_indexed_encoded|housing_indexed_encoded|loan_indexed_encoded|contact_indexed_encoded|poutcome_indexed_encoded|         rawFeatures|
+---+-----------+-------+---------+-------+-------+-------+----+--------+--------+----

In [9]:
transformed.select('rawFeatures').printSchema()

root
 |-- rawFeatures: vector (nullable = true)


In [10]:
transformed.printSchema()

root
 |-- age: long (nullable = true)
 |-- job: string (nullable = true)
 |-- marital: string (nullable = true)
 |-- education: string (nullable = true)
 |-- default: string (nullable = true)
 |-- balance: long (nullable = true)
 |-- housing: string (nullable = true)
 |-- loan: string (nullable = true)
 |-- contact: string (nullable = true)
 |-- duration: long (nullable = true)
 |-- campaign: long (nullable = true)
 |-- pdays: long (nullable = true)
 |-- previous: long (nullable = true)
 |-- poutcome: string (nullable = true)
 |-- deposit: string (nullable = true)
 |-- job_indexed: double (nullable = false)
 |-- marital_indexed: double (nullable = false)
 |-- education_indexed: double (nullable = false)
 |-- default_indexed: double (nullable = false)
 |-- housing_indexed: double (nullable = false)
 |-- loan_indexed: double (nullable = false)
 |-- contact_indexed: double (nullable = false)
 |-- poutcome_indexed: double (nullable = false)
 |-- job_indexed_encoded: vector (nullable = true

In [11]:
scaler = StandardScaler(inputCol='rawFeatures', outputCol='scaled_rawFeatures')
scaler_model = scaler.fit(transformed)

In [12]:
(trainingData, testData) = transformed.randomSplit([0.7, 0.3],seed = 11)

In [13]:
trainingData.show(5)

+---+-------+-------+---------+-------+-------+-------+----+---------+--------+--------+-----+--------+--------+-------+-----------+---------------+-----------------+---------------+---------------+------------+---------------+----------------+-------------------+-----------------------+-------------------------+-----------------------+-----------------------+--------------------+-----------------------+------------------------+--------------------+
|age|    job|marital|education|default|balance|housing|loan|  contact|duration|campaign|pdays|previous|poutcome|deposit|job_indexed|marital_indexed|education_indexed|default_indexed|housing_indexed|loan_indexed|contact_indexed|poutcome_indexed|job_indexed_encoded|marital_indexed_encoded|education_indexed_encoded|default_indexed_encoded|housing_indexed_encoded|loan_indexed_encoded|contact_indexed_encoded|poutcome_indexed_encoded|         rawFeatures|
+---+-------+-------+---------+-------+-------+-------+----+---------+--------+--------+----

In [14]:
testData.show(5)

+---+-----------+-------+---------+-------+-------+-------+----+---------+--------+--------+-----+--------+--------+-------+-----------+---------------+-----------------+---------------+---------------+------------+---------------+----------------+-------------------+-----------------------+-------------------------+-----------------------+-----------------------+--------------------+-----------------------+------------------------+--------------------+
|age|        job|marital|education|default|balance|housing|loan|  contact|duration|campaign|pdays|previous|poutcome|deposit|job_indexed|marital_indexed|education_indexed|default_indexed|housing_indexed|loan_indexed|contact_indexed|poutcome_indexed|job_indexed_encoded|marital_indexed_encoded|education_indexed_encoded|default_indexed_encoded|housing_indexed_encoded|loan_indexed_encoded|contact_indexed_encoded|poutcome_indexed_encoded|         rawFeatures|
+---+-----------+-------+---------+-------+-------+-------+----+---------+--------+-

In [15]:
trainingData.columns

['age', 'job', 'marital', 'education', 'default', 'balance', 'housing', 'loan', 'contact', 'duration', 'campaign', 'pdays', 'previous', 'poutcome', 'deposit', 'job_indexed', 'marital_indexed', 'education_indexed', 'default_indexed', 'housing_indexed', 'loan_indexed', 'contact_indexed', 'poutcome_indexed', 'job_indexed_encoded', 'marital_indexed_encoded', 'education_indexed_encoded', 'default_indexed_encoded', 'housing_indexed_encoded', 'loan_indexed_encoded', 'contact_indexed_encoded', 'poutcome_indexed_encoded', 'rawFeatures']


In [16]:
trainingData.count()

7821


In [17]:
testData.count()

3341


In [20]:
trainingData.columns

['age', 'job', 'marital', 'education', 'default', 'balance', 'housing', 'loan', 'contact', 'duration', 'campaign', 'pdays', 'previous', 'poutcome', 'deposit', 'job_indexed', 'marital_indexed', 'education_indexed', 'default_indexed', 'housing_indexed', 'loan_indexed', 'contact_indexed', 'poutcome_indexed', 'job_indexed_encoded', 'marital_indexed_encoded', 'education_indexed_encoded', 'default_indexed_encoded', 'housing_indexed_encoded', 'loan_indexed_encoded', 'contact_indexed_encoded', 'poutcome_indexed_encoded', 'rawFeatures']


In [21]:
trainingData.printSchema()

root
 |-- age: long (nullable = true)
 |-- job: string (nullable = true)
 |-- marital: string (nullable = true)
 |-- education: string (nullable = true)
 |-- default: string (nullable = true)
 |-- balance: long (nullable = true)
 |-- housing: string (nullable = true)
 |-- loan: string (nullable = true)
 |-- contact: string (nullable = true)
 |-- duration: long (nullable = true)
 |-- campaign: long (nullable = true)
 |-- pdays: long (nullable = true)
 |-- previous: long (nullable = true)
 |-- poutcome: string (nullable = true)
 |-- deposit: string (nullable = true)
 |-- job_indexed: double (nullable = false)
 |-- marital_indexed: double (nullable = false)
 |-- education_indexed: double (nullable = false)
 |-- default_indexed: double (nullable = false)
 |-- housing_indexed: double (nullable = false)
 |-- loan_indexed: double (nullable = false)
 |-- contact_indexed: double (nullable = false)
 |-- poutcome_indexed: double (nullable = false)
 |-- job_indexed_encoded: vector (nullable = true

In [23]:
trainingData.count()

7821


In [24]:
testData.count()

3341


In [170]:
trainingData.repartition(1).write.option("header", "true").option("schema","true").mode('overwrite').parquet('s3://aghar-awsglue-capstone/Bank_Data/processed_BankData/train-data/')

In [171]:
testData.repartition(1).write.option("header", "true").option("schema","true").mode('overwrite').parquet('s3://aghar-awsglue-capstone/Bank_Data/processed_BankData/test-data/')

# Modeling

Next step is to Modeling:

List of few Classification Algorithms from Spark ML

LogisticRegression

DecisionTreeClassifier

RandomForestClassifier

Gradient-boosted tree classifier

NaiveBayes

Support Vector Machine

# LogisticRegression

In [172]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.feature import QuantileDiscretizer

In [173]:
from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression(labelCol="loan_indexed", featuresCol="rawFeatures")


In [32]:
from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression(labelCol="loan_indexed", featuresCol="rawFeatures")
#Training algo
lrModel = lr.fit(trainingData)
lr_prediction = lrModel.transform(testData)
lr_prediction.select("prediction", "loan_indexed", "rawFeatures").show()
evaluator = MulticlassClassificationEvaluator(labelCol="loan_indexed", predictionCol="prediction", metricName="accuracy")

+----------+------------+--------------------+
|prediction|loan_indexed|         rawFeatures|
+----------+------------+--------------------+
|       0.0|         0.0|(32,[7,13,15,19,2...|
|       0.0|         0.0|(32,[7,13,15,19,2...|
|       0.0|         0.0|(32,[1,12,15,19,2...|
|       1.0|         1.0|(32,[1,13,17,19,2...|
|       0.0|         0.0|(32,[1,13,15,19,2...|
|       0.0|         0.0|(32,[0,13,16,19,2...|
|       0.0|         0.0|(32,[1,13,15,19,2...|
|       0.0|         0.0|(32,[1,13,15,19,2...|
|       0.0|         0.0|(32,[0,13,16,19,2...|
|       0.0|         0.0|(32,[1,12,15,19,2...|
|       0.0|         0.0|(32,[4,12,15,19,2...|
|       0.0|         0.0|(32,[4,13,15,19,2...|
|       0.0|         0.0|(32,[4,13,15,19,2...|
|       1.0|         1.0|(32,[3,13,15,19,2...|
|       1.0|         1.0|(32,[1,13,15,19,2...|
|       1.0|         1.0|(32,[0,12,16,20,2...|
|       0.0|         0.0|(32,[2,13,15,19,2...|
|       0.0|         0.0|(32,[2,13,18,19,2...|
|       0.0| 

# Evaluating accuracy of LogisticRegression.

In [94]:
from pyspark.mllib.evaluation import BinaryClassificationMetrics
out = lrModel.transform(testData)\
 .select("prediction","loan_indexed")\
 .rdd.map(lambda x: (float(x[0]), float(x[1])))
lr_metrics = BinaryClassificationMetrics(out)

In [130]:
print("areaUnderPR"+" : "+str(lr_metrics.areaUnderPR))

areaUnderPR : 1.0


In [131]:
print("areaUnderROC"+" : "+str(lr_metrics.areaUnderROC))

areaUnderROC : 1.0


In [33]:
# Use the MulticlassClassificationEvaluator to evaluate the model's accuracy
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

evaluator = MulticlassClassificationEvaluator(labelCol="loan_indexed", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(lr_prediction)
print("Accuracy:", accuracy)
# Select the "prediction" and "label" columns
predictions_df = lr_prediction.select(["prediction", "loan_indexed"])

# Convert the predictions and labels to Pandas dataframes for easier inspection
predictions_pd = predictions_df.toPandas()

# Print the first 10 predictions and their corresponding true labels
print(predictions_pd.head(10))
# Set the hyperparameters for the logistic regression model
lr = LogisticRegression(labelCol='loan_indexed', featuresCol='rawFeatures')

# Fit the model to the training data
lr_model = lr.fit(trainingData)

# Make predictions on the test data
predictions = lr_model.transform(testData)
# Save the model to a file
#lr_model.save("logistic_regression_model1")

# Load the saved model
#loaded_model = LogisticRegression.load("/content/logistic_regression_model1")

accuracy = evaluator.evaluate(predictions)
print("Accuracy:", accuracy)

Accuracy: 1.0
   prediction  loan_indexed
0         0.0           0.0
1         0.0           0.0
2         0.0           0.0
3         1.0           1.0
4         0.0           0.0
5         0.0           0.0
6         0.0           0.0
7         0.0           0.0
8         0.0           0.0
9         0.0           0.0
Accuracy: 1.0


In [34]:
lr_accuracy = evaluator.evaluate(lr_prediction)
print("Accuracy:", lr_accuracy)
print("Accuracy_LogisticRegression is = %g"% (lr_accuracy))
print("Test Error_LogisticRegression = %g " % (1.0 - lr_accuracy))


Accuracy: 1.0
Accuracy_LogisticRegression is = 1
Test Error_LogisticRegression = 0


# DecisionTreeClassifier

In [35]:
from pyspark.ml.classification import DecisionTreeClassifier
dt = DecisionTreeClassifier(labelCol="loan_indexed", featuresCol="rawFeatures")
dt_model = dt.fit(trainingData)
dt_prediction = dt_model.transform(testData)
dt_prediction.select("prediction", "loan_indexed", "rawFeatures").show()

+----------+------------+--------------------+
|prediction|loan_indexed|         rawFeatures|
+----------+------------+--------------------+
|       0.0|         0.0|(32,[7,13,15,19,2...|
|       0.0|         0.0|(32,[7,13,15,19,2...|
|       0.0|         0.0|(32,[1,12,15,19,2...|
|       1.0|         1.0|(32,[1,13,17,19,2...|
|       0.0|         0.0|(32,[1,13,15,19,2...|
|       0.0|         0.0|(32,[0,13,16,19,2...|
|       0.0|         0.0|(32,[1,13,15,19,2...|
|       0.0|         0.0|(32,[1,13,15,19,2...|
|       0.0|         0.0|(32,[0,13,16,19,2...|
|       0.0|         0.0|(32,[1,12,15,19,2...|
|       0.0|         0.0|(32,[4,12,15,19,2...|
|       0.0|         0.0|(32,[4,13,15,19,2...|
|       0.0|         0.0|(32,[4,13,15,19,2...|
|       1.0|         1.0|(32,[3,13,15,19,2...|
|       1.0|         1.0|(32,[1,13,15,19,2...|
|       1.0|         1.0|(32,[0,12,16,20,2...|
|       0.0|         0.0|(32,[2,13,15,19,2...|
|       0.0|         0.0|(32,[2,13,18,19,2...|
|       0.0| 

# Evaluating accuracy of DecisionTreeClassifier.

In [36]:
dt_accuracy = evaluator.evaluate(dt_prediction)
print("Accuracy of DecisionTreeClassifier is = %g"% (dt_accuracy))
print("Test Error of DecisionTreeClassifier = %g " % (1.0 - dt_accuracy))

Accuracy of DecisionTreeClassifier is = 1
Test Error of DecisionTreeClassifier = 0


In [97]:
from pyspark.mllib.evaluation import BinaryClassificationMetrics
out = dt_model.transform(testData)\
 .select("prediction","loan_indexed")\
 .rdd.map(lambda x: (float(x[0]), float(x[1])))
dt_metrics = BinaryClassificationMetrics(out)

In [128]:
print("areaUnderPR"+" : "+str(dt_metrics.areaUnderPR))

areaUnderPR : 1.0


In [129]:
print("areaUnderROC"+" : "+str(dt_metrics.areaUnderROC))

areaUnderROC : 1.0


# RandomForestClassifier

In [37]:
from pyspark.ml.classification import RandomForestClassifier
rf = DecisionTreeClassifier(labelCol="loan_indexed", featuresCol="rawFeatures")
rf_model = rf.fit(trainingData)
rf_prediction = rf_model.transform(testData)
rf_prediction.select("prediction", "loan_indexed", "rawFeatures").show()

+----------+------------+--------------------+
|prediction|loan_indexed|         rawFeatures|
+----------+------------+--------------------+
|       0.0|         0.0|(32,[7,13,15,19,2...|
|       0.0|         0.0|(32,[7,13,15,19,2...|
|       0.0|         0.0|(32,[1,12,15,19,2...|
|       1.0|         1.0|(32,[1,13,17,19,2...|
|       0.0|         0.0|(32,[1,13,15,19,2...|
|       0.0|         0.0|(32,[0,13,16,19,2...|
|       0.0|         0.0|(32,[1,13,15,19,2...|
|       0.0|         0.0|(32,[1,13,15,19,2...|
|       0.0|         0.0|(32,[0,13,16,19,2...|
|       0.0|         0.0|(32,[1,12,15,19,2...|
|       0.0|         0.0|(32,[4,12,15,19,2...|
|       0.0|         0.0|(32,[4,13,15,19,2...|
|       0.0|         0.0|(32,[4,13,15,19,2...|
|       1.0|         1.0|(32,[3,13,15,19,2...|
|       1.0|         1.0|(32,[1,13,15,19,2...|
|       1.0|         1.0|(32,[0,12,16,20,2...|
|       0.0|         0.0|(32,[2,13,15,19,2...|
|       0.0|         0.0|(32,[2,13,18,19,2...|
|       0.0| 

# Evaluating accuracy of RandomForestClassifier.

In [38]:
accuracy = evaluator.evaluate(rf_prediction)
print("Accuracy:", accuracy)

Accuracy: 1.0


In [39]:
rf_accuracy = evaluator.evaluate(rf_prediction)
print("Accuracy of RandomForestClassifier is = %g"% (rf_accuracy))
print("Test Error of RandomForestClassifier  = %g " % (1.0 - rf_accuracy))

Accuracy of RandomForestClassifier is = 1
Test Error of RandomForestClassifier  = 0


In [100]:
from pyspark.mllib.evaluation import BinaryClassificationMetrics
out = rf_model.transform(testData)\
 .select("prediction","loan_indexed")\
 .rdd.map(lambda x: (float(x[0]), float(x[1])))
rf_metrics = BinaryClassificationMetrics(out)


In [126]:
print("areaUnderPR"+" : "+str(rf_metrics.areaUnderPR))

areaUnderPR : 1.0


In [127]:
print("areaUnderROC"+" : "+str(rf_metrics.areaUnderROC))

areaUnderROC : 1.0


# Gradient-boosted tree classifier

In [40]:
from pyspark.ml.classification import GBTClassifier
gbt = GBTClassifier(labelCol="loan_indexed", featuresCol="rawFeatures",maxIter=10)
gbt_model = gbt.fit(trainingData)
gbt_prediction = gbt_model.transform(testData)
gbt_prediction.select("prediction", "loan_indexed", "rawFeatures").show()

+----------+------------+--------------------+
|prediction|loan_indexed|         rawFeatures|
+----------+------------+--------------------+
|       0.0|         0.0|(32,[7,13,15,19,2...|
|       0.0|         0.0|(32,[7,13,15,19,2...|
|       0.0|         0.0|(32,[1,12,15,19,2...|
|       1.0|         1.0|(32,[1,13,17,19,2...|
|       0.0|         0.0|(32,[1,13,15,19,2...|
|       0.0|         0.0|(32,[0,13,16,19,2...|
|       0.0|         0.0|(32,[1,13,15,19,2...|
|       0.0|         0.0|(32,[1,13,15,19,2...|
|       0.0|         0.0|(32,[0,13,16,19,2...|
|       0.0|         0.0|(32,[1,12,15,19,2...|
|       0.0|         0.0|(32,[4,12,15,19,2...|
|       0.0|         0.0|(32,[4,13,15,19,2...|
|       0.0|         0.0|(32,[4,13,15,19,2...|
|       1.0|         1.0|(32,[3,13,15,19,2...|
|       1.0|         1.0|(32,[1,13,15,19,2...|
|       1.0|         1.0|(32,[0,12,16,20,2...|
|       0.0|         0.0|(32,[2,13,15,19,2...|
|       0.0|         0.0|(32,[2,13,18,19,2...|
|       0.0| 

# Evaluate accuracy of Gradient-boosted.

In [41]:
gbt_accuracy = evaluator.evaluate(gbt_prediction)
print("Accuracy of Gradient-boosted tree classifie is = %g"% (gbt_accuracy))
print("Test Error of Gradient-boosted tree classifie %g"% (1.0 - gbt_accuracy))

Accuracy of Gradient-boosted tree classifie is = 1
Test Error of Gradient-boosted tree classifie 0


In [103]:
from pyspark.mllib.evaluation import BinaryClassificationMetrics
out = gbt_model.transform(testData)\
 .select("prediction","loan_indexed")\
 .rdd.map(lambda x: (float(x[0]), float(x[1])))
gbt_metrics = BinaryClassificationMetrics(out)


In [124]:
print("areaUnderPR"+" : "+str(gbt_metrics.areaUnderPR))

areaUnderPR : 1.0


In [125]:
print("areaUnderROC"+" : "+str(gbt_metrics.areaUnderROC))

areaUnderROC : 1.0


# NaiveBayes

In [42]:
from pyspark.ml.classification import NaiveBayes
nb = NaiveBayes(labelCol="loan_indexed", featuresCol="rawFeatures")
nb_model = nb.fit(trainingData)
nb_prediction = nb_model.transform(testData)
nb_prediction.select("prediction", "loan_indexed", "rawFeatures").show()

+----------+------------+--------------------+
|prediction|loan_indexed|         rawFeatures|
+----------+------------+--------------------+
|       0.0|         0.0|(32,[7,13,15,19,2...|
|       0.0|         0.0|(32,[7,13,15,19,2...|
|       0.0|         0.0|(32,[1,12,15,19,2...|
|       1.0|         1.0|(32,[1,13,17,19,2...|
|       0.0|         0.0|(32,[1,13,15,19,2...|
|       0.0|         0.0|(32,[0,13,16,19,2...|
|       0.0|         0.0|(32,[1,13,15,19,2...|
|       0.0|         0.0|(32,[1,13,15,19,2...|
|       0.0|         0.0|(32,[0,13,16,19,2...|
|       0.0|         0.0|(32,[1,12,15,19,2...|
|       0.0|         0.0|(32,[4,12,15,19,2...|
|       0.0|         0.0|(32,[4,13,15,19,2...|
|       0.0|         0.0|(32,[4,13,15,19,2...|
|       1.0|         1.0|(32,[3,13,15,19,2...|
|       1.0|         1.0|(32,[1,13,15,19,2...|
|       1.0|         1.0|(32,[0,12,16,20,2...|
|       0.0|         0.0|(32,[2,13,15,19,2...|
|       0.0|         0.0|(32,[2,13,18,19,2...|
|       0.0| 

# Evaluating accuracy of NaiveBayes.

In [43]:
nb_accuracy = evaluator.evaluate(nb_prediction)
print("Accuracy of NaiveBayes is  = %g"% (nb_accuracy))
print("Test Error of NaiveBayes  = %g " % (1.0 - nb_accuracy))

Accuracy of NaiveBayes is  = 1
Test Error of NaiveBayes  = 0


In [106]:
from pyspark.mllib.evaluation import BinaryClassificationMetrics
out = nb_model.transform(testData)\
 .select("prediction","loan_indexed")\
 .rdd.map(lambda x: (float(x[0]), float(x[1])))
nb_metrics = BinaryClassificationMetrics(out)


In [122]:
print("areaUnderPR"+" : "+str(nb_metrics.areaUnderPR))

areaUnderPR : 1.0


In [123]:
print("areaUnderROC"+" : "+str(nb_metrics.areaUnderROC))

areaUnderROC : 1.0


# Support Vector Machine

In [44]:
from pyspark.ml.classification import LinearSVC
svm = LinearSVC(labelCol="loan_indexed", featuresCol="rawFeatures")
svm_model = svm.fit(trainingData)
svm_prediction = svm_model.transform(testData)
svm_prediction.select("prediction", "loan_indexed", "rawFeatures").show()

+----------+------------+--------------------+
|prediction|loan_indexed|         rawFeatures|
+----------+------------+--------------------+
|       0.0|         0.0|(32,[7,13,15,19,2...|
|       0.0|         0.0|(32,[7,13,15,19,2...|
|       0.0|         0.0|(32,[1,12,15,19,2...|
|       1.0|         1.0|(32,[1,13,17,19,2...|
|       0.0|         0.0|(32,[1,13,15,19,2...|
|       0.0|         0.0|(32,[0,13,16,19,2...|
|       0.0|         0.0|(32,[1,13,15,19,2...|
|       0.0|         0.0|(32,[1,13,15,19,2...|
|       0.0|         0.0|(32,[0,13,16,19,2...|
|       0.0|         0.0|(32,[1,12,15,19,2...|
|       0.0|         0.0|(32,[4,12,15,19,2...|
|       0.0|         0.0|(32,[4,13,15,19,2...|
|       0.0|         0.0|(32,[4,13,15,19,2...|
|       1.0|         1.0|(32,[3,13,15,19,2...|
|       1.0|         1.0|(32,[1,13,15,19,2...|
|       1.0|         1.0|(32,[0,12,16,20,2...|
|       0.0|         0.0|(32,[2,13,15,19,2...|
|       0.0|         0.0|(32,[2,13,18,19,2...|
|       0.0| 

# Evaluating the accuracy of Support Vector Mac

In [45]:
svm_accuracy = evaluator.evaluate(svm_prediction)
print("Accuracy of Support Vector Machine is = %g"% (svm_accuracy))
print("Test Error of Support Vector Machine = %g " % (1.0 - svm_accuracy))

Accuracy of Support Vector Machine is = 1
Test Error of Support Vector Machine = 0


In [109]:
from pyspark.mllib.evaluation import BinaryClassificationMetrics
out = svm_model.transform(testData)\
 .select("prediction","loan_indexed")\
 .rdd.map(lambda x: (float(x[0]), float(x[1])))
svm_metrics = BinaryClassificationMetrics(out)


In [120]:
print("areaUnderPR"+" : "+str(svm_metrics.areaUnderPR))

areaUnderPR : 1.0


In [121]:
print("areaUnderROC"+" : "+str(svm_metrics.areaUnderROC))

areaUnderROC : 1.0


# Hyperparameter tuning and CrossValidation

In [46]:
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

# Define the hyperparameters to tune
hyperparameters = [
    {'regParam': [0.1, 0.01, 0.001], 'elasticNetParam': [0.0, 0.5, 1.0]},
    {'regParam': [0.1, 0.01, 0.001], 'elasticNetParam': [0.0, 0.5, 1.0], 'maxIter': [10, 50, 100]}
]

In [47]:
param_grid = ParamGridBuilder().addGrid(lr.regParam, hyperparameters[0]['regParam'])\
                               .addGrid(lr.elasticNetParam, hyperparameters[0]['elasticNetParam'])\
                               .build()

In [48]:
cv = CrossValidator(estimator=lr, estimatorParamMaps=param_grid, evaluator=evaluator, numFolds=2)

In [49]:
model = cv.fit(trainingData)

In [50]:
model.params

[Param(parent='CrossValidatorModel_3c0bffd389f8', name='estimator', doc='estimator to be cross-validated'), Param(parent='CrossValidatorModel_3c0bffd389f8', name='estimatorParamMaps', doc='estimator param maps'), Param(parent='CrossValidatorModel_3c0bffd389f8', name='evaluator', doc='evaluator used to select hyper-parameters that maximize the validator metric'), Param(parent='CrossValidatorModel_3c0bffd389f8', name='seed', doc='random seed.')]


In [51]:
model.bestModel

LogisticRegressionModel: uid = LogisticRegression_8f5adf87b2e3, numClasses = 2, numFeatures = 32


In [52]:
predictions = model.transform(testData)

accuracy = evaluator.evaluate(lr_prediction)
print("Accuracy: ", accuracy)

Accuracy:  1.0


In [115]:
from pyspark.mllib.evaluation import BinaryClassificationMetrics
out = model.transform(testData)\
 .select("prediction","loan_indexed")\
 .rdd.map(lambda x: (float(x[0]), float(x[1])))
cv_metrics = BinaryClassificationMetrics(out)


In [118]:
print("areaUnderPR"+" : "+str(cv_metrics.areaUnderPR))

areaUnderPR : 1.0


In [119]:
print("areaUnderROC"+" : "+str(cv_metrics.areaUnderROC))

areaUnderROC : 1.0


In [151]:
metrics=[lr_metrics,dt_metrics,rf_metrics,gbt_metrics,nb_metrics,svm_metrics,cv_metrics]

In [152]:
models=["logistic regression","decisiontree classification","randomforest classification","GBT classification","NaiveBayes classification","SVM classification","CV_Logistic regression"]

In [153]:
from pyspark.sql.types import StructType,StructField,DoubleType,StringType
metrics_schema = StructType([ \
    StructField("Model",StringType(),True), \
    StructField("areaUnderPR",DoubleType(),True), \
    StructField("areaUnderROC",DoubleType(),True), \
  ])

In [163]:
metrics_data=[]
li=[]
j=0
for i in metrics:
    li.append(models[j])
    j=j+1
    li.append(i.areaUnderPR)
    li.append(i.areaUnderROC)
    metrics_data.append(li)
    li=[]
print(metrics_data)

[['logistic regression', 1.0, 1.0], ['decisiontree classification', 1.0, 1.0], ['randomforest classification', 1.0, 1.0], ['GBT classification', 1.0, 1.0], ['NaiveBayes classification', 1.0, 1.0], ['SVM classification', 1.0, 1.0], ['CV_Logistic regression', 1.0, 1.0]]


In [164]:
metrics_df=spark.createDataFrame(data=metrics_data,schema=metrics_schema)

In [167]:
metrics_df.show(truncate=35)

+---------------------------+-----------+------------+
|                      Model|areaUnderPR|areaUnderROC|
+---------------------------+-----------+------------+
|        logistic regression|        1.0|         1.0|
|decisiontree classification|        1.0|         1.0|
|randomforest classification|        1.0|         1.0|
|         GBT classification|        1.0|         1.0|
|  NaiveBayes classification|        1.0|         1.0|
|         SVM classification|        1.0|         1.0|
|     CV_Logistic regression|        1.0|         1.0|
+---------------------------+-----------+------------+


In [169]:
metrics_df.repartition(1).write.option("header", "true").option("schema","true").mode('overwrite').parquet('s3://aghar-awsglue-capstone/Bank_Data/output_metrics/')