<a href="https://colab.research.google.com/github/RanojoyBiswas/PySpark-Practice/blob/main/Tree_methods_Practice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tree Methods:

In this lecture we will code along with some data and test out 3 different tree methods:

* A single decision tree
* A random forest
* A gradient boosted tree classifier
    
We will be using a college dataset to try to classify colleges as Private or Public based off these features:

    Private A factor with levels No and Yes indicating private or public university
    Apps Number of applications received
    Accept Number of applications accepted
    Enroll Number of new students enrolled
    Top10perc Pct. new students from top 10% of H.S. class
    Top25perc Pct. new students from top 25% of H.S. class
    F.Undergrad Number of fulltime undergraduates
    P.Undergrad Number of parttime undergraduates
    Outstate Out-of-state tuition
    Room.Board Room and board costs
    Books Estimated book costs
    Personal Estimated personal spending
    PhD Pct. of faculty with Ph.D.’s
    Terminal Pct. of faculty with terminal degree
    S.F.Ratio Student/faculty ratio
    perc.alumni Pct. alumni who donate
    Expend Instructional expenditure per student
    Grad.Rate Graduation rate

In [56]:
pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [57]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Trees').getOrCreate()

In [58]:
df = spark.read.csv('/content/drive/MyDrive/Pyspark Practice/Python-and-Spark-for-Big-Data-master/Spark_for_Machine_Learning/Tree_Methods/College.csv',
                    inferSchema = True, header = True)

In [59]:
df.printSchema()

root
 |-- School: string (nullable = true)
 |-- Private: string (nullable = true)
 |-- Apps: integer (nullable = true)
 |-- Accept: integer (nullable = true)
 |-- Enroll: integer (nullable = true)
 |-- Top10perc: integer (nullable = true)
 |-- Top25perc: integer (nullable = true)
 |-- F_Undergrad: integer (nullable = true)
 |-- P_Undergrad: integer (nullable = true)
 |-- Outstate: integer (nullable = true)
 |-- Room_Board: integer (nullable = true)
 |-- Books: integer (nullable = true)
 |-- Personal: integer (nullable = true)
 |-- PhD: integer (nullable = true)
 |-- Terminal: integer (nullable = true)
 |-- S_F_Ratio: double (nullable = true)
 |-- perc_alumni: integer (nullable = true)
 |-- Expend: integer (nullable = true)
 |-- Grad_Rate: integer (nullable = true)



In [60]:
df.describe().show()

+-------+--------------------+-------+------------------+------------------+----------------+------------------+------------------+-----------------+-----------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+----------------+------------------+
|summary|              School|Private|              Apps|            Accept|          Enroll|         Top10perc|         Top25perc|      F_Undergrad|      P_Undergrad|          Outstate|        Room_Board|             Books|          Personal|               PhD|          Terminal|         S_F_Ratio|       perc_alumni|          Expend|         Grad_Rate|
+-------+--------------------+-------+------------------+------------------+----------------+------------------+------------------+-----------------+-----------------+------------------+------------------+------------------+------------------+------------------+------------------+-------

In [61]:
for i in df.take(3):
  print(i,'\n')

Row(School='Abilene Christian University', Private='Yes', Apps=1660, Accept=1232, Enroll=721, Top10perc=23, Top25perc=52, F_Undergrad=2885, P_Undergrad=537, Outstate=7440, Room_Board=3300, Books=450, Personal=2200, PhD=70, Terminal=78, S_F_Ratio=18.1, perc_alumni=12, Expend=7041, Grad_Rate=60) 

Row(School='Adelphi University', Private='Yes', Apps=2186, Accept=1924, Enroll=512, Top10perc=16, Top25perc=29, F_Undergrad=2683, P_Undergrad=1227, Outstate=12280, Room_Board=6450, Books=750, Personal=1500, PhD=29, Terminal=30, S_F_Ratio=12.2, perc_alumni=16, Expend=10527, Grad_Rate=56) 

Row(School='Adrian College', Private='Yes', Apps=1428, Accept=1097, Enroll=336, Top10perc=22, Top25perc=50, F_Undergrad=1036, P_Undergrad=99, Outstate=11250, Room_Board=3750, Books=400, Personal=1165, PhD=53, Terminal=66, S_F_Ratio=12.9, perc_alumni=30, Expend=8735, Grad_Rate=54) 



In [62]:
from pyspark.sql.functions import isnull, isnan

for i in df.columns:
  nulls_count = df.filter(isnull(df['School'])).count()
  print(f'Total Null values in the {i} columns:', nulls_count)

Total Null values in the School columns: 0
Total Null values in the Private columns: 0
Total Null values in the Apps columns: 0
Total Null values in the Accept columns: 0
Total Null values in the Enroll columns: 0
Total Null values in the Top10perc columns: 0
Total Null values in the Top25perc columns: 0
Total Null values in the F_Undergrad columns: 0
Total Null values in the P_Undergrad columns: 0
Total Null values in the Outstate columns: 0
Total Null values in the Room_Board columns: 0
Total Null values in the Books columns: 0
Total Null values in the Personal columns: 0
Total Null values in the PhD columns: 0
Total Null values in the Terminal columns: 0
Total Null values in the S_F_Ratio columns: 0
Total Null values in the perc_alumni columns: 0
Total Null values in the Expend columns: 0
Total Null values in the Grad_Rate columns: 0


In [63]:
df.columns

['School',
 'Private',
 'Apps',
 'Accept',
 'Enroll',
 'Top10perc',
 'Top25perc',
 'F_Undergrad',
 'P_Undergrad',
 'Outstate',
 'Room_Board',
 'Books',
 'Personal',
 'PhD',
 'Terminal',
 'S_F_Ratio',
 'perc_alumni',
 'Expend',
 'Grad_Rate']

In [64]:
from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler(outputCol='features', inputCols = ['Apps',
 'Accept',
 'Enroll',
 'Top10perc',
 'Top25perc',
 'F_Undergrad',
 'P_Undergrad',
 'Outstate',
 'Room_Board',
 'Books',
 'Personal',
 'PhD',
 'Terminal',
 'S_F_Ratio',
 'perc_alumni',
 'Expend',
 'Grad_Rate'])

In [65]:
out = assembler.transform(df)

In [66]:
from pyspark.ml.feature import StringIndexer

indexer = StringIndexer(inputCol = 'Private', outputCol = 'private_index')
op = indexer.fit(out).transform(out)

In [67]:
op.printSchema()

root
 |-- School: string (nullable = true)
 |-- Private: string (nullable = true)
 |-- Apps: integer (nullable = true)
 |-- Accept: integer (nullable = true)
 |-- Enroll: integer (nullable = true)
 |-- Top10perc: integer (nullable = true)
 |-- Top25perc: integer (nullable = true)
 |-- F_Undergrad: integer (nullable = true)
 |-- P_Undergrad: integer (nullable = true)
 |-- Outstate: integer (nullable = true)
 |-- Room_Board: integer (nullable = true)
 |-- Books: integer (nullable = true)
 |-- Personal: integer (nullable = true)
 |-- PhD: integer (nullable = true)
 |-- Terminal: integer (nullable = true)
 |-- S_F_Ratio: double (nullable = true)
 |-- perc_alumni: integer (nullable = true)
 |-- Expend: integer (nullable = true)
 |-- Grad_Rate: integer (nullable = true)
 |-- features: vector (nullable = true)
 |-- private_index: double (nullable = false)



In [68]:
final_data = op.select('features', 'private_index')
final_data.show(truncate=False)

+----------------------------------------------------------------------------------------------------------+-------------+
|features                                                                                                  |private_index|
+----------------------------------------------------------------------------------------------------------+-------------+
|[1660.0,1232.0,721.0,23.0,52.0,2885.0,537.0,7440.0,3300.0,450.0,2200.0,70.0,78.0,18.1,12.0,7041.0,60.0]   |0.0          |
|[2186.0,1924.0,512.0,16.0,29.0,2683.0,1227.0,12280.0,6450.0,750.0,1500.0,29.0,30.0,12.2,16.0,10527.0,56.0]|0.0          |
|[1428.0,1097.0,336.0,22.0,50.0,1036.0,99.0,11250.0,3750.0,400.0,1165.0,53.0,66.0,12.9,30.0,8735.0,54.0]   |0.0          |
|[417.0,349.0,137.0,60.0,89.0,510.0,63.0,12960.0,5450.0,450.0,875.0,92.0,97.0,7.7,37.0,19016.0,59.0]       |0.0          |
|[193.0,146.0,55.0,16.0,44.0,249.0,869.0,7560.0,4120.0,800.0,1500.0,76.0,72.0,11.9,2.0,10922.0,15.0]       |0.0          |
|[587.0,479.0,15

In [69]:
final_data.groupBy('private_index').count().show()

+-------------+-----+
|private_index|count|
+-------------+-----+
|          0.0|  565|
|          1.0|  212|
+-------------+-----+



Train-Test Split

In [70]:
train_data, test_data = final_data.randomSplit([0.7, 0.3])

Importing Tree based Algorithms

In [71]:
from pyspark.ml.classification import DecisionTreeClassifier, GBTClassifier, RandomForestClassifier

In [81]:
dtc = DecisionTreeClassifier(featuresCol='features', labelCol='private_index')
rfc = RandomForestClassifier(numTrees=150, featuresCol='features', labelCol='private_index')
gbc = GBTClassifier(featuresCol='features', labelCol='private_index')

In [82]:
dtc_model = dtc.fit(train_data)
rfc_model = rfc.fit(train_data)
gbc_model = gbc.fit(train_data)

### Model Comparison :

In [83]:
dtc_pred = dtc_model.transform(test_data)
rfc_pred = rfc_model.transform(test_data)
gbc_pred = gbc_model.transform(test_data)

Evaluation Metrics:

In [84]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [85]:
acc_eval = MulticlassClassificationEvaluator(predictionCol='prediction', labelCol='private_index', metricName='accuracy')
recall_eval = MulticlassClassificationEvaluator(predictionCol='prediction', labelCol='private_index', metricName='weightedRecall')
precision_eval = MulticlassClassificationEvaluator(predictionCol='prediction', labelCol='private_index', metricName='weightedPrecision')

In [86]:
dtc_acc = acc_eval.evaluate(dtc_pred)
dtc_recall = recall_eval.evaluate(dtc_pred)
dtc_precision = precision_eval.evaluate(dtc_pred)

rfc_acc = acc_eval.evaluate(rfc_pred)
rfc_recall = recall_eval.evaluate(rfc_pred)
rfc_precision = precision_eval.evaluate(rfc_pred)

gbc_acc = acc_eval.evaluate(gbc_pred)
gbc_recall = recall_eval.evaluate(gbc_pred)
gbc_precision = precision_eval.evaluate(gbc_pred)

In [87]:
print("Here are the results!")
print('-'*80)
print('A single decision tree had an accuracy of: {0:2.2f}%'.format(dtc_acc*100))
print('-'*80)
print('A random forest ensemble had an accuracy of: {0:2.2f}%'.format(rfc_acc*100))
print('-'*80)
print('A ensemble using GBT had an accuracy of: {0:2.2f}%'.format(gbc_acc*100))

Here are the results!
--------------------------------------------------------------------------------
A single decision tree had an accuracy of: 92.08%
--------------------------------------------------------------------------------
A random forest ensemble had an accuracy of: 95.42%
--------------------------------------------------------------------------------
A ensemble using GBT had an accuracy of: 93.33%


In [88]:
print("Here are the results!")
print('-'*80)
print('A single decision tree had an weighted Recall of: {0:2.2f}%'.format(dtc_recall*100))
print('-'*80)
print('A random forest ensemble had an weighted Recall of: {0:2.2f}%'.format(rfc_recall*100))
print('-'*80)
print('A ensemble using GBT had an weighted Recall of: {0:2.2f}%'.format(gbc_recall*100))

Here are the results!
--------------------------------------------------------------------------------
A single decision tree had an weighted Recall of: 92.08%
--------------------------------------------------------------------------------
A random forest ensemble had an weighted Recall of: 95.42%
--------------------------------------------------------------------------------
A ensemble using GBT had an weighted Recall of: 93.33%


In [89]:
print("Here are the results!")
print('-'*80)
print('A single decision tree had an weighted Precision of: {0:2.2f}%'.format(dtc_precision*100))
print('-'*80)
print('A random forest ensemble had an weighted Precision of: {0:2.2f}%'.format(rfc_precision*100))
print('-'*80)
print('A ensemble using GBT had an weighted Precision of: {0:2.2f}%'.format(gbc_precision*100))

Here are the results!
--------------------------------------------------------------------------------
A single decision tree had an weighted Precision of: 92.99%
--------------------------------------------------------------------------------
A random forest ensemble had an weighted Precision of: 95.45%
--------------------------------------------------------------------------------
A ensemble using GBT had an weighted Precision of: 94.07%
