# Tree Methods - College Data

Aim - Use a college dataset to classify colleges as Private or Public based off these features:

    Private A factor with levels No and Yes indicating private or public university
    Apps Number of applications received
    Accept Number of applications accepted
    Enroll Number of new students enrolled
    Top10perc Pct. new students from top 10% of H.S. class
    Top25perc Pct. new students from top 25% of H.S. class
    F.Undergrad Number of fulltime undergraduates
    P.Undergrad Number of parttime undergraduates
    Outstate Out-of-state tuition
    Room.Board Room and board costs
    Books Estimated book costs
    Personal Estimated personal spending
    PhD Pct. of faculty with Ph.D.’s
    Terminal Pct. of faculty with terminal degree
    S.F.Ratio Student/faculty ratio
    perc.alumni Pct. alumni who donate
    Expend Instructional expenditure per student
    Grad.Rate Graduation rate

Steps to follow:

1. Create a Spark Session and load data
2. Check for missing values (if yes, drop or fill them)
3. Check whether or not data is in the format - label, features (if not, assemble the features using an assembler)
4. Fix the issue of label ('Private') being of string type
5. Split data into training and testing set (7:3)
6. Import DecisionTreeClassifier,GBTClassifier,RandomForestClassifier or Regressor depending on the aim (regression or classification) 
7. Create their instances
8. Create their models by using the instances to train/fit data 
9. Obtain predictions by transforming the test data on the models created in the previous step
10. Import MulticlassClassificationEvaluator and create it's instance
11. Use the evaluator instance to get accuracy

In [1]:
# Create spark session
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('tree').getOrCreate()

In [2]:
# Load data
data = spark.read.csv('College.csv',inferSchema=True,header=True)

In [3]:
# Check for any missing values
from pyspark.sql.functions import isnan, isnull, when, count, col

data.select([count(when(isnan(c)| isnull(c), c)).alias(c) for c in data.columns]).show()

+------+-------+----+------+------+---------+---------+-----------+-----------+--------+----------+-----+--------+---+--------+---------+-----------+------+---------+
|School|Private|Apps|Accept|Enroll|Top10perc|Top25perc|F_Undergrad|P_Undergrad|Outstate|Room_Board|Books|Personal|PhD|Terminal|S_F_Ratio|perc_alumni|Expend|Grad_Rate|
+------+-------+----+------+------+---------+---------+-----------+-----------+--------+----------+-----+--------+---+--------+---------+-----------+------+---------+
|     0|      0|   0|     0|     0|        0|        0|          0|          0|       0|         0|    0|       0|  0|       0|        0|          0|     0|        0|
+------+-------+----+------+------+---------+---------+-----------+-----------+--------+----------+-----+--------+---+--------+---------+-----------+------+---------+



In [4]:
# Check format
data.printSchema()

root
 |-- School: string (nullable = true)
 |-- Private: string (nullable = true)
 |-- Apps: integer (nullable = true)
 |-- Accept: integer (nullable = true)
 |-- Enroll: integer (nullable = true)
 |-- Top10perc: integer (nullable = true)
 |-- Top25perc: integer (nullable = true)
 |-- F_Undergrad: integer (nullable = true)
 |-- P_Undergrad: integer (nullable = true)
 |-- Outstate: integer (nullable = true)
 |-- Room_Board: integer (nullable = true)
 |-- Books: integer (nullable = true)
 |-- Personal: integer (nullable = true)
 |-- PhD: integer (nullable = true)
 |-- Terminal: integer (nullable = true)
 |-- S_F_Ratio: double (nullable = true)
 |-- perc_alumni: integer (nullable = true)
 |-- Expend: integer (nullable = true)
 |-- Grad_Rate: integer (nullable = true)



In [5]:
data.head()

Row(School='Abilene Christian University', Private='Yes', Apps=1660, Accept=1232, Enroll=721, Top10perc=23, Top25perc=52, F_Undergrad=2885, P_Undergrad=537, Outstate=7440, Room_Board=3300, Books=450, Personal=2200, PhD=70, Terminal=78, S_F_Ratio=18.1, perc_alumni=12, Expend=7041, Grad_Rate=60)

In [6]:
data.head().asDict()
#Note that .asDict() orders the keys in an alphabetical order

{'Accept': 1232,
 'Apps': 1660,
 'Books': 450,
 'Enroll': 721,
 'Expend': 7041,
 'F_Undergrad': 2885,
 'Grad_Rate': 60,
 'Outstate': 7440,
 'P_Undergrad': 537,
 'Personal': 2200,
 'PhD': 70,
 'Private': 'Yes',
 'Room_Board': 3300,
 'S_F_Ratio': 18.1,
 'School': 'Abilene Christian University',
 'Terminal': 78,
 'Top10perc': 23,
 'Top25perc': 52,
 'perc_alumni': 12}

In [7]:
# Import VectorAssembler and Vectors
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

In [8]:
data.columns

['School',
 'Private',
 'Apps',
 'Accept',
 'Enroll',
 'Top10perc',
 'Top25perc',
 'F_Undergrad',
 'P_Undergrad',
 'Outstate',
 'Room_Board',
 'Books',
 'Personal',
 'PhD',
 'Terminal',
 'S_F_Ratio',
 'perc_alumni',
 'Expend',
 'Grad_Rate']

In [9]:
# We will be using everything except the label ('Private') and school name ('School') for features
assembler = VectorAssembler(
  inputCols=['Apps',
             'Accept',
             'Enroll',
             'Top10perc',
             'Top25perc',
             'F_Undergrad',
             'P_Undergrad',
             'Outstate',
             'Room_Board',
             'Books',
             'Personal',
             'PhD',
             'Terminal',
             'S_F_Ratio',
             'perc_alumni',
             'Expend',
             'Grad_Rate'],
              outputCol="features")

In [10]:
output = assembler.transform(data)

The Private column is of type string - 'Yes' or 'No'. We want to covert it to 0 and 1.

In [11]:
from pyspark.ml.feature import StringIndexer

In [12]:
indexer = StringIndexer(inputCol="Private", outputCol="PrivateIndex")
output_fixed = indexer.fit(output).transform(output)

In [13]:
output_fixed.printSchema()

root
 |-- School: string (nullable = true)
 |-- Private: string (nullable = true)
 |-- Apps: integer (nullable = true)
 |-- Accept: integer (nullable = true)
 |-- Enroll: integer (nullable = true)
 |-- Top10perc: integer (nullable = true)
 |-- Top25perc: integer (nullable = true)
 |-- F_Undergrad: integer (nullable = true)
 |-- P_Undergrad: integer (nullable = true)
 |-- Outstate: integer (nullable = true)
 |-- Room_Board: integer (nullable = true)
 |-- Books: integer (nullable = true)
 |-- Personal: integer (nullable = true)
 |-- PhD: integer (nullable = true)
 |-- Terminal: integer (nullable = true)
 |-- S_F_Ratio: double (nullable = true)
 |-- perc_alumni: integer (nullable = true)
 |-- Expend: integer (nullable = true)
 |-- Grad_Rate: integer (nullable = true)
 |-- features: vector (nullable = true)
 |-- PrivateIndex: double (nullable = false)



In [14]:
final_data = output_fixed.select("features",'PrivateIndex')

In [15]:
train_data,test_data = final_data.randomSplit([0.7,0.3])

In [16]:
from pyspark.ml.classification import (DecisionTreeClassifier,GBTClassifier,
                                       RandomForestClassifier)

Create all three instances:

In [17]:
# Used with default parameters to make a fair comparison between the 3 techniques

dtc = DecisionTreeClassifier(labelCol='PrivateIndex',featuresCol='features')
rfc = RandomForestClassifier(labelCol='PrivateIndex',featuresCol='features')
gbt = GBTClassifier(labelCol='PrivateIndex',featuresCol='features')

Train all three models:

In [18]:
# Train the models (its three models, so it might take some time)
dtc_model = dtc.fit(train_data)
rfc_model = rfc.fit(train_data)
gbt_model = gbt.fit(train_data)

In [19]:
dtc_predictions = dtc_model.transform(test_data)
rfc_predictions = rfc_model.transform(test_data)
gbt_predictions = gbt_model.transform(test_data)

In [20]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [21]:
acc_evaluator = MulticlassClassificationEvaluator(labelCol="PrivateIndex", predictionCol="prediction", metricName="accuracy")

In [22]:
dtc_acc = acc_evaluator.evaluate(dtc_predictions)
rfc_acc = acc_evaluator.evaluate(rfc_predictions)
gbt_acc = acc_evaluator.evaluate(gbt_predictions)

In [23]:
print('A single decision tree had an accuracy of: {0:2.2f}%'.format(dtc_acc*100))
print('A random forest ensemble had an accuracy of: {0:2.2f}%'.format(rfc_acc*100))
print('A ensemble using GBT had an accuracy of: {0:2.2f}%'.format(gbt_acc*100))

A single decision tree had an accuracy of: 90.87%
A random forest ensemble had an accuracy of: 92.17%
A ensemble using GBT had an accuracy of: 91.74%


-------------------------------------------------------------------------------