# College Classification

    
The task is to use a college dataset to try to classify colleges as Private or Public based features:

    Private A factor with levels No and Yes indicating private or public university
    Apps Number of applications received
    Accept Number of applications accepted
    Enroll Number of new students enrolled
    Top10perc Pct. new students from top 10% of H.S. class
    Top25perc Pct. new students from top 25% of H.S. class
    F.Undergrad Number of fulltime undergraduates
    P.Undergrad Number of parttime undergraduates
    Outstate Out-of-state tuition
    Room.Board Room and board costs
    Books Estimated book costs
    Personal Estimated personal spending
    PhD Pct. of faculty with Ph.D.’s
    Terminal Pct. of faculty with terminal degree
    S.F.Ratio Student/faculty ratio
    perc.alumni Pct. alumni who donate
    Expend Instructional expenditure per student
    Grad.Rate Graduation rate
    
Testing 3 different tree methods:

* A single decision tree
* A random forest
* A gradient boosted tree classifier

In [2]:
import findspark
findspark.init('/home/matt/spark-3.1.1-bin-hadoop2.7')

In [3]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('college').getOrCreate()

In [4]:
# Load training data
df = spark.read.csv('College.csv',inferSchema=True,header=True)

In [5]:
df.printSchema()

root
 |-- School: string (nullable = true)
 |-- Private: string (nullable = true)
 |-- Apps: integer (nullable = true)
 |-- Accept: integer (nullable = true)
 |-- Enroll: integer (nullable = true)
 |-- Top10perc: integer (nullable = true)
 |-- Top25perc: integer (nullable = true)
 |-- F_Undergrad: integer (nullable = true)
 |-- P_Undergrad: integer (nullable = true)
 |-- Outstate: integer (nullable = true)
 |-- Room_Board: integer (nullable = true)
 |-- Books: integer (nullable = true)
 |-- Personal: integer (nullable = true)
 |-- PhD: integer (nullable = true)
 |-- Terminal: integer (nullable = true)
 |-- S_F_Ratio: double (nullable = true)
 |-- perc_alumni: integer (nullable = true)
 |-- Expend: integer (nullable = true)
 |-- Grad_Rate: integer (nullable = true)



In [6]:
df.show(5)

+--------------------+-------+----+------+------+---------+---------+-----------+-----------+--------+----------+-----+--------+---+--------+---------+-----------+------+---------+
|              School|Private|Apps|Accept|Enroll|Top10perc|Top25perc|F_Undergrad|P_Undergrad|Outstate|Room_Board|Books|Personal|PhD|Terminal|S_F_Ratio|perc_alumni|Expend|Grad_Rate|
+--------------------+-------+----+------+------+---------+---------+-----------+-----------+--------+----------+-----+--------+---+--------+---------+-----------+------+---------+
|Abilene Christian...|    Yes|1660|  1232|   721|       23|       52|       2885|        537|    7440|      3300|  450|    2200| 70|      78|     18.1|         12|  7041|       60|
|  Adelphi University|    Yes|2186|  1924|   512|       16|       29|       2683|       1227|   12280|      6450|  750|    1500| 29|      30|     12.2|         16| 10527|       56|
|      Adrian College|    Yes|1428|  1097|   336|       22|       50|       1036|         99|  

## Feature engineering

In [7]:
# Import VectorAssembler and Vectors
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

In [8]:
df.columns

['School',
 'Private',
 'Apps',
 'Accept',
 'Enroll',
 'Top10perc',
 'Top25perc',
 'F_Undergrad',
 'P_Undergrad',
 'Outstate',
 'Room_Board',
 'Books',
 'Personal',
 'PhD',
 'Terminal',
 'S_F_Ratio',
 'perc_alumni',
 'Expend',
 'Grad_Rate']

In [23]:
assembler = VectorAssembler(
  inputCols=['Apps',
             'Accept',
             'Enroll',
             'Top10perc',
             'Top25perc',
             'F_Undergrad',
             'P_Undergrad',
             'Outstate',
             'Room_Board',
             'Books',
             'Personal',
             'PhD',
             'Terminal',
             'S_F_Ratio',
             'perc_alumni',
             'Expend',
             'Grad_Rate'],
              outputCol="features")

In [24]:
output = assembler.transform(df)

In [25]:
# Convert Private column (the labels) into a categorical variable
from pyspark.ml.feature import StringIndexer
indexer = StringIndexer(inputCol="Private", outputCol="PrivateIndex")
indexed = indexer.fit(output).transform(output)

In [26]:
final_data = indexed.select("features",'PrivateIndex')

### Scale feature data

In [27]:
#scale features
from pyspark.ml.feature import StandardScaler
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures", withStd=True, withMean=False)
# Compute summary statistics by fitting the StandardScaler
scalerModel = scaler.fit(final_data)
# Normalize each feature to have unit standard deviation.
scaled_final_data = scalerModel.transform(final_data)

In [28]:
scaled_final_data.show(5)

+--------------------+------------+--------------------+
|            features|PrivateIndex|      scaledFeatures|
+--------------------+------------+--------------------+
|[1660.0,1232.0,72...|         0.0|[0.42891823763594...|
|[2186.0,1924.0,51...|         0.0|[0.56482847438082...|
|[1428.0,1097.0,33...|         0.0|[0.36897303815911...|
|[417.0,349.0,137....|         0.0|[0.10774632836999...|
|[193.0,146.0,55.0...|         0.0|[0.04986820473719...|
+--------------------+------------+--------------------+
only showing top 5 rows



In [29]:
#split data
train_data,test_data = scaled_final_data.randomSplit([0.7,0.3])

## Train model(s) using pipelines

In [30]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier,GBTClassifier,RandomForestClassifier

Create all three models:

In [31]:
# Using defaults to make fair comparison
dtc = DecisionTreeClassifier(labelCol='PrivateIndex',featuresCol='scaledFeatures')
rfc = RandomForestClassifier(labelCol='PrivateIndex',featuresCol='scaledFeatures',numTrees=75)
gbt = GBTClassifier(labelCol='PrivateIndex',featuresCol='scaledFeatures',maxIter=150)

Train all three models:

In [32]:
dtc_model = dtc.fit(train_data)
rfc_model = rfc.fit(train_data)
gbt_model = gbt.fit(train_data)

## Model Comparison

In [33]:
dtc_predictions = dtc_model.transform(test_data)
rfc_predictions = rfc_model.transform(test_data)
gbt_predictions = gbt_model.transform(test_data)

## Evaluate models

In [34]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [35]:
# Select (prediction, true label) and compute test error
acc_evaluator = MulticlassClassificationEvaluator(labelCol="PrivateIndex", predictionCol="prediction", metricName="accuracy")

In [36]:
dtc_acc = acc_evaluator.evaluate(dtc_predictions)
rfc_acc = acc_evaluator.evaluate(rfc_predictions)
gbt_acc = acc_evaluator.evaluate(gbt_predictions)

In [37]:
print("Here are the results!")
print('-'*80)
print('A single decision tree had an accuracy of: {0:2.2f}%'.format(dtc_acc*100))
print('-'*80)
print('A random forest ensemble had an accuracy of: {0:2.2f}%'.format(rfc_acc*100))
print('-'*80)
print('A ensemble using GBT had an accuracy of: {0:2.2f}%'.format(gbt_acc*100))

Here are the results!
--------------------------------------------------------------------------------
A single decision tree had an accuracy of: 89.30%
--------------------------------------------------------------------------------
A random forest ensemble had an accuracy of: 94.24%
--------------------------------------------------------------------------------
A ensemble using GBT had an accuracy of: 87.65%


In [38]:
# check models to see if any feature is a good predictor of college type
dtc_model.featureImportances

SparseVector(17, {2: 0.0099, 3: 0.0086, 4: 0.0239, 5: 0.508, 6: 0.0411, 7: 0.3256, 10: 0.01, 11: 0.0266, 13: 0.027, 16: 0.0192})

In [39]:
rfc_model.featureImportances

SparseVector(17, {0: 0.0301, 1: 0.0656, 2: 0.1167, 3: 0.0141, 4: 0.0104, 5: 0.2012, 6: 0.1018, 7: 0.2114, 8: 0.0446, 9: 0.0061, 10: 0.0073, 11: 0.0211, 12: 0.0195, 13: 0.0631, 14: 0.0144, 15: 0.0357, 16: 0.0369})

In [40]:
gbt_model.featureImportances

SparseVector(17, {0: 0.0221, 1: 0.0222, 2: 0.0068, 3: 0.0238, 4: 0.0111, 5: 0.3818, 6: 0.0809, 7: 0.2437, 8: 0.025, 9: 0.0038, 10: 0.0067, 11: 0.0626, 12: 0.0176, 13: 0.0138, 14: 0.0159, 15: 0.0022, 16: 0.06})

Number of new students enrolled (feature 5, Enroll) has highest significance across models