## Introduction

This article will help to solve the real world problem for students to classify the **university as the Private or the Public university** based on the features we fed in the model that will be trained by **various trees methods** which we will discuss later on. In a nutshell, PySpark library is involved as we will be working with it's **MLIB** library (**The machine learning library of PySpark**).


## About the dataset

We are using the famous **Private VS Public Universities** dataset which have **17 features** that will work as **the dependent columns** and a target column named as **Private** (the categorical column which have "**Yes**" for **Private** and "**No**" for **Public**) 

**Here is the brief description of all the columns:**

1. **Private** is our target column which have two values, **Yes/No** to classify the university as **private/public** respectively.
2. **Apps** is the number of applications received.
3. **Accept** is the total number of the application received.
4. **Enroll number** is the total number of students who enrolled.
5. **Top10per Pct.** has all the students from the top 10% of High School.
6. **Top25perc Pct.** has the students from the top 20% of High School.
7. **F.Undergrad** holds the total number of full-time undergraduates.
8. **P.Undergrad** have the number of part-time undergrduates.
9. **Outstate** column holds the number of out of station students.
10. **Room_board** is the room costs.
11. **Books_estimated** is the costs of the books.
12. **Personal_estimated** column stores the personal spending of students.
13. **PhD Pct.** holds the total number of Phd holder faculty.
14. **Terminal Pct.** column have the number of terminal holder faculty.
15. **S.F ratio** stimulates the Student/Faculty ratio.
16. **perc.alumni Pct.** have the number of alumini who donate.
17. **Expend** has the instructional expenditure of each student.
18. **Grad rate** have the graduation rate values.
    
**To achieve the aim of developing the good model for our problem statement we will be using various trees methods that are as follows:**

* A single decision tree
* A random forest
* A gradient boosted tree classifier

In [1]:
!pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.2.1.tar.gz (281.4 MB)
[K     |████████████████████████████████| 281.4 MB 32 kB/s 
[?25hCollecting py4j==0.10.9.3
  Downloading py4j-0.10.9.3-py2.py3-none-any.whl (198 kB)
[K     |████████████████████████████████| 198 kB 50.0 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.2.1-py2.py3-none-any.whl size=281853642 sha256=a5d8dec75c12ccdfd2693d15c0399bb42ae7d0c0d8548e5d69187d95e150ccd6
  Stored in directory: /root/.cache/pip/wheels/9f/f5/07/7cd8017084dce4e93e84e92efd1e1d5334db05f2e83bcef74f
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.3 pyspark-3.2.1


## Intializing the Spark Session

As we are well aware of the **mandatory steps** that we need to follow in order to start the **spark session** because for working with **PySpark** there should be all the resources available with us for that **setting up the environment** is the key thing to do.

In [2]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('treecode').getOrCreate()

**Inference:** For initializing the session we imported the Spark object from **pyspark** library. Then for creating the enivironment of Apache Spark we used the **builder** and the **create** function of the **SparkSession** object.

## Reading the dataset

Let's read our dataset now from PySpark's **read.csv** function so that we could then predict that according to the given features it is the **Private** college of the **Public** one.

**Note:** While we are passing the name of the dataset in the first parameter though note that in second and third param i.e. **inferSchema and header is set to True** so that original types in the dataset should be shown.

In [5]:
data = spark.read.csv('College.csv',inferSchema=True,header=True)
data.show()

+--------------------+-------+----+------+------+---------+---------+-----------+-----------+--------+----------+-----+--------+---+--------+---------+-----------+------+---------+
|              School|Private|Apps|Accept|Enroll|Top10perc|Top25perc|F_Undergrad|P_Undergrad|Outstate|Room_Board|Books|Personal|PhD|Terminal|S_F_Ratio|perc_alumni|Expend|Grad_Rate|
+--------------------+-------+----+------+------+---------+---------+-----------+-----------+--------+----------+-----+--------+---+--------+---------+-----------+------+---------+
|Abilene Christian...|    Yes|1660|  1232|   721|       23|       52|       2885|        537|    7440|      3300|  450|    2200| 70|      78|     18.1|         12|  7041|       60|
|  Adelphi University|    Yes|2186|  1924|   512|       16|       29|       2683|       1227|   12280|      6450|  750|    1500| 29|      30|     12.2|         16| 10527|       56|
|      Adrian College|    Yes|1428|  1097|   336|       22|       50|       1036|         99|  

**Inference:** So in the output it returned the complete data in the form of dataset and showing up the **top 20 rows**. Please have a note to our target column i.e. **Private**.

In [8]:
data.describe().show()

+-------+--------------------+-------+------------------+------------------+----------------+------------------+------------------+-----------------+-----------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+----------------+------------------+
|summary|              School|Private|              Apps|            Accept|          Enroll|         Top10perc|         Top25perc|      F_Undergrad|      P_Undergrad|          Outstate|        Room_Board|             Books|          Personal|               PhD|          Terminal|         S_F_Ratio|       perc_alumni|          Expend|         Grad_Rate|
+-------+--------------------+-------+------------------+------------------+----------------+------------------+------------------+-----------------+-----------------+------------------+------------------+------------------+------------------+------------------+------------------+-------

**Inference:** The **describe** function of **PySpark** provides us the **brief statistical information** about the dataset which is quite informative. We can also see that the count of each column is same i.e. **777** which stimulates **there are no null values**.

In [None]:
data.printSchema()

root
 |-- School: string (nullable = true)
 |-- Private: string (nullable = true)
 |-- Apps: integer (nullable = true)
 |-- Accept: integer (nullable = true)
 |-- Enroll: integer (nullable = true)
 |-- Top10perc: integer (nullable = true)
 |-- Top25perc: integer (nullable = true)
 |-- F_Undergrad: integer (nullable = true)
 |-- P_Undergrad: integer (nullable = true)
 |-- Outstate: integer (nullable = true)
 |-- Room_Board: integer (nullable = true)
 |-- Books: integer (nullable = true)
 |-- Personal: integer (nullable = true)
 |-- PhD: integer (nullable = true)
 |-- Terminal: integer (nullable = true)
 |-- S_F_Ratio: double (nullable = true)
 |-- perc_alumni: integer (nullable = true)
 |-- Expend: integer (nullable = true)
 |-- Grad_Rate: integer (nullable = true)



**Inference:** **printSchema()** is yet another function of **PySpark** where it gives us the complete information about the origin**al Schema of the dataset** along with that it returns the type of the data as well as **nullable value (true/false)** corresponding to each features.

In [None]:
data.head()

Row(School='Abilene Christian University', Private='Yes', Apps=1660, Accept=1232, Enroll=721, Top10perc=23, Top25perc=52, F_Undergrad=2885, P_Undergrad=537, Outstate=7440, Room_Board=3300, Books=450, Personal=2200, PhD=70, Terminal=78, S_F_Ratio=18.1, perc_alumni=12, Expend=7041, Grad_Rate=60)

**Inference:** Head method is yet another method to look into data more closely as along with **showing the column name it also return the value associated with it** hence, becomes quite handy when we want to get more infernce of **what data** it is constituting.

## Formatting the dataset 

By far we are investigating the data like **what type of data each feature is holding, is there any null values or not and stuff like that** but now it's time to format the dataset in such a way that it becomes capable enough to be **feeded to machine learning algorithm.**

There are few mandatory things that we need to **perform so that Spark MLIB could accept our data**. It should have only two columns i.e. the label column which is the **target** one and the features column that holds the **set of all features**.

In [9]:
# Import VectorAssembler and Vectors
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

**Inference:** Here we are importing two of the main libraries that will help us to format the dataset that PySpark could accept that are **Vectors** and **VectorAssembler**.

In [11]:
data.columns

['School',
 'Private',
 'Apps',
 'Accept',
 'Enroll',
 'Top10perc',
 'Top25perc',
 'F_Undergrad',
 'P_Undergrad',
 'Outstate',
 'Room_Board',
 'Books',
 'Personal',
 'PhD',
 'Terminal',
 'S_F_Ratio',
 'perc_alumni',
 'Expend',
 'Grad_Rate']

**Inference:** Whomsoever is following my PySpark series they will notice that just before creating the assembler object I always used to see my data columns **this actually helps me in saving time** while typing the names hence **one can either see the name of the features + target or can take it as a tip for effecient coding.**

In [10]:
assembler = VectorAssembler(
  inputCols=['Apps',
             'Accept',
             'Enroll',
             'Top10perc',
             'Top25perc',
             'F_Undergrad',
             'P_Undergrad',
             'Outstate',
             'Room_Board',
             'Books',
             'Personal',
             'PhD',
             'Terminal',
             'S_F_Ratio',
             'perc_alumni',
             'Expend',
             'Grad_Rate'],
              outputCol="features")

output = assembler.transform(data)

**Inference:** So now as we were aware of the right columns to be chosen hence we can make our **assembler object** that will help to combine all the features together and make it right fit for MLIB's format.

Don't forget to **transform** it so that we can see the permanent changes in the newly created DataFrame.

## Converting type of Target column

Previously we saw that the Private column i.e. **independent column** has the values as **Yes/No** but we can't feed this type of **String** data to model hence we need to convert this string data to **binary categorical values (0/1)**

In [12]:
from pyspark.ml.feature import StringIndexer

indexer = StringIndexer(inputCol="Private", outputCol="PrivateIndex")
output_fixed = indexer.fit(output).transform(output)

final_data = output_fixed.select("features",'PrivateIndex')

**Code breakdown:**

1. When we want to convert any **string data type to the integer categorical values** then **StringIndexer** comes into action.

2. StringIndexer object is created where **input column** is there as the parameter that **needs name of the column to be converted**. Again we also need to **transform** it to view the permanent changes in the data.

3. At the last we created a new **DataFrame (final_data)** that have **features** column and **Private Index (converted target column)**.

## Splitting the dataset (Training/Testing)

This is one of the quite important step in any machine learning pipeline as after **splitting the whole dataset we have enough data for our training purpose as well enough data for the testing purpose** as the validation of the model is equally important than training it. 

In [13]:
train_data,test_data = final_data.randomSplit([0.7,0.3])

**Inference:** After the execution of above cell we are left with **70%** of the training data and **30%** of the testing dataset which will eventually we used for validating the model.

## Model development phase

So as we have seperated our dataset and left with independent training set of data. Now, with this we will train our model with all the available tree methods like **Decision tree, Gradient Boosting classifier and Random forest classifier**.

In [14]:
from pyspark.ml.classification import DecisionTreeClassifier,GBTClassifier,RandomForestClassifier
from pyspark.ml import Pipeline

**Inference:** Importing all the mentioned **tree classifiers** from the **ml.classification** library. Along with that **Pipeline module** is imported and this one is completely optional as I'll personally suggest that use the pipeline way only when you are **transforming the data multiple time and it needs a specific order of execution.**

Now we have to create the object of each model i.e. all the three **tree models** and store it in a certain variables so that later one can easily **fit (train)** it.

In [15]:
dtc = DecisionTreeClassifier(labelCol='PrivateIndex',featuresCol='features')
rfc = RandomForestClassifier(labelCol='PrivateIndex',featuresCol='features')
gbt = GBTClassifier(labelCol='PrivateIndex',featuresCol='features')

**Inference:** From the above **three line of codes** we can assure that each **tree model is created** one common thing is that each model has the same parameter i.e. **label column** and the **features column**

**Note:** Here we are using the **default parameters** to maintain the simplicity of the model though one can change that to **fine tune** the model.

Now let's **train the model** i.e. **fit the models using the fit function** of MLIB.

In [16]:
dtc_model = dtc.fit(train_data)
rfc_model = rfc.fit(train_data)
gbt_model = gbt.fit(train_data)

**Note:** When one will train all the three models together (in one cell) then one should patient enough as it will take some time **(depending on one's system configuration)**.

## Evaluating and comparing the model

In this section we will **compare and evaluate each model** simultaneously so that we could come to the conclusion that which model has **performed the best** hence that particular model will be taken to the **deployment phase**.

In [17]:
dtc_predictions = dtc_model.transform(test_data)
rfc_predictions = rfc_model.transform(test_data)
gbt_predictions = gbt_model.transform(test_data)

**Inference:** For evaluaing the model we need to **transform each tree algorithm** via testing data as the evaluation are only done on the basis of the results that we get from the **testing data**.

There are various evaluation metrics available in PySpark we just have to figure out what we need at what point of time i.e. either **Binary Classification Evaluator** or **Multi class Evaluator**

In [18]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

**Inference:** Here we acquired the **Multi class classification Evaluator** as we wanna see the Accuracy of the model and one needs to know that in classification problem if we are going with **Binary classification evaluator then Accuracy, Precision and such other metrics we can't access but with Multi class we can.**

In [19]:
# Select (prediction, true label) and compute test error
acc_evaluator = MulticlassClassificationEvaluator(labelCol="PrivateIndex", predictionCol="prediction", metricName="accuracy")

**Inference:** So we build the **MulticlassClassificationEvaluator** object where we passed the **label column, prediction column** as well as the **name of the metrics** that we wanna see.

In [20]:
dtc_acc = acc_evaluator.evaluate(dtc_predictions)
rfc_acc = acc_evaluator.evaluate(rfc_predictions)
gbt_acc = acc_evaluator.evaluate(gbt_predictions)

**Inference:** To see the results **evaluate** method is used which needs the **evaluated data (tree_model_predictions)**

In [21]:
print("Here are the results!")
print('-'*50)
print('A single decision tree had an accuracy of: {0:2.2f}%'.format(dtc_acc*100))
print('-'*50)
print('A random forest ensemble had an accuracy of: {0:2.2f}%'.format(rfc_acc*100))
print('-'*50)
print('A ensemble using GBT had an accuracy of: {0:2.2f}%'.format(gbt_acc*100))

Here are the results!
--------------------------------------------------
A single decision tree had an accuracy of: 91.67%
--------------------------------------------------
A random forest ensemble had an accuracy of: 92.92%
--------------------------------------------------
A ensemble using GBT had an accuracy of: 91.67%


**Inference:** At the last we printed the **accuracy of each model** and found out that **random forest ensemble is the best** when it comes to this particular problem statement.

## Conclusion

This is the last section of the article where we will have a look on everything we did so far to **classify the private and public universities** i.e. from starting the **SparkSession** to evaluating the model and choosing the **best model** the brief conlusion will help you to understand the **flow of the machine learning pipeline via PySpark's MLIB**.

1. Firstly, as usual we setup an **hastle free environment** and read the **college dataset** to do the data analysis.
2. Then we look into the data closely to undertand what changes needs to be done in **data preprocessing** step and later updated it too according to the requirements.
3. Then comes the turn of model development phase where we build various **tree models** so that later we could compare and choose the best one.
4. In the last section we evaluated the model and came to the final conclusion that **random forest ensemble model performed best on the testing data.**
