## Introduction

We are keeping forward with the PySpark series where by far we covered **Data preprocessing techniques** and various **ML agorithm** along with real-world **consulting projects**. In this article as well we will work upon another consulting project, let's take a scenario, Suppose we have been **hired by a Dog Food company** and our task is to predict that why their manufactured food is being **spoiling rapdily comparing to their shelves life** and we will solve this particular problem statement using **PySpark's MLIB**.

## About the Problem statement

From the introduction part we are well aware that **"what"** needs to be done but in this section we will dig more to understand the **"how"** and **"why"** part of this project.

**Why Dog Food company needs us?**
- From last few suppy chain tenure they are regularly facing the **pre-spoiling of the dog food** and they have figured out the reason as well, as they have **not upgraded to the latest machineries** so the **four secret ingredients** are not mixing up well. But they are not able to figure out that among those **4 chemicals which one is responsible or have the strongest effect.** 


**How we are gonna approach this problem statement?**
- Our main task is to predict that **1 chemical/preservatives has the strongest effect** among those **4 ingredients** and to achieve this we are not gonna follow **train test split** methodology instead **feature importance** method because in the end that only will let us know which of those ingredients is most responsible for **spoiling the dog food** before its shelf life.



## About the dataset

This dataset holds **4 feature columns** labelled as : **A,B,C and D** and the one is **Target** column labelled as **"Spoiled"**. So total of **5 columns** are there in the dataset. Let's look at the short description of each column.

1. **Preservative_A:** Percentage of A ingredient in the mixture.
2. **Preservative_B:** Percentage of B ingredient in the mixture.
3. **Preservative_C:** Percentage of C ingredient in the mixture.
4. **Preservative_D:** Percentage of D ingredient in the mixture.


Here you can find the source of the [dataset](https://https://github.com/SkalskiP/pySpark_Tutorial/blob/master/Sekcja_13_Decision_Trees_and_Random_Forests/dog_food.csv).

**Note:** In this particular project **we will not follow that generalised method of machine learning pipeline (train-test-split)** instead we will go with other method which you will find out while moving on with this article and that will help you to **draw another template for such problems.**

**Installing PySpark:** To do the predictive analysis on the spoiling chemical we just need to install one library which is heart and soul of this project i.e. **PySpark** that will eventually set up an environment for the **MLIB library** and **established a connection with Apache Spark**.

In [1]:
!pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.3.0.tar.gz (281.3 MB)
[K     |████████████████████████████████| 281.3 MB 38 kB/s 
[?25hCollecting py4j==0.10.9.5
  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
[K     |████████████████████████████████| 199 kB 56.9 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.3.0-py2.py3-none-any.whl size=281764026 sha256=3793c0cc5561bff60a517f42489e4bbcdc50d777a91475c885ceb2d13b97726e
  Stored in directory: /root/.cache/pip/wheels/7a/8e/1b/f73a52650d2e5f337708d9f6a1750d451a7349a867f928b885
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.5 pyspark-3.3.0


## Spark Session

In this part of the article we will start the **Spark Session** because this is one of those mandatory process where we setup the environment with the **Apache Spark** by creating and new session via **PySpark**.

In [2]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('dog_food_project').getOrCreate()
spark

**Inference:** First and foremost the **Spark Session** library is imported from the **pyspark.sql** library. 

Then comes the role of builder function that will **build the session** (providing that naming functionality too - **dog_food_project**) after building it we created the SparkSession using **getOrCreate()** function. 

At the last **calling the spark object** we can see the **UI** of the **SparkMemory** that summarizes the whole process.

## Reading the dog food dataset

Here is yet another compulsory step to be followed because any data science project is impossible to carry on without the relevant dataset it's like "***trying to build the house without considering bricks***". Hence one can refer the below code to read the dataset which is in the CSV format.

In [3]:
data_food = spark.read.csv('dog_food.csv',inferSchema=True,header=True)
data_food.show()

+---+---+----+---+-------+
|  A|  B|   C|  D|Spoiled|
+---+---+----+---+-------+
|  4|  2|12.0|  3|    1.0|
|  5|  6|12.0|  7|    1.0|
|  6|  2|13.0|  6|    1.0|
|  4|  2|12.0|  1|    1.0|
|  4|  2|12.0|  3|    1.0|
| 10|  3|13.0|  9|    1.0|
|  8|  5|14.0|  5|    1.0|
|  5|  8|12.0|  8|    1.0|
|  6|  5|12.0|  9|    1.0|
|  3|  3|12.0|  1|    1.0|
|  9|  8|11.0|  3|    1.0|
|  1| 10|12.0|  3|    1.0|
|  1|  5|13.0| 10|    1.0|
|  2| 10|12.0|  6|    1.0|
|  1| 10|11.0|  4|    1.0|
|  5|  3|12.0|  2|    1.0|
|  4|  9|11.0|  8|    1.0|
|  5|  1|11.0|  1|    1.0|
|  4|  9|12.0| 10|    1.0|
|  5|  8|10.0|  9|    1.0|
+---+---+----+---+-------+
only showing top 20 rows



**Inference:** From the above output we have confirmed what we stated regarding the dataset in About section i.e. it have **4 ingredients/chemicals** (**A,B,C,D**) and one target variable i.e. **Spoiled**.

For doing so, **read.csv function** was used, keeping the **inferSchema** and **header** parameter as **True** so that it can return the relevant type of data.

In [4]:
data_food.printSchema()

root
 |-- A: integer (nullable = true)
 |-- B: integer (nullable = true)
 |-- C: double (nullable = true)
 |-- D: integer (nullable = true)
 |-- Spoiled: double (nullable = true)



**Inference:** Just before this step while reading the dataset we put the **inferSchema** parameter value to **True** so that while performing the **printSchema** function we can get the right data type of each features. Hence all **4 feature** has the **integer** type and **Spoiled** (**target**) holding the **double** type of data.

In [5]:
data_food.head(10)

[Row(A=4, B=2, C=12.0, D=3, Spoiled=1.0),
 Row(A=5, B=6, C=12.0, D=7, Spoiled=1.0),
 Row(A=6, B=2, C=13.0, D=6, Spoiled=1.0),
 Row(A=4, B=2, C=12.0, D=1, Spoiled=1.0),
 Row(A=4, B=2, C=12.0, D=3, Spoiled=1.0),
 Row(A=10, B=3, C=13.0, D=9, Spoiled=1.0),
 Row(A=8, B=5, C=14.0, D=5, Spoiled=1.0),
 Row(A=5, B=8, C=12.0, D=8, Spoiled=1.0),
 Row(A=6, B=5, C=12.0, D=9, Spoiled=1.0),
 Row(A=3, B=3, C=12.0, D=1, Spoiled=1.0)]

**Inference:** There is one more method from which we can look at the dataset i.e. **traditional head function** which will not only return the **name of all the columns** but also the values associated with it (**row wise**) and the tuple is in the format of **Row object**.

In [6]:
data_food.describe().show()

+-------+------------------+------------------+------------------+------------------+-------------------+
|summary|                 A|                 B|                 C|                 D|            Spoiled|
+-------+------------------+------------------+------------------+------------------+-------------------+
|  count|               490|               490|               490|               490|                490|
|   mean|  5.53469387755102| 5.504081632653061| 9.126530612244897| 5.579591836734694| 0.2857142857142857|
| stddev|2.9515204234399057|2.8537966089662063|2.0555451971054275|2.8548369309982857|0.45221563164613465|
|    min|                 1|                 1|               5.0|                 1|                0.0|
|    max|                10|                10|              14.0|                10|                1.0|
+-------+------------------+------------------+------------------+------------------+-------------------+



**Inference:** What if we want to access the **statistical information** about the dataset? For that **PySpark** have the **describe() function** that is used against the chosen dataset. One can see the output where it returned the **count**, **mean**, **standard deviation**, **minimum** and **maximum** values of each features as well as for the independent column.

## Vector Assembler and Vectors in PySpark

Whenever we are working with **MLIB library** we need to make sure that all the **features are stacked up together in one seperate column** keeping the target column in other one. So to attain this PySpark comes with **VectorAssembler** library that will sort things up for us without handling much manually.

In [7]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

In the above cell we imported **Vectors** and **VectorAssembler** modules from the **ml.linalg** and **ml.feature** library simultaneously. Moving forward we will see the implementation of the same.

In [17]:
assembler_data = VectorAssembler(inputCols=['A', 'B', 'C', 'D'],outputCol="features")
output = assembler_data.transform(data_food)
output.show()

+---+---+----+---+-------+-------------------+
|  A|  B|   C|  D|Spoiled|           features|
+---+---+----+---+-------+-------------------+
|  4|  2|12.0|  3|    1.0| [4.0,2.0,12.0,3.0]|
|  5|  6|12.0|  7|    1.0| [5.0,6.0,12.0,7.0]|
|  6|  2|13.0|  6|    1.0| [6.0,2.0,13.0,6.0]|
|  4|  2|12.0|  1|    1.0| [4.0,2.0,12.0,1.0]|
|  4|  2|12.0|  3|    1.0| [4.0,2.0,12.0,3.0]|
| 10|  3|13.0|  9|    1.0|[10.0,3.0,13.0,9.0]|
|  8|  5|14.0|  5|    1.0| [8.0,5.0,14.0,5.0]|
|  5|  8|12.0|  8|    1.0| [5.0,8.0,12.0,8.0]|
|  6|  5|12.0|  9|    1.0| [6.0,5.0,12.0,9.0]|
|  3|  3|12.0|  1|    1.0| [3.0,3.0,12.0,1.0]|
|  9|  8|11.0|  3|    1.0| [9.0,8.0,11.0,3.0]|
|  1| 10|12.0|  3|    1.0|[1.0,10.0,12.0,3.0]|
|  1|  5|13.0| 10|    1.0|[1.0,5.0,13.0,10.0]|
|  2| 10|12.0|  6|    1.0|[2.0,10.0,12.0,6.0]|
|  1| 10|11.0|  4|    1.0|[1.0,10.0,11.0,4.0]|
|  5|  3|12.0|  2|    1.0| [5.0,3.0,12.0,2.0]|
|  4|  9|11.0|  8|    1.0| [4.0,9.0,11.0,8.0]|
|  5|  1|11.0|  1|    1.0| [5.0,1.0,11.0,1.0]|
|  4|  9|12.0

**Code breakdown:** 

1. Firstly, before using the **VectorAssembler** we first need to create the object for the same i.e. **initializing it** and passing the **input columns** (features) and **output columns** (the one which will piled up)

2. After initializing the object we are transforming it, note that in the parameter **whole dataset is passed**.

3. At the last, to show the **transformed** data **show function** is used and in the output the last column turned out to be features column (all in one).

## Model building

Here comes the model building phase where specifically we will use the **Tree method** to achieve the motto of this article. Note that this **model building phase will not be same as the traditional way** because we **don't need the train test split** instead we just want to grab which **feature has more importance**.



In [11]:
from pyspark.ml.classification import RandomForestClassifier,DecisionTreeClassifier

rfc = DecisionTreeClassifier(labelCol='Spoiled',featuresCol='features')

**Inference:** So before using the tree classifiers we need to import the **Random forest classifier** and **Decision Tree classifier** from classification module. 

Then, **initialising the Decision Tree object** and passing in the **label** column (**target**) and features columns (**feature**).

In [18]:
final_data = output.select('features','Spoiled')
final_data.show()

+-------------------+-------+
|           features|Spoiled|
+-------------------+-------+
| [4.0,2.0,12.0,3.0]|    1.0|
| [5.0,6.0,12.0,7.0]|    1.0|
| [6.0,2.0,13.0,6.0]|    1.0|
| [4.0,2.0,12.0,1.0]|    1.0|
| [4.0,2.0,12.0,3.0]|    1.0|
|[10.0,3.0,13.0,9.0]|    1.0|
| [8.0,5.0,14.0,5.0]|    1.0|
| [5.0,8.0,12.0,8.0]|    1.0|
| [6.0,5.0,12.0,9.0]|    1.0|
| [3.0,3.0,12.0,1.0]|    1.0|
| [9.0,8.0,11.0,3.0]|    1.0|
|[1.0,10.0,12.0,3.0]|    1.0|
|[1.0,5.0,13.0,10.0]|    1.0|
|[2.0,10.0,12.0,6.0]|    1.0|
|[1.0,10.0,11.0,4.0]|    1.0|
| [5.0,3.0,12.0,2.0]|    1.0|
| [4.0,9.0,11.0,8.0]|    1.0|
| [5.0,1.0,11.0,1.0]|    1.0|
|[4.0,9.0,12.0,10.0]|    1.0|
| [5.0,8.0,10.0,9.0]|    1.0|
+-------------------+-------+
only showing top 20 rows



The above process of accessing only the **features** and **target** column was performed so that we can get the **final data** that needs to be passed in the **training phase**. In the output also one can confirm the same.

In [15]:
rfc_model = rfc.fit(final_data)

**Inference:** Finally there is the **training phase** and for that **fit** method is used also have a note that here we are passing the **final data** that we grabbed above.

In [16]:
rfc_model.featureImportances

SparseVector(4, {1: 0.0019, 2: 0.9832, 3: 0.0149})

**Inference:** Have a close look at the output where 3 indexes are there and 2nd index has the highest value (0.9832) i.e. 

**Chemical C is the most important feature that stimulates Chemical C is the main cause for the early spoilage of dog food.**

## Conclusion

We are in the endgame now 😧, The last part of the article will let you **summarize everything** we did so far to achieve the results for which we were hypothetically **hired by the dog food company to predict the chemical which is causing for the early spoiling of the dog food.**

1. As usual for any Spark project here also we first **setup the Spark Session and read the dataset** - mandatory steps.

2. Then we moved forward to the **feature analysis** phase and also **feature engineering** to make the dataset ready to fed machine learning algorithm.

3. At the last we **build the model** (tree method) and after the training of the same we grabbed the **feature importance** and concluded that **Chemical C was mainly responsible for early spoiling.**