## Experiment 2: Explore, visualize, transform and summarize input datasets for building Classification/regression/prediction models.

## Software Used: Google colaboratory---


# **PyCaret for Classification**
---
- It is a bundle of many Machine Learning algorithms.
- Only three lines of code is required to compare 20 ML models.
- Pycaret is available for:
    - Classification
    - Regression
    - Clustering

---

### **Self Learning Resource**
1. Tutorial on Pycaret <a href="https://pycaret.readthedocs.io/en/latest/tutorials.html"> Click Here</a> 

2. Documentation on Pycaret-Classification: <a href="https://pycaret.org/Classification/"> Click Here </a>

---

### **In this experiment we will learn:**

- Getting Data: How to import data from PyCaret repository
- Setting up Environment: How to setup an experiment in PyCaret and get started with building regression models
- Create Model: How to create a model, perform cross validation and evaluate regression metrics
- Tune Model: How to automatically tune the hyperparameters of a regression model
- Plot Model: How to analyze model performance using various plots
- Finalize Model: How to finalize the best model at the end of the experiment
- Predict Model: How to make prediction on new / unseen data
- Save / Load Model: How to save / load a model for future use

---



#### **(a) Install Pycaret**

Exclamation sign: It means run it as a shell command rather than a notebook command. This is actually not specific to pip, but really any shell command from the iPython notebook. In computing, a shell is a computer program which exposes an operating system's services to a human user or other program. 

&> /dev/null: /dev/null is the null file. Anything written to it is discarded.

Together they mean "throw away any error messages".

In [None]:
!pip install pycaret &> /dev/null
print ("Pycaret installed sucessfully!!")

Pycaret installed sucessfully!!


#### **(b) Get the version of the pycaret**

In [None]:
## Utils is a collection of small Python functions and classes which make common patterns shorter and easier.
from pycaret.utils import version
version()

'2.3.2'

---
# **1. Classification: Basics**
---

### **1.1 Loading Dataset - Loading dataset from pycaret**

In [None]:
from pycaret.datasets import get_data

# No output

---
### **1.2 Get the list of datasets available in pycaret (55)**
---

In [None]:
# Internet connection is required
dataSets = get_data('index')

Unnamed: 0,Dataset,Data Types,Default Task,Target Variable 1,Target Variable 2,# Instances,# Attributes,Missing Values
0,anomaly,Multivariate,Anomaly Detection,,,1000,10,N
1,france,Multivariate,Association Rule Mining,InvoiceNo,Description,8557,8,N
2,germany,Multivariate,Association Rule Mining,InvoiceNo,Description,9495,8,N
3,bank,Multivariate,Classification (Binary),deposit,,45211,17,N
4,blood,Multivariate,Classification (Binary),Class,,748,5,N
5,cancer,Multivariate,Classification (Binary),Class,,683,10,N
6,credit,Multivariate,Classification (Binary),default,,24000,24,N
7,diabetes,Multivariate,Classification (Binary),Class variable,,768,9,N
8,electrical_grid,Multivariate,Classification (Binary),stabf,,10000,14,N
9,employee,Multivariate,Classification (Binary),left,,14999,10,N


---
### **1.3 Get diabetes dataset**
---

In [None]:
diabetesDataSet = get_data("diabetes")    # SN is 7
# This is binary classification dataset. The values in "Class variable" have two (binary) values.
print(type(diabetesDataSet))

Unnamed: 0,Number of times pregnant,Plasma glucose concentration a 2 hours in an oral glucose tolerance test,Diastolic blood pressure (mm Hg),Triceps skin fold thickness (mm),2-Hour serum insulin (mu U/ml),Body mass index (weight in kg/(height in m)^2),Diabetes pedigree function,Age (years),Class variable
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


<class 'pandas.core.frame.DataFrame'>


Read data from file

In [None]:
#import pandas as pd
#diabetesDataSet = pd.read_csv("diabetes.csv")


In [None]:
diabetesDataSet.columns

Index(['Number of times pregnant',
       'Plasma glucose concentration a 2 hours in an oral glucose tolerance test',
       'Diastolic blood pressure (mm Hg)', 'Triceps skin fold thickness (mm)',
       '2-Hour serum insulin (mu U/ml)',
       'Body mass index (weight in kg/(height in m)^2)',
       'Diabetes pedigree function', 'Age (years)', 'Class variable'],
      dtype='object')

In [None]:
#Get the statistical summary of the dataset
diabetesDataSet.describe()

Unnamed: 0,Number of times pregnant,Plasma glucose concentration a 2 hours in an oral glucose tolerance test,Diastolic blood pressure (mm Hg),Triceps skin fold thickness (mm),2-Hour serum insulin (mu U/ml),Body mass index (weight in kg/(height in m)^2),Diabetes pedigree function,Age (years),Class variable
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [None]:
print("type(diabetesDataSet)-->",type(diabetesDataSet))

type(diabetesDataSet)--> <class 'pandas.core.frame.DataFrame'>


In [None]:
##Get the dimention of the dataset

In [None]:
print("diabetesDataSet.shape -->", diabetesDataSet.shape)
print("Rows     -->", diabetesDataSet.shape[0])  ##axis 0---row
print("Columns  -->", diabetesDataSet.shape[1])   ###column

In [None]:
### Show top 5 rows of the dataset

In [None]:
diabetesDataSet.head()

In [None]:
## Accessing data from dataset - Part 1 (using loc - Column Names)

In [None]:
# Syntax --> loc[ ROW, COL_Names_in_List ]

#diabetesDataSet.loc[:, ['Diabetes pedigree function','Age (years)']]

# Also Try
#diabetesDataSet.loc[:10 , ['Diabetes pedigree function','Age (years)']]
diabetesDataSet.loc[10:100 , ['Diabetes pedigree function','Age (years)']]

In [None]:
## Accessing data from dataset - Part 2 (using iloc - Column position)

In [None]:
# Syntax --> iloc[ ROW, COL_Position]

#diabetesDataSet.iloc[0:10, 5:]

# Also Try
#diabetesDataSet.iloc[10:100, :-2]
diabetesDataSet.iloc[20:30, 1:5]

Unnamed: 0,Plasma glucose concentration a 2 hours in an oral glucose tolerance test,Diastolic blood pressure (mm Hg),Triceps skin fold thickness (mm),2-Hour serum insulin (mu U/ml)
20,126,88,41,235
21,99,84,0,0
22,196,90,0,0
23,119,80,35,0
24,143,94,33,146
25,125,70,26,115
26,147,76,0,0
27,97,66,15,140
28,145,82,19,110
29,117,92,0,0


In [None]:
## Get the mean of the all the columns present in the dataset

In [None]:
diabetesDataSet.mean()
# What is the output?
#df.head(50).mean()
#df.tail(50).mean()

Number of times pregnant                                                      3.845052
Plasma glucose concentration a 2 hours in an oral glucose tolerance test    120.894531
Diastolic blood pressure (mm Hg)                                             69.105469
Triceps skin fold thickness (mm)                                             20.536458
2-Hour serum insulin (mu U/ml)                                               79.799479
Body mass index (weight in kg/(height in m)^2)                               31.992578
Diabetes pedigree function                                                    0.471876
Age (years)                                                                  33.240885
Class variable                                                                0.348958
dtype: float64

In [None]:
## Get the maximum of each column in the dataset
diabetesDataSet.max()

Number of times pregnant                                                     17.00
Plasma glucose concentration a 2 hours in an oral glucose tolerance test    199.00
Diastolic blood pressure (mm Hg)                                            122.00
Triceps skin fold thickness (mm)                                             99.00
2-Hour serum insulin (mu U/ml)                                              846.00
Body mass index (weight in kg/(height in m)^2)                               67.10
Diabetes pedigree function                                                    2.42
Age (years)                                                                  81.00
Class variable                                                                1.00
dtype: float64

In [None]:
## Drop NA values (delete rows)

In [None]:
diabetesDataSet.isnull().sum()
#diabetesDataSet.dropna() 

Number of times pregnant                                                    0
Plasma glucose concentration a 2 hours in an oral glucose tolerance test    0
Diastolic blood pressure (mm Hg)                                            0
Triceps skin fold thickness (mm)                                            0
2-Hour serum insulin (mu U/ml)                                              0
Body mass index (weight in kg/(height in m)^2)                              0
Diabetes pedigree function                                                  0
Age (years)                                                                 0
Class variable                                                              0
dtype: int64

In [None]:
## Drop the columns where there are null values

In [None]:
diabetesDataSet.dropna(axis = 'columns')

In [None]:
## Fill the null values with '0'

In [None]:
diabetesDataSet.fillna(0) 

In [None]:
## Bar graph

In [None]:
## Scatter plot

In [None]:
## Subplot

In [None]:
### TASK: Load the dataset from pycaret for regression/classification problem, explore and summarize it with adequate visualizations.