# <span style='color:DarkBlue'> Day23 - PyCaret for </span> <span style='color:Red'>Clustering</span>

- ### An open source automated library for Machine Learning.
- ### <span style='color:Red'> Three Step Process</span> to build machine learning models for:
    - Classification
    - Regression
    - Clustering

### Self Learning Resource
1. Explore Pycaret mannual on Clustering: <a href="https://pycaret.org/clustering/"> Click Here </a>
2. Tutorial on Pycaret <a href="https://pycaret.readthedocs.io/en/latest/tutorials.html"> Click Here</a> 




### <span style='color:DarkBlue'> Method 1</span>: To install `pycaret`
- Installing PyCaret in Local Jupyter Notebook, Google Colab or Azure Notebooks
    - Using conda: `!conda install pycaret`
    - Using pip: `!pip install pycaret`
- Installing PyCaret in Anaconda
    - Using conda: `conda install pycaret`
    - Using pip: `pip install pycaret`


### <span style='color:DarkBlue'> Method 2</span>: To install `pycaret` | Online manual to install pycaret <a href="https://pycaret.org/install/"> Click Here</a> 
-  <span style='color:DarkRed'> Step 1</span>: To Install pycaret (One Time)
    - Open Anaconda prompt
    - Create a conda environment: `conda create --name myenv python=3.6`
    - Activate environment: `conda activate myenv`
    - To install pycaret: `pip install pycaret`

-  <span style='color:DarkRed'> Step 2</span>: To use pycaret environment through Jypyter notebook (Always)
    - Open Anaconda prompt
    - Activate environment: `conda activate myenv`
    - Start Jupyter Notebook: `jupyter notebook`



### In this tutorial we will learn:

- Getting Data: How to import data from PyCaret repository
- Setting up Environment: How to setup an experiment in PyCaret and get started with building regression models
- Create Model: How to create a model, perform cross validation and evaluate regression metrics
- Tune Model: How to automatically tune the hyperparameters of a regression model
- Plot Model: How to analyze model performance using various plots
- Finalize Model: How to finalize the best model at the end of the experiment
- Predict Model: How to make prediction on new / unseen data
- Save / Load Model: How to save / load a model for future use


# <span style='color:Red'> 1. Clustering - Part 1 (Kmean Clustering)</span>

### <span style='color:DarkBlue'>1.1 KMean Clustering </span>

#### Get the version of the pycaret

In [None]:
from pycaret.utils import version
version()

#### Loading dataset from pycaret

In [None]:
from pycaret.datasets import get_data

#### Get the list of datasets available in pycaret

In [None]:
# Internet connection is required
dataSets = get_data('index')
dataSets

#### Get Jewellery dataset

In [None]:
jewellery_df = get_data('jewellery')

#### Get the dimention of dataset

In [None]:
print(jewellery_df.shape)

#### Remove duplicates

In [None]:
print(jewellery_df.shape)
jewellery_df.drop_duplicates()
print(jewellery_df.shape)

### <span style='color:DarkBlue'>1.2 Parameter setting for all clustering models</span>
- Train/Test division
- Sampling
- Normalization
- Transformation
- PCA (Dimention Reduction)
- Handaling of Outliers
- Feature Selection

#### Setup parameters for clustering models (defaults)

In [None]:
from pycaret.clustering import *
kMeanClusteringParameters = setup(jewellery_df)

### <span style='color:DarkBlue'>1.3 Build KMean clustering model</span>

In [None]:
KMeanClusteringModel = create_model('kmeans', num_clusters=4)
KMeanClusteringModel

#### <span style='color:DarkBlue'>Other clustering model</span>

In [None]:
K-Means clustering                 'kmeans'
Affinity Propagation               'ap'
Mean shift clustering              'meanshift'
Spectral Clustering                'sc'
Agglomerative Clustering           'hclust'
Density-Based Spatial Clustering   'dbscan'
OPTICS Clustering                  'optics'
Birch Clustering                   'birch'
K-Modes clustering                 'kmodes'

### <span style='color:DarkBlue'>1.4 Assign Model - Assign the labels to the dataset</span>



In [None]:
kmeans_df = assign_model(KMeanClusteringModel)
kmeans_df

### <span style='color:DarkBlue'>1.5 Saving the result</span>

In [None]:
kmeans_df.to_csv("KMeanResult.csv")

### <span style='color:DarkBlue'>1.6 Plot Clustering Model</span>

In [None]:
plot_model(KMeanClusteringModel)

### <span style='color:DarkBlue'>1.7 Save the trained model </span>

In [None]:
save_model(KMeanClusteringModel, 'KMeanClusteringModel')

### <span style='color:DarkBlue'>1.8 Load the model </span>

In [None]:
KMeanModel = load_model('KMeanClusteringModel')

### <span style='color:DarkBlue'>1.9 Make prediction on new dataset</span>

#### Read New Data

In [None]:
data = get_data("jewellery")

#### Select some data

In [None]:
# Select top 10 rows
new_data = data.iloc[:10]
new_data

#### Make prediction on new dataset

In [None]:
newPredictions = predict_model(KMeanModel, data = new_data)
newPredictions

### <span style='color:DarkBlue'>1.10 Save prediction results to csv</span>

In [None]:
newPredictions.to_csv("NewPredictions.csv")

### <span style='color:DarkBlue'>1.11 Ploting the model</span>

In [None]:
Cluster PCA Plot (2d)          'cluster'              
Cluster TSnE (3d)              'tsne'
Elbow Plot                     'elbow'
Silhouette Plot                'silhouette'
Distance Plot                  'distance'
Distribution Plot              'distribution'

#### Evaluate Cluster

In [None]:
evaluate_model(KMeanClusteringModel)

#### Cluster PCA Plot

In [None]:
plot_model(KMeanClusteringModel , plot='cluster')

#### Cluster Plot (3d)

In [None]:
plot_model(KMeanClusteringModel,plot = 'tsne')

#### Elbow Plot

In [None]:
plot_model(KMeanClusteringModel, plot = 'elbow')

#### Silhouette Plot

In [None]:
plot_model(KMeanClusteringModel, plot = 'silhouette')
# Error!! Plot Type not supported for this model

#### Distribution Plot

In [None]:
plot_model(KMeanClusteringModel,plot = 'distribution')

#### Distribution Plot

In [None]:
plot_model(KMeanClusteringModel,plot = 'distribution', feature='Income')

#### Distance Plot

In [None]:
plot_model(KMeanClusteringModel,plot = 'distance')
# Error!! Plot Type not supported for this model

# <span style='color:Red'> 2. Clustering - Part 2 (Apply Data-Preprocessing)</span>

##### <span style='color:DarkBlue'>Read Dataset </span>

In [None]:
from pycaret.clustering import *
from pycaret.datasets import get_data

jewellery_df = get_data('jewellery')

### <span style='color:DarkBlue'>2.1 Model Performance using Data Normalization</span>

In [None]:
setup(data = jewellery_df, normalize = True, normalize_method = 'zscore')
create_model('kmeans', num_clusters = 4)

### <span style='color:DarkBlue'>2.2 Model Performance using Transformation</span>

In [None]:
setup(data = jewellery_df, transformation = True, transformation_method = 'yeo-johnson')
create_model('kmeans', num_clusters = 4)

### <span style='color:DarkBlue'>2.3 Model Performance using Transformation + Normalization</span>

In [None]:
setup(data = jewellery_df, transformation = True, normalize = True, normalize_method = 'zscore', 
      transformation_method = 'yeo-johnson')
create_model('kmeans', num_clusters = 4)

### <span style='color:DarkBlue'>2.4 Model Performance using PCA</span>

In [None]:
setup(data = jewellery_df, pca = True, pca_method = 'linear')
create_model('kmeans', num_clusters = 4)

### <span style='color:DarkBlue'>2.4 Model Performance using Remove Multicollinearity</span>

In [None]:
setup(data = jewellery_df, remove_multicollinearity = True, multicollinearity_threshold = 0.8)
create_model('kmeans', num_clusters = 4)

# <span style='color:Red'> 3. Clustering - Part 3 (Other Clustering Techniques)</span>

#### <span style='color:DarkBlue'>Other clustering model</span>

In [None]:
K-Means clustering                 'kmeans'
Affinity Propagation               'ap'
Mean shift clustering              'meanshift'
Spectral Clustering                'sc'
Agglomerative Clustering           'hclust'
Density-Based Spatial Clustering   'dbscan'
OPTICS Clustering                  'optics'
Birch Clustering                   'birch'
K-Modes clustering                 'kmodes'

### <span style='color:DarkBlue'>3.1 Agglomerative (Hierarchical) Clustering </span>

#### Step 1: Loading dataset from pycaret

In [None]:
from pycaret.datasets import get_data

#### Step 2: Get Jewellery dataset

In [None]:
jewellery_df = get_data('jewellery')

#### Step 3: Get the dimention of dataset

In [None]:
jewellery_df.shape

#### Step 4: Setup

In [None]:
from pycaret.clustering import *
hierarchicalParameters = setup(jewellery_df)

#### Step 5: Create Hierarchical Clustering Model

In [None]:
hierarchicalModel = create_model('hclust', num_clusters=6)
hierarchicalModel

#### Step 6: Assign Model - Assign the labels to the dataframe

In [None]:
hierarchical_df = assign_model(hierarchicalModel)
hierarchical_df

#### Step 7: Saving to file

In [None]:
hierarchical_df.to_csv("HierarchicalResult.csv")

#### Step 8: Evaluate Model

In [None]:
evaluate_model(KMeanClusteringModel)

### <span style='color:DarkBlue'>3.2 Density-Based Spatial Clustering </span>

#### Step 1: Loading dataset from pycaret

In [None]:
from pycaret.datasets import get_data

#### Step 2: Get Jewellery dataset

In [None]:
jewellery_df = get_data('jewellery')

#### Step 3: Get the dimention of dataset

In [None]:
jewellery_df.shape

#### Step 4: Setup

In [None]:
from pycaret.clustering import *
dbscanParameters = setup(jewellery_df)

#### Step 5: Create Clustering Model

In [None]:
dbscanModel = create_model('dbscan')
dbscanModel

#### Step 6: Assign Model - Assign the labels to the dataframe

In [None]:
dbscan_df = assign_model(dbscanModel)
dbscan_df

# Noisy samples are given the label -1 i.e. 'Cluster -1'

#### Step 7: Saving to file

In [None]:
dbscan_df.to_csv("DBScanResult.csv")

#### Step 8: Evaluate Model

In [None]:
evaluate_model(KMeanClusteringModel)

### <span style='color:Red'>Key Points</span>

- num_clusters not required for some of the clustering Alorithms (Affinity Propagation ('ap'), Mean shift
  clustering ('meanshift'), Density-Based Spatial Clustering ('dbscan') and OPTICS Clustering ('optics')). 
- num_clusters param for these models are automatically determined.
  
- When fit doesn't converge in Affinity Propagation ('ap') model, all datapoints are labelled as -1.
  
- Noisy samples are given the label -1, when using Density-Based Spatial  ('dbscan') or OPTICS Clustering ('optics'). 
  
- OPTICS ('optics') clustering may take longer training times on large datasets.



### Self Learning Resource
1. Explore Pycaret mannual on Clustering: <a href="https://pycaret.org/clustering/"> Click Here </a>
2. Tutorial on Pycaret <a href="https://pycaret.readthedocs.io/en/latest/tutorials.html"> Click Here</a> 