# Regression

## Objectives

*   Fit and evaluate a regression model to predict tomorrow's rainfall levels, in mm.


## Inputs

* content/WalkthroughProject/outputs/datasets/collection/WeatherAustralia.csv
* instructions on which variables to use for data cleaning and feature engineering. They are found on its respectives notebooks.

## Outputs

* Regression model

## Additional Comments | Insights | Conclusions


---

# Install and Import packages

* You eventually will need to restart runtime when installing packages, please note cell output when installing a package

In [None]:
! pip install feature-engine==1.0.2
! pip install scikit-learn==0.23.2
! pip install yellowbrick==1.2
! pip install lazypredict==0.2.9


# Code for restarting the runtime, that will restart colab session
# It is a good practice after you install a package in a colab session
import os
os.kill(os.getpid(), 9)

---

# Setup GPU

* Go to Edit → Notebook Settings
* In the Hardware accelerator menu, selects GPU
* note: when you select an option, either GPU, TPU or None, you switch among kernels/sessions

---
* How to know if I am using the GPU?
  * run the code below, if the output is different than '0' or null/nothing, you are using GPU in this session


In [None]:
import tensorflow as tf
tf.test.gpu_device_name()

# **Connection between: Colab Session and your GitHub Repo**

### Insert your **credentials**

* The variable's content will exist only while the session exists. Once this session terminates, the variable's content will be erased permanently.

In [1]:
from getpass import getpass
import os
from IPython.display import clear_output 

print("=== Insert your credentials === \nType in and hit Enter")
os.environ['UserName'] = getpass('GitHub User Name: ')
os.environ['UserEmail'] = getpass('GitHub User E-mail: ')
os.environ['RepoName'] = getpass('GitHub Repository Name: ')
os.environ['UserPwd'] = getpass('GitHub Account Password: ')
clear_output()
print("* Thanks for inserting your credentials!")
print(f"* You may now Clone your Repo to this Session, "
      f"then Connect this Session to your Repo.")

* Thanks for inserting your credentials!
* You may now Clone your Repo to this Session, then Connect this Session to your Repo.


* **Credentials format disclaimer**: when opening Jupyter notebooks in Colab that are hosted at GitHub, we ask you to not consider special characters in your **password**, like @ ! " # $ % & ' ( ) * + , - . / :;< = > ? @ [\ ]^_ ` { } | ~
  * Otherwise it will not work properly the git push command, since the credentials are concatenated in the command: username:password@github.com/username/repo , the git push command will not work properly when these terms have special characters 

---

### **Clone** your GitHub Repo to your current Colab session

* So you can have access to your project's files

In [2]:
! git clone https://github.com/{os.environ['UserName']}/{os.environ['RepoName']}.git
! rm -rf sample_data   # remove content/sample_data folder, since we dont need it for this project

import os
if os.path.isdir(os.environ['RepoName']):
  print("\n")
  %cd /content/{os.environ['RepoName']}
  print(f"\n\n* Current session directory is:{os.getcwd()}")
  print(f"* You may refresh the session folder to access {os.environ['RepoName']} folder.")
else:
  print(f"\n* The Repo {os.environ['UserName']}/{os.environ['RepoName']} was not cloned."
        f" Please check your Credentials: UserName and RepoName")

fatal: destination path 'WalkthroughProject' already exists and is not an empty directory.


/content/WalkthroughProject


* Current session directory is:/content/WalkthroughProject
* You may refresh the session folder to access WalkthroughProject folder.


---

### **Connect** this Colab session to your GitHub Repo

* So if you need, you can push files generated in this session to your Repo.

In [3]:
! git config --global user.email {os.environ['UserEmail']}
! git config --global user.name {os.environ['UserName']}
! git remote rm origin
! git remote add origin https://{os.environ['UserName']}:{os.environ['UserPwd']}@github.com/{os.environ['UserName']}/{os.environ['RepoName']}.git

# the logic is: create a temporary file in the sessions, update the repo. Delete this file, update the repo
# If it works, it is a signed that the session is connected to the repo.
# import uuid
# file_name = "session_connection_test_" + str(uuid.uuid4()) # generates a unique file name
# with open(f"{file_name}.txt", "w") as file: file.write("text")
# print("=== Testing Session Connectivity to the Repo === \n")
# ! git add . ; ! git commit -m {file_name + "_added_file"} ; ! git push origin main 
# print("\n\n")
# os.remove(f"{file_name}.txt")
# ! git add . ; ! git commit -m {file_name + "_removed_file"}; ! git push origin main

# delete your Credentials (username and password)
os.environ['UserName'] = os.environ['UserPwd'] = os.environ['UserEmail'] = ""

* If output above indicates there was a **failure in the authentication**, please insert again your credentials.

---

### **Push** generated/new files from this Session to GitHub repo

* Git status

In [None]:
! git status

* Git commit

In [None]:
CommitMsg = "update"
!git add .
!git commit -m {CommitMsg}

* Git Push

In [None]:
!git push origin main


---

### **Delete** Cloned Repo from current Session

* Delete cloned repo and move current directory to /content

In [None]:
%cd /content
import os
!rm -rf {os.environ['RepoName']}

print(f"\n * Please refresh session folder to validate that {os.environ['RepoName']} folder was removed from this session.")
print(f"\n\n* Current session directory is:  {os.getcwd()}")

---

# Load your data

In [4]:
import pandas as pd
df = (pd.read_csv("/content/WalkthroughProject/outputs/datasets/collection/WeatherAustralia.csv")
      .query("RainTomorrow == 'Yes'")  # subset RainTomorrow as Yes
      .drop(labels=['RainTomorrow'],axis=1)
      .dropna(subset=['RainfallTomorrow'])   # drop missing data from target RainfallTomorrow
  )


# subset RainTomorrow as 1, label: RainfallTomorrow, features: all other variables
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 31838 entries, 8 to 145393
Data columns (total 26 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Date              31838 non-null  object 
 1   Location          31838 non-null  object 
 2   MinTemp           31663 non-null  float64
 3   MaxTemp           31783 non-null  float64
 4   RainfallToday     31162 non-null  float64
 5   Evaporation       17836 non-null  float64
 6   Sunshine          16801 non-null  float64
 7   WindGustDir       29374 non-null  object 
 8   WindGustSpeed     29399 non-null  float64
 9   WindDir9am        29921 non-null  object 
 10  WindDir3pm        30786 non-null  object 
 11  WindSpeed9am      31498 non-null  float64
 12  WindSpeed3pm      31155 non-null  float64
 13  Humidity9am       31304 non-null  float64
 14  Humidity3pm       30874 non-null  float64
 15  Pressure9am       28740 non-null  float64
 16  Pressure3pm       28730 non-null  float

# Regressor Pipeline

## Custom transformer


  * convert ['Cloud9am','Cloud3pm'] to categorical
  * get Get Day, Month, Year, Weekday, IsWeekend from Date

In [6]:
import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin

# Convert ['Cloud9am','Cloud3pm'] to categorical
class ConvertToCategorical(BaseEstimator, TransformerMixin):

  def __init__(self, variables=None):
      if not isinstance(variables, list):
          self.variables = [variables]
      else:
          self.variables = variables

  def fit(self, X, y=None):
      return self

  def transform(self, X):
      X = X.copy()
      for feature in self.variables:
          X[feature] = X[feature].astype('object')

      return X


# Get Day, Month, Year, Weekday, IsWeekend from Date
class GetFeaturesFromDate(BaseEstimator, TransformerMixin):

  def __init__(self, variable=None):
      self.variable = variable

  def fit(self, X, y=None):
      return self

  def transform(self, X):
      X = X.copy()
      X[self.variable] = pd.to_datetime(X[self.variable])
      X['Day'] = X[self.variable].dt.day
      X['Month'] = X[self.variable].dt.month
      X['Year'] = X[self.variable].dt.year
      X['WeekDay']= X[self.variable].dt.weekday
      X['IsWeekend'] = X['WeekDay'].apply(lambda x: 1 if x >= 5 else 0)

      return X


## ML Pipeline: DataCleaningFeatEng, and Regressor

In [55]:
from config import config
from sklearn.pipeline import Pipeline

### Data Cleaning
from feature_engine.imputation import AddMissingIndicator
from feature_engine.selection import DropFeatures
from feature_engine.imputation import DropMissingData
from feature_engine.imputation import CategoricalImputer
from feature_engine.imputation import MeanMedianImputer

### Feature Engineering
from feature_engine.outliers import Winsorizer
from feature_engine.transformation import (LogTransformer,
                                           ReciprocalTransformer,
                                           PowerTransformer,
                                           BoxCoxTransformer,
                                           YeoJohnsonTransformer)
from feature_engine.discretisation import EqualFrequencyDiscretiser
from feature_engine.encoding import RareLabelEncoder
from feature_engine.encoding import CountFrequencyEncoder


### Feat Selection
from sklearn.feature_selection import SelectFromModel

### Feat Scaling
from sklearn.preprocessing import StandardScaler

### ML algorithms 
from sklearn.tree import DecisionTreeRegressor




def PipelineDataCleaningAndFeatureEngineering():

  pipeline_base = Pipeline(
      [
      ### Data Cleaning
      ("ConvertToCategorical",ConvertToCategorical(variables = ['Cloud9am','Cloud3pm'])
      ),

      ("GetFeaturesFromDate",GetFeaturesFromDate(variable= 'Date')
      ),
       
      ("AddMissingIndicator",AddMissingIndicator(variables= ['Sunshine', 'Evaporation', 'Cloud3pm',
                                                             'Cloud9am', 'Pressure9am', 'Pressure3pm',
                                                             'WindDir9am', 'WindGustDir', 'WindGustSpeed',
                                                             'Humidity3pm', 'WindDir3pm', 'Temp3pm',
                                                             'RainfallToday', 'RainToday',
                                                             'WindSpeed3pm', 'Humidity9am','Temp9am',
                                                             'WindSpeed9am', 'MinTemp','MaxTemp'])
      ),

      ("DropFeatures",DropFeatures(features_to_drop = ['Sunshine','Evaporation','Cloud9am','Date'])
      ),                                         ##########dont drop sunshine

      ("DropMissingData",DropMissingData(variables =['RainfallToday', 'RainToday'])
      ),

      ("CategoricalImputer",CategoricalImputer(variables=['WindDir9am', 'WindGustDir', 'WindDir3pm','Cloud3pm'],
                                                imputation_method='missing',fill_value='Missing')
      ),

      ("MedianImputer",MeanMedianImputer(imputation_method='median',
                                          variables=['Pressure3pm', 'Pressure9am','WindGustSpeed',
                                                    'Humidity3pm', 'Temp3pm', 'WindSpeed3pm', 'Humidity9am',
                                                    'WindSpeed9am','Temp9am','MaxTemp',
                                                     'RainfallToday']
                                          )
      ),

      ("MeanImputer",MeanMedianImputer(imputation_method='mean',variables=['MinTemp'])
      ),

      ### Feature Engineering

      ("Winsorizer_iqr",Winsorizer(capping_method='iqr',tail='both', fold=3,variables = ['RainfallToday'])
      ),


      ("PowerTransformer",PowerTransformer(variables = ['WindSpeed3pm','Humidity3pm'])
      ),

      ("YeoJohnsonTransformer",YeoJohnsonTransformer(variables=['RainfallToday','WindGustSpeed',
                                                                'WindSpeed9am','Humidity9am'])
      ),

      ("EqualFrequencyDiscretiser",EqualFrequencyDiscretiser(q=5,variables = ['Latitude','Longitude' ])
      ),

      ("RareLabelEncoder_tol5",RareLabelEncoder(tol=0.05, n_categories=2, variables=['WindDir3pm'])
      ),

      ("RareLabelEncoder_tol7",RareLabelEncoder(tol=0.06, n_categories=2, variables=['State'])
      ),

      ("CountEncoder",CountFrequencyEncoder(encoding_method='count',
                                            variables = ['Location','WindGustDir','WindDir9am',
                                                          'WindDir3pm','State','Cloud3pm',
                                                          'RainToday'])
      )

    ]
  )
  return pipeline_base


def PipelineRegressor():
  pipe = PipelineDataCleaningAndFeatureEngineering()
 
  pipe.steps.append([
                     "scaler",StandardScaler()
                     ])
  
  pipe.steps.append([
                     "model",DecisionTreeRegressor(random_state=config.RANDOM_STATE)
                     ])
  return pipe



PipelineRegressor()

Pipeline(steps=[('ConvertToCategorical',
                 ConvertToCategorical(variables=['Cloud9am', 'Cloud3pm'])),
                ('GetFeaturesFromDate', GetFeaturesFromDate(variable='Date')),
                ('AddMissingIndicator',
                 AddMissingIndicator(variables=['Sunshine', 'Evaporation',
                                                'Cloud3pm', 'Cloud9am',
                                                'Pressure9am', 'Pressure3pm',
                                                'WindDir9am', 'WindGustDir',
                                                'WindGustSpeed', 'Humidity3pm',
                                                'WindD...
                 RareLabelEncoder(n_categories=2, variables=['WindDir3pm'])),
                ('RareLabelEncoder_tol7',
                 RareLabelEncoder(n_categories=2, tol=0.06,
                                  variables=['State'])),
                ('CountEncoder',
                 CountFrequencyEncoder(variables=['Loc

# Lazy Predict

* Transform the data using pipeline, except last step

In [59]:
from sklearn.pipeline import Pipeline

pipeline_lazy = Pipeline(PipelineRegressor().steps[:-2])
columns_after_data_cleaning_feat_eng = pipeline_lazy.fit_transform(df).columns
columns_after_data_cleaning_feat_eng

pipeline_lazy = Pipeline(PipelineRegressor().steps[:-1])
df_lazy = pipeline_lazy.fit_transform(df)
df_lazy = pd.DataFrame(data = df_lazy,
                       columns = columns_after_data_cleaning_feat_eng)

df_lazy.shape

(31162, 47)

* Split Train and Test Set

In [60]:
from sklearn.model_selection import train_test_split

X_train, X_test,y_train, y_test = train_test_split(
                                    df_lazy.drop(['RainfallTomorrow'],axis=1),
                                    df_lazy['RainfallTomorrow'],
                                    test_size=config.TEST_SIZE,
                                    random_state=config.RANDOM_STATE
                                    )

print(X_train.shape, X_test.shape)

(24929, 46) (6233, 46)


In [63]:
X_train.shape, y_train.shape, X_test.shape,y_test.shape

((24929, 46), (24929,), (6233, 46), (6233,))

In [2]:
from lazypredict.Supervised import LazyRegressor
reg = LazyRegressor(ignore_warnings=False, predictions=True, random_state=config.RANDOM_STATE)
models, predictions = reg.fit(X_train, X_test, y_train, y_test)



NameError: ignored

In [1]:
models

NameError: ignored

# Modeling - Regression

* Quick recap in our raw dataset

In [None]:
print(df.shape)
df.head(3)

* Split Train and Test Set

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test,y_train, y_test = train_test_split(
                                    df.drop(['RainfallTomorrow'],axis=1),
                                    df['RainfallTomorrow'],
                                    test_size=config.TEST_SIZE,
                                    random_state=config.RANDOM_STATE
                                    )

print(X_train.shape, X_test.shape)

* Use lazy-predict

* Create Pipeline

In [None]:
pipeline_regressor = PipelineRegressor()
pipeline_regressor

* Fit Cluster pipeline

In [None]:
X = df.copy()

pipeline_cluster = PipelineCluster()
pipeline_cluster.fit(X)

* Cluster model output is an array with clusters labels

In [None]:
pipeline_cluster['model'].labels_

In [None]:
pipeline_cluster['model'].labels_.shape

* The goal is to merge cluster labels to our data.
  * However, the pipeline dropped rows from ['RainfallToday', 'RainToday'] and had AddMissingIndicatorFlag in the process
  * Before merging, we need to adjust it

In [None]:
drop_imputer = DropMissingData(variables =['RainfallToday' , 'RainToday'])  #,'RainfallTomorrow','RainTomorrow'])
X = drop_imputer.fit_transform(X)

na_imputer =  AddMissingIndicator(variables= ['Sunshine', 'Evaporation', 'Cloud3pm',
                                           'Cloud9am', 'Pressure9am', 'Pressure3pm',
                                            'WindDir9am', 'WindGustDir', 'WindGustSpeed',
                                            'Humidity3pm', 'WindDir3pm', 'Temp3pm',
                                            #  'RainfallTomorrow','RainTomorrow',  ##########
                                            'RainfallToday', 'RainToday',
                                            'WindSpeed3pm', 'Humidity9am','Temp9am',
                                            'WindSpeed9am', 'MinTemp','MaxTemp'])
X = na_imputer.fit_transform(X)
X.shape

* We add a column "Cluster" to the data and check clusters distribution

In [None]:
X['Clusters'] = pipeline_cluster['model'].labels_
X['Clusters'] = X['Clusters'].astype('object')

print(f"* Clusters frequencies \n{ X['Clusters'].value_counts(normalize=True)} \n\n")
X['Clusters'].value_counts().sort_values().plot(kind='bar');

* Clusters don't look to be imbalanced
* This is how our data look like from now
  * Check the last column: Clusters
  * Quick reminder: The data is unprocessed (data cleaning, feat eng); except for the part DropMissingData(variables =['RainfallToday', 'RainToday'])

In [None]:
print(X.shape)
X.head(3)

# Regressor Evaluation

* To evaluate clusters silhouete we need:
  * data transformed (transform data in the pipeline wihout model step)
  * clusters arrays

# Classifier to explain cluster

* We need to find the most relevant variables, to define each cluster in terms of each relevant variable

In [None]:
df_clf = X.copy() #.sample(frac=0.051, random_state=config.RANDOM_STATE)
df_clf['Clusters'] = df_clf['Clusters'].astype('int32')
print(df_clf.shape)
df_clf.head(3)

* Split Train and Test sets

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test,y_train, y_test = train_test_split(
                                    df_clf.drop(['Clusters'],axis=1),
                                    df_clf['Clusters'],
                                    test_size=config.TEST_SIZE,
                                    random_state=config.RANDOM_STATE,
                                    stratify=df_clf['Clusters']
                                    )

print(X_train.shape, X_test.shape)

* Create pipeline

In [None]:
pipeline_clf_cluster = PipelineClf2ExplainClusters()
pipeline_clf_cluster

In [None]:
pipeline_clf_cluster['MedianImputer'].imputer_dict_ 

* Fit pipeline

In [None]:
pipeline_clf_cluster.fit(X_train,y_train)

# do GridCV after

 * Evaluate model performance on Train and Test sets

In [None]:
from sklearn.metrics import classification_report
print(
      classification_report(y_train, pipeline_clf_cluster.predict(X_train))
  )

In [None]:
print(
      classification_report(y_test, pipeline_clf_cluster.predict(X_test))
  )

* Check main features importance

In [None]:
df_feature_importance = pd.DataFrame(data={
    'Attribute': df_clf.columns[pipeline_clf_cluster['feat_selection'].get_support()],
    'Importance': pipeline_clf_cluster['model'].feature_importances_
  })

df_feature_importance.sort_values(by='Importance', ascending=False).plot(kind='bar',x='Attribute',y='Importance');

In [None]:
from sklearn.model_selection import GridSearchCV
_parameters = {
    'model__n_estimators':[50], # [100,200,50],
    'model__max_depth': [3] # [None,3,10]
}


_pipe = GridSearchCV(
		estimator = pipeline_clf_cluster,
		param_grid = _parameters, 
		cv=2,n_jobs=-2,verbose=2)
_pipe.fit(X_train, y_train))

In [None]:
PipelineToDeploy = _pipe.best_estimator_
PipelineToDeploy

In [None]:
_pipe.best_params_

In [None]:
X_train.columns[PipelineToDeploy['feat_selection'].get_support()].to_list()

In [None]:
from sklearn.metrics import classification_report
print( classification_report(y_test, PipelineToDeploy.predict(X_test)) )

# Clusters Profile

* Main variables that define a cluster

1.   Using main features from previous classifier
2.   And variables we are interested (busines acumen)



In [None]:
main_clusters_variables =  df_feature_importance['Attribute'].to_list() + ['State','RainToday','Cloud3pm']
main_clusters_variables

In [None]:
df_cluster_profile = X.copy()
for col in ['Cloud9am','Cloud3pm']:
  df_cluster_profile[col] =df_cluster_profile[col].astype('object')


df_cluster_profile = df_cluster_profile.filter(items=main_clusters_variables+['Clusters'],axis=1)

num_var = df_cluster_profile.filter(main_clusters_variables,axis=1).select_dtypes(include=['number']).columns.to_list()
categorical_var = df_cluster_profile.filter(main_clusters_variables,axis=1).select_dtypes(exclude=['number']).columns.to_list()


In [None]:
df_cluster_profile.info()

## Custom Functions for Cluster Analysis

* Distribution profile for all clusters

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_style("darkgrid")

def PlotClustersDistribution(df,num_var,categorical_var):
  for col in num_var:
    print(f"* {col} distribution per cluster")
    plt.figure(figsize=(10,5));
    sns.kdeplot(data=df, x=col, hue="Clusters",palette='Set2')
    plt.show()
    print("\n")

  for col in categorical_var:
    print(f"* {col} distribution per cluster")
    plt.figure(figsize=(15,5));
    sns.countplot(data=df.sort_values(by=col), hue=col, x="Clusters",palette='Set2')
    plt.legend(loc='upper right')
    plt.show()
    print("\n")


* Individual Cluster Analysis

In [None]:
# def IndividualClusterAnalysis(df_cluster_profile):

#   sns.set_style("darkgrid")
#   for cluster in df_cluster_profile.sort_values(by='Clusters')['Clusters'].unique():

#     df_cluster = df_cluster_profile.query(f"Clusters == {cluster}")
#     print(f"=================== Cluster {cluster} ===================")
    
#     for col in num_var:
#       print(f"* {col} distribution for cluster {cluster}")
#       plt.figure(figsize=(10,5));
#       sns.histplot(data=df_cluster, x=col)
#       plt.show();

#       iqr = df_cluster[col].quantile([0.25,0.75])
#       print(f"* IQR: {iqr[0.25]} - {iqr[0.75]}")
#       print("\n")

#     for col in categorical_var:
#       print(f"* {col} distribution for cluster {cluster}")
#       try:
#         plt.figure(figsize=(10,5));
#         sns.countplot(data=df_cluster, x=col)
#         freq = df_cluster[col].value_counts()
#         plt.show()
#       except Exception as e:
#         print(e)
#       print("\n")

    


* Description All Clusters

In [None]:
def Clusters_IndividualDescription(EDA_Cluster,cluster):

  ClustersDescription = pd.DataFrame(columns=EDA_Cluster.columns)
  for col in EDA_Cluster.columns:
    
    try:  # eventually a given cluster will have only mssing data for a given variable
      
      if EDA_Cluster[col].dtypes == 'object':
        
        top_frequencies = EDA_Cluster.dropna(subset=[col])[[col]].value_counts(normalize=True).nlargest(n=3)
        Description = ''
        
        for x in range(len(top_frequencies)):
          freq = top_frequencies.iloc[x]
          category = top_frequencies.index[x][0]
          CategoryPercentage = int(round(freq*100,0))
          statement =  f"'{category}' ({CategoryPercentage}%) ; "  
          Description = Description + statement
        
        ClustersDescription.at[0,col] = Description[:-2]


      
      elif EDA_Cluster[col].dtypes in ['float', 'int']:
        DescStats = EDA_Cluster.dropna(subset=[col])[[col]].describe()
        Q1 = int(round(DescStats.iloc[4,0],0))
        Q3 = int(round(DescStats.iloc[6,0],0))
        Description = f"{Q1} -- {Q3}"
        ClustersDescription.at[0,col] = Description
    
    
    except Exception as e:
      ClustersDescription.at[0,col] = 'Not available'
      print(f"** Error Exception: {e} - cluster {cluster}, variable {col}")
  
  ClustersDescription['Cluster'] = str(cluster)
  
  return ClustersDescription


def DescriptionAllClusters(df_cluster_profile):

  DescriptionAllClusters = pd.DataFrame(columns=df_cluster_profile.drop(['Clusters'],axis=1).columns)
  for cluster in df_cluster_profile.sort_values(by='Clusters')['Clusters'].unique():
    
      EDA_ClusterSubset = df_cluster_profile.query(f"Clusters == {cluster}").drop(['Clusters'],axis=1)
      ClusterDescription = Clusters_IndividualDescription(EDA_ClusterSubset,cluster)
      DescriptionAllClusters = DescriptionAllClusters.append(ClusterDescription)

  
  DescriptionAllClusters.set_index(['Cluster'],inplace=True)
  return DescriptionAllClusters




## All Cluster Analysis

In [None]:
pd.set_option('display.max_colwidth', None)
DescriptionAllClusters(df_cluster_profile)

In [None]:
PlotClustersDistribution(df=df_cluster_profile,num_var=num_var,categorical_var=categorical_var)

## Individual Cluster Analysis

In [None]:
# IndividualClusterAnalysis(df_cluster_profile)  # maybe remove? analysis above is better