# Cluster

## Objectives

*   Fit and evaluate a cluster model to group australian cities/states based on weather information
* Understand profile for each cluster


## Inputs

* content/WalkthroughProject/outputs/datasets/collection/WeatherAustralia.csv
* instructions on which variables to use for data cleaning and feature engineering. They are found on its respectives notebooks.

## Outputs

* Cluster model
* Classifier modeel to explain clusters

## Additional Comments | Insights | Conclusions


* how to translate cluster to map?
  * dataset is time series, each row is a day for each city



---

# Install and Import packages

* You eventually will need to restart runtime when installing packages, please note cell output when installing a package

In [None]:
! pip install feature-engine==1.0.2
! pip install scikit-learn==0.23.2
! pip install yellowbrick==1.2
! pip install scikit-learn==0.23.2


# Code for restarting the runtime, that will restart colab session
# It is a good practice after you install a package in a colab session
import os
os.kill(os.getpid(), 9)

---

# Setup GPU

* Go to Edit → Notebook Settings
* In the Hardware accelerator menu, selects GPU
* note: when you select an option, either GPU, TPU or None, you switch among kernels/sessions

---
* How to know if I am using the GPU?
  * run the code below, if the output is different than '0' or null/nothing, you are using GPU in this session


In [None]:
import tensorflow as tf
tf.test.gpu_device_name()

# **Connection between: Colab Session and your GitHub Repo**

### Insert your **credentials**

* The variable's content will exist only while the session exists. Once this session terminates, the variable's content will be erased permanently.

In [None]:
from getpass import getpass
import os
from IPython.display import clear_output 

print("=== Insert your credentials === \nType in and hit Enter")
os.environ['UserName'] = getpass('GitHub User Name: ')
os.environ['UserEmail'] = getpass('GitHub User E-mail: ')
os.environ['RepoName'] = getpass('GitHub Repository Name: ')
os.environ['UserPwd'] = getpass('GitHub Account Password: ')
clear_output()
print("* Thanks for inserting your credentials!")
print(f"* You may now Clone your Repo to this Session, "
      f"then Connect this Session to your Repo.")

* **Credentials format disclaimer**: when opening Jupyter notebooks in Colab that are hosted at GitHub, we ask you to not consider special characters in your **password**, like @ ! " # $ % & ' ( ) * + , - . / :;< = > ? @ [\ ]^_ ` { } | ~
  * Otherwise it will not work properly the git push command, since the credentials are concatenated in the command: username:password@github.com/username/repo , the git push command will not work properly when these terms have special characters 

---

### **Clone** your GitHub Repo to your current Colab session

* So you can have access to your project's files

In [None]:
! git clone https://github.com/{os.environ['UserName']}/{os.environ['RepoName']}.git
! rm -rf sample_data   # remove content/sample_data folder, since we dont need it for this project

import os
if os.path.isdir(os.environ['RepoName']):
  print("\n")
  %cd /content/{os.environ['RepoName']}
  print(f"\n\n* Current session directory is:{os.getcwd()}")
  print(f"* You may refresh the session folder to access {os.environ['RepoName']} folder.")
else:
  print(f"\n* The Repo {os.environ['UserName']}/{os.environ['RepoName']} was not cloned."
        f" Please check your Credentials: UserName and RepoName")

---

### **Connect** this Colab session to your GitHub Repo

* So if you need, you can push files generated in this session to your Repo.

In [None]:
! git config --global user.email {os.environ['UserEmail']}
! git config --global user.name {os.environ['UserName']}
! git remote rm origin
! git remote add origin https://{os.environ['UserName']}:{os.environ['UserPwd']}@github.com/{os.environ['UserName']}/{os.environ['RepoName']}.git

# the logic is: create a temporary file in the sessions, update the repo. Delete this file, update the repo
# If it works, it is a signed that the session is connected to the repo.
# import uuid
# file_name = "session_connection_test_" + str(uuid.uuid4()) # generates a unique file name
# with open(f"{file_name}.txt", "w") as file: file.write("text")
# print("=== Testing Session Connectivity to the Repo === \n")
# ! git add . ; ! git commit -m {file_name + "_added_file"} ; ! git push origin main 
# print("\n\n")
# os.remove(f"{file_name}.txt")
# ! git add . ; ! git commit -m {file_name + "_removed_file"}; ! git push origin main

# delete your Credentials (username and password)
os.environ['UserName'] = os.environ['UserPwd'] = os.environ['UserEmail'] = ""

* If output above indicates there was a **failure in the authentication**, please insert again your credentials.

---

### **Push** generated/new files from this Session to GitHub repo

* Git status

In [None]:
! git status

* Git commit

In [None]:
CommitMsg = "update"
!git add .
!git commit -m {CommitMsg}

* Git Push

In [None]:
!git push origin main


---

### **Delete** Cloned Repo from current Session

* Delete cloned repo and move current directory to /content

In [None]:
%cd /content
import os
!rm -rf {os.environ['RepoName']}

print(f"\n * Please refresh session folder to validate that {os.environ['RepoName']} folder was removed from this session.")
print(f"\n\n* Current session directory is:  {os.getcwd()}")

---

# Load your data

In [None]:
import pandas as pd
df = pd.read_csv("/content/WalkthroughProject/outputs/datasets/collection/WeatherAustralia.csv")
df.drop(['RainTomorrow','RainfallTomorrow'],axis=1,inplace=True)
df.info()

# Cluster Pipeline

## Custom transformer


  * convert ['Cloud9am','Cloud3pm'] to categorical
  * get Get Day, Month, Year, Weekday, IsWeekend from Date

In [None]:
import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin

# Convert ['Cloud9am','Cloud3pm'] to categorical
class ConvertToCategorical(BaseEstimator, TransformerMixin):

  def __init__(self, variables=None):
      if not isinstance(variables, list):
          self.variables = [variables]
      else:
          self.variables = variables

  def fit(self, X, y=None):
      return self

  def transform(self, X):
      X = X.copy()
      for feature in self.variables:
          X[feature] = X[feature].astype('object')

      return X


# Get Day, Month, Year, Weekday, IsWeekend from Date
class GetFeaturesFromDate(BaseEstimator, TransformerMixin):

  def __init__(self, variable=None):
      self.variable = variable

  def fit(self, X, y=None):
      return self

  def transform(self, X):
      X = X.copy()
      X[self.variable] = pd.to_datetime(X[self.variable])
      X['Day'] = X[self.variable].dt.day
      X['Month'] = X[self.variable].dt.month
      X['Year'] = X[self.variable].dt.year
      X['WeekDay']= X[self.variable].dt.weekday
      X['IsWeekend'] = X['WeekDay'].apply(lambda x: 1 if x >= 5 else 0)

      return X


## ML Pipeline: Base, Cluster and ClfToExplainClusters

In [None]:
from config import config
from sklearn.pipeline import Pipeline

### Data Cleaning
from feature_engine.selection import DropFeatures
from feature_engine.imputation import DropMissingData
from feature_engine.imputation import CategoricalImputer
from feature_engine.imputation import MeanMedianImputer

### Feature Engineering
from feature_engine.outliers import Winsorizer
from feature_engine.transformation import (LogTransformer,
                                           ReciprocalTransformer,
                                           PowerTransformer,
                                           BoxCoxTransformer,
                                           YeoJohnsonTransformer)
from feature_engine.discretisation import EqualFrequencyDiscretiser
from feature_engine.encoding import RareLabelEncoder
from feature_engine.encoding import CountFrequencyEncoder

### PCA
from sklearn.decomposition import PCA

### Feat Selection
from sklearn.feature_selection import SelectFromModel

### Feat Scaling
from sklearn.preprocessing import StandardScaler

### ML algorithms 
from sklearn.cluster import KMeans
from sklearn.ensemble import GradientBoostingClassifier




def PipelineDataCleaningAndFeatureEngineering():

  pipeline_base = Pipeline(
      [
      ### Data Cleaning
      ("ConvertToCategorical",ConvertToCategorical(variables = ['Cloud9am','Cloud3pm'])
      ),

      ("GetFeaturesFromDate",GetFeaturesFromDate(variable= 'Date')
      ),

      ("DropFeatures",DropFeatures(features_to_drop = ['Sunshine','Evaporation','Cloud9am','Date'])
      ),

      ("DropMissingData",DropMissingData(variables =['RainfallToday', 'RainToday',
                                                      # 'RainfallTomorrow','RainTomorrow'
                                                      ])
      ),

      ("CategoricalImputer",CategoricalImputer(variables=['WindDir9am', 'WindGustDir', 'WindDir3pm','Cloud3pm'],
                                                imputation_method='missing',fill_value='Missing')
      ),

      ("MedianImputer",MeanMedianImputer(imputation_method='median',
                                          variables=['Pressure3pm', 'Pressure9am','WindGustSpeed',
                                                    'Humidity3pm', 'Temp3pm', 'WindSpeed3pm', 'Humidity9am',
                                                    'WindSpeed9am','Temp9am','MaxTemp']
                                          )
      ),

      ("MeanImputer",MeanMedianImputer(imputation_method='mean',variables=['MinTemp'])
      ),

      ### Feature Engineering

      ("Winsorizer_iqr",Winsorizer(capping_method='iqr',tail='both', fold=3,variables = ['RainfallToday'])
      ),


      ("PowerTransformer",PowerTransformer(variables = ['WindSpeed3pm','Humidity3pm'])
      ),

      ("YeoJohnsonTransformer",YeoJohnsonTransformer(variables=['RainfallToday','WindGustSpeed','WindSpeed9am','Humidity9am'])
      ),

      ("EqualFrequencyDiscretiser",EqualFrequencyDiscretiser(q=5,variables = ['Latitude','Longitude' ])
      ),

      ("RareLabelEncoder_tol5",RareLabelEncoder(tol=0.05, n_categories=2, variables=['WindDir3pm'])
      ),

      ("RareLabelEncoder_tol7",RareLabelEncoder(tol=0.06, n_categories=2, variables=['State'])
      ),

      ("CountEncoder",CountFrequencyEncoder(encoding_method='count',
                                            variables = ['Location','WindGustDir','WindDir9am',
                                                          'WindDir3pm','State','Cloud3pm',
                                                          'RainToday'])##############
      ),

      # ### Feature Selection - Dimensionality Reduction    
      # ("PCA",PCA(n_components=3,random_state=config.RANDOM_STATE)
      # ),

      # ### Feature Scaling
      # ("scaler",StandardScaler()
      # ),
      
      # ### Model
      # ("model",KMeans(n_clusters=5,random_state=config.RANDOM_STATE)
      # )

    ]
  )
  return pipeline_base


def PipelineCluster():
  pipe = PipelineDataCleaningAndFeatureEngineering()

  pipe.steps.append([
                     "PCA",PCA(n_components=3,random_state=config.RANDOM_STATE)
                     ])
  
  pipe.steps.append([
                     "scaler",StandardScaler()
                     ])
  
  pipe.steps.append([
                     "model",KMeans(n_clusters=5,random_state=config.RANDOM_STATE)
                     ])
  return pipe


def PipelineClf2ExplainClusters():
   pipe = PipelineDataCleaningAndFeatureEngineering()

   pipe.steps.append([
                     "feat_selection",SelectFromModel(GradientBoostingClassifier(random_state=config.RANDOM_STATE))
                     ])
   
   pipe.steps.append([
                     "scaler",StandardScaler()
                     ])
   
   pipe.steps.append([
                     "model",GradientBoostingClassifier(random_state=config.RANDOM_STATE)
                     ])
   return pipe



In [None]:
pipeline_cluster = PipelineCluster()
pipeline_cluster

# Principal Component Analysis

* It needs the dataset after data cleaning and feature engineering
  * That means you have to remove 3 steps

In [None]:
pipeline_pca = Pipeline(pipeline_cluster.steps[:-3])
df_pca = pipeline_pca.fit_transform(df)
df_pca.head(3)

* Apply PCA component

In [None]:
from sklearn.decomposition import PCA
n_components = 3

pca = PCA(n_components=n_components).fit(df_pca)
x_PCA = pca.transform(df_pca) # array with transformed PCA

ComponentsList = ["Component " + str(number) for number in range(n_components)]
dfExplVarRatio = pd.DataFrame(
    data= np.round(100 * pca.explained_variance_ratio_ ,3),
    index=ComponentsList,
    columns=['Explained Variance Ratio (%)'])

PercentageOfDataExplained = dfExplVarRatio['Explained Variance Ratio (%)'].sum()

print(f"* The {n_components} components explain {round(PercentageOfDataExplained,2)}% of the data \n")
print(dfExplVarRatio)

* Heatmap: PCA components and variables

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
df_comp = pd.DataFrame(pca.components_, columns=df_pca.columns)
plt.figure(figsize=(20,5))
sns.heatmap(df_comp,center=0,linewidths=.5)
plt.show()

# Elbow Analysis and Quick Silhouete Visualizer

* Prepare data for analysis
  * You need to clean and feature engineer your data using the pipeline without the model

In [None]:
pipeline_elbow = Pipeline(pipeline_cluster.steps[:-1])
df_elbow = pipeline_elbow.fit_transform(df)
df_elbow.shape

* Elbow Analysis

In [None]:
from yellowbrick.cluster import KElbowVisualizer
from sklearn.cluster import KMeans

visualizer = KElbowVisualizer(KMeans(), k=(1,16))
visualizer.fit(df_elbow) 
visualizer.show() 

# 6 clusters

* Quick Silhouete Visualizer

In [None]:
from yellowbrick.cluster import SilhouetteVisualizer
from sklearn.cluster import KMeans

n_clusters = 5
visualizer = SilhouetteVisualizer(KMeans(n_clusters=n_clusters),colors='yellowbrick')
visualizer.fit(df_elbow)
visualizer.show()  

# takes 9m28s when clusters=5

In [None]:
# pca 4,  6 clusters, silhouete score 0.52
# pca3, 5 clusers, silhoute 0.8

# Modeling - Cluster

* Fit Cluster pipeline

In [None]:
X = df.copy()
pipeline_cluster.fit(X)

* Cluster model output is an array with clusters labels

In [None]:
pipeline_cluster['model'].labels_

In [None]:
pipeline_cluster['model'].labels_.shape

* The goal is to merge cluster labels to our data.
  * However,the pipeline dropped rows from ['RainfallToday', 'RainToday'].
  * Before merging, we need to adjust it

In [None]:
drop_imputer = DropMissingData(variables =['RainfallToday', 'RainToday'])
drop_imputer.fit(X)
X = drop_imputer.transform(X)
X.shape

* We add a column "Cluster" to the data and check clusters distribution

In [None]:
X['Clusters'] = pipeline_cluster['model'].labels_
X['Clusters'] = X['Clusters'].astype('object')

print(f"* Clusters frequencies \n{ X['Clusters'].value_counts(normalize=True)} \n\n")
X['Clusters'].value_counts().sort_values().plot(kind='bar');

* Clusters don't look to be imbalanced
* This is how our data look like from now
  * Check the last column: Clusters
  * Quick reminder: The data is unprocessed (data cleaning, feat eng)

In [None]:
X.head(3)

# Clusters Evaluation

* To evaluate clusters silhouete we need:
  * data transformed (transform data in the pipeline wihout model step)
  * clusters arrays

In [None]:
pipeline_silhouette = Pipeline(pipeline_cluster.steps[:-1])
df_transformed = pipeline_silhouette.transform(df)
df_transformed.shape

In [None]:
from config import config
import matplotlib.pyplot as plt
import matplotlib.cm as cm
from sklearn.metrics import silhouette_samples, silhouette_score


def EvaluateClusterSilhouette(X,Clusters):

	n_clusters = len(set(Clusters))

	print(" Silhouette plot for each cluster")
	fig, (ax1) = plt.subplots(1, 1)
	fig.set_size_inches(18, 7)
	ax1.set_xlim([-0.1, 1])
	ax1.set_ylim([0, len(X) + (n_clusters + 1) * 10])
	
	silhouette_avg = silhouette_score(X, cluster_labels,random_state=config.RANDOM_STATE)
	print("* The silhouette average score is ",str(round(float(silhouette_avg),2)))
	# print(
	# 	f"* Silhouette assesses consistency within clusters - "
	# 	f"[Link 1] (https://en.wikipedia.org/wiki/Silhouette_(clustering)) and "
	# 	f"[Link 2] (https://dzone.com/articles/kmeans-silhouette-score-explained-with-python-exam) ")


	sample_silhouette_values = silhouette_samples(X, cluster_labels)
	y_lower = 10
	for i in range(n_clusters):
		ith_cluster_silhouette_values = \
			sample_silhouette_values[cluster_labels == i]
		ith_cluster_silhouette_values.sort()
		size_cluster_i = ith_cluster_silhouette_values.shape[0]
		y_upper = y_lower + size_cluster_i
		color = cm.nipy_spectral(float(i) / n_clusters)
		ax1.fill_betweenx(np.arange(y_lower, y_upper),
							0, ith_cluster_silhouette_values,
							facecolor=color, edgecolor=color, alpha=0.7)
		ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
		y_lower = y_upper + 10

	ax1.set_title("The silhouette plot for each cluster")
	ax1.set_xlabel("The silhouette coefficient values")
	ax1.set_ylabel("Cluster label")
	ax1.axvline(x=silhouette_avg, color="red", linestyle="--")
	ax1.set_yticks([])
	ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])
	plt.show()

In [None]:
EvaluateClusterSilhouette(
    X=df_transformed,
    Clusters=X['Clusters'].values)

# State and RainToday distribution per cluster

In [None]:
variables_of_interest = ['State','RainToday']

import plotly.express as px

for col in variables_of_interest:
  df_stack = X.filter([col,'Clusters'],axis=1).groupby([col,'Clusters']).size().reset_index()
  df_stack.rename({0:"Count"},axis=1,inplace=True)
  fig = px.bar(df_stack, color = col, y = 'Count', x = 'Clusters', barmode = 'stack',width=None, height=400)
  fig.update_xaxes(type='category',categoryorder='category ascending')
  
  print(f"* Clusters per {col}")
  fig.show()



# 0 -  type of data that doenst happen in south of wales, victoria
# 1 -  type of day that happens only in new south wales
# 2 - type of day that happens every state: rain day
# 3 - type of day that happens only in south of wales, victoria
# 4 -  type of day that doesnt happens only in new south wales

# Classifier to explain cluster

* We need to find the most relevant variables, to define each cluster in terms of each relevant variable

In [None]:
df_clf = X.copy() #.sample(frac=0.1, random_state=config.RANDOM_STATE)
df_clf['Clusters'] = df_clf['Clusters'].astype('int32')
print(df_clf.shape)
df_clf.head(3)

* Split Train and Test sets

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test,y_train, y_test = train_test_split(
                                    df_clf.drop(['Clusters'],axis=1),
                                    df_clf['Clusters'],
                                    test_size=config.TEST_SIZE,
                                    random_state=config.RANDOM_STATE,
                                    stratify=df_clf['Clusters']
                                    )

print(X_train.shape, X_test.shape)

* Create pipeline

In [None]:
pipeline_clf_cluster = PipelineClf2ExplainClusters()
pipeline_clf_cluster

* Fit pipeline

In [None]:
pipeline_clf_cluster.fit(X_train,y_train)

# do GridCV after

* Check main features

In [None]:
df_clf.columns[pipeline_clf_cluster['feat_selection'].get_support()]

In [None]:
from sklearn.metrics import classification_report
print( classification_report(y_train, pipeline_clf_cluster.predict(X_train)) )

In [None]:
print( classification_report(y_test, pipeline_clf_cluster.predict(X_test)) )

In [None]:
# pipeline_yb = Pipeline(pipeline_cluster.steps[:-1])
# X_train_yb = pipeline_yb.transform(X_train)

# from yellowbrick.model_selection import FeatureImportances
# model = GradientBoostingClassifier(random_state=config.RANDOM_STATE)
# viz = FeatureImportances(model)
# viz.fit(X_train_yb, y_train)
# viz.show()

In [None]:
from sklearn.model_selection import GridSearchCV
_parameters = {
    'model__n_estimators':[50], # [100,200,50],
    'model__max_depth': [3] # [None,3,10]
}


_pipe = GridSearchCV(
		estimator = pipeline_clf_cluster,
		param_grid = _parameters, 
		cv=2,n_jobs=-2,verbose=2)
_pipe.fit(X_train, y_train))

In [None]:
PipelineToDeploy = _pipe.best_estimator_
PipelineToDeploy

In [None]:
_pipe.best_params_

In [None]:
X_train.columns[PipelineToDeploy['feat_selection'].get_support()].to_list()

In [None]:
from sklearn.metrics import classification_report
print( classification_report(y_test, PipelineToDeploy.predict(X_test)) )

# Clusters Profile

* Main variables that define a cluster

In [None]:
df_cluster_profile = X.copy()
for col in ['Cloud9am','Cloud3pm']:
  df_cluster_profile[col] =df_cluster_profile[col].astype('object')

In [None]:
main_clusters_variables = ['MaxTemp', 'Humidity3pm', 'Cloud9am', 'Temp3pm']

num_var = df_cluster_profile.filter(main_clusters_variables,axis=1).select_dtypes(include=['number']).columns.to_list()
categorical_var = df_cluster_profile.filter(main_clusters_variables,axis=1).select_dtypes(exclude=['number']).columns.to_list()


#['State','Cloud3pm','Pressure9am','RainToday']

## All Cluster Analysis

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_style("white")

def PlotClustersDistribution(df,num_var,categorical_var):
  for col in num_var:
    print(f"* {col} distribution per cluster")
    plt.figure(figsize=(15,5));
    sns.kdeplot(data=df, x=col, hue="Clusters",palette='coolwarm')
    plt.show()
    print("\n")

  for col in categorical_var:
    print(f"* {col} distribution per cluster")
    plt.figure(figsize=(15,5));
    sns.countplot(data=df.sort_values(by=col), hue=col, x="Clusters",palette='coolwarm')
    plt.legend(loc='upper right')
    plt.show()
    print("\n")




In [None]:
PlotClustersDistribution(df=df_cluster_profile,num_var=num_var,categorical_var=categorical_var)

## Individual Cluster Analysis

In [None]:
sns.set_style("darkgrid")
for cluster in df_cluster_profile.sort_values(by='Clusters')['Clusters'].unique():

  df_cluster = df_cluster_profile.query(f"Clusters == {cluster}")
  print(f"============ Cluster {cluster} ============")
  
  for col in num_var:
    print(f"* {col} distribution for cluster {cluster}")
    plt.figure(figsize=(10,5));
    sns.histplot(data=df_cluster, x=col)
    plt.show();

    iqr = df_cluster[col].quantile([0.25,0.75])
    print(f"* IQR: {iqr[0.25]} - {iqr[0.75]}")
    print("\n")

  for col in categorical_var:
    print(f"* {col} distribution for cluster {cluster}")
    try:
      plt.figure(figsize=(10,5));
      sns.countplot(data=df_cluster, x=col)
      freq = df_cluster[col].value_counts()
      plt.show()
    except Exception as e:
      print(e)
    print("\n")

    



In [None]:
freq

In [None]:
iqr[0.25]

In [None]:
# df_cluster['Cloud3pm'].value_counts().sort_values().plot(kind='bar')

df_cluster[col].dropna()#.value_counts().sort_values().plot(kind='bar')

In [None]:
df_stack = X.filter([col,'Clusters'],axis=1).groupby([col,'Clusters']).size().reset_index()

df_stack.rename({0:"Count"},axis=1,inplace=True)
df_stack
fig = px.bar(df_stack, x = 'Clusters',color=col, height=400)
# fig.update_xaxes(type='category',categoryorder='category ascending')
fig.show()