# Cluster

## Objectives

*   Fit and evaluate a cluster model to group australian cities/states based on weather information
* Understand profile for each cluster


## Inputs

* content/WalkthroughProject/outputs/datasets/collection/WeatherAustralia.csv
* instructions on which variables to use for data cleaning and feature engineering. They are found on its respectives notebooks.

## Outputs

* Cluster model
* Classifier modeel to explain clusters

## Additional Comments | Insights | Conclusions


* how to translate cluster to map?
  * dataset is time series, each row is a day for each city



---

# Install and Import packages

* You eventually will need to restart runtime when installing packages, please note cell output when installing a package

In [None]:
! pip install feature-engine==1.0.2
! pip install scikit-learn==0.23.2
! pip install yellowbrick==1.2



# Code for restarting the runtime, that will restart colab session
# It is a good practice after you install a package in a colab session
import os
os.kill(os.getpid(), 9)

---

# Setup GPU

* Go to Edit → Notebook Settings
* In the Hardware accelerator menu, selects GPU
* note: when you select an option, either GPU, TPU or None, you switch among kernels/sessions

---
* How to know if I am using the GPU?
  * run the code below, if the output is different than '0' or null/nothing, you are using GPU in this session


In [1]:
import tensorflow as tf
tf.test.gpu_device_name()

'/device:GPU:0'

# **Connection between: Colab Session and your GitHub Repo**

### Insert your **credentials**

* The variable's content will exist only while the session exists. Once this session terminates, the variable's content will be erased permanently.

In [10]:
from getpass import getpass
import os
from IPython.display import clear_output 

print("=== Insert your credentials === \nType in and hit Enter")
os.environ['UserName'] = getpass('GitHub User Name: ')
os.environ['UserEmail'] = getpass('GitHub User E-mail: ')
os.environ['RepoName'] = getpass('GitHub Repository Name: ')
os.environ['UserPwd'] = getpass('GitHub Account Password: ')
clear_output()
print("* Thanks for inserting your credentials!")
print(f"* You may now Clone your Repo to this Session, "
      f"then Connect this Session to your Repo.")

* Thanks for inserting your credentials!
* You may now Clone your Repo to this Session, then Connect this Session to your Repo.


* **Credentials format disclaimer**: when opening Jupyter notebooks in Colab that are hosted at GitHub, we ask you to not consider special characters in your **password**, like @ ! " # $ % & ' ( ) * + , - . / :;< = > ? @ [\ ]^_ ` { } | ~
  * Otherwise it will not work properly the git push command, since the credentials are concatenated in the command: username:password@github.com/username/repo , the git push command will not work properly when these terms have special characters 

---

### **Clone** your GitHub Repo to your current Colab session

* So you can have access to your project's files

In [12]:
! git clone https://github.com/{os.environ['UserName']}/{os.environ['RepoName']}.git
! rm -rf sample_data   # remove content/sample_data folder, since we dont need it for this project

import os
if os.path.isdir(os.environ['RepoName']):
  print("\n")
  %cd /content/{os.environ['RepoName']}
  print(f"\n\n* Current session directory is:{os.getcwd()}")
  print(f"* You may refresh the session folder to access {os.environ['RepoName']} folder.")
else:
  print(f"\n* The Repo {os.environ['UserName']}/{os.environ['RepoName']} was not cloned."
        f" Please check your Credentials: UserName and RepoName")

Cloning into 'WalkthroughProject'...
remote: Enumerating objects: 934, done.[K
remote: Counting objects: 100% (490/490), done.[K
remote: Compressing objects: 100% (423/423), done.[K
remote: Total 934 (delta 297), reused 115 (delta 46), pack-reused 444[K
Receiving objects: 100% (934/934), 31.68 MiB | 7.95 MiB/s, done.
Resolving deltas: 100% (523/523), done.


/content/WalkthroughProject


* Current session directory is:/content/WalkthroughProject
* You may refresh the session folder to access WalkthroughProject folder.


---

### **Connect** this Colab session to your GitHub Repo

* So if you need, you can push files generated in this session to your Repo.

In [13]:
! git config --global user.email {os.environ['UserEmail']}
! git config --global user.name {os.environ['UserName']}
! git remote rm origin
! git remote add origin https://{os.environ['UserName']}:{os.environ['UserPwd']}@github.com/{os.environ['UserName']}/{os.environ['RepoName']}.git

# the logic is: create a temporary file in the sessions, update the repo. Delete this file, update the repo
# If it works, it is a signed that the session is connected to the repo.
# import uuid
# file_name = "session_connection_test_" + str(uuid.uuid4()) # generates a unique file name
# with open(f"{file_name}.txt", "w") as file: file.write("text")
# print("=== Testing Session Connectivity to the Repo === \n")
# ! git add . ; ! git commit -m {file_name + "_added_file"} ; ! git push origin main 
# print("\n\n")
# os.remove(f"{file_name}.txt")
# ! git add . ; ! git commit -m {file_name + "_removed_file"}; ! git push origin main

# delete your Credentials (username and password)
os.environ['UserName'] = os.environ['UserPwd'] = os.environ['UserEmail'] = ""

* If output above indicates there was a **failure in the authentication**, please insert again your credentials.

---

### **Push** generated/new files from this Session to GitHub repo

* Git status

In [None]:
! git status

* Git commit

In [None]:
CommitMsg = "update"
!git add .
!git commit -m {CommitMsg}

* Git Push

In [None]:
!git push origin main


---

### **Delete** Cloned Repo from current Session

* Delete cloned repo and move current directory to /content

In [11]:
%cd /content
!rm -rf {os.environ['RepoName']}

print(f"\n * Please refresh session folder to validate that {os.environ['RepoName']} folder was removed from this session.")
print(f"\n\n* Current session directory is:  {os.getcwd()}")

/content

 * Please refresh session folder to validate that WalkthroughProject folder was removed from this session.


* Current session directory is:  /content


---

# Load your data

In [14]:
import pandas as pd
df = pd.read_csv("/content/WalkthroughProject/outputs/datasets/collection/WeatherAustralia.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 145460 entries, 0 to 145459
Data columns (total 27 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   Date              145460 non-null  object 
 1   Location          145460 non-null  object 
 2   MinTemp           143975 non-null  float64
 3   MaxTemp           144199 non-null  float64
 4   RainfallToday     142199 non-null  float64
 5   Evaporation       82670 non-null   float64
 6   Sunshine          75625 non-null   float64
 7   WindGustDir       135134 non-null  object 
 8   WindGustSpeed     135197 non-null  float64
 9   WindDir9am        134894 non-null  object 
 10  WindDir3pm        141232 non-null  object 
 11  WindSpeed9am      143693 non-null  float64
 12  WindSpeed3pm      142398 non-null  float64
 13  Humidity9am       142806 non-null  float64
 14  Humidity3pm       140953 non-null  float64
 15  Pressure9am       130395 non-null  float64
 16  Pressure3pm       13

# Cluster Pipeline

## Custom transformer


  * convert ['Cloud9am','Cloud3pm'] to categorical
  * get Get Day, Month, Year, Weekday, IsWeekend from Date

In [15]:
import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin

# Convert ['Cloud9am','Cloud3pm'] to categorical
class ConvertToCategorical(BaseEstimator, TransformerMixin):

  def __init__(self, variables=None):
      if not isinstance(variables, list):
          self.variables = [variables]
      else:
          self.variables = variables

  def fit(self, X, y=None):
      return self

  def transform(self, X):
      X = X.copy()
      for feature in self.variables:
          X[feature] = X[feature].astype('object')

      return X


# Get Day, Month, Year, Weekday, IsWeekend from Date
class GetFeaturesFromDate(BaseEstimator, TransformerMixin):

  def __init__(self, variable=None):
      self.variable = variable

  def fit(self, X, y=None):
      return self

  def transform(self, X):
      X = X.copy()
      X[self.variable] = pd.to_datetime(X[self.variable])
      X['Day'] = X[self.variable].dt.day
      X['Month'] = X[self.variable].dt.month
      X['Year'] = X[self.variable].dt.year
      X['WeekDay']= X[self.variable].dt.weekday
      X['IsWeekend'] = X['WeekDay'].apply(lambda x: 1 if x >= 5 else 0)

      return X


## Cluster Pipeline

* add PCA in the pipeline, consider kmeans, fit with gridcv

In [14]:
convert_cat = ConvertToCategorical(variables=['Cloud9am','Cloud3pm'])
convert_date = GetFeaturesFromDate(variable='Date')


In [16]:
from config import config
from sklearn.pipeline import Pipeline


### Data Cleaning
from feature_engine.selection import DropFeatures
from feature_engine.imputation import DropMissingData
from feature_engine.imputation import CategoricalImputer
from feature_engine.imputation import MeanMedianImputer

### Feature Engineering
from feature_engine.outliers import Winsorizer
from feature_engine.transformation import (LogTransformer,
                                           ReciprocalTransformer,
                                           PowerTransformer,
                                           BoxCoxTransformer,
                                           YeoJohnsonTransformer)
from feature_engine.discretisation import EqualFrequencyDiscretiser
from feature_engine.encoding import RareLabelEncoder
from feature_engine.encoding import CountFrequencyEncoder

### PCA
from sklearn.decomposition import PCA

### Feat Scaling
from sklearn.preprocessing import StandardScaler

### ML algorithms 
from sklearn.cluster import KMeans


In [17]:
ml_pipeline_cluster = Pipeline(
    [
     ### Data Cleaning
     ("ConvertToCategorical",ConvertToCategorical(variables = ['Cloud9am','Cloud3pm'])
     ),

     ("GetFeaturesFromDate",GetFeaturesFromDate(variable= 'Date')
     ),

     ("DropFeatures",DropFeatures(features_to_drop = ['Sunshine','Evaporation','Cloud9am','Date'])
     ),

     ("DropMissingData",DropMissingData(variables =['RainTomorrow', 'RainfallToday', 'RainToday','RainfallTomorrow'])
     ),

     ("CategoricalImputer",CategoricalImputer(variables=['WindDir9am', 'WindGustDir', 'WindDir3pm','Cloud3pm'],
                                              imputation_method='missing',fill_value='Missing')
     ),

     ("MedianImputer",MeanMedianImputer(imputation_method='median',
                                        variables=['Pressure3pm', 'Pressure9am','WindGustSpeed',
                                                   'Humidity3pm', 'Temp3pm', 'WindSpeed3pm', 'Humidity9am',
                                                   'WindSpeed9am','Temp9am','MaxTemp']
                                        )
     ),

     ("MeanImputer",MeanMedianImputer(imputation_method='mean',variables=['MinTemp'])
     ),

     ### Feature Engineering

     ("Winsorizer_iqr",Winsorizer(capping_method='iqr',tail='both', fold=3,variables = ['RainfallToday'])
     ),


     ("PowerTransformer",PowerTransformer(variables = ['WindSpeed3pm','Humidity3pm'])
     ),

     ("YeoJohnsonTransformer",YeoJohnsonTransformer(variables=['RainfallToday','WindGustSpeed','WindSpeed9am','Humidity9am'])
     ),

     ("EqualFrequencyDiscretiser",EqualFrequencyDiscretiser(q=5,variables = ['Latitude','Longitude' ])
     ),

     ("RareLabelEncoder_tol5",RareLabelEncoder(tol=0.05, n_categories=2, variables=['WindDir3pm'])
     ),

     ("RareLabelEncoder_tol7",RareLabelEncoder(tol=0.06, n_categories=2, variables=['State'])
     ),

     ("CountEncoder",CountFrequencyEncoder(encoding_method='count',
                                           variables = ['Location','WindGustDir','WindDir9am','WindDir3pm','State'])
     ),

     ### Feature Selection - Dimensionality Reduction    
     ("PCA",PCA(n_components=3,random_state=config.RANDOM_STATE)
     ),

     ### Feature Scaling
     ("scaler",StandardScaler()
     ),
     
     ### Model
     ("model",KMeans(n_clusters=6,random_state=config.RANDOM_STATE)
     )

    ])

In [18]:
ml_pipeline_cluster

Pipeline(steps=[('ConvertToCategorical',
                 ConvertToCategorical(variables=['Cloud9am', 'Cloud3pm'])),
                ('GetFeaturesFromDate', GetFeaturesFromDate(variable='Date')),
                ('DropFeatures',
                 DropFeatures(features_to_drop=['Sunshine', 'Evaporation',
                                                'Cloud9am', 'Date'])),
                ('DropMissingData',
                 DropMissingData(variables=['RainTomorrow', 'RainfallToday',
                                            'RainToday', 'RainfallT...
                 RareLabelEncoder(n_categories=2, variables=['WindDir3pm'])),
                ('RareLabelEncoder_tol7',
                 RareLabelEncoder(n_categories=2, tol=0.06,
                                  variables=['State'])),
                ('CountEncoder',
                 CountFrequencyEncoder(variables=['Location', 'WindGustDir',
                                                  'WindDir9am', 'WindDir3pm',
                

# Elbow Analysis and Quick Silhouete Visualizer

* Prepare data for analysis
  * You need to clean and feature engineer your data using the pipeline (but the model step)

In [None]:
from scr.FeatEngineering.ApplyPipeline_FeatEng import ApplyFeatEngPipeline
	df = ApplyFeatEngPipeline(df)
 df = df.drop([config.TARGET],axis=1)

* Elbow Analysis

In [None]:
nClusters = 4 # amount of  clusters used for silhoute visualizer
# i have to break in 2 moments, first elboow to know nb ofcluster, then silhoute visualizer with that nb of clusters
KMeansAlgoAnalysis(df,nClusters)



# it needs the data already transformed by the cluster pipeline

from yellowbrick.cluster import KElbowVisualizer
from sklearn.cluster import KMeans
visualizer = KElbowVisualizer(KMeans(), k=(1,16))
visualizer.fit(df) 
visualizer.show() 

* Quick Silhouete Visualizer# i

In [None]:
# it needs the data already transformed by the cluster pipeline

from yellowbrick.cluster import SilhouetteVisualizer
	from sklearn.cluster import KMeans
	visualizer = SilhouetteVisualizer(
		KMeans(n_clusters=n_clusters),
		colors='yellowbrick')
	visualizer.fit(df)
	visualizer.show()  

# Principal Component Analysis

* Calculate each component

In [None]:
from scr.FeatEngineering.ApplyPipeline_FeatEng import ApplyFeatEngPipeline
	df_orginal = ApplyFeatEngPipeline(df_orginal)

	from sklearn.decomposition import PCA

	# if there is a intended target, remove
	try: dfNoTarget = df_orginal.drop([config.TARGET],axis=1)
	except: dfNoTarget = df_orginal.copy()

	pca = PCA(n_components=n_components).fit(dfNoTarget)
	x_PCA = pca.transform(dfNoTarget) # array with transformed PCA

	# generate datframe according to n_components
	ComponentsList = ["Component " + str(number) for number in range(n_components)]
	dfPCA = pd.DataFrame(data=x_PCA, columns=ComponentsList)

	try:
		dfPCA_WithTarget = dfPCA.copy()
		dfPCA_WithTarget[config.TARGET] = df_orginal[config.TARGET].astype(str)
	except:
		pass # dataset doenst have TARGET


	# how each component explains data variance
	dfExplVarRatio = pd.DataFrame(
		data=pca.explained_variance_ratio_,
		index=ComponentsList,
		columns=['Explained Variance Ratio'])
	PercentageOfDataExplained = round(float(dfExplVarRatio['Explained Variance Ratio'].sum()),4) * 100

fig = px.scatter_3d(dfPCA, x='Component 0', y='Component 1', z='Component 2')
fig.update_traces(marker=dict(size=3,line=dict(width=0)))
fig.show()


* PCA summary

In [None]:
st.write(
	# "* PCA - Transformed dataset:",dfPCA.shape,dfPCA,
	"* Explained Variance Ratio per PCA Component: ",dfExplVarRatio)
	st.write(f"> * Together, the components explain {PercentageOfDataExplained} % of the data")

* Present explained variance per component

In [None]:
df_comp = pd.DataFrame(pca.components_, columns=dfNoTarget.columns)
	fig = px.imshow(df_comp)

In [None]:
st.write("* Heatmap: Feature Composition for each PCA Component")
st.plotly_chart(fig, use_container_width=True) 

# Modeling

# Evaluation

* use silhouete score

In [None]:
import matplotlib.pyplot as plt
import matplotlib.cm as cm
from sklearn.metrics import silhouette_samples, silhouette_score


def Cluster_Silhouette(X,Clusters):

	cluster_labels  = Clusters
	n_clusters = len(set(cluster_labels))

	print(" Silhouette plot for each cluster")
	fig, (ax1) = plt.subplots(1, 1)
	fig.set_size_inches(18, 7)
	ax1.set_xlim([-0.1, 1])
	ax1.set_ylim([0, len(X) + (n_clusters + 1) * 10])
	
	silhouette_avg = silhouette_score(X, cluster_labels,random_state=config.RANDOM_STATE)
	print("* The silhouette average score is ",str(round(float(silhouette_avg),2)))
	# print(
	# 	f"* Silhouette assesses consistency within clusters - "
	# 	f"[Link 1] (https://en.wikipedia.org/wiki/Silhouette_(clustering)) and "
	# 	f"[Link 2] (https://dzone.com/articles/kmeans-silhouette-score-explained-with-python-exam) ")


	sample_silhouette_values = silhouette_samples(X, cluster_labels)
	y_lower = 10
	for i in range(n_clusters):
		ith_cluster_silhouette_values = \
			sample_silhouette_values[cluster_labels == i]
		ith_cluster_silhouette_values.sort()
		size_cluster_i = ith_cluster_silhouette_values.shape[0]
		y_upper = y_lower + size_cluster_i
		color = cm.nipy_spectral(float(i) / n_clusters)
		ax1.fill_betweenx(np.arange(y_lower, y_upper),
							0, ith_cluster_silhouette_values,
							facecolor=color, edgecolor=color, alpha=0.7)
		ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
		y_lower = y_upper + 10

	ax1.set_title("The silhouette plot for each cluster")
	ax1.set_xlabel("The silhouette coefficient values")
	ax1.set_ylabel("Cluster label")
	ax1.axvline(x=silhouette_avg, color="red", linestyle="--")
	ax1.set_yticks([])
	ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])
	plt.show()

* plot clusters in 3d scatter plot, using PCA for dimensionality reduction

# Plot Clusters in a map

* I need a dataframe with unique locations, their lat/long/state/cluster
* drop duplicates for location??

In [None]:
# https://towardsdatascience.com/interactive-maps-with-python-pandas-and-plotly-following-bloggers-through-sydney-c24d6f30867e

In [None]:
import plotly.express as px 


fig = px.scatter_mapbox(xxxxx,
                        lat="Latitude", lon="Longitude", color="Cluster",
                        hover_data=['State','Location'],
                        # size='RainfallToday',
                        zoom=2.5,
                        mapbox_style="open-street-map",
                        # animation_frame='Month',
                        center={"lat":-27,"lon":133},
                        size_max=15
                        )
fig.show()