## **ML :** Weather AUS

#### _Rain in Australia_

🟠 `on work`

---

1. **Preprocessing**
    * Extractions des variables
    * Proto-modélisation
    * Supression des valeurs aberrantes
    * Feature Selection
    * Feature Engineering
    * Feature Scaling
2. **Modeling**
    * Fonction d’évaluation
    * Entrainements multiples modèles
    * Optimisation
    * Analyse des erreurs
    * Courbe d'aprentissage
    * Décision

**Built-in**

**Librairies**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

**ML Objects**

In [None]:
# Algorithms
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import plot_tree
# - -
# Evaluation, tuning, etc.
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
# - -
# Preprocessing
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
# from sklearn.preprocessing import RobustScaler
# from sklearn.preprocessing import MinMaxScaler
# - -
# Metrics
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay
# from sklearn.metrics import accuracy_score
# from sklearn.metrics import average_precision_score
# from sklearn.metrics import precision_score
# - - 
from sklearn.compose import make_column_transformer

**User Code**

In [None]:
def extract_x_y(dataframe:pd.DataFrame, target:str|list[str]) -> tuple :
    """Extract Features and Target from dataset

    Args:
        dataframe (pd.DataFrame): Dataframe to extract columns from
        target (str | list[str]): Target name

    Returns:
        tuple: Feature as X, and Label as y
    """

    y = dataframe[target] 
    X = dataframe.drop(columns=target)

    print(y.unique())
    print(X.columns.to_list())

    return X, y

In [None]:
def save_cm(cm:list, name:str) -> None :
    """Save a Confusion Matrix as CSV file in `./_outputs/` subdirectory

    Args:
        cm (list): Confusion Matrix built from `sklearn.metrics`
        name (str): A lowercase spaceless text for file name
    """
        
    df = pd.DataFrame({
        'Yes': [cm[0,0], cm[1,0]],
        'No': [cm[1,0], cm[1,1]]
    }, index=['True Yes', 'True No'])
    
    df.to_csv(f'./_outputs/cm_{name}.csv')

**Notebook Setup**

In [None]:
# Colour codes
mean_c = '#FFFFFF'
median_c = '#c2e800'
default_c = '#336699'
palette_c = [
    '#b8e600', # Sunny
    '#00bfff' # Rainy
]

# Pandas
pd.options.display.max_rows = 30
pd.options.display.min_rows = 6

# Matplotlib
plt.style.use('dark_background')

plt.rcParams['figure.facecolor'] = '#242428'
plt.rcParams['axes.facecolor'] = '#242428'
plt.rcParams['axes.titleweight'] = 'bold'

**Weather AUS**

[Rain in Australia](https://www.kaggle.com/datasets/jsphyg/weather-dataset-rattle-package)

In [None]:
weather_file_path = './_datasets/weather_data_prepare.csv'
weather_data = pd.read_csv(weather_file_path)
weather_data['RainTomorrow'] = weather_data['RainTomorrow'].astype('category')

weather_data.head(3)

---

### **1.** Preprocessing

##### **1.1** - Préparations et extractions

Extraction des _Features_ et du _label_

In [None]:
X, y = extract_x_y(weather_data, 'RainTomorrow')

Encodage des variables catégorielles

Standardisation des valeurs

Isolation des données d'entrainement et de test

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25, random_state=5)

In [None]:
pd.concat([
    pd.DataFrame({
    'Label Entrainement': y_train.describe(),
    'Label Test': y_test.describe()
    }),
    pd.Series([
        (y_train.describe()[3] / y_train.count()) * 100,
        (y_test.describe()[3] / y_test.count()) * 100
    ], name='percent of no', index=['Label Entrainement', 'Label Test']).to_frame().T
])

In [None]:
cv_KF = KFold(n_splits=5, shuffle=True, random_state=5)
gd_param = {'max_depth': np.arange(1,25), 'criterion' : ['entropy', 'gini']}

m1_gd_DT = GridSearchCV(DecisionTreeClassifier(), gd_param, cv=cv_KF)
m1_gd_DT.fit(X_train, y_train)

##### **1.2** - Proto-modélisation

Définition et entrainement