## **ML :** Weather AUS

#### _Rain in Australia_

🟠 `on work`

---

1. **Preprocessing**
    * Extraction et préparation
    * Nettoyage et encodage
    * Proto-modélisation
    * Traitement du contenu
    * Feature Selection
    * Feature Engineering
    * Feature Scaling
2. **Modeling**
    * Fonction d’évaluation
    * Entrainements multiples modèles
    * Optimisation
    * Analyse des erreurs
    * Courbe d'aprentissage
    * Décision

**Built-in**

**Librairies**

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

**ML Objects**

In [16]:
# Algorithms
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.ensemble import RandomForestClassifier
# - -
# Evaluation, tuning, etc.
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
# from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
# - -
# Preprocessing
# from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import StandardScaler
# from sklearn.preprocessing import RobustScaler
# from sklearn.preprocessing import MinMaxScaler
# - -
# Imputer
from sklearn.impute import KNNImputer
# - -
# Metrics
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay
# from sklearn.metrics import accuracy_score
# from sklearn.metrics import average_precision_score
# from sklearn.metrics import precision_score
# - - 
# Tools
# from sklearn.compose import make_column_transformer
# from sklearn.tree import plot_tree
from sklearn.utils import resample

**User Code**

In [11]:
def extract_x_y(dataframe:pd.DataFrame, target:str|list[str]) -> tuple :
    """Extract Features and Target from dataset

    Args:
        dataframe (pd.DataFrame): Dataframe to extract columns from
        target (str | list[str]): Target name

    Returns:
        tuple: Feature as X, and Label as y
    """

    y = dataframe[target] 
    X = dataframe.drop(columns=target)

    print(y.unique())
    print(X.columns.to_list())

    return X, y

In [12]:
def save_cm(cm:list, name:str) -> None :
    """Save a Confusion Matrix as CSV file in `./_outputs/` subdirectory

    Args:
        cm (list): Confusion Matrix built with `sklearn.metrics.confusion_matrix`
        name (str): A lowercase spaceless text for file name
    """
        
    df = pd.DataFrame({
        'Predict. Yes': [cm[0,0], cm[1,0]],
        'Predict. No': [cm[1,0], cm[1,1]]
    }, index=['True Yes', 'True No'])
    
    df.to_csv(f'./_outputs/cm_{name}.csv')

In [98]:
def encode_binaries(X:pd.DataFrame, features:list[str]) -> pd.DataFrame :
    """I'll not describe ! Just read the code. And if you can't, just learn coding ! And if you can't, just TEACH CODERS BETTER and they'll do it for you !!! 

    Args:
        X (pd.DataFrame): _description_
        features (list[str]): _description_

    Returns:
        pd.DataFrame: _description_
    """
    
    ohe = OneHotEncoder(handle_unknown='ignore', drop='if_binary', sparse=False, dtype=np.int8)
    X_enc = pd.DataFrame()
    
    X_enc[features] = pd.DataFrame(ohe.fit_transform(X[features]), columns=features)

    # One-hot encoding removed index ; put it back
    X_enc.index = X.index
    
    X.drop(features, axis=1, inplace=True)

    X_ohe = pd.concat([X, X_enc], axis=1)
    
    return X_ohe

**Notebook Setup**

In [2]:
# Colour codes
mean_c = '#FFFFFF'
median_c = '#c2e800'
default_c = '#336699'
palette_c = [
    '#b8e600', # Sunny
    '#00bfff' # Rainy
]

# Pandas
pd.options.display.max_rows = 30
pd.options.display.min_rows = 6

# Matplotlib
plt.style.use('dark_background')

plt.rcParams['figure.facecolor'] = '#242428'
plt.rcParams['axes.facecolor'] = '#242428'
plt.rcParams['axes.titleweight'] = 'bold'

**Weather AUS**

[Rain in Australia](https://www.kaggle.com/datasets/jsphyg/weather-dataset-rattle-package)

In [20]:
weather_file_path = './_datasets/weather_data_prepare.csv'
weather_data = pd.read_csv(weather_file_path)

weather_data

Unnamed: 0,Evaporation,Sunshine,WindGustSpeed,Humidity3pm,Pressure,Cloud3pm,Temp3pm,RainToday,RainTomorrow
0,,,44.0,22.0,1007.40,,21.8,No,No
1,,,44.0,25.0,1009.20,,24.3,No,No
2,,,46.0,30.0,1008.15,2.0,23.2,No,No
...,...,...,...,...,...,...,...,...,...
140784,,,22.0,21.0,1021.30,,24.5,No,No
140785,,,37.0,24.0,1018.90,,26.1,No,No
140786,,,28.0,24.0,1017.95,2.0,26.0,No,No


In [21]:
pd.DataFrame({
    'Types': weather_data.dtypes,
    'Qté de Nulles' : weather_data.isnull().sum()
})

Unnamed: 0,Types,Qté de Nulles
Evaporation,float64,59694
Sunshine,float64,66805
WindGustSpeed,float64,9105
Humidity3pm,float64,3501
Pressure,float64,13583
Cloud3pm,float64,56094
Temp3pm,float64,2624
RainToday,object,0
RainTomorrow,object,0


---

### **1.** Preprocessing

##### **1.1** - Extraction et préparation

In [28]:
X, y = extract_x_y(weather_data, 'RainTomorrow')

['No' 'Yes']
['Evaporation', 'Sunshine', 'WindGustSpeed', 'Humidity3pm', 'Pressure', 'Cloud3pm', 'Temp3pm', 'RainToday']


In [29]:
y = y.astype('category')

In [30]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25, random_state=5)

In [25]:
pd.concat([
    pd.DataFrame({
    'Label Entrainement': y_train.describe(),
    'Label Test': y_test.describe()
    }),
    pd.Series([
        (y_train.describe()[3] / y_train.count()) * 100,
        (y_test.describe()[3] / y_test.count()) * 100
    ], name='percent of no', index=['Label Entrainement', 'Label Test']).to_frame().T
])

Unnamed: 0,Label Entrainement,Label Test
count,105590,35197
unique,2,2
top,No,No
freq,82180,27406
percent of no,77.82934,77.864591


##### **1.2** - Nettoyage et encodage

Nettoyage des valeurs

In [44]:
knn_imputer = KNNImputer(n_neighbors=5, copy=False)
X_train[['Evaporation', 'Sunshine', 'WindGustSpeed', 'Humidity3pm', 'Pressure', 'Cloud3pm', 'Temp3pm']] = pd.DataFrame(knn_imputer.fit_transform(X_train[['Evaporation', 'Sunshine', 'WindGustSpeed', 'Humidity3pm', 'Pressure', 'Cloud3pm', 'Temp3pm']]))

X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 105590 entries, 114264 to 35683
Data columns (total 8 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   Evaporation    79080 non-null   float64
 1   Sunshine       79080 non-null   float64
 2   WindGustSpeed  79080 non-null   float64
 3   Humidity3pm    79080 non-null   float64
 4   Pressure       79080 non-null   float64
 5   Cloud3pm       79080 non-null   float64
 6   Temp3pm        79080 non-null   float64
 7   RainToday      105590 non-null  object 
dtypes: float64(7), object(1)
memory usage: 7.3+ MB


In [46]:
X_train

Unnamed: 0,Evaporation,Sunshine,WindGustSpeed,Humidity3pm,Pressure,Cloud3pm,Temp3pm,RainToday
114264,,,,,,,,No
52305,6.040000,6.40000,37.000000,98.000000,1010.260000,7.200000,0.600000,No
95661,2.600000,9.20000,17.000000,42.000000,1019.700000,1.000000,20.100000,No
...,...,...,...,...,...,...,...,...
20463,8.600000,11.10000,41.000000,26.000000,1013.100000,2.000000,32.000000,No
18638,4.020000,5.24000,50.000000,30.000000,1020.200000,1.000000,29.000000,No
35683,5.464991,7.61917,39.981735,51.529439,1016.464406,4.505281,21.664885,Yes


Encodage des variables

- setting handle_unknown='ignore' to avoid errors when the validation data contains classes that aren't represented in the training data, and
- setting sparse=False ensures that the encoded columns are returned as a numpy array (instead of a sparse matrix).

In [91]:
categoricals = ['RainToday']

In [101]:
X_train = encode_binaries(X_train, categoricals)

X_train

Unnamed: 0,Evaporation,Sunshine,WindGustSpeed,Humidity3pm,Pressure,Cloud3pm,Temp3pm,RainToday
114264,,,,,,,,0
52305,6.040000,6.40000,37.000000,98.000000,1010.260000,7.200000,0.600000,0
95661,2.600000,9.20000,17.000000,42.000000,1019.700000,1.000000,20.100000,0
...,...,...,...,...,...,...,...,...
20463,8.600000,11.10000,41.000000,26.000000,1013.100000,2.000000,32.000000,0
18638,4.020000,5.24000,50.000000,30.000000,1020.200000,1.000000,29.000000,0
35683,5.464991,7.61917,39.981735,51.529439,1016.464406,4.505281,21.664885,1


##### **1.2** - Proto-modélisation

Définition et entrainement

##### **1.3** - Traitements du contenu

Valeurs aberrantes

Rééchantillonnage

In [None]:
# [?] - REMINDER
count_majority, count_minority = weather_prepared['RainTomorrow'].value_counts()

display(
    count_majority,
    count_minority
)

In [None]:
df_class_majority = weather_prepared[weather_prepared['RainTomorrow'] == 0]
df_class_minority = weather_prepared[weather_prepared['RainTomorrow'] == 1]

display(
    df_class_majority,
    df_class_minority
)

In [None]:
# Downsample majority class
df_majority_downsampled = resample(df_class_majority, 
                                 replace=False,             # sample without replacement
                                 n_samples=count_minority,  # to match minority class
                                 random_state=5)            # reproducible results
 
# Combine minority class with downsampled majority class
df_downsampled = pd.concat([df_majority_downsampled, df_class_minority])
 
# Display new class counts
df_downsampled['RainTomorrow'].value_counts()

##### **1.4** - Traitements Features

Features Selection

Features Engineering

Features Scaling

In [None]:
# [?] - REMINDER : also Strat'KFold 
cv_KF = KFold(n_splits=5, shuffle=True, random_state=5)
gd_param = {'max_depth': np.arange(1,25), 'criterion' : ['entropy', 'gini']}

m1_gd_DT = GridSearchCV(DecisionTreeClassifier(), gd_param, cv=cv_KF)
m1_gd_DT.fit(X_train, y_train)