## **ML :** Weather AUS

#### _Rain in Australia_

🟠 `on work`

---

1. **Preprocessing**
    * Extractions et préparations
    * Proto-modélisation
    * Traitements des valeurs
    * Feature Selection
    * Feature Engineering
    * Feature Scaling
2. **Modeling**
    * Fonction d’évaluation
    * Entrainements multiples modèles
    * Optimisation
    * Analyse des erreurs
    * Courbe d'aprentissage
    * Décision

**Built-in**

**Librairies**

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

**ML Objects**

In [8]:
# Algorithms
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.ensemble import RandomForestClassifier
# - -
# Evaluation, tuning, etc.
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
# - -
# Preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
# from sklearn.preprocessing import RobustScaler
# from sklearn.preprocessing import MinMaxScaler
# - -
# Metrics
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay
# from sklearn.metrics import accuracy_score
# from sklearn.metrics import average_precision_score
# from sklearn.metrics import precision_score
# - - 
# Tools
from sklearn.compose import make_column_transformer
from sklearn.tree import plot_tree
from sklearn.utils import resample

**User Code**

In [3]:
def extract_x_y(dataframe:pd.DataFrame, target:str|list[str]) -> tuple :
    """Extract Features and Target from dataset

    Args:
        dataframe (pd.DataFrame): Dataframe to extract columns from
        target (str | list[str]): Target name

    Returns:
        tuple: Feature as X, and Label as y
    """

    y = dataframe[target] 
    X = dataframe.drop(columns=target)

    print(y.unique())
    print(X.columns.to_list())

    return X, y

In [4]:
def save_cm(cm:list, name:str) -> None :
    """Save a Confusion Matrix as CSV file in `./_outputs/` subdirectory

    Args:
        cm (list): Confusion Matrix built from `sklearn.metrics`
        name (str): A lowercase spaceless text for file name
    """
        
    df = pd.DataFrame({
        'Predict. Yes': [cm[0,0], cm[1,0]],
        'Predict. No': [cm[1,0], cm[1,1]]
    }, index=['True Yes', 'True No'])
    
    df.to_csv(f'./_outputs/cm_{name}.csv')

**Notebook Setup**

In [5]:
# Colour codes
mean_c = '#FFFFFF'
median_c = '#c2e800'
default_c = '#336699'
palette_c = [
    '#b8e600', # Sunny
    '#00bfff' # Rainy
]

# Pandas
pd.options.display.max_rows = 30
pd.options.display.min_rows = 6

# Matplotlib
plt.style.use('dark_background')

plt.rcParams['figure.facecolor'] = '#242428'
plt.rcParams['axes.facecolor'] = '#242428'
plt.rcParams['axes.titleweight'] = 'bold'

**Weather AUS**

[Rain in Australia](https://www.kaggle.com/datasets/jsphyg/weather-dataset-rattle-package)

In [6]:
weather_file_path = './_datasets/weather_data_prepare.csv'
weather_data = pd.read_csv(weather_file_path)
# weather_data['RainTomorrow'] = weather_data['RainTomorrow'].astype('category')

weather_data.head(3)

Unnamed: 0,Date,Location,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir3pm,Humidity3pm,Cloud3pm,Temp3pm,RainToday,RainTomorrow,Pressure
0,2008-12-01,Albury,,,W,44.0,WNW,22.0,,21.8,No,No,1007.4
1,2008-12-02,Albury,,,WNW,44.0,WSW,25.0,,24.3,No,No,1009.2
2,2008-12-03,Albury,,,WSW,46.0,WSW,30.0,2.0,23.2,No,No,1008.15


**Help**

[Categorical Variables - Kaggle](https://www.kaggle.com/code/alexisbcook/categorical-variables/tutorial)

The scikit-learn algorithm for MI treats discrete features differently from continuous features. Consequently, you need to tell it which are which. As a rule of thumb, anything that must have a float dtype is not discrete. Categoricals (object or categorial dtype) can be treated as discrete by giving them a label encoding. (You can review label encodings in our Categorical Variables lesson.)

One-hot encoding generally does not perform well if the categorical variable takes on a large number of values (i.e., you generally won't use it for variables taking more than 15 different values).

[Mutual Information - Kaggle](https://www.kaggle.com/code/ryanholbrook/mutual-information/tutorial)

_Locate features with the most potential_

A great first step is to construct a ranking with a feature utility metric, a function measuring associations between a feature and the target. Then you can choose a smaller set of the most useful features to develop initially and have more confidence that your time will be well spent.

The metric we'll use is called **"mutual information"**. Mutual information is a lot like correlation in that it measures a relationship between two quantities. The advantage of mutual information is that **it can detect any kind of relationship**, while correlation **only detects linear relationships.**

Mutual information is a great general-purpose metric and especially useful at the start of feature development when you might not know what model you'd like to use yet. It is :
- easy to use and interpret,
- computationally efficient,
- theoretically well-founded,
- resistant to overfitting, and,
- able to detect any kind of relationship

Technical note: What we're calling uncertainty is measured using a quantity from information theory known as "entropy". The entropy of a variable means roughly: "how many yes-or-no questions you would need to describe an occurance of that variable, on average." The more questions you have to ask, the more uncertain you must be about the variable. Mutual information is how many questions you expect the feature to answer about the target.

[Imbalanced Classification (theory) - Neptune.ia](https://neptune.ai/blog/how-to-deal-with-imbalanced-classification-and-regression-data)

[Imbalanced Classiffication (example) - Elite DataScience](https://elitedatascience.com/imbalanced-classes)

[Resampling Strategies - Kaggle](https://www.kaggle.com/code/rafjaa/resampling-strategies-for-imbalanced-datasets/notebook)

---

### **1.** Preprocessing

##### **1.1** - Extractions et préparations

Extraction des _Features_ et du _label_

In [7]:
X, y = extract_x_y(weather_data, 'RainTomorrow')

['No' 'Yes' nan]
['Date', 'Location', 'Evaporation', 'Sunshine', 'WindGustDir', 'WindGustSpeed', 'WindDir3pm', 'Humidity3pm', 'Cloud3pm', 'Temp3pm', 'RainToday', 'Pressure']


Isolation des données d'entrainement et de test

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25, random_state=5)

In [None]:
pd.concat([
    pd.DataFrame({
    'Label Entrainement': y_train.describe(),
    'Label Test': y_test.describe()
    }),
    pd.Series([
        (y_train.describe()[3] / y_train.count()) * 100,
        (y_test.describe()[3] / y_test.count()) * 100
    ], name='percent of no', index=['Label Entrainement', 'Label Test']).to_frame().T
])

Encodage des variables catégorielles

- setting handle_unknown='ignore' to avoid errors when the validation data contains classes that aren't represented in the training data, and
- setting sparse=False ensures that the encoded columns are returned as a numpy array (instead of a sparse matrix).

In [None]:
object_cols = []

# Apply one-hot encoder to each column with categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[object_cols]))
OH_cols_test = pd.DataFrame(OH_encoder.transform(X_test[object_cols]))

In [None]:
# One-hot encoding removed index; put it back
OH_cols_train.index = X_train.index
OH_cols_test.index = X_test.index

In [None]:
# Remove categorical columns (will replace with one-hot encoding)
num_X_train = X_train.drop(object_cols, axis=1)
num_X_test = X_test.drop(object_cols, axis=1)

In [None]:
# Add one-hot encoded columns to numerical features
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_test = pd.concat([num_X_test, OH_cols_test], axis=1)

In [None]:
# [!] - For Regression with 'mean_absolute_error' => change for classification mode
# Function for comparing different approaches
def score_dataset(X_train, X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators=100, random_state=0)
    model.fit(X_train, y_train)
    preds = model.predict(X_valid)
    return mean_absolute_error(y_valid, preds)

In [None]:
print(score_dataset(OH_X_train, OH_X_test, y_train, y_test))

##### **1.2** - Proto-modélisation

Définition et entrainement

##### **1.3** - Traitements des valeurs

Valeurs aberrantes

Resampling

In [None]:
count_majority, count_minority = weather_prepared['RainTomorrow'].value_counts()

display(
    count_majority,
    count_minority
)

In [None]:
df_class_majority = weather_prepared[weather_prepared['RainTomorrow'] == 0]
df_class_minority = weather_prepared[weather_prepared['RainTomorrow'] == 1]

display(
    df_class_majority,
    df_class_minority
)

In [None]:
# Downsample majority class
df_majority_downsampled = resample(df_class_majority, 
                                 replace=False,             # sample without replacement
                                 n_samples=count_minority,  # to match minority class
                                 random_state=5)            # reproducible results
 
# Combine minority class with downsampled majority class
df_downsampled = pd.concat([df_majority_downsampled, df_class_minority])
 
# Display new class counts
df_downsampled['RainTomorrow'].value_counts()

##### **1.4** - Traitements Features

Features Selection

Features Engineering

Features Scaling

In [None]:
cv_KF = KFold(n_splits=5, shuffle=True, random_state=5)
gd_param = {'max_depth': np.arange(1,25), 'criterion' : ['entropy', 'gini']}

m1_gd_DT = GridSearchCV(DecisionTreeClassifier(), gd_param, cv=cv_KF)
m1_gd_DT.fit(X_train, y_train)