# Challenge - Rainy Day

![](https://images.unsplash.com/photo-1558920778-a82b686f0521?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=967&q=80)

Photo by [Ren zo](https://unsplash.com/photos/rsilYJQOoVo)

In this exercise, we will try to use a neural network on a typical prediction task: predicting whether tomorrow will be a rainy day.

The dataset is in `weatherAUS.csv`. Load it and explore it. The target value is the column `'RainTomorrow'`.

In [200]:
import pandas as pd
import numpy as np
import keras
from keras.models import Sequential
from keras.layers import Dense, Input
from datetime import datetime
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

In [201]:
# TODO: Data exploration
df = pd.read_csv('../data/weatherAUS.csv',on_bad_lines='skip')

In [202]:
df.head()

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2008-12-01,Albury,13.4,22.9,0.6,,,W,44.0,W,...,71.0,22.0,1007.7,1007.1,8.0,,16.9,21.8,No,No
1,2008-12-02,Albury,7.4,25.1,0.0,,,WNW,44.0,NNW,...,44.0,25.0,1010.6,1007.8,,,17.2,24.3,No,No
2,2008-12-03,Albury,12.9,25.7,0.0,,,WSW,46.0,W,...,38.0,30.0,1007.6,1008.7,,2.0,21.0,23.2,No,No
3,2008-12-04,Albury,9.2,28.0,0.0,,,NE,24.0,SE,...,45.0,16.0,1017.6,1012.8,,,18.1,26.5,No,No
4,2008-12-05,Albury,17.5,32.3,1.0,,,W,41.0,ENE,...,82.0,33.0,1010.8,1006.0,7.0,8.0,17.8,29.7,No,No


In [203]:
df.shape

(142193, 23)

In [204]:
df.Sunshine[df.Sunshine>0]

5939      12.3
5940      13.0
5941      13.3
5942      10.6
5943      12.2
          ... 
139108    11.0
139109     8.6
139110    11.0
139111    10.6
139112    10.7
Name: Sunshine, Length: 72069, dtype: float64

In [205]:
df.Location.value_counts()

Canberra            3418
Sydney              3337
Perth               3193
Darwin              3192
Hobart              3188
Brisbane            3161
Adelaide            3090
Bendigo             3034
Townsville          3033
AliceSprings        3031
MountGambier        3030
Launceston          3028
Ballarat            3028
Albany              3016
Albury              3011
PerthAirport        3009
MelbourneAirport    3009
Mildura             3007
SydneyAirport       3005
Nuriootpa           3002
Sale                3000
Watsonia            2999
Tuggeranong         2998
Portland            2996
Woomera             2990
Cairns              2988
Cobar               2988
Wollongong          2983
GoldCoast           2980
WaggaWagga          2976
Penrith             2964
NorfolkIsland       2964
SalmonGums          2955
Newcastle           2955
CoffsHarbour        2953
Witchcliffe         2952
Richmond            2951
Dartmoor            2943
NorahHead           2929
BadgerysCreek       2928


In [206]:
for col in df.columns:
    print(col)

Date
Location
MinTemp
MaxTemp
Rainfall
Evaporation
Sunshine
WindGustDir
WindGustSpeed
WindDir9am
WindDir3pm
WindSpeed9am
WindSpeed3pm
Humidity9am
Humidity3pm
Pressure9am
Pressure3pm
Cloud9am
Cloud3pm
Temp9am
Temp3pm
RainToday
RainTomorrow


### Location trop compliquée à gérer

In [207]:
## convert Data to Month new feature
df['Month'] = df.Date.apply(lambda x: datetime.strptime(x, '%Y-%m-%d').date().month)
df.Month

## voir s'il faut considérer comme feature catégorielle

## drop Date
if 'Date' in df.columns:
    df = df.drop(columns=['Date'])
if 'Location' in df.columns:
    df = df.drop(columns=['Location'])
df.head()

Unnamed: 0,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,WindSpeed9am,...,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow,Month
0,13.4,22.9,0.6,,,W,44.0,W,WNW,20.0,...,22.0,1007.7,1007.1,8.0,,16.9,21.8,No,No,12
1,7.4,25.1,0.0,,,WNW,44.0,NNW,WSW,4.0,...,25.0,1010.6,1007.8,,,17.2,24.3,No,No,12
2,12.9,25.7,0.0,,,WSW,46.0,W,WSW,19.0,...,30.0,1007.6,1008.7,,2.0,21.0,23.2,No,No,12
3,9.2,28.0,0.0,,,NE,24.0,SE,E,11.0,...,16.0,1017.6,1012.8,,,18.1,26.5,No,No,12
4,17.5,32.3,1.0,,,W,41.0,ENE,NW,7.0,...,33.0,1010.8,1006.0,7.0,8.0,17.8,29.7,No,No,12


In [208]:
df.columns

Index(['MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation', 'Sunshine',
       'WindGustDir', 'WindGustSpeed', 'WindDir9am', 'WindDir3pm',
       'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am', 'Humidity3pm',
       'Pressure9am', 'Pressure3pm', 'Cloud9am', 'Cloud3pm', 'Temp9am',
       'Temp3pm', 'RainToday', 'RainTomorrow', 'Month'],
      dtype='object')

In [209]:
categorical_features = ['WindGustDir','WindDir9am','WindDir3pm']

numeric_features = [col for col in df.columns if col not in categorical_features and col != 'RainTomorrow']
###['MinTemmp','Maxtemp','Rainfall','Evaporation','Sunshine','WindGustSpeed','']
df[numeric_features].head()

Unnamed: 0,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustSpeed,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,Month
0,13.4,22.9,0.6,,,44.0,20.0,24.0,71.0,22.0,1007.7,1007.1,8.0,,16.9,21.8,No,12
1,7.4,25.1,0.0,,,44.0,4.0,22.0,44.0,25.0,1010.6,1007.8,,,17.2,24.3,No,12
2,12.9,25.7,0.0,,,46.0,19.0,26.0,38.0,30.0,1007.6,1008.7,,2.0,21.0,23.2,No,12
3,9.2,28.0,0.0,,,24.0,11.0,9.0,45.0,16.0,1017.6,1012.8,,,18.1,26.5,No,12
4,17.5,32.3,1.0,,,41.0,7.0,20.0,82.0,33.0,1010.8,1006.0,7.0,8.0,17.8,29.7,No,12


In [210]:
df.isna().sum()

MinTemp            637
MaxTemp            322
Rainfall          1406
Evaporation      60843
Sunshine         67816
WindGustDir       9330
WindGustSpeed     9270
WindDir9am       10013
WindDir3pm        3778
WindSpeed9am      1348
WindSpeed3pm      2630
Humidity9am       1774
Humidity3pm       3610
Pressure9am      14014
Pressure3pm      13981
Cloud9am         53657
Cloud3pm         57094
Temp9am            904
Temp3pm           2726
RainToday         1406
RainTomorrow         0
Month                0
dtype: int64

### comment gérer les valeurs manquantes?

In [211]:
df.columns

Index(['MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation', 'Sunshine',
       'WindGustDir', 'WindGustSpeed', 'WindDir9am', 'WindDir3pm',
       'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am', 'Humidity3pm',
       'Pressure9am', 'Pressure3pm', 'Cloud9am', 'Cloud3pm', 'Temp9am',
       'Temp3pm', 'RainToday', 'RainTomorrow', 'Month'],
      dtype='object')

In [212]:
df.RainToday.replace({'No':0, 'Yes':1}, inplace=True)
df.RainTomorrow.replace({'No':0, 'Yes':1}, inplace=True)
target = df.RainTomorrow
target.unique()

array([0, 1])

Make data preparation.

In [213]:
# TODO: Data preparation
# 1. split
# 2. MinMaxScaler

y = target.to_numpy()
if 'RainTomorrow' in df.columns:
    X = df.drop(columns=['RainTomorrow'])
else:
    X = df

# je ne garde que les features numériques dabs un 1er temps
X = X[numeric_features]

print(X.shape)
print(y.shape)

# pour ne pas avoir l'erreur dans mlp.fit:
## ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type float)
X = np.asarray(X).astype('float32')


# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# numeric features scaling 
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test  = scaler.transform(X_test)
#X_train[numeric_features] = scaler.fit_transform(X_train[numeric_features])
#X_test[numeric_features] = scaler.transform(X_test[numeric_features])


(142193, 18)
(142193,)


Now build a MLP model. Begin with for example 2 hidden layers of 20 units.

In [214]:
# TODO: Build a model

def binary_class_mlp(input_dim: tuple[int, ...], nb_layers: int, nb_units: int) -> Sequential:
    # We create a so called Sequential model
    model = Sequential()

    # Specify the input dimension via an `Input` layer
    model.add(Input(input_dim))

    # Add the first "Dense" layer of 100 units (neurons)
    model.add(Dense(100, activation="sigmoid"))
    model.add(Dense(100, activation="sigmoid"))

    # Add finally the output layer with one unit: the prediction
    model.add(Dense(1, activation="sigmoid"))

    # return the created model
    return model

mlp = binary_class_mlp(input_dim=(X.shape[1],),nb_layers=2, nb_units=20)
mlp.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_3 (Dense)             (None, 100)               1900      
                                                                 
 dense_4 (Dense)             (None, 100)               10100     
                                                                 
 dense_5 (Dense)             (None, 1)                 101       
                                                                 
Total params: 12,101
Trainable params: 12,101
Non-trainable params: 0
_________________________________________________________________


Now compile and fit your model.

In [215]:
# TODO: Compile and fit the model
mlp.compile(optimizer="SGD", loss="binary_crossentropy", metrics=["accuracy"])

X.shape

(142193, 18)

In [216]:
# Train the model, iterating on the data in batches of 32 samples
mlp.fit(x=X_train, y=y_train, validation_data=(X_test, y_test), epochs=10, batch_size=20000)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f7978723040>

Now check the accuracy on the test dataset.

In [217]:
# TODO: Compute the accuracy
loss, accuracy = mlp.evaluate(X_test, y_test, verbose=0)
print("loss is:", loss)
print("accuracy is:", accuracy)

loss is: nan
accuracy is: 0.7759414911270142


---

Now try to use a classical machine learning classification method (of your choice). Make the fit and compute the accuracy of your model.

In [225]:
# TODO: Redo the classification with the model of your choice
#from sklearn.linear_model import LogisticRegression
#clf = LogisticRegression(random_state=0).fit(X_train, y_train) 
## sensitive to Nan Values
## take another classifier not sensitive to Nan

from xgboost import XGBClassifier

xgb_classifier = XGBClassifier(eta = 0.7)
xgb_classifier.fit(X_train, y_train)

y_pred_test  = xgb_classifier.predict(X_test)

y_test.shape

#acc_test = accuracy(y_test, y_pred_test)
#print(f" accuracy on test: {acc_test}")


(28439,)