<center><img src ="https://kubrick.htvapps.com/htv-prod-media.s3.amazonaws.com/images/nhjxygyi-1559831904.jpg?crop=1.00xw:1.00xh;0,0&resize=900:*"></center>
<h1><center>  <div style="background-color:lightgreen;border-radius:10px; padding: 10px;">Introduction</div></center></h1>

🌨 Motivation
> I have been listing to the news of rain forcast since childhood and 99% of the times, they are incorrect 😅! I used to think that, how can the people forcating the rain be wrong so many times, is it that difficult to predict rain. This motivated me to take up this challange of forcating the rain by myself and see if I can do better than them.

🎯 Goal
> The goal of this notebook is to predict rain on the next day using the data science and ML skills with high accuracy.

🤖 Artificial Neural Netwok
> I will be predicting the rain using ANN model. One can use other classification algorithms also, but I wanted to tackle this problem with the most sofisticated algorithms we have and one of them is ANN. The focus is on getting a high accuracy and not interpreting the model or finding the feature importance, this is another reason for selecting ANN. So, let's get started!

<h1><center>  <div style="background-color:lightgreen;border-radius:10px; padding: 10px;">Reading and Cleaning the Data 🧹</div></center></h1>

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

In [None]:
# Reading the data
df = pd.read_csv('../input/weather-dataset-rattle-package/weatherAUS.csv')
df

In [None]:
# Checking the data types
df.info()

> The datatype of `Date` is object so I will change it to date time for easy handling of dates

In [None]:
# Changing the data types of Date to datetime
df['Date'] = pd.to_datetime(df['Date'])

> Now lets analyse the target variable `RainTomorrow`. I will be checking the missing values and class imbalance.

In [None]:
# Checking for the missing values in the target variable
df['RainTomorrow'].isnull().sum()

> `RainTomorrow` has 3267 missing values. As `RainTomorrow` is to be predicted so we can't impute for missing values. Thus, we have to drop the the rows with missing values.

In [None]:
# Droping the missing values
df = df.dropna(subset = ['RainTomorrow'])

In [None]:
# Checking for the class imbalance
fig = plt.figure(figsize = (10, 6))
axis = sns.countplot(x = 'RainTomorrow', data = df);
axis.set_title('Class Distribution for the target feature', size = 16);

for patch in axis.patches:
    axis.text(x = patch.get_x() + patch.get_width()/2, y = patch.get_height()/2, 
            s = f"{np.round(patch.get_height()/len(df)*100, 1)}%", 
            ha = 'center', size = 40, rotation = 0, weight = 'bold' ,color = 'white')
    
axis.set_xlabel('Rain Tomorrow', size = 14)
axis.set_ylabel('Count', size = 14);

> The class is imbalance when the minority class has only 5-10% data. We have 22.4% data belonging to the minority class, so, there is no class imbalance. Now, let's look at other features.

> I will be creating new features of `day` and `month` from `Date` column, which are cyclic in nature. If I do not do any preprocessing on them and directly feed them to ANN, the ANN can give more or less importance based on the values. Eg. days will have values from 1 to 31, so ANN thinks that value 31 is more than 1, but actually they are just days so our model can go wrong. Thus, I will be performing a transformation on these features to make them cyclic.

<center><img src = "https://www.math.hkust.edu.hk/~machiang/1013/Notes/sine_2.gif"></center>

> A circle is the projection of cyclic pattern. This concept is used here, to make feature transformations.

<h3><center>  <div style="background-color:lightgreen;border-radius:10px; padding: 10px;">Feature Engineering 📐📏</div></center></h3>

In [None]:
# months and days in a cyclic continuous feature.

def encode(data, col, max_val):
    data[col + '_sin'] = np.sin(2 * np.pi * data[col]/max_val)
    data[col + '_cos'] = np.cos(2 * np.pi * data[col]/max_val)
    return data

df['month'] = df['Date'].dt.month
df = encode(df, 'month', 12)

df['day'] = df['Date'].dt.day
df = encode(df, 'day', 31)

In [None]:
# Let's look at the transformed features

plt.style.use('ggplot')
fig, (ax1,ax2,ax3) = plt.subplots(1,3, figsize = (12, 4), constrained_layout = True)

ax1 = sns.lineplot(x = 'month', y = 'day', data = df, estimator = None, ax=ax1)
ax2 = sns.scatterplot(x = 'day_sin', y = 'day_cos', data = df, ax = ax2)
ax3 = sns.scatterplot(x = 'month_sin', y = 'month_cos', data = df, ax = ax3)

ax1.set_title('Original Day Distribution')
ax2.set_title('Cyclic encoding of Day')
ax3.set_title('Cyclic encoding of Month')

fig.suptitle('Feature Engineering for Cyclic Features', size = 16, y = 1.1);

> Before doing any preprocessing, I will split the data into train and test set. The reason for that is, we don't see the test data, so all the preprocessing should be based on the train data. If we perform the preprocessing based on test data, it means that we did some cheating 😛 as we looked at the test data.

> I am splitting the data into train and test with ratio of 80% - 20% (randomly chosen), and chossing `stratify` which will keep the proportion of target variable equal in both train and test data.  

In [None]:
# Splitting the data into train and test
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df, train_size = 0.8, random_state  = 99, stratify = df['RainTomorrow'])

<h3><center>  <div style="background-color:lightgreen;border-radius:10px; padding: 10px;">Cleaning Categorical Features 🚿</div></center></h3>

In [None]:
# Let's first handle missing values for catergorical data
categorical_col = df_train.select_dtypes('object').columns[:-1].to_list()
df_train[categorical_col].isnull().mean()*100

> As the missing values are less than 10%, I will impute them with the mode.

In [None]:
# Imputing with the mode
for col in categorical_col:
    df_train[col].fillna(df_train[col].mode()[0], inplace = True)
    df_test[col].fillna(df_train[col].mode()[0], inplace = True) # Imputing test data using train data

<h3><center>  <div style="background-color:lightgreen;border-radius:10px; padding: 10px;">Cleaning Numeric Features 🚿</div></center></h3>

> The data cleaning of numeric features involves following things:
> - Handling missing values
> - Removing multicollinearity
> - Removing outliers

> Let's do them one by one

<h4><center>  <div style="background-color:skyblue;border-radius:10px; padding: 10px;">Handling Missing Values</div></center></h4>

> Missing data present various problems. First, the absence of data reduces statistical power, which refers to the probability that the test will reject the null hypothesis when it is false. Second, the lost data can cause bias in the estimation of parameters. Third, it can reduce the representativeness of the samples.

In [None]:
# Missing values for numeric data
numeric_col = df.describe().columns.to_list()
df_train[numeric_col].isnull().mean()*100

> `Evaporation`, `Sunshine`, `Cloud9am` and `Cloud3pm` have large missing values, so we will first look at these features and then handle the missing values for remaing numeric features.

In [None]:
# Let's explore the features having high missing values
cols = ['Evaporation', 'Sunshine', 'Cloud9am', 'Cloud3pm']

plt.style.use('seaborn-dark')
fig, ax = plt.subplots(4,2, figsize = (12, 8), constrained_layout = True)

for i, num_var in enumerate(cols): 
    sns.kdeplot(data = df_train, x = num_var, ax = ax[i][0],
                fill = True, alpha = 0.6, linewidth = 1.5)
    ax[i][0].set_ylabel(num_var)
    ax[i][0].set_xlabel(None)
    
    sns.histplot(data = df_train, x = num_var, ax = ax[i][1])
    ax[i][1].set_ylabel(None)
    ax[i][1].set_xlabel(None)
    
fig.suptitle('Features having high missing values (>35%)', size = 16);

> Except `Evaporation`, all other three have distributed data, so we will impute the missing values with the median, and impute missing values for`Evaporation` with mean.

In [None]:
# Droping the columns with high missing values (>35%) and distributed data
# for dataframe in [df_train, df_test]:
#     dataframe.drop(columns = ['Sunshine', 'Cloud9am', 'Cloud3pm'], axis = 1, inplace = True)

for dataframe in [df_train, df_test]:
    for cols in ['Sunshine', 'Cloud9am', 'Cloud3pm']:
        dataframe[cols].fillna(df_train[cols].median(), inplace = True)

    dataframe['Evaporation'].fillna(df_train['Evaporation'].mean(), inplace = True)

> Now I will remove the missing values from the remaining numerical features as they are <10%. One can also impute them with mean/ median whichever is appropriate.

In [None]:
# Removing the missing values from the remaining numerical features as they are <10%.
# numeric_col = ['MinTemp', 'MaxTemp', 'Rainfall','WindGustSpeed', 'WindSpeed9am', 'WindSpeed3pm',
#                'Humidity9am','Humidity3pm', 'Pressure9am', 'Pressure3pm', 'Temp9am', 'Temp3pm']

# for dataframe in [df_train, df_test]:
#     for col in numeric_col:
#         # Imputing missing values with median based on train set
#         dataframe[col].fillna(df_train[col].median(), inplace = True)

df_train.dropna(inplace = True)
df_test.dropna(inplace = True)

<h4><center>  <div style="background-color:skyblue;border-radius:10px; padding: 10px;">Removing multicollinearity 🧑👦</div></center></h4>

> Multicollinearity is a problem because it undermines the statistical significance of an independent variable. Other things being equal, the larger the standard error of a regression coefficient, the less likely it is that this coefficient will be statistically significant. Also it affects storage and speed.

In [None]:
# Checking for the correlation between the numeric features
# Correlation between numeric variables

numeric_col = ['MinTemp', 'MaxTemp', 'Rainfall','WindGustSpeed', 'WindSpeed9am', 'WindSpeed3pm',
               'Humidity9am','Humidity3pm', 'Pressure9am', 'Pressure3pm', 'Temp9am', 'Temp3pm',
              'Sunshine', 'Cloud9am', 'Cloud3pm', 'Evaporation']

fig=plt.figure(figsize=(16,12))
axis=sns.heatmap(df_train[numeric_col].corr(), annot=True, linewidths=3, square=True, cmap='Blues', fmt=".0%")

axis.set_title('Correlation between the features', fontsize=16);
axis.set_xticklabels(numeric_col, fontsize=12)
axis.set_yticklabels(numeric_col, fontsize=12, rotation=0);

> #### Strong correlation between

`Temp3pm` and `MaxTemp`

`Pressure3pm` and `Pressure9am`

`Temp9am` and `MinTemp`

`Temp9am` and `MaxTemp`

`Temp3pm` and `Temp9am`

> We will remove one of the features in each pair, to avoid multicollinearity

In [None]:
# Droping the columns
for dataframe in [df_train, df_test]:
    dataframe.drop(['Temp3pm', 'Pressure3pm', 'Temp9am'], axis = 1, inplace = True)

<h4><center>  <div style="background-color:skyblue;border-radius:10px; padding: 10px;">Removing Outliers 👚👚👚👚🩱👚</div></center></h4>

In [None]:
# Let's look at the outliers and the distribution of the numeric features

numeric_col = ['MinTemp', 'MaxTemp', 'Rainfall','WindGustSpeed', 'WindSpeed9am', 'WindSpeed3pm',
               'Humidity9am','Humidity3pm', 'Pressure9am', 'Sunshine', 'Cloud9am', 'Cloud3pm', 'Evaporation']

plt.style.use('seaborn')
fig, axis = plt.subplots(13, 2, figsize = (12, 24))
for i, num_var in enumerate(numeric_col):
    
    # Checking for the outliers using boxplot
    sns.boxplot(y = num_var, data = df_train, ax = axis[i][0], color = 'skyblue')
    
    # Checking for the distribution using kdeplot
    sns.kdeplot(x = num_var, data = df_train, ax = axis[i][1], color = 'skyblue',
               fill = True, alpha = 0.6, linewidth = 1.5)
    
    axis[i][0].set_ylabel(f"{num_var}", fontsize = 12)
    axis[i][0].set_xlabel(None)
    axis[i][1].set_xlabel(None)
    axis[i][1].set_ylabel(None)

fig.suptitle('Analysing Numeric Features', fontsize = 16, y = 1)
plt.tight_layout()

> Many numeric features have data points beyond IQR. I am considering a threshold of 5 percentile, for outlier removal, i.e any point beyound 95 percentile and below 5 percentile is considerd as outlier and will be removed.

> The threshold of 5 percentile is choosen at random, you can very well consider other values for the threshold also.

In [None]:
threshold = 0.05
for col in numeric_col:
    
    # Lower and upper threshold
    lower_threshold = df_train[col].quantile(threshold)
    upper_threshold = df_train[col].quantile(1-threshold)
    
    # Dropping the values below lower threshold and beyond upper threshold
    df_train = df_train[(df_train[col]>=lower_threshold) & (df_train[col]<=upper_threshold)]
    df_test = df_test[(df_test[col]>=lower_threshold) & (df_test[col]<=upper_threshold)]

<h3><center>  <div style="background-color:lightgreen;border-radius:10px; padding: 10px;"> 👶 Transforming Features 👨</div></center></h3>

> Feature transformation is the process of modifying your data but keeping the information. These modifications will make Machine Learning algorithms understanding easier, which will deliver better results. We will reduce repetition, improve performance, and data integrity

In [None]:
# Converting 'Yes' and 'No' to '1' and '0' respectively
df_train['RainTomorrow'] = df_train['RainTomorrow'].map(dict({'Yes':1, 'No':0}))
df_test['RainTomorrow'] = df_test['RainTomorrow'].map(dict({'Yes':1, 'No':0}))

In [None]:
# Dropping the features not required for model
df_train.drop(['Date', 'day', 'month'], axis = 1 ,inplace = True)
df_test.drop(['Date', 'day', 'month'], axis = 1 ,inplace = True)

In [None]:
# Splitting the data into y and X
y_train = df_train.pop('RainTomorrow')
X_train = df_train

y_test = df_test.pop('RainTomorrow')
X_test = df_test

In [None]:
# Now the data is ready for preprocessing, let's convert categorical variables into one hot encodings
X_train = pd.get_dummies(X_train, drop_first = True).reset_index(drop = True)
X_test = pd.get_dummies(X_test, drop_first = True).reset_index(drop = True)

In [None]:
# Getting the categorical columns
numeric_col = ['MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation', 'Sunshine',
               'WindGustSpeed', 'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am',
               'Humidity3pm', 'Pressure9am', 'Cloud9am', 'Cloud3pm',
               'month_sin', 'month_cos', 'day_sin', 'day_cos']

categorical_col = [i for i in X_train.columns if i not in numeric_col]

In [None]:
# Now the data is ready for preprocessing
from sklearn.preprocessing import StandardScaler

scalar = StandardScaler()

X_train_scale = pd.DataFrame(scalar.fit_transform(X_train[numeric_col]), columns = numeric_col) # fit_transform on train
X_test_scale = pd.DataFrame(scalar.transform(X_test[numeric_col]), columns = numeric_col) # only transform on test

In [None]:
# Creating final train and test data
X_train_final = pd.concat([X_train_scale, X_train[categorical_col]], axis = 1)
X_test_final = pd.concat([X_test_scale, X_test[categorical_col]], axis = 1)

<h1><center>  <div style="background-color:lightgreen;border-radius:10px; padding: 10px;"> Creating ANN</div></center></h1>

In [None]:
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Dense, BatchNormalization, Dropout

In [None]:
# Creating the ANN
model = Sequential()

# layers
model.add(Dense(units = 1024, kernel_initializer = 'uniform', activation = 'relu', input_dim = X_train_final.shape[1]))
model.add(Dense(units = 512, kernel_initializer = 'uniform', activation = 'relu'))
model.add(Dropout(0.3))
model.add(Dense(units = 32, kernel_initializer = 'uniform', activation = 'relu'))
model.add(Dropout(0.4))
model.add(Dense(units = 1, kernel_initializer = 'uniform', activation = 'sigmoid'))

# Compiling the ANN
model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy', keras.metrics.AUC()])

In [None]:
# For live plotting
from IPython.display import clear_output
class live_accuracy(keras.callbacks.Callback):
    plt.style.use('ggplot')
    
    def on_train_begin(self, logs={}):
        self.i = 0
        self.x = []
        self.accuracy = []
        self.val_accuracy = []
        self.auc = []
        
        self.fig = plt.figure()
        
        self.logs = []

    def on_epoch_end(self, epoch, logs={}):
        
        plt.xlim([0, epochs-1])
        plt.ylim([0.5, 1.0])
        plt.title('Training and Validation Accuracy', size = 16)
        plt.xlabel('epochs')
        plt.ylabel('accuracy')
        
        self.logs.append(logs)
        self.x.append(self.i)
        self.accuracy.append(logs.get('accuracy'))
        self.val_accuracy.append(logs.get('val_accuracy'))
        self.auc.append(logs.get('auc'))
        
        self.i += 1
        
        clear_output(wait=True)
        
        plt.plot(self.x, self.accuracy, label="train_accuracy")
        plt.plot(self.x, self.val_accuracy, label="val_accuracy")
        
#         plt.text(x = epochs-2, y = 0.7,
#                 s = f"AUC : {round(self.auc[-1],2)}",
#                 ha = 'center', size = 14, rotation = 0, color = 'black',
#                 bbox=dict(boxstyle="round,pad=1", fc='none', ec="black", lw=2))
           
        plt.legend()
        plt.show();
        
plot_accuracy = live_accuracy()

In [None]:
# Train the ANN
epochs = 20
batch_size = 32

history = model.fit(X_train_final, y_train, batch_size = batch_size, epochs = epochs,
                    callbacks=[plot_accuracy], validation_split = 0.3)

<h2><center>  <div style="background-color:lightgreen;border-radius:10px; padding: 10px;"> Model Evaluation 🔎</div></center></h2>

In [None]:
# Model Evaluation
from sklearn.metrics import confusion_matrix
y_pred = model.predict_classes(X_test_final)

matrix = confusion_matrix(y_test, y_pred)

plt.style.use('seaborn-dark')
fig, axis1 = plt.subplots(1, 1, figsize=(10, 6), constrained_layout = True)

# Threshold = 0.5

axis1 = sns.heatmap(matrix, annot=True, fmt = '.0f', cbar=False, cmap='Blues',
                    linewidths=3, square=True, ax = axis1, annot_kws={"fontsize":16})
axis1.set_title(f"Confusion Matrics", fontsize=16, y=1.05);
axis1.set_xlabel('Predicted', fontsize=12)
axis1.set_ylabel('Actual', fontsize=12)
axis1.set_xticklabels([0,1], fontsize=12 )
axis1.set_yticklabels([0,1], fontsize=12, rotation=0);

In [None]:
# Classification Report
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

#### Around 84% accurate!

<h3><center>  <div style="background-color:lightgreen;border-radius:10px; padding: 10px;"> Upvote if you like it! This helps me motivate to produce more notebooks for the community 😊 </div></center></h3>