# Title
To predict if it will rain tomorrow in Australia using Decision Tree Classifier 

## Overview
Predicting weather for rain, sunshine, cold, etc. has been the most usual application of forcasting and prediction in Machine Learning. We will use Australia's weather and rain dataset with data of almost 10 years to predict if it will rain tomorrow. 

## Objective 
To apply Data science, data analysis and machine learning on a sample problem for classification (Yes/No)

## Dataset 
Data is openly available on Kaggle on this link - https://www.kaggle.com/jsphyg/weather-dataset-rattle-package

## Steps
We will follow steps given below - 
* Install and import required libraries 
* Download dataset and import it in pandas 
* Perform basic descriptive analytics on the data 
* Visualize data for patterns and identify critical features impacting the decision
* Prepare data for training - data cleaning, data normalization, data encoding, train/test split, etc. 
* Train data using Decision tree Classifier
* Evaluate and improve model performance 
* Packaging model 

# Import required libraries

In [638]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns

#Set basic configuration for pandas, matplotlib and seaborn
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 150)
sns.set_style("darkgrid")

matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (10, 6)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

# Downloading/Import data 

In [639]:
raw_df = pd.read_csv("../input/weather-dataset-rattle-package/weatherAUS.csv")
raw_df

In [640]:
raw_df.info()

From above output we can see that there are many missing values in the dataset. We need to handle missing data in various steps to ensure that our ML model training does not fail. 
On priority, let us drop all null values from RainTomorrow column. We need to do this as RainTomorrow is the target column and must have a value for model to get trained.

In [641]:
#drop all null values from dataframe[RainTomorrow]
raw_df.dropna(subset = ["RainTomorrow"], inplace = True)

In [642]:
raw_df.describe().T

# Exploratory data analysis

Let us try heatmap of correlation between all features in the dataset

In [643]:
plt.figure(figsize = (20,10))
sns.heatmap(raw_df.corr(), annot = True)

From above heatmap of correlation, we can see that there are a few features which are impacting other and can be termed as positively correlated

In [644]:
sns.boxplot(x = "MinTemp", y = "RainTomorrow", data = raw_df, dodge = True);

In [645]:
sns.boxplot(x = "MaxTemp", y = "RainTomorrow", data = raw_df, dodge = True);

In [646]:
sns.boxplot(x = "Cloud9am", y = "RainTomorrow", data = raw_df, dodge = True);

# Preparing data for training

## Dealing with missing values
Missing values are to be fixed before being fed to a model. Also, they must be dealt properly to ensure that we dont tamper the data with nonsense entries. 
Let us get the count of missing values in the dataset for each column

In [647]:
raw_df.isna().sum().sort_values(ascending = False)

As we can see there are many missing values. For example Sunshine has 67,816 entries missing which is almost half of the total records we have. imiliarly Evaporation, Cloud3pm and Cloud9am have over 50,000 missing values. 
We cannnot remove this many number of records as we will loose almost half of our dataset and this will impact our model training as we will not have enough data to train on. We might have to deal seperately with different category of columns and so let us seperate numeric and categorical/string columns 

In [648]:
#Fetching numeric columns from the dataset
numeric_cols = raw_df.select_dtypes(include = np.number).columns.to_list()
print(numeric_cols)

In [649]:
#Fetching categorical or string columns from the dataset
categorical_cols = raw_df.select_dtypes('object').columns.to_list()
print(categorical_cols)

### Dealing with missing values in numercial columns
One can use median, mode and mean to replace missing numerical values. For simplicity, we will use mean to replace missing values

In [650]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy = "mean").fit(raw_df[numeric_cols])

In [651]:
#apply the mean imputing strategy on numerical columns of raw_df
raw_df[numeric_cols] = imputer.transform(raw_df[numeric_cols])

Let us now check if we have missing values

In [652]:
raw_df[numeric_cols].isna().sum()

### Dealing with missing values in Categorical columns

In [653]:
raw_df[categorical_cols].isna().sum()

Let us replace all null/missing values with "Unknown" which will become another category 

In [654]:
raw_df[categorical_cols] = raw_df[categorical_cols].fillna("unknown")

In [655]:
raw_df[categorical_cols].isna().sum()

# Feature Engineering 
day and month might have a significant impact on the prediction. For example in India, it is highly likely to rain in monsoon than in winters. Let break date into day, month and year and then drop date column from the dataset 

In [656]:
raw_df["year"] = pd.to_datetime(raw_df.Date).dt.year;
raw_df["month"] = pd.to_datetime(raw_df.Date).dt.month;
raw_df["day"] = pd.to_datetime(raw_df.Date).dt.day;

So we have new columns of year, month and day in the dataset. We can now drop Date column from the dataset

In [657]:
raw_df.drop(columns = ["Date"], inplace = True)

In [658]:
raw_df.info()

We need to add these 3 columns in the numeric_cols list

In [659]:
#numeric_cols = raw_df.select_dtypes(include = np.number).columns.to_list()
#print(numeric_cols)

In [660]:
#Fetching categorical or string columns from the dataset
#categorical_cols = raw_df.select_dtypes('object').columns.to_list()
#print(categorical_cols)

### Splitting train, test and validation data
We have chronological data for years say from 2008 to 2017. We are expecting our model to predict values in the future and it is a good idea to seperate training data, validation data and test data chronologically. We will have data till 2015 as training data, validation data as data of 2015 and test data for year > 2015 

In [661]:
plt.title('No. of Rows per Year')
sns.countplot(x = year, data = raw_df);

Let us seperate input and target column 

In [662]:
train_df = raw_df[year < 2015]
validate_df = raw_df[year == 2015]
test_df = raw_df[year > 2015]

In [663]:
print("Shape of train dataset", train_df.shape)
print("Shape of validation dataset", validate_df.shape)
print("Shape of test dataset", test_df.shape)

In [664]:
# Let us seperate input and target columns 
#Seperating target columns and dropping it from train, validate and test dataframes
train_target = train_df["RainTomorrow"]
train_df.drop(columns = ["RainTomorrow"], inplace = True)

validate_target = validate_df["RainTomorrow"]
validate_df.drop(columns = ["RainTomorrow"], inplace = True)


test_target = test_df["RainTomorrow"]
test_df.drop(columns = ["RainTomorrow"], inplace = True)

In [665]:
#Separating input columns
train_inputs = train_df.copy()
validate_inputs = validate_df.copy()
test_inputs = test_df.copy()

Let us print the shape of train inputs and train targets 

In [666]:
print("Shape of Train input", train_inputs.shape)
print("Shape of Validate input", validate_inputs.shape)
print("Shape of Test input", test_inputs.shape)

In [667]:
print("Shape of Train targer", train_target.shape)
print("Shape of Validate target", validate_target.shape)
print("Shape of Test target", test_target.shape)

Get numerical and categorical columns

In [668]:
#Fetch numerical columns
numeric_cols = train_df.select_dtypes(include = np.number).columns.to_list()
print("Numerical columns", numeric_cols)

#Fetching categorical or string columns from the dataset
categorical_cols = train_df.select_dtypes('object').columns.to_list()
print("Categorical columns", categorical_cols)

### Encoding categorical columns 
We have categorical columns which must be convereted to integer as ML model cannot deal with string values. We will use OneHotEncoding as we have more than 2 classes or categories in our columns 

In [669]:
from sklearn.preprocessing import OneHotEncoder 
encoder = OneHotEncoder(sparse=False, handle_unknown='ignore').fit(raw_df[categorical_cols])

In [670]:
#Fetch new columns which will be created as a part of OneHotEncoding operation
encoded_cols = list(encoder.get_feature_names(categorical_cols))
print(encoded_cols)

In [671]:
#Transform categorical cols for train, test and validation dataset
train_inputs[encoded_cols] = encoder.transform(train_inputs[categorical_cols])
validate_inputs[encoded_cols] = encoder.transform(validate_inputs[categorical_cols])
test_inputs[encoded_cols] = encoder.transform(test_inputs[categorical_cols])

In [672]:
print("Shape of training dataset", train_inputs.shape)
print("Shape of validation dataset", validate_inputs.shape)
print("Shape of test dataset", test_inputs.shape)

In [673]:
X_train = train_inputs[numeric_cols + encoded_cols]
X_validate = validate_inputs[numeric_cols + encoded_cols]
X_test = test_inputs[numeric_cols + encoded_cols]

# Training model using DecisionTreeClassifier

In [674]:
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(random_state = 42)

In [675]:
%%time
dt.fit(X_train, train_target)

# Model evaluation and metrics for train, test and validation 

In [676]:
from sklearn.metrics import accuracy_score, confusion_matrix

## Predict and evaluate model performance on training data itself

In [677]:
train_preds = dt.predict(X_train)
print(pd.value_counts(train_preds))

Let us check the accuracy score -

In [678]:
accuracy_score(train_target, train_preds)

As we can see above accuracy score is 1.0 which is possibly correct as our model was trained on the same data which was used for prediction. Let us try to get the score of the model for validation dataet 

In [679]:
dt.score(X_validate, validate_target)

In [680]:
dt.score(X_test, test_target)

# Drawing the decision tree 

In [681]:
from sklearn.tree import plot_tree, export_text
plt.figure(figsize=(80,20))
plot_tree(dt, feature_names=X_train.columns, max_depth=2, filled=True);

Let us get the max depth of the tree

In [687]:
#Max depth of decision tree
print("Depth of the decision tree", dt.tree_.max_depth)

We can also get the feature importance which will give us a clue on which feature is the most important feature impacting the decision

In [685]:
print(dt.feature_importances_)

In [689]:

importance_df = pd.DataFrame({
    'feature': X_train.columns,
    'importance': dt.feature_importances_
}).sort_values('importance', ascending=False)

print(importance_df.head())

In [690]:
plt.title('Feature Importance')
sns.barplot(data = importance_df.head(10), x='importance', y='feature');

# Hyperparamter tuning 
Our model had 100 % training accuracy which means that model is memorising the inputs. Comparing it with validation and test accuracy of approx. 77 % we clearly see a case of overfitting. We need to try and make some changes in the paramters of model training to avoid overfitting. One possible way of doing it is to reduce the max depth of the tree. Let us train the model again

In [693]:
dt = DecisionTreeClassifier(max_depth = 4, random_state = 42)
dt.fit(X_train, train_target)

Let us score the model on training, validation and test dataset again

In [695]:
#Scoring against training dataset
dt.score(X_train, train_target)

As we can see the training accuracy is just 83% which means the model is not memorising and overfitting the values. Let us try the same for validation and test dataset

In [696]:
#Scoring against validation dataset
dt.score(X_validate, validate_target)

In [697]:
#scoring against test dataset
dt.score(X_test, test_target)

We now have a significantly better performance on training and test dataset Let us get the confusion matrix 

In [698]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

In [705]:
#Confusion matrix for training data
train_pred = dt.predict(X_train)
matrix_train = confusion_matrix(train_target, train_pred)
print(matrix_train)

matrix_train = classification_report(train_target, train_pred)
print(matrix_train)

In [706]:
#Confusion matrix for validation data
validation_pred = dt.predict(X_validate)
matrix_validate = confusion_matrix(validate_target, validation_pred)
print(matrix_validate)

matrix_validate = classification_report(validate_target, validation_pred)
print(matrix_validate)

In [707]:
test_pred = dt.predict(X_test)
matrix_test = confusion_matrix(test_target, test_pred)
print(matrix_test)

matrix_test = classification_report(test_target, test_pred)
print(matrix_test)

# Train Random Forest algorithm
Ramdom Forest is an ensemble technique where
* multiple DecisionTrees will be trained with different hyperparatmers
* outcome of each DecisionTree will be voted / averaged 
* the one with most count in terms of Classifier will be the winner prediction

In [709]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_jobs = 1, random_state = 42)

In [710]:
%%time
rfc.fit(X_train, train_target)

Let us now get the score of model for train, test and validation dataset

In [713]:
print("Training accuracy = ", rfc.score(X_train, train_target) * 100, "%")

In [714]:
print("Validation accuracy = ", rfc.score(X_validate, validate_target) * 100, "%")

In [715]:
print("Test accuracy = ", rfc.score(X_test, test_target) * 100, "%")