#### Libraries and modules required to run the file


In [None]:
# installing required libraries
import opendatasets as od
import sys
import os
import pandas as pd
import dash
import numpy as np
import plotly.graph_objects as go
from sklearn.impute import SimpleImputer
import plotly.express as px
from plotly.subplots import make_subplots
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
%matplotlib inline

sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (10, 6)
matplotlib.rcParams['figure.facecolor'] = '#00000000'


from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer


# Introduction:


## Problem statement: 

To create a fully-automated system that can use today's weather data for a given location to predict whether it will rain at the location tomorrow. This is a binary classification problem.

In this project the collected dataset is analysed and i have tried to find out the relationship between the chances of raining the next day and today's various weather indicators such as temperature today, amount of ranifall today, humidity, wind direction, pressure etc. Finally created a Dash app(a web applications) that will tell weather it would rain tomorrow or not based on today's weather data. In this app we can also look at interactive visualizations that would help the users to understand the relationship between the target variables and the predictors. So here the task involves prediction as well as inference.


The uses of a web application that can take in today's weather data and predict with high accuracy that whether it would rain tomorrow or not are:

1. We can plan outdoor activities, workout or events for the following day, this applications would help us to take informed and data driven decisions.
2. Farmers can take decision on irrigiation and plant protection strategies based on the rain prediction.
3. Travelers can plan their travel based on the rain predictions.
4. Further this could also help in areas such as: Sports event planning, Smart irrigation, Water management, Gardening and Research work and many more.

## Dataset used:

For this task we train the system on the "Rain in Australia Dataset" which is downloaded from kaggle. The dataset was compiled by the Bureau of Meteorology, an Australian government agency responsible for providing weather-related services. The dataset includes various features related to weather conditions such as temperature, humidity, rainfall, wind speed, and more. It also contains the target variable, "RainTomorrow," which indicates whether it rained the next day (Yes/No). 

RainTomorrow is the target variable to predict. It means -- did it rain the next day, Yes or No? This column is Yes if the rain for that day was 1mm or more.

## Downloading the data

In [None]:
dataset_url = 'https://www.kaggle.com/jsphyg/weather-dataset-rattle-package'

In [None]:
# {"username":"kaushikthakkar610","key":"870c4d33db9f79564265c746eea0ba04"}
od.download(dataset_url)

In [None]:
data_dir = './weather-dataset-rattle-package'
#os.listdir(data_dir)
train_csv = data_dir + '/weatherAUS.csv'
raw_df = pd.read_csv(train_csv)

In [None]:
raw_df
# total number of columns is 23.
# total number of rows is 1,45,460.

The dataset contains over 1,45,000 rows and 23 columns. The dataset contains date, numeric and categorical columns. The objective is to create a model to predict the value in the column RainTomorrow. The data set contains the weather information of 49 different locations across australia.

## Columns in the dataset and their explanation

In [None]:
raw_df.columns

Wind Gust Speed (WindGustSpeed):
Wind gust speed represents the maximum wind speed recorded over a short period during a gust of wind. Wind gusts are sudden increases in wind speed that can occur in certain weather conditions, such as during thunderstorms or strong frontal passages. Wind gust speed is usually reported in kilometers per hour (km/h) or meters per second (m/s).

Wind Gust Direction (WindGustDir):
Wind gust direction indicates the compass direction from which the strongest gusts of wind are blowing. It is reported as a cardinal direction, such as North (N), Northeast (NE), East (E), Southeast (SE), South (S), Southwest (SW), West (W), or Northwest (NW).

## Classifying the columns into Numeric and Categorical columns:

This is essential because different types of columns require different types of data preprocessing and analysis.

First, We exclude the rows where the value of 'RainTomorrow' or 'RainToday' is missing to make the analysis and make modeling simpler(since one of them is the target variable, and the other is likely to be very closely related to the target variable.

In [None]:
raw_df.dropna(subset = ['RainToday', 'RainTomorrow'], inplace=True) # inplace is True that means that changes will be made directly to the new dataframe without creating a new

In [None]:
raw_df['Date'] = pd.to_datetime(raw_df['Date'])
numeric_cols = raw_df.select_dtypes(include=np.number).columns.tolist()
categorical_cols = raw_df.select_dtypes('object').columns.tolist()

In [None]:
print(numeric_cols), print(categorical_cols)

So now we have extracted the categorical and numerical columns. Now we perform some Exploratory Data anlysis.

# Exploratory Data Analysis and Visualization

Here we perform Exploratory Data Analysis (EDA) which is a critical step in understanding and gaining insights from the dataset before building and training any machine learning models. EDA involves examining and visualizing the data to identify patterns, trends, relationships, and potential issues in the dataset. Through EDA I try to find out Data summary, Univariate Analysis, Bivariate Analysis, Missing value Analysis, Outlier Detection, Time series Analysis.

In [None]:
raw_df.info()

In [None]:
raw_df['RainTomorrow'].value_counts()

So we see that there is a class imbalance in the final target variable. Approximately there are 3.5 times data with RainTommorr as 'NO' then 'YES'.

In [None]:
# 1.
px.histogram(raw_df, x = 'Location', title = 'Location vs Rainy Days', 
             color = 'RainTomorrow')

# So in our dataset we have approximately 20% of the times it's 
# raining tomorrow for almost all locations. It follows more or less a uniform distribution.
# so we are not motivated to cosider location as an important factor in our analysis, since it doesn't appear to have an 
# significant impact on the decision of whether it would rain tomorrow or not.

In [None]:
#2. 
px.histogram(raw_df,
            x='Temp3pm',
            title='Temperature at 3pm vs. Rain tomorrow',
            color='RainTomorrow',
            width=850,
            height=500)



# If we have a moderate temperature at 9am or a bit higher temerature it has more chances to rain.

# If low temperature at 3pm, it seems more likely to rain tomorrow. 
# But there are cases when the temperature is high but it still rains the next day.
    

In [None]:
px.histogram(raw_df,
            x='Pressure9am',
            title='Pressure at 9am vs. Rain tomorrow',
            color='RainTomorrow',
            width=850,
            height=500)

In [None]:
px.histogram(raw_df,
            x='Pressure3pm',
            title='Pressure at 3pm vs. Rain tomorrow',
            color='RainTomorrow',
            width=850,
            height=500)

Thus we see that high pressure suggests that it's more likely to rain the next day.

In [None]:
px.scatter(raw_df.sample(4000),
          title = 'Min Temp vs. Max Temp',
          x = 'MinTemp',
          y = 'MaxTemp',
           opacity = 0.7,
          color = 'RainTomorrow', width = 850, height = 500)

Thus if the variation in Today's temperature is small it's very likely that it would rain tomorrow.

In [None]:
px.scatter(raw_df.sample(2000),
        title = 'Temp (3pm) vs. Humidity (3pm)',
        x = 'Temp3pm',
        y = 'Humidity3pm',
        color = 'RainTomorrow', width = 850, height = 500)

We can see that if the temperature today is low and humidity is high then there is a fairly good chance of raining tomorrow.

In [None]:
px.histogram(raw_df,
            x = 'RainTomorrow',
            color = "RainToday",
            title = 'Rain tomorrow vs. Rain Today', width = 850, height = 500)

If it did not rain today then there is a pretty good chance that it won't rain tomorrow. Predicting rain tomorrow 'yes' is difficult than predicting rain tomorrow 'no'.

In [None]:
print(numeric_cols, categorical_cols)

In [None]:
fig = px.histogram(raw_df, x = 'Rainfall', color = 'RainTomorrow')
fig.update_xaxes(range=[0, 50]) 
fig.update_yaxes(range=[0, 1000])
fig.show()

So we see that as the amount of rainfall increases the proportion of days where there is a rain tomorrow increases considerably.

In [None]:
fig = px.scatter(raw_df.sample(1000),
        title = 'Pressure (3pm) vs. Pressure (9am)',
        x = 'Pressure3pm',
        y = 'Pressure9am',
        color = 'RainTomorrow', width = 800, height = 500,
        opacity=0.9,
        color_discrete_sequence=['#ff7f0e', '#1f77b4'])#ff7f0e

fig.update_traces(opacity=0.9, selector=dict(type='scatter', mode='markers', name='Yes'))

fig.show()
# so we see that if the pressure difference is less we have more chances of raining tomorrow.

In [None]:
df1 = raw_df[raw_df['RainTomorrow'] == 'Yes']
df2 = raw_df[raw_df['RainTomorrow'] == 'No']

In [None]:
raw_df['RainTomorrow'].value_counts()

In [None]:
px.scatter(df1, x = 'Rainfall', y = 'Sunshine')

# Here we see that comparatively lower sunshine and higher rainfall charactrizes the days for which there is a rainfall tomorrow.

In [None]:
px.scatter(df2, x = 'Rainfall', y = 'Sunshine')
# we see that very lower rainfall and medium sunshine charaterizes the days when it doesn't rains tomorrow

In [None]:
px.scatter(raw_df.sample(2000),
        title = 'Humidity (3pm) vs. Humidity (9am)',
        x = 'Humidity3pm',
        y = 'Humidity9am',
        color = 'RainTomorrow', width = 850, height = 500)

# so we see that if the Humidity difference is greater we have more chances of raining tomorrow.

In [None]:
# Checking whether the change in wind direction affects the chances of raining tomorrow or not.

In [None]:
def compare_wind_direction(row):
    if row['WindDir9am'] != row['WindDir3pm']:
        return 1
    else:
        return 0
    
raw_df['change_wind_dir'] = raw_df.apply(compare_wind_direction, axis = 1)

In [None]:
raw_df['change_wind_dir'].value_counts()

In [None]:
px.histogram(raw_df,
            x = 'change_wind_dir',
            color = "RainTomorrow",
            title = '', width = 850, height = 500)

26% of the times it rains tomorrow when their is no change in the wind direction, 21% of the times it rains tomorrow when their is a change in wind direction, so the change in wind direction does not significantly effect the probability of rain tomorrow.

so we drop the above column created column

In [None]:
raw_df.drop('change_wind_dir', axis=1, inplace=True)

In [None]:
categorical_cols

In [None]:
categorical_cols.remove("RainTomorrow")

In [None]:
categorical_cols

## Imputing missing Numeric Data:

We can't work with the missing values so we need to solve this problem,Now there are several techniques for imputation (filling missing vlaues in the dataset), but I use the most basic one: replacing missing values with the average value in the column using the SimpleImputer class from sklearn.impute

In [None]:
# Imputing Missing Numeric Data:

# Machine learning models can't work with missing numerical data. The process of filling missing values is called imputation.
# Here we replace missing values with the average value in the column using the SimpleImputer class from sklearn.impute.
imputer1 = SimpleImputer(strategy = 'mean')



# before performing the inputation we check the number of missing values in the data.
raw_df[numeric_cols].isna().sum()

In [None]:
imputer1.fit(raw_df[numeric_cols])
#After calling fit, the computed statistic for each column is stored in the statistics_ property of imputer.

In [None]:
print(list(imputer1.statistics_))

In [None]:
raw_df[numeric_cols] = imputer1.transform(raw_df[numeric_cols])

In [None]:
raw_df[numeric_cols].isna().sum()

In [None]:
raw_df[categorical_cols].isna().sum()

We Impute the missing values in the categorical columns with the most frequent occuring values.

In [None]:
imputer2 = SimpleImputer(strategy = 'most_frequent')

In [None]:
raw_df[categorical_cols] = imputer2.fit_transform(raw_df[categorical_cols])

In [None]:
print(list(imputer2.statistics_))

In [None]:
print(list(imputer1.statistics_))

In [None]:
raw_df[categorical_cols].isna().sum()

### Feature Engineering


Feature Engineering: Feature engineering is the process of creating new features or transforming existing ones in a dataset to enhance the performance of machine learning models. It involves selecting, modifying, or creating features that provide relevant and valuable information to the model, thus improving its ability to make accurate predictions or classifications.

Apart form using these columns for prediction I have created new more informative columns that can be used along with these to study the data more effeciently. These are as follows:

1. Making a column Temp_diff that captures the Maximum and the Minimum temperature difference in a day.
2. Making a column Pressure_diff that calculates the difference in pressure at 3pm and 9am.
3. Making a column Humidity_diff that calculates the difference in Humidity at 3pm and 9am.
4. Create bins for "sunshine" and "rainfall" values (e.g., low, medium, high). Combine their categories to make a new feature by merging values or encoding. Convert the merged categories into numerical labels (label or one-hot encoding), add this feature to the dataset, and train your model with it along with other features. Evaluate its impact on model performance using cross-validation or assessing feature importance.



Some more ideas that can be implemented are as follows: 

5. Evaporation, Sunshine: Ratio of evaporation to rainfall, or ratio of sunshine duration to total daylight hours, as they can give information about moisture levels in the atmosphere.
6. Wind Features (WindGustDir, WindGustSpeed, WindDir9am, WindDir3pm, WindSpeed9am, WindSpeed3pm): Calculating the difference in wind direction and speed between morning and afternoon, or computing the overall wind speed, as wind patterns can influence rainfall.
9. Cloud Features (Cloud9am, Cloud3pm): Difference in cloud cover between morning and afternoon or aggregating cloud cover data to create a new feature.
10. Further we can use the wind speed at 3pm and 9am or we can also use the wind gust speed to make a new columns.

In [None]:
# 1.
raw_df['Temp_diff'] = abs(raw_df['MaxTemp'] - raw_df['MinTemp'])
# px.histogram(raw_df,
#             x = 'Temp_diff',
#             title = 'Temperature difference during a day vs. Rain tomorrow',
#             color = 'RainTomorrow', width = 850, height = 500)
numeric_cols.append('Temp_diff')

In [None]:
# since by the graph it is seen that the different month, day has more or leass a uniform distribution with respect to RainTomorrow
# so we donot include this in our study.

# Making a month column and a day of the week column

# raw_df['Day_of_week'] = pd.to_datetime(raw_df['Date']).dt.dayofweek
# raw_df['Month'] = pd.to_datetime(raw_df['Date']).dt.month
# raw_df['week_no'] = pd.to_datetime(raw_df['Date']).dt.isocalendar().week
# # raw_df.Month.value_counts()

# px.histogram(raw_df,
#             x = raw_df.week_no.map(lambda x:str(x)),
#             color = raw_df.RainTomorrow)

# px.histogram(raw_df,
#             x = raw_df.Day_of_week.map(lambda x:str(x)),
#             color = raw_df.RainTomorrow)

In [None]:
px.histogram(raw_df,
            x = 'Temp_diff',
            title = 'Temperature difference during a day vs. Rain tomorrow',
            color = 'RainTomorrow', width = 850, height = 500)

In this we see that if the temperature difference duirng a day i.e. the absolute difference between the Maximum adn minimum temperature, is small then we have fairly large chances of raining tomorrow.

In [None]:
#2. Creating the Pressure difference Column.

raw_df['Pressure_diff'] = abs(raw_df['Pressure3pm'] - raw_df['Pressure9am'])
numeric_cols.append('Pressure_diff')

In [None]:
fig = px.histogram(raw_df,
            x = 'Pressure_diff',
            title = 'Pressure difference during a day vs. Rain tomorrow',
            color = 'RainTomorrow', width = 850, height = 500)
fig.update_xaxes(range=[0,15]) 
fig.update_yaxes(range=[0, 4000])
fig.show()

It can be seen that, though not very significant, that low pressure difference is may be factor in determining that it rains tommorrow, through this diagram we can say that low difference may cause rain tomorrow.

In [None]:
#2. Creating the Humidity difference Column.

raw_df['Humidity_diff'] = abs(raw_df['Humidity3pm'] - raw_df['Humidity9am'])
numeric_cols.append('Humidity_diff')

In [None]:
px.histogram(raw_df,
            x = 'Humidity_diff',
            title = 'Humidity difference during a day vs. Rain tomorrow',
            color = 'RainTomorrow', width = 850, height = 500)

So we see that low Humidity difference may be a significant factor in determining whether it would rain tomorrow or not.

Now we feature engineer the sunshine and the Rainfall column to create a new column

In [None]:
sunshine_bins = [-1,5,10,15] # Low, Medium, High sunshine
rainfall_bins = [-1, 50, 250, 400]  # No rain, Low, Medium, High rainfall

In [None]:
raw_df['Sunshine_Category'] = pd.cut(raw_df['Sunshine'], bins=sunshine_bins, labels=['Low', 'Medium', 'High'])
raw_df['Rainfall_Category'] = pd.cut(raw_df['Rainfall'], bins=rainfall_bins, labels=['Low', 'Medium', 'High'])

In [None]:
raw_df['Combined_Feature'] = raw_df['Sunshine_Category'].astype(str) + '_' + raw_df['Rainfall_Category'].astype(str)

In [None]:
# Convert categories to numerical labels
raw_df['Combined_Encoded'] = raw_df['Combined_Feature'].astype('category').cat.codes

In [None]:
raw_df['Combined_Encoded'].value_counts()

In [None]:
raw_df = raw_df.drop('Combined_Encoded', axis=1)

In [None]:
raw_df.columns

So it seems that the samples that are coded with 2,4,6 that is Low_High, Low_Medium, Medium_Medium, where the first word denotes sunshine and the second word denotes the Rain category, has almost for all cases that it rains tomorrow. Though the number of such instances are very less.

## Encoding Categorical columns:

Since machine learning models can only be trained with numeric data, we need to convert categorical data to numbers. A common technique is to use one-hot encoding for categorical columns.

In [None]:
raw_df.info()

In [None]:
numeric_cols = raw_df.select_dtypes(include=np.number).columns.tolist()
categorical_cols = raw_df.select_dtypes('object').columns.tolist()
print(numeric_cols, categorical_cols)

In [None]:
print(numeric_cols, categorical_cols)

In [None]:
categorical_cols.remove("RainTomorrow")

In [None]:
print(numeric_cols, categorical_cols)

In [None]:
encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
encoder.fit(raw_df[categorical_cols])
encoder.categories_
encoded_cols = list(encoder.get_feature_names_out(categorical_cols))
len(encoded_cols)

In [None]:
print(encoded_cols)

In [None]:
raw_df[encoded_cols] = encoder.transform(raw_df[categorical_cols])

In [None]:
raw_df.info()

In [None]:
required_cols = numeric_cols + encoded_cols

In [None]:
raw_df[required_cols].info()

## Creating Training, Validation and Test Splits in the dataset.

As a general rule of thumb we can use around 60% of the data for the training set, 20% for the validation set and 20% for the test set. When rows in the dataset have no inherent order, it's common practice to pick random subsets of rows for creating test and validation sets. 

But in this case we since we are workign with dates it's a better idea to separate the train, validation and test datasets according to time so that the model is trained on the past data and is evaluated on the future data.

In [None]:
## if we select rows randomly set the rand to True
rand = False
if rand:
    train_val_df, test_df = train_test_split(raw_df, test_size=0.2, random_state=42)
    train_df, val_df = train_test_split(train_val_df, test_size=0.25, random_state=42)

For the current dataset, I have used the Date column in the dataset to create another column for year. I picked the last two years for the test set, and one year before it for the validation set.

In [None]:
plt.title('No. of Rows per Year')
sns.countplot(x=pd.to_datetime(raw_df.Date).dt.year);

In [None]:

year = raw_df['Date'].dt.year
train_df = raw_df[year < 2015] # we will train the model on the data before 2015
val_df = raw_df[year == 2015]  # validation set would consist of data of 2015
test_df = raw_df[year > 2015]  # test set would consist of data after 2015
print('train_df.shape :', train_df.shape)
print('val_df.shape :', val_df.shape)
print('test_df.shape :', test_df.shape)
     

We have also ensured that the train validation and test sets all contain data for all 12 months of the year.

## Identifying the Input and Target Columns.

In [None]:
input_cols = numeric_cols + encoded_cols
target_col = 'RainTomorrow'

In [None]:
print(input_cols)

In [None]:
train_inputs = train_df[input_cols].copy()
train_targets = train_df[target_col].copy()

val_inputs = val_df[input_cols].copy()
val_targets = val_df[target_col].copy()

test_inputs = test_df[input_cols].copy()
test_targets = test_df[target_col].copy()

In [None]:
# here we have only 23 percent of the data points for which it is raining tomorrow, this displays a class imbalance in 
# in the training data.
train_targets.value_counts()

## Scaling Numeric Features

This is required because:

1. Different features may have different scales, which can lead to numerical instability in many algorithms, Scaling ensures all the features are on the similar scale and no feature dominates the learning process just due to it's larger values.
2. Scaling further speeds up the Convergence of optimization algorithms, allowing them to reach the optimal solution more quickly and efficiently.
3. It enhances the performance of alogorithms that depend on distance based calculations or gradients.
4. Some algorithms, like linear regression, assess feature importance based on the magnitude of their coefficients. Scaling ensures that features with smaller numerical values are not overlooked when calculating importance.
5. Regularization techniques, like L1 and L2 regularization, are applied to penalize large coefficients in linear models. Scaling ensures that all features are penalized equally, regardless of their original scales.
6. Algorithms that rely on distance measures, such as clustering algorithms, are sensitive to feature scales. Scaling ensures that the distances are calculated correctly based on the actual significance of the features.
7. It's important to note that some algorithms, like tree-based models (Random Forest, Gradient Boosting), are less sensitive to feature scales due to their internal structures. However, in many cases, scaling still provides benefits and good practices for consistent performance across different types of algorithms.


In [None]:
raw_df[numeric_cols].describe()

In [None]:
scaler = MinMaxScaler()
scaler.fit(raw_df[numeric_cols])
print(list(scaler.data_min_), list(scaler.data_max_))

In [None]:
train_inputs[numeric_cols] = scaler.transform(train_inputs[numeric_cols])
val_inputs[numeric_cols] = scaler.transform(val_inputs[numeric_cols])
test_inputs[numeric_cols] = scaler.transform(test_inputs[numeric_cols])

In [None]:
train_inputs[numeric_cols].info()

In [None]:
train_targets.info()

## Training a Logistic Model

In [None]:
print(len(numeric_cols + encoded_cols))

In [None]:
train_inputs.info()

Before we had 97988 rows but after resampling we had now 152380 rows

In [None]:
model = LogisticRegression(solver='liblinear')
model.fit(train_inputs, train_targets)

In [None]:
print(model.coef_.tolist(), model.intercept_)

## Making Prediction and Evaluating The Model

In [None]:
X_train = train_inputs[numeric_cols + encoded_cols]
X_val = val_inputs[numeric_cols + encoded_cols]
X_test = test_inputs[numeric_cols + encoded_cols]

In [None]:
train_preds = model.predict(X_train)

In [None]:
train_preds, train_targets

In [None]:
# We can output a probabilistic prediction using predict_proba.
train_probs = model.predict_proba(X_train)
train_probs

We can test the accuracy of the model's predictions by computing the percentage of matching values in train_preds and train_targets.

In [None]:
accuracy_score(train_targets, train_preds)

Hence the Logistics Regression Model gives an accuracy of 85.25% on the training set

In [None]:
val_preds = model.predict(X_val)

In [None]:
val_preds, val_targets

In [None]:
accuracy_score(val_preds, val_targets)

The logistics Regression model gives an accuracy of 85.48% on the validation set.

In [None]:
test_preds = model.predict(X_test)
accuracy_score(test_targets, test_preds)

The Logistics Regression model gives an accuracy of 84.22% on the Test set.

In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
confusion_matrix(train_targets, train_preds, normalize='true')

In [None]:
# defining a helper function to generate predictions, compute the accuracy score, and plot a confusion matrix for a given 
# set of inputs.

def predict_and_plot(inputs, targets, name=''):
    preds = model.predict(inputs)
    
    accuracy = accuracy_score(targets, preds)
    print("Accuracy: {:.2f}%".format(accuracy * 100))
    
    cf = confusion_matrix(targets, preds, normalize='true')
    plt.figure()
    sns.heatmap(cf, annot=True)
    plt.xlabel('Prediction')
    plt.ylabel('Target')
    plt.title('{} Confusion Matrix'.format(name));
    
    return preds

In [None]:
train_preds = predict_and_plot(X_train, train_targets, 'Training')

In [None]:
val_preds = predict_and_plot(X_val, val_targets, 'Validatiaon')

In [None]:
test_preds = predict_and_plot(X_test, test_targets, 'Test')

In [None]:
# Let's check how good is the accuracy of 84%

In [None]:
def random_guess(inputs):
    return np.random.choice(["No", "Yes"], len(inputs))

In [None]:
accuracy_score(test_targets, random_guess(X_test))

In [None]:
def all_no(inputs):
    return np.full(len(inputs), "No")
accuracy_score(test_targets, all_no(X_test))

Our random model achieves an accuracy of 50% and our "always No" model achieves an accuracy of 77%. 

Thus, our model is better than a "dumb" or "random" model!

### Making prediction on a new input

In [None]:
new_input = {'Date': '2021-06-19',
             'Location': 'Katherine',
             'MinTemp': 23.2,
             'MaxTemp': 33.2,
             'Rainfall': 10.2,
             'Evaporation': 4.2,
             'Sunshine': np.nan,
             'WindGustDir': 'NNW',
             'WindGustSpeed': 52.0,
             'WindDir9am': 'NW',
             'WindDir3pm': 'NNE',
             'WindSpeed9am': 13.0,
             'WindSpeed3pm': 20.0,
             'Humidity9am': 89.0,
             'Humidity3pm': 58.0,
             'Pressure9am': 1004.8,
             'Pressure3pm': 1001.5,
             'Cloud9am': 8.0,
             'Cloud3pm': 5.0,
             'Temp9am': 25.7,
             'Temp3pm': 33.0,
             'RainToday': 'Yes'}

In [None]:
new_input_df = pd.DataFrame([new_input])

In [None]:
numeric_cols = new_input_df.select_dtypes(include=np.number).columns.tolist()
categorical_cols = new_input_df.select_dtypes('object').columns.tolist()

In [None]:
print(numeric_cols, categorical_cols)

In [None]:
categorical_cols.remove("Date")

In [None]:
new_input_df[numeric_cols] = imputer1.transform(new_input_df[numeric_cols])
new_input_df[categorical_cols] = imputer2.transform(new_input_df[categorical_cols])

In [None]:
new_input_df['Temp_diff'] = abs(new_input_df['Temp3pm']-new_input_df['Temp9am'])
new_input_df['Pressure_diff'] = abs(new_input_df['Pressure3pm']-new_input_df['Pressure9am'])
new_input_df['Humidity_diff'] = abs(new_input_df['Humidity3pm']-new_input_df['Humidity9am'])

In [None]:
new_input_df['Sunshine_Category'] = pd.cut(new_input_df['Sunshine'], bins=sunshine_bins, labels=['Low', 'Medium', 'High'])
new_input_df['Rainfall_Category'] = pd.cut(new_input_df['Rainfall'], bins=rainfall_bins, labels=['Low', 'Medium', 'High'])
new_input_df['Combined_Feature'] = new_input_df['Sunshine_Category'].astype(str) + '_' + new_input_df['Rainfall_Category'].astype(str)

In [None]:
numeric_cols = new_input_df.select_dtypes(include=np.number).columns.tolist()
categorical_cols = new_input_df.select_dtypes('object').columns.tolist()

In [None]:
print(numeric_cols, categorical_cols)

In [None]:
categorical_cols.remove('Date')

In [None]:
new_input_df[numeric_cols] = scaler.transform(new_input_df[numeric_cols])
new_input_df[encoded_cols] = encoder.transform(new_input_df[categorical_cols])

In [None]:
X_new_input = new_input_df[numeric_cols + encoded_cols]
X_new_input

In [None]:
prediction = model.predict(X_new_input)[0]

In [None]:
prediction

In [None]:
# def predict_input(single_input):
#     input_df = pd.DataFrame([single_input])
#     input_df[numeric_cols] = imputer.transform(input_df[numeric_cols])
#     input_df[numeric_cols] = scaler.transform(input_df[numeric_cols])
#     input_df[encoded_cols] = encoder.transform(input_df[categorical_cols])
#     X_input = input_df[numeric_cols + encoded_cols]
#     pred = model.predict(X_input)[0]
#     prob = model.predict_proba(X_input)[0][list(model.classes_).index(pred)]
#     return pred, prob

## Decision Trees

In [None]:
import opendatasets as od
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib
import jovian
import os
%matplotlib inline

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 150)
sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (10, 6)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
X_train = train_inputs[numeric_cols + encoded_cols]
X_val = val_inputs[numeric_cols + encoded_cols]
X_test = test_inputs[numeric_cols + encoded_cols]

In [None]:
X_train

In [None]:
train_targets

In [None]:
model = DecisionTreeClassifier(random_state=42)

In [None]:
%%time
model.fit(X_train, train_targets)

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix

In [None]:
train_preds = model.predict(X_train)

In [None]:
print(train_preds)

In [None]:
pd.value_counts(train_preds)

In [None]:
accuracy_score(train_targets, train_preds)

The training set accuracy is 100%, but we are not interested in this, the area of interest is to see how well does the model generalizes to unseen data.

In [None]:
val_preds = model.predict(X_val)
accuracy_score(val_targets, val_preds)

In [None]:
test_preds = model.predict(X_test)
accuracy_score(test_targets, test_preds)

It seems that apart from achieving 100% accuracy on the training set, the validation and the test accuracy is pretty low, this indicates that the model has memorized the training examples and do not generalize well to the unseen data. This is overfitting or we can say that the model has has variance.

#### Creating some visualizations

In [None]:
from sklearn.tree import plot_tree, export_text

In [None]:
plt.figure(figsize=(80,20))
plot_tree(model, feature_names=X_train.columns, max_depth=2, filled=True);

In [None]:
model.tree_.max_depth

##### Feature Importance

Based on the gini index computations, a decision tree assigns an "importance" value to each feature. These values can be used to interpret the results given by a decision tree.

In [None]:
model.feature_importances_

In [None]:
importance_df = pd.DataFrame({
    'feature': X_train.columns,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

In [None]:
importance_df.head(10)

In [None]:
plt.title('Feature Importance')
sns.barplot(data=importance_df.head(10), x='importance', y='feature');

##### Hyperparameter Tuning and Reducing Overfitting

In [None]:
model = DecisionTreeClassifier(max_depth=3, random_state=42)

In [None]:
model.fit(X_train, train_targets)

In [None]:
model.score(X_train, train_targets), model.score(X_val, val_targets)

In [None]:
plt.figure(figsize=(80,20))
plot_tree(model, feature_names=X_train.columns, filled=True, rounded=True, class_names=model.classes_);

In [None]:
def max_depth_error(md):
    model = DecisionTreeClassifier(max_depth=md, random_state=42)
    model.fit(X_train, train_targets)
    train_acc = 1 - model.score(X_train, train_targets)
    val_acc = 1 - model.score(X_val, val_targets)
    return {'Max Depth': md, 'Training Error': train_acc, 'Validation Error': val_acc}

In [None]:
%%time
errors_df = pd.DataFrame([max_depth_error(md) for md in range(1, 21)])

In [None]:
errors_df.sort_values('Validation Error', ascending = True)

In [None]:
plt.figure()
plt.plot(errors_df['Max Depth'], errors_df['Training Error'])
plt.plot(errors_df['Max Depth'], errors_df['Validation Error'])
plt.title('Training vs. Validation Error')
plt.xticks(range(0,21, 2))
plt.xlabel('Max. Depth')
plt.ylabel('Prediction Error (1 - Accuracy)')
plt.legend(['Training', 'Validation'])

Thus from the diagram it is that the max_depth of 8 gives a balanced train_accuracy as well as validation set accuracy.

In [None]:
model = DecisionTreeClassifier(max_depth=8, random_state=42).fit(X_train, train_targets)
model.score(X_val, val_targets)

#### Max_leaf_nodes

In [None]:
model = DecisionTreeClassifier(max_leaf_nodes=128, random_state=42)

In [None]:
model.fit(X_train, train_targets)

In [None]:
model.score(X_train, train_targets), model.score(X_val, val_targets)

In [None]:
model.tree_.max_depth

In [None]:
# finding a combination of max_depth and max_leaf_nodes that gives least validation set error.

In [None]:
def depth_leaf(depth_list, max_leaf_list):
    combination = []
    train_acc_list = []
    val_acc_list = []
    
    for i in depth_list:
        for j in max_leaf_list:
            combination.append([i, j])
            model = DecisionTreeClassifier(max_depth=i, max_leaf_nodes=j, random_state=42)
            model.fit(X_train, train_targets)
            train_acc = model.score(X_train, train_targets)
            val_acc = model.score(X_val, val_targets)
            train_acc_list.append(train_acc)
            val_acc_list.append(val_acc)
            
    data = {'Combination': combination, 'Training_accuracy': train_acc_list, 'Validation_accuracy': val_acc_list}
    df = pd.DataFrame(data)
    
    return df

In [None]:
comb = depth_leaf([i for i in range(5,11)], [i for i in range(100,130)])

In [None]:
comb.sort_values('Validation_accuracy', ascending = False)

Thus the decision tree with max_depth of 9 and maximum leaf nodes of 120 gives the highest validation accuracy of 84.7%

### Training a random Forest

The random forest model is a model in which we combine the results of several decision trees trained with slightly different parameters. The idea is that each decision tree in the forest would make some kind of errors and upon averaging, many of these errors will cancel out.

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
model = RandomForestClassifier(n_jobs=-1, random_state=42)

`n_jobs` allows the random forest to use mutiple parallel workers to train decision trees, and `random_state=42` ensures that the we get the same results for each execution.

In [None]:
%%time
model.fit(X_train, train_targets)

In [None]:

model.score(X_train, train_targets), model.score(X_val, val_targets)

In [None]:
train_probs = model.predict_proba(X_train)
train_probs

Looking at individual decision trees

In [None]:
model.estimators_[0]

In [None]:
plt.figure(figsize=(80,20))
plot_tree(model.estimators_[0], max_depth=2, feature_names=X_train.columns, filled=True, rounded=True, class_names=model.classes_);

In [None]:
plt.figure(figsize=(80,20))
plot_tree(model.estimators_[20], max_depth=2, feature_names=X_train.columns, filled=True, rounded=True, class_names=model.classes_);

In [None]:
len(model.estimators_)

In [None]:
importance_df = pd.DataFrame({
    'feature': X_train.columns,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

In [None]:
importance_df.head(10)

In [None]:
plt.title('Feature Importance')
sns.barplot(data=importance_df.head(10), x='importance', y='feature');

An important insight about the usefullness of ensembling technique is that the distribution of important features is much less skewed now than it was before when we were fitting a single decision tree.

### Hyperparameter tunning with Random Forest

In [None]:
base_model = RandomForestClassifier(random_state=42, n_jobs=-1).fit(X_train, train_targets)

In [None]:
base_train_acc = base_model.score(X_train, train_targets)
base_val_acc = base_model.score(X_val, val_targets)

In [None]:
base_accs = base_train_acc, base_val_acc
base_accs

### Some more hyperparameters

#### n_estimators : 
This controls the number of decision trees in the random forest

In [None]:
model = RandomForestClassifier(random_state=42, n_jobs=-1, n_estimators=10)

In [None]:
%%time
model.fit(X_train, train_targets)
model.score(X_train, train_targets), model.score(X_val, val_targets)

In [None]:
base_accs

In [None]:
model = RandomForestClassifier(random_state=42, n_jobs=-1, n_estimators=500)
model.fit(X_train, train_targets)

In [None]:
model.score(X_train, train_targets), model.score(X_val, val_targets)

In [None]:
model = RandomForestClassifier(random_state=42, n_jobs=-1, n_estimators=250)
model.fit(X_train, train_targets)
model.score(X_train, train_targets), model.score(X_val, val_targets)

#### Max Leaf Nodes and Max Depth

In [None]:
# making a helper function to test hyperparameters
def test_params(**params):
    model = RandomForestClassifier(random_state=42, n_jobs=-1, **params).fit(X_train, train_targets)
    return model.score(X_train, train_targets), model.score(X_val, val_targets)

In [None]:
test_params(max_depth=26)

In [None]:
test_params(max_leaf_nodes=2**20)

In [None]:
base_accs

Let's put the value of previously obtained max_leaf_nodes and max_max_depth

In [None]:
test_params(max_leaf_nodes = 250, max_depth = 9)

The accuracy seems to be low

In [None]:
#  max_features: default value is sqrt(n), that means:
#  only sqrt(n) out of total features (n) to be chosen randomly at each split

In [None]:
test_params(max_features='log2')

### `min_samples_split` and `min_samples_leaf`

By default, the decision tree classifier tries to split every node that has 2 or more. You can increase the values of these arguments to change this behavior and reduce overfitting, especially for very large datasets.

In [None]:
test_params(min_samples_split=100, min_samples_leaf=60)

In [None]:
test_params(min_samples_split=50, min_samples_leaf=10)

### `min_impurity_decrease`

This argument is used to control the threshold for splitting nodes. A node will be split if this split induces a decrease of the impurity (Gini index) greater than or equal to this value. It's default value is 0, and you can increase it to reduce overfitting.


In [None]:
test_params(min_impurity_decrease=1e-7)

In [None]:
base_accs

#### `bootstrap`, `max_samples` 

By default, a random forest doesn't use the entire dataset for training each decision tree. Instead it applies a technique called bootstrapping. For each tree, rows from the dataset are picked one by one randomly, with replacement i.e. some rows may not show up at all, while some rows may show up multiple times.


<img src="https://i.imgur.com/W8UGaEA.png" width="640">

Bootstrapping helps the random forest generalize better, because each decision tree only sees a fraction of th training set, and some rows randomly get higher weightage than others.

In [None]:
test_params(bootstrap=False)

In [None]:
base_accs

In [None]:
test_params(max_samples=0.9)

#### class_weights

The purpose of using class weights is to address class imbalance in the training data. When one class has significantly more samples than the other, it might lead the model to be biased towards the majority class. Assigning higher weights to the minority class helps the model pay more attention to it and prevents it from being overshadowed by the majority class

In [None]:
test_params(class_weight={'No': 1, 'Yes': 4})

In [None]:
# so finally we have:

model = RandomForestClassifier(n_jobs=-1, 
                               random_state=42, 
                               n_estimators=500,
                               max_features=7,
                               max_depth=30, 
                               class_weight={'No': 1, 'Yes': 1.5})

In [None]:
model.fit(X_train, train_targets)

In [None]:
model.score(X_train, train_targets), model.score(X_val, val_targets)

In [None]:
base_accs

## SVM classifier

In [None]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report

# Initialize and train an SVM classifier
svm_classifier = SVC(kernel="linear")  # You can experiment with different kernels
svm_classifier.fit(X_train, train_targets)

# Make predictions on the test set
predictions = svm_classifier.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(test_targets, predictions)
print("Accuracy:", accuracy)

classification_rep = classification_report(test_targets, predictions)
print("Classification Report:\n", classification_rep)


In [None]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report

# Initialize and train an SVM classifier
svm_classifier = SVC(kernel="rbf")  # You can experiment with different kernels
svm_classifier.fit(X_train, train_targets)

# Make predictions on the test set
predictions = svm_classifier.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(test_targets, predictions)
print("Accuracy:", accuracy)

classification_rep = classification_report(test_targets, predictions)
print("Classification Report:\n", classification_rep)


In [None]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report

# Initialize and train an SVM classifier
svm_classifier = SVC(kernel="poly")  # You can experiment with different kernels
svm_classifier.fit(X_train, train_targets)

# Make predictions on the test set
predictions = svm_classifier.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(test_targets, predictions)
print("Accuracy:", accuracy)

classification_rep = classification_report(test_targets, predictions)
print("Classification Report:\n", classification_rep)


## Conclusion:

For all the models applied in the study, It was found that Random Forest gives the best validation set accuracy i.e. 85.95% Hence we finally select that model.

## Future work ideas:

Since this data is a Spatio-Temporal data i.e. it includes both Spatial and Time component, so Analyzing these kind of data often involves techniques like space-time clustering, geostatistics, spatial interpolation, times series analysis, and various machine learning and data mining techinques that are tailored for such spatio-temporal patterns. Further I believe this would also require some doamin specific knowledge of geography and rainfall related sciences.

Studying this data as a spatio-temporal data can provide insights into dynamic processes that occur across both space and time, enabling more accurate predictions, better decision-making, and a deeper understanding of complex phenomena.

Examples of other models that could be used are: Long Short-Term Memory (LSTM) Networks, Convolutional Neural Networks (CNNs), Kriging, Spatial Regression(Geographically Weighted Regression (GWR) and Spatial Autoregressive Models (SAR) fall under this category) etc.

## References:

Books:
1. ISLR
2. Internet