# Rainfall Weather Forecasting

Rainfall Weather Forecasting

Project Description
Weather forecasting is the application of science and technology to predict the conditions of
the atmosphere for a given location and time. Weather forecasts are made by
collecting quantitative data about the current state of the atmosphere at a given place and
using meteorology to project how the atmosphere will change.
Rain Dataset is to predict whether or not it will rain tomorrow. The Dataset contains about 10
years of daily weather observations of different locations in Australia. Here, predict two things:
 
1. Problem Statement: 
a) Design a predictive model with the use of machine learning algorithms to forecast whether or
not it will rain tomorrow.
b)  Design a predictive model with the use of machine learning algorithms to predict how much
rainfall could be there.

Dataset Description:
Number of columns: 23

Date  - The date of observation
Location  -The common name of the location of the weather station
MinTemp  -The minimum temperature in degrees celsius
MaxTemp -The maximum temperature in degrees celsius
Rainfall  -The amount of rainfall recorded for the day in mm
Evaporation  -The so-called Class A pan evaporation (mm) in the 24 hours to 9am
Sunshine  -The number of hours of bright sunshine in the day.
WindGustDi r- The direction of the strongest wind gust in the 24 hours to midnight
WindGustSpeed -The speed (km/h) of the strongest wind gust in the 24 hours to midnight
WindDir9am -Direction of the wind at 9am
WindDir3pm -Direction of the wind at 3pm
WindSpeed9am -Wind speed (km/hr) averaged over 10 minutes prior to 9am
WindSpeed3pm -Wind speed (km/hr) averaged over 10 minutes prior to 3pm
Humidity9am -Humidity (percent) at 9am

Humidity3pm -Humidity (percent) at 3pm
Pressure9am -Atmospheric pressure (hpa) reduced to mean sea level at 9am
Pressure3pm -Atmospheric pressure (hpa) reduced to mean sea level at 3pm
Cloud9am - Fraction of sky obscured by cloud at 9am. 
Cloud3pm -Fraction of sky obscured by cloud 
Temp9am-Temperature (degrees C) at 9am
Temp3pm -Temperature (degrees C) at 3pm
RainToday -Boolean: 1 if precipitation (mm) in the 24 hours to 9am exceeds 1mm, otherwise 0
RainTomorrow -The amount of next day rain in mm. Used to create response variable . A kind of
measure of the &quot;risk&quot;.

Dataset Link-
 https://raw.githubusercontent.com/dsrscientist/dataset3/main/weatherAUS.csv
 https://github.com/dsrscientist/dataset3

https://raw.githubusercontent.com/dsrscientist/dataset3/main/weatherAUS.csv

In [None]:
import pandas as pd 
import seaborn as sns 
import warnings
import matplotlib.pyplot as plt 
import numpy as np 
from sklearn.preprocessing import LabelEncoder
from scipy import stats 
from scipy.stats import  zscore
warnings.filterwarnings("ignore")
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.model_selection import cross_val_score
warnings.filterwarnings('ignore')

In [None]:
# reading the file. 
df = pd.read_csv('https://raw.githubusercontent.com/dsrscientist/dataset3/main/weatherAUS.csv')

In [None]:
# reading the first 5 rows and all columns
df.head()

In [None]:
# 5 rows from the last 
df.tail()

In [None]:
df.shape

In [None]:
df.columns.tolist()

In [None]:
# the sum of the missing values
df.isnull().sum().to_frame("No. of missing values")

In [None]:
# the count of the missing values 
df.nunique().to_frame("No. of unique value")

In [None]:
# plotting the heatmap to show the missing value
sns.heatmap(df.isnull())

In [None]:
# We can identify that there are missing values in the heatmap. 

In [None]:
# To check the percentage of the missing data. 
missing_percentage = (df.isnull().sum() / len(df)) * 100
print(missing_percentage)

In [None]:
for col in df.columns: 
    print(f"columns:{col}")
    print(df[col].value_counts())
    print("/n")

In [None]:
# Analysing the dataframe
df.info()

- We can identy that there are only 2 types of data present in the df. 

In [None]:
# understanding the dataframe.
df.describe()

Observation
- Tmp ranges from -2 to 28.50, also in the 75% quantile and the max limit there is a huge gap, representing the outliers in the column. 
- The Max temp the lower value 23.857657 and the max value is 45, the outliers are seen in the column looking at the gap b/t the last quantile and the max value. 
- Very less values are observed between the min : -5.946275 and max : 107.000000 in the rainfall column, displaying the outliers
- Left Skew : Min Temp, WindSpeed3pm, Humidity9am, Humidity3pm
- Right Skew : Max Temp, Rainfall, WindGustSpeed, WindSpeed9am, Pressure9am, Pressure3pm, Temp9am, Temp3pm

In [None]:
# Plotting Rainfall over time
plt.figure(figsize=(10,5))
plt.plot(df['Rainfall'])
plt.title('Rainfall Over Time')
plt.xlabel('Date')
plt.ylabel('Rainfall')
plt.show()

In [None]:
# Converting the 'Date' column to datetime format
df['Date'] = pd.to_datetime(df['Date'], format= '%Y-%m-%d')

In [None]:
# Extracting the day, month, and year and create new columns for each
df['Day'] = df['Date'].dt.day
df['Month'] = df['Date'].dt.month
df['Year'] = df['Date'].dt.year

In [None]:
# Converting the datatype to date time format 
df['Date'] = pd.to_datetime(df['Date'])

In [None]:
# to very the format
print(df['Date'].dtype)

In [None]:
# We will drop the columns since it has high number of missing data. 
df = df.drop(['Evaporation', 'Sunshine','Cloud9am','Cloud3pm','Date'], axis=1)

In [None]:
categorical_col = []
numerical_col = []

# Categorize columns as categorical or numerical
for i in df.columns:
    if df.dtypes[i] == "object":
        categorical_col.append(i)
    else:
        numerical_col.append(i)

print("Categorical Columns:", categorical_col)
print("\n")
print("Numerical Columns:", numerical_col)

In [None]:
# Count unique occurrences of each location
location_counts = df['Location'].value_counts()

# Plot pie chart
plt.figure(figsize=(10,8))
plt.pie(location_counts, labels=location_counts.index, autopct='%1.1f%%')
plt.axis('equal')
plt.title('Pie chart of Locations')
plt.show()


In [None]:
# replacing the missing values from the categorical data. 
Missing_categorical = ['WindGustDir', 'WindDir9am', 'WindDir3pm', 'RainToday', 'RainTomorrow']

for column in Missing_categorical:
    df[column].fillna(df[column].mode()[0], inplace=True)

In [None]:
#  to check if the missing values are removed
df['RainToday'].unique()

In [None]:
df['Rainfall'].unique()

In [None]:
# replacing the missing values in rainfall with mean
mean_rainfall = df['Rainfall'].mean()
df['Rainfall'].fillna(mean_rainfall, inplace=True)

In [None]:
df['Rainfall'].unique()

In [None]:
# working on the missing numerical values using the  IterativeImputer
Missing_numerical = ['MinTemp', 'MaxTemp', 'WindGustSpeed', 'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am', 'Humidity3pm', 'Pressure9am', 'Pressure3pm', 'Temp9am', 'Temp3pm']

In [None]:
for col in Missing_numerical:
    mean_value = df[col].mean()
    df[col].fillna(mean_value, inplace=True)

In [None]:
df.isnull().sum()

In [None]:
df['Rainfall'].unique()

In [None]:
# Plotting the Rain prediction
plt.figure(figsize=(5, 5))
sns.countplot(x='RainTomorrow', data=df)
plt.title('Rainfall Distribution')
print(df['RainTomorrow'].value_counts())
plt.show()

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Rainfall
plt.figure(figsize=(25, 15))
sns.distplot(df['Rainfall']) 
plt.title('Rainfall Distribution')
print(df['Rainfall'].value_counts())
plt.show()

- The data seems to be unbalanced we will have to balance the data in order to train the model. 

In [None]:
df.head()

In [None]:
plt.figure(figsize=(15, 5))
sns.countplot(x='Location', hue='RainTomorrow', data=df, palette="hls" )

In [None]:
plt.figure(figsize=(15, 5))
sns.histplot(x='MinTemp', hue='RainTomorrow', bins=50, data=df, palette="hls", multiple="stack")
plt.show()

In [None]:
plt.figure(figsize=(15, 5))
sns.histplot(x='MaxTemp', hue='RainTomorrow', bins=100, data=df, palette="husl", multiple="stack")
plt.show()

In [None]:
plt.figure(figsize=(100, 80))
sns.pairplot(df,hue="RainTomorrow",palette="Set1")

In [None]:
# checking the skewness
df.skew()

In [None]:
df.shape

In [None]:
numerical_col = df.select_dtypes(include=[np.number]).columns.tolist()

plt.figure(figsize=(20,25), facecolor='green')
plotnumber = 1

for col in numerical_col:
    if plotnumber <= 9:
        ax = plt.subplot(4, 5, plotnumber)
        sns.distplot(df[col])
        plt.xlabel(col, fontsize=15)
        plt.xticks(rotation=0, fontsize=10)
        title = "Skewness : %.2f"%(df[col].skew())
        plt.title(title, fontsize=15)
        plotnumber += 1
plt.tight_layout()
plt.show()

In [None]:
df.head()

In [None]:
skew_col = ['Rainfall','WindGustSpeed','WindSpeed9am']

from scipy.stats import yeojohnson
# removing the skewness
for col in skew_col:
    df[col], _ = yeojohnson(df[col])

- We shall proceed without removing the skewness, since there are NaN values in the data. 

In [None]:
df.head()

In [None]:
# Encoding the values of the categorical column
le = LabelEncoder()
for col in categorical_col:
    df[col] = le.fit_transform(df[col])

In [None]:
df.head()

In [None]:
# Checking for the outliers in the numderical columns
numerical_col = df.select_dtypes(include=[np.number]).columns.tolist()

plt.figure(figsize =(10,6))
plotnumber = 1

for col in numerical_col:
    if plotnumber<=9:
        ax = plt.subplot(4,5,plotnumber)
        sns.boxplot(df[col])
        plt.xlabel(col,fontsize=15)
        plt.xticks(rotation=0,fontsize=10)
    plotnumber+=1
plt.tight_layout()

In [None]:
from scipy.stats import zscore 
feature_outliers =(['Rainfall','WindGustSpeed','WindSpeed3pm','Humidity9am','Pressure9am'])
Z = np.abs(zscore(df[feature_outliers]))

In [None]:
Z

In [None]:
#keeping the threshold=3
threshold = 3
print(np.where(Z>3))

In [None]:
df1 = df[(Z<3).all(axis=1)]

In [None]:
print("Data Loss Percentage = ",((df.shape[0]-df1.shape[0])/df.shape[0])*100)

- We shall proceed with the original data. 

In [None]:
#Seperating the features and the label. 
X1 = df1.drop('RainTomorrow', axis=1)
Y1 = df1['RainTomorrow']

In [None]:
# Dividing the data in to the training and the testing data. 
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X1, Y1, test_size=0.2, random_state=22)
X_train 

In [None]:
scaler = StandardScaler()
X1_train_scaled = scaler.fit_transform(X_train)
X1_test_scaled = scaler.transform(X_test)

In [None]:
from imblearn.over_sampling import SMOTE

# Create an instance of SMOTE
smote = SMOTE()

# Resample the data
X1, Y1 = smote.fit_resample(X1,Y1)

In [None]:
Y1.value_counts()

In [None]:
from sklearn.model_selection import train_test_split
X1_train, X1_test, Y1_train, Y1_test = train_test_split(X1, Y1, test_size=0.2, random_state=22)
X1_train 

In [None]:
def metric_score(clf, X1_train, X1_test, Y1_train, Y1_test, train):
    if train == True: 
        pred = clf.predict(X1_train)
        print("\n======================Train Result==========================")
        print(f"Accuracy Score : {accuracy_score(Y1_train, pred) * 100:.2f}%")
    elif train == False: 
        pred = clf.predict(X1_test)
        print("\n======================Test Result==========================")
        print(f"Accuracy Score : {accuracy_score(Y1_test, pred) * 100:.2f}%")
        print('\n \n Test Classification Report \n', classification_report(Y1_test, pred, digits=2))

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()

# training the model 
rf.fit(X1_train, Y1_train)

In [None]:
metric_score(rf, X1_train, X1_test, Y1_train, Y1_test, train= True)
metric_score(rf, X1_train, X1_test, Y1_train, Y1_test, train= False)

In [None]:
lr = LogisticRegression()

# training the model 
lr.fit(X1_train, Y1_train)

In [None]:
metric_score(lr, X1_train, X1_test, Y1_train, Y1_test, train= True)
metric_score(lr, X1_train, X1_test, Y1_train, Y1_test, train= False)

In [None]:
dt = DecisionTreeClassifier()

# training the model
dt.fit(X1_train, Y1_train)

In [None]:
metric_score(dt, X1_train, X1_test, Y1_train, Y1_test, train= True)
metric_score(dt, X1_train, X1_test, Y1_train, Y1_test, train= False)

In [None]:
# Gradient Boosting
gb = GradientBoostingClassifier()

# training the model
gb.fit(X1_train, Y1_train)

In [None]:
metric_score(gb, X1_train, X1_test, Y1_train, Y1_test, train= True)
metric_score(gb, X1_train, X1_test, Y1_train, Y1_test, train= False)

In [None]:
models = RandomForestClassifier(), GradientBoostingClassifier(), DecisionTreeClassifier(), LogisticRegression()
model_names = ['RandomForestClassifier', 'GradientBoostingClassifier', 'DecisionTreeClassifier', 'LogisticRegression']

In [None]:
from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt

# Define the models
models = [RandomForestClassifier(), GradientBoostingClassifier(), DecisionTreeClassifier(), LogisticRegression()]

plt.figure(figsize=(10, 6))

for model in models:
    # Train the model
    model.fit(X1_train, Y1_train)

    # Predict probabilities
    probabilities = model.predict_proba(X1_test)[:, 1]

    # Compute ROC curve and AUC score
    fpr, tpr, _ = roc_curve(Y1_test, probabilities)
    auc_score = roc_auc_score(Y1_test, probabilities)

    # Plot the ROC curve
    plt.plot(fpr, tpr, label=f'{type(model).__name__} (AUC = {auc_score:.2f})')

# Random guess line
plt.plot([0, 1], [0, 1], 'k--')

plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristics (ROC) Curve')
plt.legend(loc='lower right')
plt.show()

In [None]:
from sklearn.model_selection import cross_val_score

models = [RandomForestClassifier(), GradientBoostingClassifier(), DecisionTreeClassifier(), LogisticRegression()]

# For each model
for model in models:
    # Perform cross-validation
    scores = cross_val_score(model, X1, Y1, cv=5)
    
    # Print cross-validation score for each model
    print(f'{type(model).__name__}: {scores.mean():.2f} +/- {scores.std():.2f}')


In [None]:
# Names of the models
model_names = ['RandomForestClassifier', 'GradientBoostingClassifier', 'DecisionTreeClassifier', 'LogisticRegression']

# Cross-validation scores
cross_val_scores = [0.79, 0.65, 0.76, 0.71] # Replace these with your actual scores

# Standard Deviation scores
std_scores = [0.08, 0.07, 0.08, 0.06] # Replace these with your actual scores



# Create a dataframe
df = pd.DataFrame({
    'Model': model_names,
    'Cross_Val_Score': cross_val_scores,
    'Std_Score': std_scores
})

# Print the dataframe
print(df)

In [None]:
#RandomForestClassifier is the best model 

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()

scores = cross_val_score(model, X1, Y1, cv=5)

print("Cross-validation scores: ", scores)
print("Average cross-validation score: ", scores.mean())

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

rf = RandomForestClassifier(random_state=0)

# Define a grid of hyperparameter 'params_
param_grid = {
    'n_estimators': [50, 100, 200], 
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'bootstrap': [True, False]
}

grid_rf = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, n_jobs=-1, verbose=2)

grid_rf.fit(X1_train, Y1_train)
print("Best parameters found: ", grid_rf.best_params_)
print("Best score found: ", grid_rf.best_score_)

In [None]:
# Get the best parameters
best_params = grid_rf.best_params_
print("Best parameters found: ", best_params)

# Get the best score
best_score = grid_rf.best_score_
print("Best score found: ", best_score)

In [None]:
#Reffitting the model 
best_rf = RandomForestClassifier(**best_params)
best_rf.fit(X1_train, Y1_train)

In [None]:
from sklearn.metrics import accuracy_score, classification_report

# Predict the labels
Y1_pred = best_rf.predict(X1_test)

# Print accuracy score and classification report
print("Accuracy Score : ", accuracy_score(Y1_test, Y1_pred))
print("\nClassification Report : \n", classification_report(Y1_test, Y1_pred))

In [None]:
data_Predict= {'Location' : [3], 
               'MinTemp' : [13.4],
               'MaxTemp'  : [22.9],
               'Raifall'   : [0.6]
               'WindGustDir' : [22.0],
               'WindGustSpeed' :[1007.7],
                'WindDir3pm' : [3],
               'Windspeed9am' : [1.19],
               'Windspeed3pm' : [13.00],
               'WindDir9am' : [13],
               'Humidity9am' : [4.0],
               'Humidity3pm' : [25.0],
               'Pressure9am' : ['1007.7'],
               'Pressure3pm' : [1007.1],
               'Temp9am' :  [16.9],
               'Temp3pm' :  [21.8],
               'RainTomorrow' : [1]}

df_pred = pd.DataFrame(data_Predict,index=[0])
df_pred

In [None]:
# Predicting the result using the best model. 
new_pred = best_rfr.predict(df_pred)
print(new_pred)

In [1414]:
#Model 2 : Regression Model 

In [1415]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.feature_selection import SelectPercentile
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

In [1416]:
X2 = df1.drop(['Rainfall'],axis=1)
y2 = df1['Rainfall']

In [None]:
sns.pairplot(df,hue="Rainfall",palette="Set1")

In [None]:
y2

In [None]:
X2

In [None]:
# Splitting the data
from sklearn.model_selection import train_test_split

X2_train, X2_test, Y2_train, Y2_test = train_test_split(X2, y2, test_size=0.2, random_state=42)

In [None]:
# Scaling the data
scaler = StandardScaler()
X2_train_scaled = scaler.fit_transform(X2_train)
X2_test_scaled = scaler.transform(X2_test)

In [None]:
#  training the data
lr = LinearRegression()
lr.fit(X2_train_scaled, Y2_train)

rfr = RandomForestRegressor()
rfr.fit(X2_train_scaled, Y2_train)

gr = GradientBoostingRegressor()
gr.fit(X2_train_scaled, Y2_train)

dt = DecisionTreeRegressor()
dt.fit(X2_train_scaled, Y2_train)

In [None]:
# Making predictions on the scaled test data
lr_predictions = lr.predict(X2_test_scaled)
rfr_predictions = rfr.predict(X2_test_scaled)
gr_predictions = gr.predict(X2_test_scaled)
dt_predictions = dt.predict(X2_test_scaled)

# Evaluating the models
lr_mse2 = mean_squared_error(Y2_test, lr_predictions)
rfr_mse2 = mean_squared_error(Y2_test, rfr_predictions)
gr_mse2 = mean_squared_error(Y2_test, gr_predictions)
dt_mse2 = mean_squared_error(Y2_test, dt_predictions)

lr_mae2 = mean_absolute_error(Y2_test, lr_predictions)
rfr_mae2 = mean_absolute_error(Y2_test, rfr_predictions)
gr_mae2 = mean_absolute_error(Y2_test, gr_predictions)
dt_mae2 = mean_absolute_error(Y2_test, dt_predictions)

lr_r2_2 = r2_score(Y2_test, lr_predictions)
rfr_r2_2 = r2_score(Y2_test, rfr_predictions)
gr_r2_2 = r2_score(Y2_test, gr_predictions)
dt_r2_2 = r2_score(Y2_test, dt_predictions)

In [None]:
# Print Mean Squared Error for each model
print("Linear Regression MSE: ", lr_mse2)
print("Random Forest Regressor MSE: ", rfr_mse2)
print("Gradient Boosting Regressor MSE: ", gr_mse2)
print("Decision Tree Regressor MSE: ", dt_mse2)

# Print Mean Absolute Error for each model
print("\nLinear Regression MAE: ", lr_mae2)
print("Random Forest Regressor MAE: ", rfr_mae2)
print("Gradient Boosting Regressor MAE: ", gr_mae2)
print("Decision Tree Regressor MAE: ", dt_mae2)

# Print R2 Score for each model
print("\nLinear Regression R2 Score: ", lr_r2_2)
print("Random Forest Regressor R2 Score: ", rfr_r2_2)
print("Gradient Boosting Regressor R2 Score: ", gr_r2_2)
print("Decision Tree Regressor R2 Score: ", dt_r2_2)

In [None]:
# Create a table to compare the scores
data = {
    'Model': ['Linear Regression', 'Random Forest', 'Gradient Boosting', 'Decision Tree'],
    'Mean Squared Error (MSE)': [lr_mse2, rfr_mse2, gr_mse2, dt_mse2],
    'Mean Absolute Error (MAE)': [lr_mae2, rfr_mae2, gr_mae2, dt_mae2],
    'R^2 Score': [lr_r2_2, rfr_r2_2, gr_r2_2, dt_r2_2]}

df = pd.DataFrame(data)

# Print the DataFrame
print(df)

In [None]:
# Prection on the test data. 
y_pred1 = lr.predict(X2_test)
y_pred2 = rfr.predict(X2_test)
y_pred3 = gr.predict(X2_test)
y_pred4 = dt.predict(X2_test)
df3 = pd.DataFrame({'Actual': Y2_test, 'Lr': y_pred1, 'rfr': y_pred2, 'gr': y_pred3,'dt':y_pred4,})

In [None]:
df3

In [None]:
# For rfr2 comparison
plt.subplot()
plt.plot(df3['Actual'].iloc[0:10], label='Actual')
plt.plot(df3['rfr'].iloc[0:10], label="rfr")
plt.xlabel('Data Point')
plt.ylabel('Value')
plt.title('Comparison - Actual vs rfr')
plt.legend()

# For Lr2 comparison
plt.subplot()
plt.plot(df3['Actual'].iloc[0:10], label='Actual')
plt.plot(df3['Lr'].iloc[0:10], label="Lr")
plt.xlabel('Data Point')
plt.ylabel('Value')
plt.title('Comparison - Actual vs Lr')
plt.legend()

# For gr2 comparison
plt.subplot()
plt.plot(df3['Actual'].iloc[0:10], label='Actual')
plt.plot(df3['gr'].iloc[0:10], label="gr")
plt.xlabel('Data Point')
plt.ylabel('Value')
plt.title('Comparison - Actual vs gr')
plt.legend()

# For dt2 comparison
plt.subplot()
plt.plot(df3['Actual'].iloc[0:10], label='Actual')
plt.plot(df3['dt'].iloc[0:10], label="dt")
plt.xlabel('Data Point')
plt.ylabel('Value')
plt.title('Comparison - Actual vs dt')
plt.legend()

plt.tight_layout()
plt.show()

In [None]:
# Calculating mean squared error (MSE) for each model
mse_lr = mean_squared_error(df3['Actual'], df3['Lr'])
mse_rfr = mean_squared_error(df3['Actual'], df3['rfr'])
mse_gr = mean_squared_error(df3['Actual'], df3['gr'])
mse_dt = mean_squared_error(df3['Actual'], df3['dt'])

# dictionary that stores the MSE values
mse_scores = {'Linear Regression': mse_lr,
    'Random Forest': mse_rfr,
    'Gradient Boosting': mse_gr,
    'Decision Tree': mse_dt}

# To find the best model 
best_model = min(mse_scores, key=mse_scores.get)

print("The best model is:", best_model)

In [None]:
#Define the hyperparameter grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 5, 7],
    'min_samples_split': [2, 4, 6]
}

# Perform grid search with cross-validation
grid_search = GridSearchCV(estimator=RandomForestRegressor(), param_grid=param_grid, scoring='neg_mean_squared_error', cv=5)
grid_search.fit(X2_train, Y2_train)

# Get the best hyperparameters and model
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

print("Best Hyperparameters:", best_params)
print()

# Train the best model on the entire training set
best_model.fit(X2_train, Y2_train)

# Evaluate the best model on the testing set
best_predictions = best_model.predict(X2_test)
best_mse = mean_squared_error(Y2_test, best_predictions)
best_mae = mean_absolute_error(Y2_test, best_predictions)
best_r2 = r2_score(Y2_test, best_predictions)

print("Best Model Evaluation:")
print("MSE:", best_mse)
print("MAE:", best_mae)
print("R-squared:", best_r2)

In [None]:
# Create a table to compare the scores
data = {
    'Model': ['Linear Regression', 'Random Forest', 'Gradient Boosting', 'Decision Tree'],
    'Mean Squared Error (MSE)': [lr_mse2, rfr_mse2, gr_mse2, dt_mse2],
    'Mean Absolute Error (MAE)': [lr_mae2, rfr_mae2, gr_mae2, dt_mae2],
    'R^2 Score': [lr_r2_2, rfr_r2_2, gr_r2_2, dt_r2_2]}

df = pd.DataFrame(data)


In [None]:
# Print the DataFrame
print(df)

In [None]:
data_Predict= {'Location' : [3], 
               'MinTemp' : [13.4],
               'MaxTemp'  : [22.9],
               'WindGustDir' : [22.0],
               'WindGustSpeed' :[1007.7],
                'WindDir3pm' : [3],
               'Windspeed9am' : [1.19],
               'Windspeed3pm' : [13.00],
               'WindDir9am' : [13],
               'Humidity9am' : [4.0],
               'Humidity3pm' : [25.0],
               'Pressure9am' : ['1007.7'],
               'Pressure3pm' : [1007.1],
               'Temp9am' :  [16.9],
               'Temp3pm' :  [21.8],
               'RainToday' : [1],
               'RainTomorrow' : [1]}

df_pred = pd.DataFrame(data_Predict,index=[0])
df_pred

In [None]:
# Predicting the result using the best model. 
new_pred = best_rf.predict(df_pred)
print(new_pred)

In [None]:
from joblib import dump, load

# saving the model. 
dump(best_rf, 'best_rf.joblib') 

# Loading the model
best_rf_from_joblib = load('best_rf.joblib') 

# using the loaded model to make predictions
best_rf_from_joblib.predict(X1_test)