# Rainfall Weather Forecasting

## Project Description

Weather forecasting is the application of science and technology to predict the conditions of the atmosphere for a given location and time. Weather forecasts are made by collecting quantitative data about the current state of the atmosphere at a given place and using meteorology to project how the atmosphere will change.
Rain Dataset is to predict whether or not it will rain tomorrow. The Dataset contains about 10 years of daily weather observations of different locations in Australia. Here, predict two things:
 
## 1. Problem Statement: 

a) Design a predictive model with the use of machine learning algorithms to forecast whether or not it will rain tomorrow.

b)  Design a predictive model with the use of machine learning algorithms to predict how much rainfall could be there.


## Dataset Description:

Number of columns: 23

Date  - The date of observation

Location  -The common name of the location of the weather station

MinTemp  -The minimum temperature in degrees celsius

MaxTemp -The maximum temperature in degrees celsius

Rainfall  -The amount of rainfall recorded for the day in mm

Evaporation  -The so-called Class A pan evaporation (mm) in the 24 hours to 9am

Sunshine  -The number of hours of bright sunshine in the day.

WindGustDi r- The direction of the strongest wind gust in the 24 hours to midnight

WindGustSpeed -The speed (km/h) of the strongest wind gust in the 24 hours to midnight

WindDir9am -Direction of the wind at 9am

WindDir3pm -Direction of the wind at 3pm

WindSpeed9am -Wind speed (km/hr) averaged over 10 minutes prior to 9am

WindSpeed3pm -Wind speed (km/hr) averaged over 10 minutes prior to 3pm

Humidity9am -Humidity (percent) at 9am

Humidity3pm -Humidity (percent) at 3pm

Pressure9am -Atmospheric pressure (hpa) reduced to mean sea level at 9am

Pressure3pm -Atmospheric pressure (hpa) reduced to mean sea level at 3pm

Cloud9am - Fraction of sky obscured by cloud at 9am. 

Cloud3pm -Fraction of sky obscured by cloud 

Temp9am-Temperature (degrees C) at 9am

Temp3pm -Temperature (degrees C) at 3pm

RainToday -Boolean: 1 if precipitation (mm) in the 24 hours to 9am exceeds 1mm, otherwise 0

RainTomorrow -The amount of next day rain in mm. Used to create response variable . A kind of measure of the "risk".


## Dataset Link - 

•	https://raw.githubusercontent.com/dsrscientist/dataset3/main/weatherAUS.csv

•	https://github.com/dsrscientist/dataset3


# Importing the Dataset:

In [None]:
import pandas as pd
import numpy as np

In [None]:
df = pd.read_csv("weatherAUS.csv") #reading the data file
df.head() #diplaying the dataset with first 5 and last 5 

In [None]:
df.shape #The Total number of data (same as df)

In [None]:
df.columns #to see the columns names

In [None]:
df.head() #to see 5 first row of dataset

# EDA

# Checking for NULL values if any in the data frame

np.nan, None, NaN and others..

In [None]:
df.isnull().sum()

Observations:-

- we see a significant amount of null values in the dataset, we see that evaporation and sunshine have the highest null values
- we see that the null values in cloud9am and 3pm also have a significant amount of nulls 
- pressure9am and pressure3am also have a large no of null values
- the only 2 columns which do not have nulls are date as well as location
- next step is to understand each of the columns to know how to treat them , domain knowledge is needed


In [None]:
#CAN ALSO USE

print (df.info()) #to check for null Values 

We have rechecked again and found the same null values as we saw and we will have to remove them to proceed further

In [None]:
df['Sunshine'].unique() #Checking to see what the columns are in the null columns

In [None]:
df['Evaporation'].unique() #Checking to see what the columns are in the null columns

In [None]:
df['Cloud9am'].unique() #Checking to see what the columns are in the null columns

In [None]:
df['Cloud3pm'].unique() #Checking to see what the columns are in the null columns

In [None]:
df['Pressure9am'].unique() #Checking to see what the columns are in the null columns

In [None]:
df['Pressure3pm'].unique() #Checking to see what the columns are in the null columns

In [None]:
df['WindGustDir'].unique() #Checking to see what the columns are in the null columns

In [None]:
df['WindGustSpeed'].unique() #Checking to see what the columns are in the null columns

In [None]:
df['WindDir9am'].unique()

In [None]:
df['Rainfall'].unique() #Checking to see what the columns are in the null columns

In [None]:
df['WindDir3pm'].unique() #Checking to see what the columns are in the null columns

In [None]:
df['RainToday'].unique() #Checking to see what the columns are in the null columns

## Splitting the numeric features as well as the categorical features

In [None]:
# We are defining numerical & categorical columns
numeric_features = [feature for feature in df.columns if df[feature].dtype != 'O']
categorical_features = [feature for feature in df.columns if df[feature].dtype == 'O']

# print columns
print('We have {} numerical features : {}'.format(len(numeric_features), numeric_features))
print('\nWe have {} categorical features : {}'.format(len(categorical_features), categorical_features))

In [None]:
df_visualization_continuous=df[['MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation', 'Sunshine', 'WindGustSpeed', 'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am', 'Humidity3pm', 'Pressure9am', 'Pressure3pm', 'Cloud9am', 'Cloud3pm', 'Temp9am', 'Temp3pm']].copy()
# assigning the numeric to a variable as the most nulls are in them

# Iterative Imputer- Imputing the nulls 

We cannot use the mean , mode or remove the nulls as they are a large number , if we replace with 0 the area which may hav that property will show fake data, best option would be to replace them with the closest possible alternative with imputing techniques

In [None]:
#BEfore using Iterative Imputer, we need to enalbe it using the below code
from sklearn.experimental import enable_iterative_imputer

#Import Iterative Imputer

from sklearn.impute import IterativeImputer

In [None]:
df=df.dropna(subset=['Rainfall']) 

# we see that the target variable both have nulls and we dont want to synthesize the data so we remove them

In [None]:
import warnings
warnings.filterwarnings('ignore')

iter_impute = IterativeImputer()

ite_imp = pd.DataFrame(iter_impute.fit_transform(df[['Rainfall','MinTemp', 'MaxTemp', 'Evaporation', 'Sunshine', 'WindGustSpeed', 'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am', 'Humidity3pm', 'Pressure9am', 'Pressure3pm', 'Cloud9am', 'Cloud3pm', 'Temp9am', 'Temp3pm']]), columns = ['Rainfall','MinTemp', 'MaxTemp', 'Evaporation', 'Sunshine', 'WindGustSpeed', 'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am', 'Humidity3pm', 'Pressure9am', 'Pressure3pm', 'Cloud9am', 'Cloud3pm', 'Temp9am', 'Temp3pm'])

ite_imp

In [None]:
ite_imp.isnull().sum() # checking to see that all the columns have been imputed

In [None]:
df.shape # Now we have to drop the old columns in the data set and join with the new

In [None]:
df.drop(columns=['Rainfall','MinTemp', 'MaxTemp', 'Evaporation', 'Sunshine', 'WindGustSpeed', 'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am', 'Humidity3pm', 'Pressure9am', 'Pressure3pm', 'Cloud9am', 'Cloud3pm', 'Temp9am', 'Temp3pm'],inplace=True)

In [None]:
df = pd .concat([df,ite_imp],axis=1)

In [None]:
df # we have joined the dataset 

In [None]:
df.isnull().sum()

## Treating the String or categorical variables with mode of the column

In [None]:
df['WindGustDir'] = df['WindGustDir'].fillna(df['WindGustDir'].mode()[0])

In [None]:
df['WindDir9am'] = df['WindDir9am'].fillna(df['WindDir9am'].mode()[0])

In [None]:
df['WindDir3pm'] = df['WindDir3pm'].fillna(df['WindDir3pm'].mode()[0])

In [None]:
df['RainToday'] = df['RainToday'].fillna(df['RainToday'].mode()[0])

In [None]:
df.isnull().sum()

In [None]:
df = df.dropna() #finally we see the 2nd target variable has some nulls which we should not treat so we remove te values

In [None]:
df.isnull().sum()

Finally we see that we have got rid of the null values in the dataset

# Checking for Duplicate Values

In [None]:
df.drop_duplicates(inplace=True)

In [None]:
df.shape

We see there are a lot of duplicate values so we removed them and moving to check the data loss

In [None]:
Data_loss = ((8425-6631)/8425)*100

In [None]:
Data_loss

We see that after we removed the null values and the duplicates we end up with 6631 rows and the data lost is 21.29 approx , which is  high but as we cant do anything about the data and we cant proceed with model building with nulls and duplicates so  we have had to treat them

# Check the datatypes of the columns 

In [None]:
df.dtypes 

Observations after viewing the dataset:_

- we have 16 numeric columns in the dataset and 7 object features which we need to treat by encoding

- we see that the target or Label is of two , one Rainfall-numeric and RainTommorrow-categorical needs to be treated  

- we see that'WindGustSpeed', 'WindSpeed9am','WindSpeed3pm' are all related, same like'Humidity9am', 'Humidity3pm', 'Pressure9am', 'Pressure3pm', 'Cloud9am', 'Cloud3pm', 'Temp9am', 'Temp3pm' to the nature of elements creating humidity and moisture to compliment rainfall
 

In [None]:
df.describe()

Observations:-

    - We see that the label Rainfall has many outliers and the range is too extreme where mean is lesser than std
    - we see that Evaporation as well has outliers and abnormal readings
    -   Windspeed  has too much skewness aand our min is showing 0  which is not good the extreme range here is causing skewness
    - we see that the temp column is the same as well having extreme range from 1 to 39 same temp 3am
    - we see even the humidity columns shows the same 

In [None]:
# Checking the unique values in each column by total value

for col in df:
    print(df[col].nunique(),'\n')

# Observations made in individual columns cells above . Overall there is a huge variation in the type of data , but the columns are all having pretty straightforward contents as we can make out what each means except the id which is unique
    

In [None]:
# Checking to see if any of the values in Target  is white spaces

df.loc[df['Rainfall'] == " "]

In [None]:
# Checking to see if any of the values in Target  is white spaces

df.loc[df['RainTomorrow'] == " "]

## As we see that  the targets have no whitespaces we can move ahead

# EDA

# Visualization of the Data

In [None]:
import matplotlib.pyplot as plt
df.hist(figsize=(20,20))
plt.show()

To understand properly we need to review each feature individually but this graph is just for showing the trend with histplot only numeric features shown

# Splitting the columns with categorical and numerica data

In [None]:
# We are defining numerical & categorical columns
numeric_features = [feature for feature in df.columns if df[feature].dtype != 'O']
categorical_features = [feature for feature in df.columns if df[feature].dtype == 'O']

# print columns
print('We have {} numerical features : {}'.format(len(numeric_features), numeric_features))
print('\nWe have {} categorical features : {}'.format(len(categorical_features), categorical_features))

In [None]:
df_visualization_continuous=df[['Rainfall', 'MinTemp', 'MaxTemp', 'Evaporation', 'Sunshine', 'WindGustSpeed', 'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am', 'Humidity3pm', 'Pressure9am', 'Pressure3pm', 'Cloud9am', 'Cloud3pm', 'Temp9am', 'Temp3pm']].copy()

In [None]:
df_visualization_nominal=df[['Date', 'Location', 'WindGustDir', 'WindDir9am', 'WindDir3pm', 'RainToday', 'RainTomorrow']]

# Visualization of the distribution of the continuous value of the float and int columns.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt


import warnings
warnings.filterwarnings('ignore')

In [None]:
#Lets see how the data is distributed for every column

plt.figure(figsize =(20,25), facecolor = 'white')
plotnumber = 1

for column in df_visualization_continuous:
    if plotnumber <=16:
        ax = plt.subplot(6,3,plotnumber)
        sns.distplot(df_visualization_continuous[column])
        plt.xlabel(column,fontsize = 12)
        
    plotnumber +=1
plt.tight_layout()

# Observations :-
    
   - We see that the columns Evaporation,Windspeed9am,wind gust speed , humidity 9am  having some skewness and need to be treated
   
   - Overall we see that we have a really good dataset with all the features having very normal distribution overall

## Treating the 1st target variable-RainTomorrow where we give Y 1 and we give N 0 - CLASSIFICATION Problem

In [None]:
df['RainTomorrow'].unique() #Checking to see what the columns are in the null columns

In [None]:
df['RainTomorrow'] = df['RainTomorrow'].factorize(['Y', 'N' ])[0]

In [None]:
df

## Visualizing the Target Variable 

In [None]:
ax = sns.countplot(x='RainTomorrow',data = df_visualization_nominal)
print(df_visualization_nominal['RainTomorrow'].value_counts())

In [None]:
ax = sns.countplot(x='RainTomorrow',data = df)
print(df['RainTomorrow'].value_counts())

In [None]:
import plotly.graph_objs as go
import plotly.offline as py
less_50 = df[(df['RainTomorrow'] != 0)]
more_50 = df[(df['RainTomorrow'] == 0)]

trace = go.Pie(labels = ['ITs not going to rain is No', 'Its going to rain is Yes'], values = df['RainTomorrow'].value_counts(), 
               textfont=dict(size=15),
               marker=dict(colors=['#B9C0C9','yellow'], 
               line=dict(color='#000000', width=1.5)))
layout = dict(title =  'Distribution of RainTomorrow variable')          
fig = dict(data = [trace], layout=layout)
py.iplot(fig) 

Observations :-

We see a huge imbalance in the label column where the applicants whose loan is approved is 76.2% and the ones whose isnt is 23.8% so we need to balance this otherwise the model will be biased 

## Lets graph the data for columns individually so we can make clear findings

In [None]:
sns.distplot(df_visualization_continuous['Rainfall'],kde=True,)

In [None]:
sns.histplot(x=df_visualization_continuous['Rainfall'], ec = "gold", color='g', kde=True)

We see that the data is right skewed and we many outliers in the dataset , we see the max range is 0-350 approx , need to treat to get a better look at the data but as this is a target variable we will only treat it for the 1st label 

In [None]:
import matplotlib.pyplot as plt
plt.subplots(figsize=(12,4))
sns.countplot(x="Rainfall", hue="RainTomorrow", data=df, palette="colorblind")

In [None]:
sns.distplot(df['MinTemp'],kde=True,)

In [None]:
sns.histplot(x=df.MinTemp, ec = "black", color='g', kde=True)

We see that we have a very good normal distribution in this feature we see high between 10 to 20 approx 

In [None]:
sns.distplot(df['MaxTemp'],kde=True,)

In [None]:
sns.histplot(x=df['MaxTemp'], ec = "black", color='g', kde=True)

We see that like the min temp the max is also having a really goo distribution , we see the majority of the data falls between 17.5 to 26 approx

In [None]:
sns.distplot(df['Evaporation'],kde=True,)

In [None]:
sns.histplot(x=df['Evaporation'], ec = "black", color='g', kde=True)

We see that the data is right skewed and we see that the range lies from 0 to 140 , we need to treat the outliers or it will affect the model

In [None]:
sns.distplot(df['Sunshine'],kde=True,)

In [None]:
sns.histplot(x=df['Sunshine'], ec = "black", color='g', kde=True)

We see that sunshine is pretty normally distributed the higher range is 10 to 12 approx

In [None]:
sns.distplot(df['WindGustSpeed'],kde=True,)

In [None]:
sns.histplot(x=df['WindGustSpeed'], ec = "black", color='g', kde=True)

Wesee the data is lighly skwed to the right but overall the data has a really equally distributed , the highest range is 25-26 approx

In [None]:
sns.distplot(df['WindSpeed9am'],kde=True,)

In [None]:
sns.histplot(x=df['WindSpeed9am'], ec = "black", color='g', kde=True)

Here we see right skewed data and we see that the volume of variables are very less, we see the highest here is 0 

In [None]:
sns.distplot(df['WindSpeed3pm'],kde=True,)

In [None]:
sns.histplot(x=df['WindSpeed3pm'], ec = "black", color='g', kde=True)

We see that the data is somewhat skewed again but less outliers otherwise we see a very normal distribution , we see that the highest is 17 to 18 approx

In [None]:
sns.histplot(x=df['Humidity9am'], ec = "black", color='g', kde=True)

We see some left skewness in the data and we need to treat this as well in order to reduce variance , we see the highest range falls  between 60 to 80 

In [None]:
sns.distplot(df['Humidity3pm'],kde=True,)

In [None]:
sns.histplot(x=df['Humidity3pm'], ec = "black", color='g', kde=True)

We see a very good example of a  normally distributed data in this column, we see the highs are from 40 to 60

In [None]:
sns.distplot(df['Pressure9am'],kde=True,)

In [None]:
sns.histplot(x=df['Pressure9am'], ec = "black", color='g', kde=True)

We see the data resenmbles a tree structure which shows a really good distribution of data , we see the highs are between 1010 and 1030

In [None]:
sns.distplot(df['Pressure3pm'],kde=True,)

In [None]:
sns.histplot(x=df['Pressure3pm'], ec = "black", color='g', kde=True)

the daa in the column is left skewed a little , appart from that the majority of the data falls in 1010 to 1020

In [None]:
sns.distplot(df['Cloud9am'],kde=True,)

In [None]:
sns.histplot(x=df['Cloud9am'], ec = "black", color='g', kde=True)

We see that this column has categorical like feature and we see that the highest category is 7 and 1

In [None]:
sns.distplot(df['Cloud3pm'],kde=True,)

In [None]:
sns.histplot(x=df['Cloud3pm'], ec = "black", color='g', kde=True)

Here too we see a strong relationship with the previous column an dwe see that 1 and 7 are the highest very identical it is 

In [None]:
sns.distplot(df['Temp9am'],kde=True,)

In [None]:
sns.histplot(x=df['Temp9am'], ec = "black", color='g', kde=True)

We see that the data is noramally distributed , we see the hig range falls between 15 and 25

In [None]:
sns.distplot(df['Temp3pm'],kde=True,)

In [None]:
sns.histplot(x=df['Temp3pm'], ec = "black", color='g', kde=True)

Same as the previous column we see the highs are 20 to 27 approx here 

### MULTIVARIATE ANALYSIS -WITH PAIRPLOT

In [None]:
sns.pairplot(df)

## We see a number of observation in the pairplot , but the relationship is very hard to pinpoint so we will need to plot different relationship plots to find the actual relationship between the columns

# Visualization of the categorical features

Date needs to be treated is an index so we wont visualize than as all values are unique to each other. and will not be usful till we treat it to see realtionship with other features and label 

In [None]:
#Lets see the representation individually now with each column 

ax = sns.countplot(x='Location',data = df,)
print(df['Location'].value_counts())

We see different locations in Australia and we see that the recording made in the dataset , we see the highest a Perth Airport

In [None]:
import matplotlib.pyplot as plt
plt.subplots(figsize=(12,4))
sns.countplot(x="Location", hue="RainTomorrow", data=df, palette="colorblind")

In [None]:
yes_group = df[df["RainTomorrow"]== 0]
no_group = df[df["RainTomorrow"]!= 0]

fig=plt.figure(figsize=(9,9))
plt.style.use('seaborn-colorblind')
fig.add_subplot(2,2,1)
yes_group["Location"].value_counts().plot(kind="pie",  subplots=True,autopct='%1.1f%%', startangle=180)

plt.title(' Distribution of NO RAIN TOMORROW('+str(len(yes_group))+')');

fig.add_subplot(2,2,2)
no_group["Location"].value_counts().plot(kind="pie",  subplots=True,autopct='%1.1f%%', startangle=180)

plt.title(' Distribution of YES RAIN TOMORROW ('+str(len(no_group))+')');

We see that the perth airport has the highest instances of rain compared to the rest and melbourne as well coming in close to the leader

In [None]:
#Lets see the representation individually now with each column 

ax = sns.countplot(x='WindGustDir',data = df)
print(df['WindGustDir'].value_counts())

WE see that the highest direction of wind is North here almost equal to half of the rest

In [None]:
import matplotlib.pyplot as plt
plt.subplots(figsize=(12,4))
sns.countplot(x="WindGustDir", hue="RainTomorrow", data=df, palette="colorblind")

As we saw in the countplot the N wind direction has the highest chance of rain by a huge margin

In [None]:
yes_group = df[df["RainTomorrow"]== 0]
no_group = df[df["RainTomorrow"]!= 0]

fig=plt.figure(figsize=(9,9))
plt.style.use('seaborn-colorblind')
fig.add_subplot(2,2,1)
yes_group["WindGustDir"].value_counts().plot(kind="pie",  subplots=True,autopct='%1.1f%%', startangle=180)

plt.title(' Distribution of NO RAIN TOMORROW('+str(len(yes_group))+')');

fig.add_subplot(2,2,2)
no_group["WindGustDir"].value_counts().plot(kind="pie",  subplots=True,autopct='%1.1f%%', startangle=180)

plt.title(' Distribution of YES RAIN TOMORROW ('+str(len(no_group))+')');

In [None]:
#Lets see the representation individually now with each column 

ax = sns.countplot(x='WindDir9am',data = df)
print(df['WindDir9am'].value_counts())

We see similar characteristiccs like the previous feature where N north is the highest by a long difference

In [None]:
import matplotlib.pyplot as plt
plt.subplots(figsize=(12,4))
sns.countplot(x="WindDir9am", hue="RainTomorrow", data=df, palette="colorblind")

Again we see that the N North category has the highest chance and has the highest number as well in the column

In [None]:
#Lets see the representation individually now with each column 

ax = sns.countplot(x='WindDir3pm',data = df_visualization_nominal)
print(df_visualization_nominal['WindDir3pm'].value_counts())

In [None]:
yes_group = df[df["RainTomorrow"]== 0]
no_group = df[df["RainTomorrow"]!= 0]

fig=plt.figure(figsize=(9,9))
plt.style.use('seaborn-colorblind')
fig.add_subplot(2,2,1)
yes_group["WindDir9am"].value_counts().plot(kind="pie",  subplots=True,autopct='%1.1f%%', startangle=180)

plt.title(' Distribution of NO RAIN TOMORROW('+str(len(yes_group))+')');

fig.add_subplot(2,2,2)
no_group["WindDir9am"].value_counts().plot(kind="pie",  subplots=True,autopct='%1.1f%%', startangle=180)

plt.title(' Distribution of YES RAIN TOMORROW ('+str(len(no_group))+')');

Here we see a slight change where the highest is South East

In [None]:
import matplotlib.pyplot as plt
plt.subplots(figsize=(12,4))
sns.countplot(x="WindDir3pm", hue="RainTomorrow", data=df, palette="colorblind")

Wesee that the SE being the highest has more instances of rain compared to the rest in thsi analysis

In [None]:
#Lets see the representation individually now with each column 

ax = sns.countplot(x='RainToday',data = df_visualization_nominal)
print(df_visualization_nominal['RainToday'].value_counts())

We see strikingly accurate and same results with the label where the no of times it doesnt rains much much higher that the times it rains 

In [None]:
yes_group = df[df["RainTomorrow"]== 0]
no_group = df[df["RainTomorrow"]!= 0]

fig=plt.figure(figsize=(9,9))
plt.style.use('seaborn-colorblind')
fig.add_subplot(2,2,1)
yes_group["WindDir3pm"].value_counts().plot(kind="pie",  subplots=True,autopct='%1.1f%%', startangle=180)

plt.title(' Distribution of NO RAIN TOMORROW('+str(len(yes_group))+')');

fig.add_subplot(2,2,2)
no_group["WindDir3pm"].value_counts().plot(kind="pie",  subplots=True,autopct='%1.1f%%', startangle=180)

plt.title(' Distribution of YES RAIN TOMORROW ('+str(len(no_group))+')');

In [None]:
import matplotlib.pyplot as plt
plt.subplots(figsize=(12,4))
sns.countplot(x="RainToday", hue="RainTomorrow", data=df, palette="colorblind")

We see that the yes is strikingly similar to the label having the same metric this column may be the highest correlation with the label

# Encoding the categorical Features to numerical features

### Treating the Data columsn and extracting the month and year 

In [None]:
df["Date_month"] = pd.to_datetime(df["Date"], format = "%Y/%m/%d").dt.month # extracting the month

In [None]:
yes_group = df[df["RainTomorrow"]== 0]
no_group = df[df["RainTomorrow"]!= 0]

fig=plt.figure(figsize=(9,9))
plt.style.use('seaborn-colorblind')
fig.add_subplot(2,2,1)
yes_group["RainToday"].value_counts().plot(kind="pie",  subplots=True,autopct='%1.1f%%', startangle=180)

plt.title(' Distribution of NO RAIN TOMORROW('+str(len(yes_group))+')');

fig.add_subplot(2,2,2)
no_group["RainToday"].value_counts().plot(kind="pie",  subplots=True,autopct='%1.1f%%', startangle=180)

plt.title(' Distribution of YES RAIN TOMORROW ('+str(len(no_group))+')');

In [None]:
df["Date_year"] = pd.to_datetime(df["Date"], format = "%Y/%m/%d").dt.year # extracting the year

In [None]:
df.shape

In [None]:
df.drop(columns = ['Date'],inplace=True)

In [None]:
df.dtypes

In [None]:
from sklearn.preprocessing import LabelEncoder
enc=LabelEncoder()

In [None]:
for i in df.columns:
    if df[i].dtypes == "object":
        df[i]=enc.fit_transform(df[i].values.reshape(-1,1))


In [None]:
df.dtypes

We have converted the categorical data to numerical data we can move ahead to the next step

# Visualizing the relationship between the features and the 1st  target variable - Raintomorrow

In [None]:
#Divide data into features and label

x = df.drop(columns = ['RainTomorrow'])
y = df['RainTomorrow']

In [None]:
x

In [None]:
y

# Scatter plot

In [None]:
#Lets see how the data is distributed for every column as a whole

#Visualizing Relationship

plt.figure(figsize =(25,30), facecolor = 'yellow')
plotnumber = 1

for column in x:
    if plotnumber <=23:
        ax = plt.subplot(8,3,plotnumber)
        plt.scatter(x[column],y)
        plt.xlabel(column,fontsize = 20)
        plt.ylabel('Raintomorrow',fontsize = 10)
    plotnumber +=1
plt.tight_layout()


In [None]:
#Lets see how the data is distributed for every column as a whole

#Visualizing Relationship

plt.figure(figsize =(25,30), facecolor = 'white')
plotnumber = 1

for column in x:
    if plotnumber <=23:
        ax = plt.subplot(8,3,plotnumber)
        sns.lineplot(x[column],y)
        plt.xlabel(column,fontsize = 20)
        plt.ylabel('Raintomorrow',fontsize = 10)
    plotnumber +=1
plt.tight_layout()

We see that the stripplot is showing use binary relationship as its a classification problem , but the line plot is showing some good graphs we see some positive relationships inlocation in windgustdir, winddir9am ans winddir3pm and month and day the other columns dont show a up or down trend

# Visualizing the relationship between the features and the 2nd  target variable - Rainfall

In [None]:
#Divide data into features and label

x1 = df.drop(columns = ['Rainfall'])
y1 = df['Rainfall']

In [None]:
#Lets see how the data is distributed for every column as a whole

#Visualizing Relationship

plt.figure(figsize =(25,30), facecolor = 'yellow')
plotnumber = 1

for column in x1:
    if plotnumber <=23:
        ax = plt.subplot(8,3,plotnumber)
        plt.scatter(x1[column],y1)
        plt.xlabel(column,fontsize = 20)
        plt.ylabel('Rainfall',fontsize = 10)
    plotnumber +=1
plt.tight_layout()


In [None]:
#Lets see how the data is distributed for every column as a whole

#Visualizing Relationship

plt.figure(figsize =(25,30), facecolor = 'white')
plotnumber = 1

for column in x1:
    if plotnumber <=23:
        ax = plt.subplot(8,3,plotnumber)
        sns.lineplot(x1[column],y1)
        plt.xlabel(column,fontsize = 20)
        plt.ylabel('Rainfall',fontsize = 10)
    plotnumber +=1
plt.tight_layout()

# EDA

### Describing the Dataset

In [None]:
df.describe()

# Observations:-

- As we have mentioned in the previous notes with the df. describe , we see that there are outliers and extreme values in some of the coulmns which  need to be treated 

# Visualization of the Data Properties


In [None]:
#Lets see how the data is distributed for every column

import matplotlib.pyplot as plt
plt.figure(figsize=(25,10))
sns.heatmap(df.describe(),annot=True,linewidths=0.1,linecolor="black",fmt='0.2f')

In [None]:
df.skew()

We see tat we have skewness of more than 0.55 in columns like Evaporation,windgustdir and winddir9pm , etc , which needs to be treated using the zscore , and rainfall is one of the labels so we will only treat it with the ascore as if we manipulate it the results for the model for the regression will be unfair

# Correlation of the columns with the  target variables 

In [None]:
df.corr()['RainTomorrow'].sort_values()

In [None]:
df.corr()['Rainfall'].sort_values()

### We see that there are similarities between the variables correlation we see that the variables sunshine, all temp columns, date columns , pressure columns have negative relationship . the columns raintoday humidity,cloud,evaporation,wind columns all have positive relationship with the targets


## Heatmap of Correlation of the columns within the Columns or Features and Target

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
#size of canvas
plt.figure(figsize=(35,20))
sns.heatmap(df.corr(),annot=True, linewidths=0.5,linecolor='black', fmt='.2f')

### Observations from the heatmap

- we see that there there is high multicollinearity between Maxtemp and Temp3pm, thats the highest @ 98%, then Mintemp and Temp 9am is 2nd @ 89%,3rd is maxtemp and temp9am @87%, but these columns impact on the target is very less so we may remove them later after PCA analysis


In [None]:
# Plotting a barplot to see th relationship with 1st label in a better way

df.drop('RainTomorrow', axis=1).corrwith(df.RainTomorrow).plot(kind='bar', grid=True,figsize=(32,15),
                                                  title='Correlation with target')

plt.show()

In [None]:
# Plotting a barplot to see th relationship with 1st label in a better way

df.drop('Rainfall', axis=1).corrwith(df.Rainfall).plot(kind='bar', grid=True,figsize=(32,15),
                                                  title='Correlation with target')

plt.show()

We see the values we have seen in the corr table represented graphically .
- For RainTomorrow we see that humidity and cloud columns are most impactful
- For Rain fall we see that raintoday and tommorrow the highest and humidity and cloud trailing behind

# Using SelectKBest Feature Selection Method - Target - Raintomorrow

Select KBest use f_classif function to find the best features, where f_classif uses Anova Test

In [None]:
# Again we Divide data into features and label

X = df.drop(columns = ['RainTomorrow'])
y = df['RainTomorrow']

In [None]:
from sklearn.feature_selection import SelectKBest, f_classif

In [None]:
best_features = SelectKBest(score_func = f_classif, k=23)

fit = best_features.fit(X,y)

df_scores = pd.DataFrame(fit.scores_)

df_columns = pd.DataFrame(X.columns)


#concatenate dataframes

feature_scores = pd.concat([df_columns, df_scores], axis = 1)

feature_scores.columns = ['Feature_name', 'Score']   #name output columns

print(feature_scores.nlargest(23,'Score'))  #Print Best features

### We see that the feature Humidity3pm is the best  as the score they have are greater than 1955 approx which is really high, Even Sunshine is really high, the rest of them have a good impact or influence on the Raintomorrow label, but we are only performing this step as a way to analyze the data even further , We see that correlation showed different features and Kbest is showing different so we will move on and we will do some more analysis

# Using SelectKBest Feature Selection Method - Target - Rainfall

Select KBest use f_classif function to find the best features, where f_classif uses Anova Test

In [None]:
# Again we Divide data into features and label

X1 = df.drop(columns = ['Rainfall'])
y1 = df['Rainfall']

In [None]:
from sklearn.feature_selection import SelectKBest, f_classif

In [None]:
best_features = SelectKBest(score_func = f_classif, k=23)

fit = best_features.fit(X1,y1)

df_scores = pd.DataFrame(fit.scores_)

df_columns = pd.DataFrame(X1.columns)


#concatenate dataframes

feature_scores = pd.concat([df_columns, df_scores], axis = 1)

feature_scores.columns = ['Feature_name', 'Score']   #name output columns

print(feature_scores.nlargest(23,'Score'))  #Print Best 4 features

## We see that the best is Raintoday just as we saw in the correlation , followed by evaporation and sunshine , but overall this label does not show strong relationship with the columns or features 

# Variance Inflation Factor

Checking for Multicollinearity problem to see if one feature is dependent on the other , we need to scale the dat first using MINMAX Scalar

In [None]:
from sklearn.preprocessing import MinMaxScaler
mms=MinMaxScaler()

In [None]:
X_scaled = mms.fit_transform(X)

In [None]:
X_scaled.shape

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = pd.DataFrame()
vif["vif"] = [variance_inflation_factor(X_scaled, i) for i in range(X_scaled.shape[1])]
vif["Features"] = X.columns

#chck Values
vif

We see a high variance in the columns Mintemp,maxtemp,sunshine, windgustspeed, windspeed,humidity pressure and temp 9am & 3pm as well, which needs to be treated

In [None]:
X_scaled1 = mms.fit_transform(X1)

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = pd.DataFrame()
vif["vif"] = [variance_inflation_factor(X_scaled1, i) for i in range(X_scaled1.shape[1])]
vif["Features"] = X1.columns

#chck Values
vif

Same as the earlier observation we see many columns with high vif score which need to be treated of outliers

# Principal Component Analysis (PCA)

It is a dimension reduction technique and not a feature selection one.

and we are going to apply on the features only , it is mainly used if there are too many features and no correlation with the target

but its the final analysis we are going to do to chcek for multicollinearity problem

In [None]:
from sklearn.decomposition import PCA

In [None]:
pca = PCA()

In [None]:
pca.fit_transform(X_scaled) #To scale the data with PCA so we can plot the graph to see whats the coverage 

In [None]:
# lets plot scree plot to check the best components

plt.figure()
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Principal Components')
plt.ylabel("Variance Covered")
plt.title('PCA')
plt.show()

### We see that in order to cover 95% - 100% of the data we need to have only have 11 features and we can remove the rest , We will use the Kbest to decide which features are the best and see if we should remove any feautes , But at this pont we will move ahead as all the columns constitute to making the model better "


In [None]:
pca.fit_transform(X_scaled1) #To scale the data with PCA so we can plot the graph to see whats the coverage 

In [None]:
# lets plot scree plot to check the best components

plt.figure()
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Principal Components')
plt.ylabel("Variance Covered")
plt.title('PCA')
plt.show()

### Same as the other screeplotWe see that in order to cover 95% - 100% of the data we need to have only have 11 features and we can remove the rest , We will use the Kbest to decide which features are the best and see if we should remove any feautes , But at this pont we will move ahead as all the columns constitute to making the model better "


# Using Zscore to deal with the outliers in the data-1st label-raintomorrow

In [None]:
df.shape

In [None]:
from scipy.stats import zscore
import numpy as np
z=np.abs(zscore(df))
threshold=3
np.where(z>3)

In [None]:
df_new_z=df[(z<3).all(axis=1)]
df_new_z

In [None]:
df_new_z.shape

In [None]:
#Percentage of Data loss

Data_loss = ((6631-6319)/6631)*100

In [None]:
Data_loss

We have lost 4.70% of the data as we have to remove the skewness which occured due to outliers so that the model is not biased towards it

In [None]:
collist=df_new_z.columns.values
ncol=30
nrows=14
plt.figure(figsize=(ncol,3*ncol))
for i in range (0,len(collist)):
    plt.subplot(nrows,ncol,i+1)
    sns.boxplot(df_new_z[collist[i]],color='green',orient='h')
    plt.tight_layout()

In [None]:
df_new_z['Rainfall'].plot.box()

After  treating with Zscore we see much better data , the only columns we see there is some outliers are the categorical columns which we cannot do anything for as we need all the data and we only treat continuos data for outliers

# Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
lr=LogisticRegression()
from sklearn.metrics import accuracy_score, confusion_matrix, roc_curve, roc_auc_score,classification_report
from sklearn.model_selection import train_test_split

import warnings
warnings.filterwarnings('ignore')

In [None]:
scores=[]
for i in range(0,100):
    X_train_ns,X_test,y_train_ns,y_test = train_test_split(X,y,test_size = 0.25,random_state = i)
    lr.fit(X_train_ns,y_train_ns)
    pred_train = lr.predict(X_train_ns)
    pred_test=lr.predict(X_test)
    print(f"At random state {i},the training accuracy is :-{accuracy_score(y_train_ns,pred_train)}")
    print(f"At random state {i},the Testing accuracy is :-{accuracy_score(y_test,pred_test)}")
    print('\n')
    scores.append(accuracy_score(y_test,pred_test))

Finding the highest score using Argmax

In [None]:
np.argmax(scores)

In [None]:
scores[np.argmax(scores)]

# We see that this model work well with the data , we see that the scores are the same at Training and testing state
    
    - we are getting 
     
     At random state 88,the training accuracy is :-0.8432158683266512

     At random state 88,the Testing accuracy is :-0.8424050632911393
     
- the training score and Testing score are equal  to each other here
- both the train and test score are really good but we will test more an also th cv score to see if its consistent  


# Train Test Split

In [None]:
X_train_ns,X_test,y_train_ns,y_test = train_test_split(X,y,test_size = 0.25,random_state = 5 ) 

# as the best random state we have chosen is 5

### We are creating a method called Metric to allow us to show the metrics of each classification model we use , so we dont have to code it again 

In [None]:
#Write one function and call as many times to check accuracy_score of different models

def metric_score(clf,X_train_ns,X_test,y_train_ns,y_test,train=True):
    if train:
        y_pred = clf.predict(X_train_ns)
    
        
        print("\n===============================Train Result=============================")
        
        print(f"Accuracy score : {accuracy_score(y_train_ns,y_pred) * 100: .2f}%")
        
    elif train == False:
        pred = clf.predict(X_test)
        
        print("\n===============================Test Result===============================")
        print(f"Accuracy Scorre : {accuracy_score(y_test,pred) * 100: .2f}%")
        
        
        print ('\n \n Test Classification Report \n', classification_report(y_test, pred, digits = 2)) ##Model Confidence /Accurancy
        

In [None]:
#Call the function and pass dataset to check the train score and the test score

metric_score(lr,X_train_ns,X_test,y_train_ns,y_test,train=True) #This is for the Training Score

metric_score(lr,X_train_ns,X_test,y_train_ns,y_test,train=False) #This is for the Testing Score

### We see that this model is having a pretty good score in Logistic regression , we see the train score as 84.72% and the test score as 83.35% which is pretty good considering that this is actually a UPSIZED dataset and we have also done balancing of the label category due to imbalance 

Important to note we have used a different random state as the one we chose was having a higher train score as compared to this one so we changed

In [None]:
print(confusion_matrix(y_test,pred_test))  

We see that the type 1 and 2 error is pretty high , and we need to see other models , but before that we will check cv score

# Cross-Validation of the model

In [None]:
from sklearn.model_selection import cross_val_score
for j in range(2,10):
    cv_score=cross_val_score(lr,X,y,cv=j)
    cv_mean=cv_score.mean()
    print(f"At cross fold{j} the cv score is {cv_mean} and accuracy score for training is {accuracy_score(y_train_ns,pred_train)}and the accuracy for testing is {accuracy_score(y_test,pred_test)}")
    print('\n')

We see that the model is overfitting the data as we see the cv score of 83% approx is giving a test score of 66% approx so we need to check other models as this model is not working well and we have a underfitted test score

# Decision Tree Classifier

In [None]:
from sklearn.tree import DecisionTreeClassifier

dt=DecisionTreeClassifier()

X_train_ns,X_test,y_train_ns,y_test = train_test_split(X,y,test_size = 0.25,random_state = 5) #as we have seen a good score on 98th state
dt.fit(X_train_ns,y_train_ns)
pred_train = dt.predict(X_train_ns)
pred_test = dt.predict(X_test)
print(f"At random state {5},the training accuracy is :-{accuracy_score(y_train_ns,pred_train)}")
print(f"At random state {5},the Testing accuracy is :-{accuracy_score(y_test,pred_test)}")
print('\n')


In [None]:
#Call the function and pass dataset to check the train score and the test score

metric_score(dt,X_train_ns,X_test,y_train_ns,y_test,train=True) #This is for the Training Score

metric_score(dt,X_train_ns,X_test,y_train_ns,y_test,train=False) #This is for the Testing Score

In [None]:
print(confusion_matrix(y_test,pred_test)) 

# Observations from the Decision Tree Classifier :-

    - We see that the training score is boosted all the way to 100% which is the highest  but the testing score is fallen shorter than logistic regression  @ 77.28 % which is lesser than the logistic model   , also we see that the F1 score is the same as test score for accuracy and precision is 69% , 86%for 0 and only 53% for 1 which is same than the last model 
    
    - the model is not performing as good as Logistic regression but we cant use this model moving to check the cv score

In [None]:
#Cross validation of the model
from sklearn.model_selection import cross_val_score
for j in range(2,10):
    cv_score=cross_val_score(dt,X,y,cv=j)
    cv_mean=cv_score.mean()
    print(f"At cross fold{j} the cv score is {cv_mean} and accuracy score for training is {accuracy_score(y_train_ns,pred_train)}and the accuracy for testing is {accuracy_score(y_test,pred_test)}")
    print('\n')

We see a really good improvement with the Cv score compared to the last model , we see the cv score and test score is coming really close at 72% to 77% which is really good for the model

# KNN Classifier

In [None]:
from sklearn.neighbors import KNeighborsClassifier

knn=KNeighborsClassifier()

knn.fit(X_train_ns,y_train_ns)
knn.score(X_train_ns,y_train_ns)
pred_decision =knn.predict(X_test)

knns = accuracy_score(y_test,pred_decision)
print('Accuracy Score :',knns*100)

knnscore = cross_val_score(knn,X,y,cv=8)
knnc =knnscore.mean()
print('Cross Val Score :',knnc*100)
print(confusion_matrix(y_test,pred_decision)) 

In [None]:
#Call the function and pass dataset to check the train score and the test score

metric_score(knn,X_train_ns,X_test,y_train_ns,y_test,train=True) #This is for the Training Score

metric_score(knn,X_train_ns,X_test,y_train_ns,y_test,train=False) #This is for the Testing Score

# Observations from the KNN Classifier :-
    - We see that the training score is higher than Decision tree @ 83.10%   
    
    we see that the F1 score higher  where accuracy for 0 is 90% and for 1 is 56% which is bad and model is biased towards 0 
    
    - the CV score is good though and very similar to the test accuracy @ 81% highest among the 3 models we tested , so overall the model is ok , but we need to improve the f1 score and precision
    
    - we see the confusion matrix where the typ 1 and typ 2 error is much better than the previous models but error rate very high and we need to improve , lets test other models 

# Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf=RandomForestClassifier()

rf.fit(X_train_ns,y_train_ns)
rf.score(X_train_ns,y_train_ns)
pred_decision =rf.predict(X_test)

rfs = accuracy_score(y_test,pred_decision)
print('Accuracy Score :',rfs*100)

rfscore = cross_val_score(rf,X,y,cv=4)
rfc =rfscore.mean()
print('Cross Val Score :',rfc*100)
print(confusion_matrix(y_test,pred_decision)) 

In [None]:
#Call the function and pass dataset to check the train score and the test score

metric_score(rf,X_train_ns,X_test,y_train_ns,y_test,train=True) #This is for the Training Score

metric_score(rf,X_train_ns,X_test,y_train_ns,y_test,train=False) #This is for the Testing Score

# Observations from the Random Forest Classifier :-
    - We see that like the decision tree the train score is at the max at 100% and we have test score much better at 83.23% , the F1 score is at 82% and precisiion @ 82% which is a good model and the best till now , we have imbalance dataset and we treated so we have a little higher Cv score of 81.13% approx so we are getting a good cv score as well which is on par with the test sore which is what we need the model to do 
    
    - the model has much lower errors in the confusion matrix as all the models but can be avoided

# XgBoost

In [None]:
import xgboost as xgb

xgb = xgb.XGBClassifier()

xgb.fit(X_train_ns,y_train_ns)
xgb.score(X_train_ns,y_train_ns)
pred_decision =xgb.predict(X_test)

xgbs = accuracy_score(y_test,pred_decision)
print('Accuracy Score :',xgbs*100)

xgbscore = cross_val_score(xgb,X,y,cv=9)
xgbc =xgbscore.mean()
print('Cross Val Score :',xgbc*100)
print(confusion_matrix(y_test,pred_decision)) 

In [None]:
#Call the function and pass dataset to check the train score and the test score

metric_score(xgb,X_train_ns,X_test,y_train_ns,y_test,train=True) #This is for the Training Score

metric_score(xgb,X_train_ns,X_test,y_train_ns,y_test,train=False) #This is for the Testing Score

# Observations from the XGboost Classifier :-

    - We see that the training score is lower than the random forest and the decision tree at 99.98% and we see a higher accuracy score for test at 83.42 % which is  higher than the random forest ,we see the F1 score is same too @ 83% again  to rf  the precision score is 82% which is more than the Random forest 
    The CV score is exact same as the accuracy score @ 77% which is lower than the random forest whcih had a closer cv to the test score @ 81% approx so overall really good scores from this model, but the random forest shows the best as the cv score is closer to the test score , need to do more analysis

# SVC

In [None]:
from sklearn.svm import SVC

svc= SVC()

svc.fit(X_train_ns,y_train_ns)
svc.score(X_train_ns,y_train_ns)
pred_decision =svc.predict(X_test)

svcs = accuracy_score(y_test,pred_decision)
print('Accuracy Score :',svcs*100)

svcscore = cross_val_score(svc,X,y,cv=8)
svcc =svcscore.mean()
print('Cross Val Score :',svcc*100)
print(confusion_matrix(y_test,pred_decision))

In [None]:
#Call the function and pass dataset to check the train score and the test score

metric_score(svc,X_train_ns,X_test,y_train_ns,y_test,train=True) #This is for the Training Score

metric_score(svc,X_train_ns,X_test,y_train_ns,y_test,train=False) #This is for the Testing Score

# Observations from the SVC Classifier :-
    - We see that this model is  performimg worser than all the other models we tested where the train score is 77% and the test is 76.14% F1 score is less and precioson is 38% which is less , the cv score is also on the lower side , 
    
    the confusionmatrix is giving lowest error and we dont see any predictions for 0 classes in label , so this model is not at all doing well for the dataset

## we can assume that Random Forest Classifier is the best algorithm for this project as it has the highest scores and least difference between the Cross val score and accuracy but we need to check roc auc to finalize the decision

# Let's check ROC AUC Curve for the fitted Model

In [None]:
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.metrics import plot_roc_curve
import matplotlib.pyplot as plt


### How well our model works on training Data

disp = plot_roc_curve(lr,X_train_ns,y_train_ns)

plot_roc_curve(dt,X_train_ns,y_train_ns, ax= disp.ax_) #ax_ = Axes with confusion matrix

plot_roc_curve(knn,X_train_ns,y_train_ns, ax= disp.ax_)

plot_roc_curve(rf,X_train_ns,y_train_ns, ax= disp.ax_)

plot_roc_curve(xgb,X_train_ns,y_train_ns, ax= disp.ax_)

plot_roc_curve(svc,X_train_ns,y_train_ns, ax= disp.ax_)

plt.legend(prop={'size' : 10}, loc='lower right' )

plt.show()

In [None]:
### How well our model works on Testing Data

disp = plot_roc_curve(lr,X_test,y_test)

plot_roc_curve(dt,X_test,y_test, ax= disp.ax_) #ax_ = Axes with confusion matrix

plot_roc_curve(knn,X_test,y_test, ax= disp.ax_)

plot_roc_curve(rf,X_test,y_test, ax= disp.ax_)

plot_roc_curve(xgb,X_test,y_test, ax= disp.ax_)

plot_roc_curve(svc,X_test,y_test, ax= disp.ax_)

plt.legend(prop={'size' : 10}, loc='lower right' )

plt.show()

# WE have again proved that we will use the Random Forest Classifier as the best  model

- The Logistic Regression is only covering 86% of training data and only 86% of the test data, same as  Random forest classifier is covering 100% of training data as well as 86% of test data which is the highest, the actual winner will be logistic regression but the cv score of random forest is better @ 81% . XGBoost may have 100% at train buts its only able to give 87% to test highest among all but the cv score is lower again so we will go with Random forest  
- closest to  XGboost classifier model is random forest Classifier but the scores a a little better with Random forest 
- Random forest  will be an even better model with Hyperparameter tuning which will increase 

# Hyper parameter Tuning

In [None]:
#Plotting distplot to show equillibrium 

sns.distplot(y_test-pred_decision)
plt.show()

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
rf= RandomForestClassifier()

#Creating parameters to pass in RandomizedSearchCV

parameters = {'criterion':['gini','entropy'],
             'max_features':['auto','sqrt','log2'],
             'min_samples_split': [1, 2, 3, 4 ,5],
             'min_samples_leaf': [1, 3, 4, 5, 6,],
             'n_estimators' : [100,200,300,400,500]
             }

GCV = GridSearchCV(estimator = rf,param_grid=parameters, verbose=2,cv=3, n_jobs = -1, scoring='accuracy')
GCV.fit(X_train_ns,y_train_ns) #fitting data into the model
GCV.best_params_ #printing the best parameters found by the GridSearch CV

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf=RandomForestClassifier(n_estimators = 200 ,min_samples_split=4,min_samples_leaf=1,max_features='sqrt',criterion='entropy')

rf.fit(X_train_ns,y_train_ns)
rf.score(X_train_ns,y_train_ns)
pred_decision =rf.predict(X_test)

rfs = accuracy_score(y_test,pred_decision)
print('Accuracy Score :',rfs*100)

rfscore = cross_val_score(rf,X,y,cv=8)
rfc =rfscore.mean()
print('Cross Val Score :',rfc*100)
print(confusion_matrix(y_test,pred_test)) 

We see that after tuning we are getting a higher score than we got before  , but we also see that the Cv score is come much closer to the  accuracy score @ 82% which is great as the closer the score the better the model. let save the rf model in pickle file  

In [None]:
GCV_pred=GCV.best_estimator_.predict(X_test) #predicting with the best parameters
accuracy_score(y_test,GCV_pred) #Checking Final Accuracy

In [None]:
plt.scatter(y_test, pred_decision, alpha = 0.5)
plt.xlabel("y_test")
plt.ylabel("pred_test")
plt.show()

We see that the model is accurate as the points in 0 are shown in 0 and 1 in 1 and the graph shows normal distribution as well 

In [None]:
import pickle
filename = 'raintomorrow.pkl'
pickle.dump(rf,open(filename,'wb'))

# Conclusion

In [None]:
loaded_model = pickle.load(open('raintomorrow.pkl','rb'))
result = loaded_model.score(X_test,y_test)
print(result*100)

In [None]:
conclusion = pd.DataFrame([loaded_model.predict(X_test)[:],pred_decision[:]],index=['Predicted','Orignal'])

In [None]:
conclusion

# We have 1580 columns where the model has predicted and Actuals and the model we have chosen is Random Forest Classifier as the ideal model for this project

# PCA treatment

In [None]:
#Splitting the features and target for rainfall
X=df_new_z.drop(columns=['Rainfall'])
y=df_new_z['Rainfall']

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled =scaler.fit_transform(X)

In [None]:
pca = PCA()
pca.fit_transform(X_scaled)

In [None]:
#lets plot scree plot to check the bset components

plt.figure()
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Principal Components')
plt.ylabel("Variance Covered")
plt.title('PCA')
plt.show()

In [None]:
pca = PCA(n_components = 13)
new_pcomp = pca.fit_transform(X_scaled)
princ1_comp = pd.DataFrame(new_pcomp,columns=['PC1','PC2','PC3','PC4','PC5','PC6','PC7','PC8','PC9','PC10','PC11','PC12','PC13'])
princ1_comp

In [None]:
#Data split into train and test

X_train,X_test,y_train,y_test = train_test_split(princ1_comp ,y,test_size = 0.25,random_state = 355)

## See that we are getting the same score even after incorporating the PCA so we are going to use all the features , seen that with PCS r2 score is 49.94% and without its 50.06%

In [None]:
df_new_z.skew()

See that the data is reduced skewness after the zscore treatment as well and the only ones having are categorical columns as well as the target itself which we cant treat so there is some issue with the data 

In [None]:
X

In [None]:
y

## Progressing to the normal steps as additional steps did not help increasing the score 

In [None]:
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(X,y,train_size = 0.75,random_state = 322)

In [None]:
from sklearn.preprocessing import StandardScaler # scale the data as we have very high values in fnlwgt

scaler = StandardScaler()
X_train = pd.DataFrame(scaler.fit_transform(X_train), columns = X_train.columns)
X_test= pd.DataFrame(scaler.transform(X_test), columns = X_test.columns)
X_train

# Linear Regression Model

In [None]:
from sklearn.linear_model import LinearRegression
lr=LinearRegression()
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.model_selection import train_test_split

In [None]:
scores_test=[]
for i in range(0,100):
    X_train,X_test,y_train,y_test = train_test_split(princ1_comp,y,test_size = 0.25,random_state = i)
    lr.fit(X_train,y_train)
    pred_train = lr.predict(X_train)
    pred_test = lr.predict(X_test)
    print(f"At random state {i},the training accuracy is :-{r2_score(y_train,pred_train)}")
    print(f"At random state {i},the testing accuracy is :-{r2_score(y_test,pred_test)}")
    print('\n')
    scores_test.append(r2_score(y_test,pred_test))

In [None]:
scores_test[np.argmax(scores_test)]

# We have seen that the best random state for the observations is 30 but all are at 50% and which is very low for the model and is not good fit for the data 

At random state 30,the training accuracy is :-0.5014971732811984



At random state 30,the testing accuracy is :-0.5006375344803744


We have chosen this as the difference between the two scores is very very less almost in decimals and the rest are not higher than this 

### Splitting again train test split with ideal Random state

In [None]:
X_train,X_test,y_train,y_test = train_test_split(princ1_comp,y,test_size = 0.25,random_state = 30)

In [None]:
lr.fit(X_train,y_train)

In [None]:
pred_test=lr.predict(X_test)

In [None]:
print(r2_score(y_test,pred_test))

# We observed that Linear Regression is doing very poorly with this data as we are getting 50.06% approx which is very less  50% itself means its not a good model


We will try other models cause we cant consider a model with  50% r2 score

# Cross-Validation of the model

In [None]:
Train_accuracy=r2_score(y_train,pred_train)
Test_accuracy=r2_score(y_test,pred_test)

from sklearn.model_selection import cross_val_score
for j in range(2,10):
    cv_score=cross_val_score(lr,X,y,cv=j)
    cv_mean=cv_score.mean()
    print(f"At cross fold{j} the cv score is {cv_mean} and accuracy score for training is {Train_accuracy}and the accuracy for testing is {Test_accuracy}")
    print('\n')

## we see that the cv scores are pretty close to the test score and we see that cv score @ 7 is the closes to the test score @ 49% , but overall we see that the model is not doing well for the dataset

# Plotting the linear Regression graph with actual and predicted values comparison

In [None]:
import matplotlib.pyplot as plt
plt.figure(figsize=(8,6))
plt.scatter(x=y_test, y=pred_test,color='r')
plt.plot(y_test,y_test,color='b')
plt.xlabel('Actual Price',fontsize=14)
plt.ylabel('Predicted Price',fontsize=14)
plt.title('Linear Regression',fontsize=18)
plt.show()

## We see that the model is doing exceptionally bad in predicting , here we see the actual and predicted are not on the same line which is really really bad and the data is not showing a trend as well

# Regularization of the Linear Model

In [None]:
from sklearn.model_selection import GridSearchCV #to select the pest parameters for hyperparameter tuning
from sklearn.model_selection import cross_val_score #to check the difference from the earlier score without hyper parameter tuning

In [None]:
from sklearn.linear_model import Lasso

parameters ={'alpha' : [.0001, .001, .01, .1, 1, 10],
            'random_state' : list(range(0,15))}
ls = Lasso()
clf = GridSearchCV(ls,parameters)
clf.fit(X_train, y_train)

print(clf.best_params_)

# Final model training for Linear Regression

In [None]:
ls = Lasso(alpha= 0.01, random_state= 0)
ls.fit(X_train,y_train)
ls_score_training = ls.score(X_train,y_train)
pred_ls = ls.predict(X_test)
ls_score_training*100

We are getting a bad lasso score of 50.14% which is very high to the cross val score 49% approx for training we got earlier so we cant consider the model is doing very bad for this dataset


A reason for the  scores to be this bad is due to the nature of the label, as we see that the dataset is imbalanced initially where rain only fell 26% approx of the time , so the values in the label have huge outliers as there is an extreme range btw the days it doesnt rain and the dates it does as well as the other properties have such extreme values , even after treatment with zscore we see that the model is not able to get a normally distributed dataset to make a better prediction so we will have to try other models and see if this improves

# Checking MSE,RMSE score 

In [None]:
from sklearn import metrics

print('MAE:', metrics.mean_absolute_error(y_test, pred_test))
print('MSE:', metrics.mean_squared_error(y_test, pred_test))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, pred_test)))

We see that the values of the mae and mse are really good , very less error rates are seen 

# Decision Tree Regressor

In [None]:
from sklearn.tree import DecisionTreeRegressor

dt=DecisionTreeRegressor()
dt.fit(X_train,y_train)
dt.score(X_train,y_train)
pred_test =dt.predict(X_test)
dfs = r2_score(y_test,pred_test)
print('R2 Score :',dfs*100)

dfscore = cross_val_score(dt,X,y,cv=7)
dfc =dfscore.mean()
print('Cross Val Score :',dfc*100)


## We observe that for Decision tree regressor :-
- The model is  at R2 Score of 10.87 %approx which is way worse that linear regression
- we also see that the cross val score is also very bad compared to linear , and we cannot choose this model as well as the score is 5.01 % approx 
- there is very slight  difference between the r2 score and cross val score but the model has a horrible score of 10% which is not good
- we see that the score is much much better  than Linear regression model of 50% approx
- tryin the next model which is knn as we need to have a closer difference btw the cross val and R2 Score 


In [None]:
from sklearn import metrics

print('MAE:', metrics.mean_absolute_error(y_test, pred_test))
print('MSE:', metrics.mean_squared_error(y_test, pred_test))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, pred_test)))

We see that the mse is a little on higher end but the rest of the scores are good here , but the r2 score is not moving on

# K- Nearest Neighbors

In [None]:
from sklearn.neighbors import KNeighborsRegressor

knn=KNeighborsRegressor()
knn.fit(X_train,y_train)
knn.score(X_train,y_train)
pred_test =knn.predict(X_test)
knns = r2_score(y_test,pred_test)
print('R2 Score :',knns*100)

knnscore = cross_val_score(knn,X,y,cv=8)
knnc =knnscore.mean()
print('Cross Val Score :',knnc*100)


# We observe that for K-nearest neighbors :-
- The model is not at all working well for the data set and we see that the score is a horrible  compared to linear regression model but higher than Decision tree @ 12.31% approx
- we also see that the cross val score is also very bad compared to linear  , and we cannot choose this modelsame as  the linear model but its better that Decision tree cv score @ 10 % approx 
- there is very less  difference between the r2 score and cross val score  
- tryin the next model which is Enseble techniques , and we will use Random Forest Regressor 


In [None]:
from sklearn import metrics

print('MAE:', metrics.mean_absolute_error(y_test, pred_test))
print('MSE:', metrics.mean_squared_error(y_test, pred_test))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, pred_test)))

We see that the mse is a little on higher end but the rest of the scores are good here , but the r2 score is not moving on

# Random Forest Regressor

In [None]:
from sklearn.ensemble import RandomForestRegressor

rf=RandomForestRegressor()
rf.fit(X_train,y_train)
rf.score(X_train,y_train)
pred_decision =rf.predict(X_test)

rfs = r2_score(y_test,pred_decision)
print('R2 Score :',rfs*100)

rfscore = cross_val_score(rf,X,y,cv=7)
rfc =rfscore.mean()
print('Cross Val Score :',rfc*100)


# We observe that for Random Forest Regressor :-

- The model is doing much better than the 3 models we test before , the R2 score is @ 53% approx which is much better than the rest and our cross val is still better @ 50.99% approx
- we also see that the cross val score is a little better compared to linear and tree,  we see the Difference between the R2 score and cv score is good as well
- we see that the score is higher than Linear regression model of 50% approx
- tryin the next model which is Ensemble techniques , and we will use Ada boosted Trees 



In [None]:
from sklearn import metrics

print('MAE:', metrics.mean_absolute_error(y_test, pred_test))
print('MSE:', metrics.mean_squared_error(y_test, pred_test))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, pred_test)))

# ADA Boost Regressor

In [None]:
from sklearn.ensemble import AdaBoostRegressor
ada = AdaBoostRegressor()

ada.fit(X_train,y_train)
ada.score(X_train,y_train)
pred_decision =ada.predict(X_test)

adas = r2_score(y_test,pred_decision)
print('R2 Score :',adas*100)

adascore = cross_val_score(ada,X,y,cv=7)
adac =adascore.mean()
print('Cross Val Score :',adac*100)


In [None]:
#Checking MAE MSE and RMSE scores
from sklearn import metrics

print('MAE:', metrics.mean_absolute_error(y_test, pred_decision))
print('MSE:', metrics.mean_squared_error(y_test, pred_decision))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, pred_decision)))

# We observe that for ADA Boost Regressor :-
- The model is not  working well for the data set and we see that the score is also very low compared to Random forest & Linear as well by a small margin
- we also see that the cross val score is a much much lower  compared to linear and random forest ,   we cannot choose this over the random forest  
- there is very lettlw  difference between the r2 score  and cross val score but the model is not doing well at all 
- we see that the score is lesser than Linear regression model of 43% approx
- tryin the next model which is Xgboost model 


# Xgboost Regressor

In [None]:
import xgboost as xgb
xgb = xgb.XGBRegressor()

xgb.fit(X_train,y_train)
xgb.score(X_train,y_train)
pred_decision =xgb.predict(X_test)

xgbs = r2_score(y_test,pred_decision)
print('R2 Score :',xgbs*100)

xgbscore = cross_val_score(xgb,X,y,cv=7)
xgbc =xgbscore.mean()
print('Cross Val Score :',xgbc*100)



In [None]:
#Checking MAE MSE and RMSE scores
from sklearn import metrics

print('MAE:', metrics.mean_absolute_error(y_test, pred_decision))
print('MSE:', metrics.mean_squared_error(y_test, pred_decision))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, pred_decision)))

# We observe that for Xtreme Gradient Boost Regressor :-

- The model is giving lesser score than linear regression and random forest @ 47.30%
- we also see that the cross val score is a much much closer  compared to linear and all the models ,  
- we see that the mae and mse score are low , which is good
   

In [None]:
#Checking MAE MSE and RMSE scores
from sklearn import metrics

print('MAE:', metrics.mean_absolute_error(y_test, pred_test))
print('MSE:', metrics.mean_squared_error(y_test, pred_test))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, pred_test)))

In [None]:
#Plotting distplot to show equillibrium 

sns.distplot(y_test-pred_test)
plt.show()

In [None]:
plt.scatter(y_test, pred_test, alpha = 0.5)
plt.xlabel("y_test")
plt.ylabel("pred_test")
plt.show()

In [None]:
import pickle
filename = 'rainfall.pkl'
pickle.dump(rf,open(filename,'wb'))

# Conclusion

In [None]:
loaded_model = pickle.load(open('rainfall.pkl','rb'))
result = loaded_model.score(X_test,y_test)
print(result*100)

In [None]:
conclusion = pd.DataFrame([loaded_model.predict(X_test)[:],pred_decision[:]],index=['Predicted','Orignal'])

In [None]:
conclusion

### We see that the model has predicted the charges on 1580 columns againest the actual charges , so the Random Forest is the best one of them all