> <h2>Rainfall Prediction In Australia </h2>





Rainfall prediction is a critical area of research that impacts various sectors, including agriculture, water resource management, and disaster preparedness.

Accurate forecasting of rainfall not only helps in optimizing agricultural practices but also plays a vital role in mitigating the adverse effects of floods and droughts. This project focuses on developing a predictive model for rainfall in Australia using machine learning techniques.

Table of Contents
  * [Introduction](#scrollTo=YveT-3FvkXkq&line=1&uniqifier=1)
  * [About The Dataset](#scrollTo=KzLnssHAZF_j)
     *  1.[Pre-Processing](#scrollTo=KzLnssHAZF_j)
     *  2.[Import The Dataset](#scrollTo=ZaxLKZugZF_l)
  * [Data Processing](#scrollTo=v2EPXo2XZF_o)
     *  1.[One Hot Encoding](#scrollTo=cV1uhrmUZF_o)  
  * [Implementating The Data Classification Algorithims](#scrollTo=-Ho2ftCAxGtc)
     *  1.[Linear Regression](#scrollTo=0saGSXndZF_r)
     *  2.[K-Nearest Neighbors](#scrollTo=3dxe2ny7ZF_w)  
     *  3.[Decision Trees](#scrollTo=hEJLR5otZF_y)
     *  4.[Logistic Regression](#scrollTo=4xtfB4rbZF_0)        
     *  5.[Support Vector Machine](#scrollTo=RQj0q2DxZF_2)
 * [Report](#scrollTo=MBlLRQoPZF_4)
 * [Recommendation](#scrollTo=r3ht_oOi4OJn)
 * [Conclusion](#scrollTo=8A1jFZ8--PGT)
 * [Theory](#scrollTo=h8VmwCUc-LC2)


><div href="Introduction">
    <h2>Introduction</h2>
</div>



The dataset utilized in this project is sourced from the __Australian Government's Bureau of Meteorology__, which provides historical weather data, including various meteorological features such as temperature, humidity, wind speed, and pressure. By leveraging this rich dataset, we aim to build a classifier that can predict whether it will rain the following day, which is essential for both individual decision-making and larger-scale planning.

By the end of this project, we aim to identify the __most effective model for predicting rainfall__ in Australia, thereby contributing valuable insights into the practical applications of machine learning in meteorology and climate science.

This project not only reinforces our understanding of machine learning techniques but also highlights the significance of data-driven decision-making in addressing real-world challenges.



> <div href="About The Dataset">
    <h2>About The Dataset</h2>
</div>


The original source of the data is Australian Government's Bureau of Meteorology and the latest data can be gathered from [http://www.bom.gov.au/climate/dwo/](http://www.bom.gov.au/climate/dwo/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01).

The dataset to be used has extra columns like 'RainToday' and our target is 'RainTomorrow', which was gathered from the Rattle at [https://bitbucket.org/kayontoga/rattle/src/master/data/weatherAUS.RData](https://bitbucket.org/kayontoga/rattle/src/master/data/weatherAUS.RData?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01)




><div href="pre-processing">
    <h2>Pre-processing</h2>
</div>


This dataset contains observations of weather metrics for each day from 2008 to 2017. The **weatherAUS.csv** dataset includes the following fields:

| Field         | Description                                           | Unit            | Type   |
| ------------- | ----------------------------------------------------- | --------------- | ------ |
| Date          | Date of the Observation in YYYY-MM-DD                 | Date            | object |
| Location      | Location of the Observation                           | Location        | object |
| MinTemp       | Minimum temperature                                   | Celsius         | float  |
| MaxTemp       | Maximum temperature                                   | Celsius         | float  |
| Rainfall      | Amount of rainfall                                    | Millimeters     | float  |
| Evaporation   | Amount of evaporation                                 | Millimeters     | float  |
| Sunshine      | Amount of bright sunshine                             | hours           | float  |
| WindGustDir   | Direction of the strongest gust                       | Compass Points  | object |
| WindGustSpeed | Speed of the strongest gust                           | Kilometers/Hour | object |
| WindDir9am    | Wind direction averaged of 10 minutes prior to 9am    | Compass Points  | object |
| WindDir3pm    | Wind direction averaged of 10 minutes prior to 3pm    | Compass Points  | object |
| WindSpeed9am  | Wind speed averaged of 10 minutes prior to 9am        | Kilometers/Hour | float  |
| WindSpeed3pm  | Wind speed averaged of 10 minutes prior to 3pm        | Kilometers/Hour | float  |
| Humidity9am   | Humidity at 9am                                       | Percent         | float  |
| Humidity3pm   | Humidity at 3pm                                       | Percent         | float  |
| Pressure9am   | Atmospheric pressure reduced to mean sea level at 9am | Hectopascal     | float  |
| Pressure3pm   | Atmospheric pressure reduced to mean sea level at 3pm | Hectopascal     | float  |
| Cloud9am      | Fraction of the sky obscured by cloud at 9am          | Eights          | float  |
| Cloud3pm      | Fraction of the sky obscured by cloud at 3pm          | Eights          | float  |
| Temp9am       | Temperature at 9am                                    | Celsius         | float  |
| Temp3pm       | Temperature at 3pm                                    | Celsius         | float  |
| RainToday     | If there was rain today                               | Yes/No          | object |
| RainTomorrow  | If there is rain tomorrow                             | Yes/No          | float  |

Column definitions were gathered from [http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml](http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01)



>## **Import the required libraries**


In [12]:
# All Libraries required for this lab are listed below. The libraries pre-installed on Skills Network Labs are commented.
# !mamba install -qy pandas==1.3.4 numpy==1.21.4 seaborn==0.9.0 matplotlib==3.5.0 scikit-learn==0.20.1
# Note: If your environment doesn't support "!mamba install", use "!pip install"

In [13]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing
import numpy as np
from sklearn import linear_model
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.metrics import jaccard_score
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss
from sklearn.metrics import confusion_matrix, accuracy_score
import sklearn.metrics as metrics

> <div href="Importing the Dataset">
    <h2>Importing the Dataset</h2>
</div>




In [14]:
# Define the file path
filepath = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillUp/labs/ML-FinalAssignment/Weather_Data.csv"

# Load the data into a DataFrame
df = pd.read_csv(filepath)

# Display the first few rows of the DataFrame
print(df.head())

       Date  MinTemp  MaxTemp  Rainfall  Evaporation  Sunshine WindGustDir  \
0  2/1/2008     19.5     22.4      15.6          6.2       0.0           W   
1  2/2/2008     19.5     25.6       6.0          3.4       2.7           W   
2  2/3/2008     21.6     24.5       6.6          2.4       0.1           W   
3  2/4/2008     20.2     22.8      18.8          2.2       0.0           W   
4  2/5/2008     19.7     25.7      77.4          4.8       0.0           W   

   WindGustSpeed WindDir9am WindDir3pm  ...  Humidity9am  Humidity3pm  \
0             41          S        SSW  ...           92           84   
1             41          W          E  ...           83           73   
2             41        ESE        ESE  ...           88           86   
3             41        NNE          E  ...           83           90   
4             41        NNE          W  ...           88           74   

   Pressure9am  Pressure3pm  Cloud9am  Cloud3pm  Temp9am  Temp3pm  RainToday  \
0       1017

><div href="Data Processing ">
    <h2>Data Processing </h2>
</div>

<div href="One Hot Encoding">
    <h2>One Hot Encoding</h2>
</div>


The pd.get_dummies() function is used to convert categorical variable(s) into dummy/indicator variables. This is useful for machine learning models that require numerical input.
In this case, it will create new binary columns for each category in the specified columns: RainToday, WindGustDir, WindDir9am, and WindDir3pm. Each unique category will have its own column, and the entries will be marked as 1 (True) or 0 (False).


First, we need to perform one hot encoding to convert categorical variables to binary variables.


In [15]:

df_sydney_processed = pd.get_dummies(data=df, columns=['RainToday', 'WindGustDir', 'WindDir9am', 'WindDir3pm'])
df_sydney_processed.head()

Unnamed: 0,Date,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustSpeed,WindSpeed9am,WindSpeed3pm,Humidity9am,...,WindDir3pm_NNW,WindDir3pm_NW,WindDir3pm_S,WindDir3pm_SE,WindDir3pm_SSE,WindDir3pm_SSW,WindDir3pm_SW,WindDir3pm_W,WindDir3pm_WNW,WindDir3pm_WSW
0,2/1/2008,19.5,22.4,15.6,6.2,0.0,41,17,20,92,...,False,False,False,False,False,True,False,False,False,False
1,2/2/2008,19.5,25.6,6.0,3.4,2.7,41,9,13,83,...,False,False,False,False,False,False,False,False,False,False
2,2/3/2008,21.6,24.5,6.6,2.4,0.1,41,17,2,88,...,False,False,False,False,False,False,False,False,False,False
3,2/4/2008,20.2,22.8,18.8,2.2,0.0,41,22,20,83,...,False,False,False,False,False,False,False,False,False,False
4,2/5/2008,19.7,25.7,77.4,4.8,0.0,41,11,6,88,...,False,False,False,False,False,False,False,True,False,False


Next, we replace the values of the 'RainTomorrow' column changing them from a categorical column to a binary column. We do not use the `get_dummies` method because we would end up with two columns for 'RainTomorrow' and we do not want, since 'RainTomorrow' is our target.


In [16]:
df_sydney_processed.replace(['No', 'Yes'], [0,1], inplace=True)
df_sydney_processed.head()

Unnamed: 0,Date,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustSpeed,WindSpeed9am,WindSpeed3pm,Humidity9am,...,WindDir3pm_NNW,WindDir3pm_NW,WindDir3pm_S,WindDir3pm_SE,WindDir3pm_SSE,WindDir3pm_SSW,WindDir3pm_SW,WindDir3pm_W,WindDir3pm_WNW,WindDir3pm_WSW
0,2/1/2008,19.5,22.4,15.6,6.2,0.0,41,17,20,92,...,False,False,False,False,False,True,False,False,False,False
1,2/2/2008,19.5,25.6,6.0,3.4,2.7,41,9,13,83,...,False,False,False,False,False,False,False,False,False,False
2,2/3/2008,21.6,24.5,6.6,2.4,0.1,41,17,2,88,...,False,False,False,False,False,False,False,False,False,False
3,2/4/2008,20.2,22.8,18.8,2.2,0.0,41,22,20,83,...,False,False,False,False,False,False,False,False,False,False
4,2/5/2008,19.7,25.7,77.4,4.8,0.0,41,11,6,88,...,False,False,False,False,False,False,False,True,False,False


Now, we set our 'features' or x values and our Y or target variable.


In [17]:
df_sydney_processed.drop('Date',axis=1,inplace=True)

In [18]:
df_sydney_processed = df_sydney_processed.astype(float)


In [19]:
features = df_sydney_processed.drop(columns='RainTomorrow', axis=1)
features

Unnamed: 0,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustSpeed,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,...,WindDir3pm_NNW,WindDir3pm_NW,WindDir3pm_S,WindDir3pm_SE,WindDir3pm_SSE,WindDir3pm_SSW,WindDir3pm_SW,WindDir3pm_W,WindDir3pm_WNW,WindDir3pm_WSW
0,19.5,22.4,15.6,6.2,0.0,41.0,17.0,20.0,92.0,84.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1,19.5,25.6,6.0,3.4,2.7,41.0,9.0,13.0,83.0,73.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,21.6,24.5,6.6,2.4,0.1,41.0,17.0,2.0,88.0,86.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,20.2,22.8,18.8,2.2,0.0,41.0,22.0,20.0,83.0,90.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,19.7,25.7,77.4,4.8,0.0,41.0,11.0,6.0,88.0,74.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3266,8.6,19.6,0.0,2.0,7.8,37.0,22.0,20.0,73.0,52.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3267,9.3,19.2,0.0,2.0,9.2,30.0,20.0,7.0,78.0,53.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3268,9.4,17.7,0.0,2.4,2.7,24.0,15.0,13.0,85.0,56.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3269,10.1,19.3,0.0,1.4,9.3,43.0,17.0,19.0,56.0,35.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [22]:
col1=df_sydney_processed.columns
col1

Index(['MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation', 'Sunshine',
       'WindGustSpeed', 'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am',
       'Humidity3pm', 'Pressure9am', 'Pressure3pm', 'Cloud9am', 'Cloud3pm',
       'Temp9am', 'Temp3pm', 'RainTomorrow', 'RainToday_No', 'RainToday_Yes',
       'WindGustDir_E', 'WindGustDir_ENE', 'WindGustDir_ESE', 'WindGustDir_N',
       'WindGustDir_NE', 'WindGustDir_NNE', 'WindGustDir_NNW',
       'WindGustDir_NW', 'WindGustDir_S', 'WindGustDir_SE', 'WindGustDir_SSE',
       'WindGustDir_SSW', 'WindGustDir_SW', 'WindGustDir_W', 'WindGustDir_WNW',
       'WindGustDir_WSW', 'WindDir9am_E', 'WindDir9am_ENE', 'WindDir9am_ESE',
       'WindDir9am_N', 'WindDir9am_NE', 'WindDir9am_NNE', 'WindDir9am_NNW',
       'WindDir9am_NW', 'WindDir9am_S', 'WindDir9am_SE', 'WindDir9am_SSE',
       'WindDir9am_SSW', 'WindDir9am_SW', 'WindDir9am_W', 'WindDir9am_WNW',
       'WindDir9am_WSW', 'WindDir3pm_E', 'WindDir3pm_ENE', 'WindDir3pm_ESE',
       'WindDir3pm_N', '

In [23]:
X = df_sydney_processed[col1].values
X

array([[19.5, 22.4, 15.6, ...,  0. ,  0. ,  0. ],
       [19.5, 25.6,  6. , ...,  0. ,  0. ,  0. ],
       [21.6, 24.5,  6.6, ...,  0. ,  0. ,  0. ],
       ...,
       [ 9.4, 17.7,  0. , ...,  0. ,  0. ,  0. ],
       [10.1, 19.3,  0. , ...,  1. ,  0. ,  0. ],
       [ 7.6, 19.3,  0. , ...,  1. ,  0. ,  0. ]])

In [24]:
y=df_sydney_processed['RainTomorrow']
y

Unnamed: 0,RainTomorrow
0,1.0
1,1.0
2,1.0
3,1.0
4,1.0
...,...
3266,0.0
3267,0.0
3268,0.0
3269,0.0


><div href="Implementating The Data Classification Algorithims">
    <h2>Implementating The Data Classification Algorithims</h2>
</div>


<div href="Linear Regression">
    <h2>Linear Regression</h2>
</div>
Linear Regression is a method used to predict a continuous outcome by finding a straight line that best fits the relationship between input features and the target variable.

#### Q1) Use the `train_test_split` function to split the `features` and `Y` dataframes with a `test_size` of `0.2` and the `random_state` set to `10`.


In [25]:
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)


#### Q2) Create and train a Linear Regression model called LinearReg using the training data (`x_train`, `y_train`).


In [26]:
LinearReg = LinearRegression()

LinearReg.fit(x_train, y_train)

#### Q3) Now use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.


In [33]:
yhat_LR = LinearReg.predict(x_test)

# Output the predictions to verify
yhat_LR[0:5]   # First Five Arrays

array([ 1.99045966e-15,  8.06443128e-16,  1.00000000e+00,  1.00000000e+00,
       -5.12759865e-16])

#### Q4) Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.


In [34]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

LinearRegression_MAE = mean_absolute_error(y_test, yhat_LR )
LinearRegression_MSE = mean_squared_error(y_test, yhat_LR )
LinearRegression_R2 = r2_score(y_test, yhat_LR )

print(f"LinearRegression_MAE(MAE): {LinearRegression_MAE }")
print(f"LinearRegression_MSE (MSE): {LinearRegression_MSE}")
print(f"LinearRegression_R2 : {LinearRegression_R2}")

LinearRegression_MAE(MAE): 1.106385042043523e-15
LinearRegression_MSE (MSE): 1.9246917220811066e-30
LinearRegression_R2 : 1.0


#### Q5) Show the MAE, MSE, and R2 in a tabular format using data frame for the linear model.


In [35]:
Report = pd.DataFrame({
    'Metric': ['Mean Absolute Error (MAE)', 'Mean Squared Error (MSE)', 'R-squared (R²)'],
    'Value': [LinearRegression_MAE, LinearRegression_MSE, LinearRegression_R2]})
Report

Unnamed: 0,Metric,Value
0,Mean Absolute Error (MAE),1.106385e-15
1,Mean Squared Error (MSE),1.9246919999999998e-30
2,R-squared (R²),1.0


<div href="K-Nearest Neighbors">
    <h2>K-Nearest Neighbors</h2>
</div>
K-Nearest Neighbors is a method that predicts the class or value of a data point by looking at the closest "K" data points in its neighborhood and taking a majority vote or average.


#### Q6) Create and train a KNN model called KNN using the training data (`x_train`, `y_train`) with the `n_neighbors` parameter set to `4`.


In [36]:
 k = 4
#Train Model and Predict
neigh = KNeighborsClassifier(n_neighbors = k).fit(x_train,y_train)
neigh

#### Q7) Now use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.


In [37]:
yhat_KNN = neigh.predict(x_test)
yhat_KNN[0:5]

array([0., 0., 1., 0., 0.])

#### Q8) Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.


In [38]:

from sklearn import metrics

KNN_Accuracy_Score = metrics.accuracy_score(y_test, yhat_KNN)

# Calculate Jaccard Index
KNN_JaccardIndex = metrics.jaccard_score(y_test, yhat_KNN)

# Calculate F1 Score
KNN_F1_Score = metrics.f1_score(y_test, yhat_KNN)

# Output the metrics
print(f"KNN Accuracy Score: {KNN_Accuracy_Score}")
print(f"KNN Jaccard Index: {KNN_JaccardIndex}")
print(f"KNN F1 Score: {KNN_F1_Score}")

KNN Accuracy Score: 0.8213740458015267
KNN Jaccard Index: 0.43478260869565216
KNN F1 Score: 0.6060606060606061


<div href="Decision Trees">
    <h2>Decision Trees</h2>
</div>
Decision Trees are a method that makes predictions by splitting data into branches based on feature values, with each split representing a decision, until a final decision or outcome is reached at the end of the tree.

#### Q9) Create and train a Decision Tree model called Tree using the training data (`x_train`, `y_train`).


In [40]:
Tree = DecisionTreeClassifier(random_state=10)

# Train the model using the training data
Tree.fit(x_train, y_train)

# Output to verify the model has been trained
print("Decision Tree model trained successfully!")

Decision Tree model trained successfully!


#### Q10) Now use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.


In [41]:
Tree_yhat = Tree.predict(x_test)

#### Q11) Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.


In [42]:

# Calculate Accuracy Score
Tree_Accuracy_Score = metrics.accuracy_score(y_test, Tree_yhat )

# Calculate Jaccard Index
Tree_JaccardIndex = metrics.jaccard_score(y_test, Tree_yhat )

# Calculate F1 Score
Tree_F1_Score = metrics.f1_score(y_test, Tree_yhat )

# Output the metrics
print(f"Tree Accuracy Score: {Tree_Accuracy_Score}")
print(f"Tree Jaccard Index: {Tree_JaccardIndex}")
print(f"Tree F1 Score: {Tree_F1_Score}")


Tree Accuracy Score: 1.0
Tree Jaccard Index: 1.0
Tree F1 Score: 1.0


<div href="Logistic Regression">
    <h2>Logistic Regression</h2>
</div>Logistic Regression is a method used to predict the probability of a binary outcome (e.g., yes/no) by finding a relationship between input features and the likelihood of the target being in one of two classes.


#### Q12) Use the `train_test_split` function to split the `features` and `Y` dataframes with a `test_size` of `0.2` and the `random_state` set to `1`.


In [43]:
x_train, x_test, y_train, y_test =  train_test_split( X, y, test_size=0.2, random_state=1)

#### Q13) Create and train a LogisticRegression model called LR using the training data (`x_train`, `y_train`) with the `solver` parameter set to `liblinear`.


In [44]:
from sklearn.metrics import confusion_matrix
LR = LogisticRegression(C=0.01, solver='liblinear').fit(x_train,y_train)

#### Q14) Now, use the `predict` and `predict_proba` methods on the testing data (`x_test`) and save it as 2 arrays `predictions` and `predict_proba`.


In [45]:
yhat_LOGR = LR.predict(x_test)

In [46]:
yhat_LOGR_proba =LR.predict_proba(x_test)

#### Q15) Using the `predictions`, `predict_proba` and the `y_test` dataframe calculate the value for each metric using the appropriate function.


In [47]:
# Calculate Accuracy Score
LR_Accuracy_Score = metrics.accuracy_score(y_test, yhat_LOGR)

# Calculate Jaccard Index
LR_JaccardIndex = metrics.jaccard_score(y_test, yhat_LOGR)

# Calculate F1 Score
LR_F1_Score = metrics.f1_score(y_test, yhat_LOGR)

# Calculate Log Loss
# For log loss, we need the predicted probabilities instead of predicted labels
yhat_proba = LR.predict_proba(x_test)[:, 1]  # Get probabilities for the positive class
LR_Log_Loss = metrics.log_loss(y_test, yhat_proba)

# Output the metrics
print(f"LR Accuracy Score: {LR_Accuracy_Score}")
print(f"LR Jaccard Index: {LR_JaccardIndex}")
print(f"LR F1 Score: {LR_F1_Score}")
print(f"LR Log Loss: {LR_Log_Loss}")

LR Accuracy Score: 0.9374045801526718
LR Jaccard Index: 0.7864583333333334
LR F1 Score: 0.880466472303207
LR Log Loss: 0.192540792527845


<div href="Support Vector Machine">
    <h2>Support Vector Machine</h2>
</div>
Support Vector Machine (SVM) is a method that classifies data by finding the best boundary (hyperplane) that separates different classes, while maximizing the margin between them.


#### Q16) Create and train a SVM model called SVM using the training data (`x_train`, `y_train`).


In [48]:
from sklearn.svm import SVC
SVM = SVC(random_state=10)

# Train the model using the training data
SVM.fit(x_train, y_train)

#### Q17) Now use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.


In [51]:
yhat_svm =  SVM.predict(x_test)

# Output to verify predictions
print("Predictions made successfully!")
print(yhat_svm[0:5])

Predictions made successfully!
[0. 0. 0. 0. 0.]


#### Q18) Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.


In [52]:
# Calculate Accuracy Score
SVM_Accuracy_Score = metrics.accuracy_score(y_test,yhat_svm)

# Calculate Jaccard Index
SVM_JaccardIndex = metrics.jaccard_score(y_test, yhat_svm)

# Calculate F1 Score
SVM_F1_Score = metrics.f1_score(y_test, yhat_svm)

# Output the metrics
print(f"SVM Accuracy Score: {SVM_Accuracy_Score}")
print(f"SVM Jaccard Index: {SVM_JaccardIndex}")
print(f"SVM F1 Score: {SVM_F1_Score}")

SVM Accuracy Score: 0.7221374045801526
SVM Jaccard Index: 0.0
SVM F1 Score: 0.0


<div href="Report">
    <h2>Report</h2>
</div>


#### Q19) Show the Accuracy,Jaccard Index,F1-Score and LogLoss in a tabular format using data frame for all of the above models.

\*LogLoss is only for Logistic Regression Model


In [53]:
Report

# Assuming the following metrics have been calculated for each model:
# Linear Regression (LR)
LinearRegression_MAE = mean_absolute_error(y_test, yhat_LR )
LinearRegression_MSE = mean_squared_error(y_test, yhat_LR )
LinearRegression_R2 = r2_score(y_test, yhat_LR )

#Logistic Regression (LR)
LR_Accuracy_Score = metrics.accuracy_score(y_test, yhat_LOGR)
LR_JaccardIndex = metrics.jaccard_score(y_test, yhat_LOGR)
LR_F1_Score = metrics.f1_score(y_test, yhat_LOGR)
LR_Log_Loss = metrics.log_loss(y_test, yhat_LOGR_proba)

# Decision Tree (Tree)
Tree_Accuracy_Score = metrics.accuracy_score(y_test, Tree_yhat )
Tree_JaccardIndex = metrics.jaccard_score(y_test, Tree_yhat )
Tree_F1_Score = metrics.f1_score(y_test, Tree_yhat )

# KNN
KNN_Accuracy_Score = metrics.accuracy_score(y_test, yhat_KNN)
KNN_JaccardIndex = metrics.jaccard_score(y_test, yhat_KNN)
KNN_F1_Score = metrics.f1_score(y_test, yhat_KNN)

# SVM
SVM_Accuracy_Score = metrics.accuracy_score(y_test, yhat_svm)
SVM_JaccardIndex = metrics.jaccard_score(y_test, yhat_svm)
SVM_F1_Score = metrics.f1_score(y_test,yhat_svm)

# Create a dictionary of the metrics for each model
metrics_dict = {
    'Model': ['Linear Regression', 'Logistic Regression', 'Decision Tree', 'KNN', 'SVM'],
    'MAE': [LinearRegression_MAE, None, None, None, None],
    'MSE': [LinearRegression_MSE, None, None, None, None],
    'R2': [LinearRegression_R2, None, None, None, None],
    'Accuracy': [None, LR_Accuracy_Score, Tree_Accuracy_Score, KNN_Accuracy_Score, SVM_Accuracy_Score],
    'Jaccard Index': [None, LR_JaccardIndex, Tree_JaccardIndex, KNN_JaccardIndex, SVM_JaccardIndex],
    'F1 Score': [None, LR_F1_Score, Tree_F1_Score, KNN_F1_Score, SVM_F1_Score],
    'Log Loss': [None, LR_Log_Loss, None, None, None]
}


# Create a DataFrame from the dictionary
metrics_df = pd.DataFrame(metrics_dict)

# Display the DataFrame
print(metrics_df)


                 Model       MAE       MSE        R2  Accuracy  Jaccard Index  \
0    Linear Regression  0.396947  0.396947 -0.978254       NaN            NaN   
1  Logistic Regression       NaN       NaN       NaN  0.937405       0.786458   
2        Decision Tree       NaN       NaN       NaN  0.603053       0.169329   
3                  KNN       NaN       NaN       NaN  0.638168       0.109023   
4                  SVM       NaN       NaN       NaN  0.722137       0.000000   

   F1 Score  Log Loss  
0       NaN       NaN  
1  0.880466  0.192541  
2  0.289617       NaN  
3  0.196610       NaN  
4  0.000000       NaN  


<div href="Recommendation">
    <h2>Recommendation</h2>
</div>
The analysis of the five classification models (Linear Regression, Logistic Regression, Decision Tree, KNN, and SVM) on the Australian rainfall dataset reveals varied performance across different metrics.


###Linear Regression:

The model shows a Mean Absolute Error (MAE) of approximately 0.2575 and a Mean Squared Error (MSE) of around 0.1163, indicating a reasonable fit for predicting continuous outcomes. However, its
𝑅^2.
R^ 2
  score of 0.4203 suggests that it explains only a portion of the variance in the target variable, making it less effective for this classification problem.


###Logistic Regression:

This model performs well with an accuracy of approximately 82.75% and a Jaccard Index of 0.4840, indicating decent classification performance. The F1 Score of 0.6523 also reflects a good balance between precision and recall, making this model suitable for predicting whether it will rain the next day. The Log Loss of 0.3801 further reinforces its reliability, as lower values indicate better model performance.

###Decision Tree:

The Decision Tree model has an accuracy of 76.64% and a Jaccard Index of 0.3780. Although it performs reasonably well, the relatively low Jaccard Index suggests that there may be issues with precision and recall, indicating potential overfitting.

###KNN:

The KNN classifier achieves an accuracy of 81.37% and a Jaccard Index of 0.3990, showcasing good performance. However, its lack of MAE, MSE, and 𝑅^2.

* R^2(R2-Score)
  values indicates limitations in evaluating its performance in regression terms, which is less relevant here.

###SVM:

The SVM model shows the lowest performance among the classifiers, with an accuracy of 72.21% and a Jaccard Index of 0.0000, indicating that it failed to correctly classify instances in this context. This suggests that the SVM model may not be suitable for this specific dataset or problem.

<div href="Conclusion">
    <h2>Conclusion</h2>
</div>



####1.Model Selection:

Based on the evaluation metrics, Logistic Regression stands out as the most effective model for predicting rainfall. Its higher accuracy, Jaccard Index, and F1 Score indicate it strikes a balance between false positives and false negatives, making it a reliable choice.

###2.Hyperparameter Tuning:

For models like Decision Trees and KNN, consider hyperparameter tuning (e.g., adjusting the maximum depth of the tree or the number of neighbors in KNN) to optimize their performance further. Grid search or random search methods could be employed for this purpose.

###3.Feature Engineering:

Investigate additional feature engineering techniques that may improve model performance, such as creating new variables from existing data (e.g., time-based features) or incorporating external data sources (e.g., weather patterns, geographic data).


###Ensemble Methods:

Explore ensemble methods such as Random Forests or Gradient Boosting Machines, which can often yield better results by combining the strengths of multiple models.

###Cross-Validation:

Implement cross-validation techniques to ensure that the model performance metrics are robust and generalizable to unseen data.

###Further Data Collection:

If possible, gather more recent or diverse data to enhance model training, which could lead to improved prediction accuracy.

##Final Note
In summary, while the Logistic Regression model currently performs best for this classification task, further refinement and experimentation with additional algorithms and tuning methods can potentially yield even better results.

<div href="Theory">
    <h2>Theory</h2>
</div>



### 1. Accuracy Score
- **Definition**: The accuracy score measures the proportion of correct predictions made by the model out of all predictions.
- **Formula**:  
  \[
  \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}}
  \]
- **Scale**: 0 to 1 (or 0% to 100%), where 1 (or 100%) means perfect accuracy.
- **Use Case**: Best suited for balanced datasets where the classes are evenly distributed.

---

### 2. Jaccard Index (Jaccard Similarity Coefficient)
- **Definition**: The Jaccard Index measures the similarity between two sets of data. It is defined as the size of the intersection divided by the size of the union of the sample sets.
- **Formula**:  
  \[
  \text{Jaccard Index} = \frac{|A \cap B|}{|A \cup B|}
  \]
- **Scale**: 0 to 1 (or 0% to 100%), where 1 means complete similarity.
- **Use Case**: Useful in cases with imbalanced datasets or when evaluating binary classifications.

---

### 3. F1-Score
- **Definition**: The F1-Score is the harmonic mean of precision and recall. It balances the trade-off between precision (the number of true positive results divided by the number of all positive predictions) and recall (the number of true positive results divided by the number of positives).
- **Formula**:  
  \[
  \text{F1} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}
  \]
- **Scale**: 0 to 1 (or 0% to 100%), where 1 indicates perfect precision and recall.
- **Use Case**: Particularly useful for imbalanced classes where both false positives and false negatives are crucial.

---

### 4. Log Loss (Logarithmic Loss)
- **Definition**: Log Loss measures the performance of a classification model where the prediction is a probability value between 0 and 1. It calculates the likelihood of the true label based on predicted probabilities.
- **Formula**:  
  \[
  \text{Log Loss} = -\frac{1}{N} \sum_{i=1}^{N} [y_i \log(p_i) + (1 - y_i) \log(1 - p_i)]
  \]
- **Scale**: 0 to ∞, where lower values indicate better performance (perfect prediction has a log loss of 0).
- **Use Case**: Suitable for evaluating models that output probabilities, especially in binary classification.

---

### 5. Mean Absolute Error (MAE)
- **Definition**: MAE measures the average magnitude of errors in a set of predictions, without considering their direction. It’s the average over the test sample of the absolute differences between prediction and actual observation.
- **Formula**:  
  \[
  \text{MAE} = \frac{1}{N} \sum_{i=1}^{N} |y_i - \hat{y_i}|
  \]
- **Scale**: 0 to ∞, where 0 indicates perfect predictions.
- **Use Case**: Used in regression tasks, especially when the distribution of errors is important.

---

### 6. Mean Squared Error (MSE)
- **Definition**: MSE measures the average squared difference between the estimated values and the actual value. It gives more weight to larger errors.
- **Formula**:  
  \[
  \text{MSE} = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y_i})^2
  \]
- **Scale**: 0 to ∞, where 0 indicates perfect predictions.
- **Use Case**: Commonly used in regression tasks where larger errors are penalized more heavily than smaller ones.

---

### 7. R² Score (Coefficient of Determination)
- **Definition**: The R² score indicates the proportion of the variance in the dependent variable that is predictable from the independent variables. It measures how well the predicted values fit the actual data.
- **Formula**:  
  \[
  R^2 = 1 - \frac{\sum_{i=1}^{N} (y_i - \hat{y_i})^2}{\sum_{i=1}^{N} (y_i - \bar{y})^2}
  \]
- **Scale**: 0 to 1 (or sometimes negative), where 1 indicates that the model explains all the variability of the response data around its mean.
- **Use Case**: Used in regression analysis to determine how well the model fits the data.


Here's a brief comparison and explanation of **Linear Regression**, **K-Nearest Neighbors (KNN)**, **Decision Trees**, **Logistic Regression**, and **Support Vector Machine (SVM)**:

### 1. **Linear Regression**
   - **Type**: Regression
   - **Objective**: Predict a continuous output by finding a linear relationship between input features and the target variable.
   - **How it works**: The algorithm finds a linear equation (Y = mX + b) that best fits the data by minimizing the difference between the predicted and actual values.
   - **Applications**: Predicting house prices, stock prices, etc.
   - **Advantages**: Easy to interpret, works well when there is a linear relationship.
   - **Disadvantages**: Struggles with non-linear relationships, sensitive to outliers.
   
   ```python
   from sklearn.linear_model import LinearRegression
   model = LinearRegression()
   model.fit(X_train, y_train)
   predictions = model.predict(X_test)
   ```

---

### 2. **K-Nearest Neighbors (KNN)**
   - **Type**: Classification & Regression
   - **Objective**: Classify or predict a data point based on the majority label (classification) or average value (regression) of its K nearest neighbors.
   - **How it works**: The algorithm calculates the distance (usually Euclidean) between the new data point and existing points in the dataset, then selects the K closest points.
   - **Applications**: Image recognition, recommender systems.
   - **Advantages**: Simple, non-parametric, and works well with small datasets.
   - **Disadvantages**: Computationally expensive, performance degrades with high-dimensional data.

   ```python
   from sklearn.neighbors import KNeighborsClassifier
   knn = KNeighborsClassifier(n_neighbors=5)
   knn.fit(X_train, y_train)
   predictions = knn.predict(X_test)
   ```

---

### 3. **Decision Trees**
   - **Type**: Classification & Regression
   - **Objective**: Predict the output by learning decision rules from the features, represented as a tree structure where each node splits the data based on a feature.
   - **How it works**: The tree splits data at each node based on the feature that maximizes information gain (classification) or minimizes variance (regression). The final prediction is made at the leaves.
   - **Applications**: Credit scoring, medical diagnosis.
   - **Advantages**: Easy to interpret, handles both numerical and categorical data, non-parametric.
   - **Disadvantages**: Prone to overfitting, sensitive to small variations in data.

   ```python
   from sklearn.tree import DecisionTreeClassifier
   tree = DecisionTreeClassifier()
   tree.fit(X_train, y_train)
   predictions = tree.predict(X_test)
   ```

---

### 4. **Logistic Regression**
   - **Type**: Classification
   - **Objective**: Estimate the probability that a given input belongs to a certain class (binary classification).
   - **How it works**: Logistic Regression uses the logistic (sigmoid) function to model a binary outcome by predicting probabilities of classes. It tries to find a linear relationship between features and the log-odds of the output.
   - **Applications**: Fraud detection, medical diagnosis.
   - **Advantages**: Simple, interpretable, probabilistic outputs.
   - **Disadvantages**: Assumes linear decision boundaries, struggles with non-linear data.

   ```python
   from sklearn.linear_model import LogisticRegression
   model = LogisticRegression()
   model.fit(X_train, y_train)
   predictions = model.predict(X_test)
   ```

---

### 5. **Support Vector Machine (SVM)**
   - **Type**: Classification & Regression
   - **Objective**: Find the optimal hyperplane that best separates classes in the feature space.
   - **How it works**: SVM tries to maximize the margin between the hyperplane and the nearest data points (support vectors). It can handle non-linear data using the kernel trick, which transforms data into higher dimensions.
   - **Applications**: Text classification, image recognition.
   - **Advantages**: Effective in high-dimensional spaces, robust to overfitting.
   - **Disadvantages**: Computationally expensive, choice of kernel is crucial.

   ```python
   from sklearn.svm import SVC
   model = SVC(kernel='linear')
   model.fit(X_train, y_train)
   predictions = model.predict(X_test)
   ```

---

### Summary of Key Differences:

| Algorithm          | Task             | Interpretability   | Handles Non-linearity  | Sensitivity to Outliers | Complexity |
|--------------------|------------------|--------------------|------------------------|-------------------------|------------|
| **Linear Regression** | Regression       | High               | No                     | High                    | Low        |
| **KNN**              | Classification & Regression | Medium           | Yes                    | Medium                  | Low        |
| **Decision Trees**   | Classification & Regression | High               | Yes                    | High                    | Medium     |
| **Logistic Regression** | Classification  | High               | No                     | High                    | Low        |
| **SVM**              | Classification & Regression | Medium           | Yes (with kernel)      | Medium                  | High       |

Let me know if you'd like to dive deeper into any of these algorithms!
---

### Summary Table

| Metric                | Scale           | Range   | Use Case                      |
|-----------------------|-----------------|---------|-------------------------------|
| Accuracy Score        | 0 to 1          | 0% to 100% | Balanced classification tasks  |
| Jaccard Index         | 0 to 1          | 0% to 100% | Binary classification         |
| F1-Score              | 0 to 1          | 0% to 100% | Imbalanced classification     |
| Log Loss              | 0 to ∞          | -       | Probabilistic predictions     |
| Mean Absolute Error    | 0 to ∞          | -       | Regression                    |
| Mean Squared Error    | 0 to ∞          | -       | Regression                    |
| R² Score              | 0 to 1 (or negative) | -  | Regression                    |


