# Rain Prediction in Australia - project

<h2>Table of Contents</h2>
<div class="alert alert-block alert-info" style="margin-top: 20px">
    <ul>
    <li><a href="https://#Section_1">Instructions</a></li>
    <li><a href="https://#Section_2">About the Data</a></li>
    <li><a href="https://#Section_3">Importing Data </a></li>
    <li><a href="https://#Section_4">Data Preprocessing</a> </li>
    <li><a href="https://#Section_5">One Hot Encoding </a></li>
    <li><a href="https://#Section_6">Train and Test Data Split </a></li>
    <li><a href="https://#Section_7">Train Logistic Regression, KNN, Decision Tree, SVM, and Linear Regression models and return their appropriate accuracy scores</a></li>
</a></li>
</div>

</div>

<hr>





Below, is where we are going to use the classification algorithms to create a model based on our training data and evaluate our testing data using evaluation metrics learned in the course.

We will use some of the algorithms taught in the course, specifically:

1. Linear Regression
2. KNN
3. Decision Trees
4. Logistic Regression
5. SVM

We will evaluate our models using:

1.  Accuracy Score
2.  Jaccard Index
3.  F1-Score
4.  LogLoss
5.  Mean Absolute Error
6.  Mean Squared Error
7.  R2-Score

Finally, you will use your models to generate the report at the end. 


# About The Dataset


The original source of the data is Australian Government's Bureau of Meteorology and the latest data can be gathered from [http://www.bom.gov.au/climate/dwo/](http://www.bom.gov.au/climate/dwo/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01).

The dataset to be used has extra columns like 'RainToday' and our target is 'RainTomorrow', which was gathered from the Rattle at [https://bitbucket.org/kayontoga/rattle/src/master/data/weatherAUS.RData](https://bitbucket.org/kayontoga/rattle/src/master/data/weatherAUS.RData?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01)




This dataset contains observations of weather metrics for each day from 2008 to 2017. The **weatherAUS.csv** dataset includes the following fields:

| Field         | Description                                           | Unit            | Type   |
| ------------- | ----------------------------------------------------- | --------------- | ------ |
| Date          | Date of the Observation in YYYY-MM-DD                 | Date            | object |
| Location      | Location of the Observation                           | Location        | object |
| MinTemp       | Minimum temperature                                   | Celsius         | float  |
| MaxTemp       | Maximum temperature                                   | Celsius         | float  |
| Rainfall      | Amount of rainfall                                    | Millimeters     | float  |
| Evaporation   | Amount of evaporation                                 | Millimeters     | float  |
| Sunshine      | Amount of bright sunshine                             | hours           | float  |
| WindGustDir   | Direction of the strongest gust                       | Compass Points  | object |
| WindGustSpeed | Speed of the strongest gust                           | Kilometers/Hour | object |
| WindDir9am    | Wind direction averaged of 10 minutes prior to 9am    | Compass Points  | object |
| WindDir3pm    | Wind direction averaged of 10 minutes prior to 3pm    | Compass Points  | object |
| WindSpeed9am  | Wind speed averaged of 10 minutes prior to 9am        | Kilometers/Hour | float  |
| WindSpeed3pm  | Wind speed averaged of 10 minutes prior to 3pm        | Kilometers/Hour | float  |
| Humidity9am   | Humidity at 9am                                       | Percent         | float  |
| Humidity3pm   | Humidity at 3pm                                       | Percent         | float  |
| Pressure9am   | Atmospheric pressure reduced to mean sea level at 9am | Hectopascal     | float  |
| Pressure3pm   | Atmospheric pressure reduced to mean sea level at 3pm | Hectopascal     | float  |
| Cloud9am      | Fraction of the sky obscured by cloud at 9am          | Eights          | float  |
| Cloud3pm      | Fraction of the sky obscured by cloud at 3pm          | Eights          | float  |
| Temp9am       | Temperature at 9am                                    | Celsius         | float  |
| Temp3pm       | Temperature at 3pm                                    | Celsius         | float  |
| RainToday     | If there was rain today                               | Yes/No          | object |
| RainTomorrow  | If there is rain tomorrow                             | Yes/No          | float  |

Column definitions were gathered from [http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml](http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01)



## **Import the required libraries**


In [None]:
# All Libraries required for this lab are listed below. The libraries pre-installed on Skills Network Labs are commented.
# !mamba install -qy pandas==1.3.4 numpy==1.21.4 seaborn==0.9.0 matplotlib==3.5.0 scikit-learn==0.20.1
# Note: If your environment doesn't support "!mamba install", use "!pip install"

In [1]:
# Surpress warnings:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

In [3]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.metrics import jaccard_score
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss
from sklearn.metrics import confusion_matrix, accuracy_score
import sklearn.metrics as metrics

### Importing the Dataset


In [5]:
path='https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillUp/labs/ML-FinalAssignment/Weather_Data.csv'

In [38]:
df = pd.read_csv(path)
df.head()

Unnamed: 0,Date,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2/1/2008,19.5,22.4,15.6,6.2,0.0,W,41,S,SSW,...,92,84,1017.6,1017.4,8,8,20.7,20.9,Yes,Yes
1,2/2/2008,19.5,25.6,6.0,3.4,2.7,W,41,W,E,...,83,73,1017.9,1016.4,7,7,22.4,24.8,Yes,Yes
2,2/3/2008,21.6,24.5,6.6,2.4,0.1,W,41,ESE,ESE,...,88,86,1016.7,1015.6,7,8,23.5,23.0,Yes,Yes
3,2/4/2008,20.2,22.8,18.8,2.2,0.0,W,41,NNE,E,...,83,90,1014.2,1011.8,8,8,21.4,20.9,Yes,Yes
4,2/5/2008,19.7,25.7,77.4,4.8,0.0,W,41,NNE,W,...,88,74,1008.3,1004.8,8,8,22.5,25.5,Yes,Yes


### Data Preprocessing


#### One Hot Encoding


First, we need to perform one hot encoding to convert categorical variables to binary variables.


In [9]:
df_sydney_processed = pd.get_dummies(data=df, columns=['RainToday', 'WindGustDir', 'WindDir9am', 'WindDir3pm'])

Next, we replace the values of the 'RainTomorrow' column changing them from a categorical column to a binary column. We do not use the `get_dummies` method because we would end up with two columns for 'RainTomorrow' and we do not want, since 'RainTomorrow' is our target.


In [10]:
df_sydney_processed.replace(['No', 'Yes'], [0,1], inplace=True)

### Training Data and Test Data


Now, we set our 'features' or x values and our Y or target variable.


In [11]:
df_sydney_processed.drop('Date',axis=1,inplace=True)

In [12]:
df_sydney_processed = df_sydney_processed.astype(float)

In [85]:
features = df_sydney_processed.drop(columns='RainTomorrow', axis=1)
Y = df_sydney_processed['RainTomorrow']

In [79]:
df_sydney_processed.shape

(3271, 67)

In [103]:
features = np.asarray(features)
features

array([[19.5, 22.4, 15.6, ...,  0. ,  0. ,  0. ],
       [19.5, 25.6,  6. , ...,  0. ,  0. ,  0. ],
       [21.6, 24.5,  6.6, ...,  0. ,  0. ,  0. ],
       ...,
       [ 9.4, 17.7,  0. , ...,  0. ,  0. ,  0. ],
       [10.1, 19.3,  0. , ...,  1. ,  0. ,  0. ],
       [ 7.6, 19.3,  0. , ...,  1. ,  0. ,  0. ]])

In [127]:
X = preprocessing.StandardScaler().fit(features).transform(features.astype(float))
X[0:5]

array([[ 1.01512601, -0.13507807,  1.23613927,  0.37146005, -1.87896481,
        -0.04408087,  0.27304105,  0.09468284,  1.57493504,  1.80020165,
        -0.10463342,  0.19902353,  1.45711035,  1.58608764,  0.58822905,
        -0.1498131 , -1.6890139 ,  1.6890139 , -0.19683104, -0.26458131,
        -0.2114572 , -0.0962102 , -0.23341161, -0.21846239, -0.12709872,
        -0.14570551, -0.23341161, -0.14570551, -0.28075943, -0.22900939,
        -0.12459185,  1.13817336, -0.15731934, -0.20667842, -0.21224496,
        -0.15731934, -0.20667842, -0.1775832 , -0.1583367 , -0.16724183,
        -0.18031258, -0.16333775,  4.46855108, -0.1775832 , -0.19599158,
        -0.24484312, -0.1319786 , -0.79155161, -0.4172346 , -0.17481649,
        -0.48552917, -0.34946664, -0.33553693, -0.1319786 , -0.33327672,
        -0.13899878, -0.10841489, -0.13553087, -0.28454344, -0.26458131,
        -0.30236527,  4.5774469 , -0.11266543, -0.25587558, -0.24343717,
        -0.18742087],
       [ 1.01512601,  0.57871

In [105]:
Y = np.asarray(Y).reshape(-1, 1)
Y

array([[1.],
       [1.],
       [1.],
       ...,
       [0.],
       [0.],
       [0.]])

In [80]:
df_sydney_processed.dtypes

MinTemp           float64
MaxTemp           float64
Rainfall          float64
Evaporation       float64
Sunshine          float64
                   ...   
WindDir3pm_SSW    float64
WindDir3pm_SW     float64
WindDir3pm_W      float64
WindDir3pm_WNW    float64
WindDir3pm_WSW    float64
Length: 67, dtype: object

### Linear Regression


#### Q1) Use the `train_test_split` function to split the `features` and `Y` dataframes with a `test_size` of `0.2` and the `random_state` set to `10`.


In [128]:
x_train, x_test, y_train, y_test = train_test_split(features, Y, test_size=0.2, random_state=10)

In [129]:
print(x_train.shape)
print(y_train.shape)

(2616, 66)
(2616, 1)


#### Q2) Create and train a Linear Regression model called LinearReg using the training data (`x_train`, `y_train`).


In [109]:
LinearReg = LinearRegression()

In [110]:
LinearReg.fit(x_train, y_train)

#### Q3) Now use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.


In [111]:
predictions = LinearReg.predict(x_test)
predictions[:10]

array([[0.13184071],
       [0.2761859 ],
       [0.97818819],
       [0.2874561 ],
       [0.13241371],
       [0.46046418],
       [0.35678746],
       [0.85640685],
       [0.67501191],
       [0.03824739]])

#### Q4) Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.


In [114]:
LinearRegression_MAE = metrics.mean_absolute_error(y_test, predictions)
LinearRegression_MSE = metrics.mean_squared_error(y_test, predictions)
LinearRegression_R2 = metrics.r2_score(y_test, predictions)

AttributeError: 'dict' object has no attribute 'mean_absolute_error'

#### Q5) Show the MAE, MSE, and R2 in a tabular format using data frame for the linear model.


In [113]:
metrics = {
    "Metric": ["MAE", "MSE", "R2"],
    "Value": [LinearRegression_MAE, LinearRegression_MSE, LinearRegression_R2]
}
# Create a Pandas DataFrame from the dictionary
Report = pd.DataFrame(metrics)
Report

Unnamed: 0,Metric,Value
0,MAE,0.256325
1,MSE,0.115723
2,R2,0.427119


## Model Performance:

* Mean Absolute Error (MAE): 0.2563 (average absolute difference between predicted and actual rain amount)
* Mean Squared Error (MSE): 0.1157 (average squared difference between predicted and actual rain amount)
* R-squared: 0.4271 (model explains approximately 43% of the variance in rain amount)

## Interpretation:

* The model shows moderate accuracy in predicting rain amount, with an average error of 0.2563 units.
* While the R-squared suggests some explanatory power, there's room for improvement.

### KNN


#### Q6) Create and train a KNN model called KNN using the training data (`x_train`, `y_train`) with the `n_neighbors` parameter set to `4`.


In [130]:
k = 4
#Train Model and Predict  
KNN = KNeighborsClassifier(n_neighbors = k).fit(x_train,y_train)
KNN

#### Q7) Now use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.


In [131]:
# Make predictions on the test data using the trained KNN model
predictions_neigh = KNN.predict(x_test)
predictions_neigh[:8]

array([0., 0., 1., 0., 0., 0., 0., 1.])

#### Q8) Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.


In [132]:
KNN_Accuracy_Score = accuracy_score(y_test, predictions_neigh)
KNN_JaccardIndex = jaccard_score(y_test, predictions_neigh)
KNN_F1_Score = f1_score(y_test, predictions_neigh)

In [133]:
matrix = {
    "Metric":['Accuracy Score', 'Jaccard Index', 'F1_score'],
    "Value":[KNN_Accuracy_Score, KNN_JaccardIndex, KNN_F1_Score]
}

report3 = pd.DataFrame(matrix)
report3

Unnamed: 0,Metric,Value
0,Accuracy Score,0.818321
1,Jaccard Index,0.425121
2,F1_score,0.59661


## Use different k value

In [140]:
k = 5
#Train Model and Predict  
KNN2 = KNeighborsClassifier(n_neighbors = k).fit(x_train,y_train)
KNN2

In [141]:
# Make predictions on the test data using the trained KNN model
predictions_neigh2 = KNN2.predict(x_test)
predictions_neigh2[:8]

array([0., 0., 1., 0., 0., 0., 0., 1.])

In [143]:
KNN_Accuracy_Score2 = accuracy_score(y_test, predictions_neigh2)
KNN_JaccardIndex2 = jaccard_score(y_test, predictions_neigh2)
KNN_F1_Score2 = f1_score(y_test, predictions_neigh2)

In [144]:
matrix = {
    "Metric":['Accuracy Score', 'Jaccard Index', 'F1_score'],
    "Value":[KNN_Accuracy_Score2, KNN_JaccardIndex2, KNN_F1_Score2]
}

report3 = pd.DataFrame(matrix)
report3

Unnamed: 0,Metric,Value
0,Accuracy Score,0.819847
1,Jaccard Index,0.466063
2,F1_score,0.635802


## Interpretation:

* Accuracy (0.8198): This metric indicates that the KNN model predicts the rain occurrence (present or not) correctly for approximately 82% of the test data. This suggests a good overall performance in classifying rain events.
* Jaccard Index (0.4661): This metric measures the similarity between predicted rain events and actual rain events. A value of 0.47 indicates a moderate level of similarity. In simpler terms, the model correctly identifies rain events for about 47% of the data points where rain actually occurred, and vice versa.
* F1-Score (0.6358): This metric considers both precision (correctly predicted rain events) and recall (correctly identified actual rain events). A score of 0.64 suggests a balanced performance between these two aspects.


### Decision Tree


#### Q9) Create and train a Decision Tree model called Tree using the training data (`x_train`, `y_train`).


In [121]:
Tree = DecisionTreeClassifier().fit(x_train, y_train)
Tree

#### Q10) Now use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.


In [122]:
predictions_tree = Tree.predict(x_test)
predictions_tree[:10]

array([0., 0., 1., 0., 0., 1., 1., 1., 1., 1.])

#### Q11) Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.


In [123]:
Tree_Accuracy_Score = accuracy_score(y_test, predictions_tree)
Tree_JaccardIndex = jaccard_score(y_test, predictions_tree)
Tree_F1_Score = f1_score(y_test, predictions_tree)

In [124]:
matrix = {
    'Metric':['Tree_Accuracy_Score', 'Tree_JaccardIndex', 'Tree_F1_Score'],
    'Value':[Tree_Accuracy_Score, Tree_JaccardIndex, Tree_F1_Score]
}

report_tree = pd.DataFrame(matrix) 
report_tree

Unnamed: 0,Metric,Value
0,Tree_Accuracy_Score,0.738931
1,Tree_JaccardIndex,0.373626
2,Tree_F1_Score,0.544


## Interpretation:

* Accuracy (0.7435): This indicates that the model predicts the rain occurrence (yes or no) correctly for approximately 74% of the test data.
* Jaccard Index (0.3823): This metric measures the similarity between predicted rain events and actual rain events. A value of 0.38 indicates a moderate level of similarity. In simpler terms, the model correctly identifies rain events for about 38% of the data points where rain actually occurred, and vice versa.
* F1-Score (0.5531): This metric considers both precision (correctly predicted rain events) and recall (correctly identified actual rain events). A score of 0.55 suggests a balanced performance between these two aspects.

### Logistic Regression


#### Q12) Use the `train_test_split` function to split the `features` and `Y` dataframes with a `test_size` of `0.2` and the `random_state` set to `1`.


In [145]:
x_train, x_test, y_train, y_test = train_test_split(features, Y, test_size=0.2, random_state=1)

#### Q13) Create and train a LogisticRegression model called LR using the training data (`x_train`, `y_train`) with the `solver` parameter set to `liblinear`.


In [147]:
LR = LogisticRegression(solver='liblinear').fit(x_train, y_train)
LR

#### Q14) Now, use the `predict` and `predict_proba` methods on the testing data (`x_test`) and save it as 2 arrays `predictions` and `predict_proba`.


In [154]:
predictions_lr = LR.predict(x_test)
predictions_lr[:10]

array([0., 0., 0., 0., 0., 1., 0., 0., 0., 0.])

In [155]:
predict_proba = LR.predict_proba(x_test)
predict_proba[:15]

array([[0.74087112, 0.25912888],
       [0.97444658, 0.02555342],
       [0.52599984, 0.47400016],
       [0.84829603, 0.15170397],
       [0.9667289 , 0.0332711 ],
       [0.06155102, 0.93844898],
       [0.69523495, 0.30476505],
       [0.96243641, 0.03756359],
       [0.92078303, 0.07921697],
       [0.93372274, 0.06627726],
       [0.31957095, 0.68042905],
       [0.44654511, 0.55345489],
       [0.68890187, 0.31109813],
       [0.97850538, 0.02149462],
       [0.98622478, 0.01377522]])

#### Q15) Using the `predictions`, `predict_proba` and the `y_test` dataframe calculate the value for each metric using the appropriate function.


In [157]:
LR_Accuracy_Score = accuracy_score(y_test, predictions_lr)
LR_JaccardIndex = jaccard_score(y_test, predictions_lr,pos_label=0)
LR_F1_Score = f1_score(y_test, predictions_lr)
LR_Log_Loss = log_loss(y_test, predict_proba)

In [158]:
matrix = {
    'Metric':['Accuracy Score', 'Jaccard Index', 'F1_score', 'Log Loss'],
    'Value':[LR_Accuracy_Score, LR_JaccardIndex, LR_F1_Score, LR_Log_Loss]
}

report4 = pd.DataFrame(matrix)
report4

Unnamed: 0,Metric,Value
0,Accuracy Score,0.833588
1,Jaccard Index,0.800366
2,F1_score,0.666667
3,Log Loss,0.38115


## Interpretations:


* Accuracy Score (0.834): indicating it predicts rain correctly for approximately 83% of the test data. This suggests good overall classification ability.

* Jaccard Index (0.800): This metric measures the similarity between predicted rain events and actual rain events. A value of 0.80 indicates a high level of similarity. In simpler terms, the model correctly identifies rain events for about 80% of the data points where rain actually occurred, and vice versa.

* F1-score (0.667): This metric considers both precision and recall. A score of 0.667 suggests a moderate balance between these two aspects. While the model has a high Jaccard Index, the F1-score indicates there might be room for improvement in distinguishing between true rain events and false positives or negatives.

* Log Loss (0.381):  A lower Log Loss value generally indicates a better fit of the model to the training data. While the specific interpretation of loss values depends on the model and data, a value of 0.381 suggests a reasonable fit.

### SVM


#### Q16) Create and train a SVM model called SVM using the training data (`x_train`, `y_train`).


In [161]:
SVM = svm.SVC(kernel='rbf').fit(x_train, y_train)
SVM

#### Q17) Now use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.


In [163]:
predictions_svm = SVM.predict(x_test)
predictions_svm[:15]

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

#### Q18) Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.


In [175]:
SVM_Accuracy_Score = accuracy_score(y_test, predictions_svm)
SVM_JaccardIndex = jaccard_score(y_test, predictions_svm, pos_label=0)
SVM_F1_Score = f1_score(y_test, predictions_svm, average='weighted') 

In [176]:
matrix = {
    'Metric':['Accuracy Score', 'Jaccard Index', 'F1 Score'],
    'Value':[SVM_Accuracy_Score, SVM_JaccardIndex, SVM_F1_Score]
}
report5 = pd.DataFrame(matrix)
report5

Unnamed: 0,Metric,Value
0,Accuracy Score,0.722137
1,Jaccard Index,0.722137
2,F1 Score,0.605622


## Different kernel fucntion - linear

In [177]:
SVM2 = svm.SVC(kernel='linear').fit(x_train, y_train)
SVM2

In [178]:
predictions_svm2 = SVM2.predict(x_test)
predictions_svm2[:15]

array([0., 0., 1., 0., 0., 1., 0., 0., 0., 0., 1., 1., 0., 0., 0.])

In [182]:
SVM_Accuracy_Score2 = accuracy_score(y_test, predictions_svm2)
SVM_JaccardIndex2 = jaccard_score(y_test, predictions_svm2, pos_label=0)
SVM_F1_Score2 = f1_score(y_test, predictions_svm2, average='weighted') 

In [183]:
matrix = {
    'Metric':['Accuracy Score', 'Jaccard Index', 'F1 Score'],
    'Value':[SVM_Accuracy_Score2, SVM_JaccardIndex2, SVM_F1_Score2]
}
report6 = pd.DataFrame(matrix)
report6

Unnamed: 0,Metric,Value
0,Accuracy Score,0.832061
1,Jaccard Index,0.798903
2,F1 Score,0.825516


## Interpretaions:

* Accuracy Score (0.832): which is very similar to the logistic regression model (0.834). This indicates that both models predict rain occurrence correctly for a high proportion of the test data.

* Jaccard Index (0.799): This metric is slightly lower in the SVM model (0.799) compared to logistic regression (0.800). It suggests a high level of similarity between predicted and actual rain events, though slightly less than the logistic regression in this case.

* F1-score (0.826): This metric is a highlight for the SVM model, with a score of 0.826 compared to 0.667 for logistic regression. A higher F1-score indicates a better balance between precision (correctly predicted rain) and recall (correctly identified actual rain). This suggests the SVM is better at distinguishing true rain events from false positives or negatives.

#### Q19) Show the Accuracy,Jaccard Index,F1-Score and LogLoss in a tabular format using data frame for all of the above models.

\*LogLoss is only for Logistic Regression Model


In [186]:
import pandas as pd

# Create a dictionary with the model names as keys and their metrics as values
data = {
    'Model': ['KNN', 'Decision Tree', 'Logistic Regression', 'SVM'],
    'Accuracy Score': [0.8198471, 0.738931, 0.833588, 0.832061],
    'Jaccard Index': [0.466063, 0.373626, 0.800366, 0.798903],
    'F1_Score': [0.635802, 0.544000, 0.666667, 0.825516],
    'Log Loss': [None, None, 0.381150, None]  # Assuming you don't have Log Loss for KNN, Decision Tree
}

# Create a DataFrame from the dictionary
df = pd.DataFrame(data)

# Print the DataFrame with clean formatting
print(df.to_string(index=False))


              Model  Accuracy Score  Jaccard Index  F1_Score  Log Loss
                KNN        0.819847       0.466063  0.635802       NaN
      Decision Tree        0.738931       0.373626  0.544000       NaN
Logistic Regression        0.833588       0.800366  0.666667   0.38115
                SVM        0.832061       0.798903  0.825516       NaN


## Overall:

* Both Logistic Regression and SVM achieved very similar Accuracy Scores (around 83%), indicating good overall ability to predict rain occurrence.
* SVM emerged as the top performer in terms of F1-score (0.826), suggesting a better balance between precision and recall in identifying true rain events.
* Logistic Regression had a slightly higher Jaccard Index (0.800) compared to SVM (0.799), indicating a marginally higher similarity between predicted and actual rain events.
* KNN achieved a moderate Accuracy Score (0.820) but fell short in Jaccard Index (0.466) and F1-score (0.636), suggesting less precise rain event identification.
* Decision Tree had the lowest overall performance among the evaluated models.
## Key Takeaways:

* SVM seems to be the most promising model for rain prediction in this case, offering a good balance between accuracy, precision, and recall.
* Logistic Regression is a strong contender with similar accuracy but might miss a few true rain events compared to SVM.
* KNN and Decision Tree require further investigation or optimization if they are to be considered for this task.