<a href="https://colab.research.google.com/github/Jude-Ufoh/Data_Analysis/blob/main/Jude_Ufoh.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Classification Project
This is classification project on the Australian Rain Dataset

The original source of the data is Australian Government's Bureau of Meteorology and the latest data can be gathered from http://www.bom.gov.au/climate/dwo/.

The dataset to be used has extra columns like 'RainToday' and our target is 'RainTomorrow', which was gathered from the Rattle at https://bitbucket.org/kayontoga/rattle/src/master/data/weatherAUS.RData

The main dataset used in this project is IBM skills network and it is downloaded directly from there


# Importing the required libraries

In [None]:
# Suppressing warnings to avoid cluttering the output
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

In [None]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import jaccard_score
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss
from sklearn.metrics import confusion_matrix, accuracy_score
import sklearn.metrics as metrics

# Importing Dataset from skills network

In [None]:
#The Dataset is imported from the skills network and assign as dataframe df
filepath = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillUp/labs/ML-FinalAssignment/Weather_Data.csv"
df = pd.read_csv(filepath)


# Understanding the Data

In [None]:
# Checking the structure
print('DATASET INFORMATION')
print()
print(df.info(), '\n\n\n', 'SHAPE\n', df.shape)
print('\n IDENTIFYING UNIQUE VALUES')
print(df.nunique())

DATASET INFORMATION

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3271 entries, 0 to 3270
Data columns (total 22 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Date           3271 non-null   object 
 1   MinTemp        3271 non-null   float64
 2   MaxTemp        3271 non-null   float64
 3   Rainfall       3271 non-null   float64
 4   Evaporation    3271 non-null   float64
 5   Sunshine       3271 non-null   float64
 6   WindGustDir    3271 non-null   object 
 7   WindGustSpeed  3271 non-null   int64  
 8   WindDir9am     3271 non-null   object 
 9   WindDir3pm     3271 non-null   object 
 10  WindSpeed9am   3271 non-null   int64  
 11  WindSpeed3pm   3271 non-null   int64  
 12  Humidity9am    3271 non-null   int64  
 13  Humidity3pm    3271 non-null   int64  
 14  Pressure9am    3271 non-null   float64
 15  Pressure3pm    3271 non-null   float64
 16  Cloud9am       3271 non-null   int64  
 17  Cloud3pm       3271 non-null   

from the dataset information, we can see that the dataset contains 22 columns, and 3271 non-null rows. Also, most of the columns contain similar data

# Data Summary and Descriptive Statistics

In [None]:
df.describe()

Unnamed: 0,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustSpeed,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm
count,3271.0,3271.0,3271.0,3271.0,3271.0,3271.0,3271.0,3271.0,3271.0,3271.0,3271.0,3271.0,3271.0,3271.0,3271.0,3271.0
mean,14.877102,23.005564,3.342158,5.175787,7.16897,41.476307,15.077041,19.294405,68.243962,54.698563,1018.334424,1016.003085,4.318557,4.176093,17.821461,21.543656
std,4.55471,4.483752,9.917746,2.757684,3.815966,10.806951,7.043825,7.453331,15.086127,16.279241,7.02009,7.019915,2.526923,2.411274,4.894316,4.297053
min,4.3,11.7,0.0,0.0,0.0,17.0,0.0,0.0,19.0,10.0,986.7,989.8,0.0,0.0,6.4,10.2
25%,11.0,19.6,0.0,3.2,4.25,35.0,11.0,15.0,58.0,44.0,1013.7,1011.3,2.0,2.0,13.8,18.4
50%,14.9,22.8,0.0,4.8,8.3,41.0,15.0,19.0,69.0,56.0,1018.6,1016.3,5.0,4.0,18.2,21.3
75%,18.8,26.0,1.4,7.0,10.2,44.0,20.0,24.0,80.0,64.0,1023.1,1020.8,7.0,7.0,21.7,24.5
max,27.6,45.8,119.4,18.4,13.6,96.0,54.0,57.0,100.0,99.0,1039.0,1036.7,9.0,8.0,36.5,44.7


In [None]:
# checking for missing values
df.isnull().sum()

In [None]:
#checking for duplicates
df.duplicated().sum()

0

In [None]:
df

Unnamed: 0,Date,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2/1/2008,19.5,22.4,15.6,6.2,0.0,W,41,S,SSW,...,92,84,1017.6,1017.4,8,8,20.7,20.9,Yes,Yes
1,2/2/2008,19.5,25.6,6.0,3.4,2.7,W,41,W,E,...,83,73,1017.9,1016.4,7,7,22.4,24.8,Yes,Yes
2,2/3/2008,21.6,24.5,6.6,2.4,0.1,W,41,ESE,ESE,...,88,86,1016.7,1015.6,7,8,23.5,23.0,Yes,Yes
3,2/4/2008,20.2,22.8,18.8,2.2,0.0,W,41,NNE,E,...,83,90,1014.2,1011.8,8,8,21.4,20.9,Yes,Yes
4,2/5/2008,19.7,25.7,77.4,4.8,0.0,W,41,NNE,W,...,88,74,1008.3,1004.8,8,8,22.5,25.5,Yes,Yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3266,6/21/2017,8.6,19.6,0.0,2.0,7.8,SSE,37,W,SSE,...,73,52,1025.9,1025.3,2,2,10.5,17.9,No,No
3267,6/22/2017,9.3,19.2,0.0,2.0,9.2,W,30,W,ESE,...,78,53,1028.5,1024.6,2,2,11.0,18.7,No,No
3268,6/23/2017,9.4,17.7,0.0,2.4,2.7,W,24,WNW,N,...,85,56,1020.8,1015.0,6,6,10.2,17.3,No,No
3269,6/24/2017,10.1,19.3,0.0,1.4,9.3,W,43,W,W,...,56,35,1017.3,1015.1,5,2,12.4,19.0,No,No


In [None]:
# converting the date column to datetime
df['Date'] = pd.to_datetime(df['Date'])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3271 entries, 0 to 3270
Data columns (total 22 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Date           3271 non-null   datetime64[ns]
 1   MinTemp        3271 non-null   float64       
 2   MaxTemp        3271 non-null   float64       
 3   Rainfall       3271 non-null   float64       
 4   Evaporation    3271 non-null   float64       
 5   Sunshine       3271 non-null   float64       
 6   WindGustDir    3271 non-null   object        
 7   WindGustSpeed  3271 non-null   int64         
 8   WindDir9am     3271 non-null   object        
 9   WindDir3pm     3271 non-null   object        
 10  WindSpeed9am   3271 non-null   int64         
 11  WindSpeed3pm   3271 non-null   int64         
 12  Humidity9am    3271 non-null   int64         
 13  Humidity3pm    3271 non-null   int64         
 14  Pressure9am    3271 non-null   float64       
 15  Pressure3pm    3271 n

 We have in total five categorical variables which are WindGustDir, WindDir9am, WindDir3pm, RainToday and Raintomorrow. First we convert the first four to numerical variable using one hot encoding and for 'Raintomorrow' which is the target variable, we convert it to numerical variable without using one hot encoding to avoid getting more columns

In [None]:
# Performing one-hot encoding on the first four
df_processed = pd.get_dummies(data =df, columns=['WindGustDir', 'WindDir9am', 'WindDir3pm', 'RainToday'])

In [None]:
# Converting the target variable to numerical variable directly
df_processed.replace({'RainTomorrow':{'Yes':1, 'No':0}}, inplace=True)

  df_processed.replace({'RainTomorrow':{'Yes':1, 'No':0}}, inplace=True)


In [None]:
df_processed.info()

# Training Data and Test Data
we set the 'fetaures' otherwise known as the x values and the 'target' or Y values

In [None]:
# First, we drop the date column since its influence is quite minimal
df_processed.drop('Date', axis=1, inplace=True)

In [None]:
# We convert all the columns to float datatype
df_processed = df_processed.astype(float)

In [None]:
# Assigning the features and target columns
x= df_processed.drop('RainTomorrow', axis=1)
y= df_processed['RainTomorrow']

# Linear Regression

(Q1) Use the train_test_split function to split the features and Y dataframes with a test_size of 0.2 and the random_state set to 10.

In [None]:
# the first step in linear regression is to split the training and testing data using the sklearn train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=10)




### Scaling the data

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

(Q2) Create and train a Linear Regression model called LinearReg using the training data (x_train, y_train).

In [None]:
# We then create and train a linear Regression model called LinearReg using the training data (x_train, y_train)
LinearReg = LinearRegression()
LinearReg.fit(x_train, y_train)


Q3) Now use the predict method on the testing data (x_test) and save it to the array predictions.

In [None]:
# We now use the predict method on the testing data(x-test) and create an array to save the predictions
predictions= LinearReg.predict(x_test)
print(predictions[:5])

[0.13184071 0.2761859  0.97818819 0.2874561  0.13241371]


Q4) Using the predictions and the y_test dataframe calculate the value for each metric using the appropriate function.

In [None]:
# We use the value and the prediction and the y_test dataframe to calculate the value for each metric using the different functions
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
LinearRegression_MAE = mean_absolute_error(y_test, predictions)
LinearRegression_MSE = mean_squared_error(y_test, predictions)
LinearRegression_R2 = r2_score(y_test, predictions)


Q5) Show the MAE, MSE, and R2 in a tabular format using data frame for the linear model.

In [None]:
Report = {
    'Metric': ['Mean Absolute Error (MAE)', 'Mean Squared Error (MSE)', 'R-squared (R²)'],
    'Value': [LinearRegression_MAE, LinearRegression_MSE, LinearRegression_R2]
}

Report = pd.DataFrame(Report)

# Display the metrics DataFrame
print(Report)

                      Metric     Value
0  Mean Absolute Error (MAE)  0.256318
1   Mean Squared Error (MSE)  0.115721
2             R-squared (R²)  0.427132


# KNN

Q6) Create and train a KNN model called KNN using the training data (x_train, y_train) with the n_neighbors parameter set to 4.

In [None]:
# Creating the KNN Model with n_neighbors set to 4
KNN = KNeighborsClassifier(n_neighbors=4)

In [None]:
# Training the model using the training data (xtrain, y_train)
KNN.fit(x_train, y_train)

Q7) Now use the predict method on the testing data (x_test) and save it to the array predictions.

In [None]:
# Using the trained KNN model to make predictions on the test data
predictions = KNN.predict(x_test)

Q8) Using the predictions and the y_test dataframe calculate the value for each metric using the appropriate function.

In [None]:
from sklearn.metrics import accuracy_score, jaccard_score, f1_score

# Calculate KNN Accuracy Score
KNN_Accuracy_Score = accuracy_score(y_test, predictions)

# Calculate KNN Jaccard Index (binary classification)
KNN_JaccardIndex = jaccard_score(y_test, predictions, pos_label= 1)

# Calculate KNN F1 Score
KNN_F1_Score = f1_score(y_test, predictions, pos_label= 1)

In [None]:
print(f"KNN Accuracy Score = {KNN_Accuracy_Score}")
print(f"KNN Jaccard Index = {KNN_JaccardIndex}")
print(f"KNN F1 Score = {KNN_F1_Score}")


KNN Accuracy Score = 0.7603053435114504
KNN Jaccard Index = 0.23786407766990292
KNN F1 Score = 0.3843137254901961


# Decision Tree

Q9) Create and train a Decision Tree model called Tree using the training data (x_train, y_train).

In [None]:
# Creating a Decision Tree classifier model
Tree = DecisionTreeClassifier(random_state = 10)

In [None]:
# fitting the model using the training data
Tree.fit(x_train, y_train)

Q10) Now use the predict method on the testing data (x_test) and save it to the array predictions.

In [None]:
# Making predictions using the trained Decision Tree model to make predictions on the test data
predictions = Tree.predict(x_test)

Q11) Using the predictions and the y_test dataframe calculate the value for each metric using the appropriate function.

In [None]:
from sklearn.metrics import accuracy_score, jaccard_score, f1_score

# Calculate Decision Tree Accuracy Score
Tree_Accuracy_Score = accuracy_score(y_test, predictions)

# Calculate Decision Tree Jaccard Index (binary classification)
Tree_JaccardIndex = jaccard_score(y_test, predictions, pos_label=1)

# Calculate Decision Tree F1 Score (binary classification)
Tree_F1_Score = f1_score(y_test, predictions, pos_label= 1)

# Display the metrics
print(f"Decision Tree Accuracy Score: {Tree_Accuracy_Score}")
print(f"Decision Tree Jaccard Index: {Tree_JaccardIndex}")
print(f"Decision Tree F1 Score: {Tree_F1_Score}")


Decision Tree Accuracy Score: 0.751145038167939
Decision Tree Jaccard Index: 0.38257575757575757
Decision Tree F1 Score: 0.5534246575342465


# Logistic Regression

Q12) Use the train_test_split function to split the features and Y dataframes with a test_size of 0.2 and the random_state set to 1.

In [None]:
x_trainlg, x_testlg, y_trainlg, y_testlg = train_test_split(x, y, test_size=0.2, random_state=1)

Q13) Create and train a LogisticRegression model called LR using the training data (x_train, y_train) with the solver parameter set to liblinear.

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
x_trainlg = scaler.fit_transform(x_trainlg)
x_testlg = scaler.transform(x_testlg)

In [None]:
# Create the Logistic Regression model with the 'liblinear' solver
LR = LogisticRegression(solver='liblinear', random_state=1)

# Train the model using the training data
LR.fit(x_trainlg, y_trainlg)

Q14) Now, use the predict and predict_proba methods on the testing data (x_test) and save it as 2 arrays predictions and predict_proba.

In [None]:
# Generate predictions using the trained Logistic Regression model
predictions = LR.predict(x_testlg)

# Generate probability predictions
predict_proba = LR.predict_proba(x_testlg)

Q15) Using the predictions, predict_proba and the y_test dataframe calculate the value for each metric using the appropriate function.

In [None]:
# Accuracy Score
LR_Accuracy_Score = accuracy_score(y_testlg, predictions)

# Jaccard Index
LR_JaccardIndex = jaccard_score(y_testlg, predictions, pos_label= 1)

# F1 Score
LR_F1_Score = f1_score(y_testlg, predictions, pos_label= 1)

# Log Loss
LR_Log_Loss = log_loss(y_testlg, predict_proba)

In [None]:
print(f"Logistic Regression Accuracy Score: {LR_Accuracy_Score:.4f}")
print(f"Logistic Regression Jaccard Index: {LR_JaccardIndex:.4f}")
print(f"Logistic Regression F1 Score: {LR_F1_Score:.4f}")
print(f"Logistic Regression Log Loss: {LR_Log_Loss:.4f}")

Logistic Regression Accuracy Score: 0.8305
Logistic Regression Jaccard Index: 0.4955
Logistic Regression F1 Score: 0.6626
Logistic Regression Log Loss: 0.3836


#SVM

Q16) Create and train a SVM model called SVM using the training data (x_train, y_train).

In [None]:
x_trainsvm, x_testsvm, y_trainsvm, y_testsvm = train_test_split(x, y, test_size=0.2, random_state=1)

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
x_trainsvm = scaler.fit_transform(x_trainsvm)
x_testsvm = scaler.transform(x_testsvm)

In [None]:
# Creating the SVM model
SVM = SVC(kernel='linear', probability=True, random_state=10)

# Training  the model using the training data
SVM.fit(x_trainsvm, y_trainsvm)

Q17) Now use the predict method on the testing data (x_test) and save it to the array predictions.

In [None]:
# Generating predictions using the trained SVM model
predictions = SVM.predict(x_testsvm)

Q18) Using the predictions and the y_test dataframe calculate the value for each metric using the appropriate function.

In [None]:
from sklearn.metrics import accuracy_score, jaccard_score, f1_score

# Accuracy Score
SVM_Accuracy_Score = accuracy_score(y_testsvm, predictions)

# Jaccard Index
SVM_JaccardIndex = jaccard_score(y_testsvm, predictions, pos_label= 1)

# F1 Score
SVM_F1_Score = f1_score(y_testsvm, predictions, pos_label= 1)

# Display results
print(f"SVM Accuracy Score: {SVM_Accuracy_Score:.4f}")
print(f"SVM Jaccard Index: {SVM_JaccardIndex:.4f}")
print(f"SVM F1 Score: {SVM_F1_Score:.4f}")


SVM Accuracy Score: 0.8366
SVM Jaccard Index: 0.5000
SVM F1 Score: 0.6667


Q19) Show the Accuracy,Jaccard Index,F1-Score and LogLoss in a tabular format using data frame for all of the above models.

In [None]:
# Calculate Log Loss only for models that support probability predictions
LR_Log_Loss = log_loss(y_testlg, predict_proba)
SVM_Log_Loss = "N/A"
KNN_Log_Loss = "N/A"
Tree_Log_Loss = "N/A"

# Create a DataFrame to compare model performance
metrics_comparison = pd.DataFrame({
    "Model": ["Logistic Regression", "SVM", "KNN", "Decision Tree"],
    "Accuracy Score": [LR_Accuracy_Score, SVM_Accuracy_Score, KNN_Accuracy_Score, Tree_Accuracy_Score],
    "Jaccard Index": [LR_JaccardIndex, SVM_JaccardIndex, KNN_JaccardIndex, Tree_JaccardIndex],
    "F1 Score": [LR_F1_Score, SVM_F1_Score, KNN_F1_Score, Tree_F1_Score],
    "Log Loss": [LR_Log_Loss, SVM_Log_Loss, KNN_Log_Loss, Tree_Log_Loss]
})

# Display the DataFrame
print(metrics_comparison)

                 Model  Accuracy Score  Jaccard Index  F1 Score  Log Loss
0  Logistic Regression        0.830534       0.495455  0.662614  0.383591
1                  SVM        0.836641       0.500000  0.666667       N/A
2                  KNN        0.760305       0.237864  0.384314       N/A
3        Decision Tree        0.751145       0.382576  0.553425       N/A


SVM is the overall best-performing model in terms of accuracy, Jaccard Index, and F1 Score.

Best Model for overall performance.
Logistic Regression comes second, with strong performance but slightly lower Jaccard Index and F1 Score than SVM.

KNN has the poorest performance with lower accuracy and poor Jaccard Index and F1 Score. You might want to try hyperparameter tuning or use a different model.

Decision Tree performs better than KNN but is worse than Logistic Regression and SVM in terms of accuracy and other metrics. It might benefit from pruning to avoid overfitting.