# Classification Algorithms

**Name:** Prithivi Raaj K

**Roll No:** 21z238

**Importing the libraries**

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, StandardScaler
from scipy import stats
from sklearn.model_selection import train_test_split, RandomizedSearchCV, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.exceptions import ConvergenceWarning
from sklearn.svm import SVC
from pgmpy.models import BayesianModel, BayesianNetwork
from pgmpy.estimators import HillClimbSearch, BayesianEstimator
from pgmpy.inference import VariableElimination
from sklearn.metrics import accuracy_score
import warnings

warnings.filterwarnings("ignore", category=ConvergenceWarning)

# Data Preprocessing

**Loading the dataset**

In [None]:
dataset = pd.read_csv("weatherAUS.csv")
dataset.head()

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2008-12-01,Albury,13.4,22.9,0.6,,,W,44.0,W,...,71.0,22.0,1007.7,1007.1,8.0,,16.9,21.8,No,No
1,2008-12-02,Albury,7.4,25.1,0.0,,,WNW,44.0,NNW,...,44.0,25.0,1010.6,1007.8,,,17.2,24.3,No,No
2,2008-12-03,Albury,12.9,25.7,0.0,,,WSW,46.0,W,...,38.0,30.0,1007.6,1008.7,,2.0,21.0,23.2,No,No
3,2008-12-04,Albury,9.2,28.0,0.0,,,NE,24.0,SE,...,45.0,16.0,1017.6,1012.8,,,18.1,26.5,No,No
4,2008-12-05,Albury,17.5,32.3,1.0,,,W,41.0,ENE,...,82.0,33.0,1010.8,1006.0,7.0,8.0,17.8,29.7,No,No


**Analyzing the dataset**

In [None]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 145460 entries, 0 to 145459
Data columns (total 23 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   Date           145460 non-null  object 
 1   Location       145460 non-null  object 
 2   MinTemp        143975 non-null  float64
 3   MaxTemp        144199 non-null  float64
 4   Rainfall       142199 non-null  float64
 5   Evaporation    82670 non-null   float64
 6   Sunshine       75625 non-null   float64
 7   WindGustDir    135134 non-null  object 
 8   WindGustSpeed  135197 non-null  float64
 9   WindDir9am     134894 non-null  object 
 10  WindDir3pm     141232 non-null  object 
 11  WindSpeed9am   143693 non-null  float64
 12  WindSpeed3pm   142398 non-null  float64
 13  Humidity9am    142806 non-null  float64
 14  Humidity3pm    140953 non-null  float64
 15  Pressure9am    130395 non-null  float64
 16  Pressure3pm    130432 non-null  float64
 17  Cloud9am       89572 non-null

In [None]:
dataset.shape

(145460, 23)

**Handling the Null Values in the dataset**

In [None]:
dataset['RainToday'].fillna('Unknown',inplace=True)

In [None]:
dataset.dropna(subset=['RainTomorrow'],inplace=True)

The missing values in the Numerical columns are replaced with their mean values in the column.

In [None]:
num_columns = dataset.select_dtypes(include='number').columns

dataset[num_columns] = dataset[num_columns].fillna(dataset[num_columns].mean())

The missing values in Categorical columns are replaced with the mode values in the column.

In [None]:
cat_columns = dataset.select_dtypes(include='object').columns
cat_columns = cat_columns[cat_columns!='Date']

for col in cat_columns:
  dataset[col] = dataset[col].fillna(dataset[col].mode().iloc[0])

Null values in the dataset after being handled.

In [None]:
print(dataset.isna().mean()*100,2)

Date             0.0
Location         0.0
MinTemp          0.0
MaxTemp          0.0
Rainfall         0.0
Evaporation      0.0
Sunshine         0.0
WindGustDir      0.0
WindGustSpeed    0.0
WindDir9am       0.0
WindDir3pm       0.0
WindSpeed9am     0.0
WindSpeed3pm     0.0
Humidity9am      0.0
Humidity3pm      0.0
Pressure9am      0.0
Pressure3pm      0.0
Cloud9am         0.0
Cloud3pm         0.0
Temp9am          0.0
Temp3pm          0.0
RainToday        0.0
RainTomorrow     0.0
dtype: float64 2


**Handling Categorical Columns**

In [None]:
label_encoder = LabelEncoder()
for i in cat_columns[4:]:
  dataset[i] = label_encoder.fit_transform(dataset[i])

In [None]:
dataset = pd.get_dummies(dataset,columns=cat_columns[0:4])

In [None]:
dataset.head()

Unnamed: 0,Date,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustSpeed,WindSpeed9am,WindSpeed3pm,Humidity9am,...,WindDir3pm_NNW,WindDir3pm_NW,WindDir3pm_S,WindDir3pm_SE,WindDir3pm_SSE,WindDir3pm_SSW,WindDir3pm_SW,WindDir3pm_W,WindDir3pm_WNW,WindDir3pm_WSW
0,2008-12-01,13.4,22.9,0.6,5.469824,7.624853,44.0,20.0,24.0,71.0,...,False,False,False,False,False,False,False,False,True,False
1,2008-12-02,7.4,25.1,0.0,5.469824,7.624853,44.0,4.0,22.0,44.0,...,False,False,False,False,False,False,False,False,False,True
2,2008-12-03,12.9,25.7,0.0,5.469824,7.624853,46.0,19.0,26.0,38.0,...,False,False,False,False,False,False,False,False,False,True
3,2008-12-04,9.2,28.0,0.0,5.469824,7.624853,24.0,11.0,9.0,45.0,...,False,False,False,False,False,False,False,False,False,False
4,2008-12-05,17.5,32.3,1.0,5.469824,7.624853,41.0,7.0,20.0,82.0,...,False,True,False,False,False,False,False,False,False,False


In [None]:
dataset = dataset.drop(columns=['Date'])

**Handling Outliers in the dataset**

In [None]:
z_scores = stats.zscore(dataset[num_columns])
abs_z_score = np.abs(z_scores)
filtered_entries = (abs_z_score < 3).all(axis=1)
dataset = dataset[filtered_entries]

In [None]:
dataset.shape

(133601, 115)

In [None]:
dataset.head()

Unnamed: 0,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustSpeed,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,...,WindDir3pm_NNW,WindDir3pm_NW,WindDir3pm_S,WindDir3pm_SE,WindDir3pm_SSE,WindDir3pm_SSW,WindDir3pm_SW,WindDir3pm_W,WindDir3pm_WNW,WindDir3pm_WSW
0,13.4,22.9,0.6,5.469824,7.624853,44.0,20.0,24.0,71.0,22.0,...,False,False,False,False,False,False,False,False,True,False
1,7.4,25.1,0.0,5.469824,7.624853,44.0,4.0,22.0,44.0,25.0,...,False,False,False,False,False,False,False,False,False,True
2,12.9,25.7,0.0,5.469824,7.624853,46.0,19.0,26.0,38.0,30.0,...,False,False,False,False,False,False,False,False,False,True
3,9.2,28.0,0.0,5.469824,7.624853,24.0,11.0,9.0,45.0,16.0,...,False,False,False,False,False,False,False,False,False,False
4,17.5,32.3,1.0,5.469824,7.624853,41.0,7.0,20.0,82.0,33.0,...,False,True,False,False,False,False,False,False,False,False


**Splitting the target variable**

In [None]:
X = dataset.drop(columns=['RainTomorrow'])
Y = dataset['RainTomorrow']

X1 = dataset.drop(columns=['RainTomorrow'])
Y1 = dataset['RainTomorrow']

X1 = X1.sample(frac=0.1,random_state=42)
Y1 = Y1.sample(frac=0.1,random_state=42)

**Splitting the Train set and the Test set from the dataset**

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.2, random_state=0)
X_train1, X_test1, Y_train1, Y_test1 = train_test_split(X1,Y1, test_size=0.2, random_state=0)

**Feature Scaling**

In [None]:
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)

X_train_scaled1 = scaler.fit_transform(X_train1)
X_test_scaled1 = scaler.fit_transform(X_test1)

# Linear models: Logistic Regression and Naive Bayes

## Logistic Regression

In [None]:
log_model  = LogisticRegression(max_iter=2000,solver='liblinear')
log_model.fit(X_train_scaled,Y_train)

log_train_acc = log_model.score(X_train_scaled,Y_train)
log_test_acc = log_model.score(X_test_scaled,Y_test)

print("Logistic Regression: ")
print("Training Set Accuracy: ",log_train_acc)
print("Test Set Accuracy: ",log_test_acc)

Logistic Regression: 
Training Set Accuracy:  0.8516467065868264
Test Set Accuracy:  0.8489577485872535


**Hyperparameter Tuning for Logistic Regression**

In [None]:
log_model_tuned = LogisticRegression(max_iter=1000,solver='lbfgs',C=1.0)
log_model_tuned.fit(X_train_scaled,Y_train)

log_train_acc_tuned = log_model_tuned.score(X_train_scaled,Y_train)
log_test_acc_tuned = log_model_tuned.score(X_test_scaled,Y_test)

print("Logistic Regression after tuning parameters: ")
print("Training Set Accuracy: ",log_train_acc_tuned)
print("Test Set Accuracy: ",log_test_acc_tuned)

Logistic Regression after tuning parameters: 
Training Set Accuracy:  0.8516373502994012
Test Set Accuracy:  0.8489577485872535


## Naive Bayes Classifier

In [None]:
nb_model = GaussianNB()
nb_model.fit(X_train_scaled,Y_train)

nb_train_acc = nb_model.score(X_train_scaled,Y_train)
nb_test_acc = nb_model.score(X_test_scaled,Y_test)

print("Naive Bayes: ")
print("Training Set Accuracy: ",nb_train_acc)
print("Test Set Accuracy: ",nb_test_acc)

Naive Bayes: 
Training Set Accuracy:  0.6401478293413174
Test Set Accuracy:  0.6435762134650649


## Result Analysis

**Logistic Regression**

**Training Set Accuracy:**


*   Before tuning: 85.16%
*   After tuning: 85.16%


**Test Set Accuracy:**


*   Before tuning: 84.90%
*   After tuning: 84.90%

Both before and after tuning, the model demonstrates consistent performance on both the training and test sets, with accuracies around 85% on the training set and 85% on the test set.

The lack of significant improvement in test set accuracy after tuning suggests that the default parameter values were already effective for this dataset, and further optimization did not yield substantial gains.


**Naive Bayes**

Training Set Accuracy: 64.01%

Test Set Accuracy: 64.36%

The Naive Bayes classifier demonstrates moderate performance on both the training and test sets, with accuracies around 64%.


# Non-Linear Models: Decision Tree and Neural Network

## Decision Tree

In [None]:
dt_model = DecisionTreeClassifier(random_state=0)
dt_model.fit(X_train_scaled,Y_train)

dt_train_acc = dt_model.score(X_train_scaled,Y_train)
dt_test_acc = dt_model.score(X_test_scaled,Y_test)

print("Decision Tree Classifier: ")
print("Training Accuracy: ",dt_train_acc)
print("Testing Accuracy: ",dt_test_acc)

Decision Tree Classifier: 
Training Accuracy:  0.9999625748502994
Testing Accuracy:  0.7914374462033607


**Hyperparameter Tuning for Decision Tree Classifier**

In [None]:
param_grid_dt = {'max_depth': [None,10,20]}

randomSearch_dt = RandomizedSearchCV(DecisionTreeClassifier(random_state=0),
                                     param_distributions=param_grid_dt,
                                     n_iter=3,cv=3,random_state=0)

randomSearch_dt.fit(X_train_scaled,Y_train)
dt_model_tuned = randomSearch_dt.best_estimator_

dt_train_acc_tuned = dt_model_tuned.score(X_train_scaled,Y_train)
dt_test_acc_tuned = dt_model_tuned.score(X_test_scaled,Y_test)

print("Decision Tree Classifier after Tuning Hyperparameters: ")
print("Training Accuracy: ", dt_train_acc_tuned)
print("Testing Accuracy: ",dt_test_acc_tuned)

Decision Tree Classifier after Tuning Hyperparameters: 
Training Accuracy:  0.8664577095808383
Testing Accuracy:  0.8407993712810149


## Neural Network

In [None]:
neu_net_model = MLPClassifier(random_state=0)
neu_net_model.fit(X_train_scaled1,Y_train1)

neu_net_train_acc = neu_net_model.score(X_train_scaled1,Y_train1)
neu_net_test_acc = neu_net_model.score(X_test_scaled1,Y_test1)

print("Neural Network (MLPClassifier): ")
print("Training Accuracy: ",neu_net_train_acc)
print("Testing Accuracy: ",neu_net_test_acc)

Neural Network (MLPClassifier): 
Training Accuracy:  0.999064371257485
Testing Accuracy:  0.8139970059880239


**Hyperparameter Tuning for Neural Network Classifier**

In [None]:
param_grid_nn = {'hidden_layer_sizes':[(50,),(100,)]}

randomSearch_nn = RandomizedSearchCV(MLPClassifier(random_state=0),
                                     param_distributions=param_grid_nn,
                                     n_iter=2, cv=3, random_state=0)

randomSearch_nn.fit(X_train_scaled1,Y_train1)
neu_net_model_tuned = randomSearch_nn.best_estimator_

neu_net_train_acc_tuned = neu_net_model_tuned.score(X_train_scaled1,Y_train1)
neu_net_test_acc_tuned = neu_net_model_tuned.score(X_test_scaled1,Y_test1)

print("Neural Network (MLPClassifier) after Tuning Hyperparameters: ")
print("Training Accuracy: ",neu_net_train_acc_tuned)
print("Testing Accuracy: ",neu_net_test_acc_tuned)


Neural Network (MLPClassifier) after Tuning Hyperparameters: 
Training Accuracy:  0.999064371257485
Testing Accuracy:  0.8139970059880239


## Result Analysis

**Decision Tree Classifier**

**Training Set Accuracy:**

*   Original: 99.99%
*   After Tuning: 86.65%


The Decision Tree Classifier achieves extremely high accuracy on the training set, with an original accuracy of nearly 100%. However, this indicates a potential issue of overfitting.

After tuning hyperparameters, the training accuracy decreases to 86.65%, indicating that the model's tendency to overfit has been reduced.

**Test Set Accuracy:**

*   Original: 79.14%
*   After Tuning: 84.08%



The original Decision Tree Classifier achieves a test set accuracy of 79.14%, which suggests that the model performs moderately well on unseen data but may have some room for improvement.

After tuning hyperparameters, the test set accuracy increases to 84.08%, indicating that the model's performance has improved.




**Neural Network**

**Training Set Accuracy:**

*   Original: 99.91%
*   After Tuning: 99.91%


**Test Set Accuracy:**

*   Original: 81.40%
*   After Tuning: 81.40%


The MLPClassifier demonstrates high training set accuracy, indicating that it is able to capture complex patterns in the training data.

# Hybrid Models: SVM and Bayesian Network

## Support Vector Machines (SVM)

In [None]:
svm_model = SVC(random_state=0)

svm_model.fit(X_train_scaled1,Y_train1)

svm_train_acc = svm_model.score(X_train_scaled1,Y_train1)
svm_test_acc = svm_model.score(X_test_scaled1,Y_test1)

print("SVM Classifier: ")
print("Training Accuracy: ",svm_train_acc)
print("Testing Accuracy: ",svm_test_acc)

SVM Classifier: 
Training Accuracy:  0.8997941616766467
Testing Accuracy:  0.8536676646706587


**Hyperparameter Tuning for SVM**

In [None]:
param_grid_svm = {'C':[0.1,1,10,100],'gamma':[1,0.1,0.01,0.001]}

gridSearch_svm = GridSearchCV(SVC(random_state=0),
                              param_grid=param_grid_svm,cv=2,n_jobs=-1)

gridSearch_svm.fit(X_train_scaled1, Y_train1)
svm_model_tuned = gridSearch_svm.best_estimator_

svm_train_acc_tuned = svm_model_tuned.score(X_train_scaled1,Y_train1)
svm_test_acc_tuned = svm_model_tuned.score(X_test_scaled1,Y_test1)

print("SVM Classifier after Hyperparameter Tuning: ")
print("Training Accuracy: ",svm_train_acc_tuned)
print("Testing Accuracy: ",svm_test_acc_tuned)

SVM Classifier after Hyperparameter Tuning: 
Training Accuracy:  0.8720059880239521
Testing Accuracy:  0.8540419161676647


## Bayesian Network

In [None]:
X = dataset[['RainToday', 'MaxTemp', 'Rainfall','RainTomorrow']]
print("Attributes passed to the model",X.columns.tolist())

X_train, X_test = train_test_split(X, test_size=0.2, random_state=42)

model = BayesianNetwork([
    ('RainToday', 'RainTomorrow'),
    ('MaxTemp', 'RainTomorrow'),
    ('Rainfall', 'RainTomorrow')
])


model.fit(X_train, estimator=BayesianEstimator, n_jobs=-1)

Rain_infer = VariableElimination(model)

print('\nProbability of RainTomorrow given RainToday= 1')
q1=Rain_infer.query(variables=['RainTomorrow'],evidence={'RainToday':1})
print(q1)

Attributes passed to the model ['RainToday', 'MaxTemp', 'Rainfall', 'RainTomorrow']





Probability of RainTomorrow given RainToday= 1
+-----------------+---------------------+
| RainTomorrow    |   phi(RainTomorrow) |
| RainTomorrow(0) |              0.5000 |
+-----------------+---------------------+
| RainTomorrow(1) |              0.5000 |
+-----------------+---------------------+


## Result Analysis

**Support Vector Machines (SVM)**

**Training Set Accuracy:**

*   Original: 89.98%
*   After Tuning: 87.20%


**Test Set Accuracy:**

*   Original: 85.37%
*   After Tuning: 85.40%

Hyperparameter tuning led to a slight decrease in training accuracy but a slight improvement in test set accuracy, indicating that the tuned model may be more robust and generalize better to new data.





**Bayesian Network**

The attributes passed to the model include 'RainToday', 'MaxTemp', 'Rainfall', and 'RainTomorrow'.

The model predicts the probability of rain tomorrow given that it rained today (RainToday = 1).

The equal probabilities suggest that the model does not have a strong preference for either outcome and is uncertain about whether it will rain tomorrow given that it rained today.



# Conclusion



*   Logistic Regression and SVM Classifier exhibit the highest test accuracies, around 85%, indicating strong predictive performance.
*   Decision Tree Classifier initially suffers from overfitting but improves after hyperparameter tuning, indicating the importance of tuning in decision tree models.

*   Naive Bayes shows the lowest accuracy among the algorithms, suggesting that it may not capture the relationships between features as effectively as other algorithms.
*   Neural Network (MLPClassifier) performs well, with high accuracy both before and after tuning hyperparameters, indicating its capability to capture complex patterns in the data.

In conclusion, the choice of classification algorithm depends on factors such as the complexity of the data, interpretability of the model, and computational resources available.