# Navie-Bayes on our dataset

***

**Description**
In this document we prepare our code for the Naive Bayes algorithm and we runit. Then we discuss its accuracy.





***

The first step is loading the libraries and importing the dataset


In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Load the dataset
file_path = 'preprocessed.csv'
df = pd.read_csv(file_path)

# Fill missing values with the mean for continuous variables and the mode for categorical variables
df.fillna(df.select_dtypes(include=[np.number]).mean(), inplace=True)
for column in df.select_dtypes(include=['object']).columns:
    df[column].fillna(df[column].mode()[0], inplace=True)

# Encode categorical variables
label_encoders = {}
for column in df.select_dtypes(include=['object']).columns:
    le = LabelEncoder()
    df[column] = le.fit_transform(df[column])
    label_encoders[column] = le

# Exclude the decision and decision_o columns
X = df.drop(['match', 'decision', 'decision_o'], axis=1)
y = df['match']

print(X.shape)
print(y.shape)


(8378, 62)
(8378,)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[column].fillna(df[column].mode()[0], inplace=True)


In [2]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)


(5864, 62)
(2514, 62)
(5864,)
(2514,)


In [3]:
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import StratifiedKFold, cross_val_score

# Initialize Gaussian Naive Bayes
gnb = GaussianNB()

# Perform cross-validation
cv = StratifiedKFold(n_splits=10)
cv_scores = cross_val_score(gnb, X_train, y_train, cv=cv)
mean_cv_score = np.mean(cv_scores)
print(f"Mean CV score: {mean_cv_score}")


Mean CV score: 0.7853053357443123


In [4]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

# Predict on the test set
gnb.fit(X_train, y_train)
predicted = gnb.predict(X_test)

# Evaluate the model
print("Confusion Matrix:")
print(confusion_matrix(y_test, predicted))
print("Accuracy Score:")
print(accuracy_score(y_test, predicted))
print("Classification Report:")
print(classification_report(y_test, predicted))


Confusion Matrix:
[[1680  426]
 [ 134  274]]
Accuracy Score:
0.7772474144789181
Classification Report:
              precision    recall  f1-score   support

           0       0.93      0.80      0.86      2106
           1       0.39      0.67      0.49       408

    accuracy                           0.78      2514
   macro avg       0.66      0.73      0.68      2514
weighted avg       0.84      0.78      0.80      2514
