# Maria Yasin 

## 18716929

## ML with Python Assignment 1
## Encoding Category Features

In [None]:
# Import essential libraries
import pandas as pd
import numpy as np 
from sklearn.metrics import confusion_matrix 
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import CategoricalNB, GaussianNB, BernoulliNB, MultinomialNB
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, LabelEncoder
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# Load the dataset
cancer = pd.read_csv('breast-cancer.csv')
cancer.head()

In [None]:
# Variable types
cancer.dtypes

## Task 1 

### Ordinal Encoder

In [None]:
y = cancer.pop('irradiat').values # Set this as the y (target) 
print(cancer.columns)
print(y)
ord_encoder = OrdinalEncoder()
cancerOE = ord_encoder.fit_transform(cancer)
cancerOE

In [None]:
# Creating a new dataframe to look at our variables after applying the orinal encoder
df = pd.DataFrame(cancerOE, columns = cancer.columns)
df.head()

In [None]:
df.dtypes

We can see in the above df, all our categorical variables have been changed into numerical variables.

In [None]:
#Categorical Naive Bayes
catNB = CategoricalNB(fit_prior=True,alpha = 0.0001)
cancer_catNB = catNB.fit(cancerOE,y)
y_dash = cancer_catNB.predict(cancerOE) # y_dash is the prediction
confusion = confusion_matrix(y, y_dash)
print("Confusion matrix:\n{}".format(confusion)) 

The above code implements the Categorical Naive Bayes algorithm on our array 'cancerOE'. The code trains the classifier using the fit method, makes predictions on the data using the predict method, it computes the confusion matrix to evaluate the performance of the classifier. Finally, we get the confusion matrix as seen above. 

The confusion matrix above has 190 true negatives (TN), 33 false negatives (FN), 28 false positives (FP), and 35 true positives (TP).

In [None]:
#accuracy
from sklearn.metrics import accuracy_score
OEaccuracy_beforesplit = accuracy_score(y,y_dash)
# Print the accuracy
print(f"Accuracy: {OEaccuracy_beforesplit}")

The classifier correctly classified 225 samples (190 TNs and 35 TPs) out of a total of 286 samples, i.e., it has an accuracy of approximately 78.7%.

### OneHot Encoder

In [None]:
#Load the dataset again
cancer = pd.read_csv('breast-cancer.csv')
cancer
## OneHot Encoder
y = cancer.pop('irradiat').values # Set this as the y (target)
onehot_encoder = OneHotEncoder(sparse=False)
cancerOH = onehot_encoder.fit_transform(cancer)
cancerOH

In [None]:
#Features 
onehot_encoder.get_feature_names_out(cancer.columns)

In [None]:
bnb = BernoulliNB()
cancer_numNB = bnb.fit(cancerOH,y)
y_dash = cancer_numNB.predict(cancerOH) # y_dash is the prediction

Here we are performing classification on the cancerOH array using the Bernoulli Naive Bayes algorithm. The fit method is used to train the classifier on the data, and the predict method is used to generate predictions based on the trained classifier.

In [None]:
#Confusion matrix
confusion = confusion_matrix(y, y_dash)
print("Confusion matrix:\n{}".format(confusion)) 

The confusion matrix in the code has 187 true negatives (TN), 39 true positives (TP), 31 false positives (FP), and 29 false negatives (FN). 

In [None]:
#accuracy
OHEaccuracy_beforesplit = accuracy_score(y,y_dash)
# Print the accuracy
print(f"Accuracy: {OHEaccuracy_beforesplit}")

The classifier correctly classified 226 samples out of a total of 286 samples, resulting in an accuracy of approximately 79%.

## Task 2

### CatBoost Encoder

In [None]:
import category_encoders as ce

In [None]:
#Load the dataset again
cancer = pd.read_csv('breast-cancer.csv')
cancer
# Define train and target
y = cancer['irradiat'] #target
x = cancer.drop('irradiat', axis = 1) #train

y = LabelEncoder().fit_transform(y) # convert y to numeric
# Define catboost encoder
cbe_encoder = ce.cat_boost.CatBoostEncoder()

Here we are setting the target variable y as the 'irradiat' column of the cancer dataframe, while the input features x are defined as all other columns except the 'irradiat' column in the breast cancer dataset.

We than apply the LabelEncoder() from the sklearn library to encode the target variable y as numeric values.

Finally, we define the CatBoostEncoder() encoder from the category_encoders library, which can be used later to encode categorical variables.

In [None]:
# Fit encoder and transform the features
cbe_encoder.fit(x,y)
train_cbe = cbe_encoder.transform(x)

This code fits the CatBoostEncoder() encoder to the input features x and target variable y using the fit() function.

Then, it uses the learned encoding mappings to transform the categorical variables in x using the target variable y. The encoded dataset is stored in a variable called train_cbe.

In [None]:
#Confusion Matrix
cancer_numGNB = GaussianNB().fit(train_cbe,y)
y_dash = cancer_numGNB.predict(train_cbe) # y_dash is the prediction
confusion = confusion_matrix(y, y_dash)
print("Confusion matrix:\n{}".format(confusion)) 

Here we are using the GaussianNB() function from sklearn library to train a Naive Bayes model on the encoded dataset train_cbe and the target variable y.

We then make predictions on the same dataset using the predict() function and stores the results in a variable y_dash.

Finally, we compute the confusion matrix using the confusion_matrix() function from sklearn.metrics library, comparing the true target variable y with the predicted values y_dash. 


The model correctly predicted 223 samples (181 TN and 42 TP) out of 286, and misclassified 63 samples (37 FP and 26 FN).

In [None]:
#accuracy
CBEaccuracy_beforesplit = accuracy_score(y,y_dash)
# Print the accuracy
print(f"Accuracy: {CBEaccuracy_beforesplit}")

the accuracy of the model is 77.97%.

## Task 3

### Train-Test Split

In [None]:
#Splitting the data as 50:50 (i.e. use 50% of the data for testing)
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.5, random_state=0)

X_train.shape
X_test.shape  

In [None]:
#Ordinal encoder after the train-test split
ord_encoder = OrdinalEncoder()
cancerOE2 = ord_encoder.fit_transform(X_train)
cancerOE2

In [None]:
#Categorical Naive Bayes
catNB = CategoricalNB(fit_prior=True,alpha = 0.0001)
cancer_catNB = catNB.fit(cancerOE2,y_train)
y_dash2 = cancer_catNB.predict(cancerOE2) #prediction after the train-test split
confusion = confusion_matrix(y_train, y_dash2)
print("Confusion matrix:\n{}".format(confusion)) 

We repeat the same steps from the previous tasks, fit and transform the data after train-test split. We train the classifier using the fit method, make predictions on the data using the predict method and generate a confusion matrix.

In [None]:
#accuracy
OEaccuracy_aftersplit = accuracy_score(y_train,y_dash2)
# Print the accuracy
print(f"Accuracy: {OEaccuracy_aftersplit}")

Our Ordinal encoder classifier has an accuracy of 81.1% after the train-test split. This means that our classifierl has improved after the split.

In [None]:
#### OneHot after train-test split
onehot_encoder = OneHotEncoder(sparse=False)
cancerOH2 = onehot_encoder.fit_transform(X_train)
cancerOH2

In [None]:
bnb = BernoulliNB()
cancer_numNB = bnb.fit(cancerOH2,y_train)
y_dash3 = cancer_numNB.predict(cancerOH2)  #prediction after the train-test split

In [None]:
confusion = confusion_matrix(y_train, y_dash3)
print("Confusion matrix:\n{}".format(confusion)) 

In [None]:
OHEaccuracy_aftersplit = accuracy_score(y_train,y_dash3)
# Print the accuracy
print(f"Accuracy: {OHEaccuracy_aftersplit}")

Once again, we repeat the same steps for OneHot Encoder after the train-test split. The new classifier has an accuracy of 76.92%. We can see that the accuracy of OneHot Encoder has slightly decreased after the train-test split.

In [None]:
###CatBoost after train-test split
# Fit encoder and transform the features
cbe_encoder.fit(x,y)
train_cbe2 = cbe_encoder.transform(X_train)

In [None]:
cancer_numGNB = GaussianNB().fit(train_cbe2,y_train)
y_dash4 = cancer_numGNB.predict(train_cbe2)  #prediction after the train-test split
confusion = confusion_matrix(y_train, y_dash4)
print("Confusion matrix:\n{}".format(confusion)) 

In [None]:
#accuracy
CBEaccuracy_aftersplit = accuracy_score(y_train,y_dash4)
# Print the accuracy
print(f"Accuracy: {CBEaccuracy_aftersplit}")

Lastly, we fit and tranform the data using CatBoost Encoder after the train-test split and calculate the accuracy of our predictions after the split using the confusion matrix. We get an accuracy of 79.02%. Our accuracy for the CatBoost encoder classifier has increased slightly after the train-test split.

## Task 4

## Plot

In [None]:
# data
before_split = [OEaccuracy_beforesplit, OHEaccuracy_beforesplit, CBEaccuracy_beforesplit]
after_split = [OEaccuracy_aftersplit, OHEaccuracy_aftersplit, CBEaccuracy_aftersplit]

# set the width of the bars
bar_width = 0.25

# create an array of indices for the bars
indices = np.arange(len(before_split))

# create a new figure object with a larger size
fig, ax = plt.subplots(figsize=(12, 6))

# plot the bar chart
ax.bar(indices, before_split, width=bar_width, label='Before Split')
ax.bar(indices + bar_width, after_split, width=bar_width, label='After Split')

# label the x-axis
ax.set_xticks(indices + bar_width / 2)
ax.set_xticklabels(['OneHot','Ordinal','CatBoost'])

# add labels and title
ax.set_xlabel('Encoding Models')
ax.set_ylabel('Accuracy')
ax.set_title('Accuracy Before and After Train-Test Split')

# adjust the position and size of the legend
ax.legend(bbox_to_anchor=(1.05, 1), loc='upper left')

# adjust the plot layout to accommodate the legend
plt.subplots_adjust(right=0.8)

# show the plot
plt.show()

The above plot clearly shows us the accuracy of our 3 classifiers before and after the Train-Test Split. We can conclude that the accuracy for our OneHot encoder classifier and CatBoost encoder classifier has increased after the train-test split, whereas, the accuracy of our Ordinal encoder classifier has decreased after the split.