##### Explain why it is not useful to include the column 'Roommate' in a classification procedure.

Adding the Roommate column to the classification process isn't helpful, because this column just involves the label such as sequence value for each student and doesn't show any useful information for the main task (predicting test results based on roommate symptoms).

### Train a Categorical Naive Bayes classifier

#### making data

In [None]:
import pandas as pd

# Define the dataset
data = {
    'shivers': ['Y', 'N', 'Y', 'N', 'N', 'Y', 'Y'],
    'running nose': ['N', 'N', 'Y', 'Y', 'N', 'N', 'Y'],
    'headache': ['No', 'Mild', 'No', 'No', 'Heavy', 'No', 'Mild'],
    'test result': ['Negative', 'Negative', 'Positive', 'Negative', 'Positive', 'Negative', 'Positive']
}

# Create a DataFrame from the dataset
df = pd.DataFrame(data)

#### Preparing data

To train a categorical naive bayes classifier on the above dataset, we need to transform the nominal data into a format that scikit-learn can work with. so in the following code get_dummies function from pandas is used to convert the categorical variables into binary (0 or 1) features.

In [None]:
from sklearn.naive_bayes import CategoricalNB

# Converting categorical variables to binary features
df_encoded = pd.get_dummies(df.drop('test result', axis=1))

# Creating the target variable
target = df['test result']

# Training the Categorical Naive Bayes classifier
nb_classifier = CategoricalNB()
nb_classifier.fit(df_encoded, target)

#### Predicting

Now it is possible to predict the test results for the given dataset. To show the discrepancy in the prediction for observation number 5, it is possible to manually calculate the prediction probabilities for both the Negative and Positive classes.

In [None]:
# Manually calculating the prediction probabilities for observation 5
observation_5 = df_encoded.iloc[4]  # 0-based index, corresponds to 5 in 1-based index
prediction_probabilities = nb_classifier.predict_proba([observation_5])

# Displaying the prediction probabilities
print(f'Prediction probabilities: {prediction_probabilities}')

The prediction probabilities indicate that the classifier assigns a higher probability to the Negative class (0.598) than to the Positive class (0.402) for observation number 5. This higher probability for the Negative class is why the classifier predicts a Negative result for this observation, even though the actual value is Positive.

#### visualize the results

To visualize the results a confusion matrix and a classification report are used. The confusion matrix provides an overview of the predicted and actual classes, while the classification report gives more detailed performance metrics such as precision, recall, and F1-score for each class.

In [None]:
import seaborn as sns
from sklearn.metrics import confusion_matrix, classification_report
import matplotlib.pylab as plt

# Make predictions on the entire dataset
predictions = nb_classifier.predict(df_encoded)

In [None]:
# Creating a confusion matrix
confusion_mat = confusion_matrix(target, predictions)

sns.heatmap(confusion_mat, annot=True, cmap='Blues')
plt.xlabel('Predicted Class')
plt.ylabel('True Class')
plt.title('Confusion Matrix')
plt.show()

The top-left value (4) represents the number of instances that were correctly predicted as Negative (True Negative).

The top-right value (0) represents the number of instances that were predicted as Positive but are actually Negative (False Positive).

The bottom-left value (1) represents the number of instances that were predicted as Negative but are actually Positive (False Negative).

The bottom-right value (2) represents the number of instances that were correctly predicted as Positive (True Positive).

In conclusion, there are a total of 4 + 2 = 6 instances correctly predicted, and 1 instance misclassified.

In [None]:
# Creating a classification report
report = classification_report(target, predictions)

print('Classification Report:')
print(report)

The classification report provides a comprehensive assessment of model's performance for both the negative and positive classes. The precision, recall, and F1-score values offer insights into the model's ability to correctly classify instances, and the support values give context to the class distribution. The accuracy and average F1-scores give an overall view of the model's performance.

Precision: For the 'Negative' class, the precision is 0.80, indicating that 80% of the instances predicted as 'Negative' were actually 'Negative'. For the 'Positive' class, the precision is 1.00, meaning that all instances predicted as 'Positive' were actually 'Positive'.

Recall: For the 'Negative' class, the recall is 1.00, indicating that all actual 'Negative' instances were correctly identified. For the 'Positive' class, the recall is 0.67, meaning that 67% of actual 'Positive' instances were correctly identified.

F1-score: It provides a balanced view of the model's performance. For the 'Negative' class, the F1-score is 0.89, and for the 'Positive' class, the F1-score is 0.80.

Support: Support represents the number of actual occurrences of each class in the test set. For the 'Negative' class, the support is 4, and for the 'Positive' class, the support is 3.

Accuracy: The accuracy is 0.86, meaning that the model predicted the correct test result for 86% of the instances.

Macro Average F1-score: The macro average F1-score is the average of the F1-scores of all classes. It gives equal weight to each class, regardless of its size. In this case, the macro average F1-score is 0.84.

Weighted Average F1-score: The weighted average F1-score is the average of the F1-scores, weighted by the support of each class. It provides a measure of overall performance, considering class imbalances. In this case, the weighted average F1-score is 0.85