# CSCI 5622: Machine Learning
## Fall 2023
### Instructor: Daniel Acuna, Associate Professor, Department of Computer Science, University of Colorado at Boulder

Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [1]:
NAME = "Luk Letif"
COLLABORATORS = ""

---

In [2]:
# Necessary imports
import matplotlib.pyplot as plt
import numpy as np
import sklearn

In [3]:
from sklearn.datasets import make_classification

# Generating synthetic dataset of cancer
X, y = make_classification(n_samples=10_000,     # Number of samples
                           n_features=10,      # Total number of features
                           n_informative=5,    # Number of informative features
                           n_redundant=3,      # Number of redundant features
                           n_clusters_per_class=1, # Number of clusters per class
                           weights=[0.9, 0.1], # Balanced classes
                           flip_y=0.05,        # Adds a bit of noise
                           class_sep=0.8,      # Make classes slightly separable
                           random_state=42)    # Seed for reproducibility
# binary data
X = (X > 0).astype(int)

# Part 2: Classification with Naive Bayes

### Introduction to Scikit-Learn

**Scikit-learn** is an important machine learning library in Python that we will use througout the semester. It is known for its wide range of algorithms and utilities for almost every standard machine learning task.

### Basic Workflow with Scikit-Learn

1. **Preparing Data:** (already done above)
    - Begin by organizing your data. Typically in machine learning, especially with `scikit-learn`, you'll have input data referred to as `X` and labels referred to as `y`. Here, `X` is a matrix where each row is a sample and each column is a feature. In contrast, `y` is a vector of labels corresponding to each sample.

2. **Splitting Data:**
    - Divide your dataset into two parts: a training set and a testing set. The training set is used to teach your model, while the testing set is used to evaluate its generalization performance (i.e., risk).

3. **Defining the Model:**
    - Choose the algorithm that best fits your task. For this assignment, you'll be working with the Bernoulli naive Bayes classifier. In `scikit-learn`, each algorithm is implemented as a class. To use it, you'll first instantiate it, resulting in what's commonly referred to as a model.

4. **Training the Model:**
    - Once you've chosen a model, the next step is to train it using your training data. This is typically done using a method called `fit()`, where you'll pass in your training data and corresponding labels.

5. **Evaluating the Model:**
    - After training, you'll want to see how well your model performs. You can achieve this by making predictions on your testing data and comparing those predictions to the actual labels. Many metrics can be used for evaluation, with accuracy being one of the most common for classification tasks.

Remember, the key to mastering machine learning and `scikit-learn` is practice. As you experiment with different algorithms and datasets, you'll gain a deeper understanding of the underlying concepts and the nuances of each method.

In [4]:
# the matrix X contains a set of 10 binary markers predictive of a disease y
# (this is simulated data)
X

array([[0, 1, 0, ..., 0, 1, 1],
       [1, 1, 1, ..., 0, 1, 0],
       [0, 1, 1, ..., 1, 1, 0],
       ...,
       [0, 1, 0, ..., 1, 1, 0],
       [1, 0, 0, ..., 1, 1, 0],
       [0, 1, 1, ..., 1, 1, 0]])

In [5]:
# prevance of each market
X.mean(axis=0)

array([0.4104, 0.7192, 0.3411, 0.6876, 0.4409, 0.6346, 0.2661, 0.5054,
       0.5074, 0.7181])

In [6]:
y

array([0, 0, 0, ..., 0, 0, 0])

In [7]:
# most people don't have the disease
y.mean()

0.1215

## Question 1: Splitting Data for Training and Testing

Using the synthetic dataset of cancer provided, which simulates binary disease markers, your task is as follows:

**Your Task:**

1. Split the dataset (`X` and `y`) into two subsets: `X_train`, `X_test` for the features and `y_train`, `y_test` for the labels.
2. Ensure a 50-50 split, so both training and testing sets have an equal number of samples.
3. The split should be reproducible; set a fixed random seed.
4. Once you've split the data, report the number of samples in `X_train`, `X_test`, `y_train`, and `y_test`.

**Hints:**

- Use the `train_test_split` function from `sklearn.model_selection`.
- The `random_state` parameter in `sklearn` functions can set a fixed seed, ensuring reproducibility.

**Requirements:**

- Store the training features in `X_train`, testing features in `X_test`, training labels in `y_train`, and testing labels in `y_test`.

In [8]:
from sklearn.model_selection import train_test_split

# YOUR CODE HERE
# raise NotImplementedError()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)


In [9]:
""" (10 pts) check that data is correctly defined """
# Check the size of training data
assert X_train.shape[0] == 5000, "The number of samples in X_train is incorrect."
assert y_train.shape[0] == 5000, "The number of samples in y_train is incorrect."

# Check the size of testing data
assert X_test.shape[0] == 5000, "The number of samples in X_test is incorrect."
assert y_test.shape[0] == 5000, "The number of samples in y_test is incorrect."

## Question 2: Defining and Fitting the Bernoulli Naive Bayes Model**

Naive Bayes classifiers are a family of probabilistic classifiers based on Bayes' theorem. The Bernoulli Naive Bayes classifier is particularly suited for binary/boolean features, making it a perfect fit for our dataset with binary disease markers.

Your task is to define and fit a Bernoulli Naive Bayes model to the training data you've prepared in the previous question.

**Your Task:**

1. Define a Bernoulli Naive Bayes classifier.
2. Fit the classifier to the training data (`X_train` and `y_train`).
3. (Optional) After fitting, check the `class_log_prior_` attribute of the trained model, which represents the logarithm of the probability of each class. This will give you an idea of the distribution of the two classes in the training data.

**Hints:**

- The `BernoulliNB` class in `sklearn.naive_bayes` is your go-to for defining a Bernoulli Naive Bayes classifier.
- Notice that by default, `sklearn` uses a smoothing of 1.
- Like other `sklearn` models, the `.fit()` method will be used to train your classifier.
- Remember to only fit the model to the training data. The testing data (`X_test` and `y_test`) will be used later for evaluation.

In [10]:
from sklearn.naive_bayes import BernoulliNB

# Define the Bernoulli Naive Bayes classifier
bnb_model = BernoulliNB()

# YOUR CODE HERE
# raise NotImplementedError()
bnb_model.fit(X_train, y_train)


In [11]:
""" (10 pts) simple checks """
# Check that bnb_model is an instance of BernoulliNB
assert isinstance(bnb_model, BernoulliNB), \
    "bnb_model is not an instance of BernoulliNB."

# Check that bnb_model has been fitted
# One way to check this is by ensuring that attributes available after fitting are present
assert hasattr(bnb_model, "class_log_prior_"), \
    "bnb_model does not seem to be fitted."

# Optionally, check that the model has been fitted with the right data size
assert bnb_model.class_count_.sum() == 5000, \
    "bnb_model doesn't seem to be fitted with the correct number of samples."

## Question 3: (10 pts) Evaluating the Bernoulli Naive Bayes Model**

Now that you've trained a Bernoulli Naive Bayes classifier, the next step is to evaluate its performance on the test data. Evaluating a model is crucial to understand how well it might perform on unseen real-world data.

**Your Task:**

1. Use your trained Bernoulli Naive Bayes model, `bnb_model`, to predict the labels of the test data (`X_test`).
2. Calculate the accuracy of the model's predictions against the true labels (`y_test`).
3. Report the calculated accuracy.

**Hints:**

- `sklearn` provides a convenient function, `accuracy_score`, in the `sklearn.metrics` module to compute the accuracy of predictions.
- Remember the general workflow: Use the `.predict()` method on your trained model to get predictions and then compare these predictions to the true labels to evaluate accuracy.
  
**Requirements:**

- Store your predictions in a variable named `y_pred`.
- Compute and report the accuracy of your model on the test data and store in the variable `accuracy`

In [12]:
from sklearn.metrics import accuracy_score

# YOUR CODE HERE
# raise NotImplementedError()
y_pred = bnb_model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)

# Report the calculated accuracy
print(f"Accuracy of the Bernoulli Naive Bayes model on the test data: {accuracy * 100:.2f}%")

Accuracy of the Bernoulli Naive Bayes model on the test data: 84.48%


In [13]:
""" (10 pts) Simple test """
# Check if y_pred is correctly defined
assert 'y_pred' in locals(), "y_pred is not defined."
assert len(y_pred) == 5000, "The length of y_pred is not correct."

# Check if the predictions are in the right range
assert set(y_pred).issubset({0, 1}), "y_pred contains values other than 0 or 1."
