<div style="text-align:center; color:gray;">
    <h1>DSCI-D 590 – Introduction to Python Programming</h1>
    <h1>Final Project</h1>
</div>

<div class="alert alert-block alert-warning">  
<b>Notes:</b> 
    
- Use pandas to perform vectorized operations on DataFrames and Series.<br>
    
- Do not alter the given function names specified in the assignment prompt. Your task is to concentrate your coding efforts solely within those specified sections.<br>

  
- You should only provide your solution in the designated areas marked by the comment "# Write your code here."<br>
    

- <b>Don't forget to Call the relevant function(s) in your Jupyter Notebook after completing your answers. Display the results or output of calling the function(s) to demonstrate their functionality.</b>
    
</div>`

<div class="alert alert-block alert-info">
<h2>
<b> PROBLEM STATEMENT</b>
</h2> 
    
<br>
The objective of the final project is to implement $k$-means algorithm for Wisconsin Breast Cancer data set
using Python.
<br>
<br>
Breast cancer is a rising issue among women. A cancer’s stage is a crucial factor in deciding what
treatment options to recommend, and in determining the patient’s prognosis. Today, in the United States,
approximately one in eight women over their lifetime has a risk of developing breast cancer. An analysis
of the most recent data has shown that the survival rate is 88% after 5 years of diagnosis and 80% after 10
years of diagnosis. With early detection and treatment, it is possible that this type of cancer will go into
remission. In such a case, the worse fear of a cancer patient is the recurrence of the cancer.
<br>
<br>  
In the final project we will address this issue by using k-means clustering to analyze cancer data and
classify patients into two different groups, one with benign and the other one with malign cells. You will
implement $k$-means clustering program in Python and test it on famous Wisconsin Breast Cancer Data.
    
</div>


<div class="alert alert-block alert-info">
<h2>
<b> DATA SET DESCRIPTION</b>
</h2> 
    
Wolberg’s breast cancer data can be found here
    (https://archive.ics.uci.edu/dataset/15/breast+cancer+wisconsin+original). Samples arrive periodically as Dr. Wolberg reports his
clinical cases. There are 11 columns (attributes) in this data set. The first column contains Sample code
number (SCN), which is the patient id number.

![image.png](attachment:image.png)

</div>


<div class="alert alert-block alert-warning">  
<h2><b>PHASE 1 (30%)</b></h2>
    
<br>
    
This project has been designed to be completed in three weeks. Each week you will deliver different phases of the project.
<br>    
During the first week, you need to complete following data analysis tasks.
- Get yourself comfortable with $k$-means algorithm
- Download the data and load it in Python
- Impute missing values
- Compute data statistics
- Plot basic graphs  

</div>

<div class="alert alert-block alert-info">
<h2>
<b> INSTRUCTIONS</b>
</h2>

a) Download breast cancer data from following UCI machine learning repository.
Link:https://archive.ics.uci.edu/dataset/15/breast+cancer+wisconsin+original
<br>   
b) Load dataset into Python.

<div class="alert alert-block alert-warning">  
<b>Hints:</b> 
    
You may use pandas library to load dataset. You may want to use option `na_values=?` to replace missing values entered to column `A7` as `?` with `NaN`.

Downloaded data is in file `breast-cancer-wisconsin.data` (without the names of the columns).

The following statements read the data into dataframe `df`, replace missing values with `?` and insert the names of the columns to the dataframe:

`col = ["Scn", "A2", "A3", "A4", "A5", "A6", "A7", "A8", "A9", "A10", "Class"]`

`df = pd.read_csv('breast-cancer-wisconsin.data', na_values = '?', names = col)`
    
</div>
<br> 
    
c) Impute missing value to column `A7`. The missing values will be either `?` or `NaN`. You may replace missing values using techniques like mean, median, mode imputation or any other methods of your choice.  
    
Example: If you want to impute missing values by `mean` imputation method, you first need to compute the mean of column `A7` without considering `?` data. Once you compute the mean, replace `?` value with computed mean value.   <br>  
d) Find the mean, median, standard deviation and variance of each of the attributes `A2` to `A10`. So you will have total of nine mean, median, variance and standard deviation values. Print all results rounded to 1 decimal place.    
<br> 
e) Plot histograms with 10 bins for attributes `A2` to `A10` (nine histograms).    

<div class="alert alert-block alert-warning">  
<b>Hints:</b> 
A histogram is a kind of bar plot that gives a discretized display of value frequency. The data points are split into discrete, evenly spaced bins, and the number of data points in each bin is plotted. Use `matplotlib` and `hist`  method on the Series to plot a histogram.
    
For example, if $s$ is a Series, the following statement plots to subplot $sp$ a histogram with 10 blue
bins and opacity 0.5:

`sp.hist(s, bins=10, color = "blue", alpha = 0.5)`
    
</div>    
    
    
    
    
</div>


In [3]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

def analyze_breast_cancer_data():
    # Step 1: Load the dataset
    url = "https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data"
        
    col = ['Scn','A2','A3','A4','A5','A6','A7','A8','A9','A10','Class']
    df = pd.read_csv(url, na_values="?")
    df.columns = col

    # Step 2: Impute missing values
    mean_value = df['A7'].mean()
    df['A7'].fillna(mean_value, inplace=True)

    # Step 3: Calculate statistics
    attributes = df.columns[1:-1]
    statistics = df[attributes].agg(['mean', 'median', 'var', 'std']).round(1)
    print(statistics)

    # Step 4: Plot histograms
    plt.figure(figsize=(12, 9))
    for i, attribute in enumerate(attributes):
        plt.subplot(3, 3, i + 1)
        plt.hist(df[attribute], bins=10, color="blue", alpha=0.5)
        plt.title(f"Histogram of attribute {attribute}")
    plt.tight_layout()
    plt.show()
    
analyze_breast_cancer_data()

#### Sample Output:
Attribute A2 ---------------
    Mean:               4.4
    Median:             4.0<br>
    Variance            7.9<br>
    Standard Deviation  2.8<br>

Attribute A3 ---------------<br>
    Mean:               3.1<br>
    Median:             1.0<br>
    Variance            9.3<br>
    Standard Deviation  3.1<br>
![download.png](attachment:download.png)


![image.png](attachment:image.png)

<div class="alert alert-block alert-danger">  
<h2>
<b> SUBMISSION GUIDELINES FOR PHASE 1</b> 
</h2>
    
- You should submit a single Jupyter Notebook (.ipynb) for this assignment. The results should include 9 tables (one for each attribute) and 9 histograms (again, one per attribute).

    
- You don’t have to write any formal project report in this week.
    
    
- Write all programs in one single Jupyter Notebook. Make sure your code is properly formatted, structured and commented. The statements in the program have to properly indented.

    
- Make sure that your program compiles. A program that doesn’t compile is usually scored 0 points.

    
- The submission files should be in ".ipynb" format and name you submission files "firstname_project_phase_(number)" .

    
- Please upload one single file for all questions. Irrespective of the number of questions, you'll be uploading one single file Jupyter notebook with all programs in the assignment.

    
- If you want to provide more instructions for executing your code, upload a ‘readme.txt’ file in which you may provide additional information.
    
</div>

<div class="alert alert-block alert-warning">  
<h2><b>PHASE 2 (40%)</b></h2>
    
You will implement $k$-means algorithm in the second phase. We suggest that you revise $k$-means clustering to make sure that you understand the algorithm before writing code. 
    
In the Wisconsin Breast Cancer data set, `Scn` is the first and `Class` is the last column. We do not consider these two columns for $k$-means computation. However, you will need these two columns for printing results and, in phase 3, to calculate the errors. 
    
Use the dataset with imputed missing values from phase 1 and use columns `A2` to `A10` for $k$-means computation.

</div>

c) Write code for recompute step

So far, you have assigned each data point to one of the two clusters. Next, you will update the centroids.<div class="alert alert-block alert-info">
<h2>
<b> INSTRUCTIONS</b>
</h2>
The steps in phase 2 include:

- Write code for `intial` Step
- Write code for `assign` Step
- Write code for `recompute` Step
    
    
a) Write code for `initial` step 

Randomly choose two points from data set as the initial centroids. Since you only consider column `A2` to `A10`, an initial centroid is a nine dimensional vector. Give first centroid name $\mu_{2}$ and second centroid name $\mu_{4}$
<br>   

<div class="alert alert-block alert-warning">  
<b>Hints:</b> 

Example:
    
Let’s say randomly selected data points are $6$ and $246$ as two initial centroids. So values of $\mu_{2}$ and
$\mu_{4}$ are
    
$\mu_{2} = (8, 10, 10, 8, 7, 10, 9, 7, 1)$<br>
$\mu_{4} = (5, 1, 1, 2, 2, 2, 3, 1, 1)$
    
</div>

<div class="alert alert-block alert-warning">  
<b>Note:</b> 
    
You need to select two initial centroids randomly. So every time the code runs, two different initial centroids are selected. You may use `numpy random` library to select random points from the data set   
</div>
<br>  
b) Write code for `Assign` step

For each one of 699 data points compute Euclidian distance from the two centroids. Use initial centroids in the first iteration. In the rest of the iterations use the centroids calculated in step *c* of the previous iteration.
    
At the end of this step, each data point will be assigned to one of the two clusters, which we call predicted clusters. Store the information on `predicted clusters` into a column `Predicted_Class`. 
    
For each data point the value of `Predicted_Class` will be:
    
- `Predicted_Class = 2`, if the program calculates that the corresponding data point is in the cluster $2$ that corresponds to centroid $\mu_{2}$
- `Predicted_Class = 4`, if the program calculates that the corresponding data point is in the cluster $4$ that corresponds to centroid $\mu_{4}$

This step calculates two distances for each data point. Assign a data point to cluster 2
(`Predicted_Class = 2`) if its distance from $\mu_{2}$ is closer than its distance from $\mu_{4}$; assign the data
point to cluster 4 (`Predicted_Class = 4`) otherwise.
    
<div class="alert alert-block alert-warning">  
<b>Hints:</b> 
    
Example:
    
Let’s take a data point at row number $375$ and compute its distance $d$ from both centroids.
data point $375 = (3, 1, 2, 1, 2, 1, 2, 1, 1)$<br>
$d(375,\mu_{2})=\sqrt{(3-8)^2+(1-10)^2+(2-10)^2... ...+(1-1)^2}=20.25\\
d(375,\mu_{4})=\sqrt{(3-5)^2+(1-1)^2+(2-1)^2... ...+(1-1)^2}=2.83$

Since $d(375,\mu_{4}) < d(375,\mu_{2})$, data point at row number $375$ will be assigned to cluster $4$
    
</div>  
    
c) Write code for `recompute` step

So far, you have assigned each data point to one of the two clusters. Next, you will update the centroids.  
    
<div class="alert alert-block alert-warning">  
<b>Hints:</b> 

Example:
    
Let’s say after performing step *b*, you have $300$ data points assigned to cluster $2$ (i.e. `Predicted_Class = 2` ), and remaining $399$ data points assigned to cluster 4 (`Predicted_Class = 4`). Update $\mu_{2}$ by computing the mean from cluster 2 data points and update $\mu_{4}$ by computing the mean from cluster 4 data points.
   
</div>    

d) Iterate steps *b* and *c* until any one of the following conditions is true:
    
1. Centroids ( or clusters) don’t change compared to the previous iteration.
    
2. Steps *b* and *c* iterated $50$ times.
    
At the end step *d*, you will have final values of the centroids and predicted clusters. Print this result in console. 
    
</div>

In [None]:
import pandas as pd
import numpy as np
import math


## Initiate class to calculate distance between data points and centroids
## returns 2 or 4 depending on least distance between centroid and data point 
class Predicted_Class:
    
    def __init__(self, mu_2, mu_4, data_point):
        
        self.data_point = list(data_point[1:10])
        self.calculate_distance(mu_2, mu_4)
        self.cluster = self.predict_cluster()


    ## calcluate distance between centroids and data point
    def calculate_distance(self, mu_2, mu_4):
        
        distance_sum_2 = 0
        distance_sum_4 = 0
        
        for n in range(len(self.data_point)):
            distance_sum_2 += (self.data_point[n] - mu_2[n])**2
        self.distance_2 = round(math.sqrt(distance_sum_2), 2)
        
        for n in range(len(self.data_point)):
            distance_sum_4 += (self.data_point[n] - mu_4[n])**2
        self.distance_4 = round(math.sqrt(distance_sum_4), 2)
        
    
    ## Assign 2 or 4 depending on least distance between centroid
    def predict_cluster(self):
         
        if self.distance_4 < self.distance_2:
            return 4
        else:
            return 2
            
## Generates centroids randomly from the df
def initial(df):
    
    centroid_index = np.random.randint(len(df), size = (2))
    
    mu_2 = list(df.iloc[centroid_index[0], 1:10])
    mu_4 = list(df.iloc[centroid_index[1], 1:10])      

    
    print(f'Randomly selected row {centroid_index[0]} for centroid mu_2.')
    print(df.iloc[centroid_index[0], 1:10])
    print('\n')
    
    print(f'Randomly selected row {centroid_index[1]} for centroid mu_4.')
    print(df.iloc[centroid_index[1], 1:10])
    print('\n')
    
    return mu_2, mu_4


## Assigns the data points to centroid cluster 2 or centroid cluster 4
def assign(df, mu_2, mu_4):
    
    ##Assign each data point to Predicted_class = 2 or Predicted_Class = 4
    predicted_classes = []
    for i in range(len(df)):
        predicted_classes.append(Predicted_Class(mu_2, mu_4, df.loc[i]))
    
    ## separate clusters by Predicted_class = 2 and Predicted_class = 4
    cluster_2_points = []
    cluster_4_points = []
    predicted_cluster = []
    for i in range(len(predicted_classes)):
        pc = predicted_classes[i]
        if pc.cluster == 2:
            cluster_2_points.append(pc.data_point)
            predicted_cluster.append(pc.cluster)
        elif pc.cluster == 4:
            cluster_4_points.append(pc.data_point)
            predicted_cluster.append(pc.cluster)
    
    ## Add Predicted_Class column to df      
    df['Predicted_Class'] = predicted_cluster
            
    return cluster_2_points, cluster_4_points, df


## Compute the means from cluster 2 and cluster 4
def compute(cluster_points):
    
    cluster_columns = ["A2", "A3", "A4", "A5", "A6", 
            "A7", "A8", "A9", "A10"]
    
    ##create dataframes for Predicted_Class = 2 and Predicted_Class = 4
    cluster_df = pd.DataFrame(cluster_points, columns = cluster_columns)
    row, col = cluster_df.shape
    
    ##calcluate means for each column
    cluster_means = []
    for c in range(col):
        values = cluster_columns[c]
        cluster_means.append(np.mean(cluster_df[values]))
        
    return cluster_means

## Update centroids based on cluster 2 and cluster 4 means  
def recompute(df, mu_2, mu_4, cluster_2_points, cluster_4_points):
    
    iterations = 0
    
    while iterations <= 50:
        
        iterations += 1
        final_clusters = []
        cluster_2_points, cluster_4_points, df = assign(df, mu_2, mu_4)
        new_mu_2 = compute(cluster_2_points)
        new_mu_4 = compute(cluster_4_points)
        
        ## Print results if newly generated centroids are == previous centroid
        
        if new_mu_2 == mu_2 and new_mu_4 == mu_4:
            print(f'Program ended after {iterations} iterations. \n')
            final_clusters.append(new_mu_2)
            final_clusters.append(new_mu_4)
            
     
            ## convert to final_clusters to dataframe 
            cluster_columns = ["A2", "A3", "A4", "A5", "A6", 
            "A7", "A8", "A9", "A10"]
    
            ##create dataframe for final_clusters
            centroid_df = pd.DataFrame(final_clusters, columns = cluster_columns)
            row, col = centroid_df.shape
            
            print(f'Final centroid for mu_2: \n{centroid_df.iloc[0]}')
            print('\n')
            print(f'Final centroid for mu_4: \n{centroid_df.iloc[1]}')
            print('\n')
            print(f'Final cluster assignment: \n')
            print(df[['Scn', 'Class', 'Predicted_Class']].head(21))
            np.save("scores.npy", df[['Scn', 'Class', 'Predicted_Class']].head(21))
            break
        
        else:
            mu_2 = new_mu_2
            mu_4 = new_mu_4
            
    return mu_2, mu_4
    

def main():
    
    ##define column names of dataframe
    col =  ["Scn", "A2", "A3", "A4", "A5", "A6", 
            "A7", "A8", "A9", "A10", "Class"]
    
    ## read dataframe
    df = pd.read_csv('breast-cancer-wisconsin.data', 
                     na_values ='?', names = col)
    row, column = df.shape
    
    ## replace NaN values with column A7 mean 
    A7_mean = round(df['A7'].mean(), 1) 
    df = df.fillna(A7_mean)
 
    
    ## Initialization of centroids Step
    mu_2, mu_4 = initial(df)

    ## Assign data points to centroid 2 or centroid 4
    cluster_2_points, cluster_4_points, df = assign(df, mu_2, mu_4)
    
    ## Compute means of each column in clusters
    mu_2 = compute(cluster_2_points)
    mu_4 = compute(cluster_4_points)
    
    ## Iterate assign and recompute steps until the clusters do not change
    ## (equal to the previous centroid)
    recompute(df, mu_2, mu_4, cluster_2_points, cluster_4_points)
    
    
    ##final cluster assignment
    cluster_2_points, cluster_4_points, df = assign(df, mu_2, mu_4)
    
    
main()

#### Sample Output

<pre>
Randomly selected row 148 for centroid mu_2.
Initial centroid mu_2:
A2      3.0
A3      1.0
A4      1.0
A5      3.0
A6      8.0
A7      1.0
A8      5.0
A9      8.0
A10     1.0
Name: 148, dtype: float64

Randomly selected row 190 for centroid mu_4.
Initial centroid mu_4:
A2      10.0
A3      10.0
A4      10.0
A5      8.0
A6      6.0
A7      8.0
A8      7.0
A9      10.0
A10     1.0
Name: 190, dtype: float64

Program ended after 4 iterations.
Final centroid mu_2:
A2    3.047210
A3    1.302575
A4    1.446352
A5    1.343348
A6    2.087983
A7    1.380001
A8    2.105150
A9    1.261803
A10   1.109442
dtype: float64

Final centroid mu_4:
A2    7.158798
A3    6.798283
A4    6.729614
A5    5.733906
A6    5.472103
A7    7.873966
A8    6.103004
A9    6.077253
A10   2.549356
dtype: float64
<pre>



Final cluster assignment:
<pre>
    Scn   Class Predicted_Class
0  1000025 2    2
1  1002945 2    4
2  1015425 2    2
3  1016277 2    4
4  1017023 2    2
5  1017122 4    4
6  1018099 2    2
7  1018561 2    2
8  1033078 2    2
9  1033078 2    2
10 1035283 2    2
11 1036172 2    2
12 1041801 4    2
13 1043999 2    2
14 1044572 4    4
15 1047630 4    2
16 1048672 2    2
17 1049815 2    2
18 1050670 4    4
19 1050718 2    2
20 1054590 4    4
<pre>

<div class="alert alert-block alert-warning">  
<b>Note:</b> 

You can pass arguments to the function if required.
    
Your output may be different than the above. Include the first $20$ data points (rows) in your report.
    
This version of $k$-means algorithm may suffer from poor initialization. Therefore, you may see different answers or swapped cluster assignments. We recommend running program multiple times and submitting the best results. 
</div>

<div class="alert alert-block alert-danger">  
<h2>
<b> SUBMISSION GUIDELINES FOR PHASE 2</b> 
</h2>
    
- You should submit a single Jupyter Notebook (.ipynb) for this assignment. The results should include the initial and final centroids and cluster assignment with first 20 data points.

    
- You don’t have to write any formal project report in this week.
    
    
- Write all programs in one single Jupyter Notebook. Make sure your code is properly formatted, structured and commented. The statements in the program have to properly indented. 

    
- Make sure that your program compiles. A program that doesn’t compile is usually scored 0 points.

    
- The submission files should be in ".ipynb" format and name you submission files "firstname_project_phase_(number)" .

    
- Please upload one single file for all questions. Irrespective of the number of questions, you'll be uploading one single file Jupyter notebook with all programs in the assignment.

    
- If you want to provide more instructions for executing your code, upload a ‘readme.txt’ file in which you may provide additional information.
    
</div>

<div class="alert alert-block alert-warning">  
<h2><b>PHASE 3 (30%)</b></h2>
    
<br>

Phase 2 program, which implements $k$-means algorithm, produces two clusters – one containing benign
cells (`predicted class = 2`) and the other one that contains malign cells (`predicted class = 4`). But there are
chances that a malign cell is clustered into a benign cluster and vice versa.<br><br>
In phase 3 you will analyze the quality of the clustering. To check how well your clustering worked, you
will calculate the error rate for your clusters. Assume that the column `Class` of the initial data set
contains correct clustering of the data points
This project has been designed to be completed in three weeks. Each week you will deliver different phases of the project.
<br>    
During the third week, you need to complete following tasks. There are two parts in phase 3:
- Write a code to calculate the individual and total error rates of the predicted clusters.
- Prepare and submit final report.

<div class="alert alert-block alert-info">  
<h2><b>PART 1</b></h2>
<br>   
Write code to calculate the individual and total error rates of the predicted clusters
Your phase 3 program will calculate the error rates based on two arguments:

- The predicted clusters, calculated by your phase 2 program,
- The correct clusters, specified by the column `Class` of the initial data set.

</div>



</div>

<div class="alert alert-block alert-info">
<h2>
<b> INSTRUCTIONS</b>
</h2>

Let’s have a look at the example of the cluster assignment with first 20 data points, listed on page
8. Column `Class` represents the correct clusters and column `Predicted_Class` represents the
clusters calculated by the $k$-means algorithm.

![image.png](attachment:image.png)
<br>
Marked data points represent the errors of the $k$-means clustering:
- Yellow data points are predicted as class 4 (malign cells), while the correct class is 2
(benign cells).
- Gray data points are predicted as class 2 (benign cells), while the correct class is 4
(malign cells).
<br>
    
Let’s define the following notation:
    
- `error_24`: number of data points predicted as class 2, while the correct class is 4
- `error_42`: number of data points predicted as class 4, while the correct class is 2
- `error_all`: number of data points with predicted class not equal to correct class
- `pclass_2`: number of data points with predicted class equal to 2
- `pclass_4`: number of data points with predicted class equal to 4
- `class_all`: number of data points
- `error_B`: error rate for the benign cells
- `error_M`: error rate for the malign cells
- `error_T`: total error rate

<br>
Use the following formulae to calculate and print error rates for each cluster:

- $error\_B = (error\_24 \div pclass\_2) \times 100$ %
- $error\_M = (error\_42 \div pclass\_4) \times 100$ %
- $error\_T = (error\_all \div class\_all) \times 100$ %

<br>
    
Total error rate more than 50% indicates that your program swapped the predicted clusters. Your
program has to detect this situation, swap the predicted clusters by replacing 2 with 4, and 4 with
2 in column `Predicted_Class`, and recalculate the error rates.

</div>


In [2]:
import pandas as pd
import numpy as np

def error_rate():

    
    ##define column names of dataframe
    col =  ["Scn", "A2", "A3", "A4", "A5", "A6", 
            "A7", "A8", "A9", "A10", "Class"]
    
    ## read dataframe
    df = pd.read_csv('breast-cancer-wisconsin.data', 
                     na_values ='?', names = col)
    row, column = df.shape
    
    ## replace NaN values with column A7 mean 
    A7_mean = round(df['A7'].mean(), 1) 
    df = df.fillna(A7_mean)
 
    
    ## Initialization of centroids Step
    mu_2, mu_4 = initial(df)

    ## Assign data points to centroid 2 or centroid 4
    cluster_2_points, cluster_4_points, df = assign(df, mu_2, mu_4)
    
    ## Compute means of each column in clusters
    mu_2 = compute(cluster_2_points)
    mu_4 = compute(cluster_4_points)
    
    ## Iterate assign and recompute steps until the clusters do not change
    ## (equal to the previous centroid)
    recompute(df, mu_2, mu_4, cluster_2_points, cluster_4_points)
    
    
    ##final cluster assignment
    cluster_2_points, cluster_4_points, df = assign(df, mu_2, mu_4)
 



# Get the predicted clusters from the Phase 2 output
    predicted_clusters = df['Predicted_Class'].values

# Get the correct clusters from the "Class" column
    correct_clusters = df['Class'].values

# Calculate the individual error rates for benign and malignant cells
    error_24 = np.sum((predicted_clusters == 2) & (correct_clusters == 4))
    error_42 = np.sum((predicted_clusters == 4) & (correct_clusters == 2))
    pclass_2 = np.sum(predicted_clusters == 2)
    pclass_4 = np.sum(predicted_clusters == 4)
    error_B = (error_24 / pclass_2) * 100
    error_M = (error_42 / pclass_4) * 100

# Calculate the total error rate
    error_all = error_24 + error_42
    class_all = len(predicted_clusters)
    error_T = (error_all / class_all) * 100

# Print the error rates
  


    print("Number of all data points:", class_all)
    print(f"Data points in predicted class 2 {pclass_2}")
    print(f"Data points in predicted class 4 {pclass_4}")
    


  
    print("Error rate for class 2:", round(error_B, 2), "%")
    print("Error rate for class 4:", round(error_M, 2), "%")
    print("Total error rate:", round(error_T, 2), "%")


error_rate()

#### Sample Output:

This is the output in case the clusters are swapped and the program swapped the predicted class.

<br>
Total errors: $96.0$ %<br>
Clusters are swapped!<br>
Swapping Predicted_Class<br>
<br>
Data points in Predicted Class 2: $464$<br>
Data points in Predicted Class 4: $235$<br>

Error data points, Predicted Class 2:

| | Scn | Class | Predicted_Class | 
|-------------|:----------|:-----------:|:----------:|
|12  |1041801 |4 |2 |
|15  |1047630 |4 |2 |
|50  |1108370 |4 |2 |
|51  |1108449 |4 |2 |
|57  |1113038 |4 |2 |
|59  |1113906 |4 |2 |
|63  |1116132 |4 |2 |
|65  |1116998 |4 |2 |
|101 |1167439 |4 |2 |
|103 |1168359 |4 |2 |
|105 |1169049 |4 |2 |
|222 |1226012 |4 |2 |
|273 |428903  |4 |2 |
|348 |832226  |4 |2 |
|356 |859164  |4 |2 |
|455 |1246562 |4 |2 |
|489 |1084139 |4 |2 |

<br>

Error data points, Predicted Class 4:

| | Scn | Class | Predicted_Class | 
|-------------|:----------|:-----------:|:----------:|
|1   |1002945 |2 |4 |
|3   |1016277 |2 |4 |
|40  |1096800 |2 |4 |
|196 |1213375 |2 |4 |
|252 |1017023 |2 |4 |
|259 |242970  |2 |4 |
|296 |616240  |2 |4 |
|315 |704168  |2 |4 |
|319 |721482  |2 |4 |
|352 |846832  |2 |4 |
|434 |1293439 |2 |4 |

<br>

Number of all data points: $699$
<br>

Number of error data points: $28$

<br>
Error rate for class 2: $3.7$ %<br>
Error rate for class 4: $4.7$ %

<br>

Total error rate: $4.0$ %

<div class="alert alert-block alert-danger">  
<h2>
<b> SUBMISSION GUIDELINES FOR PHASE 3</b> 
</h2>
    
- You should submit a single Jupyter Notebook (.ipynb) for this assignment. 

    
- Prepare and submit a PDF with final report that includes:
    - Project statement
    - Short description of phase 1, 2 and 3 programs (algorithm, description of input data, structure of the programs and description of results)
    - Phase 1, 2 and 3 results
    - Conclusion
- Submit phase 1, 2 and 3 programs together in one jupyter notebook. Submit any data files that may be needed to run your programs.
    
    
- Write all programs in one single Jupyter Notebook. Make sure your code is properly formatted, structured and commented. The statements in the program have to properly indented. 

    
- Make sure that your program compiles. A program that doesn’t compile is usually scored 0 points.

    
- The submission files should be in ".ipynb" format and name you submission files "firstname_project_phase_(number)" .

    
- Please upload one single file for all questions. Irrespective of the number of questions, you'll be uploading one single file Jupyter notebook with all programs in the assignment.

    
- If you want to provide more instructions for executing your code, upload a ‘readme.txt’ file in which you may provide additional information.
    
</div>