# Hands-On Assignment 5

In this assignment, you will practice everything that you have learned so far in an end-to-end setting.
You will be provided with a dataset that is **unique to you**, and your task is to perform
all the steps from previous assignments to clean, explore, visualize, and analyze your dataset.

**Written Portion**: Additionally, you will create a report that describes your process and provides insights about your dataset.
Each section that should appear in your report is noted with an orange star (like normal HO tasks).  The report should be  4-6 pages (12 pt font, 1.5 line spacing), and turned in on Canvas as a PDF.

The coding aspect for this assignment will be turned in the same was as all other HO's,
by submitting this file to the autograder.


For this assignment, feel free to make additional functions instead of implementing everything in the provided function.

The objective of this assignment is for you to apply and solidify the skills you have learned in previous assignments.

# Prompt

You have graduated from this class, and are a huge success!
You landed a job doing data science at some fancy company.

You just got a new client with some really interesting problems you get to solve.
Unfortunately, because of a big mess-up on their side the data's metadata got corrupted up
(and the person that used to maintain the data just took a vow of silence and moved to a bog).

The only column you are sure about is the `label` column,
which contains a numeric label for each row.
Aside from that, the client does not know anything about the names, content, or even data types for each column.

Your task is to explore, clean, and analyze this data.
You should have already received an email with the details on obtaining your unique data.
Place it in the same directory as this notebook (and your `local_grader.py` script) and name it `data.txt`.

*I know this prompt may sound unrealistic, but I have literally been in a situation exactly like this.
I was working at a database startup, and one of our clients gave us data with over 70 columns and more than a million records and told us:
"The person who used to manage the data is no longer working with us, but this was the data they used to make all their decisions.
We also lost all the metadata information, like column names."
...
Working in industry is not always glamorous.
-Eriq*

# Part 0: Explore Your Data

Before you start doing things to/with your data, it's always a good idea to load up your data and take a look.

In [31]:
import pandas as pd
import numpy as np
import re
import scipy

import sklearn.ensemble
import sklearn.neighbors
import sklearn.linear_model
import sklearn.preprocessing
import sklearn.model_selection
import sklearn.metrics

# Modify this to point to your data.
unique_data = pd.read_csv('data.txt', sep = "\t")
unique_data

Unnamed: 0,label,col_00,col_01,col_02,col_03,col_04,col_05,col_06,col_07,col_08,col_09,col_10,col_11,col_12,col_13,col_14
0,1,600 mph,0.1179,fabrice,Baseball,359,aNDREW,958 m/s,913,-597,652,0.3977,2.2247,Bioinformatics,-75,0.7676
1,6,518 mph,1.0059,Tony,ice hockey,2650,Fabrice,456 m/s,2392,447,-191,1.2029,1.3058,Computer Game Design,575,1.0125
2,3,-585 mph,-0.3057,Andrew,soccer,1234,Tony,1117 m/s,93,227,1153,0.8166,1.1266,Biotechnology,1845,2.2419
3,3,-1042 mph,-0.2849,Andrew,soccer,705,Tony,557 m/s,1275,1581,-701,-0.2707,-0.1947,Natural Language Processing,489,0.1878
4,1,473 mph,-0.4617,Fabrice,baseball,249,Lise,839 m/s,211,-819,-753,1.3098,1.7498,Bioinformatics,924,-0.0791
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1442,7,521 mph,-1.002,Inan,tennis,983,Inan,681 m/s,1169,427,261,1.2008,1.0294,Computational Media,384,0.3957
1443,4,-804 mph,1.0782,Chris,badminton,657,Vani,167 m/s,-899,1967,253,1.1767,,Applied Mathematics,-1432,-0.6581
1444,6,1109 mph,0.8671,Tony,golf,1114,Chris,3008 m/s,353,1629,-378,0.7717,0.6322,Games and Playable Media,555,0.5452
1445,4,-643 mph,1.7351,Chris,boxing,1338,Vani,-1155 m/s,2378,1330,740,1.5394,1.7111,Technology and Information Management,73,1.2654


Don't forget to checkout the column information.

In [32]:
unique_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1447 entries, 0 to 1446
Data columns (total 16 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   label   1447 non-null   int64 
 1   col_00  1438 non-null   object
 2   col_01  1433 non-null   object
 3   col_02  1438 non-null   object
 4   col_03  1437 non-null   object
 5   col_04  1436 non-null   object
 6   col_05  1435 non-null   object
 7   col_06  1441 non-null   object
 8   col_07  1436 non-null   object
 9   col_08  1437 non-null   object
 10  col_09  1439 non-null   object
 11  col_10  1435 non-null   object
 12  col_11  1437 non-null   object
 13  col_12  1441 non-null   object
 14  col_13  1447 non-null   object
 15  col_14  1432 non-null   object
dtypes: int64(1), object(15)
memory usage: 181.0+ KB


And any numeric information.

In [33]:
unique_data.describe()

Unnamed: 0,label
count,1447.0
mean,3.487906
std,2.281312
min,0.0
25%,1.0
50%,3.0
75%,5.0
max,7.0


<h4 style="color: darkorange";>★ Written Task: Introduction</h4>

Briefly describe the dataset you’re given and define the goal of the project and how you approach it.
For example, you can present a basic introduction of your data (shape and proposed data types)
and your goal is to use these features to predict the label of the response variable.
Then you propose a few models that are suitable for this project which will be introduced in the modeling section.

# Part 1: Data Cleaning

As always, we should start with data cleaning.
Take what you learned from HO3 to clean up this messy data to a point where it is ready for machine learning algorithms.

Some things you may want to do:
 - Deal with missing/empty values.
 - Fix numeric columns so that they actually contain numbers.
 - Remove inconsistencies from columns.
 - Assign a data type to each column.

<h4 style="color: darkorange";>★ Task 1.A</h4>

Complete the following function that takes in a DataFrame and outputs a clean version of the DataFrame.
You can assume that the frame has all the same structure as your unique dataset.
You can return the same or a new data frame.

In [34]:
def clean_data(frame):
    for column in frame:
        if column == "label":
            continue

        expression = re.compile(r'([0-9.]+)')
        extract = frame[column].str.extract(expression, expand=False)

        if extract.notna().any():
            frame[column] = pd.to_numeric(extract, errors='coerce')

            np_col = frame[column].dropna().to_numpy()
            
            mean = np.mean(np_col).astype(int)
            std = frame[column].std()
            num_stds = 3
            
            upper_bound = mean + num_stds * std
            lower_bound = mean - num_stds * std

            frame[column] = frame[column].clip(lower=lower_bound, upper=upper_bound)
            frame[column] = frame[column].fillna(mean)

            if np.all(np_col % 1 == 0):
                frame[column] = frame[column].astype(int)

            else:
                frame[column] = frame[column].astype(float)

        else:
            frame[column] = frame[column].str.title()
            orig_column = column

            one_hot = pd.get_dummies(frame[column])
            one_hot.columns = [f'{column}_{col.lower()}' for col in one_hot.columns]

            location = frame.columns.get_loc(column) + 1
            for column in reversed(one_hot.columns):
                frame.insert(location, column, one_hot[column])

            frame.drop(orig_column, axis=1, inplace=True)

    return frame

unique_data = clean_data(unique_data)
unique_data

Unnamed: 0,label,col_00,col_01,col_02_?,col_02_andrew,col_02_chris,col_02_eriq,col_02_fabrice,col_02_inan,col_02_lise,...,col_12_electrical engineering,col_12_games and playable media,col_12_human computer interaction,col_12_natural language processing,col_12_none,col_12_robotics engineering,col_12_statistics,col_12_technology and information management,col_13,col_14
0,1,600,0.117900,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,75,0.767600
1,6,518,1.005900,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,575,1.012500
2,3,585,0.305700,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1845,1.629107
3,3,1042,0.284900,0,1,0,0,0,0,0,...,0,0,0,1,0,0,0,0,489,0.187800
4,1,473,0.461700,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,924,0.079100
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1442,7,521,1.002000,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,384,0.395700
1443,4,804,1.078200,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,1432,0.658100
1444,6,1109,0.867100,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,555,0.545200
1445,4,643,1.645613,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,1,73,1.265400


Now we should also be able to view all the numeric columns.

In [22]:
unique_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1447 entries, 0 to 1446
Data columns (total 71 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   label                                         1447 non-null   int64  
 1   col_00                                        1447 non-null   int32  
 2   col_01                                        1447 non-null   float64
 3   col_02_?                                      1447 non-null   uint8  
 4   col_02_andrew                                 1447 non-null   uint8  
 5   col_02_chris                                  1447 non-null   uint8  
 6   col_02_eriq                                   1447 non-null   uint8  
 7   col_02_fabrice                                1447 non-null   uint8  
 8   col_02_inan                                   1447 non-null   uint8  
 9   col_02_lise                                   1447 non-null   u

<h4 style="color: darkorange";>★ Written Task: Data Cleaning</h4>

Describe the steps you took for data cleaning.
Why did you do this?
Did you have to make some choices along the way? If so, describe them.

# Part 2: Data Visualization

Once you have cleaned up the data, it is time to explore it and find interesting things.
Part of this exploration, will be visualizing the data in a way that makes it easier for yourself and others to understand.
Use what you have learned in HO1 and HO2 to create some visualizations for your dataset.

In [23]:
# Create bar charts for the one-hot encoded columns.
import matplotlib.pyplot as plt
import seaborn as sns
import os
from itertools import product

if not os.path.exists("./graphs/"): os.makedirs("./graphs/")

# Combine the data between both name columns
name_cols = [col for col in unique_data.columns if col.startswith("col_02") or col.startswith("col_05")]
data_cols = [col for col in unique_data.columns if col.startswith("col_03") or col.startswith("col_12")]

name_cols = [col for col in name_cols if "none" not in col and "?" not in col]
data_cols = [col for col in data_cols if "none" not in col and "?" not in col]

correlations = np.corrcoef(unique_data[data_cols].values.T, unique_data[name_cols].values.T)
cross_corr_matrix = correlations[:len(data_cols), len(data_cols):]
corr_matrix = pd.DataFrame(cross_corr_matrix, index=unique_data[data_cols].columns, columns=unique_data[name_cols].columns)

# Create a heatmap
fig, ax = plt.subplots(figsize=(10, 8), facecolor="white")
ax.set_facecolor("white")

sns.heatmap(corr_matrix, annot=False, cmap='coolwarm', ax=ax)
plt.title(f'Heatmap')

# Adjust the aspect of the axes
plt.gca().set_aspect('equal', adjustable='box')

# Adjust the layout to make room for the labels
plt.subplots_adjust(left=0.2, right=1, top=0.95, bottom=0.2)

# Save the heatmap
heatmap_path = os.path.join("./graphs", f'heatmap.png')
plt.savefig(heatmap_path)
plt.close()  # Close the figure to avoid displaying it

In [24]:
one_hots = ["col_02", "col_05", "col_03", "col_12"]

non_one_hots = unique_data.copy()
for column in non_one_hots: 
    for col in one_hots:
        if col in column:
            non_one_hots.drop(column, axis=1, inplace=True)

non_one_hots

Unnamed: 0,label,col_00,col_01,col_04,col_06,col_07,col_08,col_09,col_10,col_11,col_13,col_14
0,1,600,0.117900,359,958,913,597,652,0.3977,2.038494,75,0.767600
1,6,518,1.005900,2650,456,2392,447,191,1.2029,1.305800,575,1.012500
2,3,585,0.305700,1234,1117,93,227,1153,0.8166,1.126600,1845,1.629107
3,3,1042,0.284900,705,557,1275,1581,701,0.2707,0.194700,489,0.187800
4,1,473,0.461700,249,839,211,819,753,1.3098,1.749800,924,0.079100
...,...,...,...,...,...,...,...,...,...,...,...,...
1442,7,521,1.002000,983,681,1169,427,261,1.2008,1.029400,384,0.395700
1443,4,804,1.078200,657,167,899,1967,253,1.1767,0.000000,1432,0.658100
1444,6,1109,0.867100,1114,3008,353,1629,378,0.7717,0.632200,555,0.545200
1445,4,643,1.645613,1338,1155,2378,1330,740,1.5394,1.711100,73,1.265400


In [25]:
# Continuous: col_01, col_10, col_11, col_14
# Discrete: col_00, col_04, col_06, col_07, col_08, col_09, col_13

continuous = ["col_01", "col_10", "col_11", "col_14"]
discrete = ["col_00", "col_04", "col_06", "col_07", "col_08", "col_09", "col_13"]

# Make sure the './graphs/' directory exists
if not os.path.exists("./graphs/"): os.makedirs("./graphs/")

# Initialize the plot with a white background
fig, ax = plt.subplots(figsize=(10, 8), facecolor="white")
ax.set_facecolor("white")

# Define the number of bins you want to use
bins = np.linspace(min(non_one_hots[continuous].min()), max(non_one_hots[continuous].max()), 50)

for column in continuous:
    counts, edges = np.histogram(non_one_hots[column], bins=bins)
    bin_centers = 0.5 * (edges[1:] + edges[:-1])
    ax.plot(bin_centers, counts, '-o', label=column)

# ax.set_yscale('log')
ax.legend(loc='upper right')
plt.title("Line Histogram of Continuous Features")
plt.xlabel("Original Value")
plt.ylabel("Quantity")

histogram_path = os.path.join("./graphs", "histogram_continuous.png")
plt.savefig(histogram_path)
plt.close()

fig, ax = plt.subplots(figsize=(10, 8), facecolor="white")
ax.set_facecolor("white")
bins = np.linspace(min(non_one_hots[discrete].min()), max(non_one_hots[discrete].max()), 10)

for column in discrete: 
    counts, edges = np.histogram(non_one_hots[column], bins=10)
    bin_centers = 0.5 * (edges[1:] + edges[:-1])
    ax.plot(bin_centers, counts, '-o', label=column)

# ax.set_yscale('log')
ax.legend(loc='upper right')
plt.title("Line Histogram of Discrete Features")
plt.xlabel("Original Value")
plt.ylabel("Quantity")

histogram_path = os.path.join("./graphs", "histogram_discrete.png")
plt.savefig(histogram_path)
plt.close()

In [26]:
# Create boxplots for the continuous variables

for column in continuous:
    fig, ax = plt.subplots(figsize=(10, 8), facecolor="white")
    ax.set_facecolor("white")
    sns.boxplot(x='label', y=column, data=non_one_hots)

    plt.title(f"Box plot of {column}")
    plt.xlabel("Label")
    plt.ylabel(f"{column} value")
    plt.savefig(f"./graphs/boxplot_{column}.png")
    plt.close()

# Create pair plots for the continuous variables

sns.pairplot(non_one_hots[continuous])
plt.savefig(f"./graphs/pairplot.png")
plt.close()

In [27]:
# Plot the frequency polygon
fig, ax = plt.subplots(figsize=(10, 6), facecolor="white")
ax.set_facecolor("white")

for column in discrete: 
    counts, bin_edges = np.histogram(non_one_hots[column], bins='auto')  # 'auto' lets numpy decide the number of bins
    bin_midpoints = (bin_edges[:-1] + bin_edges[1:]) / 2

    plt.plot(bin_midpoints, counts, marker='o', linestyle='-')

plt.title(f'Frequency Polygon')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.grid(True)  # Optional: adds a grid for easier readability

# Save the plot
plt.savefig(f'./graphs/polygon.png')
plt.close()

In [28]:
from pandas.plotting import parallel_coordinates

non_one_hots_normalized = (non_one_hots - non_one_hots.min()) / (non_one_hots.max() - non_one_hots.min())

fig, ax = plt.subplots(figsize=(10, 8), facecolor="white")
ax.set_facecolor("white")

parallel_coordinates(non_one_hots_normalized, 'label', colormap='viridis')
plt.savefig(f'./graphs/parallel_coordinates.png')
plt.close()

<h4 style="color: darkorange";>★ Written Task: Data Visualization</h4>

Create at least two different visualizations that help describe what you see in your dataset.
Include these visualizations in your report along with descriptions of
how you created the visualization,
what data preparation you had to do for the visualization (aside from the data cleaning in the previous part),
and what the visualization tells us about the data.

# Part 3: Modeling

Now that you have a good grasp of your clean data,
it is time to do some machine learning!
(Technically all our previous steps were also machine learning,
but now we get to use classifiers!)

Use the skills you developed to select **three** classifiers and implement them on your data.
For example, you can narrow down your choices to three classifiers which may include:
- Logistic regression
- K-nearest neighbors
- Decision tree
- Or others

<h4 style="color: darkorange";>★ Task 3.A</h4>

Complete the following function that takes in no parameters,
and returns a list with **three** untrained classifiers you are going to explore in this assignment.
This method may set parameters/options for the classifiers, but should not do any training/fitting.

For example, if you wanted to use logistic regression,
then **one** of your list items may be:
```
sklearn.linear_model.LogisticRegression()
```

In [35]:
def create_classifiers():
    random_forest = sklearn.ensemble.RandomForestClassifier()
    k_nearest_neighbors = sklearn.neighbors.KNeighborsClassifier()
    logistic_regression = sklearn.linear_model.LogisticRegression()

    return [random_forest, k_nearest_neighbors, logistic_regression]

my_classifiers = create_classifiers()
my_classifiers

[RandomForestClassifier(), KNeighborsClassifier(), LogisticRegression()]

Now that we have some classifiers, we can see how they perform.

<h4 style="color: darkorange";>★ Task 3.B</h4>

Complete the following function that takes in an untrained classifier, a DataFrame, and a number of folds.
This function should run k-fold cross validation with the classifier and the data,
and return a list with the accuracy of each run of cross validation.
You can assume that the frame has the column `label` and the rest of the columns can be considered clean numeric features.

Note that you may have to break your frame into features and labels to do this.
Do not change the passed-in frame (make copies instead).

If you are getting any `ConvergenceWarning`s you may either ignore them,
or try and address them
(they will not affect your autograder score, but may be something to discuss in the written portion of this assignment).

In [39]:
def cross_fold_validation(classifier, frame, folds, scale=False):
    features = frame.drop(columns=["label"]).to_numpy()
    labels = frame["label"].to_numpy()

    if scale: 
        scaler = sklearn.preprocessing.StandardScaler()
        features = scaler.fit_transform(features)

    kf = sklearn.model_selection.KFold(n_splits=folds)
    accuracy_scores = []
    f1_scores = []
    precision_scores = []
    recall_scores = []

    for train_index, test_index in kf.split(features, labels):
        train_features = features[train_index]
        train_labels = labels[train_index]
        test_features = features[test_index]
        test_labels = labels[test_index]

        classifier.fit(train_features, train_labels)
        predictions = classifier.predict(test_features)

        accuracy = sklearn.metrics.accuracy_score(test_labels, predictions)
        accuracy_scores.append(np.round(accuracy, 4))

        f1 = sklearn.metrics.f1_score(test_labels, predictions, average="weighted")
        f1_scores.append(np.round(f1, 4))

        precision = sklearn.metrics.precision_score(test_labels, predictions, average="weighted")
        precision_scores.append(np.round(precision, 4))

        recall = sklearn.metrics.recall_score(test_labels, predictions, average="weighted")
        recall_scores.append(np.round(recall, 4))

    return accuracy_scores, f1_scores, precision_scores, recall_scores

results = {}

my_classifiers_scores = []
for classifier in my_classifiers:
    for scale in [True, False]:
        accuracy_scores, f1_scores, precision_scores, recall_scores = cross_fold_validation(classifier, unique_data, 5, scale)
        my_classifiers_scores.append(accuracy_scores)
        name = f"{type(classifier).__name__}".replace("Classifier", "")
        if scale: name += " (Scaled)"
        else: name += " (Unscaled)"

        print(f"{name} & {np.round(np.mean(accuracy_scores), 3)}, {np.round(np.std(accuracy_scores), 3)} & {np.round(np.mean(f1_scores), 3)}, {np.round(np.std(f1_scores), 3)} & {np.round(np.mean(precision_scores), 3)}, {np.round(np.std(precision_scores), 3)} & {np.round(np.mean(recall_scores), 3)}, {np.round(np.std(recall_scores), 3)} \\\\ \\hline")

        # print(f"Classifier: {type(classifier).__name__}; Scaled: {scale}")
        # print(f"Accuracy: {np.round(np.mean(accuracy_scores), 3)}, {np.round(np.std(accuracy_scores), 3)}")
        # print(f"F1: {np.round(np.mean(f1_scores), 3)}, {np.round(np.std(f1_scores), 3)}")
        # print(f"Precision: {np.round(np.mean(precision_scores), 3)}, {np.round(np.std(precision_scores), 3)}")
        # print(f"Recall: {np.round(np.mean(recall_scores), 3)}, {np.round(np.std(recall_scores), 3)}")
        print()

RandomForest (Scaled) & 0.98, 0.009 & 0.98, 0.009 & 0.981, 0.009 & 0.98, 0.009 \\ \hline

RandomForest (Unscaled) & 0.974, 0.006 & 0.974, 0.006 & 0.975, 0.006 & 0.974, 0.006 \\ \hline

KNeighbors (Scaled) & 0.99, 0.004 & 0.99, 0.004 & 0.991, 0.004 & 0.99, 0.004 \\ \hline

KNeighbors (Unscaled) & 0.207, 0.023 & 0.201, 0.022 & 0.217, 0.019 & 0.207, 0.023 \\ \hline

LogisticRegression (Scaled) & 0.994, 0.006 & 0.994, 0.006 & 0.994, 0.005 & 0.994, 0.006 \\ \hline



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

LogisticRegression (Unscaled) & 0.991, 0.002 & 0.991, 0.002 & 0.991, 0.002 & 0.991, 0.002 \\ \hline



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


<h4 style="color: darkorange";>★ Task 3.C</h4>

Complete the following function that takes in two equally-sized lists of numbers and a p-value.
This function should compute whether there is a statistical significance between
these two lists of numbers using a [Student's t-test](https://en.wikipedia.org/wiki/Student%27s_t-test)
at the given p-value.
Return `True` if there is a statistical significance, and `False` otherwise.
Hint: If you wish, you may use the `ttest_ind()` [method](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html) provided in the scipy package. 

In [None]:
def significance_test(a_values, b_values, p_value):
    t, p = scipy.stats.ttest_ind(a_values, b_values)
    return p < p_value

for i in range(len(my_classifiers)):
    for j in range(i + 1, len(my_classifiers)):
        significant = significance_test(my_classifiers_scores[i], my_classifiers_scores[j], 0.10)
        print(f"{type(my_classifiers[i]).__name__} vs ", end = "")
        print(f"{type(my_classifiers[j]).__name__}: {significant}")

RandomForestClassifier vs KNeighborsClassifier: True
RandomForestClassifier vs LogisticRegression: True
KNeighborsClassifier vs LogisticRegression: False


<h4 style="color: darkorange";>★ Written Task: Modeling</h4>

Describe the classifiers you have chosen.
Be sure to include all details about any parameter settings used for the algorithms.

Compare the performance of your models using k-fold validation.
You may look at accuracy, F1 or other measures.

Then, briefly summarize your results.
Are your results statistically significant?
Is there a clear winner?
What do the standard deviations look like, and what do they tell us about the different models?
Include a table like Table 1.

<center>Table 1: Every table need a caption.</center>

| Model | Mean Accuracy | Standard Deviation of Accuracy |
|-------|---------------|--------------------------------|
| Logistic Regression | 0.724 | 0.004
| K-Nearest Neighbor | 0.750 | 0.003
| Decision Tree | 0.655 | 0.011

# Part 4: Analysis

Now, take some time to go over your results for each classifier and try to make sense of them.
 - Why do some classifiers work better than others?
 - Would another evaluation metric work better than vanilla accuracy?
 - Is there still a problem in the data that should fixed in data cleaning?
 - Does the statistical significance between the different classifiers make sense?
 - Are there parameters for the classifier that I can tweak to get better performance?

<h4 style="color: darkorange";>★ Written Task: Analysis</h4>

Discuss your observations, the relationship you found, and how you applied concepts from the class to this project.
For example, you may find that some feature has the most impact in predicting your response variable or removing a feature improves the model accuracy.
Or you may observe that your training accuracy is much higher than your test accuracy and you may want to explain what issues may arise.

# Part 5: Conclusion

<h4 style="color: darkorange";>★ Written Task: Conclusion</h4>

Briefly summarize the important results and conclusions presented in the project.
What are the important points illustrated by your work?
Are there any areas for further investigation or improvement?

<h4 style="color: darkorange";>★ Written Task: References</h4>

Include a standard bibliography with citations referring to techniques or published papers you used throughout your report (if you used any).

For example:
```
[1] Derpanopoulos, G. (n.d.). Bayesian Model Checking & Comparison.
https://georgederpa.github.io/teaching/modelChecking.html.
```

# Part XC: Extra Credit

So far you have used a synthetic dataset that was created just for you.
But, data science is always more interesting when you are dealing with actual data from the real world.
Therefore, you will have an opportunity for extra credit on this assignment using real-world data.

For extra credit, repeat the **written tasks** of Parts 0 through 4 with an additional dataset that you find yourself.
For the written portion of the extra credit for Part 0, include information about where you got the data and what the data represents.
You may choose any dataset that represents real data (i.e., is **not** synthetic or generated)
and is **not** [pre-packaged in scikit-learn](https://scikit-learn.org/stable/datasets.html).

Below are some of the many places you can start looking for datasets:
 - [Kaggle](https://www.kaggle.com/datasets) -- Kaggle is a website focused around machine learning competitions,
       where people compete to see who can get the best results on a dataset.
       It is very popular in the machine learning community and has thousands of datasets with descriptions.
       Make sure to read the dataset's description, as Kaggle also has synthetic datasets.
 - [data.gov](https://data.gov/) -- A portal for data from the US government.
        The US government has a lot of data, and much of it has to be available to the public by law.
        This portal contains some of the more organized data from several different government agencies.
        In general, the government has A LOT of interesting data.
        It may not always be clean (remember the CIA factbook), but it is interesting and available.
        All data here should be real-world, but make sure to read the description to verify.
 - [UCI's Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets.php) -- UC Irvine has their own data repository with a few hundred datasets on many different topics.
        Make sure to read the dataset's description, as UCI also has synthetic datasets.
 - [WHO's Global Health Observatory](https://apps.who.int/gho/data/node.home) -- The World Health Organization keeps track of many different health-related statistics for most of the countries in the world.
        All data here should be real-world, but make sure to read the description to verify.
 - [Google's Dataset Search](https://datasetsearch.research.google.com/) -- Google indexes many datasets that can be searched here.

You can even create a dataset from scratch if you find some data you like that is not already organized into a specific dataset.
The only real distinction between "data" and a "dataset" is that a dataset is organized and finite (has a fixed size).

Create a new section in your written report for this extra credit and include all the written tasks for the extra credit there.
Each written task/section that you complete for your new dataset is eligible for extra credit (so you can still receive some extra credit even if you do not complete all parts).
There is no need to submit any code for the extra credit.
If you created a new dataset, include the dataset or links to it with your submission.