# Week 1 Challenge Project
This is the complete notebook, which includes both the data cleaning and visualization section released on Day 2, and additional guidelines for model creation and evaluation. If you were already working on the notebook from before, you may have to copy your work over.

At the end of this notebook, there is also a description of how to finalize and present your project.

## Challenge Introduction

> Original author: Lyle Lalunio

Hypothyroidism, also called underactive thyroid or low thyroid, is a disorder of the endocrine system in which the thyroid gland does not produce enough thyroid hormone. It can cause a number of symptoms, such as poor ability to tolerate cold, a feeling of tiredness, constipation, depression, and weight gain. Occasionally there may be swelling of the front part of the neck due to goitre. Untreated hypothyroidism during pregnancy can lead to delays in growth and intellectual development in the baby or cretinism.

Worldwide, too little iodine in the diet is the most common cause of hypothyroidism. In countries with enough iodine in the diet, the most common cause of hypothyroidism is the autoimmune condition Hashimoto's thyroiditis. Less common causes include: previous treatment with radioactive iodine, injury to the hypothalamus or the anterior pituitary gland, certain medications, a lack of a functioning thyroid at birth, or previous thyroid surgery. The diagnosis of hypothyroidism, when suspected, can be confirmed with blood tests measuring thyroid-stimulating hormone (TSH) and thyroxine levels.

Worldwide about one billion people are estimated to be iodine deficient; however, it is unknown how often this results in hypothyroidism. In the United States, hypothyroidism occurs in 0.3–0.4% of people.

And that is why we iodize salt.

![alt text](https://www.mayoclinic.org/-/media/kcms/gbs/patient-consumer/images/2013/11/15/17/39/ds00181_-ds00344_-ds00353_-ds00491_-ds00492_-ds00567_-ds00660_-my00709_im01872_thyroid_gif.jpg)



Background: Doctors all around the world need our help to predict whether a patient has hypothyroid disease. We have already overspent our budget to collect such complete data on about 30 attributes for 2800 patients--a good starting number, but a larger sample would certainly be preferred. Moving forward, however, we simply cannot afford to spend so much money on data collection. Therefore, we also need to determine which attributes are the most meaningful to the predictive models, and cut out the rest that don't contribute much. 

The boss wants to see a **balanced** model that can predict with a **high sensitivity** and **high specificity** while using a ***low amount of features***. Collecting complete data such as this is very rare, very time-consuming, and often very expensive. By minimizing the number of features, it will optimize future data collection by deciding what needs to be collected, and what doesn't.

## Loading the data

Let's read the data into a Pandas dataframe and look at the first 20 records.

In [305]:
import pandas as pd

url = "https://raw.githubusercontent.com/BeaverWorksMedlytics2020/Data_Public/master/ChallengeProjects/Week1/allhypo.train.data.csv"
dataset=pd.read_csv(url) 
dataset.head(10)

Unnamed: 0,Age,Sex,On thyroxine,query on thyroxine,on antithyroid medication,sick,pregnant,thyroid surgery,I131 treatment,query hypothyroid,query hyperthyroid,lithium,goitre,tumor,psych,TSH measured,TSH,T3 measured,T3,TT4 measured,TT4,T4U measured,T4u,FTI measured,FTI,TBG measured,TBG,referral source,class
0,41,F,f,f,f,f,f,f,f,f,f,f,f,f,f,t,1.3,t,2.5,t,125,t,1.14,t,109,f,?,SVHC,negative.|3733
1,23,F,f,f,f,f,f,f,f,f,f,f,f,f,f,t,4.1,t,2,t,102,f,?,f,?,f,?,other,negative.|1442
2,46,M,f,f,f,f,f,f,f,f,f,f,f,f,f,t,0.98,f,?,t,109,t,0.91,t,120,f,?,other,negative.|2965
3,70,F,t,f,f,f,f,f,f,f,f,f,f,f,f,t,0.16,t,1.9,t,175,f,?,f,?,f,?,other,negative.|806
4,70,F,f,f,f,f,f,f,f,f,f,f,f,f,f,t,0.72,t,1.2,t,61,t,0.87,t,70,f,?,SVI,negative.|2807
5,18,F,t,f,f,f,f,f,f,f,f,f,f,f,f,t,0.03,f,?,t,183,t,1.3,t,141,f,?,other,negative.|3434
6,59,F,f,f,f,f,f,f,f,f,f,f,f,f,f,f,?,f,?,t,72,t,0.92,t,78,f,?,other,negative.|1595
7,80,F,f,f,f,f,f,f,f,f,f,f,f,f,f,t,2.2,t,0.6,t,80,t,0.7,t,115,f,?,SVI,negative.|1367
8,66,F,f,f,f,f,f,f,f,f,f,f,f,t,f,t,0.6,t,2.2,t,123,t,0.93,t,132,f,?,SVI,negative.|1787
9,68,M,f,f,f,f,f,f,f,f,f,f,f,f,f,t,2.4,t,1.6,t,83,t,0.89,t,93,f,?,SVI,negative.|2534


Great, looks like the data loaded in properly. Let's continue looking at some summary statistics on our data.

## Viewing summary statistics
The functions describe() and info() are your friends

In [306]:
dataset.describe()

Unnamed: 0,Age,Sex,On thyroxine,query on thyroxine,on antithyroid medication,sick,pregnant,thyroid surgery,I131 treatment,query hypothyroid,query hyperthyroid,lithium,goitre,tumor,psych,TSH measured,TSH,T3 measured,T3,TT4 measured,TT4,T4U measured,T4u,FTI measured,FTI,TBG measured,TBG,referral source,class
count,2800,2800,2800,2800,2800,2800,2800,2800,2800,2800,2800,2800,2800,2800,2800,2800,2800,2800,2800,2800,2800,2800,2800,2800,2800,2800,2800,2800,2800
unique,94,3,2,2,2,2,2,2,2,2,2,2,2,2,2,2,264,2,65,2,218,2,139,2,210,1,1,5,2800
top,59,F,f,f,f,f,f,f,f,f,f,f,f,f,f,t,?,t,?,t,?,t,?,t,?,f,?,other,negative.|2588
freq,75,1830,2470,2760,2766,2690,2759,2761,2752,2637,2627,2786,2775,2729,2665,2516,284,2215,585,2616,184,2503,297,2505,295,2800,2800,1632,1


In [307]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2800 entries, 0 to 2799
Data columns (total 29 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   Age                        2800 non-null   object
 1   Sex                        2800 non-null   object
 2   On thyroxine               2800 non-null   object
 3   query on thyroxine         2800 non-null   object
 4   on antithyroid medication  2800 non-null   object
 5   sick                       2800 non-null   object
 6   pregnant                   2800 non-null   object
 7   thyroid surgery            2800 non-null   object
 8   I131 treatment             2800 non-null   object
 9   query hypothyroid          2800 non-null   object
 10  query hyperthyroid         2800 non-null   object
 11  lithium                    2800 non-null   object
 12  goitre                     2800 non-null   object
 13  tumor                      2800 non-null   object
 14  psych   

Note the data types are all objects--even columns that are obviously numeric like Age.


## Data cleaning

To start, let's make all the numerical columns contain the correct type of values and change the data type of those columns to numeric. Let's also replace all those question marks with the median of the respective column.

Hint: To make it easier, first try converting all the "?" to NaN.

In [308]:
import numpy as np

# Convert "?" to NaN
listy = ['Age','TSH','T3','TT4','T4u','FTI']
for i in listy:
  dataset[i].replace(to_replace='?', value=np.nan, inplace=True)
  dataset[i] = pd.to_numeric(dataset[i])

# Identify columns by what type of data they hold
numeric_columns = listy
categorical_columns = list(dataset.select_dtypes(include=['object']).columns)

# Print statement for sanity check
print('Numerical Columns: ',numeric_columns)
print('Categorical Columns: ',categorical_columns)

dataset.head(10)

Numerical Columns:  ['Age', 'TSH', 'T3', 'TT4', 'T4u', 'FTI']
Categorical Columns:  ['Sex', 'On thyroxine', 'query on thyroxine', 'on antithyroid medication', 'sick', 'pregnant', 'thyroid surgery', 'I131 treatment', 'query hypothyroid', 'query hyperthyroid', 'lithium', 'goitre', 'tumor', 'psych', 'TSH measured', 'T3 measured', 'TT4 measured', 'T4U measured', 'FTI measured', 'TBG measured', 'TBG', 'referral source', 'class']


Unnamed: 0,Age,Sex,On thyroxine,query on thyroxine,on antithyroid medication,sick,pregnant,thyroid surgery,I131 treatment,query hypothyroid,query hyperthyroid,lithium,goitre,tumor,psych,TSH measured,TSH,T3 measured,T3,TT4 measured,TT4,T4U measured,T4u,FTI measured,FTI,TBG measured,TBG,referral source,class
0,41.0,F,f,f,f,f,f,f,f,f,f,f,f,f,f,t,1.3,t,2.5,t,125.0,t,1.14,t,109.0,f,?,SVHC,negative.|3733
1,23.0,F,f,f,f,f,f,f,f,f,f,f,f,f,f,t,4.1,t,2.0,t,102.0,f,,f,,f,?,other,negative.|1442
2,46.0,M,f,f,f,f,f,f,f,f,f,f,f,f,f,t,0.98,f,,t,109.0,t,0.91,t,120.0,f,?,other,negative.|2965
3,70.0,F,t,f,f,f,f,f,f,f,f,f,f,f,f,t,0.16,t,1.9,t,175.0,f,,f,,f,?,other,negative.|806
4,70.0,F,f,f,f,f,f,f,f,f,f,f,f,f,f,t,0.72,t,1.2,t,61.0,t,0.87,t,70.0,f,?,SVI,negative.|2807
5,18.0,F,t,f,f,f,f,f,f,f,f,f,f,f,f,t,0.03,f,,t,183.0,t,1.3,t,141.0,f,?,other,negative.|3434
6,59.0,F,f,f,f,f,f,f,f,f,f,f,f,f,f,f,,f,,t,72.0,t,0.92,t,78.0,f,?,other,negative.|1595
7,80.0,F,f,f,f,f,f,f,f,f,f,f,f,f,f,t,2.2,t,0.6,t,80.0,t,0.7,t,115.0,f,?,SVI,negative.|1367
8,66.0,F,f,f,f,f,f,f,f,f,f,f,f,t,f,t,0.6,t,2.2,t,123.0,t,0.93,t,132.0,f,?,SVI,negative.|1787
9,68.0,M,f,f,f,f,f,f,f,f,f,f,f,f,f,t,2.4,t,1.6,t,83.0,t,0.89,t,93.0,f,?,SVI,negative.|2534


Hmm, still looks like the TBG column is unfilled, implying it was empty to begin with. Let's get rid of this column, then (and make sure to get rid of it in your list of numeric columns, too!)

In [309]:
new_data = dataset.drop(['TBG'],axis=1)
categorical_columns.remove('TBG')

All right, let's take a look now at the info of *just the numeric columns* in the dataset:

In [310]:
new_data[numeric_columns].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2800 entries, 0 to 2799
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Age     2799 non-null   float64
 1   TSH     2516 non-null   float64
 2   T3      2215 non-null   float64
 3   TT4     2616 non-null   float64
 4   T4u     2503 non-null   float64
 5   FTI     2505 non-null   float64
dtypes: float64(6)
memory usage: 131.4 KB


Perfect, now let's fix that class feature. According to the note the data collectors included with this data, the ".|####" refers to a patient number, and is not necessarily relevant for our purposes here.

In [311]:
import re
for i in range(len(new_data['class'])):
  new_data['class'][i] = re.sub('[^A-Za-z]+','', dataset['class'][i])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Let's run the describe() function on just the "class" column.

In [312]:
new_data['class'].describe()

count         2800
unique           4
top       negative
freq          2580
Name: class, dtype: object

It looks like there are actually 4 unique classification variables! Thank goodness we didn't assume it was binary.

Display all the unique values in the class column.

In [313]:
np.unique(new_data['class'])

array(['compensatedhypothyroid', 'negative', 'primaryhypothyroid',
       'secondaryhypothyroid'], dtype=object)

But let's make it binary for the sake of this example anyway. If you finish early later on, try the multiclass classifier with all 4 values!

In [314]:
for i in range(len(new_data['class'])):
  if new_data['class'][i] == 'negative':
    new_data['class'][i] = 0
  else:
    new_data['class'][i] = 1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


Before we move on, let's not forget to run the describe() function on just  your categorical columns, too.
Compare it to the describe() that your numeric columns produce.

In [315]:
new_data[categorical_columns].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2800 entries, 0 to 2799
Data columns (total 22 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   Sex                        2800 non-null   object
 1   On thyroxine               2800 non-null   object
 2   query on thyroxine         2800 non-null   object
 3   on antithyroid medication  2800 non-null   object
 4   sick                       2800 non-null   object
 5   pregnant                   2800 non-null   object
 6   thyroid surgery            2800 non-null   object
 7   I131 treatment             2800 non-null   object
 8   query hypothyroid          2800 non-null   object
 9   query hyperthyroid         2800 non-null   object
 10  lithium                    2800 non-null   object
 11  goitre                     2800 non-null   object
 12  tumor                      2800 non-null   object
 13  psych                      2800 non-null   object
 14  TSH meas

Great! Let's see if there's any other records we have to address. count() is a nice way to check if we have any other missing values.

In [316]:
new_data.count()

Age                          2799
Sex                          2800
On thyroxine                 2800
query on thyroxine           2800
on antithyroid medication    2800
sick                         2800
pregnant                     2800
thyroid surgery              2800
I131 treatment               2800
query hypothyroid            2800
query hyperthyroid           2800
lithium                      2800
goitre                       2800
tumor                        2800
psych                        2800
TSH measured                 2800
TSH                          2516
T3 measured                  2800
T3                           2215
TT4 measured                 2800
TT4                          2616
T4U measured                 2800
T4u                          2503
FTI measured                 2800
FTI                          2505
TBG measured                 2800
referral source              2800
class                        2800
dtype: int64

We could replace the missing values in proportion to the current number of males and females over the total, but that is making an assumption we don't have to make. For now, let's simply cut the records of all these sexless people out of our data.

In [317]:
for i in range(len(new_data['Sex'])):
  if new_data['Sex'][i] == '?':
    new_data = new_data.drop(i, axis=0)

Nice! Now we have a pretty clean dataset to work with. Let's now do some further data analysis and visualization to better understand what we're working with.

## Data analysis and visualization

Check the correlation

In [318]:
new_data.corr()

Unnamed: 0,Age,TSH,T3,TT4,T4u,FTI
Age,1.0,-0.041316,-0.249888,-0.05456,-0.166453,0.038776
TSH,-0.041316,1.0,-0.184824,-0.26705,0.067,-0.306676
T3,-0.249888,-0.184824,1.0,0.565775,0.465089,0.352194
TT4,-0.05456,-0.26705,0.565775,1.0,0.439125,0.795785
T4u,-0.166453,0.067,0.465089,0.439125,1.0,-0.168474
FTI,0.038776,-0.306676,0.352194,0.795785,-0.168474,1.0


Convert the class feature to numeric so we can also see the correlations it has with the numeric features, and check the correlation again.

In [319]:
import pandas as pd
new_data['class']=pd.to_numeric(new_data['class'])
new_data.corr()


Unnamed: 0,Age,TSH,T3,TT4,T4u,FTI,class
Age,1.0,-0.041316,-0.249888,-0.05456,-0.166453,0.038776,-0.008204
TSH,-0.041316,1.0,-0.184824,-0.26705,0.067,-0.306676,0.439362
T3,-0.249888,-0.184824,1.0,0.565775,0.465089,0.352194,-0.185861
TT4,-0.05456,-0.26705,0.565775,1.0,0.439125,0.795785,-0.273877
T4u,-0.166453,0.067,0.465089,0.439125,1.0,-0.168474,0.032811
FTI,0.038776,-0.306676,0.352194,0.795785,-0.168474,1.0,-0.307223
class,-0.008204,0.439362,-0.185861,-0.273877,0.032811,-0.307223,1.0


Let's do some further visual analysis using a new module called seaborn. Explore its incredible versatility and diversity with data visualization here: https://seaborn.pydata.org/

In [320]:
import seaborn as sns
#sns.pairplot(new_data)

OK! I think we're ready to create and select some supervised learning models. To get the ball rolling, select Age and Sex as our explanatory features (and class as the target feature, obviously).

## Model training and selection

Let's use get_dummies on the categorical variables (but not the class value!) to view the column names to select some for our model.

In [321]:
dummies = categorical_columns[0:-1]
for i in range(len(dummies)):
  to_add = pd.get_dummies(new_data[dummies[i]], prefix=dummies[i], drop_first=True)
  new_data = pd.concat([new_data, to_add], axis=1)
  del new_data[dummies[i]]
new_data.head()

Unnamed: 0,Age,TSH,T3,TT4,T4u,FTI,class,Sex_M,On thyroxine_t,query on thyroxine_t,on antithyroid medication_t,sick_t,pregnant_t,thyroid surgery_t,I131 treatment_t,query hypothyroid_t,query hyperthyroid_t,lithium_t,goitre_t,tumor_t,psych_t,TSH measured_t,T3 measured_t,TT4 measured_t,T4U measured_t,FTI measured_t,referral source_SVHC,referral source_SVHD,referral source_SVI,referral source_other
0,41.0,1.3,2.5,125.0,1.14,109.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,0,0,0
1,23.0,4.1,2.0,102.0,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,1
2,46.0,0.98,,109.0,0.91,120.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,1,1,0,0,0,1
3,70.0,0.16,1.9,175.0,,,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,1
4,70.0,0.72,1.2,61.0,0.87,70.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,0,0,1,0


In [322]:
class_to_front = new_data.columns.tolist()
class_to_front = class_to_front[6:] + class_to_front[:6]
new_data = new_data[class_to_front]
new_data.head()

Unnamed: 0,class,Sex_M,On thyroxine_t,query on thyroxine_t,on antithyroid medication_t,sick_t,pregnant_t,thyroid surgery_t,I131 treatment_t,query hypothyroid_t,query hyperthyroid_t,lithium_t,goitre_t,tumor_t,psych_t,TSH measured_t,T3 measured_t,TT4 measured_t,T4U measured_t,FTI measured_t,referral source_SVHC,referral source_SVHD,referral source_SVI,referral source_other,Age,TSH,T3,TT4,T4u,FTI
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,0,0,0,41.0,1.3,2.5,125.0,1.14,109.0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,1,23.0,4.1,2.0,102.0,,
2,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,1,1,0,0,0,1,46.0,0.98,,109.0,0.91,120.0
3,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,1,70.0,0.16,1.9,175.0,,
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,0,0,1,0,70.0,0.72,1.2,61.0,0.87,70.0


In [323]:
new_data.corr()

Unnamed: 0,class,Sex_M,On thyroxine_t,query on thyroxine_t,on antithyroid medication_t,sick_t,pregnant_t,thyroid surgery_t,I131 treatment_t,query hypothyroid_t,query hyperthyroid_t,lithium_t,goitre_t,tumor_t,psych_t,TSH measured_t,T3 measured_t,TT4 measured_t,T4U measured_t,FTI measured_t,referral source_SVHC,referral source_SVHD,referral source_SVI,referral source_other,Age,TSH,T3,TT4,T4u,FTI
class,1.0,-0.042863,-0.086285,-0.001571,-0.020632,0.01781,-0.035843,-0.012251,0.015851,0.078626,-0.023872,-0.001886,-0.028257,0.004423,-0.022312,0.097649,0.022703,0.059295,0.017943,0.017287,-0.060899,0.008517,0.041389,0.006508,-0.008204,0.439362,-0.185861,-0.273877,0.032811,-0.307223
Sex_M,-0.042863,1.0,-0.088265,0.040912,-0.034749,0.001917,-0.084223,-0.043136,-0.027264,-0.034216,-0.071517,-0.027428,0.024984,-0.071997,0.114167,0.03877,0.09603,0.074904,0.045724,0.046813,0.154559,-0.012077,0.112454,-0.155142,-0.008012,-0.037527,-0.070792,-0.167941,-0.240761,-0.032953
On thyroxine_t,-0.086285,-0.088265,1.0,0.002005,-0.000716,-0.052083,0.011467,0.0415,0.076911,0.07189,-0.019451,0.005159,-0.011846,-0.031501,-0.079169,0.03972,-0.15909,0.012118,0.020688,0.019853,-0.092987,0.015365,-0.192846,0.238603,0.008531,0.019718,0.029562,0.213735,0.050568,0.187885
query on thyroxine_t,-0.001571,0.040912,0.002005,1.0,-0.013901,0.021812,0.03566,0.010794,-0.016025,-0.030896,-0.006323,-0.008886,0.05212,-0.000789,-0.028131,-0.122152,-0.034004,0.03211,0.031819,0.03162,-0.041122,0.016201,-0.007597,0.035759,-0.016533,-0.011823,-0.024347,-0.01387,-0.014712,-0.005212
on antithyroid medication_t,-0.020632,-0.034749,-0.000716,-0.013901,1.0,-0.02314,0.068583,-0.013723,0.011189,-0.014383,0.122072,-0.008184,-0.010958,-0.018494,-0.025906,-0.006354,0.009808,-0.03844,-0.037715,-0.038111,-0.03787,-0.012016,-0.070153,0.069019,-0.070929,-0.012796,0.083169,0.004109,0.069989,-0.041206
sick_t,0.01781,0.001917,-0.052083,0.021812,-0.02314,1.0,-0.025127,-0.024806,-0.026676,0.036639,-0.029308,-0.014793,-0.019809,-0.00964,-0.02942,0.030705,0.031904,0.00701,0.020102,0.019666,-0.062163,0.284836,0.084604,-0.087492,0.07991,-0.021698,-0.08576,-0.034003,-0.041064,-0.016222
pregnant_t,-0.035843,-0.084223,0.011467,0.03566,0.068583,-0.025127,1.0,-0.014902,-0.016025,-0.01791,0.133303,-0.008886,-0.011899,0.114969,-0.014013,0.010508,0.026125,0.03211,0.041877,0.04171,-0.030917,-0.013047,-0.076178,-0.038953,-0.116034,-0.023895,0.203985,0.198272,0.366885,-0.00961
thyroid surgery_t,-0.012251,-0.043136,0.0415,0.010794,-0.013723,-0.024806,-0.014902,1.0,-0.015821,-0.017353,0.020105,-0.008773,-0.011748,-0.00029,-0.027772,0.040597,-0.013212,0.0317,0.000606,0.000311,-0.040597,-0.012881,-0.026597,0.052482,-0.033293,0.052304,-0.031098,-0.031658,0.0214,-0.0392
I131 treatment_t,0.015851,-0.027264,0.076911,-0.016025,0.011189,-0.026676,-0.016025,-0.015821,1.0,0.040728,0.0981,-0.009434,-0.012633,-0.02132,-0.029865,0.043658,-0.003296,0.03409,0.025476,0.025238,-0.043658,-0.013852,-0.080876,0.110674,0.049765,0.004905,0.013711,-0.014219,0.008462,-0.021906
query hypothyroid_t,0.078626,-0.034216,0.07189,-0.030896,-0.014383,0.036639,-0.01791,-0.017353,0.040728,1.0,0.013039,-0.018189,-0.024357,-0.041105,-0.014234,0.05284,-0.077295,0.01433,0.003367,0.002764,-0.05284,0.018194,-0.01906,0.044521,0.024745,0.049205,-0.06049,-0.020155,0.011898,-0.029311


All right, let's now split our data into training and testing in an 80-20 split. For consistency, let's all use a seed of 8675309.

In [325]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(new_data[class_to_front[1:]], new_data[class_to_front[0]], test_size=0.2, random_state=8675309)

# further split X and y of training into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=0)

print('There are {} training samples with {} features and {} associated classification labels'.format(*X_train.shape, *y_train.shape))
print('There are {} validation samples with {} features and {} associated classification labels'.format(*X_val.shape, *y_val.shape))
print('There are {} test samples with {} features and {} associated classification labels'.format(*X_test.shape, *y_test.shape))

There are 1721 training samples with 29 features and 1721 associated classification labels
There are 431 validation samples with 29 features and 431 associated classification labels
There are 538 test samples with 29 features and 538 associated classification labels


For reusability, let's make a logistic regression function that will take our training and testing data as arguments. Inside the function, build a model on your training data, fit it with your training class data, and return a list of your predictions.

In [None]:
from sklearn import metrics
from sklearn.linear_model import LogisticRegression

def log_reg(train_X,train_Y,test_X,test_Y):


  


Fantastic, we have just built a logistic regression model! Let's go see how well it performs.

### Model evaluation

To start, let's establish the baseline performance. This is important because it provides a starting point of comparison for later evaluation methods, like accuracy.

A good baseline model to use is the Zero Rule algorithm. In classification problems, it simply predicts the class value with the greatest number of instances every time.

In [None]:
def zero_rule_algorithm_classification(train,test):
  ## Your code here


Get your baseline performance by calculating the accuracy of your Zero Rule algorithm.

In [None]:
## Your code here

So maybe accuracy isn't the best performance measure for this dataset. As you've seen already, even when the models predict "negative" for all the records, we could already achieve a ~92% accuracy. However, that also implies we incorrectly predicted 100% of the positive cases, which in the context of this problem, is fatal.

Thankfully, it isn't the only way to evaluate your model. Let's start by creating a confusion matrix using the logistic regression function you built earlier.

In [None]:
from sklearn.metrics import confusion_matrix
### Your code here

Hopefully you remember our discussion of the Area Under the Receiver Operator Curve metric. This can measure the accuracy of a test to discriminate diseased cases from normal cases.

When you consider the results of a particular test in two populations, one population with a disease, the other population without the disease, you will rarely observe a perfect separation between the two groups. Hence, the overlapping areas in the diagram below (FN, FP).

To review, on a Receiver Operating Characteristic (ROC) curve, the true positive rate is plotted in function of the false positive rate for different cut-off points. Each point on the ROC curve represents a sensitivity/specificity pair corresponding to a particular decision threshold. A test with perfect discrimination (no overlap in the two distributions) has a ROC curve that passes through the upper left corner. Therefore the closer the ROC curve is to the upper left corner, the higher the overall accuracy of the test.

![alt text](https://www.medcalc.org/manual/_help/images/roc_intro1.png)





Now, to graph the AUROC curve, we will need to predict probabilities of choosing a specific class value rather than the class value itself. Make a new logistic regression model that does so.

In [None]:
from sklearn.linear_model import LogisticRegression

### Your code here

Now calculate the area under the receiver operator curve with your predictions.

In [None]:
from sklearn import metrics

### Your code here

Now graph the ROC curve using matplotlib, fully labeled.

In [None]:
import matplotlib.pyplot as plt
### Your code here

In conclusion, it looks like this model performed pretty bad. It's probably best to try out different columns or perhaps use a different model before we submit our model for scoring. Get creative!

In [None]:
## Your code here

## Submitting your Model

Once you believe to have found the best classifier, run your classifier on the test data and make a pickle file containing of your predictions contained a pandas dataframe.

This pandas dataframe will contain three columns for your binary classifier (or 5 columns for the multiclass classifier): the first column should be your model's "best guess" for each patient (either 0 or 1, negative or positive) and the last two columns should be the probability the patient would be classified as either a 0 or 1.

(see below for reference)

In [None]:
#pickling example
import pickle
predictions=po.DataFrame({"guesses":[0,1,0,1],"prob_neg":[.75,.15,.63,.20],"prob_pos":[.25,.85,.27,.80]})
prediction_pickle_path = 'prediction_pickle.pkl'

from google.colab import files
# Create an variable to pickle and open it in write mode
prediction_pickle = open(prediction_pickle_path, 'wb')
pickle.dump(predictions, prediction_pickle)
files.download(prediction_pickle_path)
prediction_pickle.close()

In [None]:
prediction_unpickle = open(prediction_pickle_path, 'rb')
 
# load the unpickle object into a variable
predictions = pickle.load(prediction_unpickle)
 
print(predictions)

We will compare your guesses with the true classifications to score your model using the AUC metric.

## Presenting your Model

Finally, we would like you to be able to present your model to the class. Prepare a notebook with the following things:

* **Features Chosen:** a list of the features used in your model, and an explanation of how you chose them.
* **Type of Model:** an explanation of the model type, parameters used, and why.
* **Evaluation:** at least one plot showing an evaluation of your model against a validation set. You can use a confusion matrix, AUROC, or another metric of your choice.

Feel free to include one or two additional plots that describe your process and/or model if you think that would be helpful.

## Moving to the Next Level

For those that finish early, remember how we converted the class values into the binary of "negative" and "positive"? Now try tackling the multiclass classifier (predicting the different types of positive hypothyroid cases instead of simply negative or positive)! 

The same rules apply!