# Week 1 Challenge Project
---
Hypothyroidism, also called underactive thyroid or low thyroid, is a disorder of the endocrine system in which the thyroid gland does not produce enough thyroid hormone. It can cause a number of symptoms, such as poor ability to tolerate cold, a feeling of tiredness, constipation, depression, and weight gain. Occasionally there may be swelling of the front part of the neck due to goitre. Untreated hypothyroidism during pregnancy can lead to delays in growth and intellectual development in the baby or cretinism.

Worldwide, too little iodine in the diet is the most common cause of hypothyroidism. In countries with enough iodine in the diet, the most common cause of hypothyroidism is the autoimmune condition Hashimoto's thyroiditis. Less common causes include: previous treatment with radioactive iodine, injury to the hypothalamus or the anterior pituitary gland, certain medications, a lack of a functioning thyroid at birth, or previous thyroid surgery. The diagnosis of hypothyroidism, when suspected, can be confirmed with blood tests measuring thyroid-stimulating hormone (TSH) and thyroxine levels.

Worldwide about one billion people are estimated to be iodine deficient; however, it is unknown how often this results in hypothyroidism. In the United States, hypothyroidism occurs in 0.3–0.4% of people.

And that is why we iodize salt.

![alt text](https://www.mayoclinic.org/-/media/kcms/gbs/patient-consumer/images/2013/11/15/17/39/ds00181_-ds00344_-ds00353_-ds00491_-ds00492_-ds00567_-ds00660_-my00709_im01872_thyroid_gif.jpg)



Background: Doctors all around the world need our help to predict whether a patient has hypothyroid disease. We have already overspent our budget to collect such complete data on about 30 attributes for 2800 patients--a good starting number, but a larger sample would certainly be preferred. Moving forward, however, we simply cannot afford to spend so much money on data collection. Therefore, we also need to determine which attributes are the most meaningful to the predictive models, and cut out the rest that don't contribute much. 

The boss wants to see a **balanced** model that can predict with a **high sensitivity** and **high specificity** while using a ***low amount of features***. Collecting complete data such as this is very rare, very time-consuming, and often very expensive. By minimizing the number of features, it will optimize future data collection by deciding what needs to be collected, and what doesn't.

## Loading the data

Let's read the data into a Pandas dataframe and look at the first 20 records.

In [5]:
import pandas as pd

url = "https://raw.githubusercontent.com/BeaverWorksMedlytics2020/Data_Public/master/ChallengeProjects/Week1/allhypo.train.data.csv"
dataset=pd.read_csv(url) 
dataset.head(20)

Unnamed: 0,Age,Sex,On thyroxine,query on thyroxine,on antithyroid medication,sick,pregnant,thyroid surgery,I131 treatment,query hypothyroid,query hyperthyroid,lithium,goitre,tumor,psych,TSH measured,TSH,T3 measured,T3,TT4 measured,TT4,T4U measured,T4u,FTI measured,FTI,TBG measured,TBG,referral source,class
0,41,F,f,f,f,f,f,f,f,f,f,f,f,f,f,t,1.3,t,2.5,t,125,t,1.14,t,109,f,?,SVHC,negative.|3733
1,23,F,f,f,f,f,f,f,f,f,f,f,f,f,f,t,4.1,t,2,t,102,f,?,f,?,f,?,other,negative.|1442
2,46,M,f,f,f,f,f,f,f,f,f,f,f,f,f,t,0.98,f,?,t,109,t,0.91,t,120,f,?,other,negative.|2965
3,70,F,t,f,f,f,f,f,f,f,f,f,f,f,f,t,0.16,t,1.9,t,175,f,?,f,?,f,?,other,negative.|806
4,70,F,f,f,f,f,f,f,f,f,f,f,f,f,f,t,0.72,t,1.2,t,61,t,0.87,t,70,f,?,SVI,negative.|2807
5,18,F,t,f,f,f,f,f,f,f,f,f,f,f,f,t,0.03,f,?,t,183,t,1.3,t,141,f,?,other,negative.|3434
6,59,F,f,f,f,f,f,f,f,f,f,f,f,f,f,f,?,f,?,t,72,t,0.92,t,78,f,?,other,negative.|1595
7,80,F,f,f,f,f,f,f,f,f,f,f,f,f,f,t,2.2,t,0.6,t,80,t,0.7,t,115,f,?,SVI,negative.|1367
8,66,F,f,f,f,f,f,f,f,f,f,f,f,t,f,t,0.6,t,2.2,t,123,t,0.93,t,132,f,?,SVI,negative.|1787
9,68,M,f,f,f,f,f,f,f,f,f,f,f,f,f,t,2.4,t,1.6,t,83,t,0.89,t,93,f,?,SVI,negative.|2534


Great, looks like the data loaded in properly. Let's continue looking at some summary statistics on our data.

## Viewing summary statistics
The functions describe() and info() are your friends

In [6]:
dataset.describe()

Unnamed: 0,Age,Sex,On thyroxine,query on thyroxine,on antithyroid medication,sick,pregnant,thyroid surgery,I131 treatment,query hypothyroid,query hyperthyroid,lithium,goitre,tumor,psych,TSH measured,TSH,T3 measured,T3,TT4 measured,TT4,T4U measured,T4u,FTI measured,FTI,TBG measured,TBG,referral source,class
count,2800,2800,2800,2800,2800,2800,2800,2800,2800,2800,2800,2800,2800,2800,2800,2800,2800,2800,2800,2800,2800,2800,2800,2800,2800,2800,2800,2800,2800
unique,94,3,2,2,2,2,2,2,2,2,2,2,2,2,2,2,264,2,65,2,218,2,139,2,210,1,1,5,2800
top,59,F,f,f,f,f,f,f,f,f,f,f,f,f,f,t,?,t,?,t,?,t,?,t,?,f,?,other,negative.|2893
freq,75,1830,2470,2760,2766,2690,2759,2761,2752,2637,2627,2786,2775,2729,2665,2516,284,2215,585,2616,184,2503,297,2505,295,2800,2800,1632,1


In [7]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2800 entries, 0 to 2799
Data columns (total 29 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   Age                        2800 non-null   object
 1   Sex                        2800 non-null   object
 2   On thyroxine               2800 non-null   object
 3   query on thyroxine         2800 non-null   object
 4   on antithyroid medication  2800 non-null   object
 5   sick                       2800 non-null   object
 6   pregnant                   2800 non-null   object
 7   thyroid surgery            2800 non-null   object
 8   I131 treatment             2800 non-null   object
 9   query hypothyroid          2800 non-null   object
 10  query hyperthyroid         2800 non-null   object
 11  lithium                    2800 non-null   object
 12  goitre                     2800 non-null   object
 13  tumor                      2800 non-null   object
 14  psych   

Note the data types are all objects--even columns that are obviously numeric like Age.


## Data cleaning

To start, let's make all the numerical columns contain the correct type of values and change the data type of those columns to numeric. Let's also replace all those question marks with the median of the respective column.

Hint: To make it easier, first try converting all the "?" to NaN.

In [36]:
import numpy as np
import math

# Convert "?" to NaN
dataset[dataset == "?"] = np.nan

# Identify columns by what type of data they hold
allCols = list(dataset.columns.values)

numeric_columns = []
categorical_columns = []

for i in range(len(allCols)):
  #includes NaN as numeric
  if str(dataset[allCols[i]][0]).replace(".", "").isdigit() or dataset[allCols[i]][0] != dataset[allCols[i]][0]:
    numeric_columns.append(allCols[i])
  else:
    categorical_columns.append(allCols[i])

# Make those columns numeric instead of object
dataset[numeric_columns] = dataset[numeric_columns].apply(pd.to_numeric)

# Replace NaN with median of col
for i in range(len(numeric_columns)):
  currCol = dataset[numeric_columns[i]]
  med = np.nanmedian(currCol)

  for i in range(len(currCol)):
    if np.isnan(currCol[i]):
      currCol.iloc[i] = med

# Print statement for sanity check
print('Numerical Columns: ',numeric_columns)
print('Categorical Columns: ',categorical_columns)

dataset.head(10)

#dataset.info()

  res_values = method(rvalues)
  overwrite_input=overwrite_input)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)


Numerical Columns:  ['Age', 'TSH', 'T3', 'TT4', 'T4u', 'FTI', 'TBG']
Categorical Columns:  ['Sex', 'On thyroxine', 'query on thyroxine', 'on antithyroid medication', 'sick', 'pregnant', 'thyroid surgery', 'I131 treatment', 'query hypothyroid', 'query hyperthyroid', 'lithium', 'goitre', 'tumor', 'psych', 'TSH measured', 'T3 measured', 'TT4 measured', 'T4U measured', 'FTI measured', 'TBG measured', 'referral source', 'class']


Unnamed: 0,Age,Sex,On thyroxine,query on thyroxine,on antithyroid medication,sick,pregnant,thyroid surgery,I131 treatment,query hypothyroid,query hyperthyroid,lithium,goitre,tumor,psych,TSH measured,TSH,T3 measured,T3,TT4 measured,TT4,T4U measured,T4u,FTI measured,FTI,TBG measured,TBG,referral source,class
0,41.0,F,f,f,f,f,f,f,f,f,f,f,f,f,f,t,1.3,t,2.5,t,125.0,t,1.14,t,109.0,f,,SVHC,negative.|3733
1,23.0,F,f,f,f,f,f,f,f,f,f,f,f,f,f,t,4.1,t,2.0,t,102.0,f,0.98,f,107.0,f,,other,negative.|1442
2,46.0,M,f,f,f,f,f,f,f,f,f,f,f,f,f,t,0.98,f,2.0,t,109.0,t,0.91,t,120.0,f,,other,negative.|2965
3,70.0,F,t,f,f,f,f,f,f,f,f,f,f,f,f,t,0.16,t,1.9,t,175.0,f,0.98,f,107.0,f,,other,negative.|806
4,70.0,F,f,f,f,f,f,f,f,f,f,f,f,f,f,t,0.72,t,1.2,t,61.0,t,0.87,t,70.0,f,,SVI,negative.|2807
5,18.0,F,t,f,f,f,f,f,f,f,f,f,f,f,f,t,0.03,f,2.0,t,183.0,t,1.3,t,141.0,f,,other,negative.|3434
6,59.0,F,f,f,f,f,f,f,f,f,f,f,f,f,f,f,1.4,f,2.0,t,72.0,t,0.92,t,78.0,f,,other,negative.|1595
7,80.0,F,f,f,f,f,f,f,f,f,f,f,f,f,f,t,2.2,t,0.6,t,80.0,t,0.7,t,115.0,f,,SVI,negative.|1367
8,66.0,F,f,f,f,f,f,f,f,f,f,f,f,t,f,t,0.6,t,2.2,t,123.0,t,0.93,t,132.0,f,,SVI,negative.|1787
9,68.0,M,f,f,f,f,f,f,f,f,f,f,f,f,f,t,2.4,t,1.6,t,83.0,t,0.89,t,93.0,f,,SVI,negative.|2534


Hmm, still looks like the TBG columns are unfilled, implying they were empty to begin with. Let's get rid of these columns, then (and make sure to get rid of it in your list of categorical/numeric columns, too!)

In [37]:
## YOUR CODE HERE
tbgCols = [s for s in allCols if "TBG" in s]
#print(tbgCols)

for col in tbgCols:
  del dataset[col]
  allCols.remove(col)
  if col in numeric_columns:
    numeric_columns.remove(col)
  elif col in categorical_columns:
    categorical_columns.remove(col)


dataset.head(10)

['TBG measured', 'TBG']


Unnamed: 0,Age,Sex,On thyroxine,query on thyroxine,on antithyroid medication,sick,pregnant,thyroid surgery,I131 treatment,query hypothyroid,query hyperthyroid,lithium,goitre,tumor,psych,TSH measured,TSH,T3 measured,T3,TT4 measured,TT4,T4U measured,T4u,FTI measured,FTI,referral source,class
0,41.0,F,f,f,f,f,f,f,f,f,f,f,f,f,f,t,1.3,t,2.5,t,125.0,t,1.14,t,109.0,SVHC,negative.|3733
1,23.0,F,f,f,f,f,f,f,f,f,f,f,f,f,f,t,4.1,t,2.0,t,102.0,f,0.98,f,107.0,other,negative.|1442
2,46.0,M,f,f,f,f,f,f,f,f,f,f,f,f,f,t,0.98,f,2.0,t,109.0,t,0.91,t,120.0,other,negative.|2965
3,70.0,F,t,f,f,f,f,f,f,f,f,f,f,f,f,t,0.16,t,1.9,t,175.0,f,0.98,f,107.0,other,negative.|806
4,70.0,F,f,f,f,f,f,f,f,f,f,f,f,f,f,t,0.72,t,1.2,t,61.0,t,0.87,t,70.0,SVI,negative.|2807
5,18.0,F,t,f,f,f,f,f,f,f,f,f,f,f,f,t,0.03,f,2.0,t,183.0,t,1.3,t,141.0,other,negative.|3434
6,59.0,F,f,f,f,f,f,f,f,f,f,f,f,f,f,f,1.4,f,2.0,t,72.0,t,0.92,t,78.0,other,negative.|1595
7,80.0,F,f,f,f,f,f,f,f,f,f,f,f,f,f,t,2.2,t,0.6,t,80.0,t,0.7,t,115.0,SVI,negative.|1367
8,66.0,F,f,f,f,f,f,f,f,f,f,f,f,t,f,t,0.6,t,2.2,t,123.0,t,0.93,t,132.0,SVI,negative.|1787
9,68.0,M,f,f,f,f,f,f,f,f,f,f,f,f,f,t,2.4,t,1.6,t,83.0,t,0.89,t,93.0,SVI,negative.|2534


All right, let's take a look now at the info of *just the numeric columns* in the dataset:

In [38]:
dataset[numeric_columns].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2800 entries, 0 to 2799
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Age     2800 non-null   float64
 1   TSH     2800 non-null   float64
 2   T3      2800 non-null   float64
 3   TT4     2800 non-null   float64
 4   T4u     2800 non-null   float64
 5   FTI     2800 non-null   float64
dtypes: float64(6)
memory usage: 131.4 KB


Perfect!  Now let's see what's going on with the "class" column.... According to the note the data collectors included with this data, the ".|####" refers to a patient number, and is not necessarily relevant for our purposes here.  Let's turn the "class" column into a useful multi-class label.

In [39]:
import re
## YOUR CODE HERE
classCol = dataset["class"]
#print(classCol)
resultCol = []
idCol = []

for i in range(len(classCol)):
  curr = classCol[i].split(".|")
  #print(curr)
  resultCol.append(curr[0])
  idCol.append(curr[1])

print(resultCol)
print(idCol)

['negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'compensated hypothyroid', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'primary hypothyroid', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'compensated hypothyroid', 'compensated hypothyroid', 'compensated hypothyroid', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'neg

Let's run the describe() function on just the "class" column.

In [41]:
dataset['class'].describe()

count               2800
unique              2800
top       negative.|2893
freq                   1
Name: class, dtype: object

It looks like there are actually 4 unique classification variables! Thank goodness we didn't assume it was binary.

Display all the unique values in the class column.

In [42]:
## YOUR CODE HERE
print(list(set(resultCol)))

['compensated hypothyroid', 'primary hypothyroid', 'negative', 'secondary hypothyroid']


But let's make it binary for the sake of this example anyway. If you finish early later on Thursday/Friday, try the multiclass classifier with all 4 values!

In [43]:
## YOUR CODE HERE
binaryResultCol = [1 if s != "negative" else 0 for s in resultCol]
#print(binaryResultCol)
#print(binaryResultCol.count("positive"))
dataset["class"] = binaryResultCol
print(dataset)

       Age Sex On thyroxine  ...    FTI referral source class
0     41.0   F            f  ...  109.0            SVHC     0
1     23.0   F            f  ...  107.0           other     0
2     46.0   M            f  ...  120.0           other     0
3     70.0   F            t  ...  107.0           other     0
4     70.0   F            f  ...   70.0             SVI     0
...    ...  ..          ...  ...    ...             ...   ...
2795  70.0   M            f  ...  148.0             SVI     0
2796  73.0   M            f  ...   72.0           other     0
2797  75.0   M            f  ...  183.0           other     0
2798  60.0   F            f  ...  121.0           other     0
2799  81.0   F            f  ...  115.0             SVI     0

[2800 rows x 27 columns]


Before we move on, let's not forget to run the describe() function on just  your categorical columns, too.
Compare it to the describe() that your numeric columns produce.

In [44]:
## YOUR CODE HERE
dataset[categorical_columns].describe()

Unnamed: 0,class
count,2800.0
mean,0.078571
std,0.269117
min,0.0
25%,0.0
50%,0.0
75%,0.0
max,1.0


Great! Let's see if there's any other records we have to address. count() is a nice way to check if we have any other missing values.

In [45]:
dataset.count()

Age                          2800
Sex                          2690
On thyroxine                 2800
query on thyroxine           2800
on antithyroid medication    2800
sick                         2800
pregnant                     2800
thyroid surgery              2800
I131 treatment               2800
query hypothyroid            2800
query hyperthyroid           2800
lithium                      2800
goitre                       2800
tumor                        2800
psych                        2800
TSH measured                 2800
TSH                          2800
T3 measured                  2800
T3                           2800
TT4 measured                 2800
TT4                          2800
T4U measured                 2800
T4u                          2800
FTI measured                 2800
FTI                          2800
referral source              2800
class                        2800
dtype: int64

There seems to be quite a few rows with missing data. There are techniques you can use to try to handle this situation (and some models in sklearn can handle NaN values without problem). But let's just remove those rows for now. When working on groups, you're more than welcome to choose your own method of dealing with the missing data.

In [46]:
## YOUR CODE HERE
dfClean = dataset.dropna()
print(dfClean)

       Age Sex On thyroxine  ...    FTI referral source class
0     41.0   F            f  ...  109.0            SVHC     0
1     23.0   F            f  ...  107.0           other     0
2     46.0   M            f  ...  120.0           other     0
3     70.0   F            t  ...  107.0           other     0
4     70.0   F            f  ...   70.0             SVI     0
...    ...  ..          ...  ...    ...             ...   ...
2795  70.0   M            f  ...  148.0             SVI     0
2796  73.0   M            f  ...   72.0           other     0
2797  75.0   M            f  ...  183.0           other     0
2798  60.0   F            f  ...  121.0           other     0
2799  81.0   F            f  ...  115.0             SVI     0

[2690 rows x 27 columns]


Ooof! We just cut out about 30% of our data set! You probably won't want to throw out this data for your project, but let's keep going now that we have a clean dataset and do some further data analysis and visualization to better understand what we're working with.

## Data analysis and visualization

As the name suggests, [pandas.corr()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.corr.html) will compute pairwise correlation of (numerical) columns, excluding NA/null values. Notice that in this case, since we've converted 'class' to a number (0 or 1) we can see how correlated different features are with the class label!

Check the correlation

In [47]:
## YOUR CODE HERE
dfClean.corr()

Unnamed: 0,Age,TSH,T3,TT4,T4u,FTI,class
Age,1.0,-0.033569,-0.224286,-0.050343,-0.155124,0.040092,-0.008215
TSH,-0.033569,1.0,-0.155681,-0.259671,0.066218,-0.295188,0.441618
T3,-0.224286,-0.155681,1.0,0.514505,0.424036,0.31207,-0.167724
TT4,-0.050343,-0.259671,0.514505,1.0,0.431904,0.781832,-0.268986
T4u,-0.155124,0.066218,0.424036,0.431904,1.0,-0.167147,0.03188
FTI,0.040092,-0.295188,0.31207,0.781832,-0.167147,1.0,-0.292745
class,-0.008215,0.441618,-0.167724,-0.268986,0.03188,-0.292745,1.0


Convert the class feature to numeric so we can also see the correlations it has with the numeric features, and check the correlation again.

In [48]:
## YOUR CODE HERE
dfClean["class"] = dfClean["class"].apply(pd.to_numeric)
dfClean.corr()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,Age,TSH,T3,TT4,T4u,FTI,class
Age,1.0,-0.033569,-0.224286,-0.050343,-0.155124,0.040092,-0.008215
TSH,-0.033569,1.0,-0.155681,-0.259671,0.066218,-0.295188,0.441618
T3,-0.224286,-0.155681,1.0,0.514505,0.424036,0.31207,-0.167724
TT4,-0.050343,-0.259671,0.514505,1.0,0.431904,0.781832,-0.268986
T4u,-0.155124,0.066218,0.424036,0.431904,1.0,-0.167147,0.03188
FTI,0.040092,-0.295188,0.31207,0.781832,-0.167147,1.0,-0.292745
class,-0.008215,0.441618,-0.167724,-0.268986,0.03188,-0.292745,1.0


Let's do some further visual analysis using a new module called seaborn. Explore its incredible versatility and diversity with data visualization here: https://seaborn.pydata.org/

In [49]:
import seaborn as sns


  import pandas.util.testing as tm


OK! I think we're ready to create and select some supervised learning models. To get the ball rolling, select Age and Sex as our explanatory features (and class as the target feature, obviously).

In [51]:
#dfClean.describe()
dfClean.head(10)
explanatoryFeatures = ["Age", "Sex"]
targetFeature = "class"

Unnamed: 0,Age,Sex,On thyroxine,query on thyroxine,on antithyroid medication,sick,pregnant,thyroid surgery,I131 treatment,query hypothyroid,query hyperthyroid,lithium,goitre,tumor,psych,TSH measured,TSH,T3 measured,T3,TT4 measured,TT4,T4U measured,T4u,FTI measured,FTI,referral source,class
0,41.0,F,f,f,f,f,f,f,f,f,f,f,f,f,f,t,1.3,t,2.5,t,125.0,t,1.14,t,109.0,SVHC,0
1,23.0,F,f,f,f,f,f,f,f,f,f,f,f,f,f,t,4.1,t,2.0,t,102.0,f,0.98,f,107.0,other,0
2,46.0,M,f,f,f,f,f,f,f,f,f,f,f,f,f,t,0.98,f,2.0,t,109.0,t,0.91,t,120.0,other,0
3,70.0,F,t,f,f,f,f,f,f,f,f,f,f,f,f,t,0.16,t,1.9,t,175.0,f,0.98,f,107.0,other,0
4,70.0,F,f,f,f,f,f,f,f,f,f,f,f,f,f,t,0.72,t,1.2,t,61.0,t,0.87,t,70.0,SVI,0
5,18.0,F,t,f,f,f,f,f,f,f,f,f,f,f,f,t,0.03,f,2.0,t,183.0,t,1.3,t,141.0,other,0
6,59.0,F,f,f,f,f,f,f,f,f,f,f,f,f,f,f,1.4,f,2.0,t,72.0,t,0.92,t,78.0,other,0
7,80.0,F,f,f,f,f,f,f,f,f,f,f,f,f,f,t,2.2,t,0.6,t,80.0,t,0.7,t,115.0,SVI,0
8,66.0,F,f,f,f,f,f,f,f,f,f,f,f,t,f,t,0.6,t,2.2,t,123.0,t,0.93,t,132.0,SVI,0
9,68.0,M,f,f,f,f,f,f,f,f,f,f,f,f,f,t,2.4,t,1.6,t,83.0,t,0.89,t,93.0,SVI,0


Now let's take a look at our categorical columns!

In [61]:
## YOUR CODE HERE
for e in categorical_columns:
  dfClean[e].describe()

Uh oh... we have several features that are non-informative (they only have a single value).  We probably didn't notice this before because there were still '?' values in there, or perhaps when we threw out that 30% of our data we got rid of some variation in these features.  Let's just drop those columns.

In [62]:
## YOUR CODE HERE
df = dfClean.loc[:, ~(dfClean == dfClean.iloc[0]).all()]
df.head(10)

Unnamed: 0,Age,Sex,On thyroxine,query on thyroxine,on antithyroid medication,sick,pregnant,thyroid surgery,I131 treatment,query hypothyroid,query hyperthyroid,lithium,goitre,tumor,psych,TSH measured,TSH,T3 measured,T3,TT4 measured,TT4,T4U measured,T4u,FTI measured,FTI,referral source,class
0,41.0,F,f,f,f,f,f,f,f,f,f,f,f,f,f,t,1.3,t,2.5,t,125.0,t,1.14,t,109.0,SVHC,0
1,23.0,F,f,f,f,f,f,f,f,f,f,f,f,f,f,t,4.1,t,2.0,t,102.0,f,0.98,f,107.0,other,0
2,46.0,M,f,f,f,f,f,f,f,f,f,f,f,f,f,t,0.98,f,2.0,t,109.0,t,0.91,t,120.0,other,0
3,70.0,F,t,f,f,f,f,f,f,f,f,f,f,f,f,t,0.16,t,1.9,t,175.0,f,0.98,f,107.0,other,0
4,70.0,F,f,f,f,f,f,f,f,f,f,f,f,f,f,t,0.72,t,1.2,t,61.0,t,0.87,t,70.0,SVI,0
5,18.0,F,t,f,f,f,f,f,f,f,f,f,f,f,f,t,0.03,f,2.0,t,183.0,t,1.3,t,141.0,other,0
6,59.0,F,f,f,f,f,f,f,f,f,f,f,f,f,f,f,1.4,f,2.0,t,72.0,t,0.92,t,78.0,other,0
7,80.0,F,f,f,f,f,f,f,f,f,f,f,f,f,f,t,2.2,t,0.6,t,80.0,t,0.7,t,115.0,SVI,0
8,66.0,F,f,f,f,f,f,f,f,f,f,f,f,t,f,t,0.6,t,2.2,t,123.0,t,0.93,t,132.0,SVI,0
9,68.0,M,f,f,f,f,f,f,f,f,f,f,f,f,f,t,2.4,t,1.6,t,83.0,t,0.89,t,93.0,SVI,0


We can convert categorical columns (i.e., True/False or Male/Female) into indicator values (0,1) using a pretty nifty feature: [pandas.get_dummies()](https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.get_dummies.html).  We can convert categorical columns (i.e., True/False or Male/Female) into indicator values (0,1) using a pretty nifty feature: [pandas.get_dummies()](https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.get_dummies.html).  

In [None]:
## YOUR CODE HERE
df[categorical_columns] = df[categorical_columns].map({"f": 1, "t", 0})