# A medical project: the horse colic

This project explore data about horse colic. Its goal is to predict weither a horse need surgery, given some symptoms.

## The Data

The data comes from [here](http://archive.ics.uci.edu/ml/machine-learning-databases/horse-colic). It originally contained 28 variables and 300 observations for the training set (68 for the test set). I am using [this version](../data/train.csv) with 21 variables, 20 features and `surgical_lesion` the response. I have dropped some identifiants, and I have selected one of the 8 possible responses.   

In [1]:
import pandas as pd
pd.read_csv("../data/train.csv", index_col=0).head(10)

Unnamed: 0,abdom_protein,abdomen,abdominal_dist,abdomino_appearance,age,capillary_time,cell_vol,extreme_temp,feces,mucous,...,nasogastric_reflux,nasogastric_tube,pain,peripheral_pulse,peristalsis,protein,pulse,rectal_temp,respiration,surgical_lesion
0,2.0,2.0,2.0,2.0,1,1.0,50.0,,4.0,4.0,...,,,3.0,,4.0,85.0,88.0,39.2,20.0,2
1,,1.0,1.0,,1,1.0,33.0,1.0,1.0,3.0,...,,,3.0,1.0,3.0,6.7,40.0,38.3,24.0,2
2,5.3,,4.0,3.0,9,2.0,48.0,4.0,3.0,6.0,...,2.0,1.0,2.0,1.0,4.0,7.2,164.0,39.1,84.0,1
3,,,,,1,2.0,74.0,,,6.0,...,,,,,,7.4,104.0,37.3,35.0,2
4,,3.0,2.0,,1,1.0,,2.0,3.0,3.0,...,1.0,2.0,2.0,1.0,3.0,,,,,2
5,,5.0,3.0,,1,1.0,37.0,1.0,3.0,1.0,...,1.0,1.0,3.0,1.0,3.0,7.0,48.0,37.9,16.0,1
6,,4.0,2.0,,1,1.0,44.0,3.0,3.0,,...,1.0,2.0,,,4.0,8.3,60.0,,,1
7,,5.0,4.0,,1,1.0,38.0,3.0,3.0,3.0,...,1.0,2.0,4.0,4.0,4.0,6.2,80.0,,36.0,1
8,2.2,,1.0,1.0,9,1.0,40.0,1.0,3.0,1.0,...,1.0,2.0,5.0,,3.0,6.2,90.0,38.3,,2
9,3.6,5.0,1.0,2.0,1,1.0,44.0,3.0,2.0,5.0,...,1.0,2.0,3.0,3.0,3.0,6.0,66.0,38.1,12.0,1


There seems to be a lot of missing values. So I decided to drop the variables containing more than 20% of missing values and to impute the remaining data using a $k$-nearest neighbours algorithm.

In [2]:
pd.read_csv("../results/na_count.csv", index_col=0, header=[0,1]).sort_values(axis=0, by=('train','nb_na'))

Unnamed: 0_level_0,train,train,test,test
Unnamed: 0_level_1,nb_na,pct_na,nb_na,pct_na
surgical_lesion,0,0.0,0,0.0
age,0,0.0,0,0.0
pulse,24,8.0,2,2.941176
cell_vol,29,9.666667,8,11.764706
capillary_time,32,10.666667,6,8.823529
protein,33,11.0,10,14.705882
peristalsis,44,14.666667,8,11.764706
mucous,47,15.666667,1,1.470588
pain,55,18.333333,8,11.764706
abdominal_dist,56,18.666667,9,13.235294


I now have 13 variables, 12 features and `surgical_lesion` the response. Among the features, 5 are continuous and 7 are categorical. Two of the categorical features are not ordered. Here is a plot displaying two continuous variables and the response:

[link to image](../results/EDA_plot.html)

I used Python [`fancyimpute`](https://pypi.python.org/pypi/fancyimpute) package to impute the remaining missing values.   

## The Model

Since the response is binary, I decided to build and fit a [Logistic Regression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) model from scikit-learn, using the $L_1$-regularisation.

In order to select the most relevant features, I also used a [Recursive Features Elimination](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html). 

I tuned this nested model on two hyperparameters:   
- the number of features to select by RFE,   
- the regularisation parameter `C` used in the logistic regression.  

For each pair of `C`/number of features, I have calculated the score (the mean accuracy) by a 6-fold cross validation. I got the following color map:

![](../results/scores.png)

In [3]:
params = pd.read_csv("../results/params.csv", index_col=0)
params.reindex_axis(params.columns[[0, 3, 4, 2, 1]], axis=1)

Unnamed: 0,C,nb_feat,test_score_max,best_mod_train,best_mod_test
0,0.1,6,0.746692,0.746667,0.720588


The best parameters are `C = 0.1` and 6 features. The cross-validation score for the best model is 0.743.    
The best model has a training accuracy of 0.747 and a validation accuracy of 0.72.

##### But which features are the most relevant?

In [5]:
pd.options.display.float_format = '{:,.3f}'.format
pd.read_csv("../results/ranks.csv", index_col=0)

Unnamed: 0,pain,abdominal_dist,peristalsis,rectal_temp,respiration,cell_vol,pulse,protein,mucous,extreme_temp,capillary_time,age
rank,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0,11.0,12.0
coef,0.443,0.134,0.308,-0.085,0.028,0.012,,,,,,


According to RFE, the horse's pain is the most relevant feature. But it's a subjective judgement, and prior treatments of pain may mask the pain level.   
The other selected features are the abdominal distension, the peristalsis - an indication of the activity in the horse's gut, the rectal temperature, the respiratory rate, and the number of red cells by volume in the blood.

### Overall
This model makes predictions for a new dataset with an accuracy of 72%. Using only 6 features would make the measurements easier for the doctor. 