# Prosthetics Web Application
This notebook has 3 sections:
* Data Preprocessing
* Data Modeling
* Model Persisting

## Data Preprocessing

In [1]:
import pandas as pd
import numpy as np

Data are split on two tab-delimited text files, one for men and one for women. Let's import them

In [2]:
men=pd.read_table("ansur_men.txt")
women=pd.read_table("ansur_women.txt")


Let's check if the two data frames have the same set of variables. We then check if the variable "GENDER" is in both, and explore how it is coded.

In [3]:
men.columns.tolist() == women.columns.tolist()


True

In [6]:
("GENDER"  in men.columns.tolist()) and ("GENDER" in women.columns.tolist())


True

In [8]:
pd.unique(men.GENDER)


array([1], dtype=int64)

In [9]:
pd.unique(women.GENDER)


array([2], dtype=int64)

We want to change the coding of women to 0, instead of 2, but let's first make sure that there are no missing data in that column. We will then use numpy's where() for the code replacement, and then concatenate the dataframes. 

In [10]:
women.GENDER.isnull().any()


False

In [11]:
women["GENDER"] = np.where(women["GENDER"] == 2, [0] * len(women["GENDER"]), women["GENDER"])


In [12]:
pd.unique(women.GENDER)


array([0], dtype=int64)

In [13]:
both=pd.concat([men, women], ignore_index=True)


In [14]:
men.shape[0] + women.shape[0]  == both.shape[0]


True

Now let's select the variables of interest into a lighter-weight dataframe, change their cryptic labels, then inspect their type (i.e., categorical or numeric) We'll also check if there are any missing data

In [15]:
df=both[['ACR_HT-SIT',
'TENTH_RIB',
'WAIST_NAT_LNTH',
'GENDER',
'AGE-ANSUR88',
'HAND_LNTH',
'HAND_CIRC_AT_METACARPALE',
'FOREARM_CIRC-FLEXED',
'RADIALE-STYLION_LNTH',
'ARM_CIRC-AXILLARY',
'SHOULDER_ELBOW_LNTH',
'FOOT_BRTH',
'FOOT_LNTH',
'KNEE_HT_-_SITTING',
'CALF_CIRC',
'THIGH_CIRC-PROXIMAL',
'BUTT_KNEE_LNTH',
 'THUMB-TIP_REACH',
'FUNCTIONAL_LEG_LNTH']]



In [16]:
df.rename(columns={'ACR_HT-SIT' : "shoulder_height",
'TENTH_RIB' : "chest_height",
'WAIST_NAT_LNTH':"waist_length",
'GENDER':"gender",
'AGE-ANSUR88':"age",
'HAND_LNTH':"hand1",
'HAND_CIRC_AT_METACARPALE':"hand2",
'FOREARM_CIRC-FLEXED':"forearm2",
'RADIALE-STYLION_LNTH':"forearm1",
'ARM_CIRC-AXILLARY':"arm2",
'SHOULDER_ELBOW_LNTH':"arm1",
'FOOT_BRTH':"foot2",
'FOOT_LNTH':"foot1",
'KNEE_HT_-_SITTING':"foreleg1",
'CALF_CIRC':"foreleg2",
'THIGH_CIRC-PROXIMAL':"thigh2",
'BUTT_KNEE_LNTH':"thigh1",
 'THUMB-TIP_REACH':"upperlimb",
'FUNCTIONAL_LEG_LNTH':"lowerlimb"}, inplace=True)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  **kwargs)


In [17]:
df.isnull().any().any()

False

In [18]:
len(df.select_dtypes(["category"]).columns)

0

In [19]:
df.shape

(3982, 19)

Good! So we have a data frame of 19 vars and roughly 4000 example; all variables are numeric; and no missing data. Let's take a look at the upper 5 rows, then get descriptives of the variables, and finally write the dataframe to a csv file for future use and move on to the modeling   step.

In [20]:
df.head()

Unnamed: 0,shoulder_height,chest_height,waist_length,gender,age,hand1,hand2,forearm2,forearm1,arm2,arm1,foot2,foot1,foreleg1,foreleg2,thigh2,thigh1,upperlimb,lowerlimb
0,583,1110,385,1,34,193,217,318,273,364,363,97,260,551,400,627,618,807,1071
1,631,1204,414,1,37,213,228,294,285,348,395,105,290,601,400,615,653,875,1152
2,586,1090,395,1,38,191,213,282,264,328,359,97,254,531,362,576,594,785,1036
3,615,1124,434,1,33,204,213,308,279,294,381,108,271,546,410,616,624,845,1072
4,577,1046,386,1,42,176,203,292,264,335,342,97,240,540,380,587,584,750,1021


In [21]:
df.tail()

Unnamed: 0,shoulder_height,chest_height,waist_length,gender,age,hand1,hand2,forearm2,forearm1,arm2,arm1,foot2,foot1,foreleg1,foreleg2,thigh2,thigh1,upperlimb,lowerlimb
3977,557,1010,333,0,33,181,192,263,252,297,341,90,239,499,360,615,572,747,984
3978,519,1007,319,0,25,178,186,257,230,297,335,87,234,497,333,586,605,713,1003
3979,518,1068,314,0,28,197,192,250,258,268,353,95,246,539,358,564,599,781,1055
3980,492,994,323,0,26,185,182,250,236,287,323,89,228,503,382,650,599,718,1020
3981,614,1172,407,0,24,181,199,278,247,347,361,97,259,570,414,669,649,761,1126


In [22]:
df.describe()

Unnamed: 0,shoulder_height,chest_height,waist_length,gender,age,hand1,hand2,forearm2,forearm1,arm2,arm1,foot2,foot1,foreleg1,foreleg2,thigh2,thigh1,upperlimb,lowerlimb
count,3982.0,3982.0,3982.0,3982.0,3982.0,3982.0,3982.0,3982.0,3982.0,3982.0,3982.0,3982.0,3982.0,3982.0,3982.0,3982.0,3982.0,3982.0,3982.0
mean,574.347062,1078.595932,387.220241,0.445505,26.638122,186.390256,198.462833,275.887745,255.204169,310.909844,350.589151,94.547715,255.654696,534.739076,363.819437,587.505525,601.173782,764.120291,1043.22225
std,35.853782,62.805923,32.881028,0.497084,6.241669,11.764664,16.435055,29.892711,20.34607,33.537753,24.179413,7.44796,17.814418,34.587067,27.35527,47.668024,32.721096,50.037924,60.928713
min,464.0,876.0,283.0,0.0,17.0,149.0,158.0,210.0,157.0,223.0,282.0,73.0,203.0,406.0,285.0,454.0,491.0,605.0,819.0
25%,548.0,1033.0,362.0,0.0,22.0,178.0,185.0,251.0,241.0,285.0,333.0,89.0,242.0,510.0,345.0,555.0,579.0,728.0,1003.0
50%,574.0,1077.0,387.0,0.0,25.0,186.0,196.0,271.0,255.0,308.0,350.0,94.0,255.0,533.0,363.0,586.0,600.0,761.0,1041.0
75%,600.0,1121.0,411.0,1.0,31.0,194.0,212.0,300.0,269.0,335.0,368.0,100.0,268.0,558.0,382.0,619.0,623.0,799.0,1085.0
max,695.0,1353.0,493.0,1.0,51.0,233.0,247.0,372.0,325.0,453.0,446.0,122.0,310.0,675.0,470.0,787.0,723.0,980.0,1291.0


In [23]:
df.to_csv("prosthetics.csv")

## Data Modeling

So, we need to build a predictive model for each of the "lowerlimb" and "upperlimb" outcome variables, using, for each, all other variables except those  that comprise the limb being predicted. For example, we will exclude the hand, forearm, and arm variables from the upperlimb model, and use all other predictors.

We'll definitly  use a regression model since our dependent variable is continuous. Anthropometric data of body measures are known for being highly correlated, however, and the variables certainly carry redundent information. Accordingly, we will use a regularized regression model to trade some bias in the parameter estimates for out-of-sample stability of those estimates. We will go for Ridge regression here, since we want to retain measurements from as many body segments as possible in our model (Lasso will result in a highly sparce one).

The hyperparameter alpha in Ridge regression needs to be empiricallly tuned. We will use a coarse-grained grid search for the optimal value of that parameter, then hone in on the value using finer-grained search. We'We'll use R-squared as the metric for our model performance. We'll start with the lower-limb, then build the upper-limb model.

### Lower Limb Model

In [24]:
from sklearn.grid_search import GridSearchCV
from sklearn.linear_model import Ridge




In [25]:
y1=df["lowerlimb"]

x1 = df[['shoulder_height', 'chest_height', 'waist_length', 'gender', 'age', 'hand1', 'hand2', 'forearm2', 'forearm1', 'arm2', 'arm1', ]]


In [26]:
grid=[{"alpha" : [0.1, 1, 10]}]


In [27]:
model1=GridSearchCV(Ridge(),
grid,
cv=5,
scoring= 'r2')



In [28]:
model1.fit(x1,y1)


GridSearchCV(cv=5, error_score='raise',
       estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001),
       fit_params={}, iid=True, n_jobs=1,
       param_grid=[{'alpha': [0.1, 1, 10]}], pre_dispatch='2*n_jobs',
       refit=True, scoring='r2', verbose=0)

In [29]:
for params, mean_score, scores in model1.grid_scores_:
	print("mean_score is {}, its standard deviation is {}, and the parameter value is {}".format(mean_score, scores.std(), params))


mean_score is 0.8816575415382724, its standard deviation is 0.01462638955470248, and the parameter value is {'alpha': 0.1}
mean_score is 0.8816626944593696, its standard deviation is 0.014631879388053784, and the parameter value is {'alpha': 1}
mean_score is 0.8817009735231847, its standard deviation is 0.01468624862588155, and the parameter value is {'alpha': 10}


Interestingly, the model is highly stable with similar mean R-squared across different values of alpha, and little variablity within the random k samples for each of the alphas. Around 88% of variability in the lower-limb length can be predicted by the model, for data that the model was not trained on.  For completness, we'll search for the most optimal value for   alpha within the vicinity  of 10, using small increments.

In [30]:
fine_grid=[{"alpha" : np.arange(7,13, 0.2)}]


In [31]:
model2=GridSearchCV(Ridge(),
fine_grid,
cv=5,
scoring= 'r2')



In [32]:
model2.fit(x1,y1)


GridSearchCV(cv=5, error_score='raise',
       estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001),
       fit_params={}, iid=True, n_jobs=1,
       param_grid=[{'alpha': array([  7. ,   7.2,   7.4,   7.6,   7.8,   8. ,   8.2,   8.4,   8.6,
         8.8,   9. ,   9.2,   9.4,   9.6,   9.8,  10. ,  10.2,  10.4,
        10.6,  10.8,  11. ,  11.2,  11.4,  11.6,  11.8,  12. ,  12.2,
        12.4,  12.6,  12.8])}],
       pre_dispatch='2*n_jobs', refit=True, scoring='r2', verbose=0)

In [33]:
model2.best_estimator_


Ridge(alpha=12.800000000000004, copy_X=True, fit_intercept=True,
   max_iter=None, normalize=False, random_state=None, solver='auto',
   tol=0.001)

Alright, it seems that alpha=12.8 is the most optimal. Let's repeat the same steps for the upper limb model.

In [35]:
y2=df["upperlimb"]

x2 = df[['shoulder_height', 'chest_height', 'waist_length', 'gender', 'age', 'foot1', 'foot2', 'foreleg1', 'foreleg2', 'thigh2', 'thigh1']]


In [36]:
model3=GridSearchCV(Ridge(),
grid,
cv=5,
scoring= 'r2')



In [37]:
model3.fit(x2,y2)


GridSearchCV(cv=5, error_score='raise',
       estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001),
       fit_params={}, iid=True, n_jobs=1,
       param_grid=[{'alpha': [0.1, 1, 10]}], pre_dispatch='2*n_jobs',
       refit=True, scoring='r2', verbose=0)

In [38]:
for params, mean_score, scores in model3.grid_scores_:
	print("mean_score is {}, its standard deviation is {}, and the parameter value is {}".format(mean_score, scores.std(), params))


mean_score is 0.7562480874516535, its standard deviation is 0.02166740064148744, and the parameter value is {'alpha': 0.1}
mean_score is 0.7562447153804018, its standard deviation is 0.0216824321404775, and the parameter value is {'alpha': 1}
mean_score is 0.7562000680958149, its standard deviation is 0.021826773507645152, and the parameter value is {'alpha': 10}


So, the model is also stable in predicting around 75% of variability in upper-limb measures for examples it had not seen. Alpha = 0.1 seems nomenally optimal. 

## Model Persistance

Now, on to the final step. We arrived at the optimal hyperparameter values  through model selection above. Let's now use these values in simple Ridge regression models to obtain the parameter estimates  . We will then ""freeze" or persist these models into "pkl" binary files. These files can then be transfered to the server side, and reside in the Flask web application directory. When needed, the pkl-ed files can be "defrosted" and then used to predict the lower-limb and upper-limb lengths for patients with amputation.

In [40]:
from sklearn.linear_model import Ridge
from sklearn.externals import joblib


In [41]:
model_lower=Ridge(alpha=12.8)


In [42]:
model_upper=Ridge(alpha=0.1)


In [43]:
model_lower.fit(x1, y1)

Ridge(alpha=12.8, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)

In [44]:
model_upper.fit(x2, y2)

Ridge(alpha=0.1, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)

In [45]:
joblib.dump(model_lower, "model_lower.pkl")


['model_lower.pkl']

In [46]:
joblib.dump(model_upper, "model_upper.pkl")


['model_upper.pkl']

Great! You should now find 2 "pkl" files in your working directory. Copy them to the Flask app directories, and see you there !