
[![AnalyticsDojo](https://github.com/rpi-techfundamentals/spring2019-materials/blob/master/fig/final-logo.png?raw=1)](http://rpi.analyticsdojo.com)
<center><h1>Titanic Regression</h1></center>
<center><h3><a href = 'http://introml.analyticsdojo.com'>introml.analyticsdojo.com</a></h3></center>



# Titanic Regression

Here we are going to create a model for our age variable. 


In [129]:
import os
import pandas as pd
raw_df = pd.read_csv('https://raw.githubusercontent.com/rpi-techfundamentals/spring2019-materials/master/input/train.csv')
# test = pd.read_csv('https://raw.githubusercontent.com/rpi-techfundamentals/spring2019-materials/master/input/test.csv')

print(raw_df.columns)#, test.columns)

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')


Here is a broad description of the keys and what they mean:

```
pclass          Passenger Class
                (1 = 1st; 2 = 2nd; 3 = 3rd)
survival        Survival
                (0 = No; 1 = Yes)
name            Name
sex             Sex
age             Age
sibsp           Number of Siblings/Spouses Aboard
parch           Number of Parents/Children Aboard
ticket          Ticket Number
fare            Passenger Fare
cabin           Cabin
embarked        Port of Embarkation
                (C = Cherbourg; Q = Queenstown; S = Southampton)
boat            Lifeboat
body            Body Identification Number
home.dest       Home/Destination
```

In general, it looks like `name`, `sex`, `cabin`, `embarked`, `boat`, `body`, and `homedest` may be candidates for categorical features, while the rest appear to be numerical features. We can also look at the first couple of rows in the dataset to get a better understanding:

In [130]:
raw_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [131]:
raw_df.shape

(891, 12)

In [132]:
test = raw_df.loc[raw_df['Age'].isnull(),:]
train = raw_df.loc[raw_df['Age'].notnull(),:]
train

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
885,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.1250,,Q
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [133]:
test

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
17,18,1,2,"Williams, Mr. Charles Eugene",male,,0,0,244373,13.0000,,S
19,20,1,3,"Masselmani, Mrs. Fatima",female,,0,0,2649,7.2250,,C
26,27,0,3,"Emir, Mr. Farred Chehab",male,,0,0,2631,7.2250,,C
28,29,1,3,"O'Dwyer, Miss. Ellen ""Nellie""",female,,0,0,330959,7.8792,,Q
...,...,...,...,...,...,...,...,...,...,...,...,...
859,860,0,3,"Razi, Mr. Raihed",male,,0,0,2629,7.2292,,C
863,864,0,3,"Sage, Miss. Dorothy Edith ""Dolly""",female,,8,2,CA. 2343,69.5500,,S
868,869,0,3,"van Melkebeke, Mr. Philemon",male,,0,0,345777,9.5000,,S
878,879,0,3,"Laleff, Mr. Kristo",male,,0,0,349217,7.8958,,S


### Preprocessing function

We want to create a preprocessing function that can address transformation of our train and test set.  

In [134]:
from sklearn.impute import SimpleImputer
import numpy as np

cat_features = ['Pclass', 'Sex', 'Embarked']
num_features =  [ 'SibSp', 'Parch', 'Fare'  ]
def preprocess(df, num_features, cat_features, dv):
    features = cat_features + num_features
    if dv in df.columns:
      y = df[dv]
    else:
      y=None
    #Address missing variables

    imp_mode = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
    df[cat_features]=imp_mode.fit_transform(df[cat_features] )

    imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
    df[num_features]=imp_mean.fit_transform(df[num_features])
    # We don't need to standardize/normalize since we're doing regression

    X = pd.get_dummies(df[features], columns=cat_features, drop_first=True)
    return y,X

train_y, train_X =  preprocess(train, num_features, cat_features, 'Age')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[cat_features]=imp_mode.fit_transform(df[cat_features] )
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[num_features]=imp_mean.fit_transform(df[num_features])
  uniques = Index(uniques)


In [135]:
train_X

Unnamed: 0,SibSp,Parch,Fare,Pclass_2,Pclass_3,Sex_male,Embarked_Q,Embarked_S
0,1.0,0.0,7.2500,0,1,1,0,1
1,1.0,0.0,71.2833,0,0,0,0,0
2,0.0,0.0,7.9250,0,1,0,0,1
3,1.0,0.0,53.1000,0,0,0,0,1
4,0.0,0.0,8.0500,0,1,1,0,1
...,...,...,...,...,...,...,...,...
885,0.0,5.0,29.1250,0,1,0,1,0
886,0.0,0.0,13.0000,1,0,1,0,1
887,0.0,0.0,30.0000,0,0,0,0,1
889,0.0,0.0,30.0000,0,0,1,0,0


In [136]:
train_y

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
885    39.0
886    27.0
887    19.0
889    26.0
890    32.0
Name: Age, Length: 714, dtype: float64

In [137]:
from sklearn.model_selection import train_test_split

# Split into training and validation, stratifying on sex
smaller_train_X, validation_X, smaller_train_y, validation_y = train_test_split(train_X, train_y, train_size=0.6, test_size=0.4, random_state=122, stratify = train_X['Sex_male'])
print("Length of final training DataFrame:", len(smaller_train_X))
assert len(smaller_train_X) == len(smaller_train_X)
print("Length of validation DataFrame:", len(validation_X))
assert len(validation_X) == len(validation_y)
smaller_train_X

Length of final training DataFrame: 428
Length of validation DataFrame: 286


Unnamed: 0,SibSp,Parch,Fare,Pclass_2,Pclass_3,Sex_male,Embarked_Q,Embarked_S
70,0.0,0.0,10.5000,1,0,1,0,1
618,2.0,1.0,39.0000,1,0,0,0,1
536,0.0,0.0,26.5500,0,0,1,0,1
710,0.0,0.0,49.5042,0,0,0,0,0
836,0.0,0.0,8.6625,0,1,1,0,1
...,...,...,...,...,...,...,...,...
238,0.0,0.0,10.5000,1,0,1,0,1
342,0.0,0.0,13.0000,1,0,1,0,1
432,1.0,0.0,26.0000,1,0,0,0,1
696,0.0,0.0,8.0500,0,1,1,0,1


In [138]:
validation_X

Unnamed: 0,SibSp,Parch,Fare,Pclass_2,Pclass_3,Sex_male,Embarked_Q,Embarked_S
661,0.0,0.0,7.2250,0,1,1,0,0
624,0.0,0.0,16.1000,0,1,1,0,1
34,1.0,0.0,82.1708,0,0,1,0,0
854,1.0,0.0,26.0000,1,0,0,0,1
98,0.0,1.0,23.0000,1,0,0,0,1
...,...,...,...,...,...,...,...,...
587,1.0,1.0,79.2000,0,0,1,0,0
476,1.0,0.0,21.0000,1,0,1,0,1
550,0.0,2.0,110.8833,0,0,1,0,0
294,0.0,0.0,7.8958,0,1,1,0,1


In [139]:
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor

In [140]:
reg = LinearRegression()
reg.fit(smaller_train_X, smaller_train_y)

print('Coefficients: \n', reg.coef_)
print('Intercept: \n', reg.intercept_)
print('R2 for train', reg.score(smaller_train_X, smaller_train_y))
print('R2 for validation', reg.score(validation_X, validation_y))

Coefficients: 
 [-4.29741890e+00 -2.67412991e-01 -9.49154983e-03 -9.87322021e+00
 -1.47939030e+01  2.19619426e+00  5.43658582e+00  3.84105821e+00]
Intercept: 
 37.88921436586399
R2 for train 0.2659980660554415
R2 for validation 0.21284235213769886


In [141]:
from sklearn import metrics

def evaluate(name, dtype, y_true, y_pred, results=pd.Series(dtype=float)):
  """
  This creates a Pandas series with different results. 
  """
  results['name']=name
  results['r2-'+dtype]=metrics.r2_score(y_true, y_pred)
  return results


def fit(name, regressor, train_X, train_y, val_X, val_y):
  """
  This will train and evaluate a classifier. 
  """
  regressor.fit(train_X, train_y)
  #This creates the prediction. 
  r1 = evaluate(name, "train", train_y, regressor.predict(train_X))
  r1 = evaluate(name, "validation", val_y, regressor.predict(val_X),  results=r1)
  return r1

In [142]:
final=pd.DataFrame()
allmodels={"linear": LinearRegression(),
           "gradient": GradientBoostingRegressor(),
           "randomforest": RandomForestRegressor()}

for name, regressor in  allmodels.items():
  print("Modeling: ", name, "...")
  #atrain_X, aval_X, atrain_y, aval_y
  results = fit(name, regressor, smaller_train_X, smaller_train_y, validation_X, validation_y)
  final = final.append(results, ignore_index=True)
#final_order=['name','accuracy-train', 'accuracy-validation', 'auc-train', 'auc-validation','recall-train', 'recall-validation']
#final=final.loc[:,final_order]
final

Modeling:  linear ...
Modeling:  gradient ...
Modeling:  randomforest ...


  final = final.append(results, ignore_index=True)
  final = final.append(results, ignore_index=True)
  final = final.append(results, ignore_index=True)


Unnamed: 0,name,r2-train,r2-validation
0,linear,0.265998,0.212842
1,gradient,0.573295,0.236534
2,randomforest,0.726844,0.212732


# Challenge: does dropping the categorical features improve the R2 values?