### Исследование результатов предсказаний

Для того, чтобы лучше разобраться, что происходит, я написал скрипты, которые сравнивают получившиеся предсказания и помогаю понять по какой логике действуют алгоритмы.

In [None]:
def mismatch(a, b):
    return [(x1, x2) for x1, x2 in zip(a, b) if x1[1] != x2[1]]

In [2]:
def read(ds):
    with open(ds) as f:
        r = [tuple([int(s) for s in l.split(',') if s != '']) for l in f.read().split('\n')[1:] if l]
    return r

In [3]:
def print_mismatch(a, b):
    m = mismatch(a, b)
    print(len(m))
    for r in m:
        print('{}-{}'.format(r[0], r[1]))

In [4]:
def survived(a):
    survived = 0
    not_survived = 0
    for i in a:
        if i[1] == 0:
            not_survived += 1
        else:
            survived += 1
    return survived, not_survived, len(a)

In [5]:
def print_survived(a):
    s = survived(a)
    p = s[2] / 100
    print('survived:', s[0], ', percent:', s[0] / p, '|', 
          'not_survived:', s[1], ', percent:', s[1] / p, '|', 
          'total:', s[2]
         )

In [6]:
lr = read('sample_submission_lr.csv')

In [7]:
knn = read('sample_submission_knn.csv')

In [8]:
gbc = read('sample_submission_gbc.csv')

In [9]:
rf = read('sample_submission_rf.csv')

### Посмотрим как алгоритмы распределили количество выживших и погибших:

#### Logit survived:

In [10]:
print_survived(lr)

survived: 146 , percent: 34.928229665071775 | not_survived: 272 , percent: 65.07177033492823 | total: 418


#### kNN survived:

In [11]:
print_survived(knn)

survived: 130 , percent: 31.100478468899524 | not_survived: 288 , percent: 68.89952153110049 | total: 418


#### Gradient Boosting survived:

In [12]:
print_survived(gbc)

survived: 127 , percent: 30.38277511961723 | not_survived: 291 , percent: 69.61722488038278 | total: 418


#### Random Forest survived:

In [13]:
print_survived(rf)

survived: 127 , percent: 30.38277511961723 | not_survived: 291 , percent: 69.61722488038278 | total: 418


Самым "миролюбивым" алгоритмом оказалась логистическая регрессия, а самыми "кровожадными" ансамбли моделей, которые показали абсолютно одинаковый результат по количеству выживших и погибших, но разделили их по разному.

### Посмотрим насколько различаются предсказания моделей:

#### Gradient Boosting and Logit:

In [14]:
print_mismatch(gbc, lr)

37
(910, 0)-(910, 1)
(911, 0)-(911, 1)
(913, 1)-(913, 0)
(928, 0)-(928, 1)
(967, 0)-(967, 1)
(972, 1)-(972, 0)
(980, 0)-(980, 1)
(982, 0)-(982, 1)
(1003, 0)-(1003, 1)
(1010, 1)-(1010, 0)
(1052, 0)-(1052, 1)
(1053, 1)-(1053, 0)
(1057, 0)-(1057, 1)
(1091, 0)-(1091, 1)
(1092, 0)-(1092, 1)
(1093, 1)-(1093, 0)
(1106, 1)-(1106, 0)
(1108, 0)-(1108, 1)
(1117, 0)-(1117, 1)
(1119, 0)-(1119, 1)
(1141, 0)-(1141, 1)
(1160, 0)-(1160, 1)
(1165, 0)-(1165, 1)
(1173, 1)-(1173, 0)
(1174, 0)-(1174, 1)
(1175, 0)-(1175, 1)
(1196, 0)-(1196, 1)
(1199, 1)-(1199, 0)
(1205, 0)-(1205, 1)
(1239, 0)-(1239, 1)
(1274, 0)-(1274, 1)
(1275, 0)-(1275, 1)
(1282, 0)-(1282, 1)
(1284, 1)-(1284, 0)
(1295, 0)-(1295, 1)
(1300, 0)-(1300, 1)
(1302, 0)-(1302, 1)


#### Random Forest and Logit:

In [15]:
print_mismatch(rf, lr)

31
(911, 0)-(911, 1)
(913, 1)-(913, 0)
(928, 0)-(928, 1)
(967, 0)-(967, 1)
(972, 1)-(972, 0)
(980, 0)-(980, 1)
(1003, 0)-(1003, 1)
(1051, 0)-(1051, 1)
(1052, 0)-(1052, 1)
(1053, 1)-(1053, 0)
(1091, 0)-(1091, 1)
(1092, 0)-(1092, 1)
(1093, 1)-(1093, 0)
(1108, 0)-(1108, 1)
(1117, 0)-(1117, 1)
(1119, 0)-(1119, 1)
(1141, 0)-(1141, 1)
(1160, 0)-(1160, 1)
(1165, 0)-(1165, 1)
(1173, 1)-(1173, 0)
(1174, 0)-(1174, 1)
(1196, 0)-(1196, 1)
(1199, 1)-(1199, 0)
(1205, 0)-(1205, 1)
(1239, 0)-(1239, 1)
(1251, 0)-(1251, 1)
(1274, 0)-(1274, 1)
(1282, 0)-(1282, 1)
(1295, 0)-(1295, 1)
(1300, 0)-(1300, 1)
(1302, 0)-(1302, 1)


#### Gradient Boosting and Random Forest:

In [16]:
print_mismatch(gbc, rf)

10
(910, 0)-(910, 1)
(982, 0)-(982, 1)
(1010, 1)-(1010, 0)
(1051, 1)-(1051, 0)
(1057, 0)-(1057, 1)
(1106, 1)-(1106, 0)
(1175, 0)-(1175, 1)
(1251, 1)-(1251, 0)
(1275, 0)-(1275, 1)
(1284, 1)-(1284, 0)


#### KNN and Logit:

In [17]:
print_mismatch(knn, lr)

70
(895, 1)-(895, 0)
(896, 0)-(896, 1)
(898, 0)-(898, 1)
(900, 0)-(900, 1)
(910, 0)-(910, 1)
(911, 0)-(911, 1)
(913, 1)-(913, 0)
(920, 1)-(920, 0)
(928, 0)-(928, 1)
(929, 0)-(929, 1)
(931, 1)-(931, 0)
(933, 1)-(933, 0)
(958, 0)-(958, 1)
(960, 1)-(960, 0)
(962, 0)-(962, 1)
(967, 0)-(967, 1)
(971, 0)-(971, 1)
(972, 1)-(972, 0)
(973, 1)-(973, 0)
(979, 0)-(979, 1)
(980, 0)-(980, 1)
(982, 0)-(982, 1)
(986, 1)-(986, 0)
(990, 0)-(990, 1)
(996, 0)-(996, 1)
(1003, 0)-(1003, 1)
(1005, 0)-(1005, 1)
(1019, 1)-(1019, 0)
(1036, 1)-(1036, 0)
(1040, 1)-(1040, 0)
(1050, 1)-(1050, 0)
(1052, 0)-(1052, 1)
(1053, 1)-(1053, 0)
(1057, 0)-(1057, 1)
(1063, 1)-(1063, 0)
(1083, 1)-(1083, 0)
(1089, 0)-(1089, 1)
(1091, 0)-(1091, 1)
(1092, 0)-(1092, 1)
(1093, 1)-(1093, 0)
(1097, 1)-(1097, 0)
(1098, 0)-(1098, 1)
(1108, 0)-(1108, 1)
(1117, 0)-(1117, 1)
(1119, 0)-(1119, 1)
(1141, 0)-(1141, 1)
(1160, 0)-(1160, 1)
(1172, 0)-(1172, 1)
(1173, 1)-(1173, 0)
(1174, 0)-(1174, 1)
(1183, 0)-(1183, 1)
(1196, 0)-(1196, 1)
(1199, 

Наименьшие различия между Gradient Boosting и Random Forest, что говорит о том, что алгоритмы работают примерно похожим образом. Посмотрим, что это за пассажиры: 

In [18]:
import pandas as pd

In [19]:
df_test = pd.read_csv('test.csv')

In [20]:
df_test = df_test.fillna(df_test.median(axis=0), axis=0)

In [21]:
df_test['AgeGroup'] = df_test['Age']
df_test['AgeGroup'] = df_test['AgeGroup'].map(lambda age: int(age // 10) + 1)
df_test['IsChild'] = df_test.apply(lambda row: 1 if row['Age'] <= 18 and row['Parch'] > 0 else 0, axis=1)

In [22]:
p_ids = []
for m in mismatch(gbc, rf):
    if m[0][1] == 1:
        p_ids.append(m[0][0])
df_test[df_test['PassengerId'].isin(p_ids)]

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,AgeGroup,IsChild
118,1010,1,"Beattie, Mr. Thomson",male,36.0,0,0,13050,75.2417,C6,C,4,0
159,1051,3,"Peacock, Mrs. Benjamin (Edith Nile)",female,26.0,0,2,SOTON/O.Q. 3101315,13.775,,S,3,0
214,1106,3,"Andersson, Miss. Ida Augusta Margareta",female,38.0,4,2,347091,7.775,,S,4,0
359,1251,3,"Lindell, Mrs. Edvard Bengtsson (Elin Gerda Per...",female,30.0,1,0,349910,15.55,,S,4,0
392,1284,3,"Abbott, Master. Eugene Joseph",male,13.0,0,2,C.A. 2673,20.25,,S,2,1


In [23]:
p_ids = []
for m in mismatch(gbc, rf):
    if m[0][1] == 0:
        p_ids.append(m[0][0])
df_test[df_test['PassengerId'].isin(p_ids)]

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,AgeGroup,IsChild
18,910,3,"Ilmakangas, Miss. Ida Livija",female,27.0,1,0,STON/O2. 3101270,7.925,,S,3,0
90,982,3,"Dyker, Mrs. Adolf Fredrik (Anna Elisabeth Judi...",female,22.0,1,0,347072,13.9,,S,3,0
165,1057,3,"Kink-Heilmann, Mrs. Anton (Luise Heilmann)",female,26.0,1,1,315153,22.025,,S,3,0
283,1175,3,"Touma, Miss. Maria Youssef",female,9.0,1,1,2650,15.2458,,C,1,1
383,1275,3,"McNamee, Mrs. Neal (Eileen O'Leary)",female,19.0,1,0,376566,16.1,,S,2,0


Сразу и не скажешь, чем руководствуются Gradient Boosting и Random Forest при выборе кого в какую категорию записсать. Бросается в глаза, что Gradient Boosting больше внимания уделяет параметру SibSp, возможно, это является следствием переобучения или недообучения модели.

In [24]:
p_ids = []
for m in mismatch(lr, rf):
    if m[0][1] == 1:
        p_ids.append(m[0][0])
df_test[df_test['PassengerId'].isin(p_ids)]

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,AgeGroup,IsChild
19,911,3,"Assaf Khalil, Mrs. Mariana (Miriam"")""",female,45.0,0,0,2696,7.225,,C,5,0
36,928,3,"Roth, Miss. Sarah A",female,27.0,0,0,342712,8.05,,S,3,0
75,967,1,"Keeping, Mr. Edwin",male,32.5,0,0,113503,211.5,C132,C,4,0
88,980,3,"O'Donoghue, Ms. Bridget",female,27.0,0,0,364856,7.75,,Q,3,0
111,1003,3,"Shine, Miss. Ellen Natalia",female,27.0,0,0,330968,7.7792,,Q,3,0
159,1051,3,"Peacock, Mrs. Benjamin (Edith Nile)",female,26.0,0,2,SOTON/O.Q. 3101315,13.775,,S,3,0
160,1052,3,"Smyth, Miss. Julia",female,27.0,0,0,335432,7.7333,,Q,3,0
199,1091,3,"Rasmussen, Mrs. (Lena Jacobsen Solvang)",female,27.0,0,0,65305,8.1125,,S,3,0
200,1092,3,"Murphy, Miss. Nora",female,27.0,0,0,36568,15.5,,Q,3,0
216,1108,3,"Mahon, Miss. Bridget Delia",female,27.0,0,0,330924,7.8792,,Q,3,0


In [25]:
p_ids = []
for m in mismatch(lr, rf):
    if m[0][1] == 0:
        p_ids.append(m[0][0])
df_test[df_test['PassengerId'].isin(p_ids)]

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,AgeGroup,IsChild
21,913,3,"Olsen, Master. Artur Karl",male,9.0,0,1,C 17368,3.1708,,S,1,1
80,972,3,"Boulos, Master. Akar",male,6.0,1,1,2678,15.2458,,C,1,1
161,1053,3,"Touma, Master. Georges Youssef",male,7.0,1,1,2650,15.2458,,C,1,1
201,1093,3,"Danbom, Master. Gilbert Sigvard Emanuel",male,0.33,0,2,347080,14.4,,S,1,1
281,1173,3,"Peacock, Master. Alfred Edward",male,0.75,1,1,SOTON/O.Q. 3101315,13.775,,S,1,1
307,1199,3,"Aks, Master. Philip Frank",male,0.83,0,1,392091,9.35,,S,1,1


Как видно, по сравнению с логистической регрессией, Random Forest любит записывать в погибшие женщин старше среднего возраста и, наоборот, записал в выжившие больше детей. Это говорит о том, что по сравнению с логистической регрессией, алгоритм гораздо больше полагается на возраст пассажира, а также, на свойство Patch и IsChild.