# Logistic Regression によるモデリングと予測

今回はKaggleのチュートリアルCompetitionであるTitanic: Machine learning from disasterから簡単な特徴からモデリングと予測を行い、提出フォーマットに整える工程のノートになります。

In this notebook, I made a prediction from simple features from training data which is 'Titanic: Machine Learning from Disaster is famous for Tutorial Competition in Kaggle and processed the submission format.

In [54]:
import pandas as pd
import numpy as np
from pandas import DataFrame
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

今回は性別、客室クラスのみを特徴として解析するのでその他の余分なデータを削除します。<br>
In this analysis, I deleted the extra data becouse analyzed the data from Pclass and Sex as features.

In [55]:
train_df = train_df.drop(['Name', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],axis=1)
test_df = test_df.drop(['Name', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'], axis=1)

Ageの平均でAgeの欠損値を補完しました。<br>
Complement null value on average of Age.

In [56]:
age_train_mean = train_df.groupby('Sex').Age.mean()

def fage(x):
    if x.Sex == 'male':
        return round(age_train_mean['male'])
    if x.Sex == 'female':
        return round(age_train_mean['female'])

train_df.Age.fillna(train_df[train_df.Age.isnull()].apply(fage,axis=1),inplace=True)

In [57]:
age_test_mean = test_df.groupby('Sex').Age.mean()

def fage(x):
    if x.Sex == 'male':
        return round(age_test_mean['male'])
    if x.Sex == 'female':
        return round(age_test_mean['female'])

test_df.Age.fillna(test_df[test_df.Age.isnull()].apply(fage,axis=1),inplace=True)

SexとPclassを特徴量として使える形にする為にそれぞれの内容を見ていきます。
まずは性別から

In [58]:
sex_ct = pd.crosstab(train_df['Sex'], train_df['Survived'])
sex_ct

Survived,0,1
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1
female,81,233
male,468,109


女性がかなりの生存率が高いので、女性を1男性を0としたfemaleの行を追加します。

In [59]:
train_df['Female'] = train_df['Sex'].map({'male': 0, 'female' : 1}).astype(int)
test_df['Female'] = test_df['Sex'].map({'male': 0, 'female': 1}).astype(int)

In [60]:
train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,Female
0,1,0,3,male,22.0,0
1,2,1,1,female,38.0,1
2,3,1,3,female,26.0,1
3,4,1,1,female,35.0,1
4,5,0,3,male,35.0,0


In [61]:
test_df.head()

Unnamed: 0,PassengerId,Pclass,Sex,Age,Female
0,892,3,male,34.5,0
1,893,3,female,47.0,1
2,894,2,male,62.0,0
3,895,3,male,27.0,0
4,896,3,female,22.0,1


次にPclassを見てみます。

In [62]:
pclass_ct = pd.crosstab(train_df['Pclass'], train_df['Survived'])
pclass_ct

Survived,0,1
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1
1,80,136
2,97,87
3,372,119


Pclass3(つまり一番安い客室クラス)の生存率が低いです。<br>
今回はoverfittingを避ける為、Pclass3を除外してPclass1とPclass2を説明変数としたダミー変数に分けます。

In [63]:
pclass_train_df = pd.get_dummies(train_df['Pclass'],prefix='Class')
pclass_test_df = pd.get_dummies(test_df['Pclass'],prefix='Class')

In [64]:
pclass_train_df.head()

Unnamed: 0,Class_1,Class_2,Class_3
0,0,0,1
1,1,0,0
2,0,0,1
3,1,0,0
4,0,0,1


In [65]:
pclass_train_df = pclass_train_df.drop(['Class_3'], axis=1)
pclass_test_df = pclass_test_df.drop(['Class_3'], axis=1)

元のデータフレームに追加するとこんな感じになります。

In [53]:
train_df = train_df.join(pclass_train_df)
test_df = test_df.join(pclass_test_df)

train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,Female,Class_1,Class_2
0,1,0,3,male,22.0,0,0,0
1,2,1,1,female,38.0,1,1,0
2,3,1,3,female,26.0,1,0,0
3,4,1,1,female,35.0,1,1,0
4,5,0,3,male,35.0,0,0,0


In [34]:
test_df.head()

Unnamed: 0,PassengerId,Pclass,Sex,Age,Female,Class_1,Class_2
0,892,3,male,34.5,0,0,0
1,893,3,female,47.0,1,0,0
2,894,2,male,62.0,0,0,1
3,895,3,male,27.0,0,0,0
4,896,3,female,22.0,1,0,0


モデリングに使うのは'Female'、'Class_1'、'Class_2'になるのでそれ以外を削除し、特徴量をX、生存者のデータをyとします。

In [74]:
X = train_df.drop(['PassengerId','Survived','Pclass','Sex'],axis=1)
y = train_df.Survived

ロジスティック回帰〜！！

In [75]:
clf = LogisticRegression()

clf.fit(X, y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

このモデルがどれくらいの精度を持っているのか確認します。

In [83]:
predict_y = clf.predict(X)
accuracy_score(y, predict_y)

0.7867564534231201

まぁそんなに高くないですね…<br>
今回は簡単な前処理を行なったのでこの様な精度ですが、他にも色んな前処理を行えば精度はぐんと上がるはずです。<br>
という訳で先ほど作ったモデルを使ってテストデータを解析しましょう。

In [77]:
test_model_df = test_df.drop(['PassengerId','Pclass','Sex'],axis=1)

In [79]:
test_predict = clf.predict(test_model_df)
print(test_predict)

[0 1 0 0 1 0 1 0 1 0 0 0 1 0 1 1 0 0 1 1 0 0 1 0 1 0 1 0 0 0 0 0 1 1 0 0 1
 1 0 0 0 0 0 1 1 0 0 0 1 1 0 0 1 1 0 0 0 0 0 1 0 0 0 1 0 1 1 0 0 1 1 0 1 0
 1 0 0 1 0 1 0 0 0 0 0 0 1 1 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 0 0 0 0 0 0
 1 1 1 1 0 0 1 0 1 1 0 1 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0
 0 0 1 0 0 1 0 0 1 1 0 1 1 0 1 0 0 1 0 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0 1 0 1
 0 1 0 1 0 0 0 0 0 0 0 0 1 0 1 1 0 0 1 0 0 1 0 1 0 0 0 0 1 1 0 1 0 1 0 1 0
 1 0 1 1 0 1 0 0 0 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 1 1 1 0 0 0 0 0 0 0 1
 0 0 0 1 1 0 0 0 0 1 0 0 0 1 1 0 1 0 0 0 0 1 0 1 1 1 0 0 0 0 0 0 1 0 0 0 0
 1 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0
 1 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 1 0 0 0 1 0 1 0 0 1 0 1 1 0 1 1 0 1 1 0
 0 1 0 0 1 1 1 0 0 0 0 0 1 1 0 1 0 0 0 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 0 0
 0 1 1 1 1 1 0 1 0 0 0]


完成した予測データをKaggleのページで指定された形式で保存します。  <br>
save the prediction data as a format ordered by Kaggle

In [None]:
test_df = pd.read_csv('test.csv')
PassengerId =np.array(test_df["PassengerId"]).astype(int)
my_solution = pd.DataFrame(test_predict, PassengerId, columns = ["Survived"])
my_solution.to_csv("my_solution_one.csv", index_label = ["PassengerId"])