# Logistic Regression によるモデリングと予測

今回はKaggleのチュートリアルCompetitionであるTitanic: Machine learning from disasterから簡単な特徴からモデリングと予測を行い、提出フォーマットに整える工程のノートになります。

In this notebook, I made a prediction from simple features from training data which is 'Titanic: Machine Learning from Disaster is famous for Tutorial Competition in Kaggle and processed the submission format.

In [24]:
import pandas as pd
import numpy as np
from pandas import DataFrame
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
%matplotlib inline

train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

今回は性別、客室クラスのみを特徴として解析するのでその他の余分なデータを削除します。<br>
In this analysis, I deleted the extra data becouse analyzed the data from Pclass and Sex as features.

In [25]:
train_df = train_df.drop(['Name', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],axis=1)
test_df = test_df.drop(['Name', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'], axis=1)

Ageの平均でAgeの欠損値を補完しました。<br>
Complement null value on average of Age.

In [26]:
age_train_mean = train_df.groupby('Sex').Age.mean()

def fage(x):
    if x.Sex == 'male':
        return round(age_train_mean['male'])
    if x.Sex == 'female':
        return round(age_train_mean['female'])

train_df.Age.fillna(train_df[train_df.Age.isnull()].apply(fage,axis=1),inplace=True)

In [27]:
age_test_mean = test_df.groupby('Sex').Age.mean()

def fage(x):
    if x.Sex == 'male':
        return round(age_test_mean['male'])
    if x.Sex == 'female':
        return round(age_test_mean['female'])

test_df.Age.fillna(test_df[test_df.Age.isnull()].apply(fage,axis=1),inplace=True)

In [28]:
sex_ct = pd.crosstab(train_df['Sex'], train_df['Survived'])
sex_ct

Survived,0,1
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1
female,81,233
male,468,109


In [29]:
train_df['Female'] = train_df['Sex'].map({'male': 0, 'female' : 1}).astype(int)
test_df['Female'] = test_df['Sex'].map({'male': 0, 'female': 1}).astype(int)

In [30]:
pclass_ct = pd.crosstab(train_df['Pclass'], train_df['Survived'])
pclass_ct

Survived,0,1
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1
1,80,136
2,97,87
3,372,119


In [31]:
pclass_train_df = pd.get_dummies(train_df['Pclass'],prefix='Class')
pclass_test_df = pd.get_dummies(test_df['Pclass'],prefix='Class')

In [32]:
pclass_train_df = pclass_train_df.drop(['Class_3'], axis=1)
pclass_test_df = pclass_test_df.drop(['Class_3'], axis=1)

train_df = train_df.join(pclass_train_df)
test_df = test_df.join(pclass_test_df)

In [33]:
train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,Female,Class_1,Class_2
0,1,0,3,male,22.0,0,0,0
1,2,1,1,female,38.0,1,1,0
2,3,1,3,female,26.0,1,0,0
3,4,1,1,female,35.0,1,1,0
4,5,0,3,male,35.0,0,0,0


In [34]:
test_df.head()

Unnamed: 0,PassengerId,Pclass,Sex,Age,Female,Class_1,Class_2
0,892,3,male,34.5,0,0,0
1,893,3,female,47.0,1,0,0
2,894,2,male,62.0,0,0,1
3,895,3,male,27.0,0,0,0
4,896,3,female,22.0,1,0,0


In [35]:
X = train_df.drop(['PassengerId','Survived','Pclass','Sex'],axis=1)
y = train_df.Survived

In [36]:
clf = LogisticRegression()

clf.fit(X, y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [37]:
predict_y = clf.predict(X)

In [38]:
coeff_df = DataFrame([X.columns, clf.coef_[0]]).T

In [39]:
test_model_df = test_df.drop(['PassengerId','Pclass','Sex'],axis=1)

In [40]:
test_predict = clf.predict(test_model_df)

完成した予測データをKaggleのページで指定された形式で保存  <br>
save the prediction data as a format ordered by Kaggle

In [None]:
test_df = pd.read_csv('test.csv')
PassengerId =np.array(test_df["PassengerId"]).astype(int)
my_solution = pd.DataFrame(test_predict, PassengerId, columns = ["Survived"])
my_solution.to_csv("my_solution_one.csv", index_label = ["PassengerId"])