## データの構成
- train.csv : 乗客データ + 生存フラグ (Survived)
- test.csv  : 乗客データのみ（Survivedは欠損）

trainでモデルを学習し、testに対して予測を行う


### カラム補足
- pclass: 1 = 1st, 2 = 2nd, 3 = 3rd
- sibsp: 乗船した兄弟／配偶者の人数
- parch: 乗船した両親と子供の数
- embarked: C = Cherbourg, Q = Queenstown, S = Southampton

In [1]:
import pandas as pd

df_train = pd.read_csv("train.csv")
df_test  = pd.read_csv("test.csv")

### trainの概要
- 生存フラグ付きの乗客データ（訓練用データ）
- 欠損値
  - Age : 約2割が欠損
  - Cabin : 大半が欠損
  - Embarked : 極小数欠損

In [2]:
#info
print(df_train.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
None


In [3]:
#行数・欠損値確認
print(f"- 行数,列数 \n{df_train.shape}")
print(f"- 欠損値 \n{df_train.isnull().sum()}")

- 行数,列数 
(891, 12)
- 欠損値 
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64


In [4]:
#head
df_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## 1-train加工
1. 不要なカラムの削除 ※今後使用する可能性あり
  - Name : 称号部分は特徴量化予定
  - Ticket : 形式が定まっていないため不要
  - Cabin : 欠損が多いため一旦削除
  - Embarked : 乗船港は生存率に関係なさそうなため削除
2. カテゴリ変数のエンコーディング
  - Sex : ["male","female"]のダミー変数を追加
    - その後、Sex列を削除
3. 欠損値の補完
  - Age : 中央値で補完

## 2-分析前準備
1. 特徴量と目的変数に分割
2. 検証用データを作成（trainを8:2に分割）

## 3-学習・検証・精度確認
1. モデル作成
  - 今回は「ロジスティック回帰(LogisticRegression)」を使用
2. 学習
3. 検証データで予測
4. 制度を確認

In [5]:
#1-1 不要なカラムの削除
drop_cols = ["Name","Ticket","Cabin","Embarked"]
df_train = df_train.drop(columns=drop_cols)
df_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare
0,1,0,3,male,22.0,1,0,7.25
1,2,1,1,female,38.0,1,0,71.2833
2,3,1,3,female,26.0,0,0,7.925
3,4,1,1,female,35.0,1,0,53.1
4,5,0,3,male,35.0,0,0,8.05


In [6]:
#1-2 数値ではないカラムの数値化
df_train_sex = pd.get_dummies(df_train["Sex"])
df_train = pd.concat([df_train,df_train_sex],axis=1)

## 不要になったsex列を削除
df_train = df_train.drop(columns=["Sex"])
df_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare,female,male
0,1,0,3,22.0,1,0,7.25,False,True
1,2,1,1,38.0,1,0,71.2833,True,False
2,3,1,3,26.0,0,0,7.925,True,False
3,4,1,1,35.0,1,0,53.1,True,False
4,5,0,3,35.0,0,0,8.05,False,True


In [7]:
#1-3 欠損値の補完
age_median = df_train["Age"].median()
df_train["Age"] = df_train["Age"].fillna(age_median)
df_train.isnull().sum()

PassengerId    0
Survived       0
Pclass         0
Age            0
SibSp          0
Parch          0
Fare           0
female         0
male           0
dtype: int64

In [8]:
#2-1 特徴量と目的変数に分割
X = df_train.drop(columns=["Survived", "PassengerId"])
y = df_train["Survived"]

#2-2 検証用データを作成（trainを8:2に分割）
from sklearn.model_selection import train_test_split

X_train, X_valid, y_train, y_valid = train_test_split(
    X, y, test_size=0.2, random_state=0, stratify=y
)

print("X_train:", X_train.shape)
print("X_valid:", X_valid.shape)
print("y_train:", y_train.shape)
print("y_valid:", y_valid.shape)

X_train: (712, 7)
X_valid: (179, 7)
y_train: (712,)
y_valid: (179,)


In [9]:
#3-1 モデル作成
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

#3-2 学習
model.fit(X_train, y_train)

#3-3 検証データで予測
y_pred = model.predict(X_valid)

# 精度確認
from sklearn.metrics import accuracy_score, classification_report

print("Accuracy Score:", accuracy_score(y_valid, y_pred))
print(classification_report(y_valid, y_pred))

Accuracy Score: 0.770949720670391
              precision    recall  f1-score   support

           0       0.80      0.84      0.82       110
           1       0.72      0.67      0.69        69

    accuracy                           0.77       179
   macro avg       0.76      0.75      0.75       179
weighted avg       0.77      0.77      0.77       179



STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### 評価指標について

- **Accuracy（全体精度）**
  - 0.7877 ＝ 78.7%  
  - → 100人中79人くらい正しく予測できたイメージ  

- **Precision（適合率）**
  - ラベル「0」（死亡）: 0.81  
  - ラベル「1」（生存）: 0.75  
  - → 生存者を「生存」と予測できた割合は 75%  

- **Recall（再現率）**
  - 死亡: 0.86  
  - 生存: 0.67  
  - → 生存者を取りこぼすことが多い  

- **F1-score（バランス指標）**
  - 生存: 0.71  
  - → 精度と再現率のバランス

### testの概要
- 生存フラグ付きの乗客データ（テスト用データ）
- 欠損値
  - Age : 約2割が欠損
  - Fare : 1つだけ欠損（trainに欠損なし）
  - Cabin : 大半が欠損
  - Embarked : 欠損なし（trainに欠損あり）

In [10]:
#info
print(df_test.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.1+ KB
None


In [11]:
#行数・欠損値確認
print(f"- 行数,列数 \n{df_test.shape}")
print(f"- 欠損値 \n{df_test.isnull().sum()}")

- 行数,列数 
(418, 11)
- 欠損値 
PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64


In [12]:
#head
df_test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


## 1-train加工
1. 不要なカラムの削除
  - Name : 称号部分は特徴量化予定
  - Ticket : 形式が定まっていないため不要
  - Cabin : 欠損が多いため一旦削除
  - Embarked : 乗船港は生存率に関係なさそうなため削除
2. カテゴリ変数のエンコーディング
  - Sex : ["male","female"]のダミー変数を追加
    - その後、Sex列を削除
3. 欠損値の補完
  - Age : 中央値で補完
  - Fare : 中央値で補完

## 2-提出ファイル作成
1. trainと列構成を揃える
2. モデルで予測
3. 結果を提出用FMでまとめる
4. csvに出力

In [13]:
#1-1 不要なカラムの削除
drop_cols = ["Name","Ticket","Cabin","Embarked"]
df_test = df_test.drop(columns=drop_cols)
df_test.head()

Unnamed: 0,PassengerId,Pclass,Sex,Age,SibSp,Parch,Fare
0,892,3,male,34.5,0,0,7.8292
1,893,3,female,47.0,1,0,7.0
2,894,2,male,62.0,0,0,9.6875
3,895,3,male,27.0,0,0,8.6625
4,896,3,female,22.0,1,1,12.2875


In [14]:
#1-2 数値ではないカラムの数値化
df_test_sex = pd.get_dummies(df_test["Sex"])
df_test = pd.concat([df_test,df_test_sex],axis=1)

## 不要になったsex列を削除
df_test = df_test.drop(columns=["Sex"])
df_test.head()

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare,female,male
0,892,3,34.5,0,0,7.8292,False,True
1,893,3,47.0,1,0,7.0,True,False
2,894,2,62.0,0,0,9.6875,False,True
3,895,3,27.0,0,0,8.6625,False,True
4,896,3,22.0,1,1,12.2875,True,False


In [15]:
#1-3 欠損値の補完
df_test["Age"] = df_test["Age"].fillna(age_median)

fare_median = df_train["Fare"].median()
df_test["Fare"] = df_test["Fare"].fillna(fare_median)

df_test.isnull().sum()

PassengerId    0
Pclass         0
Age            0
SibSp          0
Parch          0
Fare           0
female         0
male           0
dtype: int64

In [16]:
#2-1 trainと列構成を揃える
X_test = df_test.drop(columns=["PassengerId"])

print(X_train)
print(X_test)

     Pclass   Age  SibSp  Parch     Fare  female   male
502       3  28.0      0      0   7.6292    True  False
464       3  28.0      0      0   8.0500   False   True
198       3  28.0      0      0   7.7500    True  False
765       1  51.0      1      0  77.9583    True  False
421       3  21.0      0      0   7.7333   False   True
..      ...   ...    ...    ...      ...     ...    ...
131       3  20.0      0      0   7.0500   False   True
490       3  28.0      1      0  19.9667   False   True
528       3  39.0      0      0   7.9250   False   True
48        3  28.0      2      0  21.6792   False   True
80        3  22.0      0      0   9.0000   False   True

[712 rows x 7 columns]
     Pclass   Age  SibSp  Parch      Fare  female   male
0         3  34.5      0      0    7.8292   False   True
1         3  47.0      1      0    7.0000    True  False
2         2  62.0      0      0    9.6875   False   True
3         3  27.0      0      0    8.6625   False   True
4         3  22.0  

In [None]:
#2-2 モデルで予測
y_pred_test = model.predict(X_test)

In [24]:
#2-3 結果を提出用FMでまとめる
submission = pd.DataFrame({
    "PassengerId": df_test["PassengerId"],
    "Survived": y_pred_test
})

submission.head()

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,1


In [22]:
#2-4 csvに出力
submission.to_csv("submission.csv", index=False)