訓練用データを使ってロジスティック回帰とランダムフォレストで判別を行い、性能の良い方でテストデータを使って判別を行うことにした。

まずロジスティック回帰で判別を行う。

1.vehicle.csvを読み込む
2.データの前処理（欠損の多い項目の削除、ダミー変数を入れる、標準化、スケーリング、正規化）
3.ロジスティック回帰で判別する

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler, normalize
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report

In [2]:
#データを読み込む
data_train= pd.read_csv('/Users/ichinose_kaori/Downloads/M8exercise/vehicle.csv', encoding="utf-8")
data_train

Unnamed: 0,destination,passanger,weather,temperature,time,coupon,expiration,gender,age,maritalStatus,...,CoffeeHouse,CarryAway,RestaurantLessThan20,Restaurant20To50,toCoupon_GEQ5min,toCoupon_GEQ15min,toCoupon_GEQ25min,direction_same,direction_opp,Y
0,Home,Alone,Sunny,80,10PM,Carry out & Take away,1d,Female,26,Single,...,less1,4~8,1~3,never,1,0,0,1,0,1
1,No Urgent Place,Alone,Sunny,55,10AM,Coffee House,2h,Female,26,Single,...,less1,4~8,less1,4~8,1,0,0,0,1,1
2,Home,Alone,Snowy,30,6PM,Coffee House,1d,Female,46,Unmarried partner,...,less1,less1,1~3,less1,1,1,0,0,1,0
3,Home,Alone,Snowy,30,10PM,Coffee House,2h,Male,31,Married partner,...,less1,4~8,1~3,4~8,1,1,0,0,1,0
4,Home,Alone,Rainy,55,10PM,Coffee House,1d,Female,26,Single,...,less1,1~3,gt8,gt8,1,0,0,1,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9508,No Urgent Place,Alone,Rainy,55,10AM,Bar,1d,Female,31,Married partner,...,less1,gt8,4~8,less1,1,1,0,0,1,0
9509,No Urgent Place,Friend(s),Sunny,80,2PM,Coffee House,2h,Female,31,Unmarried partner,...,never,4~8,,never,1,1,0,0,1,1
9510,Work,Alone,Sunny,55,7AM,Coffee House,1d,Female,50plus,Single,...,never,4~8,4~8,less1,1,1,0,0,1,0
9511,Work,Alone,Sunny,80,7AM,Restaurant(20-50),2h,Female,31,Single,...,less1,gt8,1~3,less1,1,0,0,1,0,0


In [3]:
# データの型と欠損の確認。
data_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9513 entries, 0 to 9512
Data columns (total 26 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   destination           9513 non-null   object
 1   passanger             9513 non-null   object
 2   weather               9513 non-null   object
 3   temperature           9513 non-null   int64 
 4   time                  9513 non-null   object
 5   coupon                9513 non-null   object
 6   expiration            9513 non-null   object
 7   gender                9513 non-null   object
 8   age                   9513 non-null   object
 9   maritalStatus         9513 non-null   object
 10  has_children          9513 non-null   int64 
 11  education             9513 non-null   object
 12  occupation            9513 non-null   object
 13  income                9513 non-null   object
 14  car                   86 non-null     object
 15  Bar                   9423 non-null   

In [4]:
# 列除去前の状態を確認。
data_train.dtypes

destination             object
passanger               object
weather                 object
temperature              int64
time                    object
coupon                  object
expiration              object
gender                  object
age                     object
maritalStatus           object
has_children             int64
education               object
occupation              object
income                  object
car                     object
Bar                     object
CoffeeHouse             object
CarryAway               object
RestaurantLessThan20    object
Restaurant20To50        object
toCoupon_GEQ5min         int64
toCoupon_GEQ15min        int64
toCoupon_GEQ25min        int64
direction_same           int64
direction_opp            int64
Y                        int64
dtype: object

In [5]:
# 欠損値の多いcarを削除。
data_train = data_train.drop(["car"],axis=1)
data_train.dtypes

destination             object
passanger               object
weather                 object
temperature              int64
time                    object
coupon                  object
expiration              object
gender                  object
age                     object
maritalStatus           object
has_children             int64
education               object
occupation              object
income                  object
Bar                     object
CoffeeHouse             object
CarryAway               object
RestaurantLessThan20    object
Restaurant20To50        object
toCoupon_GEQ5min         int64
toCoupon_GEQ15min        int64
toCoupon_GEQ25min        int64
direction_same           int64
direction_opp            int64
Y                        int64
dtype: object

In [6]:
#データのほとんどがObjectでダミー変数への変換が必要。楽に指定するため、データの行の一覧を取得
data_train.columns

Index(['destination', 'passanger', 'weather', 'temperature', 'time', 'coupon',
       'expiration', 'gender', 'age', 'maritalStatus', 'has_children',
       'education', 'occupation', 'income', 'Bar', 'CoffeeHouse', 'CarryAway',
       'RestaurantLessThan20', 'Restaurant20To50', 'toCoupon_GEQ5min',
       'toCoupon_GEQ15min', 'toCoupon_GEQ25min', 'direction_same',
       'direction_opp', 'Y'],
      dtype='object')

In [7]:
#ダミー変数を入れる
cat_col = ['destination', 'passanger', 'weather', 'time', 'coupon','expiration', 'gender', 'age', 'maritalStatus','education', 'occupation', 'income', 'Bar', 'CoffeeHouse', 'CarryAway','RestaurantLessThan20', 'Restaurant20To50']
data_train=pd.get_dummies(data_train,columns=cat_col)

判別を実行

In [8]:
#データから訓練データとテストデータを作成。
from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split( data_train, test_size = 0.2, random_state = 0)
print('訓練データ数 : {}, テストデータ数 : {}'.format(len(train_df), len(test_df)))

訓練データ数 : 7610, テストデータ数 : 1903


In [9]:
# 標準化、スケーリング、正規化を行う。

standard_f=False
if standard_f:
    from sklearn.preprocessing import StandardScaler
else:
    from sklearn.preprocessing import MinMaxScaler

# numpyの配列として値を取り出す
Y_column = 'Y'  # ターゲット列名を明示的に指定
X_train = train_df.drop(Y_column, axis=1)
Y_train = train_df[Y_column].values
X_test = test_df.drop(Y_column, axis=1)
Y_test = test_df[Y_column].values

if standard_f:
    transformer = StandardScaler()
    X_train = transformer.fit_transform(X_train)# 訓練データに標準化を適用してください。10点
    X_test = transformer.transform(X_test)# テストデータに標準化を適用してください。20点
else:
    transformer = MinMaxScaler()
    X_train = transformer.fit_transform(X_train)
    X_test = transformer.transform(X_test)
                                   
normalize_f = True
if normalize_f:
    from sklearn.preprocessing import normalize
    X_train = normalize(X_train, norm='l1', axis=1)
    X_test = normalize(X_test, norm='l1', axis=1)

In [10]:
# ロジスティック回帰による学習と予測を実行。
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
import numpy as np

params = {"C": np.logspace(0, 4, 5)}
logreg_cv = GridSearchCV(LogisticRegression(), cv=5, param_grid=params)
logreg_cv.fit(X_train, Y_train)
print('訓練データでの分類精度 : {0:.2%}'.format(logreg_cv.score(X_train, Y_train)))
print('テストデータでの分類精度 : {0:.2%}'.format( logreg_cv.score(X_test, Y_test)))

訓練データでの分類精度 : 68.98%
テストデータでの分類精度 : 69.15%


後半に記述するランダムフォレストではロジスティック回帰より精度が低かったため、ロジスティック回帰でテストデータでの判別予測を行う。

In [11]:
#テストデータの読み込み
df_test_without_label = pd.read_csv('vehicle_test_without_label.csv') 

In [12]:
df_test_without_label.head()

Unnamed: 0,destination,passanger,weather,temperature,time,coupon,expiration,gender,age,maritalStatus,...,Bar,CoffeeHouse,CarryAway,RestaurantLessThan20,Restaurant20To50,toCoupon_GEQ5min,toCoupon_GEQ15min,toCoupon_GEQ25min,direction_same,direction_opp
0,Home,Alone,Snowy,30,10PM,Coffee House,2h,Female,26,Unmarried partner,...,1~3,never,4~8,4~8,less1,1,1,0,0,1
1,No Urgent Place,Friend(s),Sunny,80,2PM,Restaurant(20-50),2h,Female,41,Unmarried partner,...,never,1~3,1~3,4~8,less1,1,1,0,0,1
2,No Urgent Place,Partner,Sunny,55,2PM,Bar,1d,Female,31,Married partner,...,less1,less1,gt8,4~8,less1,1,0,0,0,1
3,No Urgent Place,Friend(s),Sunny,30,10PM,Restaurant(<20),2h,Female,31,Single,...,never,4~8,gt8,4~8,less1,1,0,0,0,1
4,Work,Alone,Sunny,55,7AM,Bar,1d,Female,50plus,Married partner,...,never,never,4~8,gt8,less1,1,1,1,0,1


In [13]:
# 欠損値の多いcarを削除。
data_test = df_test_without_label.drop(["car"],axis=1)
data_test.dtypes

destination             object
passanger               object
weather                 object
temperature              int64
time                    object
coupon                  object
expiration              object
gender                  object
age                     object
maritalStatus           object
has_children             int64
education               object
occupation              object
income                  object
Bar                     object
CoffeeHouse             object
CarryAway               object
RestaurantLessThan20    object
Restaurant20To50        object
toCoupon_GEQ5min         int64
toCoupon_GEQ15min        int64
toCoupon_GEQ25min        int64
direction_same           int64
direction_opp            int64
dtype: object

In [14]:
#ダミー変数を入れる
cat_col = ['destination', 'passanger', 'weather', 'time', 'coupon','expiration', 'gender', 'age', 'maritalStatus','education', 'occupation', 'income', 'Bar', 'CoffeeHouse', 'CarryAway','RestaurantLessThan20', 'Restaurant20To50']
data_test=pd.get_dummies(data_test,columns=cat_col)

In [15]:
# 訓練データのカラムと一致させる
missing_cols = set(data_train.columns) - set(data_test.columns)
for c in missing_cols:
    data_test[c] = 0
data_test = data_test[data_train.drop(Y_column, axis=1).columns]  # 訓練データのカラム順に並べ替え

In [16]:
# 標準化、スケーリング
X_test = transformer.transform(data_test)

# 正規化
if normalize_f:
    X_test = normalize(X_test, norm='l1', axis=1)

In [17]:
# 予測結果を取得
result = logreg_cv.predict(X_test)

In [18]:
result

array([0, 1, 1, ..., 1, 1, 1])

In [19]:
#予測結果をデータフレーム型にするコード
result_df = pd.DataFrame(result, columns=['Y'])

In [20]:
result_df.head()

Unnamed: 0,Y
0,0
1,1
2,1
3,1
4,0


In [21]:
#データフレームをcsvに出力するコード
result_df.to_csv('predict.csv',index=False) #csvに変換、行番号をcsvに書かないようにindexをFalseに指定。

次にランダムフォレストで予測を行う。

1.データを読み込む
2.データの前処理（欠損の多い項目の削除、ダミー変数を入れる、標準化、スケーリング、正規化）
3.ランダムフォレストでコンフュージョンマトリクスを作成する

In [22]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix

np.random.seed(42)

In [23]:
# データ読込
data_train = pd.read_csv('/Users/ichinose_kaori/Downloads/M8exercise/vehicle.csv', encoding="utf-8") #訓練用データ
data_train.head()

Unnamed: 0,destination,passanger,weather,temperature,time,coupon,expiration,gender,age,maritalStatus,...,CoffeeHouse,CarryAway,RestaurantLessThan20,Restaurant20To50,toCoupon_GEQ5min,toCoupon_GEQ15min,toCoupon_GEQ25min,direction_same,direction_opp,Y
0,Home,Alone,Sunny,80,10PM,Carry out & Take away,1d,Female,26,Single,...,less1,4~8,1~3,never,1,0,0,1,0,1
1,No Urgent Place,Alone,Sunny,55,10AM,Coffee House,2h,Female,26,Single,...,less1,4~8,less1,4~8,1,0,0,0,1,1
2,Home,Alone,Snowy,30,6PM,Coffee House,1d,Female,46,Unmarried partner,...,less1,less1,1~3,less1,1,1,0,0,1,0
3,Home,Alone,Snowy,30,10PM,Coffee House,2h,Male,31,Married partner,...,less1,4~8,1~3,4~8,1,1,0,0,1,0
4,Home,Alone,Rainy,55,10PM,Coffee House,1d,Female,26,Single,...,less1,1~3,gt8,gt8,1,0,0,1,0,1


In [24]:
# 欠損値の多いcarを削除。
data_train = data_train.drop(["car"],axis=1)
data_train.dtypes

destination             object
passanger               object
weather                 object
temperature              int64
time                    object
coupon                  object
expiration              object
gender                  object
age                     object
maritalStatus           object
has_children             int64
education               object
occupation              object
income                  object
Bar                     object
CoffeeHouse             object
CarryAway               object
RestaurantLessThan20    object
Restaurant20To50        object
toCoupon_GEQ5min         int64
toCoupon_GEQ15min        int64
toCoupon_GEQ25min        int64
direction_same           int64
direction_opp            int64
Y                        int64
dtype: object

In [25]:
#ダミー変数を入れる
cat_col = ['destination', 'passanger', 'weather', 'time', 'coupon','expiration', 'gender', 'age', 'maritalStatus','education', 'occupation', 'income', 'Bar', 'CoffeeHouse', 'CarryAway','RestaurantLessThan20', 'Restaurant20To50']
data_train=pd.get_dummies(data_train,columns=cat_col)

In [26]:
# 特徴データとラベルをX, yに格納
Y_column = 'Y'
X = np.array(data_train.drop(Y_column, axis=1))#class(最後の１列)以外を説明変数として、numpy arrayに変換しXに代入してください。 5点
y= np.array(data_train[Y_column].values)#classの列をnumpy arrayに変換してyに代入してください。 1が悪性，0が良性を表します。 5点
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=15)

transformer = StandardScaler() #データを標準化するtransformer
X_train = transformer.fit_transform(X_train) #訓練データを標準化
X_test = transformer.transform(X_test) #訓練データと同じtransformerで標準化

In [27]:
params = {"max_depth":[2,3,4,5, 6], "n_estimators":[1, 10, 100]} #探索したいパラメータのdict
rf = RandomForestClassifier(random_state=42) #Random Forestのインスタンス
clf = GridSearchCV(estimator=rf, param_grid=params, scoring='accuracy', cv=5) #CVにより最適なパラメータを探索してくれるインスタンス
clf.fit(X_train, y_train)
pd.DataFrame(clf.cv_results_)[['rank_test_score', 'params', 'mean_test_score', 'std_test_score']].sort_values(by=["rank_test_score"], ascending=True)#CVの結果

Unnamed: 0,rank_test_score,params,mean_test_score,std_test_score
14,1,"{'max_depth': 6, 'n_estimators': 100}",0.700309,0.010584
13,2,"{'max_depth': 6, 'n_estimators': 10}",0.688254,0.008231
11,3,"{'max_depth': 5, 'n_estimators': 100}",0.68601,0.009854
10,4,"{'max_depth': 5, 'n_estimators': 10}",0.672414,0.003679
8,5,"{'max_depth': 4, 'n_estimators': 100}",0.670872,0.007665
7,6,"{'max_depth': 4, 'n_estimators': 10}",0.662321,0.004324
4,7,"{'max_depth': 3, 'n_estimators': 10}",0.652087,0.016388
5,8,"{'max_depth': 3, 'n_estimators': 100}",0.644378,0.010123
9,9,"{'max_depth': 5, 'n_estimators': 1}",0.635831,0.013638
12,10,"{'max_depth': 6, 'n_estimators': 1}",0.634846,0.014447


In [28]:
y_pred = clf.predict(X_test) #GridSearchCVはそのまま一番良いパラメータのestimatorとして使える
print(classification_report(y_test, y_pred)) #各指標を計算
print(confusion_matrix(y_test, y_pred)) #コンヒュージョンマトリクスを表示

              precision    recall  f1-score   support

           0       0.70      0.45      0.55      1042
           1       0.66      0.85      0.75      1337

    accuracy                           0.67      2379
   macro avg       0.68      0.65      0.65      2379
weighted avg       0.68      0.67      0.66      2379

[[ 466  576]
 [ 200 1137]]


ロジスティック回帰(68.98%)とランダムフォレスト（67%）での判別結果を比べると、わずかであるがロジスティック回帰の方が精度が良かった。
よってロジスティック回帰でのテストデータでの判別を行った。