# メモ
## このファイルの概要
- XGBoostのテスト
- [参照書籍](https://books.google.co.jp/books?id=HjlAEAAAQBAJ&pg=PA124&lpg=PA124&dq=education+国勢調査%E3%80%80機械学習&source=bl&ots=60Qzg00sXc&sig=ACfU3U29yBgBXPQmscE8XpM6E7hj6t32YQ&hl=ja&sa=X&ved=2ahUKEwjq2eWhqoD2AhXrT2wGHbRgCHg4ChDoAXoECB4QAw#v=onepage&q=education%20国勢調査%E3%80%80機械学習&f=false)
- [Githubソースコード](https://github.com/abhishekkrthakur/approachingalmost)

## ToDo
- [ ] ~~打ち消し線~~
- [ ] XGBoostのコーディング
- [ ] 


## 特徴量のアイデア、改善事項、分析事項のメモ
- Githubでのコード管理


## その他


# 初期設定


## [初期設定]Googleドライブ設定

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## [初期設定]ファイルの読み込み設定
- フォルダについてはフォルダについては'/'で終わるようにする

## フォルダ構成
- base_folder: プロジェクトのフォルダ
-- src: ソースコード
-- data: 入力データ
-- result: 結果、提出データ

In [None]:
glab_folder = '/content/drive/My Drive/Colab/'
project_folder = 'project/2022-08-12_DXQ_model_assesment/'
base_folder = glab_folder + project_folder #プロジェクトのフォルダ

input_folder = base_folder + 'data/input/'
result_folder = base_folder + 'data/output/'

# 実行ファイル名 + タイムスタンプ + 拡張子となる -> タイムスタンプと拡張子は提出ファイル作成時に付加されるようにする必要がある

# プロジェクト概要
- 【第19回_Beginner限定コンペ】国勢調査からの収入予測

## サイト
[https://signate.jp/competitions/576]()

## 結果ファイル
.csv

## スコア



# 関数定義
## ヘッダ
- nt_aiq_xxx_fff

## xxx -> 
- table::テーブル作成
- feature::特徴量作成
- analysis:分析、グラフ表示

## 説明テキスト
- [関数] 説明::関数名


# コンペサイト
[https://signate.jp/competitions/488](【AIQuest2021】PBL_01 需要予測・在庫最適化（小売業）)

# コード構成

#### 0. 共通関数呼び出し
- ライブラリ読み込み
- 共通関数読み出し

#### 1. データの読み込み
- inputデータ読み込み

#### 2. 分析（関数作成）
- データ確認、データ分析、グラフ表示
- データ分析、グラフ表示の関数作成

#### 3. 前処理
- NaNデータ補正

#### 4. 特徴量、テーブル生成（関数作成）
- 特徴量生成、特徴量の関数作成
- 新規テーブル生成、新規テーブル生成の関数作成

#### 5. データ整理
- 作成した特徴量、テーブルを結合してtrainとtestのデータを作成する
- 4の特徴量、テーブル生成とかぶる時もありなので、柔軟にする


#### 6. モデリング

#### 7. 提出用ファイルの作成

# 0. 共通関数呼び出し
- ライブラリ読み込み
- 共通関数読み出し

In [None]:
# base
import pandas as pd
import numpy as np

from itertools import product

# sklearn
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.metrics import accuracy_score, f1_score
from sklearn.metrics import confusion_matrix, classification_report

#デバッグ
#import pdb; pdb.set_trace()

In [None]:
pd.__version__

'1.3.5'

In [None]:
# 日本語フォントのインストール
!pip install japanize_matplotlib

import matplotlib.pyplot as plt
%matplotlib inline
import japanize_matplotlib 
import seaborn as sns

sns.set(font="IPAexGothic")

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## [関数]タイムスタンプの取得、生成関数の作成


In [None]:
import datetime

# タイムスタンプ取得関数
def nt_get_now():
  now = datetime.datetime.now(datetime.timezone(datetime.timedelta(hours=9)))
  now = now.strftime('%Y%m%d-%H%M%S')
  
  return now

# タイムスタンプ文字列生成関数
def nt_get_submit_filename(file_name):
  fname = file_name + '['+ nt_get_now() + '].csv'
  print(fname)
  return fname

## [関数]print拡張関数
- nprint()

- globalのnt_debug = True時にprint表示
- False時には表示しない

In [None]:
# 設定
nt_debug = True

In [None]:
def nprint(string):
    if nt_debug == True:
        print(string)

## [設定]headの出力数を変更する

In [None]:
#現在の最大表示列数の出力
nprint(pd.get_option("display.max_columns"))

#最大表示列数の指定（ここでは50列を指定）
pd.set_option('display.max_columns', 50)

50


## [関数] ラベルのエンコーディング::nt_labal_encoding
- nt_labal_encoding(tbl=pd.DataFrame(), categorical_features=[]):
- テーブル、カラムリストを与えてラベルをint、categoryにエンコーディングする

In [None]:
!pip install category_encoders

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
#@title デフォルトのタイトル テキスト
# ラベルのエンコーディング

# インストール
# !pip install category_encoders

# 使用方法
# list_item = ['商品カテゴリ名']
# tbl = nt_labal_encoding(category_names, list_item)

# 参照
# https://qiita.com/sinchir0/items/b038757e578b790ec96a

# int_change_features = ['pay_year_month','商品カテゴリID','商品カテゴリ名', 'category_main','category_sub','shop_item_id']

def nt_labal_encoding(tbl=pd.DataFrame(), categorical_features=[]):

    # Label Encodingの実施
    import category_encoders as ce

    ce_oe = ce.OrdinalEncoder(cols=categorical_features, handle_unknown='impute')

    #文字を序数に変換
    table_enc = ce_oe.fit_transform(tbl)

    #値を1の始まりから0の始まりにする
    for i in categorical_features:
        table_enc[i] = table_enc[i] - 1
    
    # カテゴリ変数を全て、category型へ変換
    for i in categorical_features:
        table_enc[i] = table_enc[i].astype('category')

    return table_enc

## [関数] テーブル情報表示関数
nt_aiq_analysis_print_table_info(com=str, tbl=pd.DataFrame())

In [None]:
def nt_aiq_analysis_print_table_info(com=str, tbl=pd.DataFrame()):
    
    print('###############################################')
    print('# {}'.format(com))
    print('###############################################')

    print('------------------------------Table Info------------------------------')
    print(tbl.info())
    print('-------------------------------------------------------------------------------\n')

    print('------------------------------Table Head------------------------------')
    print(tbl.head())
    print('---------------------------------------------------------------------------\n')

    print('------------------------------Table Describe------------------------------')
    print(tbl.describe())
    print('-------------------------------------------------------------------------------\n')

# 1. データの読み込み
- inputデータ読み込み

## csvデータの読み込み

In [None]:
train = pd.read_csv(input_folder + 'train.csv')
test = pd.read_csv(input_folder + 'test.csv', index_col=0) # indexのcolがあるので注意
sample_sub = pd.read_csv(input_folder + 'sample_submit.csv', header=None)

## [関数] 各テーブルの情報表示
- nt_aiq_analysis_print_each_table()

In [None]:
def nt_aiq_analysis_print_each_table():

    # category_namesに関する情報を表示
    com = 'trainに関する情報を表示'
    nt_aiq_analysis_print_table_info(com, train)

    # test.csvに関する情報を表示
    com = 'test.csvに関する情報を表示'
    nt_aiq_analysis_print_table_info(com, test)

    # sample_submission.csvに関する情報を表示
    com = 'sample_submission.csvに関する情報を表示'
    nt_aiq_analysis_print_table_info(com, sample_sub)

nt_aiq_analysis_print_each_table()

###############################################
# trainに関する情報を表示
###############################################
------------------------------Table Info------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55583 entries, 0 to 55582
Data columns (total 29 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   id                      55583 non-null  int64  
 1   accommodates            55583 non-null  int64  
 2   amenities               55583 non-null  object 
 3   bathrooms               55436 non-null  float64
 4   bed_type                55583 non-null  object 
 5   bedrooms                55512 non-null  float64
 6   beds                    55487 non-null  float64
 7   cancellation_policy     55583 non-null  object 
 8   city                    55583 non-null  object 
 9   cleaning_fee            55583 non-null  object 
 10  description             55583 non-null  object 
 11  first_review 

# 3. 前処理
- NaNデータ補正
- 

# 4. 特徴量の作成
- 特徴量生成
- 正規化

# 5. データセットの整理

### データセットの分割
- 学習値、目標値の分割

In [None]:
X_train = train.drop(['y','id'], axis=1)
y_train = train['y']

X_test = test

In [None]:
print(X_test.shape)
print(X_train.shape)
print(y_train.shape)

(18528, 27)
(55583, 27)
(55583,)


# クロスバリデーション
- XGBoostを使用する

In [None]:
X_train.columns

Index(['accommodates', 'amenities', 'bathrooms', 'bed_type', 'bedrooms',
       'beds', 'cancellation_policy', 'city', 'cleaning_fee', 'description',
       'first_review', 'host_has_profile_pic', 'host_identity_verified',
       'host_response_rate', 'host_since', 'instant_bookable', 'last_review',
       'latitude', 'longitude', 'name', 'neighbourhood', 'number_of_reviews',
       'property_type', 'review_scores_rating', 'room_type', 'thumbnail_url',
       'zipcode'],
      dtype='object')

In [None]:
# feature = ['index', 'age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race','sex', 'native-country']

In [None]:
from sklearn import model_selection

# 数値の含む列の削除
#df_train = pd.read_csv(input_folder + 'train.csv')
df_train = train.copy()

df_train['kfold'] = -1

# fold数
num_fold = 5

# 交差検証用にデータセットを5分割（5-fold）してフォールドインデックスを格納
kf = model_selection.KFold(n_splits=num_fold, shuffle=True, random_state=42)
for fold, (train_indices, valid_indices) in enumerate(kf.split(X=df_train)):
  df_train.loc[valid_indices, 'kfold'] = fold


In [None]:
df_train

Unnamed: 0,id,accommodates,amenities,bathrooms,bed_type,bedrooms,beds,cancellation_policy,city,cleaning_fee,description,first_review,host_has_profile_pic,host_identity_verified,host_response_rate,host_since,instant_bookable,last_review,latitude,longitude,name,neighbourhood,number_of_reviews,property_type,review_scores_rating,room_type,thumbnail_url,zipcode,y,kfold
0,0,6,"{TV,""Wireless Internet"",Kitchen,""Free parking ...",2.0,Real Bed,1.0,4.0,flexible,LA,t,My place is meant for family and a few friends...,2016-07-27,t,f,,2016-07-13,f,2016-07-27,33.788931,-118.154761,The Penthouse,,1,Apartment,60.0,Private room,,90804,138.0,1
1,1,2,"{TV,""Cable TV"",Internet,""Wireless Internet"",""A...",1.0,Real Bed,1.0,1.0,strict,DC,t,This is a new listing for a lovely guest bedro...,2016-09-12,t,t,100%,2015-12-30,f,2017-03-31,38.934810,-76.978190,Guest Bedroom in Brookland,Brookland,9,House,100.0,Private room,https://a0.muscache.com/im/pictures/e4d8b51f-6...,20018,42.0,1
2,2,2,"{TV,Internet,""Wireless Internet"",Kitchen,""Indo...",2.0,Real Bed,1.0,1.0,strict,NYC,t,We're looking forward to your stay at our apt....,2016-06-15,t,f,100%,2016-05-21,t,2017-08-13,40.695118,-73.926240,Clean Modern Room in Lux Apt 1 Block From J Train,Bushwick,27,Apartment,83.0,Private room,https://a0.muscache.com/im/pictures/5ffecc9b-d...,,65.0,4
3,3,2,"{TV,""Cable TV"",Internet,""Wireless Internet"",""A...",1.0,Real Bed,1.0,1.0,strict,SF,t,BEST CITY VIEWS - - ROOF DECK W/ BBQ & WiFi - ...,2014-03-15,t,t,100%,2012-06-19,t,2017-09-03,37.796728,-122.411906,BEST views + reviews! 5/5 stars*****,Nob Hill,38,Apartment,95.0,Private room,,94133,166.0,2
4,4,2,"{TV,Internet,""Wireless Internet"",""Air conditio...",1.0,Real Bed,1.0,1.0,strict,NYC,t,Charming Apartment on the upper west side of M...,2015-08-05,t,t,100%,2015-03-25,f,2017-09-10,40.785050,-73.974691,Charming 1-bedroom - UWS Manhattan,Upper West Side,5,Apartment,100.0,Entire home/apt,https://a0.muscache.com/im/pictures/92879730/5...,10024,165.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
55578,55578,4,"{TV,""Cable TV"",Internet,""Wireless Internet"",""A...",1.5,Real Bed,1.0,1.0,strict,NYC,t,"1100 square foot, New York apartment right in ...",2013-02-27,t,t,100%,2013-01-08,f,2017-09-18,40.739261,-73.994814,Super Swanky New York Apartment!,Flatiron District,110,Apartment,93.0,Entire home/apt,https://a0.muscache.com/im/pictures/12634916/5...,10010.0,340.0,3
55579,55579,2,"{TV,Internet,""Wireless Internet"",""Air conditio...",1.0,Real Bed,1.0,1.0,moderate,Chicago,f,Private room in apartment in Avondale Chicago ...,2015-11-29,t,t,,2015-01-29,f,2016-10-29,41.933710,-87.720810,Cozy Room in Avondale 3,,6,Apartment,87.0,Private room,https://a0.muscache.com/im/pictures/6a4a3dcb-6...,60618,30.0,3
55580,55580,2,"{TV,Internet,""Wireless Internet"",Kitchen,Heati...",1.0,Real Bed,1.0,1.0,flexible,SF,t,This private bedroom is part of a spacious uni...,2016-03-02,t,f,,2016-02-16,f,2017-04-07,37.762222,-122.416493,Cozy Bedroom in Mission District,Mission District,14,Apartment,99.0,Private room,https://a0.muscache.com/im/pictures/e8dd8cfd-5...,94110,100.0,4
55581,55581,1,"{TV,""Wireless Internet"",""Air conditioning"",Poo...",1.5,Real Bed,1.0,1.0,moderate,LA,t,You’ll love my place because of the ambiance a...,2016-10-18,t,t,100%,2016-04-06,t,2017-04-18,34.217543,-118.534260,Private Room in Resort-Style Townhouse,Reseda,10,Townhouse,100.0,Private room,https://a0.muscache.com/im/pictures/603399bb-9...,91335,38.0,2


In [None]:
train

Unnamed: 0,id,accommodates,amenities,bathrooms,bed_type,bedrooms,beds,cancellation_policy,city,cleaning_fee,description,first_review,host_has_profile_pic,host_identity_verified,host_response_rate,host_since,instant_bookable,last_review,latitude,longitude,name,neighbourhood,number_of_reviews,property_type,review_scores_rating,room_type,thumbnail_url,zipcode,y
0,0,6,"{TV,""Wireless Internet"",Kitchen,""Free parking ...",2.0,Real Bed,1.0,4.0,flexible,LA,t,My place is meant for family and a few friends...,2016-07-27,t,f,,2016-07-13,f,2016-07-27,33.788931,-118.154761,The Penthouse,,1,Apartment,60.0,Private room,,90804,138.0
1,1,2,"{TV,""Cable TV"",Internet,""Wireless Internet"",""A...",1.0,Real Bed,1.0,1.0,strict,DC,t,This is a new listing for a lovely guest bedro...,2016-09-12,t,t,100%,2015-12-30,f,2017-03-31,38.934810,-76.978190,Guest Bedroom in Brookland,Brookland,9,House,100.0,Private room,https://a0.muscache.com/im/pictures/e4d8b51f-6...,20018,42.0
2,2,2,"{TV,Internet,""Wireless Internet"",Kitchen,""Indo...",2.0,Real Bed,1.0,1.0,strict,NYC,t,We're looking forward to your stay at our apt....,2016-06-15,t,f,100%,2016-05-21,t,2017-08-13,40.695118,-73.926240,Clean Modern Room in Lux Apt 1 Block From J Train,Bushwick,27,Apartment,83.0,Private room,https://a0.muscache.com/im/pictures/5ffecc9b-d...,,65.0
3,3,2,"{TV,""Cable TV"",Internet,""Wireless Internet"",""A...",1.0,Real Bed,1.0,1.0,strict,SF,t,BEST CITY VIEWS - - ROOF DECK W/ BBQ & WiFi - ...,2014-03-15,t,t,100%,2012-06-19,t,2017-09-03,37.796728,-122.411906,BEST views + reviews! 5/5 stars*****,Nob Hill,38,Apartment,95.0,Private room,,94133,166.0
4,4,2,"{TV,Internet,""Wireless Internet"",""Air conditio...",1.0,Real Bed,1.0,1.0,strict,NYC,t,Charming Apartment on the upper west side of M...,2015-08-05,t,t,100%,2015-03-25,f,2017-09-10,40.785050,-73.974691,Charming 1-bedroom - UWS Manhattan,Upper West Side,5,Apartment,100.0,Entire home/apt,https://a0.muscache.com/im/pictures/92879730/5...,10024,165.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
55578,55578,4,"{TV,""Cable TV"",Internet,""Wireless Internet"",""A...",1.5,Real Bed,1.0,1.0,strict,NYC,t,"1100 square foot, New York apartment right in ...",2013-02-27,t,t,100%,2013-01-08,f,2017-09-18,40.739261,-73.994814,Super Swanky New York Apartment!,Flatiron District,110,Apartment,93.0,Entire home/apt,https://a0.muscache.com/im/pictures/12634916/5...,10010.0,340.0
55579,55579,2,"{TV,Internet,""Wireless Internet"",""Air conditio...",1.0,Real Bed,1.0,1.0,moderate,Chicago,f,Private room in apartment in Avondale Chicago ...,2015-11-29,t,t,,2015-01-29,f,2016-10-29,41.933710,-87.720810,Cozy Room in Avondale 3,,6,Apartment,87.0,Private room,https://a0.muscache.com/im/pictures/6a4a3dcb-6...,60618,30.0
55580,55580,2,"{TV,Internet,""Wireless Internet"",Kitchen,Heati...",1.0,Real Bed,1.0,1.0,flexible,SF,t,This private bedroom is part of a spacious uni...,2016-03-02,t,f,,2016-02-16,f,2017-04-07,37.762222,-122.416493,Cozy Bedroom in Mission District,Mission District,14,Apartment,99.0,Private room,https://a0.muscache.com/im/pictures/e8dd8cfd-5...,94110,100.0
55581,55581,1,"{TV,""Wireless Internet"",""Air conditioning"",Poo...",1.5,Real Bed,1.0,1.0,moderate,LA,t,You’ll love my place because of the ambiance a...,2016-10-18,t,t,100%,2016-04-06,t,2017-04-18,34.217543,-118.534260,Private Room in Resort-Style Townhouse,Reseda,10,Townhouse,100.0,Private room,https://a0.muscache.com/im/pictures/603399bb-9...,91335,38.0


## 特徴量の設定

In [None]:
train.info()
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55583 entries, 0 to 55582
Data columns (total 29 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   id                      55583 non-null  int64  
 1   accommodates            55583 non-null  int64  
 2   amenities               55583 non-null  object 
 3   bathrooms               55436 non-null  float64
 4   bed_type                55583 non-null  object 
 5   bedrooms                55512 non-null  float64
 6   beds                    55487 non-null  float64
 7   cancellation_policy     55583 non-null  object 
 8   city                    55583 non-null  object 
 9   cleaning_fee            55583 non-null  object 
 10  description             55583 non-null  object 
 11  first_review            43675 non-null  object 
 12  host_has_profile_pic    55435 non-null  object 
 13  host_identity_verified  55435 non-null  object 
 14  host_response_rate      41879 non-null

In [None]:
base_col = ['id', 'kfold','y']
obj_col = ['amenities', 'bed_type','cancellation_policy', 'city', 'cleaning_fee', 
           'description','first_review', 'host_has_profile_pic', 
           'host_identity_verified', 'host_response_rate', 'host_since',
           'instant_bookable', 'last_review', 'name', 'neighbourhood',
           'property_type', 'room_type', 'thumbnail_url', 'zipcode']

num_cols = ['accommodates', 'bathrooms', 'bedrooms', 'beds',
            'latitude', 'longitude', 'number_of_reviews',
            'review_scores_rating']

In [None]:
# 特徴量の設定
features = [
    #目的変数とfold番号の列を除き特徴量とする
    f for f in df_train.columns if f not in ('id','kfold','y')
     #目的変数とfold番号と数値の列を除き特徴量とする
    #f for f in df_train.columns if f not in ("kfold","Y") and f not in num_cols
]

In [None]:
# 特徴量の設定
non_num_features = [
     #目的変数とfold番号と数値の列を除き特徴量とする
    f for f in df_train.columns if f not in ('id','kfold','y') and f not in num_cols
]

# すべての特徴量の設定
all_features = [
     #目的変数とfold番号と数値の列を除き特徴量とする
    f for f in df_train.columns if f not in ('id','kfold','y')
]

In [None]:
# 選択した特徴量
select_features = obj_col + num_cols

In [None]:
print(non_num_features)
print(all_features)
print(select_features)

['amenities', 'bed_type', 'cancellation_policy', 'city', 'cleaning_fee', 'description', 'first_review', 'host_has_profile_pic', 'host_identity_verified', 'host_response_rate', 'host_since', 'instant_bookable', 'last_review', 'name', 'neighbourhood', 'property_type', 'room_type', 'thumbnail_url', 'zipcode']
['accommodates', 'amenities', 'bathrooms', 'bed_type', 'bedrooms', 'beds', 'cancellation_policy', 'city', 'cleaning_fee', 'description', 'first_review', 'host_has_profile_pic', 'host_identity_verified', 'host_response_rate', 'host_since', 'instant_bookable', 'last_review', 'latitude', 'longitude', 'name', 'neighbourhood', 'number_of_reviews', 'property_type', 'review_scores_rating', 'room_type', 'thumbnail_url', 'zipcode']
['amenities', 'bed_type', 'cancellation_policy', 'city', 'cleaning_fee', 'description', 'first_review', 'host_has_profile_pic', 'host_identity_verified', 'host_response_rate', 'host_since', 'instant_bookable', 'last_review', 'name', 'neighbourhood', 'property_type'

## XGBoostのクロスバリデーション

In [None]:
import xgboost as xgb
from xgboost import XGBRegressor

from sklearn import metrics
from sklearn import preprocessing

In [None]:
print(select_features)

['amenities', 'bed_type', 'cancellation_policy', 'city', 'cleaning_fee', 'description', 'first_review', 'host_has_profile_pic', 'host_identity_verified', 'host_response_rate', 'host_since', 'instant_bookable', 'last_review', 'name', 'neighbourhood', 'property_type', 'room_type', 'thumbnail_url', 'zipcode', 'accommodates', 'bathrooms', 'bedrooms', 'beds', 'latitude', 'longitude', 'number_of_reviews', 'review_scores_rating']


In [None]:
train

Unnamed: 0,id,accommodates,amenities,bathrooms,bed_type,bedrooms,beds,cancellation_policy,city,cleaning_fee,description,first_review,host_has_profile_pic,host_identity_verified,host_response_rate,host_since,instant_bookable,last_review,latitude,longitude,name,neighbourhood,number_of_reviews,property_type,review_scores_rating,room_type,thumbnail_url,zipcode,y
0,0,6,"{TV,""Wireless Internet"",Kitchen,""Free parking ...",2.0,Real Bed,1.0,4.0,flexible,LA,t,My place is meant for family and a few friends...,2016-07-27,t,f,,2016-07-13,f,2016-07-27,33.788931,-118.154761,The Penthouse,,1,Apartment,60.0,Private room,,90804,138.0
1,1,2,"{TV,""Cable TV"",Internet,""Wireless Internet"",""A...",1.0,Real Bed,1.0,1.0,strict,DC,t,This is a new listing for a lovely guest bedro...,2016-09-12,t,t,100%,2015-12-30,f,2017-03-31,38.934810,-76.978190,Guest Bedroom in Brookland,Brookland,9,House,100.0,Private room,https://a0.muscache.com/im/pictures/e4d8b51f-6...,20018,42.0
2,2,2,"{TV,Internet,""Wireless Internet"",Kitchen,""Indo...",2.0,Real Bed,1.0,1.0,strict,NYC,t,We're looking forward to your stay at our apt....,2016-06-15,t,f,100%,2016-05-21,t,2017-08-13,40.695118,-73.926240,Clean Modern Room in Lux Apt 1 Block From J Train,Bushwick,27,Apartment,83.0,Private room,https://a0.muscache.com/im/pictures/5ffecc9b-d...,,65.0
3,3,2,"{TV,""Cable TV"",Internet,""Wireless Internet"",""A...",1.0,Real Bed,1.0,1.0,strict,SF,t,BEST CITY VIEWS - - ROOF DECK W/ BBQ & WiFi - ...,2014-03-15,t,t,100%,2012-06-19,t,2017-09-03,37.796728,-122.411906,BEST views + reviews! 5/5 stars*****,Nob Hill,38,Apartment,95.0,Private room,,94133,166.0
4,4,2,"{TV,Internet,""Wireless Internet"",""Air conditio...",1.0,Real Bed,1.0,1.0,strict,NYC,t,Charming Apartment on the upper west side of M...,2015-08-05,t,t,100%,2015-03-25,f,2017-09-10,40.785050,-73.974691,Charming 1-bedroom - UWS Manhattan,Upper West Side,5,Apartment,100.0,Entire home/apt,https://a0.muscache.com/im/pictures/92879730/5...,10024,165.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
55578,55578,4,"{TV,""Cable TV"",Internet,""Wireless Internet"",""A...",1.5,Real Bed,1.0,1.0,strict,NYC,t,"1100 square foot, New York apartment right in ...",2013-02-27,t,t,100%,2013-01-08,f,2017-09-18,40.739261,-73.994814,Super Swanky New York Apartment!,Flatiron District,110,Apartment,93.0,Entire home/apt,https://a0.muscache.com/im/pictures/12634916/5...,10010.0,340.0
55579,55579,2,"{TV,Internet,""Wireless Internet"",""Air conditio...",1.0,Real Bed,1.0,1.0,moderate,Chicago,f,Private room in apartment in Avondale Chicago ...,2015-11-29,t,t,,2015-01-29,f,2016-10-29,41.933710,-87.720810,Cozy Room in Avondale 3,,6,Apartment,87.0,Private room,https://a0.muscache.com/im/pictures/6a4a3dcb-6...,60618,30.0
55580,55580,2,"{TV,Internet,""Wireless Internet"",Kitchen,Heati...",1.0,Real Bed,1.0,1.0,flexible,SF,t,This private bedroom is part of a spacious uni...,2016-03-02,t,f,,2016-02-16,f,2017-04-07,37.762222,-122.416493,Cozy Bedroom in Mission District,Mission District,14,Apartment,99.0,Private room,https://a0.muscache.com/im/pictures/e8dd8cfd-5...,94110,100.0
55581,55581,1,"{TV,""Wireless Internet"",""Air conditioning"",Poo...",1.5,Real Bed,1.0,1.0,moderate,LA,t,You’ll love my place because of the ambiance a...,2016-10-18,t,t,100%,2016-04-06,t,2017-04-18,34.217543,-118.534260,Private Room in Resort-Style Townhouse,Reseda,10,Townhouse,100.0,Private room,https://a0.muscache.com/im/pictures/603399bb-9...,91335,38.0


# XGBoost（XGBRegressor）でモデルの作成

In [None]:
features

['accommodates',
 'amenities',
 'bathrooms',
 'bed_type',
 'bedrooms',
 'beds',
 'cancellation_policy',
 'city',
 'cleaning_fee',
 'description',
 'first_review',
 'host_has_profile_pic',
 'host_identity_verified',
 'host_response_rate',
 'host_since',
 'instant_bookable',
 'last_review',
 'latitude',
 'longitude',
 'name',
 'neighbourhood',
 'number_of_reviews',
 'property_type',
 'review_scores_rating',
 'room_type',
 'thumbnail_url',
 'zipcode']

## ラベルエンコーディング

In [None]:
df_train =train.copy()

In [None]:
# 特徴量のラベルエンコーディング（教師）
#for col in features:
for col in non_num_features:
    #初期化
    label = preprocessing.LabelEncoder()
    #ラベルエンコーディングの学習
    label.fit(df_train[col])
    #データの変換
    df_train.loc[:, col] = label.transform(df_train[col])

In [None]:
# 特徴量のラベルエンコーディング（テスト）
#for col in features:
for col in non_num_features:
    #初期化
    label = preprocessing.LabelEncoder()
    #ラベルエンコーディングの学習
    label.fit(X_test[col])
    #データの変換
    X_test.loc[:, col] = label.transform(X_test[col])

In [None]:
X_train = df_train
y_train = X_train.y

In [None]:
#学習量のデータセットの準備
x_train = X_train[select_features].values
x_test = X_test[select_features].values

#初期化
model_XGBoost = xgb.XGBRegressor(
    max_depth=4,
    n_jobs=-1
)

# モデルの学習
model_XGBoost.fit(x_train, y_train)



XGBRegressor(max_depth=4, n_jobs=-1)

In [None]:
# 学習済みのモデルを使用してテストデータに関する予測値を算出する
y_pred_XGBoost = model_XGBoost.predict(x_test)

In [None]:
y_pred_XGBoost.shape

(18528,)

# 提出用ファイルの作成(XGBoost)

In [None]:
# sample_submissionの中身を確認
sample_sub.head()

Unnamed: 0,0,1
0,0,10
1,1,10
2,2,10
3,3,10
4,4,10


In [None]:
# sample_submissionの右側のカラムに予測値を代入する。
sample_sub.iloc[:, -1] = y_pred_XGBoost

sample_sub.head()

Unnamed: 0,0,1
0,0,263.914825
1,1,125.311157
2,2,86.905724
3,3,138.058685
4,4,151.656235


In [None]:
# 予測ファイルの生成
sample_sub.to_csv(result_folder + nt_get_submit_filename('XGBoost_'), index=False, header=False)

XGBoost_[20220814-182830].csv


# LightGBM（LGBMRegressor）でモデルの作成

In [None]:
import lightgbm as lgb

In [None]:
#学習量のデータセットの準備
x_train = X_train[select_features].values
x_test = X_test[select_features].values

#初期化
model_LightGBM = lgb.LGBMRegressor(
    max_depth=4,
    n_jobs=-1
)

# モデルの学習
model_LightGBM.fit(x_train, y_train)

LGBMRegressor(max_depth=4)

In [None]:
# 学習済みのモデルを使用してテストデータに関する予測値を算出する
y_pred_LightGBM = model_LightGBM.predict(x_test)

# 整数化
#y_pred = (y_pred_LightGBM+0.5).astype(int)
#y_pred

In [None]:
y_pred_LightGBM.shape

(18528,)

In [None]:
# feature importanceを表示
#importance = pd.DataFrame(model.feature_importance(), index=x_train.columns, columns=['importance'])
#display(importance)

# 提出用ファイルの作成(LightGBM)

In [None]:
# sample_submissionの中身を確認
sample_sub.head()

Unnamed: 0,0,1
0,0,263.914825
1,1,125.311157
2,2,86.905724
3,3,138.058685
4,4,151.656235


In [None]:
# sample_submissionの右側のカラムに予測値を代入する。
sample_sub.iloc[:, -1] = y_pred_LightGBM

sample_sub.head()

Unnamed: 0,0,1
0,0,263.558422
1,1,129.526806
2,2,94.053343
3,3,140.342896
4,4,152.37044


In [None]:
# 予測ファイルの生成
sample_sub.to_csv(result_folder + nt_get_submit_filename('LightGBM_'), index=False, header=False)

LightGBM_[20220814-182926].csv


# Catboostでモデル作成

In [None]:
!pip install catboost

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting catboost
  Downloading catboost-1.0.6-cp37-none-manylinux1_x86_64.whl (76.6 MB)
[K     |████████████████████████████████| 76.6 MB 1.2 MB/s 
Installing collected packages: catboost
Successfully installed catboost-1.0.6


In [None]:
from catboost import Pool
from catboost import CatBoostRegressor

In [None]:
cat_features_index = [0,1,2,3,4,5,6]

#学習量のデータセットの準備
x_train = X_train[select_features].values
x_test = X_test[select_features].values

#初期化
model_CatBoost = CatBoostRegressor(
    max_depth=4,
)
# モデルの学習
model_CatBoost.fit(x_train, y_train)


Learning rate set to 0.077247
0:	learn: 163.5169830	total: 58.8ms	remaining: 58.8s
1:	learn: 159.2806284	total: 69.7ms	remaining: 34.8s
2:	learn: 155.5557325	total: 79.5ms	remaining: 26.4s
3:	learn: 152.1271687	total: 88.9ms	remaining: 22.1s
4:	learn: 149.2199328	total: 99.1ms	remaining: 19.7s
5:	learn: 146.5491194	total: 109ms	remaining: 18.1s
6:	learn: 144.1631522	total: 130ms	remaining: 18.4s
7:	learn: 142.0797097	total: 142ms	remaining: 17.6s
8:	learn: 140.1969091	total: 152ms	remaining: 16.7s
9:	learn: 138.4723014	total: 161ms	remaining: 16s
10:	learn: 136.9529681	total: 170ms	remaining: 15.3s
11:	learn: 135.5198343	total: 180ms	remaining: 14.8s
12:	learn: 134.2315464	total: 194ms	remaining: 14.7s
13:	learn: 133.1298164	total: 204ms	remaining: 14.4s
14:	learn: 132.0920416	total: 214ms	remaining: 14.1s
15:	learn: 131.1439614	total: 224ms	remaining: 13.8s
16:	learn: 130.2954972	total: 233ms	remaining: 13.5s
17:	learn: 129.4335990	total: 248ms	remaining: 13.5s
18:	learn: 128.6372232	

<catboost.core.CatBoostRegressor at 0x7fb90e08e590>

In [None]:
# 学習済みのモデルを使用してテストデータに関する予測値を算出する
y_pred_CatBoost = model_CatBoost.predict(x_test)

# 整数化
#y_pred = (y_pred_CatBoost+0.5).astype(int)
#y_pred

In [None]:
y_pred_CatBoost.shape

(18528,)

In [None]:
# feature importanceを表示
#importance = pd.DataFrame(model.feature_importance(), index=x_train.columns, columns=['importance'])
#display(importance)

# 提出用ファイルの作成(Catboost)

In [None]:
# sample_submissionの中身を確認
sample_sub.head()

Unnamed: 0,0,1
0,0,263.558422
1,1,129.526806
2,2,94.053343
3,3,140.342896
4,4,152.37044


In [None]:
# sample_submissionの右側のカラムに予測値を代入する。
sample_sub.iloc[:, -1] = y_pred_CatBoost

sample_sub.head()

Unnamed: 0,0,1
0,0,293.742556
1,1,141.737156
2,2,93.469758
3,3,142.257415
4,4,144.743108


In [None]:
# 予測ファイルの生成
sample_sub.to_csv(result_folder + nt_get_submit_filename('CatBoost_'), index=False, header=False)

CatBoost_[20220814-183109].csv


# アンサンブル

In [None]:
# アンサンブルの割合
rate_XGBoost = 0.5
rate_LightGBM = 0.2
rate_CatBoost = 0.3

# アンサンブル
y_pred = rate_XGBoost * y_pred_XGBoost + rate_LightGBM * y_pred_LightGBM + rate_CatBoost * y_pred_CatBoost

# 整数化
#y_pred = (y_pred+0.5).astype(int)
#y_pred

In [None]:
# sample_submissionの右側のカラムに予測値を代入する。
sample_sub.iloc[:, -1] = y_pred

sample_sub.head()

Unnamed: 0,0,1
0,0,272.791864
1,1,131.082087
2,2,90.304458
3,3,139.775146
4,4,149.725138


In [None]:
# 予測ファイルの生成
sample_sub.to_csv(result_folder + nt_get_submit_filename('ensemble_LightGBM_XGBoost_CatBoost_'), index=False, header=False)

ensemble_LightGBM_XGBoost_CatBoost_[20220814-183120].csv
