# Version 3 Validation

## 검증 내용

> `Version 3`에서 적용한 아래 3가지의 전처리 방안을 각각 적용하여 성능의 변화를 살펴본다.

1. [Test 1] Train에서 공급유형이 장기전세, 공공분양, 공공임대(5년)인 example을 아예 Drop한다.
2. [Test 2] 각 단지코드 내에 임대건물구분 -> '아파트'이면 1, '상가&아파트'이면 0 으로 매핑
3. [Test 3] '자격유형별 평균 임대료' feature 추가

## 검증 순서

1. **Version_3-1** : [Test 1] 만 적용
2. **Version_3-2** : [Test 2] 만 적용
3. **Version_3-3** : [Test 3] 만 적용
4. **Version_3-4** : [Test 1] & [Test 2] 적용
5. **Version_3-5** : [Test 1] & [Test 3] 적용
6. **Version_3-6** : [Test 2] & [Test 3] 적용
7. **Version_3-7** : 모든 전처리 방안 적용

## 검증 결과

### 가장 좋은 CV 점수

- **[Test 1] & [Test 3]** : -148.1634

## Import Module

In [1]:
import pandas as pd
import numpy as np
from os.path import join as Join
from tqdm.notebook import tqdm

## Data Load

In [2]:
DATA_ROOT = ''
DATA_ROOT = Join(DATA_ROOT, '../../../../competition_data/parking_data/')

TRAIN_ROOT = Join(DATA_ROOT, 'train.csv')
TEST_ROOT = Join(DATA_ROOT, 'test.csv')
SUBMISSION_ROOT = Join(DATA_ROOT, 'sample_submission.csv')

print(f"DATA_ROOT : {DATA_ROOT}")
print(f"TRAIN_ROOT : {TRAIN_ROOT}")
print(f"TEST_ROOT : {TEST_ROOT}")
print(f"SUBMISSION_ROOT : {SUBMISSION_ROOT}")

DATA_ROOT : ../../../../competition_data/parking_data/
TRAIN_ROOT : ../../../../competition_data/parking_data/train.csv
TEST_ROOT : ../../../../competition_data/parking_data/test.csv
SUBMISSION_ROOT : ../../../../competition_data/parking_data/sample_submission.csv


In [3]:
train = pd.read_csv(TRAIN_ROOT)
test = pd.read_csv(TEST_ROOT)
submission = pd.read_csv(SUBMISSION_ROOT)

print("Data Loaded!")

Data Loaded!


## Preprocessing (Version 1)

### 지역명 숫자로 매핑

In [4]:
local_map = {}

for i, loc in enumerate(train['지역'].unique()):
    local_map[loc] = i

train['지역'] = train['지역'].map(local_map)
test['지역'] = test['지역'].map(local_map)

### 전용면적 처리

In [5]:
train['전용면적'] = train['전용면적']//5*5
test['전용면적'] = test['전용면적']//5*5

## Preprocessing (Version 2)

`'-'` -> NULL, dtype을 float으로 변경

In [6]:
columns = ['임대보증금', '임대료']

for col in columns:
    train.loc[train[col] == '-', col] = np.nan
    test.loc[test[col] == '-', col] = np.nan

    train[col] = train[col].astype(float)
    test[col] = test[col].astype(float)

### NULL 값 처리

#### **임대보증금, 임대료**

In [7]:
train[['임대보증금', '임대료']] = train[['임대보증금', '임대료']].fillna(0)
test[['임대보증금', '임대료']] = test[['임대보증금', '임대료']].fillna(0)

#### **지하철, 버스**

In [8]:
cols = ['도보 10분거리 내 지하철역 수(환승노선 수 반영)', '도보 10분거리 내 버스정류장 수']
train[cols] = train[cols].fillna(0)
test[cols] = test[cols].fillna(0)

#### **자격유형**

In [9]:
test.loc[test.단지코드.isin(['C2411']) & test.자격유형.isnull(), '자격유형'] = 'A'
test.loc[test.단지코드.isin(['C2253']) & test.자격유형.isnull(), '자격유형'] = 'C'

### 중복 example 제거

In [10]:
train = train.drop_duplicates()
test = test.drop_duplicates()

### 자격유형 병합

In [11]:
train.loc[train.자격유형.isin(['J', 'L', 'K', 'N', 'M', 'O']), '자격유형'] = '행복주택_공급대상'
test.loc[test.자격유형.isin(['J', 'L', 'K', 'N', 'M', 'O']), '자격유형'] = '행복주택_공급대상'

train.loc[train.자격유형.isin(['H', 'B', 'E', 'G']), '자격유형'] = '국민임대_공급대상'
test.loc[test.자격유형.isin(['H', 'B', 'E', 'G']), '자격유형'] = '국민임대_공급대상'

train.loc[train.자격유형.isin(['C', 'I', 'F']), '자격유형'] = '영구임대_공급대상'
test.loc[test.자격유형.isin(['C', 'I', 'F']), '자격유형'] = '영구임대_공급대상'

### 공급유형 병합

In [12]:
train.loc[train.공급유형.isin(['공공임대(10년)', '공공임대(분납)']), '공급유형'] = '공공임대(10년/분납)'
test.loc[test.공급유형.isin(['공공임대(10년)', '공공임대(분납)']), '공급유형'] = '공공임대(10년/분납)'

## Preprocessing (Version 3)

### **Versoin_3-1** : [Test 1] 만 적용

- [Test 1] Train에서 공급유형이 장기전세, 공공분양, 공공임대(5년)인 example을 아예 Drop한다.

In [13]:
idx = train[(train.공급유형 == '장기전세') | (train.공급유형 == '공공분양') | (train.공급유형 == '공공임대(5년)')].index
version_3_1_train = train.drop(idx)

version_3_1_train.공급유형.unique().tolist()

['국민임대', '공공임대(50년)', '영구임대', '임대상가', '공공임대(10년/분납)', '행복주택']

### **Version_3-2** : [Test 2] 만 적용

- [Test 2] 각 단지코드 내에 임대건물구분 -> '아파트'이면 1, '상가&아파트'이면 0 으로 매핑

In [14]:
version_3_2_train = train.copy()

codes = version_3_2_train.단지코드.unique().tolist()

for code in tqdm(codes):
    values = version_3_2_train.loc[version_3_2_train.단지코드 == code, '임대건물구분'].unique().tolist()

    if '상가' in values:
        version_3_2_train.loc[version_3_2_train.단지코드 == code, '임대건물구분'] = 0
    else:
        version_3_2_train.loc[version_3_2_train.단지코드 == code, '임대건물구분'] = 1

  0%|          | 0/423 [00:00<?, ?it/s]

In [15]:
version_3_2_test = test.copy()

codes = version_3_2_test.단지코드.unique().tolist()

for code in tqdm(codes):
    values = version_3_2_test.loc[version_3_2_test.단지코드 == code, '임대건물구분'].unique().tolist()

    if '상가' in values:
        version_3_2_test.loc[version_3_2_test.단지코드 == code, '임대건물구분'] = 0
    else:
        version_3_2_test.loc[version_3_2_test.단지코드 == code, '임대건물구분'] = 1

  0%|          | 0/150 [00:00<?, ?it/s]

### **Version_3-3** : [Test 3] 만 적용

- [Test 3] '자격유형별 평균 임대료' feature 추가

In [16]:
version_3_3_train = train.copy()

qualifies = version_3_3_train.자격유형.unique().tolist()

for qualify in tqdm(qualifies):
    version_3_3_train.loc[version_3_3_train.자격유형 == qualify, '평균임대료(자격유형)'] = version_3_3_train.loc[version_3_3_train.자격유형 == qualify, '임대료'].mean()

  0%|          | 0/5 [00:00<?, ?it/s]

In [17]:
version_3_3_test = test.copy()

qualifies = version_3_3_test.자격유형.unique().tolist()

for qualify in tqdm(qualifies):
    version_3_3_test.loc[version_3_3_test.자격유형 == qualify, '평균임대료(자격유형)'] = version_3_3_test.loc[version_3_3_test.자격유형 == qualify, '임대료'].mean()

  0%|          | 0/5 [00:00<?, ?it/s]

### **Versoin_3_4** : [Test 1] & [Test 2] 적용

- [Test 1] Train에서 공급유형이 장기전세, 공공분양, 공공임대(5년)인 example을 아예 Drop한다.
- [Test 2] 각 단지코드 내에 임대건물구분 -> '아파트'이면 1, '상가&아파트'이면 0 으로 매핑

#### [Test 1]

In [18]:
idx = train[(train.공급유형 == '장기전세') | (train.공급유형 == '공공분양') | (train.공급유형 == '공공임대(5년)')].index
version_3_4_train = train.drop(idx)

version_3_4_train.공급유형.unique().tolist()

['국민임대', '공공임대(50년)', '영구임대', '임대상가', '공공임대(10년/분납)', '행복주택']

#### [Test 2]

In [19]:
codes = version_3_4_train.단지코드.unique().tolist()

for code in tqdm(codes):
    values = version_3_4_train.loc[version_3_4_train.단지코드 == code, '임대건물구분'].unique().tolist()

    if '상가' in values :
        version_3_4_train.loc[version_3_4_train.단지코드 == code, '임대건물구분'] = 0
    else :
        version_3_4_train.loc[version_3_4_train.단지코드 == code, '임대건물구분'] = 1

  0%|          | 0/421 [00:00<?, ?it/s]

In [20]:
version_3_4_test = test.copy()

codes = version_3_4_test.단지코드.unique().tolist()

for code in tqdm(codes):
    values = version_3_4_test.loc[version_3_4_test.단지코드 == code, '임대건물구분'].unique().tolist()

    if '상가' in values:
        version_3_4_test.loc[version_3_4_test.단지코드 == code, '임대건물구분'] = 0
    else:
        version_3_4_test.loc[version_3_4_test.단지코드 == code, '임대건물구분'] = 1

  0%|          | 0/150 [00:00<?, ?it/s]

### **Version_3_5** : [Test 1] & [Test 3] 적용

- [Test 1] Train에서 공급유형이 장기전세, 공공분양, 공공임대(5년)인 example을 아예 Drop한다.
- [Test 3] '자격유형별 평균 임대료' feature 추가

#### [Test 1]

In [21]:
idx = train[(train.공급유형 == '장기전세') | (train.공급유형 == '공공분양') | (train.공급유형 == '공공임대(5년)')].index
version_3_5_train = train.drop(idx)

version_3_5_train.공급유형.unique().tolist()

['국민임대', '공공임대(50년)', '영구임대', '임대상가', '공공임대(10년/분납)', '행복주택']

#### [Test 3]

In [22]:
qualifies = version_3_5_train.자격유형.unique().tolist()

for qualify in tqdm(qualifies):
    version_3_5_train.loc[version_3_5_train.자격유형 == qualify, '평균임대료(자격유형)'] = version_3_5_train.loc[version_3_5_train.자격유형 == qualify, '임대료'].mean()

  0%|          | 0/5 [00:00<?, ?it/s]

In [23]:
version_3_5_test = test.copy()

qualifies = version_3_5_test.자격유형.unique().tolist()

for qulify in tqdm(qualifies):
    version_3_5_test.loc[version_3_5_test.자격유형 == qualify, '평균임대료(자격유형)'] = version_3_5_test.loc[version_3_5_test.자격유형 == qualify, '임대료'].mean()

  0%|          | 0/5 [00:00<?, ?it/s]

### **Version_3-6** : [Test 2] & [Test 3] 적용

- [Test 2] 각 단지코드 내에 임대건물구분 -> '아파트'이면 1, '상가&아파트'이면 0 으로 매핑
- [Test 3] '자격유형별 평균 임대료' feature 추가

#### [Test 2]

In [24]:
version_3_6_train = train.copy()

codes = version_3_6_train.단지코드.unique().tolist()

for code in tqdm(codes):
    values = version_3_6_train.loc[version_3_6_train.단지코드 == code, '임대건물구분'].unique().tolist()

    if '상가' in values:
        version_3_6_train.loc[version_3_6_train.단지코드 == code, '임대건물구분'] = 0
    else :
        version_3_6_train.loc[version_3_6_train.단지코드 == code, '임대건물구분'] = 1

  0%|          | 0/423 [00:00<?, ?it/s]

In [25]:
version_3_6_test = test.copy()

codes = version_3_6_test.단지코드.unique().tolist()

for code in tqdm(codes):
    values = version_3_6_test.loc[version_3_6_test.단지코드 == code, '임대건물구분'].unique().tolist()

    if '상가' in values:
        version_3_6_test.loc[version_3_6_test.단지코드 == code, '임대건물구분'] = 0
    else:
        version_3_6_test.loc[version_3_6_test.단지코드 == code, '임대건물구분'] = 1

  0%|          | 0/150 [00:00<?, ?it/s]

#### [Test 3]

In [26]:
qualifies = version_3_6_train.자격유형.unique().tolist()

for qualify in tqdm(qualifies):
    version_3_6_train.loc[version_3_6_train.자격유형 == qualify, '평균임대료(자격유형)'] = version_3_6_train.loc[version_3_6_train.자격유형 == qualify, '임대료'].mean()

  0%|          | 0/5 [00:00<?, ?it/s]

In [27]:
qualifies = version_3_6_test.자격유형.unique().tolist()

for qulify in tqdm(qualifies):
    version_3_6_test.loc[version_3_6_test.자격유형 == qualify, '평균임대료(자격유형)'] = version_3_6_test.loc[version_3_6_test.자격유형 == qualify, '임대료'].mean()

  0%|          | 0/5 [00:00<?, ?it/s]

### **Version_3-7** : 모든 전처리 방안 적용

- [Test 1] Train에서 공급유형이 장기전세, 공공분양, 공공임대(5년)인 example을 아예 Drop한다.
- [Test 2] 각 단지코드 내에 임대건물구분 -> '아파트'이면 1, '상가&아파트'이면 0 으로 매핑
- [Test 3] '자격유형별 평균 임대료' feature 추가

#### [Test 1]

In [28]:
idx = train[(train.공급유형 == '장기전세') | (train.공급유형 == '공공분양') | (train.공급유형 == '공공임대(5년)')].index
version_3_7_train = train.drop(idx)

version_3_7_train.공급유형.unique().tolist()

['국민임대', '공공임대(50년)', '영구임대', '임대상가', '공공임대(10년/분납)', '행복주택']

#### [Test 2]

In [29]:
codes = version_3_7_train.단지코드.unique().tolist()

for code in tqdm(codes):
    values = version_3_7_train.loc[version_3_7_train.단지코드 == code, '임대건물구분'].unique().tolist()

    if '상가' in values:
        version_3_7_train.loc[version_3_7_train.단지코드 == code, '임대건물구분'] = 0
    else :
        version_3_7_train.loc[version_3_7_train.단지코드 == code, '임대건물구분'] = 1

  0%|          | 0/421 [00:00<?, ?it/s]

In [30]:
version_3_7_test = test.copy()

codes = version_3_7_test.단지코드.unique().tolist()

for code in tqdm(codes):
    values = version_3_7_test.loc[version_3_7_test.단지코드 == code, '임대건물구분'].unique().tolist()

    if '상가' in values:
        version_3_7_test.loc[version_3_7_test.단지코드 == code, '임대건물구분'] = 0
    else:
        version_3_7_test.loc[version_3_7_test.단지코드 == code, '임대건물구분'] = 1

  0%|          | 0/150 [00:00<?, ?it/s]

#### [Test 3]

In [31]:
qualifies = version_3_7_train.자격유형.unique().tolist()

for qualify in tqdm(qualifies):
    version_3_7_train.loc[version_3_7_train.자격유형 == qualify, '평균임대료(자격유형)'] = version_3_7_train.loc[version_3_7_train.자격유형 == qualify, '임대료'].mean()

  0%|          | 0/5 [00:00<?, ?it/s]

In [32]:
qualifies = version_3_7_test.자격유형.unique().tolist()

for qulify in tqdm(qualifies):
    version_3_7_test.loc[version_3_7_test.자격유형 == qualify, '평균임대료(자격유형)'] = version_3_7_test.loc[version_3_7_test.자격유형 == qualify, '임대료'].mean()

  0%|          | 0/5 [00:00<?, ?it/s]

## Aggregation

### 단지코드 별로 모두 같은 값을 가지는 feature

In [33]:
idx = train[(train.공급유형 == '장기전세') | (train.공급유형 == '공공분양') | (train.공급유형 == '공공임대(5년)')].index
train_drop = train.drop(idx)

train_drop.공급유형.unique().tolist()

['국민임대', '공공임대(50년)', '영구임대', '임대상가', '공공임대(10년/분납)', '행복주택']

In [34]:
unique_cols = ['총세대수', '지역', '공가수', '도보 10분거리 내 지하철역 수(환승노선 수 반영)', '도보 10분거리 내 버스정류장 수', '단지내주차면수', '등록차량수']

train_eq = train.set_index('단지코드')[unique_cols].drop_duplicates()
test_eq = test.set_index('단지코드')[[col for col in unique_cols if col != '등록차량수']].drop_duplicates()

train_drop = train_drop.set_index('단지코드')[unique_cols].drop_duplicates()

### 단지코드 별로 다양한 값을 가지는 feature

#### feature reshape 함수 정의

In [35]:
def reshape_cat_features(data, cast_col, value_col):
    res = data.drop_duplicates(['단지코드', cast_col]).assign(counter=1).pivot(index='단지코드', columns=cast_col, values=value_col).fillna(0)
    res.columns.name = None
    res = res.rename(columns={col:cast_col+'_'+col for col in res.columns})
    return res

#### **Version_3-1** : [Test 1]

In [36]:
version_3_1_train = version_3_1_train.drop(unique_cols, axis=1)
version_3_1_test = test.drop([col for col in unique_cols if col != '등록차량수'], axis=1)

In [37]:
version_3_1_train = pd.concat([train_drop, \
    reshape_cat_features(data=version_3_1_train, cast_col='임대건물구분', value_col='counter'),
    reshape_cat_features(data=version_3_1_train, cast_col='공급유형', value_col='counter'),
    reshape_cat_features(data=version_3_1_train, cast_col='자격유형', value_col='counter')], axis=1)

In [38]:
version_3_1_test = pd.concat([test_eq, \
    reshape_cat_features(data=version_3_1_test, cast_col='임대건물구분', value_col='counter'),
    reshape_cat_features(data=version_3_1_test, cast_col='공급유형', value_col='counter'),
    reshape_cat_features(data=version_3_1_test, cast_col='자격유형', value_col='counter')], axis=1)

#### Version_3-2 : [Test 2]

In [39]:
version_3_2_train = version_3_2_train.drop(unique_cols, axis=1)
version_3_2_test = version_3_2_test.drop([col for col in unique_cols if col != '등록차량수'], axis=1)

In [40]:
version_3_2_train = pd.concat([train_eq, version_3_2_train[['단지코드', '임대건물구분']].drop_duplicates().set_index('단지코드'), \
    reshape_cat_features(data=version_3_2_train, cast_col='공급유형', value_col='counter'),
    reshape_cat_features(data=version_3_2_train, cast_col='자격유형', value_col='counter')], axis=1)

In [41]:
version_3_2_test = pd.concat([test_eq, version_3_2_test[['단지코드', '임대건물구분']].drop_duplicates().set_index('단지코드'), \
    reshape_cat_features(data=version_3_2_test, cast_col='공급유형', value_col='counter'),
    reshape_cat_features(data=version_3_2_test, cast_col='자격유형', value_col='counter')], axis=1)

#### Version_3-3 : [Test 3]

In [42]:
version_3_3_train = version_3_3_train.drop(unique_cols, axis=1)
version_3_3_test = version_3_3_test.drop([col for col in cols if col != '등록차량수'], axis=1)

In [43]:
mean_vals = version_3_3_train[['단지코드', '임대료', '평균임대료(자격유형)']].copy()

codes = mean_vals.단지코드.unique().tolist()

for code in tqdm(codes):
    mean_vals.loc[mean_vals.단지코드 == code, '평균임대료(자격유형)'] = mean_vals.loc[mean_vals.단지코드 == code, '평균임대료(자격유형)'].mean()
    mean_vals.loc[mean_vals.단지코드 == code, '임대료'] = mean_vals.loc[mean_vals.단지코드 == code, '임대료'].mean()

mean_vals = mean_vals.drop_duplicates().set_index('단지코드')

  0%|          | 0/423 [00:00<?, ?it/s]

In [44]:
version_3_3_train = pd.concat([train_eq, mean_vals, \
    reshape_cat_features(data=version_3_3_train, cast_col='임대건물구분', value_col='counter'),
    reshape_cat_features(data=version_3_3_train, cast_col='공급유형', value_col='counter'),
    reshape_cat_features(data=version_3_3_train, cast_col='자격유형', value_col='counter')], axis=1)

In [45]:
mean_vals = version_3_3_test[['단지코드', '임대료', '평균임대료(자격유형)']].copy()

codes = mean_vals.단지코드.unique().tolist()

for code in tqdm(codes):
    mean_vals.loc[mean_vals.단지코드 == code, '평균임대료(자격유형)'] = mean_vals.loc[mean_vals.단지코드 == code, '평균임대료(자격유형)'].mean()
    mean_vals.loc[mean_vals.단지코드 == code, '임대료'] = mean_vals.loc[mean_vals.단지코드 == code, '임대료'].mean()

mean_vals = mean_vals.drop_duplicates().set_index('단지코드')

  0%|          | 0/150 [00:00<?, ?it/s]

In [46]:
version_3_3_test = pd.concat([test_eq, mean_vals, \
    reshape_cat_features(data=version_3_3_test, cast_col='임대건물구분', value_col='counter'),
    reshape_cat_features(data=version_3_3_test, cast_col='공급유형', value_col='counter'),
    reshape_cat_features(data=version_3_3_test, cast_col='자격유형', value_col='counter')], axis=1)

#### Version_3-4 : [Test 1] & [Test 2]

In [47]:
version_3_4_train = version_3_4_train.drop(unique_cols, axis=1)
version_3_4_test = version_3_4_test.drop([col for col in unique_cols if col != '등록차량수'], axis=1)

In [48]:
version_3_4_train = pd.concat([train_drop, version_3_4_train[['단지코드', '임대건물구분']].drop_duplicates().set_index('단지코드'), \
    reshape_cat_features(data=version_3_4_train, cast_col='공급유형', value_col='counter'),
    reshape_cat_features(data=version_3_4_train, cast_col='자격유형', value_col='counter')], axis=1)

In [49]:
version_3_4_test = pd.concat([test_eq, version_3_4_test[['단지코드', '임대건물구분']].drop_duplicates().set_index('단지코드'), \
    reshape_cat_features(data=version_3_4_test, cast_col='공급유형', value_col='counter'),
    reshape_cat_features(data=version_3_4_test, cast_col='자격유형', value_col='counter')], axis=1)

#### Version_3-5 : [Test 1] & [Test 3]

In [50]:
version_3_5_train = version_3_5_train.drop(unique_cols, axis=1)
version_3_5_test = version_3_5_test.drop([col for col in unique_cols if col != '등록차량수'], axis=1)

In [51]:
mean_vals = version_3_5_train[['단지코드', '임대료', '평균임대료(자격유형)']].copy()

codes = mean_vals.단지코드.unique().tolist()

for code in tqdm(codes):
    mean_vals.loc[mean_vals.단지코드 == code, '평균임대료(자격유형)'] = mean_vals.loc[mean_vals.단지코드 == code, '평균임대료(자격유형)'].mean()
    mean_vals.loc[mean_vals.단지코드 == code, '임대료'] = mean_vals.loc[mean_vals.단지코드 == code, '임대료'].mean()

mean_vals = mean_vals.drop_duplicates().set_index('단지코드')

  0%|          | 0/421 [00:00<?, ?it/s]

In [52]:
version_3_5_train = pd.concat([train_drop, mean_vals, \
    reshape_cat_features(data=version_3_5_train, cast_col='임대건물구분', value_col='counter'),
    reshape_cat_features(data=version_3_5_train, cast_col='공급유형', value_col='counter'),
    reshape_cat_features(data=version_3_5_train, cast_col='자격유형', value_col='counter')], axis=1)

In [53]:
mean_vals = version_3_5_test[['단지코드', '임대료', '평균임대료(자격유형)']].copy()

codes = mean_vals.단지코드.unique().tolist()

for code in tqdm(codes):
    mean_vals.loc[mean_vals.단지코드 == code, '평균임대료(자격유형)'] = mean_vals.loc[mean_vals.단지코드 == code, '평균임대료(자격유형)'].mean()
    mean_vals.loc[mean_vals.단지코드 == code, '임대료'] = mean_vals.loc[mean_vals.단지코드 == code, '임대료'].mean()

mean_vals = mean_vals.drop_duplicates().set_index('단지코드')

  0%|          | 0/150 [00:00<?, ?it/s]

In [54]:
version_3_5_test = pd.concat([test_eq, mean_vals, \
    reshape_cat_features(data=version_3_5_test, cast_col='임대건물구분', value_col='counter'),
    reshape_cat_features(data=version_3_5_test, cast_col='공급유형', value_col='counter'),
    reshape_cat_features(data=version_3_5_test, cast_col='자격유형', value_col='counter')], axis=1)

#### Version_3-6 : [Test 2] & [Test 3]

In [55]:
version_3_6_train = version_3_6_train.drop(unique_cols, axis=1)
version_3_6_test = version_3_6_test.drop([col for col in unique_cols if col != '등록차량수'], axis=1)

In [56]:
mean_vals = version_3_6_train[['단지코드', '임대료', '평균임대료(자격유형)']].copy()

codes = mean_vals.단지코드.unique().tolist()

for code in tqdm(codes):
    mean_vals.loc[mean_vals.단지코드 == code, '평균임대료(자격유형)'] = mean_vals.loc[mean_vals.단지코드 == code, '평균임대료(자격유형)'].mean()
    mean_vals.loc[mean_vals.단지코드 == code, '임대료'] = mean_vals.loc[mean_vals.단지코드 == code, '임대료'].mean()

mean_vals = mean_vals.drop_duplicates().set_index('단지코드')

  0%|          | 0/423 [00:00<?, ?it/s]

In [57]:
version_3_6_train = pd.concat([train_eq, version_3_6_train[['단지코드', '임대건물구분']].drop_duplicates().set_index('단지코드'), mean_vals, \
    reshape_cat_features(data=version_3_6_train, cast_col='공급유형', value_col='counter'),
    reshape_cat_features(data=version_3_6_train, cast_col='자격유형', value_col='counter')], axis=1)

In [58]:
mean_vals = version_3_6_test[['단지코드', '임대료', '평균임대료(자격유형)']].copy()

codes = mean_vals.단지코드.unique().tolist()

for code in tqdm(codes):
    mean_vals.loc[mean_vals.단지코드 == code, '평균임대료(자격유형)'] = mean_vals.loc[mean_vals.단지코드 == code, '평균임대료(자격유형)'].mean()
    mean_vals.loc[mean_vals.단지코드 == code, '임대료'] = mean_vals.loc[mean_vals.단지코드 == code, '임대료'].mean()

mean_vals = mean_vals.drop_duplicates().set_index('단지코드')

  0%|          | 0/150 [00:00<?, ?it/s]

In [59]:
version_3_6_test = pd.concat([test_eq, version_3_6_test[['단지코드', '임대건물구분']].drop_duplicates().set_index('단지코드'), mean_vals, \
    reshape_cat_features(data=version_3_6_test, cast_col='공급유형', value_col='counter'),
    reshape_cat_features(data=version_3_6_test, cast_col='자격유형', value_col='counter')], axis=1)

#### Version_3-7 : 모든 전처리 방안 적용

In [60]:
version_3_7_train = version_3_7_train.drop(unique_cols, axis=1)
version_3_7_test = version_3_7_test.drop([col for col in unique_cols if col != '등록차량수'], axis=1)

In [61]:
mean_vals = version_3_7_train[['단지코드', '임대료', '평균임대료(자격유형)']].copy()

codes = mean_vals.단지코드.unique().tolist()

for code in tqdm(codes):
    mean_vals.loc[mean_vals.단지코드 == code, '평균임대료(자격유형)'] = mean_vals.loc[mean_vals.단지코드 == code, '평균임대료(자격유형)'].mean()
    mean_vals.loc[mean_vals.단지코드 == code, '임대료'] = mean_vals.loc[mean_vals.단지코드 == code, '임대료'].mean()

mean_vals = mean_vals.drop_duplicates().set_index('단지코드')

  0%|          | 0/421 [00:00<?, ?it/s]

In [62]:
version_3_7_train = pd.concat([train_drop, version_3_7_train[['단지코드', '임대건물구분']].drop_duplicates().set_index('단지코드'), mean_vals, \
    reshape_cat_features(data=version_3_7_train, cast_col='공급유형', value_col='counter'),
    reshape_cat_features(data=version_3_7_train, cast_col='자격유형', value_col='counter')], axis=1)

In [63]:
mean_vals = version_3_7_test[['단지코드', '임대료', '평균임대료(자격유형)']].copy()

codes = mean_vals.단지코드.unique().tolist()

for code in tqdm(codes):
    mean_vals.loc[mean_vals.단지코드 == code, '평균임대료(자격유형)'] = mean_vals.loc[mean_vals.단지코드 == code, '평균임대료(자격유형)'].mean()
    mean_vals.loc[mean_vals.단지코드 == code, '임대료'] = mean_vals.loc[mean_vals.단지코드 == code, '임대료'].mean()

mean_vals = mean_vals.drop_duplicates().set_index('단지코드')

  0%|          | 0/150 [00:00<?, ?it/s]

In [64]:
version_3_7_test = pd.concat([test_eq, version_3_7_test[['단지코드', '임대건물구분']].drop_duplicates().set_index('단지코드'), mean_vals, \
    reshape_cat_features(data=version_3_7_test, cast_col='공급유형', value_col='counter'),
    reshape_cat_features(data=version_3_7_test, cast_col='자격유형', value_col='counter')], axis=1)

## Performance Test

### Model Define (Baseline Model)

In [65]:
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_jobs=-1, random_state=42)

### CV Score (‘neg_mean_absolute_error’)

In [66]:
from sklearn.model_selection import cross_val_score

#### Version_3-1 : [Test 1]

In [67]:
X_train = version_3_1_train.drop(['등록차량수'], axis=1)
y_train = version_3_1_train['등록차량수']

In [68]:
print(f"[Test 1] 만 적용한 경우 : {cross_val_score(model, X_train, y_train, cv=5, scoring='neg_mean_absolute_error').mean()}")

[Test 1] 만 적용한 경우 : -150.15621848739497


#### Version_3-2 : [Test 2]

In [69]:
X_train = version_3_2_train.drop(['등록차량수'], axis=1)
y_train = version_3_2_train['등록차량수']

In [70]:
print(f"[Test 2] 만 적용한 경우 : {cross_val_score(model, X_train, y_train, cv=5, scoring='neg_mean_absolute_error').mean()}")

[Test 2] 만 적용한 경우 : -150.4797406162465


#### Version_3-3 : [Test 3]

In [71]:
X_train = version_3_3_train.drop(['등록차량수'], axis=1)
y_train = version_3_3_train['등록차량수']

In [72]:
print(f"[Test 3] 만 적용한 경우 : {cross_val_score(model, X_train, y_train, cv=5, scoring='neg_mean_absolute_error').mean()}")

[Test 3] 만 적용한 경우 : -149.68167647058823


#### Version_3-4 : [Test 1] & [Test 2]

In [73]:
X_train = version_3_4_train.drop(['등록차량수'], axis=1)
y_train = version_3_4_train['등록차량수']

In [74]:
print(f"[Test 1] & [Test 2] 적용한 경우 : {cross_val_score(model, X_train, y_train, cv=5, scoring='neg_mean_absolute_error').mean()}")

[Test 1] & [Test 2] 적용한 경우 : -149.67148151260503


#### Versoin_3-5 : [Test 1] & [Test 3]

In [75]:
X_train = version_3_5_train.drop(['등록차량수'], axis=1)
y_train = version_3_5_train['등록차량수']

In [76]:
print(f"[Test 1] & [Test 3] 적용한 경우 : {cross_val_score(model, X_train, y_train, cv=5, scoring='neg_mean_absolute_error').mean()}")

[Test 1] & [Test 3] 적용한 경우 : -148.16336526610647


#### Version_3-6 : [Test 2] & [Test 3]

In [77]:
X_train = version_3_6_train.drop(['등록차량수'], axis=1)
y_train = version_3_6_train['등록차량수']

In [78]:
print(f"[Test 2] & [Test 3] 적용한 경우 : {cross_val_score(model, X_train, y_train, cv=5, scoring='neg_mean_absolute_error').mean()}")

[Test 2] & [Test 3] 적용한 경우 : -150.09751596638654


#### Version_3-7 : [Test 1] & [Test 2] & [Test 3]

In [79]:
X_train = version_3_7_train.drop(['등록차량수'], axis=1)
y_train = version_3_7_train['등록차량수']

In [80]:
print(f"[Test 1] & [Test 2] & [Test 3] 적용한 경우 : {cross_val_score(model, X_train, y_train, cv=5, scoring='neg_mean_absolute_error').mean()}")

[Test 1] & [Test 2] & [Test 3] 적용한 경우 : -149.82798487394956
