# Version 3 Validation

## 검증 내용

> `Version 3`에서 적용한 아래 3가지의 전처리 방안을 각각 적용하여 성능의 변화를 살펴본다.

1. [Test 1] Train에서 공급유형이 장기전세, 공공분양, 공공임대(5년)인 example을 아예 Drop한다.
2. [Test 2] 각 단지코드 내에 임대건물구분 -> '아파트'이면 1, '상가&아파트'이면 0 으로 매핑
3. [Test 3] '자격유형별 평균 임대료' feature 추가

## 검증 순서

1. **Version_3-1** : [Test 1] 만 적용
2. **Version_3-2** : [Test 2] 만 적용
3. **Version_3-3** : [Test 3] 만 적용
4. **Version_3-4** : [Test 1] & [Test 2] 적용
5. **Version_3-5** : [Test 1] & [Test 3] 적용
6. **Version_3-6** : [Test 2] & [Test 3] 적용
7. **Version_3-7** : 모든 전처리 방안 적용

## Import Module

In [1]:
import pandas as pd
import numpy as np
from os.path import join as Join
from tqdm.notebook import tqdm

## Data Load

In [2]:
DATA_ROOT = ''
DATA_ROOT = Join(DATA_ROOT, '../../../../competition_data/parking_data/')

TRAIN_ROOT = Join(DATA_ROOT, 'train.csv')
TEST_ROOT = Join(DATA_ROOT, 'test.csv')
SUBMISSION_ROOT = Join(DATA_ROOT, 'sample_submission.csv')

print(f"DATA_ROOT : {DATA_ROOT}")
print(f"TRAIN_ROOT : {TRAIN_ROOT}")
print(f"TEST_ROOT : {TEST_ROOT}")
print(f"SUBMISSION_ROOT : {SUBMISSION_ROOT}")

DATA_ROOT : ../../../../competition_data/parking_data/
TRAIN_ROOT : ../../../../competition_data/parking_data/train.csv
TEST_ROOT : ../../../../competition_data/parking_data/test.csv
SUBMISSION_ROOT : ../../../../competition_data/parking_data/sample_submission.csv


In [3]:
train = pd.read_csv(TRAIN_ROOT)
test = pd.read_csv(TEST_ROOT)
submission = pd.read_csv(SUBMISSION_ROOT)

print("Data Loaded!")

Data Loaded!


## Preprocessing (Version 1)

### 지역명 숫자로 매핑

In [4]:
local_map = {}

for i, loc in enumerate(train['지역'].unique()):
    local_map[loc] = i

train['지역'] = train['지역'].map(local_map)
test['지역'] = test['지역'].map(local_map)

### 전용면적 처리

In [5]:
train['전용면적'] = train['전용면적']//5*5
test['전용면적'] = test['전용면적']//5*5

## Preprocessing (Version 2)

`'-'` -> NULL, dtype을 float으로 변경

In [6]:
columns = ['임대보증금', '임대료']

for col in columns:
    train.loc[train[col] == '-', col] = np.nan
    test.loc[test[col] == '-', col] = np.nan

    train[col] = train[col].astype(float)
    test[col] = test[col].astype(float)

### NULL 값 처리

#### **임대보증금, 임대료**

In [7]:
train[['임대보증금', '임대료']] = train[['임대보증금', '임대료']].fillna(0)
test[['임대보증금', '임대료']] = test[['임대보증금', '임대료']].fillna(0)

#### **지하철, 버스**

In [8]:
cols = ['도보 10분거리 내 지하철역 수(환승노선 수 반영)', '도보 10분거리 내 버스정류장 수']
train[cols] = train[cols].fillna(0)
test[cols] = test[cols].fillna(0)

#### **자격유형**

In [9]:
test.loc[test.단지코드.isin(['C2411']) & test.자격유형.isnull(), '자격유형'] = 'A'
test.loc[test.단지코드.isin(['C2253']) & test.자격유형.isnull(), '자격유형'] = 'C'

### 중복 example 제거

In [10]:
train = train.drop_duplicates()
test = test.drop_duplicates()

### 자격유형 병합

In [11]:
train.loc[train.자격유형.isin(['J', 'L', 'K', 'N', 'M', 'O']), '자격유형'] = '행복주택_공급대상'
test.loc[test.자격유형.isin(['J', 'L', 'K', 'N', 'M', 'O']), '자격유형'] = '행복주택_공급대상'

train.loc[train.자격유형.isin(['H', 'B', 'E', 'G']), '자격유형'] = '국민임대_공급대상'
test.loc[test.자격유형.isin(['H', 'B', 'E', 'G']), '자격유형'] = '국민임대_공급대상'

train.loc[train.자격유형.isin(['C', 'I', 'F']), '자격유형'] = '영구임대_공급대상'
test.loc[test.자격유형.isin(['C', 'I', 'F']), '자격유형'] = '영구임대_공급대상'

### 공급유형 병합

In [12]:
train.loc[train.공급유형.isin(['공공임대(10년)', '공공임대(분납)']), '공급유형'] = '공공임대(10년/분납)'
test.loc[test.공급유형.isin(['공공임대(10년)', '공공임대(분납)']), '공급유형'] = '공공임대(10년/분납)'

## Preprocessing (Version 3)

### **Versoin_3-1** : [Test 1] 만 적용

- [Test 1] Train에서 공급유형이 장기전세, 공공분양, 공공임대(5년)인 example을 아예 Drop한다.

In [13]:
idx = train[(train.공급유형 == '장기전세') | (train.공급유형 == '공공분양') | (train.공급유형 == '공공임대(5년)')].index
version_3_1_train = train.drop(idx)

version_3_1_train.공급유형.unique().tolist()

['국민임대', '공공임대(50년)', '영구임대', '임대상가', '공공임대(10년/분납)', '행복주택']

### **Version_3-2** : [Test 2] 만 적용

- [Test 2] 각 단지코드 내에 임대건물구분 -> '아파트'이면 1, '상가&아파트'이면 0 으로 매핑

In [14]:
version_3_2_train = train[['단지코드', '임대건물구분']].copy()

codes = version_3_2_train.단지코드.unique().tolist()

for code in tqdm(codes):
    values = version_3_2_train.loc[version_3_2_train.단지코드 == code, '임대건물구분'].unique().tolist()

    if '상가' in values:
        version_3_2_train.loc[version_3_2_train.단지코드 == code, '임대건물구분'] = 0
    else:
        version_3_2_train.loc[version_3_2_train.단지코드 == code, '임대건물구분'] = 1

  0%|          | 0/423 [00:00<?, ?it/s]

In [15]:
version_3_2_test = test[['단지코드', '임대건물구분']].copy()

codes = version_3_2_test.단지코드.unique().tolist()

for code in tqdm(codes):
    values = version_3_2_test.loc[version_3_2_test.단지코드 == code, '임대건물구분'].unique().tolist()

    if '상가' in values:
        version_3_2_test.loc[version_3_2_test.단지코드 == code, '임대건물구분'] == 0
    else:
        version_3_2_test.loc[version_3_2_test.단지코드 == code, '임대건물구분'] == 1

  0%|          | 0/150 [00:00<?, ?it/s]

### **Version_3-3** : [Test 3] 만 적용

- [Test 3] '자격유형별 평균 임대료' feature 추가

In [16]:
version_3_3_train = train[['단지코드', '자격유형', '임대료']].copy()

qualifies = version_3_3_train.자격유형.unique().tolist()

for qualify in tqdm(qualifies):
    version_3_3_train.loc[version_3_3_train.자격유형 == qualify, '평균임대료(자격유형)'] = version_3_3_train.loc[version_3_3_train.자격유형 == qualify, '임대료'].mean()

  0%|          | 0/5 [00:00<?, ?it/s]

In [17]:
version_3_3_test = test[['단지코드', '자격유형', '임대료']].copy()

qualifies = version_3_3_test.자격유형.unique().tolist()

for qualify in tqdm(qualifies):
    version_3_3_test.loc[version_3_3_test.자격유형 == qualify, '평균임대료(자격유형)'] = version_3_3_test.loc[version_3_3_test.자격유형 == qualify, '임대료'].mean()

  0%|          | 0/5 [00:00<?, ?it/s]

### **Versoin_3_4** : [Test 1] & [Test 2] 적용

- [Test 1] Train에서 공급유형이 장기전세, 공공분양, 공공임대(5년)인 example을 아예 Drop한다.
- [Test 2] 각 단지코드 내에 임대건물구분 -> '아파트'이면 1, '상가&아파트'이면 0 으로 매핑

#### [Test 1]

In [18]:
idx = train[(train.공급유형 == '장기전세') | (train.공급유형 == '공공분양') | (train.공급유형 == '공공임대(5년)')].index
version_3_4_train = train.drop(idx)

version_3_4_train.공급유형.unique().tolist()

['국민임대', '공공임대(50년)', '영구임대', '임대상가', '공공임대(10년/분납)', '행복주택']

#### [Test 2]

In [19]:
version_3_4_train = version_3_4_train[['단지코드', '임대건물구분']].copy()

codes = version_3_4_train.단지코드.unique().tolist()

for code in tqdm(codes):
    values = version_3_4_train.loc[version_3_4_train.단지코드 == code, '임대건물구분'].unique().tolist()

    if '상가' in values :
        version_3_4_train.loc[version_3_4_train.단지코드 == code, '임대건물구분'] = 0
    else :
        version_3_4_train.loc[version_3_4_train.단지코드 == code, '임대건물구분'] = 0

  0%|          | 0/421 [00:00<?, ?it/s]

In [20]:
version_3_4_test = test[['단지코드', '임대건물구분']].copy()

codes = version_3_4_test.단지코드.unique().tolist()

for code in tqdm(codes):
    values = version_3_4_test.loc[version_3_4_test.단지코드 == code, '임대건물구분'].unique().tolist()

    if '상가' in values:
        version_3_4_test.loc[version_3_4_test.단지코드 == code, '임대건물구분'] == 0
    else:
        version_3_4_test.loc[version_3_4_test.단지코드 == code, '임대건물구분'] == 1

  0%|          | 0/150 [00:00<?, ?it/s]

### **Version_3_5** : [Test 1] & [Test 3] 적용

- [Test 1] Train에서 공급유형이 장기전세, 공공분양, 공공임대(5년)인 example을 아예 Drop한다.
- [Test 3] '자격유형별 평균 임대료' feature 추가

#### [Test 1]

In [21]:
idx = train[(train.공급유형 == '장기전세') | (train.공급유형 == '공공분양') | (train.공급유형 == '공공임대(5년)')].index
version_3_5_train = train.drop(idx)

version_3_5_train.공급유형.unique().tolist()

['국민임대', '공공임대(50년)', '영구임대', '임대상가', '공공임대(10년/분납)', '행복주택']

#### [Test 3]

In [22]:
version_3_5_train = version_3_5_train[['단지코드', '자격유형', '임대료']].copy()

qualifies = version_3_5_train.자격유형.unique().tolist()

for qualify in tqdm(qualifies):
    version_3_5_train.loc[version_3_5_train.자격유형 == qualify, '평균임대료(자격유형)'] = version_3_5_train.loc[version_3_5_train.자격유형 == qualify, '임대료'].mean()

  0%|          | 0/5 [00:00<?, ?it/s]

In [23]:
version_3_5_test = test[['단지코드', '자격유형', '임대료']].copy()

qualifies = version_3_5_test.자격유형.unique().tolist()

for qulify in tqdm(qualifies):
    version_3_5_test.loc[version_3_5_test.자격유형 == qualify, '평균임대료(자격유형)'] = version_3_5_test.loc[version_3_5_test.자격유형 == qualify, '임대료'].mean()

  0%|          | 0/5 [00:00<?, ?it/s]

### **Version_3-6** : [Test 2] & [Test 3] 적용

- [Test 2] 각 단지코드 내에 임대건물구분 -> '아파트'이면 1, '상가&아파트'이면 0 으로 매핑
- [Test 3] '자격유형별 평균 임대료' feature 추가

#### [Test 2]

In [24]:
version_3_6_train = train[['단지코드', '임대건물구분', '자격유형', '임대료']].copy()

codes = version_3_6_train.단지코드.unique().tolist()

for code in tqdm(codes):
    values = version_3_6_train.loc[version_3_6_train.단지코드 == code, '임대건물구분'].unique().tolist()

    if '상가' in values:
        version_3_6_train.loc[version_3_6_train.단지코드 == code, '임대건물구분'] = 0
    else :
        version_3_6_train.loc[version_3_6_train.단지코드 == code, '임대건물구분'] = 1

  0%|          | 0/423 [00:00<?, ?it/s]

In [25]:
version_3_6_test = test[['단지코드', '임대건물구분', '자격유형', '임대료']].copy()

codes = version_3_6_test.단지코드.unique().tolist()

for code in tqdm(codes):
    values = version_3_6_test.loc[version_3_6_test.단지코드 == code, '임대건물구분'].unique().tolist()

    if '상가' in values:
        version_3_6_test.loc[version_3_6_test.단지코드 == code, '임대건물구분'] = 0
    else:
        version_3_6_test.loc[version_3_6_test.단지코드 == code, '임대건물구분'] = 1

  0%|          | 0/150 [00:00<?, ?it/s]

#### [Test 3]

In [26]:
qualifies = version_3_6_train.자격유형.unique().tolist()

for qualify in tqdm(qualifies):
    version_3_6_train.loc[version_3_6_train.자격유형 == qualify, '평균임대료(자격유형)'] = version_3_6_train.loc[version_3_6_train.자격유형 == qualify, '임대료'].mean()

  0%|          | 0/5 [00:00<?, ?it/s]

In [27]:
qualifies = version_3_6_test.자격유형.unique().tolist()

for qulify in tqdm(qualifies):
    version_3_6_test.loc[version_3_6_test.자격유형 == qualify, '평균임대료(자격유형)'] = version_3_6_test.loc[version_3_6_test.자격유형 == qualify, '임대료'].mean()

  0%|          | 0/5 [00:00<?, ?it/s]

### **Version_3-7** : 모든 전처리 방안 적용

- [Test 1] Train에서 공급유형이 장기전세, 공공분양, 공공임대(5년)인 example을 아예 Drop한다.
- [Test 2] 각 단지코드 내에 임대건물구분 -> '아파트'이면 1, '상가&아파트'이면 0 으로 매핑
- [Test 3] '자격유형별 평균 임대료' feature 추가

#### [Test 1]

In [28]:
idx = train[(train.공급유형 == '장기전세') | (train.공급유형 == '공공분양') | (train.공급유형 == '공공임대(5년)')].index
version_3_7_train = train.drop(idx)

version_3_7_train.공급유형.unique().tolist()

['국민임대', '공공임대(50년)', '영구임대', '임대상가', '공공임대(10년/분납)', '행복주택']

#### [Test 2]

In [29]:
version_3_7_train = version_3_7_train[['단지코드', '임대건물구분', '자격유형', '임대료']].copy()

codes = version_3_7_train.단지코드.unique().tolist()

for code in tqdm(codes):
    values = version_3_7_train.loc[version_3_7_train.단지코드 == code, '임대건물구분'].unique().tolist()

    if '상가' in values:
        version_3_7_train.loc[version_3_7_train.단지코드 == code, '임대건물구분'] = 0
    else :
        version_3_7_train.loc[version_3_7_train.단지코드 == code, '임대건물구분'] = 1

  0%|          | 0/421 [00:00<?, ?it/s]

In [30]:
version_3_7_test = test[['단지코드', '임대건물구분', '자격유형', '임대료']].copy()

codes = version_3_7_test.단지코드.unique().tolist()

for code in tqdm(codes):
    values = version_3_7_test.loc[version_3_7_test.단지코드 == code, '임대건물구분'].unique().tolist()

    if '상가' in values:
        version_3_7_test.loc[version_3_7_test.단지코드 == code, '임대건물구분'] = 0
    else:
        version_3_7_test.loc[version_3_7_test.단지코드 == code, '임대건물구분'] = 1

  0%|          | 0/150 [00:00<?, ?it/s]

#### [Test 3]

In [31]:
qualifies = version_3_7_train.자격유형.unique().tolist()

for qualify in tqdm(qualifies):
    version_3_7_train.loc[version_3_7_train.자격유형 == qualify, '평균임대료(자격유형)'] = version_3_7_train.loc[version_3_7_train.자격유형 == qualify, '임대료'].mean()

  0%|          | 0/5 [00:00<?, ?it/s]

In [32]:
qualifies = version_3_7_test.자격유형.unique().tolist()

for qulify in tqdm(qualifies):
    version_3_7_test.loc[version_3_7_test.자격유형 == qualify, '평균임대료(자격유형)'] = version_3_7_test.loc[version_3_7_test.자격유형 == qualify, '임대료'].mean()

  0%|          | 0/5 [00:00<?, ?it/s]

## Aggregation

### 단지코드 별로 모두 같은 값을 가지는 feature

In [33]:
unique_cols = ['총세대수', '지역', '공가수', '도보 10분거리 내 지하철역 수(환승노선 수 반영)', '도보 10분거리 내 버스정류장 수', '단지내주차면수', '등록차량수']

train_eq = train.set_index('단지코드')[unique_cols].drop_duplicates()
test_eq = test.set_index('단지코드')[[col for col in unique_cols if col != '등록차량수']].drop_duplicates()

In [34]:
train_eq

Unnamed: 0_level_0,총세대수,지역,공가수,도보 10분거리 내 지하철역 수(환승노선 수 반영),도보 10분거리 내 버스정류장 수,단지내주차면수,등록차량수
단지코드,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
C2483,900,0,38.0,0.0,3.0,1425.0,1015.0
C2515,545,1,17.0,0.0,3.0,624.0,205.0
C1407,1216,2,13.0,1.0,1.0,1285.0,1064.0
C1945,755,3,6.0,1.0,3.0,734.0,730.0
C1470,696,4,14.0,0.0,2.0,645.0,553.0
...,...,...,...,...,...,...,...
C2586,90,9,7.0,0.0,3.0,66.0,57.0
C2035,492,5,24.0,0.0,1.0,521.0,246.0
C2020,40,8,7.0,1.0,2.0,25.0,19.0
C2437,90,11,12.0,0.0,1.0,30.0,16.0


### 단지코드 별로 다양한 값을 가지는 feature

#### **Version_3-1** : [Test 1]

In [35]:
version_3_1_train = version_3_1_train.drop(unique_cols, axis=1)
version_3_1_test = test.drop([col for col in unique_cols if col != '등록차량수'], axis=1)

##### '전용면적', '전용면적별 세대수'

In [37]:
version_3_1_train

Unnamed: 0,단지코드,임대건물구분,공급유형,전용면적,전용면적별세대수,자격유형,임대보증금,임대료
0,C2483,아파트,국민임대,35.0,134,A,15667000.0,103680.0
1,C2483,아파트,국민임대,35.0,15,A,15667000.0,103680.0
2,C2483,아파트,국민임대,50.0,385,A,27304000.0,184330.0
3,C2483,아파트,국민임대,50.0,15,A,27304000.0,184330.0
4,C2483,아파트,국민임대,50.0,41,A,27304000.0,184330.0
...,...,...,...,...,...,...,...,...
2945,C2437,아파트,영구임대,20.0,90,영구임대_공급대상,10346000.0,107530.0
2946,C2532,아파트,국민임대,45.0,19,A,11346000.0,116090.0
2948,C2532,아파트,국민임대,50.0,34,A,14005000.0,142310.0
2950,C2532,아파트,국민임대,50.0,114,A,14005000.0,142310.0
