<a href="https://colab.research.google.com/github/Jin-Yuseung/apartment/blob/main/Make_TrainTest_Set.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## 1. Import Library & Load Dataset

In [None]:
# Data Handling
import pandas as pd
import numpy as np
import os

# Model Selection
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [None]:
FILE_PATH = '/content/drive/MyDrive/apt'
FILE_NAME = '2019_2022_매매데이터.csv'
TRAIN_SAVE_PATH = '/content/drive/MyDrive/apt'
TEST_SAVE_PATH = '/content/drive/MyDrive/apt'

In [None]:
df = pd.read_csv(os.path.join(FILE_PATH, FILE_NAME), encoding='cp949')
df.head(1)

Unnamed: 0,시군구,번지,본번,부번,단지명,전용면적(㎡),계약년월,계약일,거래금액(만원),층,건축년도,도로명,해제사유발생일,거래유형,중개사소재지
0,광주광역시 광산구 도산동,1138-2,1138,2,대덕1,59.31,202110,9,7700,6,1990,송도로 143,,-,-


## 2. Pre Processing

In [None]:
df['구'] = df['시군구'].apply(lambda x: x.split(' ')[1])

In [None]:
dong_mean_money = pd.DataFrame(df.groupby(['시군구']).mean()['거래금액(만원)'].round().astype('int'))
dong_sum_money = pd.DataFrame(df.groupby(['시군구']).sum()['거래금액(만원)'].round().astype('int'))

dong_mean_money.columns = ['동별_평균_거래금액']
dong_sum_money.columns = ['동별_총_거래금액']

df = pd.merge(left=df, right=dong_mean_money, on='시군구')
df = pd.merge(left=df, right=dong_sum_money, on='시군구')

In [None]:
gu_mean_money = pd.DataFrame(df.groupby(['구']).mean()['거래금액(만원)'].round().astype('int'))
gu_sum_money = pd.DataFrame(df.groupby(['구']).sum()['거래금액(만원)'].round().astype('int'))

gu_mean_money.columns = ['구별_평균_거래금액']
gu_sum_money.columns = ['구별_총_거래금액']

df = pd.merge(left=df, right=gu_mean_money, on='구')
df = pd.merge(left=df, right=gu_sum_money, on='구')

In [None]:
df['계약년'] = df['계약년월'].apply(lambda x: str(x)[:4])
df['계약월'] = df['계약년월'].apply(lambda x: str(x)[4:])

In [None]:
df['거래유형'][ df['거래유형'] == '-' ] = '중개거래'
df['거래유형'].value_counts()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


중개거래    68725
직거래      2084
Name: 거래유형, dtype: int64

In [None]:
POP_FILE_PATH = '/content/drive/MyDrive/apt'
POP_FILE_NAME = 'pop.csv'

pop = pd.read_csv(os.path.join(POP_FILE_PATH, POP_FILE_NAME), encoding='cp949')

In [None]:
pop['구별'] = pop['구별'].apply(lambda x: x.replace(' ', ''))
pop['인구'] = pop['인구'].apply(lambda x: x.replace(',', '')).astype("int")

pop.head(1)

Unnamed: 0,구별,인구
0,동구,105077


In [None]:
df = pd.merge(df, pop, left_on='구', right_on='구별')
df.head(1)

Unnamed: 0,시군구,번지,본번,부번,단지명,전용면적(㎡),계약년월,계약일,거래금액(만원),층,...,중개사소재지,구,동별_평균_거래금액,동별_총_거래금액,구별_평균_거래금액,구별_총_거래금액,계약년,계약월,구별,인구
0,광주광역시 광산구 도산동,1138-2,1138,2,대덕1,59.31,202110,9,7700,6,...,-,광산구,14989,9698148,26038,520261751,2021,10,광산구,416012


## 3. Make Train/Test set

In [None]:
new_df = df[[
    '전용면적(㎡)', '계약년', '계약월',
       '층', '건축년도', '거래유형', '동별_평균_거래금액',
       '동별_총_거래금액', '구별_평균_거래금액', '구별_총_거래금액', '인구', '거래금액(만원)'
]]

In [None]:
df[df['단지명'] == '삼호'].loc[54741,:]

시군구           광주광역시 북구 일곡동
번지                   816-4
본번                     816
부번                       4
단지명                     삼호
전용면적(㎡)             119.99
계약년월                202106
계약일                      6
거래금액(만원)             36500
층                        6
건축년도                  1998
도로명                설죽로 600
해제사유발생일                NaN
거래유형                  중개거래
중개사소재지                   -
구                       북구
동별_평균_거래금액           16467
동별_총_거래금액         38829187
구별_평균_거래금액           21113
구별_총_거래금액        484249928
계약년                   2021
계약월                     06
구별                      북구
인구                  431587
Name: 54741, dtype: object

In [None]:
scaler = StandardScaler()

num_column = [
    '전용면적(㎡)', '동별_평균_거래금액',
       '동별_총_거래금액', '구별_평균_거래금액', '구별_총_거래금액', '인구'
]

In [None]:
res = scaler.fit_transform(new_df[num_column])
new_df[num_column] = pd.DataFrame(res, columns=num_column)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]


In [None]:
new_df.head(1)

Unnamed: 0,전용면적(㎡),계약년,계약월,층,건축년도,거래유형,동별_평균_거래금액,동별_총_거래금액,구별_평균_거래금액,구별_총_거래금액,인구,거래금액(만원)
0,-0.637053,2021,10,6,1990,중개거래,-1.080376,-1.138411,0.034814,0.832702,0.65056,7700


In [None]:
train, test = train_test_split(new_df, test_size = 0.3, shuffle=True)

In [None]:
train.to_csv(os.path.join(TRAIN_SAVE_PATH, 'train.csv'), encoding='cp949')
test.to_csv(os.path.join(TEST_SAVE_PATH, 'test.csv'), encoding='cp949')

In [None]:
train.shape

(49566, 12)

In [None]:
test.shape

(21243, 12)