<a href="https://colab.research.google.com/github/Ryong1998/house_price/blob/main/EDA_file3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 프로젝트 주제

- 해당 프로젝트는 미래의 아파트 집 값을 예측하는 프로젝트 입니다

# 프로젝트 소개 

- 다양한 지역의 다양한 부동산의 종류(아파트, 단독주택 등)들 중 '서울'의 '아파트'의 '미래 가격 변화율'을 예측함
- 최종적으로면 '미래 가격 변화율'이 가장 좋을 것으로 예상되는, 즉, 가장 수익률이 높을 것으로 예상되는 아파트를 찾는 것이 목적
- 부동산의 가치는 '1. 거주지로서의 특성'과 '2. 금융상품으로서의 특성' 두가지를 통해서 평가를 할 수 있다고 가정
- '1. 거주지로서의 특성'은 주변 편의시설, 교육시설, 아파트 평수, 주변 교통시설 등 더 편한 거주환경을 제공하는 요소들을 포함
- '2. 금융상품으로서의 특성'은 기준금리, 아파트 공급량, 아파트 미분양, 현재 매매가, 전세가율 등 금융관련 수치들로 표현이 되는 요소들을 포함
- '1. 거주지로서의 특성'에서 높은 가치를 의미하는 요소들은 시대가 변함에 따라서 바뀔 수가 있음(예를 들어 대가족에서 소가족 형태로 가족 구성원 구조가 바뀌면서 시대에 따라 사람들이 선호하는 아파트 평수가 바뀔 수도 있고, 인터넷 강의의 발달로 인해서 교육시설 인프라의 중요성이 향후 낮아질 수 있음음)
- '1. 거주지로서의 특성'에서 높은 가치들은 과거 계속 변화했을 수 있지만 어떻게 변했는지 파악하기가 쉽지 않고, 미래에 어떻게 변할지 알 수 없기에 평가의 기준이 '변동적'이라는 특징이 있음
- 하지만 '2. 금융상품으로서의 특성'은 가격과 경제를 바탕으로 한 '수치'들을 표현 하기에 '1. 거주지로서의 특성'보다 일관성 있게 부동산의 가치를 평가할 수 있음
- '2. 금융상품으로서의 특성'에 해당하는 수치들은 그 자체로 변화하는 '1. 거주지로서의 특성'의 가치를 내포하고 있다고 가정
- 해당 프로젝트는 '2. 금융상품으로서의 특성'에 집중하여서 집값의 변화를 예측 할 예정

# original_data 확보

- 'http://rtdown.molit.go.kr/' 사이트를 통해서 아파트매매가, 아파트 전/월세 가격 정보 파일로 얻음
- 'https://kr.investing.com/' 사이트를 통해서 한국국채금리, 미국국채금리, 코스피 정보를 얻음 
- 'https://data.kbland.kr/publicdata/unsold-apartments' 사이트를 통해서 미분양 아파트 수량 정보를 얻음
- 'https://asil.kr/asil/sub/movein.jsp' 사이트를 통해서 분양 아파트 수량 정보를 얻음
- 'https://www.bok.or.kr/portal/singl/baseRate/list.do?dataSeCd=01&menuNo=200643' 사이트를 통해서 기준금리 정보를 얻음
- 'https://data.seoul.go.kr/dataList/801/S/2/datasetView.do' 사이트를 통해서 서울시 주택가격지수를 얻음



>> 공공데이터포털의 api를 이용해서 아파트매매가, 아파트 전/월세 가격 정보를 얻으려 했지만 일일 트래픽 제한으로 인해서 직접 'http://rtdown.molit.go.kr/' 사이트에 접속해서 파일들을 다운 받아 필요 데이터를 확보

In [1]:
# 구글 드라이브 마운트
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# 파이썬 버전 확인인
!python --version

Python 3.8.10


In [None]:
# 라이브러리 버전 확인인
pip list

Package                       Version
----------------------------- ----------------------
absl-py                       1.4.0
aeppl                         0.0.33
aesara                        2.7.9
aiohttp                       3.8.4
aiosignal                     1.3.1
alabaster                     0.7.13
albumentations                1.2.1
altair                        4.2.2
appdirs                       1.4.4
argon2-cffi                   21.3.0
argon2-cffi-bindings          21.2.0
arviz                         0.12.1
astor                         0.8.1
astropy                       4.3.1
astunparse                    1.6.3
async-timeout                 4.0.2
atari-py                      0.2.9
atomicwrites                  1.4.1
attrs                         22.2.0
audioread                     3.0.0
autograd                      1.5
Babel                         2.11.0
backcall                      0.2.0
beautifulsoup4                4.6.3
bleach                        6.0.0
blis

# apartment_deal.csv 파일 생성

- apartment_deal(아파트 매매) 파일 생성
- 'http://rtdown.molit.go.kr/' 사이트를 통해서 아파트매매가 정보 파일로 얻음

## csv 파일들 불러오기 및 병합

- 아파트 매매 정보 원본본파일들은 연도별로 파일들이 나누어져 되어있고, 각 csv 파일 내의 모든 정보들이 필요하지는 않기에 전처리 과정 진행

In [None]:
import pandas as pd
import os

# 연도별 아파트 매매 정보들이 들어있는 csv경로 설정 
dir_path = "/content/drive/MyDrive/house_price/original_data/deal_price/Seoul" 
file_list = os.listdir(dir_path)
file_list.sort()
df_list = list()
# 해당 폴더 안에 있는 csv 파일들을 읽어서 리스트 안에 데이터프레임들을 담음
for csv_file in file_list:
    df_list.append(pd.read_csv(dir_path+"/"+csv_file ,skiprows=15,  encoding='cp949'))

>> 코랩은 파일을 읽어올 때 업로드한 순서대로 파일을 불러오는 듯

In [None]:
df_list[0].info() # 리스트 안에 잘 담겼는지 확인

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120812 entries, 0 to 120811
Data columns (total 15 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   시군구       120812 non-null  object 
 1   번지        120812 non-null  object 
 2   본번        120812 non-null  int64  
 3   부번        120812 non-null  int64  
 4   단지명       120812 non-null  object 
 5   전용면적(㎡)   120812 non-null  float64
 6   계약년월      120812 non-null  int64  
 7   계약일       120812 non-null  int64  
 8   거래금액(만원)  120812 non-null  object 
 9   층         120812 non-null  int64  
 10  건축년도      120812 non-null  int64  
 11  도로명       120812 non-null  object 
 12  해제사유발생일   0 non-null       float64
 13  거래유형      120812 non-null  object 
 14  중개사소재지    120812 non-null  object 
dtypes: float64(2), int64(6), object(7)
memory usage: 13.8+ MB


In [None]:
df_list[0].head()

Unnamed: 0,시군구,번지,본번,부번,단지명,전용면적(㎡),계약년월,계약일,거래금액(만원),층,건축년도,도로명,해제사유발생일,거래유형,중개사소재지
0,서울특별시 강남구 개포동,655-2,655,2,개포2차현대아파트(220),77.75,200603,10,59500,7,1988,언주로 103,,-,-
1,서울특별시 강남구 개포동,655-2,655,2,개포2차현대아파트(220),77.75,200603,29,60000,6,1988,언주로 103,,-,-
2,서울특별시 강남구 개포동,655-2,655,2,개포2차현대아파트(220),77.75,200604,29,67000,9,1988,언주로 103,,-,-
3,서울특별시 강남구 개포동,655-2,655,2,개포2차현대아파트(220),77.75,200606,1,60000,4,1988,언주로 103,,-,-
4,서울특별시 강남구 개포동,655-2,655,2,개포2차현대아파트(220),77.75,200610,20,72250,5,1988,언주로 103,,-,-


In [None]:
# 모든 데이터프레임을 하나의 데이터프레임으로 통합
df_default = df_list[0]
for df_csv in df_list[1:]:
    df_default = pd.concat([df_default, df_csv], axis=0)
df_default.reset_index(drop=True, inplace=True) # concat으로 합쳐질 때 인덱스 재설정
df_default.loc[1] 

시군구          서울특별시 강남구 개포동
번지                   655-2
본번                   655.0
부번                     2.0
단지명         개포2차현대아파트(220)
전용면적(㎡)              77.75
계약년월                200603
계약일                     29
거래금액(만원)            60,000
층                        6
건축년도                1988.0
도로명                언주로 103
해제사유발생일                NaN
거래유형                     -
중개사소재지                   -
Name: 1, dtype: object

In [None]:
df_default.head() # 병합한 테이블의 정보 파악

Unnamed: 0,시군구,번지,본번,부번,단지명,전용면적(㎡),계약년월,계약일,거래금액(만원),층,건축년도,도로명,해제사유발생일,거래유형,중개사소재지
0,서울특별시 강남구 개포동,655-2,655.0,2.0,개포2차현대아파트(220),77.75,200603,10,59500,7,1988.0,언주로 103,,-,-
1,서울특별시 강남구 개포동,655-2,655.0,2.0,개포2차현대아파트(220),77.75,200603,29,60000,6,1988.0,언주로 103,,-,-
2,서울특별시 강남구 개포동,655-2,655.0,2.0,개포2차현대아파트(220),77.75,200604,29,67000,9,1988.0,언주로 103,,-,-
3,서울특별시 강남구 개포동,655-2,655.0,2.0,개포2차현대아파트(220),77.75,200606,1,60000,4,1988.0,언주로 103,,-,-
4,서울특별시 강남구 개포동,655-2,655.0,2.0,개포2차현대아파트(220),77.75,200610,20,72250,5,1988.0,언주로 103,,-,-


In [None]:
df_default.info() # 데이터프레임 합친 결과 확인

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1237491 entries, 0 to 1237490
Data columns (total 15 columns):
 #   Column    Non-Null Count    Dtype  
---  ------    --------------    -----  
 0   시군구       1237491 non-null  object 
 1   번지        1237270 non-null  object 
 2   본번        1237416 non-null  float64
 3   부번        1237416 non-null  float64
 4   단지명       1237491 non-null  object 
 5   전용면적(㎡)   1237491 non-null  float64
 6   계약년월      1237491 non-null  int64  
 7   계약일       1237491 non-null  int64  
 8   거래금액(만원)  1237491 non-null  object 
 9   층         1237491 non-null  int64  
 10  건축년도      1237489 non-null  float64
 11  도로명       1237491 non-null  object 
 12  해제사유발생일   5242 non-null     float64
 13  거래유형      1237491 non-null  object 
 14  중개사소재지    1237491 non-null  object 
dtypes: float64(5), int64(3), object(7)
memory usage: 141.6+ MB


## 필요한 컬럼만 선택

- df_default 데이터프레임에서 모든 컬럼들을 사용하지 않기에, 사용할 컬럼들만을 선택

In [None]:
# 사용할 컬럼들만 거르고 컬럼명 영어로 치환 - 필요한 컬럼들만 선택
df_default = df_default[['시군구','본번','부번','도로명','단지명','계약년월','계약일','전용면적(㎡)','거래금액(만원)','층']]
df_default.columns = ['address','main_number','sub_number','road','name','year_month','day','area','deal_price','floor']
df_default.head() # 잘 걸러졌는지 확인

Unnamed: 0,address,main_number,sub_number,road,name,year_month,day,area,deal_price,floor
0,서울특별시 강남구 개포동,655.0,2.0,언주로 103,개포2차현대아파트(220),200603,10,77.75,59500,7
1,서울특별시 강남구 개포동,655.0,2.0,언주로 103,개포2차현대아파트(220),200603,29,77.75,60000,6
2,서울특별시 강남구 개포동,655.0,2.0,언주로 103,개포2차현대아파트(220),200604,29,77.75,67000,9
3,서울특별시 강남구 개포동,655.0,2.0,언주로 103,개포2차현대아파트(220),200606,1,77.75,60000,4
4,서울특별시 강남구 개포동,655.0,2.0,언주로 103,개포2차현대아파트(220),200610,20,77.75,72250,5


In [None]:
# 타입 변경을 통해서 deal_price,year_month, day 타입 변경
df_default["deal_price"] = df_default["deal_price"].str.replace(",", "") # 'deal_price'에서 ','가 들어있는 부분 제거(추후 계산에 사용하기 위해서서)
df = df_default.astype({'year_month':'str','day':'str','deal_price':'int64'}).copy()
df.head() # 형태가 변경된거 확인

Unnamed: 0,address,main_number,sub_number,road,name,year_month,day,area,deal_price,floor
0,서울특별시 강남구 개포동,655.0,2.0,언주로 103,개포2차현대아파트(220),200603,10,77.75,59500,7
1,서울특별시 강남구 개포동,655.0,2.0,언주로 103,개포2차현대아파트(220),200603,29,77.75,60000,6
2,서울특별시 강남구 개포동,655.0,2.0,언주로 103,개포2차현대아파트(220),200604,29,77.75,67000,9
3,서울특별시 강남구 개포동,655.0,2.0,언주로 103,개포2차현대아파트(220),200606,1,77.75,60000,4
4,서울특별시 강남구 개포동,655.0,2.0,언주로 103,개포2차현대아파트(220),200610,20,77.75,72250,5


In [None]:
df.info() # 타입변경 및 null 확인 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1237491 entries, 0 to 1237490
Data columns (total 10 columns):
 #   Column       Non-Null Count    Dtype  
---  ------       --------------    -----  
 0   address      1237491 non-null  object 
 1   main_number  1237416 non-null  float64
 2   sub_number   1237416 non-null  float64
 3   road         1237491 non-null  object 
 4   name         1237491 non-null  object 
 5   year_month   1237491 non-null  object 
 6   day          1237491 non-null  object 
 7   area         1237491 non-null  float64
 8   deal_price   1237491 non-null  int64  
 9   floor        1237491 non-null  int64  
dtypes: float64(3), int64(2), object(5)
memory usage: 94.4+ MB


In [None]:
# 'main_number'혹은 'sub_number'이 null 인데 'road'도 null 인 값을 확인 -> 없음
# 즉, 'road가 주소에 관한한 정보가 더욱 많음'
df[((df['main_number'].isnull()) |(df['sub_number'].isnull())) &(df['road'].isnull()) ]

Unnamed: 0,address,main_number,sub_number,road,name,year_month,day,area,deal_price,floor


- main_number와 sub_number에 null 값들이 있음을 확인 -> road 정보가 주소에 관한 정보로 적합하다는 판단

## 새로운 컬럼 생성

- 날짜 관련한 컬럼들을 추후 그룹화등을 할 때 사용하기에 'year_month' 컬럼과 'day' 컬럼을 가공하여서 다양한 날짜 관련 컬럼들을 생성

In [None]:
# 추후 데이터들 그룹화에 사용하기 위해서 날짜관련 컬럼들들 분리 및 생성
df['year'] = df['year_month'].str[0:4] # '연','월' 합쳐져 있는 컬럼에서 연도만 추출
df['month'] = df['year_month'].str[4:] # '연','월' 합쳐져 있는 컬럼에서 월만 추출
df.loc[df["day"].str.len()==1,"day"]='0'+df.loc[df["day"].str.len()==1,"day"] # '일'이 있는 컬럼에서 해당 '일'이 1일, 2일 처럼 1자리 숫자인 경우 앞에 0을 추가
df['date'] = pd.to_datetime(df['year']+df['month']+df['day']) # 일자들을 합쳐서 date 컬럼 생성
df = df.astype({'year':'int64','month':'int64','day':'int64'}) # 원하는 타입으로 변경경
df = df.drop(['year_month'], axis=1) # 사용 안하는 컬럼들 제거
df.head()

Unnamed: 0,address,main_number,sub_number,road,name,day,area,deal_price,floor,year,month,date
0,서울특별시 강남구 개포동,655.0,2.0,언주로 103,개포2차현대아파트(220),10,77.75,59500,7,2006,3,2006-03-10
1,서울특별시 강남구 개포동,655.0,2.0,언주로 103,개포2차현대아파트(220),29,77.75,60000,6,2006,3,2006-03-29
2,서울특별시 강남구 개포동,655.0,2.0,언주로 103,개포2차현대아파트(220),29,77.75,67000,9,2006,4,2006-04-29
3,서울특별시 강남구 개포동,655.0,2.0,언주로 103,개포2차현대아파트(220),1,77.75,60000,4,2006,6,2006-06-01
4,서울특별시 강남구 개포동,655.0,2.0,언주로 103,개포2차현대아파트(220),20,77.75,72250,5,2006,10,2006-10-20


In [None]:
df.info() # 타입들이 원하는데로 변경됨을 확인인

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1237491 entries, 0 to 1237490
Data columns (total 12 columns):
 #   Column       Non-Null Count    Dtype         
---  ------       --------------    -----         
 0   address      1237491 non-null  object        
 1   main_number  1237416 non-null  float64       
 2   sub_number   1237416 non-null  float64       
 3   road         1237491 non-null  object        
 4   name         1237491 non-null  object        
 5   day          1237491 non-null  int64         
 6   area         1237491 non-null  float64       
 7   deal_price   1237491 non-null  int64         
 8   floor        1237491 non-null  int64         
 9   year         1237491 non-null  int64         
 10  month        1237491 non-null  int64         
 11  date         1237491 non-null  datetime64[ns]
dtypes: datetime64[ns](1), float64(3), int64(5), object(3)
memory usage: 113.3+ MB


In [None]:
# 주소 및 도로명들 분리
df["address_0"] = df["address"].str.split(' ',expand=True)[0] # '시' 만 추출해야 하나, 서울만 함으로 일단은 실행 X
df["address_1"] = df["address"].str.split(' ',expand=True)[1] # '구' 만 추출
df["address_2"] = df["address"].str.split(' ',expand=True)[2] # '동' 만 추출
df["road_name"] = df["road"].str.split(' ',expand=True)[0] # '도로명' 만 추출
df["road_number"] = df["road"].str.split(' ',expand=True)[1] # '도로숫자' 만 추출
df= df[['year','month','day','address_0','address_1','address_2','road_name','road_number','area','deal_price','name','main_number','sub_number','date']] # 사용할 컬럼만 선택
df.head()

Unnamed: 0,year,month,day,address_0,address_1,address_2,road_name,road_number,area,deal_price,name,main_number,sub_number,date
0,2006,3,10,서울특별시,강남구,개포동,언주로,103,77.75,59500,개포2차현대아파트(220),655.0,2.0,2006-03-10
1,2006,3,29,서울특별시,강남구,개포동,언주로,103,77.75,60000,개포2차현대아파트(220),655.0,2.0,2006-03-29
2,2006,4,29,서울특별시,강남구,개포동,언주로,103,77.75,67000,개포2차현대아파트(220),655.0,2.0,2006-04-29
3,2006,6,1,서울특별시,강남구,개포동,언주로,103,77.75,60000,개포2차현대아파트(220),655.0,2.0,2006-06-01
4,2006,10,20,서울특별시,강남구,개포동,언주로,103,77.75,72250,개포2차현대아파트(220),655.0,2.0,2006-10-20


## 결측치 처리1

In [None]:
df.info() # road_number에 1개의의 null 값이 생김을 확인

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1237491 entries, 0 to 1237490
Data columns (total 14 columns):
 #   Column       Non-Null Count    Dtype         
---  ------       --------------    -----         
 0   year         1237491 non-null  int64         
 1   month        1237491 non-null  int64         
 2   day          1237491 non-null  int64         
 3   address_0    1237491 non-null  object        
 4   address_1    1237491 non-null  object        
 5   address_2    1237491 non-null  object        
 6   road_name    1237491 non-null  object        
 7   road_number  1237490 non-null  object        
 8   area         1237491 non-null  float64       
 9   deal_price   1237491 non-null  int64         
 10  name         1237491 non-null  object        
 11  main_number  1237416 non-null  float64       
 12  sub_number   1237416 non-null  float64       
 13  date         1237491 non-null  datetime64[ns]
dtypes: datetime64[ns](1), float64(3), int64(4), object(6)
memory usage

In [None]:
df[df['road_number'].isnull()] # road_number에 null값이 들어 있는 row를 확인

Unnamed: 0,year,month,day,address_0,address_1,address_2,road_name,road_number,area,deal_price,name,main_number,sub_number,date
1177515,2020,12,31,서울특별시,중구,만리동2가,만리재로,,39.9541,161000,서울역센트럴자이(임대),176.0,1.0,2020-12-31


In [None]:
# '서울역센트럴자이'를 확인 -> '' 값이 존재함을 확인..
df.loc[df['name'] == '서울역센트럴자이',:]

Unnamed: 0,year,month,day,address_0,address_1,address_2,road_name,road_number,area,deal_price,name,main_number,sub_number,date
936223,2017,5,3,서울특별시,중구,만리동2가,만리재로,175.0,84.972,79390,서울역센트럴자이,176.0,1.0,2017-05-03
936224,2017,12,20,서울특별시,중구,만리동2가,만리재로,175.0,59.943,85000,서울역센트럴자이,176.0,1.0,2017-12-20
936225,2017,12,30,서울특별시,중구,만리동2가,,,59.94,85000,서울역센트럴자이,176.0,1.0,2017-12-30
1018067,2018,3,20,서울특별시,중구,만리동2가,,,72.99,85000,서울역센트럴자이,176.0,1.0,2018-03-20
1093938,2019,7,13,서울특별시,중구,만리동2가,만리재로,175.0,84.972,134500,서울역센트럴자이,176.0,1.0,2019-07-13
1093939,2019,8,20,서울특별시,중구,만리동2가,만리재로,175.0,59.94,95000,서울역센트럴자이,176.0,1.0,2019-08-20
1093940,2019,8,23,서울특별시,중구,만리동2가,만리재로,175.0,84.972,139000,서울역센트럴자이,176.0,1.0,2019-08-23
1093941,2019,9,8,서울특별시,중구,만리동2가,만리재로,175.0,59.94,113800,서울역센트럴자이,176.0,1.0,2019-09-08
1093942,2019,9,21,서울특별시,중구,만리동2가,만리재로,175.0,72.9733,132000,서울역센트럴자이,176.0,1.0,2019-09-21
1093943,2019,11,30,서울특별시,중구,만리동2가,만리재로,175.0,59.9808,120000,서울역센트럴자이,176.0,1.0,2019-11-30


In [None]:
# 값이 '' 로 되어 있는 row들을 확인인
df.loc[df['road_name'] == '',:]

Unnamed: 0,year,month,day,address_0,address_1,address_2,road_name,road_number,area,deal_price,name,main_number,sub_number,date
1606,2006,2,23,서울특별시,강남구,논현동,,,128.67,73500,경복,276.0,0.0,2006-02-23
1628,2006,10,19,서울특별시,강남구,논현동,,,95.48,71000,경복,276.0,0.0,2006-10-19
2799,2006,1,24,서울특별시,강남구,대치동,,,76.56,80000,청실1,633.0,0.0,2006-01-24
2806,2006,2,14,서울특별시,강남구,대치동,,,102.64,143500,청실1,633.0,0.0,2006-02-14
2807,2006,2,14,서울특별시,강남구,대치동,,,102.64,142000,청실1,633.0,0.0,2006-02-14
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1234065,2022,6,24,서울특별시,송파구,거여동,,,84.95,128000,e편한세상송파파크센트럴,696.0,0.0,2022-06-24
1234066,2022,7,21,서울특별시,송파구,거여동,,,84.97,135000,e편한세상송파파크센트럴,696.0,0.0,2022-07-21
1234067,2022,7,23,서울특별시,송파구,거여동,,,59.96,125000,e편한세상송파파크센트럴,696.0,0.0,2022-07-23
1234069,2022,8,19,서울특별시,송파구,거여동,,,84.96,130000,e편한세상송파파크센트럴,696.0,0.0,2022-08-19


>> null 값이 없다고 ''값도 없는건 아니구나! -> 의미적으로는 결측치이지만 ''로 표시되어서 마치 값이 있는 것처럼 있을 수도 있음

In [None]:
df.loc[df['name'] == '서울역센트럴자이(임대)','name']='서울역센트럴자이' # '서울역센트럴자이(임대)' 명칭을을 '서울역센트럴자이'로 수정
df.loc[df['name'] == '서울역센트럴자이','road_name']='만리재로' # 위에서 확인한 '서울역센트럴자이'의 값들로 'road_name' 수정
df.loc[df['name'] == '서울역센트럴자이','road_number']='175' # 위에서 확인한 '서울역센트럴자이'의 값들로 'road_number' 수정
df.info() # 우선 1차적으로 null 로 표시되는는 null 값들은 처리함을 확인

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1237491 entries, 0 to 1237490
Data columns (total 14 columns):
 #   Column       Non-Null Count    Dtype         
---  ------       --------------    -----         
 0   year         1237491 non-null  int64         
 1   month        1237491 non-null  int64         
 2   day          1237491 non-null  int64         
 3   address_0    1237491 non-null  object        
 4   address_1    1237491 non-null  object        
 5   address_2    1237491 non-null  object        
 6   road_name    1237491 non-null  object        
 7   road_number  1237491 non-null  object        
 8   area         1237491 non-null  float64       
 9   deal_price   1237491 non-null  int64         
 10  name         1237491 non-null  object        
 11  main_number  1237416 non-null  float64       
 12  sub_number   1237416 non-null  float64       
 13  date         1237491 non-null  datetime64[ns]
dtypes: datetime64[ns](1), float64(3), int64(4), object(6)
memory usage

## 결측치 처리2

- 앞에서 과정들을 통해서 ''들이 값들로 들어 있을 수도 있음을 깨닫고 '' 값들을 null로 간주하여서 결측치 처리 진행

In [None]:
import numpy as np
df = df.replace('', np.nan) # ''값만 있는 값들을 null 값들로 수정
df.info() # 수정한 후 정보 확인

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1237491 entries, 0 to 1237490
Data columns (total 14 columns):
 #   Column       Non-Null Count    Dtype         
---  ------       --------------    -----         
 0   year         1237491 non-null  int64         
 1   month        1237491 non-null  int64         
 2   day          1237491 non-null  int64         
 3   address_0    1237491 non-null  object        
 4   address_1    1237491 non-null  object        
 5   address_2    1237491 non-null  object        
 6   road_name    1235462 non-null  object        
 7   road_number  1234196 non-null  object        
 8   area         1237491 non-null  float64       
 9   deal_price   1237491 non-null  int64         
 10  name         1237491 non-null  object        
 11  main_number  1237416 non-null  float64       
 12  sub_number   1237416 non-null  float64       
 13  date         1237491 non-null  datetime64[ns]
dtypes: datetime64[ns](1), float64(3), int64(4), object(6)
memory usage

In [None]:
df.isnull().sum() # df의 'road_name'과 'road_number'의 null 값들이 증가함을 확인, 

year              0
month             0
day               0
address_0         0
address_1         0
address_2         0
road_name      2029
road_number    3295
area              0
deal_price        0
name              0
main_number      75
sub_number       75
date              0
dtype: int64

- 처음에는 도로주소가 null값이 더 적은 줄 알았지만, 전처리 과정 중 지번주소가 null 값이 더 적은 것을 확인

In [None]:
# 'main_number'나 'sub_number' 둘중 하나만 null 인 것을 확인 -> 없음
# 즉, 2개가 동시에 null 값을 가지고 있음
df[((df['main_number'].isnull()) &(df['sub_number'].notnull()))
  |((df['main_number'].notnull()) &(df['sub_number'].isnull()))]

Unnamed: 0,year,month,day,address_0,address_1,address_2,road_name,road_number,area,deal_price,name,main_number,sub_number,date


In [None]:
# 도로명정보에는 null이고 지번주소도 null인 데이터를 확인 -> 없다
# 즉, 도로명주소나 지번주소 둘 중 하나를 활용해서 주소에 대한 정보를 얻을 수 있다
df[((df['road_name'].isnull()) | (df['road_number'].isnull())) & (df['main_number'].isnull())] 

Unnamed: 0,year,month,day,address_0,address_1,address_2,road_name,road_number,area,deal_price,name,main_number,sub_number,date


In [None]:
# 처리해야 할 null 값이 있는 데이터프레임을 조회
df.loc[df['main_number'].isnull(),['address_0','address_1','address_2','road_name','road_number','name','main_number','sub_number']] 

Unnamed: 0,address_0,address_1,address_2,road_name,road_number,name,main_number,sub_number
681633,서울특별시,서초구,신원동,헌릉로8길,10-12,힐스테이트 서초 젠트리스,,
681634,서울특별시,서초구,신원동,헌릉로8길,10-12,힐스테이트 서초 젠트리스,,
681635,서울특별시,서초구,신원동,헌릉로8길,10-12,힐스테이트 서초 젠트리스,,
681636,서울특별시,서초구,신원동,헌릉로8길,10-12,힐스테이트 서초 젠트리스,,
681637,서울특별시,서초구,신원동,헌릉로8길,10-12,힐스테이트 서초 젠트리스,,
...,...,...,...,...,...,...,...,...
1209122,서울특별시,서초구,신원동,헌릉로8길,10-12,힐스테이트 서초 젠트리스,,
1209123,서울특별시,서초구,신원동,헌릉로8길,10-12,힐스테이트 서초 젠트리스,,
1209124,서울특별시,서초구,신원동,헌릉로8길,10-12,힐스테이트 서초 젠트리스,,
1232880,서울특별시,서초구,신원동,헌릉로8길,10-12,힐스테이트 서초 젠트리스,,


In [None]:
df.loc[df['main_number'].isnull(),'name'].unique() # 처리해야 할 지번주소에 null 값이 있는 아파트명들 조회
                                                   # '힐스테이트 서초 젠트리스'만 수정하면 될듯

array(['힐스테이트 서초 젠트리스'], dtype=object)

In [None]:
df.loc[df['name']=='힐스테이트 서초 젠트리스',:] # 기존 name 컬럼이 '힐스테이트 서초 젠트리스' 인 전체 값들이이 지번주소가 null값으로 되어 있음

Unnamed: 0,year,month,day,address_0,address_1,address_2,road_name,road_number,area,deal_price,name,main_number,sub_number,date
681633,2015,3,1,서울특별시,서초구,신원동,헌릉로8길,10-12,84.95,73430,힐스테이트 서초 젠트리스,,,2015-03-01
681634,2015,4,17,서울특별시,서초구,신원동,헌릉로8길,10-12,84.99,79000,힐스테이트 서초 젠트리스,,,2015-04-17
681635,2015,5,1,서울특별시,서초구,신원동,헌릉로8길,10-12,101.90,95000,힐스테이트 서초 젠트리스,,,2015-05-01
681636,2015,6,16,서울특별시,서초구,신원동,헌릉로8길,10-12,84.95,87200,힐스테이트 서초 젠트리스,,,2015-06-16
681637,2015,6,26,서울특별시,서초구,신원동,헌릉로8길,10-12,101.90,94500,힐스테이트 서초 젠트리스,,,2015-06-26
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1209122,2021,4,27,서울특별시,서초구,신원동,헌릉로8길,10-12,101.90,184500,힐스테이트 서초 젠트리스,,,2021-04-27
1209123,2021,5,26,서울특별시,서초구,신원동,헌릉로8길,10-12,84.95,165000,힐스테이트 서초 젠트리스,,,2021-05-26
1209124,2021,7,26,서울특별시,서초구,신원동,헌릉로8길,10-12,84.99,182000,힐스테이트 서초 젠트리스,,,2021-07-26
1232880,2022,6,23,서울특별시,서초구,신원동,헌릉로8길,10-12,101.90,204000,힐스테이트 서초 젠트리스,,,2022-06-23


In [None]:
# 지번주소 null 값들을 네이버를 통해 검색하여서 정보를 얻고 수정
df.loc[df['name']=='힐스테이트 서초 젠트리스','main_number'] = 557
df.loc[df['name']=='힐스테이트 서초 젠트리스','sub_number'] = 0

In [None]:
# 사용할 컬럼들 선택택과, 컬럼명들 수정
df_deal = df[['date','year','month','day','address_0','address_1','address_2','main_number','sub_number','name','area','deal_price']].copy()
df_deal.columns =['date','year','month','day','address_0','address_1','address_2','address_3','address_4','name','area','deal_price']
df_deal = df_deal[df_deal['year']>=2011] # 전세/월세데이터가 2011년 이후로 있어서 연도 선택
df_deal.head()

Unnamed: 0,date,year,month,day,address_0,address_1,address_2,address_3,address_4,name,area,deal_price
355306,2011-07-09,2011,7,9,서울특별시,강남구,개포동,655.0,2.0,개포2차현대아파트(220),77.75,64000
355307,2011-07-28,2011,7,28,서울특별시,강남구,개포동,655.0,2.0,개포2차현대아파트(220),77.75,65500
355308,2011-01-19,2011,1,19,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,67.28,70500
355309,2011-09-02,2011,9,2,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,79.97,85000
355310,2011-12-17,2011,12,17,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,67.28,68000


In [None]:
df_deal.info() # 데이터프레임 정보 확인

<class 'pandas.core.frame.DataFrame'>
Int64Index: 882185 entries, 355306 to 1237490
Data columns (total 12 columns):
 #   Column      Non-Null Count   Dtype         
---  ------      --------------   -----         
 0   date        882185 non-null  datetime64[ns]
 1   year        882185 non-null  int64         
 2   month       882185 non-null  int64         
 3   day         882185 non-null  int64         
 4   address_0   882185 non-null  object        
 5   address_1   882185 non-null  object        
 6   address_2   882185 non-null  object        
 7   address_3   882185 non-null  float64       
 8   address_4   882185 non-null  float64       
 9   name        882185 non-null  object        
 10  area        882185 non-null  float64       
 11  deal_price  882185 non-null  int64         
dtypes: datetime64[ns](1), float64(3), int64(4), object(4)
memory usage: 87.5+ MB


In [None]:
df_deal.iloc[200] # 정보들 제대로 있는지 확인

date          2011-12-23 00:00:00
year                         2011
month                          12
day                            23
address_0                   서울특별시
address_1                     강남구
address_2                     개포동
address_3                   141.0
address_4                     0.0
name                      개포주공1단지
area                        56.57
deal_price                  95000
Name: 355506, dtype: object

In [None]:
df_deal.to_csv('/content/drive/MyDrive/house_price/after_data/apartment_deal.csv',index=False)

# apartment_full_rent.csv, apartment_month_rent.csv 파일 생성

- apartment_full_rent(아파트 전세), apartment_month_rent(아파트 월세) 파일 생성
- 'http://rtdown.molit.go.kr/' 사이트를 통해서 아파트전세,월세 정보 파일로 얻음
- 아파트 전세정보 csv 파일들은 연도별로 파일들이 분류가 되어있고, 각 csv 파일 내의 모든 정보들이 전부 필요하지는 않기에 전처리 과정 진행

## csv 파일들 불러오기 및 병합

In [None]:
import pandas as pd
import os


dir_path = "/content/drive/MyDrive/house_price/original_data/rent_price/Seoul"
file_list = os.listdir(dir_path)
file_list.sort()
df_list = list()

# 해당 폴더 안에 있는 csv 파일들을 읽어서 리스트 안에 데이터프레임들을 담음
for csv_file in file_list:
    df_list.append(pd.read_csv(dir_path+"/"+csv_file ,skiprows=15,  encoding='cp949'))


  exec(code_obj, self.user_global_ns, self.user_ns)
  exec(code_obj, self.user_global_ns, self.user_ns)


In [None]:
df_list[-1].info() # 리스트 안에 잘 담겼는지 확인

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 231846 entries, 0 to 231845
Data columns (total 19 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   시군구            231846 non-null  object 
 1   번지             231657 non-null  object 
 2   본번             231819 non-null  float64
 3   부번             231819 non-null  float64
 4   단지명            231846 non-null  object 
 5   전월세구분          231846 non-null  object 
 6   전용면적(㎡)        231846 non-null  float64
 7   계약년월           231846 non-null  int64  
 8   계약일            231846 non-null  int64  
 9   보증금(만원)        231846 non-null  object 
 10  월세(만원)         231846 non-null  object 
 11  층              231846 non-null  int64  
 12  건축년도           231749 non-null  float64
 13  도로명            231846 non-null  object 
 14  계약기간           231846 non-null  object 
 15  계약구분           231846 non-null  object 
 16  갱신요구권 사용       231846 non-null  object 
 17  종전계약 보증금 (만원)  188985 non-nul

In [None]:
# 모든 데이터프레임을을 통합
df_default = df_list[0]
for df_csv in df_list[1:]:
    df_default = pd.concat([df_default, df_csv], axis=0)
df_default.reset_index(drop=True, inplace=True) # concat으로 합쳐질 때 인덱스 재설정
df_default.info() # 데이터프레임 합친 결과 확인

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2085775 entries, 0 to 2085774
Data columns (total 19 columns):
 #   Column         Dtype  
---  ------         -----  
 0   시군구            object 
 1   번지             object 
 2   본번             float64
 3   부번             float64
 4   단지명            object 
 5   전월세구분          object 
 6   전용면적(㎡)        float64
 7   계약년월           int64  
 8   계약일            int64  
 9   보증금(만원)        object 
 10  월세(만원)         object 
 11  층              float64
 12  건축년도           float64
 13  도로명            object 
 14  계약기간           object 
 15  계약구분           object 
 16  갱신요구권 사용       object 
 17  종전계약 보증금 (만원)  object 
 18  종전계약 월세 (만원)   object 
dtypes: float64(5), int64(2), object(12)
memory usage: 302.4+ MB


In [None]:
df_default.loc[1]

시군구               서울특별시 강남구 개포동
번지                        655-2
본번                        655.0
부번                          2.0
단지명              개포2차현대아파트(220)
전월세구분                        전세
전용면적(㎡)                   77.75
계약년월                     201101
계약일                          18
보증금(만원)                  20,000
월세(만원)                        0
층                           8.0
건축년도                     1988.0
도로명                     언주로 103
계약기간                          -
계약구분                          -
갱신요구권 사용                      -
종전계약 보증금 (만원)               NaN
종전계약 월세 (만원)                NaN
Name: 1, dtype: object

In [None]:
df_default.head() # 데이터 형태 확인 

Unnamed: 0,시군구,번지,본번,부번,단지명,전월세구분,전용면적(㎡),계약년월,계약일,보증금(만원),월세(만원),층,건축년도,도로명,계약기간,계약구분,갱신요구권 사용,종전계약 보증금 (만원),종전계약 월세 (만원)
0,서울특별시 강남구 개포동,655-2,655.0,2.0,개포2차현대아파트(220),전세,77.75,201101,5,35000,0,7.0,1988.0,언주로 103,-,-,-,,
1,서울특별시 강남구 개포동,655-2,655.0,2.0,개포2차현대아파트(220),전세,77.75,201101,18,20000,0,8.0,1988.0,언주로 103,-,-,-,,
2,서울특별시 강남구 개포동,655-2,655.0,2.0,개포2차현대아파트(220),전세,77.75,201102,1,24000,0,5.0,1988.0,언주로 103,-,-,-,,
3,서울특별시 강남구 개포동,655-2,655.0,2.0,개포2차현대아파트(220),전세,77.75,201102,11,31000,0,9.0,1988.0,언주로 103,-,-,-,,
4,서울특별시 강남구 개포동,655-2,655.0,2.0,개포2차현대아파트(220),전세,77.75,201102,24,30500,0,9.0,1988.0,언주로 103,-,-,-,,


In [None]:
df_default.isnull().sum() # 번지, 본번, 부번이 null 값들이 있음

시군구                    0
번지                  1586
본번                   234
부번                   234
단지명                    0
전월세구분                  0
전용면적(㎡)               36
계약년월                   0
계약일                    0
보증금(만원)                0
월세(만원)                 0
층                     36
건축년도                 249
도로명                    0
계약기간                   0
계약구분                   0
갱신요구권 사용               0
종전계약 보증금 (만원)    1793799
종전계약 월세 (만원)     1793799
dtype: int64

In [None]:
df_default['전월세구분'].unique()

array(['전세', '월세'], dtype=object)

- 전월세구분이 '전세'와 '월세' 두 가지만 있음으로 조건문을 활용해서 나누기에 용이함

## 전세 데이터 프레임 생성 

- apartment_deal 과 진행 과정이 거의 동일하기에 apartment_deal.ipynb 파일의 진행과정을 참조해서 하나의 셀로 합쳐서 진행
- 주석 부분들은 중간과정 확인 부분

In [None]:
# 전세 데이터 프레임 생성 - 주석부분은 중간에서 값 확인하는 부분들
df_full_rent = df_default.loc[df_default['전월세구분']=='전세',['시군구','본번','부번','도로명','계약년월','계약일','보증금(만원)','전용면적(㎡)','단지명']].copy()
df_full_rent.columns = ['address','main_number','sub_number','road','year_month','day','full_rent_price','area','name']
# print(df_full_rent.head())
# print(df_full_rent.info())

df_full_rent = df_full_rent.astype({'full_rent_price':'str','year_month':'str','day':'str','full_rent_price':'str'})
df_full_rent["full_rent_price"] = df_full_rent["full_rent_price"].str.replace(",", "")
df_full_rent.loc[df_full_rent["day"].str.len()==1,"day"]='0'+df_full_rent.loc[df_full_rent["day"].str.len()==1,"day"] # 일이 있는 컬럼에서 1자리 숫자인 경우 앞에 0을 추가성
df_full_rent['year'] = df_full_rent['year_month'].str[0:4] # 연,월 합쳐져 있는 컬럼에서 연도만 추출
df_full_rent['month'] = df_full_rent['year_month'].str[4:] # 연,월 합쳐져 있는 컬럼에서 월만 추출
df_full_rent['date'] = pd.to_datetime(df_full_rent['year']+df_full_rent['month']+df_full_rent['day']) # 일자들을 합쳐서 date 컬럼 생
df_full_rent = df_full_rent.astype({'year':'int64','month':'int64','day':'int64','full_rent_price':'int64'})
df_full_rent = df_full_rent.drop(['year_month'], axis=1) # 사용 안하는 컬럼들 제거
# print(df_full_rent.head())
# print(df_full_rent.info())

df_full_rent["address_0"] = df_full_rent["address"].str.split(' ',expand=True)[0] # '시' 만 추출해야 하나, 서울만 함으로 일단은 실행 X
df_full_rent["address_1"] = df_full_rent["address"].str.split(' ',expand=True)[1] # '구' 만 추출
df_full_rent["address_2"] = df_full_rent["address"].str.split(' ',expand=True)[2] # '동' 만 추출
df_full_rent["road_name"] = df_full_rent["road"].str.split(' ',expand=True)[0] # '도로명' 만 추출
df_full_rent["road_number"] = df_full_rent["road"].str.split(' ',expand=True)[1] # '도로숫자' 만 추출
df_full_rent= df_full_rent[['year','month','day','address_0','address_1','address_2','main_number','sub_number','road_name','road_number','area',"full_rent_price",'name','date']] # 사용할 컬럼만 선택
# print(df_full_rent.head())
# print(df_full_rent.info())
# print(df_full_rent.isnull().sum())

df_full_rent = df_full_rent.replace('', None) # ''값만 있는 값들을 null 값들로 수정
# print(df_full_rent.isnull().sum())

# df_full_rent[((df_full_rent['main_number'].isnull()) &(df_full_rent['sub_number'].notnull()))
#   |((df_full_rent['main_number'].notnull()) &(df_full_rent['sub_number'].isnull()))]

# df_full_rent[((df_full_rent['road_name'].isnull()) | (df_full_rent['road_number'].isnull())) & (df_full_rent['main_number'].isnull())] 

# df_full_rent.loc[df_full_rent['main_number'].isnull(),['address_0','address_1','address_2','main_number','sub_number','road_name','road_number','name']]

# df_full_rent.loc[df_full_rent['main_number'].isnull(),'name'].unique()

# df_full_rent.loc[df_full_rent['name']=='힐스테이트 서초 젠트리스',:]

df_full_rent.loc[df_full_rent['name']=='힐스테이트 서초 젠트리스','main_number'] = 557
df_full_rent.loc[df_full_rent['name']=='힐스테이트 서초 젠트리스','sub_number'] = 0


df_full_rent = df_full_rent[['date','year','month','day','address_0','address_1','address_2','main_number','sub_number','name','area','full_rent_price']].copy()
df_full_rent.columns =['date','year','month','day','address_0','address_1','address_2','address_3','address_4','name','area','full_rent_price']
# df_full_rent.head()

# df_full_rent.info() 

  mask |= arr == x


In [None]:
df_full_rent.isnull().sum()

date                0
year                0
month               0
day                 0
address_0           0
address_1           0
address_2           0
address_3           0
address_4           0
name                0
area               25
full_rent_price     0
dtype: int64

### 'area' 컬럼 결측치 처리

- apartment_deal.ipynb 와 달리 area 컬럼에 결측치가 존재하기에 결측치 처리 부분 추가
- 결측치는 해당 주소의 전세 아파트의 거래 내역 중 가장 거래가 많았던 area 컬럼의 값 으로 대체하여 처리

In [None]:
# area의 빈 칸들 해결
df_full_rent[df_full_rent['area'].isnull()].tail()

Unnamed: 0,date,year,month,day,address_0,address_1,address_2,address_3,address_4,name,area,full_rent_price
357440,2013-11-16,2013,11,16,서울특별시,노원구,공릉동,683.0,14.0,한일휴니스빌,,8000
375219,2013-11-30,2013,11,30,서울특별시,동대문구,장안동,312.0,8.0,태솔에버빌,,12000
389892,2013-01-17,2013,1,17,서울특별시,서대문구,창천동,501.0,14.0,삼성아트빌,,9000
439901,2013-01-20,2013,1,20,서울특별시,영등포구,영등포동4가,103.0,0.0,영등포그랑그루,,8000
490009,2014-02-19,2014,2,19,서울특별시,강서구,화곡동,29.0,47.0,드림하우스(29-47),,9500


In [None]:
# area가 null값인 row들이 다른 주소정보관련 컬럼들을 리스트 화
add_1 = list(df_full_rent.loc[df_full_rent['area'].isnull(),'address_1'])
add_2 = list(df_full_rent.loc[df_full_rent['area'].isnull(),'address_2'])
add_3 = list(df_full_rent.loc[df_full_rent['area'].isnull(),'address_3'])
add_4 = list(df_full_rent.loc[df_full_rent['area'].isnull(),'address_4'])
area_list = list()

In [None]:
# area_list 에 값 추가
for i in range(len(add_1)):
    # 해당 주소에서 거래된 매물들의 '층' 정보가 없을 경우, area null을 처리할 참조 자료가 없음으로 ''으로 처리리
    if (len(df_full_rent.loc[(df_full_rent['address_1'] ==add_1[i]) & 
                     (df_full_rent['address_2'] ==add_2[i]) &
                     (df_full_rent['address_3'] ==add_3[i]) &
                     (df_full_rent['address_4'] ==add_4[i]),
                     'area'].value_counts())) == 0:

        area_list.append('')
    else:
        # 해당 주소에서 가장 많이 거래되었던 층수를 null 값에 채움움
        area_list.append(df_full_rent.loc[(df_full_rent['address_1'] ==add_1[i]) & 
                     (df_full_rent['address_2'] ==add_2[i]) &
                     (df_full_rent['address_3'] ==add_3[i]) &
                     (df_full_rent['address_4'] ==add_4[i]),
                     'area'].value_counts().idxmax())
print(area_list) # area 이 null 값인 주소의 매물들의 가장 많이 거래된 층들을 출력력

[84.9, 33.33, 15.94, 15.94, 84.98, 142.034, 142.034, 142.034, 142.034, 17.07, 17.07, 17.07, 17.07, 17.07, 64.52, 23.47, 23.47, 13.2195, 13.2195, 13.2195, 13.2195, 49.65, 39.28, 12.1, '']


- 마지막에 ''인 값이 있는데 이건 해당 매물은 참조할 만할 거래내역이 없음을 의미

In [None]:
# len을 통해서 리스트들이 다 만들어 졌는지 확인
print(len(add_1),len(add_2),len(add_3),len(add_4),len(area_list)) 

25 25 25 25 25


In [None]:
# 맨 마지막 row가 '' 여서 해당 row의 area 값을 채우기 위해 참조할 값을 확인 -> 없음
# 해당은 area를 알수있는 방법이 없음 - 다른 참조할만할 area 값들이 없음 -> 추후 제거 필요
df_full_rent.loc[(df_full_rent['address_3']==29)&(df_full_rent['address_4']==47),:] # 테스트로 area이 null 값인 row를 대표로 확인인

Unnamed: 0,date,year,month,day,address_0,address_1,address_2,address_3,address_4,name,area,full_rent_price
490009,2014-02-19,2014,2,19,서울특별시,강서구,화곡동,29.0,47.0,드림하우스(29-47),,9500


In [None]:
# floor가 null인 값들을 처리, 가장 많이 거래된 '층'의 정보로 결측치 처리리
for i in range(len(add_1)):
    df_full_rent.loc[(df_full_rent['address_1'] ==add_1[i]) & 
                         (df_full_rent['address_2'] ==add_2[i]) &
                         (df_full_rent['address_3'] ==add_3[i]) &
                         (df_full_rent['address_4'] ==add_4[i]),
                         'area']=area_list[i]

In [None]:
# null 대신 ''이 잘 들어있는지 확인
df_full_rent.loc[df_full_rent['area']=='',:]

Unnamed: 0,date,year,month,day,address_0,address_1,address_2,address_3,address_4,name,area,full_rent_price
490009,2014-02-19,2014,2,19,서울특별시,강서구,화곡동,29.0,47.0,드림하우스(29-47),,9500


In [None]:
# floor이 ''인 값 제거
df_full_rent=df_full_rent.drop(df_full_rent[df_full_rent['area']==''].index)

# 제거후 값 확인
df_full_rent.loc[df_full_rent['area']=='',:] # 제거가 된음 확인인

Unnamed: 0,date,year,month,day,address_0,address_1,address_2,address_3,address_4,name,area,full_rent_price


In [None]:
df_full_rent.info() # 값확인을 통해서 null값 처리가 되었는지 확인인

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1448687 entries, 0 to 2085774
Data columns (total 12 columns):
 #   Column           Non-Null Count    Dtype         
---  ------           --------------    -----         
 0   date             1448687 non-null  datetime64[ns]
 1   year             1448687 non-null  int64         
 2   month            1448687 non-null  int64         
 3   day              1448687 non-null  int64         
 4   address_0        1448687 non-null  object        
 5   address_1        1448687 non-null  object        
 6   address_2        1448687 non-null  object        
 7   address_3        1448687 non-null  float64       
 8   address_4        1448687 non-null  float64       
 9   name             1448687 non-null  object        
 10  area             1448662 non-null  float64       
 11  full_rent_price  1448687 non-null  int64         
dtypes: datetime64[ns](1), float64(3), int64(4), object(4)
memory usage: 143.7+ MB


In [None]:
df_full_rent.to_csv('/content/drive/MyDrive/house_price/after_data/apartment_full_rent.csv', index=False) # 전세 csv 파일 생성성

## 월세 데이터 프레임 생성

- 전세 데이터프레임 생성 파트 참조

In [None]:
# 월세 데이터 프레임 생성, 필요한 컬럼들만 필터링
df_month_rent = df_default.loc[df_default['전월세구분']=='월세',['시군구','본번','부번','도로명','계약년월','계약일','보증금(만원)','월세(만원)','전용면적(㎡)','단지명']].copy()
df_month_rent.columns = ['address','main_number','sub_number','road','year_month','day','rent_deposit','month_rent_price','area','name']
# df_month_rent.head()

df_month_rent.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 637088 entries, 25 to 2085770
Data columns (total 10 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   address           637088 non-null  object 
 1   main_number       637039 non-null  float64
 2   sub_number        637039 non-null  float64
 3   road              637088 non-null  object 
 4   year_month        637088 non-null  int64  
 5   day               637088 non-null  int64  
 6   rent_deposit      637088 non-null  object 
 7   month_rent_price  637088 non-null  object 
 8   area              637077 non-null  float64
 9   name              637088 non-null  object 
dtypes: float64(3), int64(2), object(5)
memory usage: 53.5+ MB


전세 파트와 다른 부분 확인! ↓

In [None]:
df_month_rent["month_rent_price2"] = df_month_rent["month_rent_price"].str.replace(',','')
df_month_rent.info() 

<class 'pandas.core.frame.DataFrame'>
Int64Index: 637088 entries, 25 to 2085770
Data columns (total 11 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   address            637088 non-null  object 
 1   main_number        637039 non-null  float64
 2   sub_number         637039 non-null  float64
 3   road               637088 non-null  object 
 4   year_month         637088 non-null  int64  
 5   day                637088 non-null  int64  
 6   rent_deposit       637088 non-null  object 
 7   month_rent_price   637088 non-null  object 
 8   area               637077 non-null  float64
 9   name               637088 non-null  object 
 10  month_rent_price2  349840 non-null  object 
dtypes: float64(3), int64(2), object(6)
memory usage: 58.3+ MB


- "month_rent_price"를 replace를 적용해서 month_rent_price2를 생성하는데 replace 함수가 제데로 처리가 안됨을 확인

>> df_month_rent["month_rent_price"].str.replace(',','') 

>> 진행했을 때, 'month_rent_price2' 컬럼에서의 null 값이 매우 커짐 -> replace 매소드가 제대로 동작 안함을 확인

>> 왜 동작을 안할까? -> string 과 object 타입의 차이, object는 타입의 혼용?

In [None]:
# 해당 파트를 통해서 우선 type 을 변경한 다음에 진행해야 함
df_month_rent = df_month_rent.astype({'month_rent_price':'str','rent_deposit':'str'})

- apartment_deal 과 진행 과정이 거의 동일하기에 한 셀로 합쳐서 진행
- 주석 부분들은 중간과정 확인 부분

In [None]:
df_month_rent["rent_deposit"] = df_month_rent["rent_deposit"].str.replace(",", "")
df_month_rent["month_rent_price"] = df_month_rent["month_rent_price"].str.replace(',','')
df_month_rent = df_month_rent.astype({'year_month':'str','day':'str','rent_deposit':'int64','month_rent_price':'int64'})
df_month_rent['year'] = df_month_rent['year_month'].str[0:4] # 연,월 합쳐져 있는 컬럼에서 연도만 추출
df_month_rent['month'] = df_month_rent['year_month'].str[4:] # 연,월 합쳐져 있는 컬럼에서 월만 추출
df_month_rent.loc[df_month_rent["day"].str.len()==1,"day"]='0'+df_month_rent.loc[df_month_rent["day"].str.len()==1,"day"] # 일이 있는 컬럼에서 1자리 숫자인 경우 앞에 0을 추가성
df_month_rent['date'] = pd.to_datetime(df_month_rent['year']+df_month_rent['month']+df_month_rent['day']) # 일자들을 합쳐서 date 컬럼 생
df_month_rent = df_month_rent.astype({'year':'int64','month':'int64','day':'int64'})
df_month_rent = df_month_rent.drop(['year_month'], axis=1) # 사용 안하는 컬럼들 제거
# print(df_month_rent.head())

df_month_rent["address_0"] = df_month_rent["address"].str.split(' ',expand=True)[0] # '시' 만 추출해야 하나, 서울만 함으로 일단은 실행 X
df_month_rent["address_1"] = df_month_rent["address"].str.split(' ',expand=True)[1] # '구' 만 추출
df_month_rent["address_2"] = df_month_rent["address"].str.split(' ',expand=True)[2] # '동' 만 추출
df_month_rent["road_name"] = df_month_rent["road"].str.split(' ',expand=True)[0] # '도로명' 만 추출
df_month_rent["road_number"] = df_month_rent["road"].str.split(' ',expand=True)[1] # '도로숫자' 만 추출
df_month_rent= df_month_rent[['year','month','day','address_0','address_1','address_2','main_number','sub_number','road_name','road_number','area',"rent_deposit","month_rent_price",'name','date']] # 사용할 컬럼만 선택
# print(df_month_rent.head())

# print(df_month_rent.info())
# print(df_month_rent.isnull().sum())

df_month_rent = df_month_rent.replace('', None) # ''값만 있는 값들을 null 값들로 수정
# print(df_month_rent.isnull().sum()) # 수정한 후 정보 확인 -> road_name과 road_number가 기하급수적으로 증가함

# df_month_rent[((df_month_rent['main_number'].isnull()) &(df_month_rent['sub_number'].notnull()))
#   |((df_month_rent['main_number'].notnull()) &(df_month_rent['sub_number'].isnull()))]

# df_month_rent[((df_month_rent['road_name'].isnull()) | (df_month_rent['road_number'].isnull())) & (df_month_rent['main_number'].isnull())] 

# df_month_rent.loc[df_month_rent['main_number'].isnull(),['address_0','address_1','address_2','main_number','sub_number','road_name','road_number','name']]

# df_month_rent.loc[df_month_rent['main_number'].isnull(),'name'].unique()

# df_month_rent.loc[df_month_rent['name']=='힐스테이트 서초 젠트리스',:]


df_month_rent.loc[df_month_rent['name']=='힐스테이트 서초 젠트리스','main_number'] = 557
df_month_rent.loc[df_month_rent['name']=='힐스테이트 서초 젠트리스','sub_number'] = 0

df_month_rent = df_month_rent[['date','year','month','day','address_0','address_1','address_2','main_number','sub_number','name','area','rent_deposit','month_rent_price']]
df_month_rent.columns =['date','year','month','day','address_0','address_1','address_2','address_3','address_4','name','area','rent_deposit','month_rent_price']
# df_month_rent.head()

# df_month_rent.info()

  mask |= arr == x


In [None]:
df_month_rent.isnull().sum()

date                 0
year                 0
month                0
day                  0
address_0            0
address_1            0
address_2            0
address_3            0
address_4            0
name                 0
area                11
rent_deposit         0
month_rent_price     0
dtype: int64

### 'area' 컬럼 결측치 처리

- 전세의 floor 결측치 처리 부분 참조

In [None]:
# df_month_rent[df_month_rent['area'].isnull()].tail()


add_1 = list(df_month_rent.loc[df_month_rent['area'].isnull(),'address_1'])
add_2 = list(df_month_rent.loc[df_month_rent['area'].isnull(),'address_2'])
add_3 = list(df_month_rent.loc[df_month_rent['area'].isnull(),'address_3'])
add_4 = list(df_month_rent.loc[df_month_rent['area'].isnull(),'address_4'])
area_list = list()
# area_list 에 값 추가
for i in range(len(add_1)):
    # 해당 주소에서 거래된 매물들의 '층' 정보가 없을 경우, area null을 처리할 참조 자료가 없음으로 ''으로 처리리
    if (len(df_month_rent.loc[(df_month_rent['address_1'] ==add_1[i]) & 
                     (df_month_rent['address_2'] ==add_2[i]) &
                     (df_month_rent['address_3'] ==add_3[i]) &
                     (df_month_rent['address_4'] ==add_4[i]),
                     'area'].value_counts())) == 0:

        area_list.append('')
    else:
        # 해당 주소에서 가장 많이 거래되었던 층수를 null 값에 채울거임
        area_list.append(df_month_rent.loc[(df_month_rent['address_1'] ==add_1[i]) & 
                     (df_month_rent['address_2'] ==add_2[i]) &
                     (df_month_rent['address_3'] ==add_3[i]) &
                     (df_month_rent['address_4'] ==add_4[i]),
                     'area'].value_counts().idxmax())
# print(area_list)

# print(len(add_1),len(add_2),len(add_3),len(add_4),len(area_list)) 

for i in range(len(add_1)):
    df_month_rent.loc[(df_month_rent['address_1'] ==add_1[i]) & 
                         (df_month_rent['address_2'] ==add_2[i]) &
                         (df_month_rent['address_3'] ==add_3[i]) &
                         (df_month_rent['address_4'] ==add_4[i]),
                         'area']=area_list[i]

# df_month_rent.head()

# df_month_rent.info()



In [None]:
df_month_rent.isnull().sum()

date                0
year                0
month               0
day                 0
address_0           0
address_1           0
address_2           0
address_3           0
address_4           0
name                0
area                0
rent_deposit        0
month_rent_price    0
dtype: int64

In [None]:
df_month_rent.head()

Unnamed: 0,date,year,month,day,address_0,address_1,address_2,address_3,address_4,name,area,rent_deposit,month_rent_price
25,2011-03-18,2011,3,18,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,79.97,19000,63
28,2011-04-09,2011,4,9,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,79.97,21000,35
38,2011-07-09,2011,7,9,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,79.97,3000,160
46,2011-09-19,2011,9,19,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,79.97,6000,140
47,2011-09-20,2011,9,20,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,79.97,5000,160


In [None]:
df_month_rent.to_csv('/content/drive/MyDrive/house_price/after_data/apartment_month_rent.csv', index=False)

# economic_data.csv 파일생성

- economic_data(거시경제 정보관련) 파일 생성
- economic_data 에는 한국기준금리, 부동산지수, 기준금리, 코스피지수, 한국국채금리, 미국국채금리, 장단기금리차, 아파트 분양 공급량, 아파트 미분양수, 아파트 미분양률 의 정보를 포함함

## 기준금리 정보관련 데이터 프레임 생성

- 'https://www.bok.or.kr/portal/singl/baseRate/list.do?dataSeCd=01&menuNo=200643' 홈페이지에서 기준금리의 변경 일자들을 제공하기에 크롤링을 하여 일자별 기준금리를 나타내는 데이터프레임을 생성

### 크롤링을 통해서 기준금리 정보 가져오기

In [None]:
#라이브러리 임포트

import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
#웹페이지 가져오기

res = requests.get('https://www.bok.or.kr/portal/singl/baseRate/list.do?dataSeCd=01&menuNo=200643')

#웹페이지 파싱하기
soup = BeautifulSoup(res.content,'html.parser')

#필요한 데이터 추출하기
items = soup.select('#content > div.table.tac > table > tbody > tr')

# 크롤링할 정보들을 담을 리스트 -> 추후 데이터프레임의 컬럼으로 대입할 예정정
change_year_list = list()
change_date_list = list()
rp_list = list()

# 사이트에서 표 안에 있는 정보들(text 정보들)을 가져와서 각 리스트에 삽입입
for item in items:
    table_list = item.select('td')
    change_year_list.append(table_list[0].get_text())
    change_date_list.append(table_list[1].get_text())
    rp_list.append(table_list[2].get_text())
    
# df는 기준금리 정보를 가져온 데이터 프레임 생성성
df = pd.DataFrame({
    "year": change_year_list,
    "change_date": change_date_list,
    "korea_rp": rp_list
}, columns=["year", "change_date", "korea_rp"])

df.tail() # 데이터프레임 형태 확인인

Unnamed: 0,year,change_date,korea_rp
50,2001,07월 05일,4.75
51,2001,02월 08일,5.0
52,2000,10월 05일,5.25
53,2000,02월 10일,5.0
54,1999,05월 06일,4.75


- change_date는 기준금리가 변경된 일자를, korea_rp는 변경한 기준금리를 나타냄

### 컬럼 통합

- year 컬럼과 change_date 컬럼이 일자를 나타내는 컬럼이므로 하나의 컬럼으로 통합

In [None]:
df['month']=df['change_date'].str[0:2] # 월의 정보만 추출
df['date'] = df['change_date'].str[4:6] # 일의 정보만 추출
df = df.astype({'korea_rp':'float64'}) # rp 컬럼 타입 변경
df['rp_date'] = df['year']+df['month']+df['date'] # 새로운 컬럼 생성
df = df.drop(['change_date', 'year','month','date'], axis=1) # 안쓰는 컬럼 제거
df=df.sort_index(ascending=False) # 날짜가 역순으로 되어 있어서 정렬
df['rp_date'] = pd.to_datetime(df['rp_date'], format='%Y-%m-%d %H:%M:%S', errors='raise') # date 타입으로 변경

In [None]:
df.head() # 데이터프레임 형태 확인

Unnamed: 0,korea_rp,rp_date
54,4.75,1999-05-06
53,5.0,2000-02-10
52,5.25,2000-10-05
51,5.0,2001-02-08
50,4.75,2001-07-05


In [None]:
df.tail() # 데이터프레임 형태 확인

Unnamed: 0,korea_rp,rp_date
4,2.25,2022-07-13
3,2.5,2022-08-25
2,3.0,2022-10-12
1,3.25,2022-11-24
0,3.5,2023-01-13


### '기준금리 변경날짜'들 사이에 있는 날짜들의 기준금리 정보 생성

- 위에서의 데이터 프레임은 '기준금리 변경일자'와 '변경된 기준금리'의 정보를 나타내는데, '기준금리 변경일자'들 사이에 있는 모든 날짜들에 대응하는 '기준금리'에 대한 정보도 필요하기에 사이 날짜들에 대한 기준금리 정보들을 생성 

In [None]:
import datetime

# 크롤링한 날짜 기간에 있는 모든 날짜들을 계산
start = datetime.datetime.strptime("06-05-1999", "%d-%m-%Y") # 시작날짜 설정
end = datetime.datetime.strptime("31-01-2023", "%d-%m-%Y") # 끝날짜 설정정
date_generated = [start + datetime.timedelta(days=x) for x in range(0, (end-start).days)] # 시작날짜와 끝날짜 사이에 있는 날짜들 생성
date_list=list()
for date in date_generated:
    date_list.append(date.strftime("%Y-%m-%d")) # date_list 에서 생성한 날짜들의 형식을 맞춰서 대입 

In [None]:
# df_date는 조회할 모든 날짜들의 정보가 들어있는 series
df_date = pd.DataFrame({
    "date": date_list
}, columns=["date"])
df_date['date'] = pd.to_datetime(df_date['date'], format='%Y-%m-%d %H:%M:%S', errors='raise') # date 타입으로 변경

In [None]:
df_date.head() # 데이터 프레임 형태 확인 

Unnamed: 0,date
0,1999-05-06
1,1999-05-07
2,1999-05-08
3,1999-05-09
4,1999-05-10


In [None]:
# 두개의 데이터프레임 결합을 통해서 날짜별 기준금리 현황을 생성
df_rp=pd.merge(df_date, df, left_on='date', right_on='rp_date', how='left')

In [None]:
# 사용할 컬럼만을 선택
df_rp = df_rp[['date','korea_rp']]
df_rp # 생성한 데이터 프레임 형태 확인 

Unnamed: 0,date,korea_rp
0,1999-05-06,4.75
1,1999-05-07,
2,1999-05-08,
3,1999-05-09,
4,1999-05-10,
...,...,...
8666,2023-01-26,
8667,2023-01-27,
8668,2023-01-28,
8669,2023-01-29,


In [None]:
# 가장 최근에 변경된 기준금리가 이후 변경되기 전까지 유지가 되기에, null값들을 젤 위에 있는 값(변경된 가장 최근의 기준금리 값)들로 채움
# 일자별 기준금리의 정보들을 생성
df_rp=df_rp.ffill() # ffill() 매소드를 통해서 젤 위의 있는 값으로 null 값들을 채움 
df_rp

Unnamed: 0,date,korea_rp
0,1999-05-06,4.75
1,1999-05-07,4.75
2,1999-05-08,4.75
3,1999-05-09,4.75
4,1999-05-10,4.75
...,...,...
8666,2023-01-26,3.50
8667,2023-01-27,3.50
8668,2023-01-28,3.50
8669,2023-01-29,3.50


In [None]:
# 기준금리 현황 그래프 출력
# x축을 날짜, y축을 기준금리 값으로 한 그래프 출력력
import plotly.graph_objects as go

# Create traces
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_rp['date'], y=df_rp['korea_rp'],
                    mode='lines',
                    name='korea_rp',yaxis='y1'))


fig.show(renderer="colab")

## 부동산 지수 데이터 프레임 생성

- https://data.seoul.go.kr/dataList/801/S/2/datasetView.do 사이트에서 아파트 매매 지수 파일을 다운 받아서 진행
- 아파트 매매 지수는 거시경제관련 지표들이 아파트 가격에 연관이 있는지 대략적인 확인을 위해서 사용 -> 추후 사용은 X

In [None]:
# 부동산지수 파일을 불러옴
df_real_estate = pd.read_csv("/content/drive/MyDrive/house_price/original_data/seoul_deal_index.csv",  encoding='UTF8') # 부동산 지수 불러오기
df_real_estate= df_real_estate.loc[(df_real_estate['시점']>1998) & (df_real_estate['자치구별(2)']=='소계'),['시점','아파트']]# 해당 조건에 대응하는 데이터만 거르기
df_real_estate.head()


Unnamed: 0,시점,아파트
39,1999,38.7
42,2000,40.3
45,2001,48.1
48,2002,62.9
51,2003,61.2


In [None]:
df_real_estate.info() # 데이터프레임 정보 확인인

<class 'pandas.core.frame.DataFrame'>
Int64Index: 23 entries, 39 to 519
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   시점      23 non-null     int64  
 1   아파트     23 non-null     float64
dtypes: float64(1), int64(1)
memory usage: 552.0 bytes


In [None]:
#  지수의 head를 파악
df_real_estate['시점'] = pd.to_datetime(df_real_estate['시점'], format='%Y') # 연도만을 datetime형식으로 변환
df_real_estate.head()

Unnamed: 0,시점,아파트
39,1999-01-01,38.7
42,2000-01-01,40.3
45,2001-01-01,48.1
48,2002-01-01,62.9
51,2003-01-01,61.2


In [None]:
df_real_estate.info() # 타입이 변경된을 확인인

<class 'pandas.core.frame.DataFrame'>
Int64Index: 23 entries, 39 to 519
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   시점      23 non-null     datetime64[ns]
 1   아파트     23 non-null     float64       
dtypes: datetime64[ns](1), float64(1)
memory usage: 552.0 bytes


## 기준금리 & 부동산지수 통합

- 기준금리와 부동산지수 데이터프레임들을 통합
- 기준금리 데이터프레임이 모든 날짜들에 대한 정보를 가지고 있기에, 기준금리 데이터 프레임을 left로 두어서 merge 실행
- 부동산지수 데이터프레임의 수치들은 1년동안 값이 일정하다 가정

In [None]:
df_final=pd.merge(df_rp, df_real_estate, left_on='date', right_on='시점', how='left') # 기준금리 데이터 프레임과 부동산지수 데이터 프레임을 병합합
df_final=df_final.ffill() # 젤 위의 값으로 null 값을 채움, 부동산지수의 수치가 1년동안 일정하다 가정정
df_final.head()

Unnamed: 0,date,korea_rp,시점,아파트
0,1999-05-06,4.75,NaT,
1,1999-05-07,4.75,NaT,
2,1999-05-08,4.75,NaT,
3,1999-05-09,4.75,NaT,
4,1999-05-10,4.75,NaT,


In [None]:
df_final.info() # 데이터프레임 정보 확인인

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8671 entries, 0 to 8670
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   date      8671 non-null   datetime64[ns]
 1   korea_rp  8671 non-null   float64       
 2   시점        8431 non-null   datetime64[ns]
 3   아파트       8431 non-null   float64       
dtypes: datetime64[ns](2), float64(2)
memory usage: 338.7 KB


In [None]:
df_final.tail()

Unnamed: 0,date,korea_rp,시점,아파트
8666,2023-01-26,3.5,2021-01-01,104.4
8667,2023-01-27,3.5,2021-01-01,104.4
8668,2023-01-28,3.5,2021-01-01,104.4
8669,2023-01-29,3.5,2021-01-01,104.4
8670,2023-01-30,3.5,2021-01-01,104.4


In [None]:
df_final = df_final.fillna(38.7) # 결측치를 채움, 38.7이 가장 과거의 값이기에 해당 값으로 값을 채움
df_final = df_final[['date','korea_rp','아파트']] # 사용할 컬럼만을 선택
df_final.columns = ['date','korea_rp','apartment_index'] # 컬럼명 수정정

In [None]:
df_final.head()

Unnamed: 0,date,korea_rp,apartment_index
0,1999-05-06,4.75,38.7
1,1999-05-07,4.75,38.7
2,1999-05-08,4.75,38.7
3,1999-05-09,4.75,38.7
4,1999-05-10,4.75,38.7


### 기준금리(역) 과 부동산지수 비교

In [None]:
# 기준금리와 부동산지수 2개의 그래프를 출력
# 기준금리는 x축을 기준으로 뒤짚은 값값

import plotly.graph_objects as go

# Create traces
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['korea_rp'],
                    mode='lines',
                    name='korea_rp',yaxis='y1'))
# x축으로 그래프를 뒤집음
fig.update_layout(
    yaxis = dict(autorange="reversed")
)


fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['apartment_index'],
                    mode='lines',
                    name='apartment_index',
                    yaxis="y2"))
fig.update_layout(

   # create first Y-axis
   yaxis=dict(
      title="rp point",
      titlefont=dict(color="blue"),
      tickfont=dict(color="blue")
   ),

   # create second Y-axis
   yaxis2=dict(
      title="apartment index",
      overlaying="y",
      side="right")
)
fig.show(renderer="colab")

2005년 이전까지는 동일한움직임, 2005년 부터 2008년은 반대로, 2008년 이후로는 어느정도 동일하게 움직인다
2008년 이후부터 양적완화의 등장으로 인한 유동성의 증가로 기준금리(역)과 부동산 가격이 유사하게 움직이는 건가?

## 데이터프레임 기간 수정

- 전세,월세에 대한 정보가 2011년 이후 부터 있기에 데이터 프레임을 2011년 ~2022년 으로 자름

In [None]:
df_final = df_final[(df_final['date']>='2011-01-01') & (df_final['date']<='2022-12-31')] # 사용할 날자만 자름
df_final.head()

Unnamed: 0,date,korea_rp,apartment_index
4258,2011-01-01,2.5,93.0
4259,2011-01-02,2.5,93.0
4260,2011-01-03,2.5,93.0
4261,2011-01-04,2.5,93.0
4262,2011-01-05,2.5,93.0


In [None]:
df_final.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4383 entries, 4258 to 8640
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   date             4383 non-null   datetime64[ns]
 1   korea_rp         4383 non-null   float64       
 2   apartment_index  4383 non-null   float64       
dtypes: datetime64[ns](1), float64(2)
memory usage: 137.0 KB


### 기준금리(역) 과 부동산지수 비교

In [None]:
# Create traces
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['korea_rp'],
                    mode='lines',
                    name='korea_rp',yaxis='y1'))
# x축으로 그래프를 뒤집음
fig.update_layout(
    yaxis = dict(autorange="reversed")
)


fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['apartment_index'],
                    mode='lines',
                    name='apartment_index',
                    yaxis="y2"))
fig.update_layout(

   # create first Y-axis
   yaxis=dict(
      title="rp point",
      titlefont=dict(color="blue"),
      tickfont=dict(color="blue")
   ),

   # create second Y-axis
   yaxis2=dict(
      title="apartment index",
      overlaying="y",
      side="right")
)

- 기준금리(역)과 부동산 지수는 연관성이 있는듯

## 코스피 지수 데이터 프레임 생성

In [None]:
df_kospi = pd.read_csv("/content/drive/MyDrive/house_price/original_data/kospi.csv",  encoding='UTF8') # 코스피 지수 정보 불러오기
df_kospi.head()

Unnamed: 0,날짜,종가,오픈,고가,저가,거래량,변동 %
0,2022- 12- 29,2236.4,2265.73,2272.67,2236.38,361.19M,-1.93%
1,2022- 12- 28,2280.45,2296.45,2296.45,2276.9,405.89M,-2.24%
2,2022- 12- 27,2332.79,2327.52,2335.99,2321.48,448.50M,0.68%
3,2022- 12- 26,2317.14,2312.54,2321.92,2304.2,427.84M,0.15%
4,2022- 12- 23,2313.69,2325.86,2333.08,2311.9,366.99M,-1.83%


In [None]:
df_kospi=df_kospi.sort_index(ascending=False) # 날짜가 역순으로 되어 있어서 정렬
df_kospi.reset_index(drop=True, inplace=True) # index 재설정
df_kospi.head()

Unnamed: 0,날짜,종가,오픈,고가,저가,거래량,변동 %
0,2007- 01- 02,1435.26,1438.89,1439.71,1430.06,147.74M,0.06%
1,2007- 01- 03,1409.35,1436.42,1437.79,1409.31,203.21M,-1.81%
2,2007- 01- 04,1397.29,1410.55,1411.12,1388.5,241.17M,-0.86%
3,2007- 01- 05,1385.76,1398.6,1400.59,1372.36,277.29M,-0.83%
4,2007- 01- 08,1370.81,1376.76,1384.65,1366.48,177.59M,-1.08%


In [None]:
df_kospi.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3956 entries, 0 to 3955
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   날짜      3956 non-null   object
 1   종가      3956 non-null   object
 2   오픈      3956 non-null   object
 3   고가      3956 non-null   object
 4   저가      3956 non-null   object
 5   거래량     3956 non-null   object
 6   변동 %    3956 non-null   object
dtypes: object(7)
memory usage: 216.5+ KB


In [None]:
# 필요한 컬럼만 선택 후, 컬럼명 수정, 타입변경경
df_kospi = df_kospi[['날짜','종가']]
df_kospi.columns = ['kospi_date','kospi_index']
df_kospi["kospi_date"] = pd.to_datetime(df_kospi["kospi_date"])
df_kospi.head()

Unnamed: 0,kospi_date,kospi_index
0,2007-01-02,1435.26
1,2007-01-03,1409.35
2,2007-01-04,1397.29
3,2007-01-05,1385.76
4,2007-01-08,1370.81


In [None]:
df_kospi.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3956 entries, 0 to 3955
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   kospi_date   3956 non-null   datetime64[ns]
 1   kospi_index  3956 non-null   object        
dtypes: datetime64[ns](1), object(1)
memory usage: 61.9+ KB


In [None]:
# kospi_index 값을 이후 계산에 사용하기 위해서 숫자 형태로 수정
df_kospi["kospi_index"] = df_kospi["kospi_index"].str.replace(",", "") # 문자형으로 되어 있기에 , 을 제거 
df_kospi = df_kospi.astype({'kospi_index': 'float64'})# 컬럼 타입 변경 
df_kospi.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3956 entries, 0 to 3955
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   kospi_date   3956 non-null   datetime64[ns]
 1   kospi_index  3956 non-null   float64       
dtypes: datetime64[ns](1), float64(1)
memory usage: 61.9 KB


In [None]:
df_kospi.head() # 데이터프레임 형태 확인 

Unnamed: 0,kospi_date,kospi_index
0,2007-01-02,1435.26
1,2007-01-03,1409.35
2,2007-01-04,1397.29
3,2007-01-05,1385.76
4,2007-01-08,1370.81


## 코스피 지수 데이터 프레임과 병합

In [None]:
# 기준금리&부동산지수 데이터프레임과 코스피 지수 데이터프레임 병합합
df_final=pd.merge(df_final, df_kospi, left_on='date', right_on='kospi_date', how='left') # 두 데이터프레임을 결함
df_final.head()

Unnamed: 0,date,korea_rp,apartment_index,kospi_date,kospi_index
0,2011-01-01,2.5,93.0,NaT,
1,2011-01-02,2.5,93.0,NaT,
2,2011-01-03,2.5,93.0,2011-01-03,2070.08
3,2011-01-04,2.5,93.0,2011-01-04,2085.14
4,2011-01-05,2.5,93.0,2011-01-05,2082.55


In [None]:
df_final.info() # 정보확인 -> 주말등 휴장일들의 존재로 kospi_date 컬럼과 kospi_index 컬럼에서 null 값들이 있음

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4383 entries, 0 to 4382
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   date             4383 non-null   datetime64[ns]
 1   korea_rp         4383 non-null   float64       
 2   apartment_index  4383 non-null   float64       
 3   kospi_date       2958 non-null   datetime64[ns]
 4   kospi_index      2958 non-null   float64       
dtypes: datetime64[ns](2), float64(3)
memory usage: 205.5 KB


In [None]:
# 휴장일에는 이전의 지수가 유지된다고 가정 
# 해결방안으로 이전의 값으로 null 값을 채우기
df_final["kospi_index"]=df_final["kospi_index"].fillna(method='ffill')
df_final.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4383 entries, 0 to 4382
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   date             4383 non-null   datetime64[ns]
 1   korea_rp         4383 non-null   float64       
 2   apartment_index  4383 non-null   float64       
 3   kospi_date       2958 non-null   datetime64[ns]
 4   kospi_index      4381 non-null   float64       
dtypes: datetime64[ns](2), float64(3)
memory usage: 205.5 KB


In [None]:
# 가장 위에 있는 null 값은 직접 찾아서(네이버 검색을 통해서서) 대입
df_final["kospi_index"] = df_final["kospi_index"].fillna(2051)
df_final.info() # 값들 대입이 되었는지 확인

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4383 entries, 0 to 4382
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   date             4383 non-null   datetime64[ns]
 1   korea_rp         4383 non-null   float64       
 2   apartment_index  4383 non-null   float64       
 3   kospi_date       2958 non-null   datetime64[ns]
 4   kospi_index      4383 non-null   float64       
dtypes: datetime64[ns](2), float64(3)
memory usage: 205.5 KB


In [None]:
df_final.head() # 형태 확인

Unnamed: 0,date,korea_rp,apartment_index,kospi_date,kospi_index
0,2011-01-01,2.5,93.0,NaT,2051.0
1,2011-01-02,2.5,93.0,NaT,2051.0
2,2011-01-03,2.5,93.0,2011-01-03,2070.08
3,2011-01-04,2.5,93.0,2011-01-04,2085.14
4,2011-01-05,2.5,93.0,2011-01-05,2082.55


In [None]:
# 사용할 컬럼만 설정
df_final = df_final[['date','korea_rp','apartment_index','kospi_index']]
df_final.head()

Unnamed: 0,date,korea_rp,apartment_index,kospi_index
0,2011-01-01,2.5,93.0,2051.0
1,2011-01-02,2.5,93.0,2051.0
2,2011-01-03,2.5,93.0,2070.08
3,2011-01-04,2.5,93.0,2085.14
4,2011-01-05,2.5,93.0,2082.55


### 코스피지수의 필요성 그래프로 점검

In [None]:
# Create traces
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['kospi_index'],
                    mode='lines',
                    name='kospi_index',yaxis='y1'))



fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['apartment_index'],
                    mode='lines',
                    name='apartment_index',
                    yaxis="y2"))
fig.update_layout(

   # create first Y-axis
   yaxis=dict(
      title="kospi index",
      titlefont=dict(color="blue"),
      tickfont=dict(color="blue")
   ),

   # create second Y-axis
   yaxis2=dict(
      title="apartment index",
      overlaying="y",
      side="right")
)

- 코스피지수와 부동산 지수는 어느정도의 상관성은 있나? 그래프로 봐서는 잘 모르겠음

## 한국국채 금리 데이터프레임 생성

- 코스피 데이터프레임 생성과정과 거의 비슷

In [None]:
import os


dir_path = "/content/drive/MyDrive/house_price/original_data/korean_bond"
file_list = os.listdir(dir_path)
file_list.sort()
name_list = list()
df_list = list()

# 해당 폴더 안에 있는 csv 파일들을 읽어서 리스트 안에 데이터프레임들을 담음
for csv_file in file_list:
    df_list.append(pd.read_csv(dir_path+"/"+csv_file , encoding='UTF8'))
    name_list.append(csv_file.split('.')[0])
for i in range(len(df_list)):
    df_korea = df_list[i] # 파일이 잘 들어갔는지 확인
    df_korea=df_korea.sort_index(ascending=False) # 날짜가 역순으로 되어 있어서 정렬
    df_korea.reset_index(drop=True, inplace=True) # index 재설정
    df_korea = df_korea[['날짜','종가']]
    df_korea.columns = ['korea_date',name_list[i]]
    df_korea['korea_date'] = pd.to_datetime(df_korea['korea_date'])
    df_final=pd.merge(df_final, df_korea, left_on='date', right_on='korea_date', how='left')
    df_final[name_list[i]]=df_final[name_list[i]].fillna(method='ffill') # 중간 공휴일들을 처리
    df_final[name_list[i]]=df_final[name_list[i]].fillna(method='bfill') # 제일 위의 있는 값을 근처 값으로 처리
    df_final = df_final.drop(['korea_date'], axis=1)

In [None]:
df_final.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4383 entries, 0 to 4382
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   date             4383 non-null   datetime64[ns]
 1   korea_rp         4383 non-null   float64       
 2   apartment_index  4383 non-null   float64       
 3   kospi_index      4383 non-null   float64       
 4   korea_10_year    4383 non-null   float64       
 5   korea_1_year     4383 non-null   float64       
 6   korea_20_year    4383 non-null   float64       
 7   korea_2_year     4383 non-null   float64       
 8   korea_3_year     4383 non-null   float64       
 9   korea_4_year     4383 non-null   float64       
 10  korea_5_year     4383 non-null   float64       
dtypes: datetime64[ns](1), float64(10)
memory usage: 410.9 KB


In [None]:
# 컬럼 순서 변경
df_final = df_final[['date', 'apartment_index','kospi_index','korea_rp',
                    'korea_1_year','korea_2_year','korea_3_year','korea_4_year','korea_5_year',
                    'korea_10_year','korea_20_year']]
df_final.head()

Unnamed: 0,date,apartment_index,kospi_index,korea_rp,korea_1_year,korea_2_year,korea_3_year,korea_4_year,korea_5_year,korea_10_year,korea_20_year
0,2011-01-01,93.0,2051.0,2.5,2.81,3.4,3.44,4.09,4.14,4.57,4.73
1,2011-01-02,93.0,2051.0,2.5,2.81,3.4,3.44,4.09,4.14,4.57,4.73
2,2011-01-03,93.0,2070.08,2.5,2.81,3.4,3.44,4.09,4.14,4.57,4.73
3,2011-01-04,93.0,2085.14,2.5,2.83,3.37,3.495,4.16,4.2,4.58,4.74
4,2011-01-05,93.0,2082.55,2.5,2.8,3.42,3.495,4.15,4.17,4.63,4.75


In [None]:
# 년,월,일일 컬럼 생성
df_final['year'] = df_final['date'].dt.year
df_final['month'] = df_final['date'].dt.month
df_final['day'] = df_final['date'].dt.day
df_final.head()

Unnamed: 0,date,apartment_index,kospi_index,korea_rp,korea_1_year,korea_2_year,korea_3_year,korea_4_year,korea_5_year,korea_10_year,korea_20_year,year,month,day
0,2011-01-01,93.0,2051.0,2.5,2.81,3.4,3.44,4.09,4.14,4.57,4.73,2011,1,1
1,2011-01-02,93.0,2051.0,2.5,2.81,3.4,3.44,4.09,4.14,4.57,4.73,2011,1,2
2,2011-01-03,93.0,2070.08,2.5,2.81,3.4,3.44,4.09,4.14,4.57,4.73,2011,1,3
3,2011-01-04,93.0,2085.14,2.5,2.83,3.37,3.495,4.16,4.2,4.58,4.74,2011,1,4
4,2011-01-05,93.0,2082.55,2.5,2.8,3.42,3.495,4.15,4.17,4.63,4.75,2011,1,5


### 부동산지수와 한국국채금리 시각화

In [None]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['korea_rp'],
                    mode='lines',
                    name='korea_rp',yaxis='y1'))

fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['korea_1_year'],
                    mode='lines',
                    name='korea_1_year',yaxis='y1'))

fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['korea_2_year'],
                    mode='lines',
                    name='korea_2_year',yaxis='y1'))

fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['korea_3_year'],
                    mode='lines',
                    name='korea_3_year',yaxis='y1'))

fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['korea_4_year'],
                    mode='lines',
                    name='korea_4_year',yaxis='y1'))

fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['korea_5_year'],
                    mode='lines',
                    name='korea_5_year',yaxis='y1'))

fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['korea_10_year'],
                    mode='lines',
                    name='korea_10_year',yaxis='y1'))

fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['korea_20_year'],
                    mode='lines',
                    name='korea_20_year',yaxis='y1'))

# 앞에서의 그래프들은 뒤집기
fig.update_layout(
    yaxis = dict(autorange="reversed")
)
fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['apartment_index'],
                    mode='lines',
                    name='apartment_index',
                    yaxis="y2"))
fig.update_layout(

   # create first Y-axis
   yaxis=dict(
      title="rate index",
      titlefont=dict(color="black"),
      tickfont=dict(color="black")
   ),

   # create second Y-axis
   yaxis2=dict(
      title="apartment index",
      overlaying="y",
      side="right")
)

- 한국국채금리(역)과 부동산지수는 연관이 있는듯

In [None]:
# 금리들이 얼추 비슷한 흐름을 보임으로 국채에서 3년과 10년만 사용
df_final = df_final[['date','year','month','day','apartment_index','kospi_index','korea_rp',
                    'korea_3_year','korea_10_year']]
df_final.head()

Unnamed: 0,date,year,month,day,apartment_index,kospi_index,korea_rp,korea_3_year,korea_10_year
0,2011-01-01,2011,1,1,93.0,2051.0,2.5,3.44,4.57
1,2011-01-02,2011,1,2,93.0,2051.0,2.5,3.44,4.57
2,2011-01-03,2011,1,3,93.0,2070.08,2.5,3.44,4.57
3,2011-01-04,2011,1,4,93.0,2085.14,2.5,3.495,4.58
4,2011-01-05,2011,1,5,93.0,2082.55,2.5,3.495,4.63


### 한국 장단기금리차 컬럼들 추가

In [None]:
# 장단기 금리차로 우선 10년 금리와 3년 금리만을 사용
df_final['korea_10-3_year'] = df_final['korea_10_year']- df_final['korea_3_year']
df_final.head()

Unnamed: 0,date,year,month,day,apartment_index,kospi_index,korea_rp,korea_3_year,korea_10_year,korea_10-3_year
0,2011-01-01,2011,1,1,93.0,2051.0,2.5,3.44,4.57,1.13
1,2011-01-02,2011,1,2,93.0,2051.0,2.5,3.44,4.57,1.13
2,2011-01-03,2011,1,3,93.0,2070.08,2.5,3.44,4.57,1.13
3,2011-01-04,2011,1,4,93.0,2085.14,2.5,3.495,4.58,1.085
4,2011-01-05,2011,1,5,93.0,2082.55,2.5,3.495,4.63,1.135


In [None]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['korea_10-3_year'],
                    mode='lines',
                    name='korea_10-3_year',yaxis='y1'))



fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['apartment_index'],
                    mode='lines',
                    name='apartment_index',
                    yaxis="y2"))
fig.update_layout(

   # create first Y-axis
   yaxis=dict(
      title="korea_10-3_year",
      titlefont=dict(color="black"),
      tickfont=dict(color="black")
   ),

   # create second Y-axis
   yaxis2=dict(
      title="apartment index",
      overlaying="y",
      side="right")
)

- 한국 장단기 금리차와 부동산 지수가 연관이 있나? - 잘 모르겠다

## 미국금채 금리 데이터프레임 생성

- 한국국채금리 데이터프레임 생성과정과 거의 동일

In [None]:
# 변수들 초기화
dir_path = "/content/drive/MyDrive/house_price/original_data/us_bond"
file_list = os.listdir(dir_path)
file_list.sort()
name_list = list()
df_list = list()

# 해당 폴더 안에 있는 csv 파일들을 읽어서 리스트 안에 데이터프레임들을 담음
for csv_file in file_list:
    df_list.append(pd.read_csv(dir_path+"/"+csv_file , encoding='UTF8'))
    name_list.append(csv_file.split('.')[0])
for i in range(len(df_list)):
    df_us = df_list[i]
    df_us=df_us.sort_index(ascending=False) # 날짜가 역순으로 되어 있어서 정렬
    df_us.reset_index(drop=True, inplace=True) # index 재설정
    df_us = df_us[['날짜','종가']]
    df_us.columns = ['us_date',name_list[i]]
    df_us['us_date'] = pd.to_datetime(df_us['us_date'])
    df_final=pd.merge(df_final, df_us, left_on='date', right_on='us_date', how='left')
    df_final[name_list[i]]=df_final[name_list[i]].fillna(method='ffill') # 중간 공휴일들을 처리
    df_final[name_list[i]]=df_final[name_list[i]].fillna(method='bfill') # 제일 위의 있는 값을 근처 값으로 처리
    df_final = df_final.drop(['us_date'], axis=1)

In [None]:
df_final.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4383 entries, 0 to 4382
Data columns (total 18 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   date             4383 non-null   datetime64[ns]
 1   year             4383 non-null   int64         
 2   month            4383 non-null   int64         
 3   day              4383 non-null   int64         
 4   apartment_index  4383 non-null   float64       
 5   kospi_index      4383 non-null   float64       
 6   korea_rp         4383 non-null   float64       
 7   korea_3_year     4383 non-null   float64       
 8   korea_10_year    4383 non-null   float64       
 9   korea_10-3_year  4383 non-null   float64       
 10  us_10_year       4383 non-null   float64       
 11  us_1_month       4383 non-null   float64       
 12  us_2_year        4383 non-null   float64       
 13  us_30_year       4383 non-null   float64       
 14  us_3_month       4383 non-null   float64

In [None]:
df_final = df_final[['date','year','month','day','apartment_index','kospi_index','korea_rp',
                    'korea_3_year','korea_10_year','korea_10-3_year','us_1_month','us_3_month',
                    'us_6_month','us_2_year', 'us_3_year', 'us_5_year',
                    'us_10_year','us_30_year']]

In [None]:
df_final.head()

Unnamed: 0,date,year,month,day,apartment_index,kospi_index,korea_rp,korea_3_year,korea_10_year,korea_10-3_year,us_1_month,us_3_month,us_6_month,us_2_year,us_3_year,us_5_year,us_10_year,us_30_year
0,2011-01-01,2011,1,1,93.0,2051.0,2.5,3.44,4.57,1.13,0.096,0.124,0.183,0.601,1.006,2.011,3.334,4.401
1,2011-01-02,2011,1,2,93.0,2051.0,2.5,3.44,4.57,1.13,0.096,0.124,0.183,0.601,1.006,2.011,3.334,4.401
2,2011-01-03,2011,1,3,93.0,2070.08,2.5,3.44,4.57,1.13,0.096,0.124,0.183,0.601,1.006,2.011,3.334,4.401
3,2011-01-04,2011,1,4,93.0,2085.14,2.5,3.495,4.58,1.085,0.106,0.142,0.187,0.621,1.026,2.016,3.338,4.422
4,2011-01-05,2011,1,5,93.0,2082.55,2.5,3.495,4.63,1.135,0.129,0.142,0.184,0.708,1.129,2.133,3.463,4.541


### 미국국채금리와 부동산 지수 비교

In [None]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['us_1_month'],
                    mode='lines',
                    name='us_1_month',yaxis='y1'))

fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['us_3_month'],
                    mode='lines',
                    name='us_3_month',yaxis='y1'))

fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['us_6_month'],
                    mode='lines',
                    name='us_6_month',yaxis='y1'))

fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['us_2_year'],
                    mode='lines',
                    name='us_2_year',yaxis='y1'))

fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['us_3_year'],
                    mode='lines',
                    name='us_3_year',yaxis='y1'))

fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['us_5_year'],
                    mode='lines',
                    name='us_5_year',yaxis='y1'))

fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['us_10_year'],
                    mode='lines',
                    name='us_10_year',yaxis='y1'))

fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['us_30_year'],
                    mode='lines',
                    name='us_30_year',yaxis='y1'))

# 앞에서의 그래프들은 뒤집기
fig.update_layout(
    yaxis = dict(autorange="reversed")
)
fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['apartment_index'],
                    mode='lines',
                    name='apartment_index',
                    yaxis="y2"))
fig.update_layout(

   # create first Y-axis
   yaxis=dict(
      title="rate index",
      titlefont=dict(color="black"),
      tickfont=dict(color="black")
   ),

   # create second Y-axis
   yaxis2=dict(
      title="apartment index",
      overlaying="y",
      side="right")
)

- 미국 국채금리(역)은 한국 국채금리(역)보다는 부동산지수와 어느정도 연관이 있는듯?

In [None]:
# 금리들이 얼추 비슷한 흐름을 보임으로 국채에서 3개월, 2년, 10년 데이터프레임을 생성
df_final = df_final[['date','year','month','day','apartment_index','kospi_index','korea_rp',
                    'korea_3_year','korea_10_year','korea_10-3_year','us_3_month', 'us_2_year', 'us_10_year']]
df_final.head()

Unnamed: 0,date,year,month,day,apartment_index,kospi_index,korea_rp,korea_3_year,korea_10_year,korea_10-3_year,us_3_month,us_2_year,us_10_year
0,2011-01-01,2011,1,1,93.0,2051.0,2.5,3.44,4.57,1.13,0.124,0.601,3.334
1,2011-01-02,2011,1,2,93.0,2051.0,2.5,3.44,4.57,1.13,0.124,0.601,3.334
2,2011-01-03,2011,1,3,93.0,2070.08,2.5,3.44,4.57,1.13,0.124,0.601,3.334
3,2011-01-04,2011,1,4,93.0,2085.14,2.5,3.495,4.58,1.085,0.142,0.621,3.338
4,2011-01-05,2011,1,5,93.0,2082.55,2.5,3.495,4.63,1.135,0.142,0.708,3.463


### 미국 장단기금리차 컬럼들 추가

In [None]:
df_final['us_10-2_year'] = df_final['us_10_year'] - df_final['us_2_year']
df_final['us_10-3_year_month'] = df_final['us_10_year'] -df_final['us_3_month']
df_final.head()

Unnamed: 0,date,year,month,day,apartment_index,kospi_index,korea_rp,korea_3_year,korea_10_year,korea_10-3_year,us_3_month,us_2_year,us_10_year,us_10-2_year,us_10-3_year_month
0,2011-01-01,2011,1,1,93.0,2051.0,2.5,3.44,4.57,1.13,0.124,0.601,3.334,2.733,3.21
1,2011-01-02,2011,1,2,93.0,2051.0,2.5,3.44,4.57,1.13,0.124,0.601,3.334,2.733,3.21
2,2011-01-03,2011,1,3,93.0,2070.08,2.5,3.44,4.57,1.13,0.124,0.601,3.334,2.733,3.21
3,2011-01-04,2011,1,4,93.0,2085.14,2.5,3.495,4.58,1.085,0.142,0.621,3.338,2.717,3.196
4,2011-01-05,2011,1,5,93.0,2082.55,2.5,3.495,4.63,1.135,0.142,0.708,3.463,2.755,3.321


In [None]:
df_final.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4383 entries, 0 to 4382
Data columns (total 15 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   date                4383 non-null   datetime64[ns]
 1   year                4383 non-null   int64         
 2   month               4383 non-null   int64         
 3   day                 4383 non-null   int64         
 4   apartment_index     4383 non-null   float64       
 5   kospi_index         4383 non-null   float64       
 6   korea_rp            4383 non-null   float64       
 7   korea_3_year        4383 non-null   float64       
 8   korea_10_year       4383 non-null   float64       
 9   korea_10-3_year     4383 non-null   float64       
 10  us_3_month          4383 non-null   float64       
 11  us_2_year           4383 non-null   float64       
 12  us_10_year          4383 non-null   float64       
 13  us_10-2_year        4383 non-null   float64     

In [None]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['us_10-2_year'],
                    mode='lines',
                    name='us_10-2_year',yaxis='y1'))

fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['us_10-3_year_month'],
                    mode='lines',
                    name='us_10-3_year_month',yaxis='y1'))



# 앞에서의 그래프들은 뒤집기
fig.update_layout(
    yaxis = dict(autorange="reversed")
)

fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['apartment_index'],
                    mode='lines',
                    name='apartment_index',
                    yaxis="y2"))
fig.update_layout(

   # create first Y-axis
   yaxis=dict(
      title="rate gap",
      titlefont=dict(color="black"),
      tickfont=dict(color="black")
   ),

   # create second Y-axis
   yaxis2=dict(
      title="apartment index",
      overlaying="y",
      side="right")
)

- 미국 장단기 금리차(역)는 부동산지수와 어느정도 상관이 있나?

## 아파트 분양 공급 데이터프레임 생성

- https://asil.kr/asil/sub/movein.jsp 사이트를 통해서 아파트 공급량의 정보를 확보

In [None]:
# txt 파일을 불러옴옴
df_apartment_supply = pd.read_csv("/content/drive/MyDrive/house_price/original_data/apartment_supply.txt",  encoding='UTF8',sep="\t")
df_apartment_supply.head()

Unnamed: 0,위치,단지명,입주년월,총세대수
0,서울 서대문구 홍은동,e편한세상홍제가든플라츠,2022년 12월,481세대
1,서울 서초구 잠원동,르엘신반포,2022년 12월,280세대
2,서울 마포구 아현동,마포더클래시,2022년 12월,"1,419세대"
3,서울 중랑구 면목동,"봄작시티201(민간임대,도시형)",2022년 12월,128세대
4,서울 서대문구 홍은동,힐스테이트홍은포레스트,2022년 11월,623세대


In [None]:
df_apartment_supply.info() # 데이터프레임 정보 확인 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1003 entries, 0 to 1002
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   위치      1003 non-null   object
 1   단지명     1003 non-null   object
 2   입주년월    1003 non-null   object
 3   총세대수    1003 non-null   object
dtypes: object(4)
memory usage: 31.5+ KB


In [None]:
# 년, 월 컬럼들 생성
# ' ' 을 기준으로 잘라서 컬럼들을 생성
df_apartment_supply['year'] =df_apartment_supply['입주년월'].str.split(' ',expand=True)[0]
df_apartment_supply['month'] =df_apartment_supply['입주년월'].str.split(' ',expand=True)[1]

In [None]:
df_apartment_supply.head()

Unnamed: 0,위치,단지명,입주년월,총세대수,year,month
0,서울 서대문구 홍은동,e편한세상홍제가든플라츠,2022년 12월,481세대,2022년,12월
1,서울 서초구 잠원동,르엘신반포,2022년 12월,280세대,2022년,12월
2,서울 마포구 아현동,마포더클래시,2022년 12월,"1,419세대",2022년,12월
3,서울 중랑구 면목동,"봄작시티201(민간임대,도시형)",2022년 12월,128세대,2022년,12월
4,서울 서대문구 홍은동,힐스테이트홍은포레스트,2022년 11월,623세대,2022년,11월


In [None]:
df_apartment_supply.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1003 entries, 0 to 1002
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   위치      1003 non-null   object
 1   단지명     1003 non-null   object
 2   입주년월    1003 non-null   object
 3   총세대수    1003 non-null   object
 4   year    1003 non-null   object
 5   month   1003 non-null   object
dtypes: object(6)
memory usage: 47.1+ KB


In [None]:
# 문자열 특정 문자들 수정
# 추후 데이터프레임 계산에 용이하게 문자들을 수정 
df_apartment_supply["year"] = df_apartment_supply["year"].str.replace("년", "")
df_apartment_supply["month"] = df_apartment_supply["month"].str.replace("월", "")
df_apartment_supply["apartment_supply"] = df_apartment_supply["총세대수"].str.replace("세대", "")
df_apartment_supply["apartment_supply"] = df_apartment_supply["apartment_supply"].str.replace(",", "")
df_apartment_supply.head()

Unnamed: 0,위치,단지명,입주년월,총세대수,year,month,apartment_supply
0,서울 서대문구 홍은동,e편한세상홍제가든플라츠,2022년 12월,481세대,2022,12,481
1,서울 서초구 잠원동,르엘신반포,2022년 12월,280세대,2022,12,280
2,서울 마포구 아현동,마포더클래시,2022년 12월,"1,419세대",2022,12,1419
3,서울 중랑구 면목동,"봄작시티201(민간임대,도시형)",2022년 12월,128세대,2022,12,128
4,서울 서대문구 홍은동,힐스테이트홍은포레스트,2022년 11월,623세대,2022,11,623


In [None]:
df_apartment_supply.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1003 entries, 0 to 1002
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   위치                1003 non-null   object
 1   단지명               1003 non-null   object
 2   입주년월              1003 non-null   object
 3   총세대수              1003 non-null   object
 4   year              1003 non-null   object
 5   month             1003 non-null   object
 6   apartment_supply  1003 non-null   object
dtypes: object(7)
memory usage: 55.0+ KB


In [None]:
# date 컬럼 생성
df_apartment_supply['date'] = pd.to_datetime(df_apartment_supply['year']+'-'+df_apartment_supply['month'], format="%Y-%m")

- 해당 달의 수치의 결과는 다음달에 발표한다고 가정(예를들어 2011년 1월의 거래수치는 2011년 1월동안에는 알 수 없고 2월이 되어야 1월의 결과를 종합해서 수치를 알 수 있다)

In [None]:
# 다음 달에 지수가 발표한다고 가정
df_apartment_supply['date_column'] = df_apartment_supply['date'] + datetime.timedelta(days=32)
df_apartment_supply['announcement_year'] = df_apartment_supply['date_column'].dt.year
df_apartment_supply['announcement_month'] = df_apartment_supply['date_column'].dt.month

In [None]:
df_apartment_supply.head()

Unnamed: 0,위치,단지명,입주년월,총세대수,year,month,apartment_supply,date,date_column,announcement_year,announcement_month
0,서울 서대문구 홍은동,e편한세상홍제가든플라츠,2022년 12월,481세대,2022,12,481,2022-12-01,2023-01-02,2023,1
1,서울 서초구 잠원동,르엘신반포,2022년 12월,280세대,2022,12,280,2022-12-01,2023-01-02,2023,1
2,서울 마포구 아현동,마포더클래시,2022년 12월,"1,419세대",2022,12,1419,2022-12-01,2023-01-02,2023,1
3,서울 중랑구 면목동,"봄작시티201(민간임대,도시형)",2022년 12월,128세대,2022,12,128,2022-12-01,2023-01-02,2023,1
4,서울 서대문구 홍은동,힐스테이트홍은포레스트,2022년 11월,623세대,2022,11,623,2022-11-01,2022-12-03,2022,12


In [None]:
df_apartment_supply.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1003 entries, 0 to 1002
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   위치                  1003 non-null   object        
 1   단지명                 1003 non-null   object        
 2   입주년월                1003 non-null   object        
 3   총세대수                1003 non-null   object        
 4   year                1003 non-null   object        
 5   month               1003 non-null   object        
 6   apartment_supply    1003 non-null   object        
 7   date                1003 non-null   datetime64[ns]
 8   date_column         1003 non-null   datetime64[ns]
 9   announcement_year   1003 non-null   int64         
 10  announcement_month  1003 non-null   int64         
dtypes: datetime64[ns](2), int64(2), object(7)
memory usage: 86.3+ KB


In [None]:
# 사용할 컬럼만을 거른 후, 타입 변경
df_apartment_supply = df_apartment_supply[['announcement_year','announcement_month','apartment_supply']]
df_apartment_supply = df_apartment_supply.astype({'apartment_supply': 'int64'})
df_apartment_supply.head()

Unnamed: 0,announcement_year,announcement_month,apartment_supply
0,2023,1,481
1,2023,1,280
2,2023,1,1419
3,2023,1,128
4,2022,12,623


In [None]:
df_apartment_supply.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1003 entries, 0 to 1002
Data columns (total 3 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   announcement_year   1003 non-null   int64
 1   announcement_month  1003 non-null   int64
 2   apartment_supply    1003 non-null   int64
dtypes: int64(3)
memory usage: 23.6 KB


In [None]:
# 연, 월별 분양공급량을 group by를 통해서 구한 후, reset_index를 통해서 다시 컬럼화
df_apartment_supply=df_apartment_supply.groupby(['announcement_year','announcement_month'])['apartment_supply'].agg('sum')
df_apartment_supply = df_apartment_supply.reset_index(['announcement_year','announcement_month'])
df_apartment_supply.head()

Unnamed: 0,announcement_year,announcement_month,apartment_supply
0,2011,2,5342
1,2011,3,3494
2,2011,4,1511
3,2011,5,709
4,2011,6,1507


## 아파트 미분양 데이터 프레임 생성

- https://data.kbland.kr/publicdata/unsold-apartments 사이트를 통해서 미분양 데이터 정보를 확보

In [None]:
df_apartment_unsold = pd.read_excel("/content/drive/MyDrive/house_price/original_data/unsold/서울 미분양 현황.xlsx")
df_apartment_unsold.index = df_apartment_unsold['구분']
df_apartment_unsold=df_apartment_unsold.drop('구분',axis=1)
df_apartment_unsold.head()

Unnamed: 0_level_0,'07.01,'07.02,'07.03,'07.04,'07.05,'07.06,'07.07,'07.08,'07.09,'07.10,...,'22.02,'22.03,'22.04,'22.05,'22.06,'22.07,'22.08,'22.09,'22.10,'22.11
구분,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
미분양,697,590.0,687.0,685.0,704.0,778.0,840.0,730.0,724.0,977.0,...,47,180.0,360,688.0,719.0,592.0,610.0,719.0,866.0,865.0
변동률,-,-15.35,16.44,-0.29,2.77,10.51,7.97,-13.1,-0.82,34.94,...,0,282.98,100,91.11,4.51,-17.66,3.04,17.87,20.45,-0.12


In [None]:
# T 매소드를 통해서 row와 column을 교환환
df_apartment_unsold=df_apartment_unsold.T
df_apartment_unsold.head()

구분,미분양,변동률
'07.01,697.0,-
'07.02,590.0,-15.35
'07.03,687.0,16.44
'07.04,685.0,-0.29
'07.05,704.0,2.77


In [None]:
df_apartment_unsold.info()

<class 'pandas.core.frame.DataFrame'>
Index: 191 entries, '07.01 to '22.11
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   미분양     191 non-null    object
 1   변동률     191 non-null    object
dtypes: object(2)
memory usage: 8.5+ KB


In [None]:
# index가 날짜의 정보를 가지고 있음으로 reset_index를 통해서 날짜 정보를 컬럼으로 생성성
df_apartment_unsold = df_apartment_unsold.reset_index()
df_apartment_unsold.head()

구분,index,미분양,변동률
0,'07.01,697.0,-
1,'07.02,590.0,-15.35
2,'07.03,687.0,16.44
3,'07.04,685.0,-0.29
4,'07.05,704.0,2.77


In [None]:
# 컬럼명 수정정
df_apartment_unsold.columns=['year_month','unsold_count','ratio']
df_apartment_unsold.head()

Unnamed: 0,year_month,unsold_count,ratio
0,'07.01,697.0,-
1,'07.02,590.0,-15.35
2,'07.03,687.0,16.44
3,'07.04,685.0,-0.29
4,'07.05,704.0,2.77


In [None]:
# year_month 컬럼에서 ' 부분을 제거
df_apartment_unsold["year_month"] = df_apartment_unsold["year_month"].str.replace("'", "")
df_apartment_unsold.head()

Unnamed: 0,year_month,unsold_count,ratio
0,7.01,697.0,-
1,7.02,590.0,-15.35
2,7.03,687.0,16.44
3,7.04,685.0,-0.29
4,7.05,704.0,2.77


In [None]:
# 연, 월 컬럼 생성성
df_apartment_unsold['year'] =df_apartment_unsold["year_month"].str.split('.',expand=True)[0]
df_apartment_unsold['month'] =df_apartment_unsold["year_month"].str.split('.',expand=True)[1]
df_apartment_unsold.head()

Unnamed: 0,year_month,unsold_count,ratio,year,month
0,7.01,697.0,-,7,1
1,7.02,590.0,-15.35,7,2
2,7.03,687.0,16.44,7,3
3,7.04,685.0,-0.29,7,4
4,7.05,704.0,2.77,7,5


In [None]:
# 연 컬럼 수정 및 사용할 컬럼 선택택
df_apartment_unsold['year'] = '20'+df_apartment_unsold['year']
df_apartment_unsold = df_apartment_unsold[['year','month','unsold_count']]
df_apartment_unsold.head()

Unnamed: 0,year,month,unsold_count
0,2007,1,697.0
1,2007,2,590.0
2,2007,3,687.0
3,2007,4,685.0
4,2007,5,704.0


In [None]:
df_apartment_unsold.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 191 entries, 0 to 190
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   year          191 non-null    object
 1   month         191 non-null    object
 2   unsold_count  191 non-null    object
dtypes: object(3)
memory usage: 4.6+ KB


In [None]:
# 미분양에 대한 정보는 한달이 지나야 결과를 알 수 있다 가정정
df_apartment_unsold['date'] = pd.to_datetime(df_apartment_unsold['year']+'-'+df_apartment_unsold['month'], format="%Y-%m")
df_apartment_unsold['date_column'] = df_apartment_unsold['date'] + datetime.timedelta(days=32)
df_apartment_unsold['announcement_year'] = df_apartment_unsold['date_column'].dt.year
df_apartment_unsold['announcement_month'] = df_apartment_unsold['date_column'].dt.month
df_apartment_unsold = df_apartment_unsold[['announcement_year','announcement_month','unsold_count']]
df_apartment_unsold = df_apartment_unsold.astype({'unsold_count': 'int64'})
df_apartment_unsold.head()

Unnamed: 0,announcement_year,announcement_month,unsold_count
0,2007,2,697
1,2007,3,590
2,2007,4,687
3,2007,5,685
4,2007,6,704


In [None]:
# 사용할 연도의의 범위만을 설정 
df_apartment_unsold=df_apartment_unsold[df_apartment_unsold['announcement_year']>=2011]
df_apartment_unsold.head()

Unnamed: 0,announcement_year,announcement_month,unsold_count
47,2011,1,2729
48,2011,2,2269
49,2011,3,2216
50,2011,4,2104
51,2011,5,1855


In [None]:
df_apartment_unsold.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 144 entries, 47 to 190
Data columns (total 3 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   announcement_year   144 non-null    int64
 1   announcement_month  144 non-null    int64
 2   unsold_count        144 non-null    int64
dtypes: int64(3)
memory usage: 4.5 KB


## 아파트 분양 & 미분양 데이터 프레임 병합 

In [None]:
df_apartment_supply.tail()

Unnamed: 0,announcement_year,announcement_month,apartment_supply
139,2022,9,1853
140,2022,10,1552
141,2022,11,1265
142,2022,12,1759
143,2023,1,2308


In [None]:
df_apartment_supply.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 144 entries, 0 to 143
Data columns (total 3 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   announcement_year   144 non-null    int64
 1   announcement_month  144 non-null    int64
 2   apartment_supply    144 non-null    int64
dtypes: int64(3)
memory usage: 3.5 KB


In [None]:
df_apartment_unsold.tail()

Unnamed: 0,announcement_year,announcement_month,unsold_count
186,2022,8,592
187,2022,9,610
188,2022,10,719
189,2022,11,866
190,2022,12,865


In [None]:
df_apartment_unsold.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 144 entries, 47 to 190
Data columns (total 3 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   announcement_year   144 non-null    int64
 1   announcement_month  144 non-null    int64
 2   unsold_count        144 non-null    int64
dtypes: int64(3)
memory usage: 4.5 KB


In [None]:
# 미분양 데이터 row 가 하나 부족함으로 임의로 하나 추가
df_apartment_unsold.loc[191] = [2023,1,865]
df_apartment_unsold.tail()

Unnamed: 0,announcement_year,announcement_month,unsold_count
187,2022,9,610
188,2022,10,719
189,2022,11,866
190,2022,12,865
191,2023,1,865


In [None]:
df_apartment_unsold.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 145 entries, 47 to 191
Data columns (total 3 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   announcement_year   145 non-null    int64
 1   announcement_month  145 non-null    int64
 2   unsold_count        145 non-null    int64
dtypes: int64(3)
memory usage: 4.5 KB


In [None]:
# 데이터 프레임 병합합
df_apartment_supply_unsold=pd.merge(df_apartment_supply, df_apartment_unsold, on=['announcement_year','announcement_month'], how='left')
df_apartment_supply_unsold.tail()

Unnamed: 0,announcement_year,announcement_month,apartment_supply,unsold_count
139,2022,9,1853,610
140,2022,10,1552,719
141,2022,11,1265,866
142,2022,12,1759,865
143,2023,1,2308,865


In [None]:
# 2011년 1월달 값을 2월달 값과 같다고 가정하고 대입
df_apartment_supply_unsold.loc[-1] = [2011,1,5345,2269]  # adding a row
df_apartment_supply_unsold.index = df_apartment_supply_unsold.index + 1  # shifting index
df_apartment_supply_unsold.sort_index(inplace=True) 

In [None]:
df_apartment_supply_unsold.head()

Unnamed: 0,announcement_year,announcement_month,apartment_supply,unsold_count
0,2011,1,5345,2269
1,2011,2,5342,2269
2,2011,3,3494,2216
3,2011,4,1511,2104
4,2011,5,709,1855


In [None]:
df_apartment_supply_unsold.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 145 entries, 0 to 144
Data columns (total 4 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   announcement_year   145 non-null    int64
 1   announcement_month  145 non-null    int64
 2   apartment_supply    145 non-null    int64
 3   unsold_count        145 non-null    int64
dtypes: int64(4)
memory usage: 5.7 KB


### 미분양 비율 컬럼 추가

In [None]:
# 미분양 비율을 구함
df_apartment_supply_unsold['unsold_ratio'] = 100*(df_apartment_supply_unsold['unsold_count'] / df_apartment_supply_unsold['apartment_supply'])
df_apartment_supply_unsold.head()

Unnamed: 0,announcement_year,announcement_month,apartment_supply,unsold_count,unsold_ratio
0,2011,1,5345,2269,42.450889
1,2011,2,5342,2269,42.474729
2,2011,3,3494,2216,63.423011
3,2011,4,1511,2104,139.245533
4,2011,5,709,1855,261.636107


In [None]:
df_apartment_supply_unsold.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 145 entries, 0 to 144
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   announcement_year   145 non-null    int64  
 1   announcement_month  145 non-null    int64  
 2   apartment_supply    145 non-null    int64  
 3   unsold_count        145 non-null    int64  
 4   unsold_ratio        145 non-null    float64
dtypes: float64(1), int64(4)
memory usage: 6.8 KB


## 최종 테이블에 병합

In [None]:
# 데이터 병합
df_final=pd.merge(df_final, df_apartment_supply_unsold, left_on=['year','month'], right_on=['announcement_year','announcement_month'], how='left')
df_final = df_final.drop(["announcement_year", "announcement_month"], axis=1)
df_final.head()

Unnamed: 0,date,year,month,day,apartment_index,kospi_index,korea_rp,korea_3_year,korea_10_year,korea_10-3_year,us_3_month,us_2_year,us_10_year,us_10-2_year,us_10-3_year_month,apartment_supply,unsold_count,unsold_ratio
0,2011-01-01,2011,1,1,93.0,2051.0,2.5,3.44,4.57,1.13,0.124,0.601,3.334,2.733,3.21,5345,2269,42.450889
1,2011-01-02,2011,1,2,93.0,2051.0,2.5,3.44,4.57,1.13,0.124,0.601,3.334,2.733,3.21,5345,2269,42.450889
2,2011-01-03,2011,1,3,93.0,2070.08,2.5,3.44,4.57,1.13,0.124,0.601,3.334,2.733,3.21,5345,2269,42.450889
3,2011-01-04,2011,1,4,93.0,2085.14,2.5,3.495,4.58,1.085,0.142,0.621,3.338,2.717,3.196,5345,2269,42.450889
4,2011-01-05,2011,1,5,93.0,2082.55,2.5,3.495,4.63,1.135,0.142,0.708,3.463,2.755,3.321,5345,2269,42.450889


In [None]:
df_final.info() # 데이터프레임 정보 확인

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4383 entries, 0 to 4382
Data columns (total 18 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   date                4383 non-null   datetime64[ns]
 1   year                4383 non-null   int64         
 2   month               4383 non-null   int64         
 3   day                 4383 non-null   int64         
 4   apartment_index     4383 non-null   float64       
 5   kospi_index         4383 non-null   float64       
 6   korea_rp            4383 non-null   float64       
 7   korea_3_year        4383 non-null   float64       
 8   korea_10_year       4383 non-null   float64       
 9   korea_10-3_year     4383 non-null   float64       
 10  us_3_month          4383 non-null   float64       
 11  us_2_year           4383 non-null   float64       
 12  us_10_year          4383 non-null   float64       
 13  us_10-2_year        4383 non-null   float64     

## 분양&미분양 관련 데이터와 부동산 지수 비교 

In [None]:
# 아파트 공급량, 미분양, 부동산 지수 비교교
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['apartment_supply'],
                    mode='lines',
                    name='apartment_supply',yaxis='y1'))

fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['unsold_count'],
                    mode='lines',
                    name='unsold_count',yaxis='y1'))



fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['apartment_index'],
                    mode='lines',
                    name='apartment_index',
                    yaxis="y2"))
fig.update_layout(

   # create first Y-axis
   yaxis=dict(
      title="supply&unsold",
      titlefont=dict(color="black"),
      tickfont=dict(color="black")
   ),

   # create second Y-axis
   yaxis2=dict(
      title="apartment index",
      overlaying="y",
      side="right")
)

In [None]:
# 미분양 개수수, 부동산 지수 비교교
fig = go.Figure()

fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['unsold_count'],
                    mode='lines',
                    name='unsold_count',yaxis='y1'))


# 앞에서의 그래프들은 뒤집기
fig.update_layout(
    yaxis = dict(autorange="reversed")
)
fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['apartment_index'],
                    mode='lines',
                    name='apartment_index',
                    yaxis="y2"))
fig.update_layout(

   # create first Y-axis
   yaxis=dict(
      title="unsold",
      titlefont=dict(color="black"),
      tickfont=dict(color="black")
   ),

   # create second Y-axis
   yaxis2=dict(
      title="apartment index",
      overlaying="y",
      side="right")
)

- 공급량은 부동산지수와 연관이 있는지 잘 모르겠다
- 미분양 개수(역)만 따로 추출해서 부동산지수와 비교한 결과 연관성이 있는듯

## 미분양 비율 데이터와 부동산 지수 비교

In [None]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['unsold_ratio'],
                    mode='lines',
                    name='unsold_ratio',yaxis='y1'))




# 앞에서의 그래프들은 뒤집기
fig.update_layout(
    yaxis = dict(autorange="reversed")
)
fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['apartment_index'],
                    mode='lines',
                    name='apartment_index',
                    yaxis="y2"))
fig.update_layout(

   # create first Y-axis
   yaxis=dict(
      title="unsold_ratio",
      titlefont=dict(color="black"),
      tickfont=dict(color="black")
   ),

   # create second Y-axis
   yaxis2=dict(
      title="apartment index",
      overlaying="y",
      side="right")
)

- 미분양비율(역) 과 부동산 지수는 어느정도 연관성이 있는듯?

In [None]:
df_final.head()

Unnamed: 0,date,year,month,day,apartment_index,kospi_index,korea_rp,korea_3_year,korea_10_year,korea_10-3_year,us_3_month,us_2_year,us_10_year,us_10-2_year,us_10-3_year_month,apartment_supply,unsold_count,unsold_ratio
0,2011-01-01,2011,1,1,93.0,2051.0,2.5,3.44,4.57,1.13,0.124,0.601,3.334,2.733,3.21,5345,2269,42.450889
1,2011-01-02,2011,1,2,93.0,2051.0,2.5,3.44,4.57,1.13,0.124,0.601,3.334,2.733,3.21,5345,2269,42.450889
2,2011-01-03,2011,1,3,93.0,2070.08,2.5,3.44,4.57,1.13,0.124,0.601,3.334,2.733,3.21,5345,2269,42.450889
3,2011-01-04,2011,1,4,93.0,2085.14,2.5,3.495,4.58,1.085,0.142,0.621,3.338,2.717,3.196,5345,2269,42.450889
4,2011-01-05,2011,1,5,93.0,2082.55,2.5,3.495,4.63,1.135,0.142,0.708,3.463,2.755,3.321,5345,2269,42.450889


In [None]:
df_final.to_csv('/content/drive/MyDrive/house_price/after_data/economic_data.csv',index=False)

# economic_data2 파일, deal_everyday 파일들, full_rent_everyday 파일들, year_rent_everyday 파일들 생성

In [2]:
import pandas as pd
import numpy as np
# 데이터들 불러오기
df_deal = pd.read_csv("/content/drive/MyDrive/house_price/after_data/apartment_deal.csv",  encoding='UTF8')
df_month_rent = pd.read_csv("/content/drive/MyDrive/house_price/after_data/apartment_month_rent.csv",  encoding='UTF8')
df_full_rent = pd.read_csv("/content/drive/MyDrive/house_price/after_data/apartment_full_rent.csv",  encoding='UTF8')
df_economic = pd.read_csv("/content/drive/MyDrive/house_price/after_data/economic_data.csv",  encoding='UTF8')

>> 데이터를 저장하고 불러올 때 버전들을 제대로 확인해야 한다

# economic_data2 파일 생성 


- economic_data2 파일은 economic_data 파일에 월별 아파트 거래체결량들(매매체결량, 전세체결량,월세체결량) 정보를 추가한 파일
- '아파트 거래' 는 '아파트 매매', '아파트 전세', '아파트 월세' 를 합친 개념
- 아파트 월별 거래량은 해당 달에 체결된 서울 총 아파트 거래량을 의미

## 아파트 매매 체결량 데이터프레임 생성

In [3]:
df_deal.head()

Unnamed: 0,date,year,month,day,address_0,address_1,address_2,address_3,address_4,name,area,deal_price
0,2011-07-09,2011,7,9,서울특별시,강남구,개포동,655.0,2.0,개포2차현대아파트(220),77.75,64000
1,2011-07-28,2011,7,28,서울특별시,강남구,개포동,655.0,2.0,개포2차현대아파트(220),77.75,65500
2,2011-01-19,2011,1,19,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,67.28,70500
3,2011-09-02,2011,9,2,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,79.97,85000
4,2011-12-17,2011,12,17,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,67.28,68000


In [4]:
df_deal.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 882185 entries, 0 to 882184
Data columns (total 12 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   date        882185 non-null  object 
 1   year        882185 non-null  int64  
 2   month       882185 non-null  int64  
 3   day         882185 non-null  int64  
 4   address_0   882185 non-null  object 
 5   address_1   882185 non-null  object 
 6   address_2   882185 non-null  object 
 7   address_3   882185 non-null  float64
 8   address_4   882185 non-null  float64
 9   name        882185 non-null  object 
 10  area        882185 non-null  float64
 11  deal_price  882185 non-null  int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 80.8+ MB


In [5]:
# 서울 아파트 월별 거래량을 group by를 이용하여여 계산
df_count = df_deal.groupby(["year","month"])["name"].agg('count').copy()
df_count = df_count.reset_index(["year","month"]) # index로 있던 컬럼들을 다시 컬럼화
df_count.columns = ["year","month","deal_count"] # 컬럼명들 수정정
df_count

Unnamed: 0,year,month,deal_count
0,2011,1,7179
1,2011,2,6026
2,2011,3,5419
3,2011,4,4028
4,2011,5,3836
...,...,...,...
139,2022,8,760
140,2022,9,649
141,2022,10,574
142,2022,11,750


In [6]:
df_count.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 144 entries, 0 to 143
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   year        144 non-null    int64
 1   month       144 non-null    int64
 2   deal_count  144 non-null    int64
dtypes: int64(3)
memory usage: 3.5 KB


## 아파트 전세 체결량 데이터프레임 추가

- 아파트 매매 체결량 부분 참조

In [7]:
df_full_rent.head()

Unnamed: 0,date,year,month,day,address_0,address_1,address_2,address_3,address_4,name,area,full_rent_price
0,2011-01-05,2011,1,5,서울특별시,강남구,개포동,655.0,2.0,개포2차현대아파트(220),77.75,35000
1,2011-01-18,2011,1,18,서울특별시,강남구,개포동,655.0,2.0,개포2차현대아파트(220),77.75,20000
2,2011-02-01,2011,2,1,서울특별시,강남구,개포동,655.0,2.0,개포2차현대아파트(220),77.75,24000
3,2011-02-11,2011,2,11,서울특별시,강남구,개포동,655.0,2.0,개포2차현대아파트(220),77.75,31000
4,2011-02-24,2011,2,24,서울특별시,강남구,개포동,655.0,2.0,개포2차현대아파트(220),77.75,30500


In [8]:
df_temp = df_full_rent.groupby(["year","month"])["name"].agg('count').copy()
df_temp = df_temp.reset_index(["year","month"])
df_temp.columns = ["year","month","full_rent_count"]
df_temp

Unnamed: 0,year,month,full_rent_count
0,2011,1,12336
1,2011,2,12261
2,2011,3,12121
3,2011,4,9754
4,2011,5,9280
...,...,...,...
139,2022,8,11341
140,2022,9,10258
141,2022,10,10559
142,2022,11,8890


In [9]:
# 아파트 매매 체결량 데이터프레임과 아파트 전세 체결량 데이터프레임임을 병합합
df_count=pd.merge(df_count,df_temp, left_on=["year","month"], right_on=["year","month"], how="inner")
df_count.head()

Unnamed: 0,year,month,deal_count,full_rent_count
0,2011,1,7179,12336
1,2011,2,6026,12261
2,2011,3,5419,12121
3,2011,4,4028,9754
4,2011,5,3836,9280


In [10]:
df_count.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 144 entries, 0 to 143
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype
---  ------           --------------  -----
 0   year             144 non-null    int64
 1   month            144 non-null    int64
 2   deal_count       144 non-null    int64
 3   full_rent_count  144 non-null    int64
dtypes: int64(4)
memory usage: 5.6 KB


## 아파트 월세 체결량 데이터프레임 추가

- 아파트 매매 체결량 데이터프레임 참조

In [11]:
df_month_rent.head()

Unnamed: 0,date,year,month,day,address_0,address_1,address_2,address_3,address_4,name,area,rent_deposit,month_rent_price
0,2011-03-18,2011,3,18,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,79.97,19000,63
1,2011-04-09,2011,4,9,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,79.97,21000,35
2,2011-07-09,2011,7,9,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,79.97,3000,160
3,2011-09-19,2011,9,19,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,79.97,6000,140
4,2011-09-20,2011,9,20,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,79.97,5000,160


In [12]:
df_temp = df_month_rent.groupby(["year","month"])["name"].agg('count').copy()
df_temp = df_temp.reset_index(["year","month"])
df_temp.columns = ["year","month","month_rent_count"]
df_temp

Unnamed: 0,year,month,month_rent_count
0,2011,1,2514
1,2011,2,2711
2,2011,3,2775
3,2011,4,2210
4,2011,5,2168
...,...,...,...
139,2022,8,7415
140,2022,9,7793
141,2022,10,7694
142,2022,11,7709


In [13]:
# 아파트 월세 거래량 데이터프레임을 추가하여 병합합
df_count=pd.merge(df_count,df_temp, left_on=["year","month"], right_on=["year","month"], how="inner")
df_count.head()

Unnamed: 0,year,month,deal_count,full_rent_count,month_rent_count
0,2011,1,7179,12336,2514
1,2011,2,6026,12261,2711
2,2011,3,5419,12121,2775
3,2011,4,4028,9754,2210
4,2011,5,3836,9280,2168


In [14]:
df_count.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 144 entries, 0 to 143
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype
---  ------            --------------  -----
 0   year              144 non-null    int64
 1   month             144 non-null    int64
 2   deal_count        144 non-null    int64
 3   full_rent_count   144 non-null    int64
 4   month_rent_count  144 non-null    int64
dtypes: int64(5)
memory usage: 6.8 KB


## 월 정보들 shift

- 해당 달의 거래량은 다음달에 알 수 있음으로 한칸씩 shift(1달씩 미룸)

In [15]:
df_count['deal_count'] = df_count['deal_count'].shift(1)
df_count['month_rent_count'] = df_count['month_rent_count'].shift(1)
df_count['full_rent_count'] = df_count['full_rent_count'].shift(1)
# 컬럼명 수정
df_count.columns = ['year','month','last_month_total_deal_count','last_month_total_full_rent_count', 'last_month_month_total_rent_count']
df_count

Unnamed: 0,year,month,last_month_total_deal_count,last_month_total_full_rent_count,last_month_month_total_rent_count
0,2011,1,,,
1,2011,2,7179.0,12336.0,2514.0
2,2011,3,6026.0,12261.0,2711.0
3,2011,4,5419.0,12121.0,2775.0
4,2011,5,4028.0,9754.0,2210.0
...,...,...,...,...,...
139,2022,8,688.0,11654.0,8916.0
140,2022,9,760.0,11341.0,7415.0
141,2022,10,649.0,10258.0,7793.0
142,2022,11,574.0,10559.0,7694.0


In [16]:
df_count.dropna(axis=0,inplace=True)
df_count.reset_index(inplace=True,drop=True)
df_count

Unnamed: 0,year,month,last_month_total_deal_count,last_month_total_full_rent_count,last_month_month_total_rent_count
0,2011,2,7179.0,12336.0,2514.0
1,2011,3,6026.0,12261.0,2711.0
2,2011,4,5419.0,12121.0,2775.0
3,2011,5,4028.0,9754.0,2210.0
4,2011,6,3836.0,9280.0,2168.0
...,...,...,...,...,...
138,2022,8,688.0,11654.0,8916.0
139,2022,9,760.0,11341.0,7415.0
140,2022,10,649.0,10258.0,7793.0
141,2022,11,574.0,10559.0,7694.0


In [None]:
df_count.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 143 entries, 0 to 142
Data columns (total 5 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   year                               143 non-null    int64  
 1   month                              143 non-null    int64  
 2   last_month_total_deal_count        143 non-null    float64
 3   last_month_total_full_rent_count   143 non-null    float64
 4   last_month_month_total_rent_count  143 non-null    float64
dtypes: float64(3), int64(2)
memory usage: 5.7 KB


## economic_data 와의 통합 

In [17]:
df_economic.head()

Unnamed: 0,date,year,month,day,apartment_index,kospi_index,korea_rp,korea_3_year,korea_10_year,korea_10-3_year,us_3_month,us_2_year,us_10_year,us_10-2_year,us_10-3_year_month,apartment_supply,unsold_count,unsold_ratio
0,2011-01-01,2011,1,1,93.0,2051.0,2.5,3.44,4.57,1.13,0.124,0.601,3.334,2.733,3.21,5345,2269,42.450889
1,2011-01-02,2011,1,2,93.0,2051.0,2.5,3.44,4.57,1.13,0.124,0.601,3.334,2.733,3.21,5345,2269,42.450889
2,2011-01-03,2011,1,3,93.0,2070.08,2.5,3.44,4.57,1.13,0.124,0.601,3.334,2.733,3.21,5345,2269,42.450889
3,2011-01-04,2011,1,4,93.0,2085.14,2.5,3.495,4.58,1.085,0.142,0.621,3.338,2.717,3.196,5345,2269,42.450889
4,2011-01-05,2011,1,5,93.0,2082.55,2.5,3.495,4.63,1.135,0.142,0.708,3.463,2.755,3.321,5345,2269,42.450889


In [18]:
# 거시경제 지표가 모든 날짜들에 대한 정보를 가지고 있음으로, year과 month를 통해서 병합
df_economic=pd.merge(df_economic, df_count, left_on=["year","month"], right_on=["year","month"], how="inner")
df_economic.head()

Unnamed: 0,date,year,month,day,apartment_index,kospi_index,korea_rp,korea_3_year,korea_10_year,korea_10-3_year,...,us_2_year,us_10_year,us_10-2_year,us_10-3_year_month,apartment_supply,unsold_count,unsold_ratio,last_month_total_deal_count,last_month_total_full_rent_count,last_month_month_total_rent_count
0,2011-02-01,2011,2,1,93.0,2072.03,2.75,3.97,4.71,0.74,...,0.605,3.435,2.83,3.278,5342,2269,42.474729,7179.0,12336.0,2514.0
1,2011-02-02,2011,2,2,93.0,2072.03,2.75,3.97,4.71,0.74,...,0.664,3.479,2.815,3.322,5342,2269,42.474729,7179.0,12336.0,2514.0
2,2011-02-03,2011,2,3,93.0,2072.03,2.75,3.97,4.71,0.74,...,0.712,3.547,2.835,3.395,5342,2269,42.474729,7179.0,12336.0,2514.0
3,2011-02-04,2011,2,4,93.0,2072.03,2.75,3.97,4.71,0.74,...,0.752,3.638,2.886,3.486,5342,2269,42.474729,7179.0,12336.0,2514.0
4,2011-02-05,2011,2,5,93.0,2072.03,2.75,3.97,4.71,0.74,...,0.752,3.638,2.886,3.486,5342,2269,42.474729,7179.0,12336.0,2514.0


In [19]:
df_economic.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4352 entries, 0 to 4351
Data columns (total 21 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   date                               4352 non-null   object 
 1   year                               4352 non-null   int64  
 2   month                              4352 non-null   int64  
 3   day                                4352 non-null   int64  
 4   apartment_index                    4352 non-null   float64
 5   kospi_index                        4352 non-null   float64
 6   korea_rp                           4352 non-null   float64
 7   korea_3_year                       4352 non-null   float64
 8   korea_10_year                      4352 non-null   float64
 9   korea_10-3_year                    4352 non-null   float64
 10  us_3_month                         4352 non-null   float64
 11  us_2_year                          4352 non-null   float

In [21]:
# 파일명을 수정정
df_economic.columns = ['date', 'year', 'month', 'day', 'apartment_index', 'kospi_index',
       'korea_rp', 'korea_3_year', 'korea_10_year', 'korea_10-3_year',
       'us_3_month', 'us_2_year', 'us_10_year', 'us_10-2_year',
       'us_10-3_year_month', 'last_month_total_apartment_supply', 'last_month_total_unsold_count',
       'last_month_total_unsold_ratio', 'last_month_total_deal_count',
       'last_month_total_full_rent_count',
       'last_month_month_total_rent_count']
df_economic.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4352 entries, 0 to 4351
Data columns (total 21 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   date                               4352 non-null   object 
 1   year                               4352 non-null   int64  
 2   month                              4352 non-null   int64  
 3   day                                4352 non-null   int64  
 4   apartment_index                    4352 non-null   float64
 5   kospi_index                        4352 non-null   float64
 6   korea_rp                           4352 non-null   float64
 7   korea_3_year                       4352 non-null   float64
 8   korea_10_year                      4352 non-null   float64
 9   korea_10-3_year                    4352 non-null   float64
 10  us_3_month                         4352 non-null   float64
 11  us_2_year                          4352 non-null   float

In [22]:
# 데이터프레임 타입 변경 
df_economic=df_economic.astype({'year': 'int16','month': 'int16','day': 'int16',
                    'last_month_total_apartment_supply': 'int32',
                    'last_month_total_unsold_count': 'int32',
                    'last_month_total_deal_count': 'int32',
                    'last_month_total_full_rent_count': 'int32',
                    'last_month_month_total_rent_count': 'int32'})

In [23]:
df_economic.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4352 entries, 0 to 4351
Data columns (total 21 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   date                               4352 non-null   object 
 1   year                               4352 non-null   int16  
 2   month                              4352 non-null   int16  
 3   day                                4352 non-null   int16  
 4   apartment_index                    4352 non-null   float64
 5   kospi_index                        4352 non-null   float64
 6   korea_rp                           4352 non-null   float64
 7   korea_3_year                       4352 non-null   float64
 8   korea_10_year                      4352 non-null   float64
 9   korea_10-3_year                    4352 non-null   float64
 10  us_3_month                         4352 non-null   float64
 11  us_2_year                          4352 non-null   float

In [24]:
# csv 파일 저장
df_economic.to_pickle('/content/drive/MyDrive/house_price/after_data/economic_data2.pkl')

# 피봇 테이블 생성

- '아파트 거래'가 체결된 날 이외의 날들은 가장 최근에 체결된 거래가격이 유지된다고 가정

## 필요한 데이터들 불러오기

In [9]:
import pandas as pd
import numpy as np
# 데이터들 불러오기
df_deal = pd.read_csv("/content/drive/MyDrive/house_price/after_data/apartment_deal.csv",  encoding='UTF8')
df_month_rent = pd.read_csv("/content/drive/MyDrive/house_price/after_data/apartment_month_rent.csv",  encoding='UTF8')
df_full_rent = pd.read_csv("/content/drive/MyDrive/house_price/after_data/apartment_full_rent.csv",  encoding='UTF8')
df_economic = pd.read_csv("/content/drive/MyDrive/house_price/after_data/economic_data.csv",  encoding='UTF8')

## 아파트 매매 피봇 테이블 생성

In [3]:
# 대표 데이터 파악
df_deal.head()

Unnamed: 0,date,year,month,day,address_0,address_1,address_2,address_3,address_4,name,area,deal_price
0,2011-07-09,2011,7,9,서울특별시,강남구,개포동,655.0,2.0,개포2차현대아파트(220),77.75,64000
1,2011-07-28,2011,7,28,서울특별시,강남구,개포동,655.0,2.0,개포2차현대아파트(220),77.75,65500
2,2011-01-19,2011,1,19,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,67.28,70500
3,2011-09-02,2011,9,2,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,79.97,85000
4,2011-12-17,2011,12,17,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,67.28,68000


In [4]:
# 면적당 가격을 추가
df_deal['area_deal_price'] = df_deal['deal_price'] / df_deal['area']
df_deal.head()

Unnamed: 0,date,year,month,day,address_0,address_1,address_2,address_3,address_4,name,area,deal_price,area_deal_price
0,2011-07-09,2011,7,9,서울특별시,강남구,개포동,655.0,2.0,개포2차현대아파트(220),77.75,64000,823.151125
1,2011-07-28,2011,7,28,서울특별시,강남구,개포동,655.0,2.0,개포2차현대아파트(220),77.75,65500,842.44373
2,2011-01-19,2011,1,19,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,67.28,70500,1047.859691
3,2011-09-02,2011,9,2,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,79.97,85000,1062.898587
4,2011-12-17,2011,12,17,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,67.28,68000,1010.701546


In [7]:
# 최근에 체결된 가격이 계속 유지된다고 생각을 하고 모든 날짜의 가격들을 결정
# 이를 위해서 그룹
import numpy as np
pivot_table_area_deal = df_deal.pivot_table(index=['year','month'], columns=['address_1','address_2','address_3','address_4'], values='area_deal_price', aggfunc=np.mean)
pivot_table_area_deal


Unnamed: 0_level_0,address_1,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,...,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구
Unnamed: 0_level_1,address_2,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,...,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동
Unnamed: 0_level_2,address_3,12.0,12.0,138.0,140.0,141.0,166.0,172.0,176.0,177.0,179.0,...,307.0,314.0,318.0,331.0,413.0,438.0,450.0,452.0,453.0,454.0
Unnamed: 0_level_3,address_4,0.0,2.0,0.0,0.0,0.0,4.0,3.0,1.0,0.0,0.0,...,76.0,1.0,81.0,64.0,8.0,0.0,0.0,0.0,0.0,0.0
year,month,Unnamed: 2_level_4,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4,Unnamed: 7_level_4,Unnamed: 8_level_4,Unnamed: 9_level_4,Unnamed: 10_level_4,Unnamed: 11_level_4,Unnamed: 12_level_4,Unnamed: 13_level_4,Unnamed: 14_level_4,Unnamed: 15_level_4,Unnamed: 16_level_4,Unnamed: 17_level_4,Unnamed: 18_level_4,Unnamed: 19_level_4,Unnamed: 20_level_4,Unnamed: 21_level_4,Unnamed: 22_level_4
2011,1,1056.223061,,1985.413117,1843.539520,1834.608613,,,,,,...,,,,,,,450.299419,,,
2011,2,1047.044551,920.318252,1914.281706,2068.121832,1929.906477,,,,1119.359020,954.653938,...,,365.380762,,,,,462.221502,,445.152911,
2011,3,997.464329,,1983.669216,1966.517376,1728.264881,,,,1284.317191,,...,,359.487524,,,,,449.526357,,,
2011,4,1056.760361,,1965.817286,1828.668916,1904.404669,,,,,,...,,383.060476,,,,,452.296247,,,
2011,5,1009.316006,,1852.884387,1971.081572,1653.633357,,,,,,...,,,,,,371.55534,444.357652,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022,8,2801.826056,,,,,,,,,,...,,,,,,,,,,
2022,9,2772.754671,,4019.153604,,,,,,,,...,,,,,,818.61013,,,,
2022,10,2342.766727,,,,,,,,,,...,,,,,,,,,,
2022,11,2459.681105,,2963.637355,,,,,,,,...,,,,,,,,,,


In [18]:
# '거래날짜'들이 '모든날짜'들이 아님으로 '모든날짜'들의 정보가 있는
# df_economic을 통해서 '모든날짜'들을 리스트로 생성
date_range_list = list(df_economic['date'])
date_list = list()
for date_element in date_range_list:
    date_list_element = list()
    for i in range(2):
        date_list_element.append(int(date_element.split('-')[i]))
    date_list.append(tuple(date_list_element))
date_list = sorted(list(set(date_list)))
print(date_list, len(date_list))

[(2011, 1), (2011, 2), (2011, 3), (2011, 4), (2011, 5), (2011, 6), (2011, 7), (2011, 8), (2011, 9), (2011, 10), (2011, 11), (2011, 12), (2012, 1), (2012, 2), (2012, 3), (2012, 4), (2012, 5), (2012, 6), (2012, 7), (2012, 8), (2012, 9), (2012, 10), (2012, 11), (2012, 12), (2013, 1), (2013, 2), (2013, 3), (2013, 4), (2013, 5), (2013, 6), (2013, 7), (2013, 8), (2013, 9), (2013, 10), (2013, 11), (2013, 12), (2014, 1), (2014, 2), (2014, 3), (2014, 4), (2014, 5), (2014, 6), (2014, 7), (2014, 8), (2014, 9), (2014, 10), (2014, 11), (2014, 12), (2015, 1), (2015, 2), (2015, 3), (2015, 4), (2015, 5), (2015, 6), (2015, 7), (2015, 8), (2015, 9), (2015, 10), (2015, 11), (2015, 12), (2016, 1), (2016, 2), (2016, 3), (2016, 4), (2016, 5), (2016, 6), (2016, 7), (2016, 8), (2016, 9), (2016, 10), (2016, 11), (2016, 12), (2017, 1), (2017, 2), (2017, 3), (2017, 4), (2017, 5), (2017, 6), (2017, 7), (2017, 8), (2017, 9), (2017, 10), (2017, 11), (2017, 12), (2018, 1), (2018, 2), (2018, 3), (2018, 4), (2018, 5),

In [19]:
pivot_table_area_deal.info() # 2011년 1월부터 2022년 12월까지 '모든날짜'는 144의 항목이 있는데 하는데 '거래날짜'는 144로 값이 다 있음음

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 144 entries, (2011, 1) to (2022, 12)
Columns: 8860 entries, ('강남구', '개포동', 12.0, 0.0) to ('중랑구', '중화동', 454.0, 0.0)
dtypes: float64(8860)
memory usage: 9.7 MB


In [20]:
# 기간 내 모든 날짜들에서 '거래날짜'들 빼서 '거래날짜'에서 없는 날짜들을 고름
print(set(date_list) - set(pivot_table_area_deal.index)) # '모든날짜'에 있고 '거래날짜'에 없는 날짜
print(set(pivot_table_area_deal.index) - set(date_list)) # 잘못 추가 생성된 날짜 

set()
set()


In [21]:
# 가장 최근에 체결된 값이 거래가격으로 유지 됨으로 ffill()을 사용
pivot_table_area_deal=pivot_table_area_deal.ffill()
pivot_table_area_deal

Unnamed: 0_level_0,address_1,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,...,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구
Unnamed: 0_level_1,address_2,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,...,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동
Unnamed: 0_level_2,address_3,12.0,12.0,138.0,140.0,141.0,166.0,172.0,176.0,177.0,179.0,...,307.0,314.0,318.0,331.0,413.0,438.0,450.0,452.0,453.0,454.0
Unnamed: 0_level_3,address_4,0.0,2.0,0.0,0.0,0.0,4.0,3.0,1.0,0.0,0.0,...,76.0,1.0,81.0,64.0,8.0,0.0,0.0,0.0,0.0,0.0
year,month,Unnamed: 2_level_4,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4,Unnamed: 7_level_4,Unnamed: 8_level_4,Unnamed: 9_level_4,Unnamed: 10_level_4,Unnamed: 11_level_4,Unnamed: 12_level_4,Unnamed: 13_level_4,Unnamed: 14_level_4,Unnamed: 15_level_4,Unnamed: 16_level_4,Unnamed: 17_level_4,Unnamed: 18_level_4,Unnamed: 19_level_4,Unnamed: 20_level_4,Unnamed: 21_level_4,Unnamed: 22_level_4
2011,1,1056.223061,,1985.413117,1843.539520,1834.608613,,,,,,...,,,,,,,450.299419,,,
2011,2,1047.044551,920.318252,1914.281706,2068.121832,1929.906477,,,,1119.359020,954.653938,...,,365.380762,,,,,462.221502,,445.152911,
2011,3,997.464329,920.318252,1983.669216,1966.517376,1728.264881,,,,1284.317191,954.653938,...,,359.487524,,,,,449.526357,,445.152911,
2011,4,1056.760361,920.318252,1965.817286,1828.668916,1904.404669,,,,1284.317191,954.653938,...,,383.060476,,,,,452.296247,,445.152911,
2011,5,1009.316006,920.318252,1852.884387,1971.081572,1653.633357,,,,1284.317191,954.653938,...,,383.060476,,,,371.555340,444.357652,,445.152911,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022,8,2801.826056,1779.004227,3857.953574,2508.957785,4208.615293,1413.594063,1342.758827,2172.968275,2380.225816,3014.696646,...,457.073761,872.199239,466.954023,956.130484,595.238095,874.431303,1163.591651,727.417008,1006.355932,1131.141746
2022,9,2772.754671,1779.004227,4019.153604,2508.957785,4208.615293,1413.594063,1342.758827,2172.968275,2380.225816,3014.696646,...,457.073761,872.199239,466.954023,956.130484,595.238095,818.610130,1163.591651,727.417008,1006.355932,1131.141746
2022,10,2342.766727,1779.004227,4019.153604,2508.957785,4208.615293,1413.594063,1342.758827,2172.968275,2380.225816,3014.696646,...,457.073761,872.199239,466.954023,956.130484,595.238095,818.610130,1163.591651,727.417008,1006.355932,1131.141746
2022,11,2459.681105,1779.004227,2963.637355,2508.957785,4208.615293,1413.594063,1342.758827,2172.968275,2380.225816,3014.696646,...,457.073761,872.199239,466.954023,956.130484,595.238095,818.610130,1163.591651,727.417008,1006.355932,1131.141746


## 아파트 전세 피봇 테이블 생성

- 아파트 매매 피봇 테이블 생성 부분 참조

In [22]:
df_full_rent.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1448686 entries, 0 to 1448685
Data columns (total 12 columns):
 #   Column           Non-Null Count    Dtype  
---  ------           --------------    -----  
 0   date             1448686 non-null  object 
 1   year             1448686 non-null  int64  
 2   month            1448686 non-null  int64  
 3   day              1448686 non-null  int64  
 4   address_0        1448686 non-null  object 
 5   address_1        1448686 non-null  object 
 6   address_2        1448686 non-null  object 
 7   address_3        1448686 non-null  float64
 8   address_4        1448686 non-null  float64
 9   name             1448686 non-null  object 
 10  area             1448686 non-null  float64
 11  full_rent_price  1448686 non-null  int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 132.6+ MB


In [23]:
# 면적당 가격을 추가
df_full_rent['area_full_rent_price'] = df_full_rent['full_rent_price'] / df_full_rent['area']
df_full_rent.head()

Unnamed: 0,date,year,month,day,address_0,address_1,address_2,address_3,address_4,name,area,full_rent_price,area_full_rent_price
0,2011-01-05,2011,1,5,서울특별시,강남구,개포동,655.0,2.0,개포2차현대아파트(220),77.75,35000,450.160772
1,2011-01-18,2011,1,18,서울특별시,강남구,개포동,655.0,2.0,개포2차현대아파트(220),77.75,20000,257.234727
2,2011-02-01,2011,2,1,서울특별시,강남구,개포동,655.0,2.0,개포2차현대아파트(220),77.75,24000,308.681672
3,2011-02-11,2011,2,11,서울특별시,강남구,개포동,655.0,2.0,개포2차현대아파트(220),77.75,31000,398.713826
4,2011-02-24,2011,2,24,서울특별시,강남구,개포동,655.0,2.0,개포2차현대아파트(220),77.75,30500,392.282958


In [24]:
pivot_table_area_full_rent=df_full_rent.pivot_table(index=['year','month'], columns=['address_1','address_2','address_3','address_4'], values='area_full_rent_price')
pivot_table_area_full_rent # 해당 날짜에 거래가 많을 경우 mean 값이 나옴을 확인!

Unnamed: 0_level_0,address_1,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,...,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구
Unnamed: 0_level_1,address_2,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,...,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동
Unnamed: 0_level_2,address_3,12.0,12.0,138.0,140.0,141.0,166.0,172.0,176.0,177.0,179.0,...,307.0,314.0,318.0,331.0,413.0,438.0,450.0,452.0,453.0,454.0
Unnamed: 0_level_3,address_4,0.0,2.0,0.0,0.0,0.0,4.0,3.0,1.0,0.0,0.0,...,76.0,1.0,81.0,64.0,8.0,0.0,0.0,0.0,0.0,0.0
year,month,Unnamed: 2_level_4,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4,Unnamed: 7_level_4,Unnamed: 8_level_4,Unnamed: 9_level_4,Unnamed: 10_level_4,Unnamed: 11_level_4,Unnamed: 12_level_4,Unnamed: 13_level_4,Unnamed: 14_level_4,Unnamed: 15_level_4,Unnamed: 16_level_4,Unnamed: 17_level_4,Unnamed: 18_level_4,Unnamed: 19_level_4,Unnamed: 20_level_4,Unnamed: 21_level_4,Unnamed: 22_level_4
2011,1,432.532897,440.045054,211.154904,258.848450,209.447885,,,,396.240321,425.511870,...,194.457948,231.800699,203.665988,,,198.560437,239.439022,,,
2011,2,420.187917,461.653016,209.653658,247.684779,206.098616,,,,412.395428,426.431723,...,,225.715445,,143.141648,,176.678445,246.031308,238.063748,235.404896,
2011,3,425.833338,,202.726317,240.732852,208.836187,,,,,445.923879,...,,187.021280,155.426409,,,252.780586,241.615927,,229.519774,
2011,4,414.627546,52.122115,206.219598,267.855555,195.808012,,,,400.612702,410.752418,...,,82.505333,,,,,244.869185,238.063748,,
2011,5,430.543477,409.530901,188.072915,245.028909,197.540818,300.388738,,,356.400356,372.179732,...,,243.181791,215.517241,,,188.457008,252.230868,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022,8,951.719030,977.289650,2038.030102,,,,,,,873.633966,...,,436.099619,,,,467.537034,565.292599,,,
2022,9,917.385474,,1635.203915,,,717.398987,,,854.440037,1063.517984,...,,,,,,,619.117902,,,
2022,10,908.431603,944.864228,1916.378550,,,,,,1119.359020,1094.711720,...,,,,,,,605.027652,,,
2022,11,867.409593,920.512565,1780.685565,,,,,,993.686192,1080.266298,...,,,,,,,611.818738,,661.235093,638.071606


In [25]:
pivot_table_area_full_rent.info() # 2011년 1월부터 2022년 12월까지 총 144개의 항목이 있음

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 144 entries, (2011, 1) to (2022, 12)
Columns: 9258 entries, ('강남구', '개포동', 12.0, 0.0) to ('중랑구', '중화동', 454.0, 0.0)
dtypes: float64(9258)
memory usage: 10.2 MB


In [26]:
# 모든 날짜들이 있음을 확인함
print(set(date_list) - set(pivot_table_area_full_rent.index))
print(set(pivot_table_area_full_rent.index) - set(date_list))

set()
set()


In [27]:
pivot_table_area_full_rent = pivot_table_area_full_rent.ffill()
pivot_table_area_full_rent

Unnamed: 0_level_0,address_1,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,...,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구
Unnamed: 0_level_1,address_2,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,...,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동
Unnamed: 0_level_2,address_3,12.0,12.0,138.0,140.0,141.0,166.0,172.0,176.0,177.0,179.0,...,307.0,314.0,318.0,331.0,413.0,438.0,450.0,452.0,453.0,454.0
Unnamed: 0_level_3,address_4,0.0,2.0,0.0,0.0,0.0,4.0,3.0,1.0,0.0,0.0,...,76.0,1.0,81.0,64.0,8.0,0.0,0.0,0.0,0.0,0.0
year,month,Unnamed: 2_level_4,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4,Unnamed: 7_level_4,Unnamed: 8_level_4,Unnamed: 9_level_4,Unnamed: 10_level_4,Unnamed: 11_level_4,Unnamed: 12_level_4,Unnamed: 13_level_4,Unnamed: 14_level_4,Unnamed: 15_level_4,Unnamed: 16_level_4,Unnamed: 17_level_4,Unnamed: 18_level_4,Unnamed: 19_level_4,Unnamed: 20_level_4,Unnamed: 21_level_4,Unnamed: 22_level_4
2011,1,432.532897,440.045054,211.154904,258.848450,209.447885,,,,396.240321,425.511870,...,194.457948,231.800699,203.665988,,,198.560437,239.439022,,,
2011,2,420.187917,461.653016,209.653658,247.684779,206.098616,,,,412.395428,426.431723,...,194.457948,225.715445,203.665988,143.141648,,176.678445,246.031308,238.063748,235.404896,
2011,3,425.833338,461.653016,202.726317,240.732852,208.836187,,,,412.395428,445.923879,...,194.457948,187.021280,155.426409,143.141648,,252.780586,241.615927,238.063748,229.519774,
2011,4,414.627546,52.122115,206.219598,267.855555,195.808012,,,,400.612702,410.752418,...,194.457948,82.505333,155.426409,143.141648,,252.780586,244.869185,238.063748,229.519774,
2011,5,430.543477,409.530901,188.072915,245.028909,197.540818,300.388738,,,356.400356,372.179732,...,194.457948,243.181791,215.517241,143.141648,,188.457008,252.230868,238.063748,229.519774,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022,8,951.719030,977.289650,2038.030102,1493.370551,266.299118,591.311344,549.820467,931.272118,793.408605,873.633966,...,475.382003,436.099619,172.413793,192.930047,540.0914,467.537034,565.292599,595.159370,682.674200,494.874514
2022,9,917.385474,977.289650,1635.203915,1493.370551,266.299118,717.398987,549.820467,931.272118,854.440037,1063.517984,...,475.382003,436.099619,172.413793,192.930047,540.0914,467.537034,619.117902,595.159370,682.674200,494.874514
2022,10,908.431603,944.864228,1916.378550,1493.370551,266.299118,717.398987,549.820467,931.272118,1119.359020,1094.711720,...,475.382003,436.099619,172.413793,192.930047,540.0914,467.537034,605.027652,595.159370,682.674200,494.874514
2022,11,867.409593,920.512565,1780.685565,1493.370551,266.299118,717.398987,549.820467,931.272118,993.686192,1080.266298,...,475.382003,436.099619,172.413793,192.930047,540.0914,467.537034,611.818738,595.159370,661.235093,638.071606


## 아파트월세 피봇테이블 -> 아파트연세 피봇테이블 

- 아파트 매매 피봇 테이블 생성 부분 참조
- 보증금은 계약시의 상황마다 다를 것
- 전월세전환률을 적용하여서 월세에서의 보증금을 변환
- 거래들마다 상황에 따라 보증금과 월세금액은 다를 수 있음으로, 보증금의 5.8% 값에 월세*12을 더하여 1년간 들어가는 금액인 연세를 계산

In [28]:
# 보증금의 5.8% 값에 월세*12을 더하여 1년간 들어가는 금액인 연세를 계산
df_month_rent['year_rent_price'] = (df_month_rent['rent_deposit']*0.058)+(df_month_rent['month_rent_price']*12)
df_month_rent['area_year_rent_price'] = df_month_rent['year_rent_price'] / df_month_rent['area']
df_month_rent

Unnamed: 0,date,year,month,day,address_0,address_1,address_2,address_3,address_4,name,area,rent_deposit,month_rent_price,year_rent_price,area_year_rent_price
0,2011-03-18,2011,3,18,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,79.97,19000,63,1858.0,23.233713
1,2011-04-09,2011,4,9,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,79.97,21000,35,1638.0,20.482681
2,2011-07-09,2011,7,9,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,79.97,3000,160,2094.0,26.184819
3,2011-09-19,2011,9,19,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,79.97,6000,140,2028.0,25.359510
4,2011-09-20,2011,9,20,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,79.97,5000,160,2210.0,27.635363
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
637083,2022-11-25,2022,11,25,서울특별시,중랑구,중화동,450.0,0.0,한신아파트(103~109),84.03,30000,48,2316.0,27.561585
637084,2022-12-10,2022,12,10,서울특별시,중랑구,중화동,450.0,0.0,한신아파트(103~109),59.76,25000,50,2050.0,34.303882
637085,2022-12-24,2022,12,24,서울특별시,중랑구,중화동,450.0,0.0,한신아파트(103~109),59.76,20000,50,1760.0,29.451138
637086,2022-12-28,2022,12,28,서울특별시,중랑구,중화동,450.0,0.0,한신아파트(103~109),84.03,5000,150,2090.0,24.872069


In [29]:
pivot_table_area_year_rent=df_month_rent.pivot_table(index=['year','month'], columns=['address_1','address_2','address_3','address_4'], values='area_year_rent_price')
pivot_table_area_year_rent

Unnamed: 0_level_0,address_1,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,...,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구
Unnamed: 0_level_1,address_2,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,...,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동
Unnamed: 0_level_2,address_3,12.0,12.0,138.0,140.0,141.0,172.0,176.0,177.0,179.0,185.0,...,307.0,307.0,314.0,318.0,331.0,438.0,450.0,452.0,453.0,454.0
Unnamed: 0_level_3,address_4,0.0,2.0,0.0,0.0,0.0,3.0,1.0,0.0,0.0,0.0,...,6.0,76.0,1.0,81.0,64.0,0.0,0.0,0.0,0.0,0.0
year,month,Unnamed: 2_level_4,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4,Unnamed: 7_level_4,Unnamed: 8_level_4,Unnamed: 9_level_4,Unnamed: 10_level_4,Unnamed: 11_level_4,Unnamed: 12_level_4,Unnamed: 13_level_4,Unnamed: 14_level_4,Unnamed: 15_level_4,Unnamed: 16_level_4,Unnamed: 17_level_4,Unnamed: 18_level_4,Unnamed: 19_level_4,Unnamed: 20_level_4,Unnamed: 21_level_4,Unnamed: 22_level_4
2011,1,29.358370,,17.911876,20.323663,18.050763,,,,29.895742,27.382319,...,,,,,,,13.922356,,,
2011,2,28.277485,,17.456082,18.333546,17.506052,,,,,26.441473,...,,14.316808,,,,,16.238534,,14.523557,
2011,3,27.614375,,17.629960,18.453881,16.910225,,,24.412572,29.644517,28.500476,...,,,,,,,14.892905,,,
2011,4,26.463862,25.650160,17.165162,18.575909,16.591916,,,,,23.895105,...,,,13.593862,,,,13.947456,15.500595,,
2011,5,27.021347,,11.109562,18.235376,16.988492,,,,,24.901078,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022,8,40.697447,39.761727,84.104350,,,,,48.971620,,39.601365,...,,,,,,,29.029074,,,
2022,9,45.050984,,90.944970,,,,,,,32.009178,...,,,26.969377,,,,35.475234,,,
2022,10,42.932170,48.741623,87.228899,,,,,,56.148725,39.161184,...,,,12.994336,,,,31.408411,21.293480,,
2022,11,47.229884,,79.793732,,,,,67.260516,52.003517,38.385315,...,,,27.508765,,,,29.190680,,,


In [30]:
print(pivot_table_area_year_rent.info()) # 4383 모든 인덱스가 있음
# 모든 날짜들이 있음을 확인함
print(set(date_list) - set(pivot_table_area_year_rent.index))
print(set(pivot_table_area_year_rent.index) - set(date_list))

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 144 entries, (2011, 1) to (2022, 12)
Columns: 8358 entries, ('강남구', '개포동', 12.0, 0.0) to ('중랑구', '중화동', 454.0, 0.0)
dtypes: float64(8358)
memory usage: 9.2 MB
None
set()
set()


In [31]:
pivot_table_area_year_rent=pivot_table_area_year_rent.ffill()
pivot_table_area_year_rent

Unnamed: 0_level_0,address_1,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,...,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구
Unnamed: 0_level_1,address_2,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,...,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동
Unnamed: 0_level_2,address_3,12.0,12.0,138.0,140.0,141.0,172.0,176.0,177.0,179.0,185.0,...,307.0,307.0,314.0,318.0,331.0,438.0,450.0,452.0,453.0,454.0
Unnamed: 0_level_3,address_4,0.0,2.0,0.0,0.0,0.0,3.0,1.0,0.0,0.0,0.0,...,6.0,76.0,1.0,81.0,64.0,0.0,0.0,0.0,0.0,0.0
year,month,Unnamed: 2_level_4,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4,Unnamed: 7_level_4,Unnamed: 8_level_4,Unnamed: 9_level_4,Unnamed: 10_level_4,Unnamed: 11_level_4,Unnamed: 12_level_4,Unnamed: 13_level_4,Unnamed: 14_level_4,Unnamed: 15_level_4,Unnamed: 16_level_4,Unnamed: 17_level_4,Unnamed: 18_level_4,Unnamed: 19_level_4,Unnamed: 20_level_4,Unnamed: 21_level_4,Unnamed: 22_level_4
2011,1,29.358370,,17.911876,20.323663,18.050763,,,,29.895742,27.382319,...,,,,,,,13.922356,,,
2011,2,28.277485,,17.456082,18.333546,17.506052,,,,29.895742,26.441473,...,,14.316808,,,,,16.238534,,14.523557,
2011,3,27.614375,,17.629960,18.453881,16.910225,,,24.412572,29.644517,28.500476,...,,14.316808,,,,,14.892905,,14.523557,
2011,4,26.463862,25.650160,17.165162,18.575909,16.591916,,,24.412572,29.644517,23.895105,...,,14.316808,13.593862,,,,13.947456,15.500595,14.523557,
2011,5,27.021347,25.650160,11.109562,18.235376,16.988492,,,24.412572,29.644517,24.901078,...,,14.316808,13.593862,,,,13.947456,15.500595,14.523557,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022,8,40.697447,39.761727,84.104350,11.258409,6.298209,20.496709,59.353076,48.971620,29.116945,39.601365,...,17.172279,16.291854,26.415850,4.281609,12.4471,6.555443,29.029074,19.574130,24.442083,22.19917
2022,9,45.050984,39.761727,90.944970,11.258409,6.298209,20.496709,59.353076,48.971620,29.116945,32.009178,...,17.172279,16.291854,26.969377,4.281609,12.4471,6.555443,35.475234,19.574130,24.442083,22.19917
2022,10,42.932170,48.741623,87.228899,11.258409,6.298209,20.496709,59.353076,48.971620,56.148725,39.161184,...,17.172279,16.291854,12.994336,4.281609,12.4471,6.555443,31.408411,21.293480,24.442083,22.19917
2022,11,47.229884,48.741623,79.793732,11.258409,6.298209,20.496709,59.353076,67.260516,52.003517,38.385315,...,17.172279,16.291854,27.508765,4.281609,12.4471,6.555443,29.190680,21.293480,24.442083,22.19917


# deal_everyday, full_rent_everyday, year_rent_everyday 폴더 및 파일들 생성

- deal_everyday 폴더에 있는 파일들은 apartment_deal 파일에 모든 날짜의 아파트매매가 현황 정보를 추가한 파일 
- full_rent_everyday 폴더에 있는 파일들은 apartment_full_rent 파일에 모든 날짜의 아파트전세가 현황 정보를 추가한 파일 
- year_rent_everyday 폴더에 있는 파일들은 apartment_month_rent 파일에 모든 날짜의 아파트월세가 현황 정보를 추가한 파일 

- 이전에 만들었던 apartment_deal, apartment_full_rent,apartment_month_rent 데이터프레임들은 거래가 체결된 날짜에 대한 정보만 데이터로 가지고 있다. 
- 거래가 체결된 날짜 외에 주어진 기간에 해당하는 모든 날짜들에 대한 매매, 전세, 연세 관련 정보들을 구하기 위해서 데이터프레임 생성
- 추후 데이터처리를 위해 위에서 생성한 피봇테이블들을 컬럼을 address_1, address_2, address_3, address_4, year, month, day, 거래가격(매매가격,전세가격,연세가격) 으로 재구조화 해야 함
- 하지만 피봇테이블들은 컬럼의 갯수가 너무 많기에 재구조화 하는 과정에서 메모리 부족 오류가 발생
- 메모리 부족 사태를 해결하기 위해서 다양한 방법 시도

## 7. 컬럼을 슬라이싱해서 stack

- 메모리 문제도 없고, 속도도 매우 빠름 -> 컬럼이 매우 많으면 잘라 부분으로 나눈 후, 부분들을 처리

>> row 뿐만이 아니라, 컬럼도 잘라서 진행할 수 있다.

### deal_everyday 폴더 및 파일 생성

- 여러 파트로 나누어서 저장해야 하기에, 폴더 안에 파일들을 담아서 진행

In [None]:
# 여기서 pivo_table_deal은 reset_index 하기 전 테이블
pivot_table_deal.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,address_1,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,...,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구
Unnamed: 0_level_1,Unnamed: 1_level_1,address_2,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,...,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동
Unnamed: 0_level_2,Unnamed: 1_level_2,address_3,12.0,12.0,138.0,140.0,141.0,166.0,172.0,176.0,177.0,179.0,...,307.0,314.0,318.0,331.0,413.0,438.0,450.0,452.0,453.0,454.0
Unnamed: 0_level_3,Unnamed: 1_level_3,address_4,0.0,2.0,0.0,0.0,0.0,4.0,3.0,1.0,0.0,0.0,...,76.0,1.0,81.0,64.0,8.0,0.0,0.0,0.0,0.0,0.0
year,month,day,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4,Unnamed: 7_level_4,Unnamed: 8_level_4,Unnamed: 9_level_4,Unnamed: 10_level_4,Unnamed: 11_level_4,Unnamed: 12_level_4,Unnamed: 13_level_4,Unnamed: 14_level_4,Unnamed: 15_level_4,Unnamed: 16_level_4,Unnamed: 17_level_4,Unnamed: 18_level_4,Unnamed: 19_level_4,Unnamed: 20_level_4,Unnamed: 21_level_4,Unnamed: 22_level_4,Unnamed: 23_level_4
2011,1,1,,,,,,,,,,,...,,,,,,,,,,
2011,1,2,,,,,,,,,,,...,,,,,,,25800.0,,,
2011,1,3,,,,,,,,,,,...,,,,,,,25800.0,,,
2011,1,4,33800.0,,,,,,,,,,...,,,,,,,25800.0,,,
2011,1,5,43000.0,,89400.0,,80300.0,,,,,,...,,,,,,,25800.0,,,


In [None]:
pivot_table_deal.info() # 피봇 테이블 정보 확인인

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 4383 entries, (2011, 1, 1) to (2022, 12, 31)
Columns: 8860 entries, ('강남구', '개포동', 12.0, 0.0) to ('중랑구', '중화동', 454.0, 0.0)
dtypes: float64(8860)
memory usage: 296.3 MB


In [None]:
# null 값을 채움 - 값을 채우지 않으면 추후 stack을 할 때 null 값을 계산을 안함
pivot_table_deal = pivot_table_deal.fillna(0)
pivot_table_deal

Unnamed: 0_level_0,Unnamed: 1_level_0,address_1,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,...,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구
Unnamed: 0_level_1,Unnamed: 1_level_1,address_2,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,...,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동
Unnamed: 0_level_2,Unnamed: 1_level_2,address_3,12.0,12.0,138.0,140.0,141.0,166.0,172.0,176.0,177.0,179.0,...,307.0,314.0,318.0,331.0,413.0,438.0,450.0,452.0,453.0,454.0
Unnamed: 0_level_3,Unnamed: 1_level_3,address_4,0.0,2.0,0.0,0.0,0.0,4.0,3.0,1.0,0.0,0.0,...,76.0,1.0,81.0,64.0,8.0,0.0,0.0,0.0,0.0,0.0
year,month,day,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4,Unnamed: 7_level_4,Unnamed: 8_level_4,Unnamed: 9_level_4,Unnamed: 10_level_4,Unnamed: 11_level_4,Unnamed: 12_level_4,Unnamed: 13_level_4,Unnamed: 14_level_4,Unnamed: 15_level_4,Unnamed: 16_level_4,Unnamed: 17_level_4,Unnamed: 18_level_4,Unnamed: 19_level_4,Unnamed: 20_level_4,Unnamed: 21_level_4,Unnamed: 22_level_4,Unnamed: 23_level_4
2011,1,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2011,1,2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,25800.0,0.0,0.0,0.0
2011,1,3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,25800.0,0.0,0.0,0.0
2011,1,4,33800.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,25800.0,0.0,0.0,0.0
2011,1,5,43000.0,0.0,89400.0,0.0,80300.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,25800.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022,12,27,104000.0,303000.0,220000.0,63250.0,184000.0,120000.0,179500.0,350000.0,390000.0,240000.0,...,37800.0,74000.0,32500.0,76500.0,28000.0,69500.0,91500.0,55000.0,85500.0,96000.0
2022,12,28,104000.0,303000.0,220000.0,63250.0,184000.0,120000.0,179500.0,350000.0,390000.0,240000.0,...,37800.0,74000.0,32500.0,76500.0,28000.0,69500.0,91500.0,55000.0,85500.0,96000.0
2022,12,29,104000.0,303000.0,220000.0,63250.0,184000.0,120000.0,179500.0,350000.0,390000.0,240000.0,...,37800.0,74000.0,32500.0,76500.0,28000.0,69500.0,91500.0,55000.0,85500.0,96000.0
2022,12,30,104000.0,303000.0,220000.0,63250.0,184000.0,120000.0,179500.0,350000.0,390000.0,240000.0,...,37800.0,74000.0,32500.0,76500.0,28000.0,69500.0,91500.0,55000.0,85500.0,96000.0


>> stack을 할 때 null 값을 계산을 안함으로, 계산 시 값 변경을 예방하기 위해서 null 값들을 채워야 한다

In [None]:
# 최종코드
# 컬럼의 개수는 8860
# 메모리를 많이 잡아먹어서 중간에 저장을 해야 할 듯
column_range_list = [[0,100],[100,200],[200,300],[300,400],[400,443]] # 컬럼의 범위들을 나눔(총 5개의 부분으로 나눌 예정)
for column_range in column_range_list:
    for i in range(column_range[0],column_range[1]):
        start = 20*i # 시작위치의 컬럼인덱스
        end = 20*i + 20 # 끝 위치의 컬럼 인덱스스
        pivot_table_deal_part = pivot_table_deal.iloc[:,start:end] # 피봇테이블의 부분을 생성
        stack_deal_table =pivot_table_deal_part.stack(level=[0,1,2,3]) # stack 을 통해서 피봇테이블 재구조화
        stack_deal_table=stack_deal_table.reset_index() # index에 있는 컬럼들을 컬럼화
        stack_deal_table.columns=['year','month','day','address_1','address_2','address_3','address_4','deal_price'] # 컬럼명 수정
        if i == column_range[0]: # 5개의 부분 중에서 첫번째 부분일 경우
            df_deal_final = stack_deal_table.copy() # 첫번째 데이터프레임 생성
        else: # 5개의 부분 중에서 첫번째 부분이 아닐 경우
            df_deal_final = pd.concat([df_deal_final, stack_deal_table], axis=0) # 위아래로 병합
            df_deal_final.reset_index(drop=True, inplace=True) # 인덱스 초기화
        print(i)
        print(stack_deal_table.head(1)) # 출력을 통해서 값 확인인
    df_deal_final.to_csv('/content/drive/MyDrive/house_price/after_data/deal_everyday/'+str(column_range[0])+'_'+str(column_range[1])+'.csv',index=False)

0
   year  month  day address_1 address_2  address_3  address_4  deal_price
0  2011      1    1       강남구       개포동       12.0        0.0         0.0
1
   year  month  day address_1 address_2  address_3  address_4  deal_price
0  2011      1    1       강남구       개포동      655.0        3.0         0.0
2
   year  month  day address_1 address_2  address_3  address_4  deal_price
0  2011      1    1       강남구       논현동        9.0        2.0         0.0
3
   year  month  day address_1 address_2  address_3  address_4  deal_price
0  2011      1    1       강남구       논현동      103.0       11.0         0.0
4
   year  month  day address_1 address_2  address_3  address_4  deal_price
0  2011      1    1       강남구       논현동      196.0        4.0         0.0
5
   year  month  day address_1 address_2  address_3  address_4  deal_price
0  2011      1    1       강남구       논현동      261.0        8.0         0.0
6
   year  month  day address_1 address_2  address_3  address_4  deal_price
0  2011      1    1     

55
   year  month  day address_1 address_2  address_3  address_4  deal_price
0  2011      1    1       강동구       천호동       27.0        6.0         0.0
56
   year  month  day address_1 address_2  address_3  address_4  deal_price
0  2011      1    1       강동구       천호동       49.0        8.0         0.0
57
   year  month  day address_1 address_2  address_3  address_4  deal_price
0  2011      1    1       강동구       천호동       55.0        4.0         0.0
58
   year  month  day address_1 address_2  address_3  address_4  deal_price
0  2011      1    1       강동구       천호동      302.0        3.0         0.0
59
   year  month  day address_1 address_2  address_3  address_4  deal_price
0  2011      1    1       강동구       천호동      416.0       12.0         0.0
60
   year  month  day address_1 address_2  address_3  address_4  deal_price
0  2011      1    1       강동구       천호동      570.0        0.0         0.0
61
   year  month  day address_1 address_2  address_3  address_4  deal_price
0  2011      1   

112
   year  month  day address_1 address_2  address_3  address_4  deal_price
0  2011      1    1       관악구       신림동     1684.0        7.0         0.0
113
   year  month  day address_1 address_2  address_3  address_4  deal_price
0  2011      1    1       관악구       신림동     1718.0        0.0         0.0
114
   year  month  day address_1 address_2  address_3  address_4  deal_price
0  2011      1    1       관악구       신림동     1738.0        0.0         0.0
115
   year  month  day address_1 address_2  address_3  address_4  deal_price
0  2011      1    1       광진구       광장동      453.0        1.0         0.0
116
   year  month  day address_1 address_2  address_3  address_4  deal_price
0  2011      1    1       광진구       구의동       45.0       11.0         0.0
117
   year  month  day address_1 address_2  address_3  address_4  deal_price
0  2011      1    1       광진구       구의동      224.0        7.0         0.0
118
   year  month  day address_1 address_2  address_3  address_4  deal_price
0  2011   

166
   year  month  day address_1 address_2  address_3  address_4  deal_price
0  2011      1    1       금천구       시흥동      798.0       64.0         0.0
167
   year  month  day address_1 address_2  address_3  address_4  deal_price
0  2011      1    1       금천구       시흥동      949.0       18.0         0.0
168
   year  month  day address_1 address_2  address_3  address_4  deal_price
0  2011      1    1       금천구       시흥동      999.0        1.0         0.0
169
   year  month  day address_1 address_2  address_3  address_4  deal_price
0  2011      1    1       노원구       공릉동      109.0        0.0         0.0
170
   year  month  day address_1 address_2  address_3  address_4  deal_price
0  2011      1    1       노원구       공릉동      380.0       56.0         0.0
171
   year  month  day address_1 address_2  address_3  address_4  deal_price
0  2011      1    1       노원구       공릉동      585.0        7.0         0.0
172
   year  month  day address_1 address_2  address_3  address_4  deal_price
0  2011   

221
   year  month  day address_1 address_2  address_3  address_4  deal_price
0  2011      1    1       동작구      신대방동      360.0       17.0         0.0
222
   year  month  day address_1 address_2  address_3  address_4  deal_price
0  2011      1    1       동작구      신대방동      711.0        0.0         0.0
223
   year  month  day address_1 address_2  address_3  address_4  deal_price
0  2011      1    1       동작구       흑석동      332.0        0.0         0.0
224
   year  month  day address_1 address_2  address_3  address_4  deal_price
0  2011      1    1       마포구       공덕동      464.0        0.0         0.0
225
   year  month  day address_1 address_2  address_3  address_4  deal_price
0  2011      1    1       마포구       도화동       82.0        0.0         0.0
226
   year  month  day address_1 address_2  address_3  address_4  deal_price
0  2011      1    1       마포구       망원동      239.0        0.0         0.0
227
   year  month  day address_1 address_2  address_3  address_4  deal_price
0  2011   

275
   year  month  day address_1 address_2  address_3  address_4  deal_price
0  2011      1    1       서초구       방배동     2626.0        0.0         0.0
276
   year  month  day address_1 address_2  address_3  address_4  deal_price
0  2011      1    1       서초구       서초동     1315.0        0.0         0.0
277
   year  month  day address_1 address_2  address_3  address_4  deal_price
0  2011      1    1       서초구       서초동     1337.0       22.0         0.0
278
   year  month  day address_1 address_2  address_3  address_4  deal_price
0  2011      1    1       서초구       서초동     1363.0       25.0         0.0
279
   year  month  day address_1 address_2  address_3  address_4  deal_price
0  2011      1    1       서초구       서초동     1454.0       29.0         0.0
280
   year  month  day address_1 address_2  address_3  address_4  deal_price
0  2011      1    1       서초구       서초동     1472.0        1.0         0.0
281
   year  month  day address_1 address_2  address_3  address_4  deal_price
0  2011   

330
   year  month  day address_1 address_2  address_3  address_4  deal_price
0  2011      1    1       송파구       송파동      187.0        5.0         0.0
331
   year  month  day address_1 address_2  address_3  address_4  deal_price
0  2011      1    1       송파구       오금동        9.0        7.0         0.0
332
   year  month  day address_1 address_2  address_3  address_4  deal_price
0  2011      1    1       송파구       오금동       75.0        0.0         0.0
333
   year  month  day address_1 address_2  address_3  address_4  deal_price
0  2011      1    1       송파구       오금동      615.0        0.0         0.0
334
   year  month  day address_1 address_2  address_3  address_4  deal_price
0  2011      1    1       송파구       잠실동      336.0        5.0         0.0
335
   year  month  day address_1 address_2  address_3  address_4  deal_price
0  2011      1    1       송파구       장지동      896.0        0.0         0.0
336
   year  month  day address_1 address_2  address_3  address_4  deal_price
0  2011   

384
   year  month  day address_1 address_2  address_3  address_4  deal_price
0  2011      1    1       용산구       한남동      805.0        0.0         0.0
385
   year  month  day address_1 address_2  address_3  address_4  deal_price
0  2011      1    1       용산구       효창동        5.0      127.0         0.0
386
   year  month  day address_1 address_2  address_3  address_4  deal_price
0  2011      1    1       용산구       후암동      143.0       23.0         0.0
387
   year  month  day address_1 address_2  address_3  address_4  deal_price
0  2011      1    1       은평구       갈현동      281.0      208.0         0.0
388
   year  month  day address_1 address_2  address_3  address_4  deal_price
0  2011      1    1       은평구       갈현동      499.0       18.0         0.0
389
   year  month  day address_1 address_2  address_3  address_4  deal_price
0  2011      1    1       은평구       구산동       24.0       59.0         0.0
390
   year  month  day address_1 address_2  address_3  address_4  deal_price
0  2011   

439
   year  month  day address_1 address_2  address_3  address_4  deal_price
0  2011      1    1       중랑구       상봉동      269.0        7.0         0.0
440
   year  month  day address_1 address_2  address_3  address_4  deal_price
0  2011      1    1       중랑구       신내동      449.0        1.0         0.0
441
   year  month  day address_1 address_2  address_3  address_4  deal_price
0  2011      1    1       중랑구       신내동      788.0        0.0         0.0
442
   year  month  day address_1 address_2  address_3  address_4  deal_price
0  2011      1    1       중랑구       중화동      207.0       14.0         0.0


>> 컬럼의 개수가 많을 경우, 데이터 재구조화를 할 때 메모리 용량 문제가 발생할 수 있다.

>> 컬럼의 개수가 많을 경우, 컬럼을 슬라이싱 해서 해당 문제를 해결할 수 있다.

### full_rent_everday 폴더 및 파일 생성

- 위에서의 deal_everday 생성 부분 참조

In [None]:
pivot_table_full_rent.info() # 컬럼의 개수가 총 9258개

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 4383 entries, (2011, 1, 1) to (2022, 12, 31)
Columns: 9258 entries, ('강남구', '개포동', 12.0, 0.0) to ('중랑구', '중화동', 454.0, 0.0)
dtypes: float64(9258)
memory usage: 309.6 MB


In [None]:
pivot_table_full_rent = pivot_table_full_rent.fillna(0)
pivot_table_full_rent

Unnamed: 0_level_0,Unnamed: 1_level_0,address_1,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,...,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구
Unnamed: 0_level_1,Unnamed: 1_level_1,address_2,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,...,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동
Unnamed: 0_level_2,Unnamed: 1_level_2,address_3,12.0,12.0,138.0,140.0,141.0,166.0,172.0,176.0,177.0,179.0,...,307.0,314.0,318.0,331.0,413.0,438.0,450.0,452.0,453.0,454.0
Unnamed: 0_level_3,Unnamed: 1_level_3,address_4,0.0,2.0,0.0,0.0,0.0,4.0,3.0,1.0,0.0,0.0,...,76.0,1.0,81.0,64.0,8.0,0.0,0.0,0.0,0.0,0.0
year,month,day,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4,Unnamed: 7_level_4,Unnamed: 8_level_4,Unnamed: 9_level_4,Unnamed: 10_level_4,Unnamed: 11_level_4,Unnamed: 12_level_4,Unnamed: 13_level_4,Unnamed: 14_level_4,Unnamed: 15_level_4,Unnamed: 16_level_4,Unnamed: 17_level_4,Unnamed: 18_level_4,Unnamed: 19_level_4,Unnamed: 20_level_4,Unnamed: 21_level_4,Unnamed: 22_level_4,Unnamed: 23_level_4
2011,1,1,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2011,1,2,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2011,1,3,17000.000000,63000.0,0.0,0.000000,9500.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2011,1,4,18833.333333,63000.0,0.0,16000.000000,7250.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,15000.0,0.0,0.0,0.0,15000.0,0.0,0.0,0.0
2011,1,5,18833.333333,63000.0,11000.0,15833.333333,10000.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,15000.0,0.0,0.0,0.0,16000.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022,12,27,36550.000000,140000.0,75000.0,107000.000000,8000.0,60900.0,73500.0,150000.0,130000.0,70000.0,...,33600.0,37000.0,12000.0,15500.0,26000.0,32000.0,32500.0,33000.0,57750.0,40000.0
2022,12,28,30200.000000,140000.0,75000.0,107000.000000,8000.0,60900.0,73500.0,150000.0,130000.0,70000.0,...,33600.0,37000.0,12000.0,15500.0,26000.0,32000.0,32500.0,33000.0,57750.0,40000.0
2022,12,29,30200.000000,140000.0,75000.0,107000.000000,8000.0,60900.0,73500.0,150000.0,130000.0,68000.0,...,33600.0,37000.0,12000.0,15500.0,26000.0,32000.0,32500.0,33000.0,57750.0,40000.0
2022,12,30,30200.000000,140000.0,75000.0,107000.000000,8000.0,60900.0,73500.0,150000.0,130000.0,68000.0,...,33600.0,37000.0,12000.0,15500.0,26000.0,32000.0,32500.0,33000.0,57750.0,40000.0


In [None]:
column_range_list = [[0,100],[100,200],[200,300],[300,400],[400,463]]
for column_range in column_range_list:
    for i in range(column_range[0],column_range[1]):
        start = 20*i
        end = 20*i + 20
        if i == 462:
            end = 9258
        pivot_table_full_rent_part = pivot_table_full_rent.iloc[:,start:end]
        stack_full_rent_table = pivot_table_full_rent_part.stack(level=[0,1,2,3])
        stack_full_rent_table=stack_full_rent_table.reset_index()
        stack_full_rent_table.columns=['year','month','day','address_1','address_2','address_3','address_4','full_rent_price']
        if i == column_range[0]:
            df_full_rent_final = stack_full_rent_table.copy()
        else:
            df_full_rent_final = pd.concat([df_full_rent_final, stack_full_rent_table], axis=0)
            df_full_rent_final.reset_index(drop=True, inplace=True)
        print(i)
        print(stack_full_rent_table.head(1))
        print()  
    df_full_rent_final.to_csv('/content/drive/MyDrive/house_price/after_data/full_rent_everyday/'+str(column_range[0])+'_'+str(column_range[1])+'.csv',index=False)

0
   year  month  day address_1 address_2  address_3  address_4  full_rent_price
0  2011      1    1       강남구       개포동       12.0        0.0              0.0

1
   year  month  day address_1 address_2  address_3  address_4  full_rent_price
0  2011      1    1       강남구       개포동      655.0        3.0              0.0

2
   year  month  day address_1 address_2  address_3  address_4  full_rent_price
0  2011      1    1       강남구       개포동     1283.0        0.0              0.0

3
   year  month  day address_1 address_2  address_3  address_4  full_rent_price
0  2011      1    1       강남구       논현동       80.0       13.0              0.0

4
   year  month  day address_1 address_2  address_3  address_4  full_rent_price
0  2011      1    1       강남구       논현동      194.0       22.0              0.0

5
   year  month  day address_1 address_2  address_3  address_4  full_rent_price
0  2011      1    1       강남구       논현동      252.0        0.0              0.0

6
   year  month  day address_1 ad

51
   year  month  day address_1 address_2  address_3  address_4  full_rent_price
0  2011      1    1       강동구       성내동      438.0        6.0              0.0

52
   year  month  day address_1 address_2  address_3  address_4  full_rent_price
0  2011      1    1       강동구       성내동      452.0        2.0              0.0

53
   year  month  day address_1 address_2  address_3  address_4  full_rent_price
0  2011      1    1       강동구       성내동      513.0        0.0              0.0

54
   year  month  day address_1 address_2  address_3  address_4  full_rent_price
0  2011      1    1       강동구       성내동      601.0        0.0              0.0

55
   year  month  day address_1 address_2  address_3  address_4  full_rent_price
0  2011      1    1       강동구       암사동      442.0        1.0              0.0

56
   year  month  day address_1 address_2  address_3  address_4  full_rent_price
0  2011      1    1       강동구       암사동      487.0       37.0              0.0

57
   year  month  day addre

104
   year  month  day address_1 address_2  address_3  address_4  full_rent_price
0  2011      1    1       강서구       화곡동     1160.0        0.0              0.0

105
   year  month  day address_1 address_2  address_3  address_4  full_rent_price
0  2011      1    1       관악구       남현동      602.0      201.0              0.0

106
   year  month  day address_1 address_2  address_3  address_4  full_rent_price
0  2011      1    1       관악구       남현동     1072.0       46.0              0.0

107
   year  month  day address_1 address_2  address_3  address_4  full_rent_price
0  2011      1    1       관악구       남현동     1139.0        0.0              0.0

108
   year  month  day address_1 address_2  address_3  address_4  full_rent_price
0  2011      1    1       관악구       봉천동      148.0      129.0              0.0

109
   year  month  day address_1 address_2  address_3  address_4  full_rent_price
0  2011      1    1       관악구       봉천동      645.0       87.0              0.0

110
   year  month  da

155
   year  month  day address_1 address_2  address_3  address_4  full_rent_price
0  2011      1    1       구로구       구로동      796.0        3.0              0.0

156
   year  month  day address_1 address_2  address_3  address_4  full_rent_price
0  2011      1    1       구로구       구로동      797.0        9.0              0.0

157
   year  month  day address_1 address_2  address_3  address_4  full_rent_price
0  2011      1    1       구로구       구로동      799.0        1.0              0.0

158
   year  month  day address_1 address_2  address_3  address_4  full_rent_price
0  2011      1    1       구로구       구로동      803.0       13.0              0.0

159
   year  month  day address_1 address_2  address_3  address_4  full_rent_price
0  2011      1    1       구로구       구로동      807.0       16.0              0.0

160
   year  month  day address_1 address_2  address_3  address_4  full_rent_price
0  2011      1    1       구로구       구로동     1268.0        0.0              0.0

161
   year  month  da

206
   year  month  day address_1 address_2  address_3  address_4  full_rent_price
0  2011      1    1      동대문구       용두동       29.0        1.0              0.0

207
   year  month  day address_1 address_2  address_3  address_4  full_rent_price
0  2011      1    1      동대문구       용두동      792.0        0.0              0.0

208
   year  month  day address_1 address_2  address_3  address_4  full_rent_price
0  2011      1    1      동대문구       이문동      327.0        1.0              0.0

209
   year  month  day address_1 address_2  address_3  address_4  full_rent_price
0  2011      1    1      동대문구       장안동      306.0       13.0              0.0

210
   year  month  day address_1 address_2  address_3  address_4  full_rent_price
0  2011      1    1      동대문구       장안동      333.0        5.0              0.0

211
   year  month  day address_1 address_2  address_3  address_4  full_rent_price
0  2011      1    1      동대문구       장안동      345.0       10.0              0.0

212
   year  month  da

257
   year  month  day address_1 address_2  address_3  address_4  full_rent_price
0  2011      1    1      서대문구       연희동      103.0        1.0              0.0

258
   year  month  day address_1 address_2  address_3  address_4  full_rent_price
0  2011      1    1      서대문구       연희동      708.0        4.0              0.0

259
   year  month  day address_1 address_2  address_3  address_4  full_rent_price
0  2011      1    1      서대문구       창천동      501.0       14.0              0.0

260
   year  month  day address_1 address_2  address_3  address_4  full_rent_price
0  2011      1    1      서대문구     충정로3가      222.0        0.0              0.0

261
   year  month  day address_1 address_2  address_3  address_4  full_rent_price
0  2011      1    1      서대문구       홍은동      150.0       13.0              0.0

262
   year  month  day address_1 address_2  address_3  address_4  full_rent_price
0  2011      1    1      서대문구       홍은동      274.0       60.0              0.0

263
   year  month  da

308
   year  month  day address_1 address_2  address_3  address_4  full_rent_price
0  2011      1    1       서초구       잠원동      162.0        0.0              0.0

309
   year  month  day address_1 address_2  address_3  address_4  full_rent_price
0  2011      1    1       성동구     금호동4가      180.0        0.0              0.0

310
   year  month  day address_1 address_2  address_3  address_4  full_rent_price
0  2011      1    1       성동구       마장동      784.0        0.0              0.0

311
   year  month  day address_1 address_2  address_3  address_4  full_rent_price
0  2011      1    1       성동구     상왕십리동      811.0        0.0              0.0

312
   year  month  day address_1 address_2  address_3  address_4  full_rent_price
0  2011      1    1       성동구     성수동1가      716.0        0.0              0.0

313
   year  month  day address_1 address_2  address_3  address_4  full_rent_price
0  2011      1    1       성동구     성수동2가      838.0        0.0              0.0

314
   year  month  da

359
   year  month  day address_1 address_2  address_3  address_4  full_rent_price
0  2011      1    1       양천구       신월동       54.0       10.0              0.0

360
   year  month  day address_1 address_2  address_3  address_4  full_rent_price
0  2011      1    1       양천구       신월동      134.0       17.0              0.0

361
   year  month  day address_1 address_2  address_3  address_4  full_rent_price
0  2011      1    1       양천구       신월동      222.0        7.0              0.0

362
   year  month  day address_1 address_2  address_3  address_4  full_rent_price
0  2011      1    1       양천구       신월동      440.0       10.0              0.0

363
   year  month  day address_1 address_2  address_3  address_4  full_rent_price
0  2011      1    1       양천구       신월동      485.0        2.0              0.0

364
   year  month  day address_1 address_2  address_3  address_4  full_rent_price
0  2011      1    1       양천구       신월동      510.0        3.0              0.0

365
   year  month  da

412
   year  month  day address_1 address_2  address_3  address_4  full_rent_price
0  2011      1    1       은평구       대조동       59.0       48.0              0.0

413
   year  month  day address_1 address_2  address_3  address_4  full_rent_price
0  2011      1    1       은평구       대조동      197.0       16.0              0.0

414
   year  month  day address_1 address_2  address_3  address_4  full_rent_price
0  2011      1    1       은평구       불광동      305.0        1.0              0.0

415
   year  month  day address_1 address_2  address_3  address_4  full_rent_price
0  2011      1    1       은평구       불광동      629.0        0.0              0.0

416
   year  month  day address_1 address_2  address_3  address_4  full_rent_price
0  2011      1    1       은평구       수색동       75.0        0.0              0.0

417
   year  month  day address_1 address_2  address_3  address_4  full_rent_price
0  2011      1    1       은평구       신사동       19.0       84.0              0.0

418
   year  month  da

### year_rent_everyday 폴더 및 파일 생성

- deal_everyday 생성 참조

In [None]:
pivot_table_year_rent.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 4383 entries, (2011, 1, 1) to (2022, 12, 31)
Columns: 8358 entries, ('강남구', '개포동', 12.0, 0.0) to ('중랑구', '중화동', 454.0, 0.0)
dtypes: float64(8358)
memory usage: 279.5 MB


In [None]:
pivot_table_year_rent = pivot_table_year_rent.fillna(0)
pivot_table_year_rent

Unnamed: 0_level_0,Unnamed: 1_level_0,address_1,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,...,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구
Unnamed: 0_level_1,Unnamed: 1_level_1,address_2,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,...,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동
Unnamed: 0_level_2,Unnamed: 1_level_2,address_3,12.0,12.0,138.0,140.0,141.0,172.0,176.0,177.0,179.0,185.0,...,307.0,307.0,314.0,318.0,331.0,438.0,450.0,452.0,453.0,454.0
Unnamed: 0_level_3,Unnamed: 1_level_3,address_4,0.0,2.0,0.0,0.0,0.0,3.0,1.0,0.0,0.0,0.0,...,6.0,76.0,1.0,81.0,64.0,0.0,0.0,0.0,0.0,0.0
year,month,day,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4,Unnamed: 7_level_4,Unnamed: 8_level_4,Unnamed: 9_level_4,Unnamed: 10_level_4,Unnamed: 11_level_4,Unnamed: 12_level_4,Unnamed: 13_level_4,Unnamed: 14_level_4,Unnamed: 15_level_4,Unnamed: 16_level_4,Unnamed: 17_level_4,Unnamed: 18_level_4,Unnamed: 19_level_4,Unnamed: 20_level_4,Unnamed: 21_level_4,Unnamed: 22_level_4,Unnamed: 23_level_4
2011,1,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2011,1,2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2011,1,3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2011,1,4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1786.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2011,1,5,0.0,0.0,778.0,0.0,0.0,0.0,0.0,0.0,0.0,1786.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022,12,27,958.0,6546.0,3920.0,251.6,365.8,2740.0,9560.0,3700.0,3160.0,2700.0,...,1234.0,1304.0,2244.0,298.0,1000.0,389.0,1760.0,1610.0,2070.0,2140.0
2022,12,28,2240.0,6546.0,3920.0,251.6,365.8,2740.0,9560.0,3700.0,3160.0,1779.0,...,1234.0,1304.0,2244.0,298.0,1000.0,389.0,2090.0,1610.0,2070.0,2140.0
2022,12,29,1781.0,6546.0,3920.0,251.6,365.8,2740.0,9560.0,3700.0,3160.0,1779.0,...,1234.0,1304.0,2244.0,298.0,1000.0,389.0,2090.0,1610.0,2070.0,2140.0
2022,12,30,2036.5,6546.0,4000.0,251.6,365.8,2740.0,9560.0,3700.0,3160.0,2096.0,...,1234.0,1304.0,2244.0,298.0,1000.0,389.0,2090.0,1610.0,2070.0,2140.0


In [None]:
column_range_list = [[0,100],[100,200],[200,300],[300,400],[400,418]]
for column_range in column_range_list:
    for i in range(column_range[0],column_range[1]):
        start = 20*i
        end = 20*i + 20
        if i == 417:
            end = 8358
        pivot_table_year_rent_part = pivot_table_year_rent.iloc[:,start:end]
        stack_year_rent_table = pivot_table_year_rent_part.stack(level=[0,1,2,3])
        stack_year_rent_table = stack_year_rent_table.reset_index()
        stack_year_rent_table.columns=['year','month','day','address_1','address_2','address_3','address_4','year_rent_price']
        if i == column_range[0]:
            df_year_rent_final = stack_year_rent_table.copy()
        else:
            df_year_rent_final = pd.concat([df_year_rent_final, stack_year_rent_table], axis=0)
            df_year_rent_final.reset_index(drop=True, inplace=True)
        print(i)
        print(stack_year_rent_table.head(1))
        print()  
    df_year_rent_final.to_csv('/content/drive/MyDrive/house_price/after_data/year_rent_everyday/'+str(column_range[0])+'_'+str(column_range[1])+'.csv',index=False)

0
   year  month  day address_1 address_2  address_3  address_4  year_rent_price
0  2011      1    1       강남구       개포동       12.0        0.0              0.0

1
   year  month  day address_1 address_2  address_3  address_4  year_rent_price
0  2011      1    1       강남구       개포동      656.0        0.0              0.0

2
   year  month  day address_1 address_2  address_3  address_4  year_rent_price
0  2011      1    1       강남구       논현동       22.0        0.0              0.0

3
   year  month  day address_1 address_2  address_3  address_4  year_rent_price
0  2011      1    1       강남구       논현동      103.0       11.0              0.0

4
   year  month  day address_1 address_2  address_3  address_4  year_rent_price
0  2011      1    1       강남구       논현동      194.0       23.0              0.0

5
   year  month  day address_1 address_2  address_3  address_4  year_rent_price
0  2011      1    1       강남구       논현동      252.0        1.0              0.0

6
   year  month  day address_1 ad

51
   year  month  day address_1 address_2  address_3  address_4  year_rent_price
0  2011      1    1       강동구       성내동      604.0        0.0              0.0

52
   year  month  day address_1 address_2  address_3  address_4  year_rent_price
0  2011      1    1       강동구       암사동      451.0       16.0              0.0

53
   year  month  day address_1 address_2  address_3  address_4  year_rent_price
0  2011      1    1       강동구       암사동      508.0        0.0              0.0

54
   year  month  day address_1 address_2  address_3  address_4  year_rent_price
0  2011      1    1       강동구       천호동       35.0        4.0              0.0

55
   year  month  day address_1 address_2  address_3  address_4  year_rent_price
0  2011      1    1       강동구       천호동       52.0        3.0              0.0

56
   year  month  day address_1 address_2  address_3  address_4  year_rent_price
0  2011      1    1       강동구       천호동      166.0      106.0              0.0

57
   year  month  day addre

103
   year  month  day address_1 address_2  address_3  address_4  year_rent_price
0  2011      1    1       관악구       봉천동     1644.0       26.0              0.0

104
   year  month  day address_1 address_2  address_3  address_4  year_rent_price
0  2011      1    1       관악구       봉천동     1705.0        0.0              0.0

105
   year  month  day address_1 address_2  address_3  address_4  year_rent_price
0  2011      1    1       관악구       봉천동     1723.0        0.0              0.0

106
   year  month  day address_1 address_2  address_3  address_4  year_rent_price
0  2011      1    1       관악구       신림동      244.0       21.0              0.0

107
   year  month  day address_1 address_2  address_3  address_4  year_rent_price
0  2011      1    1       관악구       신림동      746.0       43.0              0.0

108
   year  month  day address_1 address_2  address_3  address_4  year_rent_price
0  2011      1    1       관악구       신림동     1463.0       11.0              0.0

109
   year  month  da

155
   year  month  day address_1 address_2  address_3  address_4  year_rent_price
0  2011      1    1       금천구       독산동      958.0        0.0              0.0

156
   year  month  day address_1 address_2  address_3  address_4  year_rent_price
0  2011      1    1       금천구       독산동     1006.0      139.0              0.0

157
   year  month  day address_1 address_2  address_3  address_4  year_rent_price
0  2011      1    1       금천구       독산동     1141.0        0.0              0.0

158
   year  month  day address_1 address_2  address_3  address_4  year_rent_price
0  2011      1    1       금천구       시흥동      791.0       40.0              0.0

159
   year  month  day address_1 address_2  address_3  address_4  year_rent_price
0  2011      1    1       금천구       시흥동      959.0       11.0              0.0

160
   year  month  day address_1 address_2  address_3  address_4  year_rent_price
0  2011      1    1       금천구       시흥동     1012.0        0.0              0.0

161
   year  month  da

207
   year  month  day address_1 address_2  address_3  address_4  year_rent_price
0  2011      1    1       동작구       사당동     1151.0        0.0              0.0

208
   year  month  day address_1 address_2  address_3  address_4  year_rent_price
0  2011      1    1       동작구       상도동        1.0        7.0              0.0

209
   year  month  day address_1 address_2  address_3  address_4  year_rent_price
0  2011      1    1       동작구       상도동      301.0        4.0              0.0

210
   year  month  day address_1 address_2  address_3  address_4  year_rent_price
0  2011      1    1       동작구       상도동      421.0        0.0              0.0

211
   year  month  day address_1 address_2  address_3  address_4  year_rent_price
0  2011      1    1       동작구       상도동      532.0        0.0              0.0

212
   year  month  day address_1 address_2  address_3  address_4  year_rent_price
0  2011      1    1       동작구      신대방동      686.0       48.0              0.0

213
   year  month  da

258
   year  month  day address_1 address_2  address_3  address_4  year_rent_price
0  2011      1    1       서초구       방배동      963.0       16.0              0.0

259
   year  month  day address_1 address_2  address_3  address_4  year_rent_price
0  2011      1    1       서초구       방배동     1002.0       10.0              0.0

260
   year  month  day address_1 address_2  address_3  address_4  year_rent_price
0  2011      1    1       서초구       방배동     2525.0        0.0              0.0

261
   year  month  day address_1 address_2  address_3  address_4  year_rent_price
0  2011      1    1       서초구       서초동     1311.0       10.0              0.0

262
   year  month  day address_1 address_2  address_3  address_4  year_rent_price
0  2011      1    1       서초구       서초동     1336.0        0.0              0.0

263
   year  month  day address_1 address_2  address_3  address_4  year_rent_price
0  2011      1    1       서초구       서초동     1359.0       50.0              0.0

264
   year  month  da

311
   year  month  day address_1 address_2  address_3  address_4  year_rent_price
0  2011      1    1       송파구       방이동      217.0        0.0              0.0

312
   year  month  day address_1 address_2  address_3  address_4  year_rent_price
0  2011      1    1       송파구       석촌동       54.0       31.0              0.0

313
   year  month  day address_1 address_2  address_3  address_4  year_rent_price
0  2011      1    1       송파구       송파동       14.0        0.0              0.0

314
   year  month  day address_1 address_2  address_3  address_4  year_rent_price
0  2011      1    1       송파구       송파동      164.0        0.0              0.0

315
   year  month  day address_1 address_2  address_3  address_4  year_rent_price
0  2011      1    1       송파구       신천동       11.0       10.0              0.0

316
   year  month  day address_1 address_2  address_3  address_4  year_rent_price
0  2011      1    1       송파구       오금동       54.0        4.0              0.0

317
   year  month  da

362
   year  month  day address_1 address_2  address_3  address_4  year_rent_price
0  2011      1    1       용산구      이태원동      198.0       16.0              0.0

363
   year  month  day address_1 address_2  address_3  address_4  year_rent_price
0  2011      1    1       용산구     한강로2가        2.0       11.0              0.0

364
   year  month  day address_1 address_2  address_3  address_4  year_rent_price
0  2011      1    1       용산구       한남동        1.0      349.0              0.0

365
   year  month  day address_1 address_2  address_3  address_4  year_rent_price
0  2011      1    1       용산구       한남동      723.0        3.0              0.0

366
   year  month  day address_1 address_2  address_3  address_4  year_rent_price
0  2011      1    1       용산구       효창동        5.0        1.0              0.0

367
   year  month  day address_1 address_2  address_3  address_4  year_rent_price
0  2011      1    1       용산구       후암동      143.0       23.0              0.0

368
   year  month  da

414
   year  month  day address_1 address_2  address_3  address_4  year_rent_price
0  2011      1    1       중랑구       상봉동      284.0       11.0              0.0

415
   year  month  day address_1 address_2  address_3  address_4  year_rent_price
0  2011      1    1       중랑구       신내동      479.0        0.0              0.0

416
   year  month  day address_1 address_2  address_3  address_4  year_rent_price
0  2011      1    1       중랑구       신내동      795.0        0.0              0.0

417
   year  month  day address_1 address_2  address_3  address_4  year_rent_price
0  2011      1    1       중랑구       중화동      207.0       14.0              0.0



# final_economic 파일 생성

- economic_data2 은 '해당 일자'에 대한 거시경제 지표들을 가지고 있다. 
- final_economic 파일은 economic_data2 파일에 추가적으로 과거 수치대비 변화에 대한 정보들을 추가한 파일

In [None]:
import pandas as pd
import os

## 기본정보 파악

In [None]:
# 데이터 프레임 불러오기기
df_economic = pd.read_pickle('/content/drive/MyDrive/house_price/after_data/economic_data2.pkl')
df_economic.head()

Unnamed: 0,date,year,month,day,apartment_index,kospi_index,korea_rp,korea_3_year,korea_10_year,korea_10-3_year,...,us_2_year,us_10_year,us_10-2_year,us_10-3_year_month,apartment_supply,unsold_count,unsold_ratio,last_month_deal_count,last_month_full_rent_count,last_month_month_rent_count
0,2011-01-01,2011,1,1,93.0,2051.0,2.5,3.44,4.57,1.13,...,0.601,3.334,2.733,3.21,5345,2269,42.450889,7179,12336,2514
1,2011-01-02,2011,1,2,93.0,2051.0,2.5,3.44,4.57,1.13,...,0.601,3.334,2.733,3.21,5345,2269,42.450889,7179,12336,2514
2,2011-01-03,2011,1,3,93.0,2070.08,2.5,3.44,4.57,1.13,...,0.601,3.334,2.733,3.21,5345,2269,42.450889,7179,12336,2514
3,2011-01-04,2011,1,4,93.0,2085.14,2.5,3.495,4.58,1.085,...,0.621,3.338,2.717,3.196,5345,2269,42.450889,7179,12336,2514
4,2011-01-05,2011,1,5,93.0,2082.55,2.5,3.495,4.63,1.135,...,0.708,3.463,2.755,3.321,5345,2269,42.450889,7179,12336,2514


In [None]:
df_economic.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4383 entries, 0 to 4382
Data columns (total 21 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   date                         4383 non-null   object 
 1   year                         4383 non-null   int16  
 2   month                        4383 non-null   int16  
 3   day                          4383 non-null   int16  
 4   apartment_index              4383 non-null   float64
 5   kospi_index                  4383 non-null   float64
 6   korea_rp                     4383 non-null   float64
 7   korea_3_year                 4383 non-null   float64
 8   korea_10_year                4383 non-null   float64
 9   korea_10-3_year              4383 non-null   float64
 10  us_3_month                   4383 non-null   float64
 11  us_2_year                    4383 non-null   float64
 12  us_10_year                   4383 non-null   float64
 13  us_10-2_year      

In [None]:
# 컬럼명들 수정정
# apartment_supply 등의 컬럼명은 서울 전체 공급량을 나타내기에 헷갈림을 방지하기 위해서 total_을 붙임임
df_economic.columns=['date','year','month','day','apartment_index',
                    'kospi_index','korea_rp','korea_3_year',
                    'korea_10_year','korea_10-3_year','us_3_month','us_2_year',
                    'us_10_year','us_10-2_year','us_10-3_year_month',
                    'total_apartment_supply','total_unsold_count', 'total_unsold_ratio',
                    'total_last_month_deal_count','total_last_month_full_rent_count',
                    'total_last_month_month_rent_count']

In [None]:
# 원본 컬럼 목록 생성성
origin_column_list = list(df_economic.columns)
origin_column_list = origin_column_list[5:]
origin_column_list

['kospi_index',
 'korea_rp',
 'korea_3_year',
 'korea_10_year',
 'korea_10-3_year',
 'us_3_month',
 'us_2_year',
 'us_10_year',
 'us_10-2_year',
 'us_10-3_year_month',
 'total_apartment_supply',
 'total_unsold_count',
 'total_unsold_ratio',
 'total_last_month_deal_count',
 'total_last_month_full_rent_count',
 'total_last_month_month_rent_count']

## 1,3,6,12개월 전 대비 변화정도 계산

- 수치들의 추세를 확인하기 위해서 1,3,6,12 개월 전 과거 수치 대비 변화 수치들을 계산

In [None]:
# 1,3,6,12달전 날짜들 구한
# 과거 날짜를 datetime 변수로 생성
df_economic['1m_before_year'] = (pd.to_datetime(df_economic['date'])- pd.DateOffset(months=1)).dt.year
df_economic['1m_before_month'] = (pd.to_datetime(df_economic['date'])- pd.DateOffset(months=1)).dt.month
df_economic['1m_before_day'] = (pd.to_datetime(df_economic['date'])- pd.DateOffset(months=1)).dt.day
df_economic['3m_before_year'] = (pd.to_datetime(df_economic['date'])- pd.DateOffset(months=3)).dt.year
df_economic['3m_before_month'] = (pd.to_datetime(df_economic['date'])- pd.DateOffset(months=3)).dt.month
df_economic['3m_before_day'] = (pd.to_datetime(df_economic['date'])- pd.DateOffset(months=3)).dt.day
df_economic['6m_before_year'] = (pd.to_datetime(df_economic['date'])- pd.DateOffset(months=6)).dt.year
df_economic['6m_before_month'] = (pd.to_datetime(df_economic['date'])- pd.DateOffset(months=6)).dt.month
df_economic['6m_before_day'] = (pd.to_datetime(df_economic['date'])- pd.DateOffset(months=6)).dt.day
df_economic['12m_before_year'] = (pd.to_datetime(df_economic['date'])- pd.DateOffset(months=12)).dt.year
df_economic['12m_before_month'] = (pd.to_datetime(df_economic['date'])- pd.DateOffset(months=12)).dt.month
df_economic['12m_before_day'] = (pd.to_datetime(df_economic['date'])- pd.DateOffset(months=12)).dt.day
df_economic

Unnamed: 0,date,year,month,day,apartment_index,kospi_index,korea_rp,korea_3_year,korea_10_year,korea_10-3_year,...,1m_before_day,3m_before_year,3m_before_month,3m_before_day,6m_before_year,6m_before_month,6m_before_day,12m_before_year,12m_before_month,12m_before_day
0,2011-01-01,2011,1,1,93.0,2051.00,2.50,3.440,4.570,1.130,...,1,2010,10,1,2010,7,1,2010,1,1
1,2011-01-02,2011,1,2,93.0,2051.00,2.50,3.440,4.570,1.130,...,2,2010,10,2,2010,7,2,2010,1,2
2,2011-01-03,2011,1,3,93.0,2070.08,2.50,3.440,4.570,1.130,...,3,2010,10,3,2010,7,3,2010,1,3
3,2011-01-04,2011,1,4,93.0,2085.14,2.50,3.495,4.580,1.085,...,4,2010,10,4,2010,7,4,2010,1,4
4,2011-01-05,2011,1,5,93.0,2082.55,2.50,3.495,4.630,1.135,...,5,2010,10,5,2010,7,5,2010,1,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4378,2022-12-27,2022,12,27,104.4,2332.79,3.25,3.661,3.612,-0.049,...,27,2022,9,27,2022,6,27,2021,12,27
4379,2022-12-28,2022,12,28,104.4,2280.45,3.25,3.668,3.675,0.007,...,28,2022,9,28,2022,6,28,2021,12,28
4380,2022-12-29,2022,12,29,104.4,2236.40,3.25,3.718,3.723,0.005,...,29,2022,9,29,2022,6,29,2021,12,29
4381,2022-12-30,2022,12,30,104.4,2236.40,3.25,3.725,3.735,0.010,...,30,2022,9,30,2022,6,30,2021,12,30


In [None]:
# 추가할 컬럼들의 컬럼명들을 생성
temp_column_total_list = list()
month_num_list = [1,3,6,12] # 1개월,3개월,6개월,12개월 이전 자료들 생성
for i in month_num_list:
    column_list = list()
    column_list.append('temp_year_'+str(i)+'m_before')
    column_list.append('temp_month_'+str(i)+'m_before')
    column_list.append('temp_day_'+str(i)+'m_before')
    column_list.append('kospi_index_'+str(i)+'m_before')
    column_list.append('korea_rp_'+str(i)+'m_before')
    column_list.append('korea_3_year_'+str(i)+'m_before')
    column_list.append('korea_10_year_'+str(i)+'m_before')
    column_list.append('korea_10-3_year_'+str(i)+'m_before')
    column_list.append('us_3_month_'+str(i)+'m_before')
    column_list.append('us_2_year_'+str(i)+'m_before')
    column_list.append('us_10_year_'+str(i)+'m_before')
    column_list.append('us_10-2_year_'+str(i)+'m_before')
    column_list.append('us_10-3_year_month_'+str(i)+'m_before')
    column_list.append('total_apartment_supply_'+str(i)+'m_before')
    column_list.append('total_unsold_count_'+str(i)+'m_before')
    column_list.append('total_unsold_ratio_'+str(i)+'m_before')
    column_list.append('total_last_month_deal_count_'+str(i)+'m_before')
    column_list.append('total_last_month_full_rent_count_'+str(i)+'m_before')
    column_list.append('total_last_month_month_rent_count_'+str(i)+'m_before')
    temp_column_total_list.append(column_list)

In [None]:
temp_column_total_list[0]

['temp_year_1m_before',
 'temp_month_1m_before',
 'temp_day_1m_before',
 'kospi_index_1m_before',
 'korea_rp_1m_before',
 'korea_3_year_1m_before',
 'korea_10_year_1m_before',
 'korea_10-3_year_1m_before',
 'us_3_month_1m_before',
 'us_2_year_1m_before',
 'us_10_year_1m_before',
 'us_10-2_year_1m_before',
 'us_10-3_year_month_1m_before',
 'total_apartment_supply_1m_before',
 'total_unsold_count_1m_before',
 'total_unsold_ratio_1m_before',
 'total_last_month_deal_count_1m_before',
 'total_last_month_full_rent_count_1m_before',
 'total_last_month_month_rent_count_1m_before']

In [None]:
# 컬럼을 추가할 때 key로 매칭할 df_economic의 컬럼명들을 설정
economic_columns_list = [['1m_before_year','1m_before_month','1m_before_day'],
                        ['3m_before_year','3m_before_month','3m_before_day'],
                        ['6m_before_year','6m_before_month','6m_before_day'],
                        ['12m_before_year','12m_before_month','12m_before_day']]

In [None]:
# merge를 하기 위해서 copy 로 생성
# df_temp는 붙일 과거의 데이터들을 뜻함
# df_economic의 과거 날짜들(before_year, before_month 등)과 df_temp의 'year', 'month', 'day'와 을 매칭하면 과거 수치들을 컬럼으로 생성 가능
df_temp= df_economic[['year','month','day',
                    'kospi_index','korea_rp','korea_3_year',
                    'korea_10_year','korea_10-3_year','us_3_month','us_2_year',
                    'us_10_year','us_10-2_year','us_10-3_year_month',
                    'total_apartment_supply','total_unsold_count', 'total_unsold_ratio',
                    'total_last_month_deal_count','total_last_month_full_rent_count',
                    'total_last_month_month_rent_count']].copy()

In [None]:
# 데이터들을 merge
pd.set_option('display.max_columns', 100)
for i,column_list in enumerate(temp_column_total_list):
    df_temp.columns = column_list
    # inner join 이므로 null 값들을 알아서 제거가 됨
    df_economic = pd.merge(df_economic, df_temp, left_on=economic_columns_list[i], 
         right_on=[column_list[0],column_list[1],column_list[2]])

In [None]:
df_economic

Unnamed: 0,date,year,month,day,apartment_index,kospi_index,korea_rp,korea_3_year,korea_10_year,korea_10-3_year,us_3_month,us_2_year,us_10_year,us_10-2_year,us_10-3_year_month,total_apartment_supply,total_unsold_count,total_unsold_ratio,total_last_month_deal_count,total_last_month_full_rent_count,total_last_month_month_rent_count,1m_before_year,1m_before_month,1m_before_day,3m_before_year,3m_before_month,3m_before_day,6m_before_year,6m_before_month,6m_before_day,12m_before_year,12m_before_month,12m_before_day,temp_year_1m_before,temp_month_1m_before,temp_day_1m_before,kospi_index_1m_before,korea_rp_1m_before,korea_3_year_1m_before,korea_10_year_1m_before,korea_10-3_year_1m_before,us_3_month_1m_before,us_2_year_1m_before,us_10_year_1m_before,us_10-2_year_1m_before,us_10-3_year_month_1m_before,total_apartment_supply_1m_before,total_unsold_count_1m_before,total_unsold_ratio_1m_before,total_last_month_deal_count_1m_before,...,korea_10-3_year_3m_before,us_3_month_3m_before,us_2_year_3m_before,us_10_year_3m_before,us_10-2_year_3m_before,us_10-3_year_month_3m_before,total_apartment_supply_3m_before,total_unsold_count_3m_before,total_unsold_ratio_3m_before,total_last_month_deal_count_3m_before,total_last_month_full_rent_count_3m_before,total_last_month_month_rent_count_3m_before,temp_year_6m_before,temp_month_6m_before,temp_day_6m_before,kospi_index_6m_before,korea_rp_6m_before,korea_3_year_6m_before,korea_10_year_6m_before,korea_10-3_year_6m_before,us_3_month_6m_before,us_2_year_6m_before,us_10_year_6m_before,us_10-2_year_6m_before,us_10-3_year_month_6m_before,total_apartment_supply_6m_before,total_unsold_count_6m_before,total_unsold_ratio_6m_before,total_last_month_deal_count_6m_before,total_last_month_full_rent_count_6m_before,total_last_month_month_rent_count_6m_before,temp_year_12m_before,temp_month_12m_before,temp_day_12m_before,kospi_index_12m_before,korea_rp_12m_before,korea_3_year_12m_before,korea_10_year_12m_before,korea_10-3_year_12m_before,us_3_month_12m_before,us_2_year_12m_before,us_10_year_12m_before,us_10-2_year_12m_before,us_10-3_year_month_12m_before,total_apartment_supply_12m_before,total_unsold_count_12m_before,total_unsold_ratio_12m_before,total_last_month_deal_count_12m_before,total_last_month_full_rent_count_12m_before,total_last_month_month_rent_count_12m_before
0,2012-01-01,2012,1,1,86.8,1825.74,3.25,3.320,3.770,0.450,0.020,0.2470,1.876,1.6290,1.856,6960,1861,26.738506,3712,9861,1988,2011,12,1,2011,10,1,2011,7,1,2011,1,1,2011,12,1,1916.18,3.25,3.390,3.800,0.410,-0.0030,0.2580,2.086,1.8280,2.0890,3764,1801,47.848034,3519,...,0.430,0.025,0.2470,1.917,1.6700,1.892,1272,1776,139.622642,4178,9272,2359,2011,7,1,2125.74,3.25,3.765,4.320,0.555,0.018,0.4740,3.182,2.7080,3.164,4051,1825,45.050605,3687,10008,2199,2011,1,1,2051.00,2.5,3.440,4.570,1.130,0.124,0.6010,3.334,2.7330,3.210,5345,2269,42.450889,7179,12336,2514
1,2012-01-02,2012,1,2,86.8,1826.37,3.25,3.340,3.780,0.440,0.020,0.2470,1.873,1.6260,1.853,6960,1861,26.738506,3712,9861,1988,2011,12,2,2011,10,2,2011,7,2,2011,1,2,2011,12,2,1916.04,3.25,3.380,3.790,0.410,0.0030,0.2540,2.033,1.7790,2.0300,3764,1801,47.848034,3519,...,0.430,0.025,0.2470,1.917,1.6700,1.892,1272,1776,139.622642,4178,9272,2359,2011,7,2,2125.74,3.25,3.765,4.320,0.555,0.018,0.4740,3.182,2.7080,3.164,4051,1825,45.050605,3687,10008,2199,2011,1,2,2051.00,2.5,3.440,4.570,1.130,0.124,0.6010,3.334,2.7330,3.210,5345,2269,42.450889,7179,12336,2514
2,2012-01-03,2012,1,3,86.8,1875.41,3.25,3.360,3.790,0.430,0.020,0.2590,1.956,1.6970,1.936,6960,1861,26.738506,3712,9861,1988,2011,12,3,2011,10,3,2011,7,3,2011,1,3,2011,12,3,1916.04,3.25,3.380,3.790,0.410,0.0030,0.2540,2.033,1.7790,2.0300,3764,1801,47.848034,3519,...,0.430,0.008,0.2350,1.749,1.5140,1.741,1272,1776,139.622642,4178,9272,2359,2011,7,3,2125.74,3.25,3.765,4.320,0.555,0.018,0.4740,3.182,2.7080,3.164,4051,1825,45.050605,3687,10008,2199,2011,1,3,2070.08,2.5,3.440,4.570,1.130,0.124,0.6010,3.334,2.7330,3.210,5345,2269,42.450889,7179,12336,2514
3,2012-01-04,2012,1,4,86.8,1866.22,3.25,3.360,3.790,0.430,0.015,0.2630,1.984,1.7210,1.969,6960,1861,26.738506,3712,9861,1988,2011,12,4,2011,10,4,2011,7,4,2011,1,4,2011,12,4,1916.04,3.25,3.380,3.790,0.410,0.0030,0.2540,2.033,1.7790,2.0300,3764,1801,47.848034,3519,...,0.280,0.008,0.2550,1.817,1.5620,1.809,1272,1776,139.622642,4178,9272,2359,2011,7,4,2145.30,3.25,3.775,4.320,0.545,0.015,0.4740,3.200,2.7260,3.185,4051,1825,45.050605,3687,10008,2199,2011,1,4,2085.14,2.5,3.495,4.580,1.085,0.142,0.6210,3.338,2.7170,3.196,5345,2269,42.450889,7179,12336,2514
4,2012-01-05,2012,1,5,86.8,1863.74,3.25,3.340,3.780,0.440,0.015,0.2630,1.996,1.7330,1.981,6960,1861,26.738506,3712,9861,1988,2011,12,5,2011,10,5,2011,7,5,2011,1,5,2011,12,5,1922.90,3.25,3.380,3.780,0.400,0.0100,0.2620,2.033,1.7710,2.0230,3764,1801,47.848034,3519,...,0.280,0.003,0.2590,1.891,1.6320,1.888,1272,1776,139.622642,4178,9272,2359,2011,7,5,2161.75,3.25,3.770,4.300,0.530,0.005,0.4340,3.119,2.6850,3.114,4051,1825,45.050605,3687,10008,2199,2011,1,5,2082.55,2.5,3.495,4.630,1.135,0.142,0.7080,3.463,2.7550,3.321,5345,2269,42.450889,7179,12336,2514
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4013,2022-12-27,2022,12,27,104.4,2332.79,3.25,3.661,3.612,-0.049,4.311,4.3827,3.849,-0.5337,-0.462,1759,865,49.175668,750,8890,7709,2022,11,27,2022,9,27,2022,6,27,2021,12,27,2022,11,27,2437.86,3.25,3.640,3.613,-0.027,4.4107,4.4629,3.694,-0.7689,-0.7167,1265,866,68.458498,574,...,-0.088,3.328,4.2871,3.949,-0.3381,0.621,1853,610,32.919590,760,11341,7415,2022,6,27,2401.92,1.75,3.570,3.724,0.154,1.685,3.1319,3.202,0.0701,1.517,105,688,655.238095,1844,11656,8696,2021,12,27,2999.55,1.0,1.776,2.212,0.436,0.056,0.7048,1.477,0.7722,1.421,2025,54,2.666667,1445,11186,6661
4014,2022-12-28,2022,12,28,104.4,2280.45,3.25,3.668,3.675,0.007,4.457,4.3574,3.886,-0.4714,-0.571,1759,865,49.175668,750,8890,7709,2022,11,28,2022,9,28,2022,6,28,2021,12,28,2022,11,28,2408.27,3.25,3.670,3.610,-0.060,4.4020,4.4443,3.679,-0.7653,-0.7230,1265,866,68.458498,574,...,0.015,3.374,4.1267,3.737,-0.3897,0.363,1853,610,32.919590,760,11341,7415,2022,6,28,2422.09,1.75,3.539,3.664,0.125,1.780,3.1116,3.177,0.0654,1.397,105,688,655.238095,1844,11656,8696,2021,12,28,3020.24,1.0,1.786,2.196,0.410,0.061,0.7559,1.484,0.7281,1.423,2025,54,2.666667,1445,11186,6661
4015,2022-12-29,2022,12,29,104.4,2236.40,3.25,3.718,3.723,0.005,4.423,4.3656,3.820,-0.5456,-0.603,1759,865,49.175668,750,8890,7709,2022,11,29,2022,9,29,2022,6,29,2021,12,29,2022,11,29,2433.39,3.25,3.723,3.672,-0.051,4.3840,4.4814,3.750,-0.7314,-0.6340,1265,866,68.458498,574,...,-0.096,3.314,4.1924,3.782,-0.4104,0.468,1853,610,32.919590,760,11341,7415,2022,6,29,2377.99,1.75,3.547,3.667,0.120,1.757,3.0426,3.087,0.0444,1.330,105,688,655.238095,1844,11656,8696,2021,12,29,2993.29,1.0,1.783,2.180,0.397,0.051,0.7520,1.556,0.8040,1.505,2025,54,2.666667,1445,11186,6661
4016,2022-12-30,2022,12,30,104.4,2236.40,3.25,3.725,3.735,0.010,4.405,4.4279,3.879,-0.5489,-0.526,1759,865,49.175668,750,8890,7709,2022,11,30,2022,9,30,2022,6,30,2021,12,30,2022,11,30,2472.53,3.25,3.666,3.654,-0.012,4.3640,4.3287,3.611,-0.7177,-0.7530,1265,866,68.458498,574,...,-0.105,3.283,4.2726,3.829,-0.4436,0.546,1853,610,32.919590,760,11341,7415,2022,6,30,2332.64,1.75,3.549,3.637,0.088,1.700,2.9554,3.017,0.0616,1.317,105,688,655.238095,1844,11656,8696,2021,12,30,2977.65,1.0,1.802,2.248,0.446,0.046,0.7303,1.507,0.7767,1.461,2025,54,2.666667,1445,11186,6661


In [None]:
df_economic.isnull().sum()[:50]

date                                     0
year                                     0
month                                    0
day                                      0
apartment_index                          0
kospi_index                              0
korea_rp                                 0
korea_3_year                             0
korea_10_year                            0
korea_10-3_year                          0
us_3_month                               0
us_2_year                                0
us_10_year                               0
us_10-2_year                             0
us_10-3_year_month                       0
total_apartment_supply                   0
total_unsold_count                       0
total_unsold_ratio                       0
total_last_month_deal_count              0
total_last_month_full_rent_count         0
total_last_month_month_rent_count        0
1m_before_year                           0
1m_before_month                          0
1m_before_d

In [None]:
df_economic2 = df_economic.copy()

- 처음에 변화율을 구하려 했지만, 수치가 0인 값들이 있어서 계산을 할 때 null이나 inf가 되는 경우들이 있어서 변화율보다는 변화정도로 진행을 하기로 함

>> 계산식을 생성할 때, 0으로 나누거나 나누어지는 경우들에 대해서 조심해야 한다

In [None]:
# 존재하는 컬럼들에 계산한 변화정도를를 덮어씀
for column_list in temp_column_total_list:
    for i,column_name in enumerate(column_list[3:]): # column_name은 과거 날짜 데이터
        # 변화정도(이전에는 과거수치에서 덮어씀)= (현재수치-과거수치)
        df_economic2[column_name] = df_economic2[origin_column_list[i]] - df_economic2[column_name]
df_economic2

Unnamed: 0,date,year,month,day,apartment_index,kospi_index,korea_rp,korea_3_year,korea_10_year,korea_10-3_year,us_3_month,us_2_year,us_10_year,us_10-2_year,us_10-3_year_month,total_apartment_supply,total_unsold_count,total_unsold_ratio,total_last_month_deal_count,total_last_month_full_rent_count,total_last_month_month_rent_count,1m_before_year,1m_before_month,1m_before_day,3m_before_year,3m_before_month,3m_before_day,6m_before_year,6m_before_month,6m_before_day,12m_before_year,12m_before_month,12m_before_day,temp_year_1m_before,temp_month_1m_before,temp_day_1m_before,kospi_index_1m_before,korea_rp_1m_before,korea_3_year_1m_before,korea_10_year_1m_before,korea_10-3_year_1m_before,us_3_month_1m_before,us_2_year_1m_before,us_10_year_1m_before,us_10-2_year_1m_before,us_10-3_year_month_1m_before,total_apartment_supply_1m_before,total_unsold_count_1m_before,total_unsold_ratio_1m_before,total_last_month_deal_count_1m_before,...,korea_10-3_year_3m_before,us_3_month_3m_before,us_2_year_3m_before,us_10_year_3m_before,us_10-2_year_3m_before,us_10-3_year_month_3m_before,total_apartment_supply_3m_before,total_unsold_count_3m_before,total_unsold_ratio_3m_before,total_last_month_deal_count_3m_before,total_last_month_full_rent_count_3m_before,total_last_month_month_rent_count_3m_before,temp_year_6m_before,temp_month_6m_before,temp_day_6m_before,kospi_index_6m_before,korea_rp_6m_before,korea_3_year_6m_before,korea_10_year_6m_before,korea_10-3_year_6m_before,us_3_month_6m_before,us_2_year_6m_before,us_10_year_6m_before,us_10-2_year_6m_before,us_10-3_year_month_6m_before,total_apartment_supply_6m_before,total_unsold_count_6m_before,total_unsold_ratio_6m_before,total_last_month_deal_count_6m_before,total_last_month_full_rent_count_6m_before,total_last_month_month_rent_count_6m_before,temp_year_12m_before,temp_month_12m_before,temp_day_12m_before,kospi_index_12m_before,korea_rp_12m_before,korea_3_year_12m_before,korea_10_year_12m_before,korea_10-3_year_12m_before,us_3_month_12m_before,us_2_year_12m_before,us_10_year_12m_before,us_10-2_year_12m_before,us_10-3_year_month_12m_before,total_apartment_supply_12m_before,total_unsold_count_12m_before,total_unsold_ratio_12m_before,total_last_month_deal_count_12m_before,total_last_month_full_rent_count_12m_before,total_last_month_month_rent_count_12m_before
0,2012-01-01,2012,1,1,86.8,1825.74,3.25,3.320,3.770,0.450,0.020,0.2470,1.876,1.6290,1.856,6960,1861,26.738506,3712,9861,1988,2011,12,1,2011,10,1,2011,7,1,2011,1,1,2011,12,1,-90.44,0.0,-0.070,-0.030,0.040,0.0230,-0.0110,-0.210,-0.1990,-0.2330,3196,60,-21.109528,193,...,0.020,-0.005,0.0000,-0.041,-0.0410,-0.036,5688,85,-112.884136,-466,589,-371,2011,7,1,-300.00,0.0,-0.445,-0.550,-0.105,0.002,-0.2270,-1.306,-1.0790,-1.308,2909,36,-18.312099,25,-147,-211,2011,1,1,-225.26,0.75,-0.120,-0.800,-0.680,-0.104,-0.3540,-1.458,-1.1040,-1.354,1615,-408,-15.712383,-3467,-2475,-526
1,2012-01-02,2012,1,2,86.8,1826.37,3.25,3.340,3.780,0.440,0.020,0.2470,1.873,1.6260,1.853,6960,1861,26.738506,3712,9861,1988,2011,12,2,2011,10,2,2011,7,2,2011,1,2,2011,12,2,-89.67,0.0,-0.040,-0.010,0.030,0.0170,-0.0070,-0.160,-0.1530,-0.1770,3196,60,-21.109528,193,...,0.010,-0.005,0.0000,-0.044,-0.0440,-0.039,5688,85,-112.884136,-466,589,-371,2011,7,2,-299.37,0.0,-0.425,-0.540,-0.115,0.002,-0.2270,-1.309,-1.0820,-1.311,2909,36,-18.312099,25,-147,-211,2011,1,2,-224.63,0.75,-0.100,-0.790,-0.690,-0.104,-0.3540,-1.461,-1.1070,-1.357,1615,-408,-15.712383,-3467,-2475,-526
2,2012-01-03,2012,1,3,86.8,1875.41,3.25,3.360,3.790,0.430,0.020,0.2590,1.956,1.6970,1.936,6960,1861,26.738506,3712,9861,1988,2011,12,3,2011,10,3,2011,7,3,2011,1,3,2011,12,3,-40.63,0.0,-0.020,0.000,0.020,0.0170,0.0050,-0.077,-0.0820,-0.0940,3196,60,-21.109528,193,...,0.000,0.012,0.0240,0.207,0.1830,0.195,5688,85,-112.884136,-466,589,-371,2011,7,3,-250.33,0.0,-0.405,-0.530,-0.125,0.002,-0.2150,-1.226,-1.0110,-1.228,2909,36,-18.312099,25,-147,-211,2011,1,3,-194.67,0.75,-0.080,-0.780,-0.700,-0.104,-0.3420,-1.378,-1.0360,-1.274,1615,-408,-15.712383,-3467,-2475,-526
3,2012-01-04,2012,1,4,86.8,1866.22,3.25,3.360,3.790,0.430,0.015,0.2630,1.984,1.7210,1.969,6960,1861,26.738506,3712,9861,1988,2011,12,4,2011,10,4,2011,7,4,2011,1,4,2011,12,4,-49.82,0.0,-0.020,0.000,0.020,0.0120,0.0090,-0.049,-0.0580,-0.0610,3196,60,-21.109528,193,...,0.150,0.007,0.0080,0.167,0.1590,0.160,5688,85,-112.884136,-466,589,-371,2011,7,4,-279.08,0.0,-0.415,-0.530,-0.115,0.000,-0.2110,-1.216,-1.0050,-1.216,2909,36,-18.312099,25,-147,-211,2011,1,4,-218.92,0.75,-0.135,-0.790,-0.655,-0.127,-0.3580,-1.354,-0.9960,-1.227,1615,-408,-15.712383,-3467,-2475,-526
4,2012-01-05,2012,1,5,86.8,1863.74,3.25,3.340,3.780,0.440,0.015,0.2630,1.996,1.7330,1.981,6960,1861,26.738506,3712,9861,1988,2011,12,5,2011,10,5,2011,7,5,2011,1,5,2011,12,5,-59.16,0.0,-0.040,0.000,0.040,0.0050,0.0010,-0.037,-0.0380,-0.0420,3196,60,-21.109528,193,...,0.160,0.012,0.0040,0.105,0.1010,0.093,5688,85,-112.884136,-466,589,-371,2011,7,5,-298.01,0.0,-0.430,-0.520,-0.090,0.010,-0.1710,-1.123,-0.9520,-1.133,2909,36,-18.312099,25,-147,-211,2011,1,5,-218.81,0.75,-0.155,-0.850,-0.695,-0.127,-0.4450,-1.467,-1.0220,-1.340,1615,-408,-15.712383,-3467,-2475,-526
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4013,2022-12-27,2022,12,27,104.4,2332.79,3.25,3.661,3.612,-0.049,4.311,4.3827,3.849,-0.5337,-0.462,1759,865,49.175668,750,8890,7709,2022,11,27,2022,9,27,2022,6,27,2021,12,27,2022,11,27,-105.07,0.0,0.021,-0.001,-0.022,-0.0997,-0.0802,0.155,0.2352,0.2547,494,-1,-19.282830,176,...,0.039,0.983,0.0956,-0.100,-0.1956,-1.083,-94,255,16.256078,-10,-2451,294,2022,6,27,-69.13,1.5,0.091,-0.112,-0.203,2.626,1.2508,0.647,-0.6038,-1.979,1654,177,-606.062427,-1094,-2766,-987,2021,12,27,-666.76,2.25,1.885,1.400,-0.485,4.255,3.6779,2.372,-1.3059,-1.883,-266,811,46.509001,-695,-2296,1048
4014,2022-12-28,2022,12,28,104.4,2280.45,3.25,3.668,3.675,0.007,4.457,4.3574,3.886,-0.4714,-0.571,1759,865,49.175668,750,8890,7709,2022,11,28,2022,9,28,2022,6,28,2021,12,28,2022,11,28,-127.82,0.0,-0.002,0.065,0.067,0.0550,-0.0869,0.207,0.2939,0.1520,494,-1,-19.282830,176,...,-0.008,1.083,0.2307,0.149,-0.0817,-0.934,-94,255,16.256078,-10,-2451,294,2022,6,28,-141.64,1.5,0.129,0.011,-0.118,2.677,1.2458,0.709,-0.5368,-1.968,1654,177,-606.062427,-1094,-2766,-987,2021,12,28,-739.79,2.25,1.882,1.479,-0.403,4.396,3.6015,2.402,-1.1995,-1.994,-266,811,46.509001,-695,-2296,1048
4015,2022-12-29,2022,12,29,104.4,2236.40,3.25,3.718,3.723,0.005,4.423,4.3656,3.820,-0.5456,-0.603,1759,865,49.175668,750,8890,7709,2022,11,29,2022,9,29,2022,6,29,2021,12,29,2022,11,29,-196.99,0.0,-0.005,0.051,0.056,0.0390,-0.1158,0.070,0.1858,0.0310,494,-1,-19.282830,176,...,0.101,1.109,0.1732,0.038,-0.1352,-1.071,-94,255,16.256078,-10,-2451,294,2022,6,29,-141.59,1.5,0.171,0.056,-0.115,2.666,1.3230,0.733,-0.5900,-1.933,1654,177,-606.062427,-1094,-2766,-987,2021,12,29,-756.89,2.25,1.935,1.543,-0.392,4.372,3.6136,2.264,-1.3496,-2.108,-266,811,46.509001,-695,-2296,1048
4016,2022-12-30,2022,12,30,104.4,2236.40,3.25,3.725,3.735,0.010,4.405,4.4279,3.879,-0.5489,-0.526,1759,865,49.175668,750,8890,7709,2022,11,30,2022,9,30,2022,6,30,2021,12,30,2022,11,30,-236.13,0.0,0.059,0.081,0.022,0.0410,0.0992,0.268,0.1688,0.2270,494,-1,-19.282830,176,...,0.115,1.122,0.1553,0.050,-0.1053,-1.072,-94,255,16.256078,-10,-2451,294,2022,6,30,-96.24,1.5,0.176,0.098,-0.078,2.705,1.4725,0.862,-0.6105,-1.843,1654,177,-606.062427,-1094,-2766,-987,2021,12,30,-741.25,2.25,1.923,1.487,-0.436,4.359,3.6976,2.372,-1.3256,-1.987,-266,811,46.509001,-695,-2296,1048


In [None]:
# inf 값들을 null값로 수정한 후, null 값을 구해서 inf값과 null 값을 동시에 구함
import numpy as np
df_economic2.replace([np.inf, -np.inf], np.nan, inplace=True)
var = df_economic2.isnull().sum()
print(var.to_string())

date                                            0
year                                            0
month                                           0
day                                             0
apartment_index                                 0
kospi_index                                     0
korea_rp                                        0
korea_3_year                                    0
korea_10_year                                   0
korea_10-3_year                                 0
us_3_month                                      0
us_2_year                                       0
us_10_year                                      0
us_10-2_year                                    0
us_10-3_year_month                              0
total_apartment_supply                          0
total_unsold_count                              0
total_unsold_ratio                              0
total_last_month_deal_count                     0
total_last_month_full_rent_count                0


In [None]:
# 사용 안하는 컬럼들 제거거
df_economic2.drop(["1m_before_year", "1m_before_month","1m_before_day",
                 "3m_before_year", "3m_before_month","3m_before_day",
                 "6m_before_year", "6m_before_month","6m_before_day",
                 "12m_before_year", "12m_before_month","12m_before_day",
                 "temp_year_1m_before", "temp_month_1m_before","temp_day_1m_before",
                 "temp_year_3m_before", "temp_month_3m_before","temp_day_3m_before",
                 "temp_year_6m_before", "temp_month_6m_before","temp_day_6m_before",
                 "temp_year_12m_before", "temp_month_12m_before","temp_day_12m_before"], 
                 inplace=True,axis=1)

In [None]:
df_economic2

Unnamed: 0,date,year,month,day,apartment_index,kospi_index,korea_rp,korea_3_year,korea_10_year,korea_10-3_year,us_3_month,us_2_year,us_10_year,us_10-2_year,us_10-3_year_month,total_apartment_supply,total_unsold_count,total_unsold_ratio,total_last_month_deal_count,total_last_month_full_rent_count,total_last_month_month_rent_count,kospi_index_1m_before,korea_rp_1m_before,korea_3_year_1m_before,korea_10_year_1m_before,korea_10-3_year_1m_before,us_3_month_1m_before,us_2_year_1m_before,us_10_year_1m_before,us_10-2_year_1m_before,us_10-3_year_month_1m_before,total_apartment_supply_1m_before,total_unsold_count_1m_before,total_unsold_ratio_1m_before,total_last_month_deal_count_1m_before,total_last_month_full_rent_count_1m_before,total_last_month_month_rent_count_1m_before,kospi_index_3m_before,korea_rp_3m_before,korea_3_year_3m_before,korea_10_year_3m_before,korea_10-3_year_3m_before,us_3_month_3m_before,us_2_year_3m_before,us_10_year_3m_before,us_10-2_year_3m_before,us_10-3_year_month_3m_before,total_apartment_supply_3m_before,total_unsold_count_3m_before,total_unsold_ratio_3m_before,total_last_month_deal_count_3m_before,total_last_month_full_rent_count_3m_before,total_last_month_month_rent_count_3m_before,kospi_index_6m_before,korea_rp_6m_before,korea_3_year_6m_before,korea_10_year_6m_before,korea_10-3_year_6m_before,us_3_month_6m_before,us_2_year_6m_before,us_10_year_6m_before,us_10-2_year_6m_before,us_10-3_year_month_6m_before,total_apartment_supply_6m_before,total_unsold_count_6m_before,total_unsold_ratio_6m_before,total_last_month_deal_count_6m_before,total_last_month_full_rent_count_6m_before,total_last_month_month_rent_count_6m_before,kospi_index_12m_before,korea_rp_12m_before,korea_3_year_12m_before,korea_10_year_12m_before,korea_10-3_year_12m_before,us_3_month_12m_before,us_2_year_12m_before,us_10_year_12m_before,us_10-2_year_12m_before,us_10-3_year_month_12m_before,total_apartment_supply_12m_before,total_unsold_count_12m_before,total_unsold_ratio_12m_before,total_last_month_deal_count_12m_before,total_last_month_full_rent_count_12m_before,total_last_month_month_rent_count_12m_before
0,2012-01-01,2012,1,1,86.8,1825.74,3.25,3.320,3.770,0.450,0.020,0.2470,1.876,1.6290,1.856,6960,1861,26.738506,3712,9861,1988,-90.44,0.0,-0.070,-0.030,0.040,0.0230,-0.0110,-0.210,-0.1990,-0.2330,3196,60,-21.109528,193,1629,165,56.09,0.00,-0.180,-0.160,0.020,-0.005,0.0000,-0.041,-0.0410,-0.036,5688,85,-112.884136,-466,589,-371,-300.00,0.0,-0.445,-0.550,-0.105,0.002,-0.2270,-1.306,-1.0790,-1.308,2909,36,-18.312099,25,-147,-211,-225.26,0.75,-0.120,-0.800,-0.680,-0.104,-0.3540,-1.458,-1.1040,-1.354,1615,-408,-15.712383,-3467,-2475,-526
1,2012-01-02,2012,1,2,86.8,1826.37,3.25,3.340,3.780,0.440,0.020,0.2470,1.873,1.6260,1.853,6960,1861,26.738506,3712,9861,1988,-89.67,0.0,-0.040,-0.010,0.030,0.0170,-0.0070,-0.160,-0.1530,-0.1770,3196,60,-21.109528,193,1629,165,56.72,0.00,-0.160,-0.150,0.010,-0.005,0.0000,-0.044,-0.0440,-0.039,5688,85,-112.884136,-466,589,-371,-299.37,0.0,-0.425,-0.540,-0.115,0.002,-0.2270,-1.309,-1.0820,-1.311,2909,36,-18.312099,25,-147,-211,-224.63,0.75,-0.100,-0.790,-0.690,-0.104,-0.3540,-1.461,-1.1070,-1.357,1615,-408,-15.712383,-3467,-2475,-526
2,2012-01-03,2012,1,3,86.8,1875.41,3.25,3.360,3.790,0.430,0.020,0.2590,1.956,1.6970,1.936,6960,1861,26.738506,3712,9861,1988,-40.63,0.0,-0.020,0.000,0.020,0.0170,0.0050,-0.077,-0.0820,-0.0940,3196,60,-21.109528,193,1629,165,105.76,0.00,-0.140,-0.140,0.000,0.012,0.0240,0.207,0.1830,0.195,5688,85,-112.884136,-466,589,-371,-250.33,0.0,-0.405,-0.530,-0.125,0.002,-0.2150,-1.226,-1.0110,-1.228,2909,36,-18.312099,25,-147,-211,-194.67,0.75,-0.080,-0.780,-0.700,-0.104,-0.3420,-1.378,-1.0360,-1.274,1615,-408,-15.712383,-3467,-2475,-526
3,2012-01-04,2012,1,4,86.8,1866.22,3.25,3.360,3.790,0.430,0.015,0.2630,1.984,1.7210,1.969,6960,1861,26.738506,3712,9861,1988,-49.82,0.0,-0.020,0.000,0.020,0.0120,0.0090,-0.049,-0.0580,-0.0610,3196,60,-21.109528,193,1629,165,160.03,0.00,-0.160,-0.010,0.150,0.007,0.0080,0.167,0.1590,0.160,5688,85,-112.884136,-466,589,-371,-279.08,0.0,-0.415,-0.530,-0.115,0.000,-0.2110,-1.216,-1.0050,-1.216,2909,36,-18.312099,25,-147,-211,-218.92,0.75,-0.135,-0.790,-0.655,-0.127,-0.3580,-1.354,-0.9960,-1.227,1615,-408,-15.712383,-3467,-2475,-526
4,2012-01-05,2012,1,5,86.8,1863.74,3.25,3.340,3.780,0.440,0.015,0.2630,1.996,1.7330,1.981,6960,1861,26.738506,3712,9861,1988,-59.16,0.0,-0.040,0.000,0.040,0.0050,0.0010,-0.037,-0.0380,-0.0420,3196,60,-21.109528,193,1629,165,197.22,0.00,-0.160,0.000,0.160,0.012,0.0040,0.105,0.1010,0.093,5688,85,-112.884136,-466,589,-371,-298.01,0.0,-0.430,-0.520,-0.090,0.010,-0.1710,-1.123,-0.9520,-1.133,2909,36,-18.312099,25,-147,-211,-218.81,0.75,-0.155,-0.850,-0.695,-0.127,-0.4450,-1.467,-1.0220,-1.340,1615,-408,-15.712383,-3467,-2475,-526
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4013,2022-12-27,2022,12,27,104.4,2332.79,3.25,3.661,3.612,-0.049,4.311,4.3827,3.849,-0.5337,-0.462,1759,865,49.175668,750,8890,7709,-105.07,0.0,0.021,-0.001,-0.022,-0.0997,-0.0802,0.155,0.2352,0.2547,494,-1,-19.282830,176,-1669,15,108.93,0.75,-0.633,-0.594,0.039,0.983,0.0956,-0.100,-0.1956,-1.083,-94,255,16.256078,-10,-2451,294,-69.13,1.5,0.091,-0.112,-0.203,2.626,1.2508,0.647,-0.6038,-1.979,1654,177,-606.062427,-1094,-2766,-987,-666.76,2.25,1.885,1.400,-0.485,4.255,3.6779,2.372,-1.3059,-1.883,-266,811,46.509001,-695,-2296,1048
4014,2022-12-28,2022,12,28,104.4,2280.45,3.25,3.668,3.675,0.007,4.457,4.3574,3.886,-0.4714,-0.571,1759,865,49.175668,750,8890,7709,-127.82,0.0,-0.002,0.065,0.067,0.0550,-0.0869,0.207,0.2939,0.1520,494,-1,-19.282830,176,-1669,15,111.16,0.75,-0.548,-0.556,-0.008,1.083,0.2307,0.149,-0.0817,-0.934,-94,255,16.256078,-10,-2451,294,-141.64,1.5,0.129,0.011,-0.118,2.677,1.2458,0.709,-0.5368,-1.968,1654,177,-606.062427,-1094,-2766,-987,-739.79,2.25,1.882,1.479,-0.403,4.396,3.6015,2.402,-1.1995,-1.994,-266,811,46.509001,-695,-2296,1048
4015,2022-12-29,2022,12,29,104.4,2236.40,3.25,3.718,3.723,0.005,4.423,4.3656,3.820,-0.5456,-0.603,1759,865,49.175668,750,8890,7709,-196.99,0.0,-0.005,0.051,0.056,0.0390,-0.1158,0.070,0.1858,0.0310,494,-1,-19.282830,176,-1669,15,65.47,0.75,-0.605,-0.504,0.101,1.109,0.1732,0.038,-0.1352,-1.071,-94,255,16.256078,-10,-2451,294,-141.59,1.5,0.171,0.056,-0.115,2.666,1.3230,0.733,-0.5900,-1.933,1654,177,-606.062427,-1094,-2766,-987,-756.89,2.25,1.935,1.543,-0.392,4.372,3.6136,2.264,-1.3496,-2.108,-266,811,46.509001,-695,-2296,1048
4016,2022-12-30,2022,12,30,104.4,2236.40,3.25,3.725,3.735,0.010,4.405,4.4279,3.879,-0.5489,-0.526,1759,865,49.175668,750,8890,7709,-236.13,0.0,0.059,0.081,0.022,0.0410,0.0992,0.268,0.1688,0.2270,494,-1,-19.282830,176,-1669,15,80.91,0.75,-0.489,-0.374,0.115,1.122,0.1553,0.050,-0.1053,-1.072,-94,255,16.256078,-10,-2451,294,-96.24,1.5,0.176,0.098,-0.078,2.705,1.4725,0.862,-0.6105,-1.843,1654,177,-606.062427,-1094,-2766,-987,-741.25,2.25,1.923,1.487,-0.436,4.359,3.6976,2.372,-1.3256,-1.987,-266,811,46.509001,-695,-2296,1048


In [None]:
df_economic2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4018 entries, 0 to 4017
Data columns (total 85 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   date                                          4018 non-null   object 
 1   year                                          4018 non-null   int16  
 2   month                                         4018 non-null   int16  
 3   day                                           4018 non-null   int16  
 4   apartment_index                               4018 non-null   float64
 5   kospi_index                                   4018 non-null   float64
 6   korea_rp                                      4018 non-null   float64
 7   korea_3_year                                  4018 non-null   float64
 8   korea_10_year                                 4018 non-null   float64
 9   korea_10-3_year                               4018 non-null   f

In [None]:
# type 이 floay64 인 컬럼을 float32로 변경, 메모리 사용량을 줄이기 위해서서
df_economic2_columns = list(df_economic2.columns)
for df_economic2_column in df_economic2_columns:
    if df_economic2[df_economic2_column].dtypes =='float64':
        df_economic2[df_economic2_column]=df_economic2[df_economic2_column].astype('float32')
    else:
        pass

In [None]:
df_economic2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4018 entries, 0 to 4017
Data columns (total 85 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   date                                          4018 non-null   object 
 1   year                                          4018 non-null   int16  
 2   month                                         4018 non-null   int16  
 3   day                                           4018 non-null   int16  
 4   apartment_index                               4018 non-null   float32
 5   kospi_index                                   4018 non-null   float32
 6   korea_rp                                      4018 non-null   float32
 7   korea_3_year                                  4018 non-null   float32
 8   korea_10_year                                 4018 non-null   float32
 9   korea_10-3_year                               4018 non-null   f

In [None]:
df_economic2

Unnamed: 0,date,year,month,day,apartment_index,kospi_index,korea_rp,korea_3_year,korea_10_year,korea_10-3_year,us_3_month,us_2_year,us_10_year,us_10-2_year,us_10-3_year_month,total_apartment_supply,total_unsold_count,total_unsold_ratio,total_last_month_deal_count,total_last_month_full_rent_count,total_last_month_month_rent_count,kospi_index_1m_before,korea_rp_1m_before,korea_3_year_1m_before,korea_10_year_1m_before,korea_10-3_year_1m_before,us_3_month_1m_before,us_2_year_1m_before,us_10_year_1m_before,us_10-2_year_1m_before,us_10-3_year_month_1m_before,total_apartment_supply_1m_before,total_unsold_count_1m_before,total_unsold_ratio_1m_before,total_last_month_deal_count_1m_before,total_last_month_full_rent_count_1m_before,total_last_month_month_rent_count_1m_before,kospi_index_3m_before,korea_rp_3m_before,korea_3_year_3m_before,korea_10_year_3m_before,korea_10-3_year_3m_before,us_3_month_3m_before,us_2_year_3m_before,us_10_year_3m_before,us_10-2_year_3m_before,us_10-3_year_month_3m_before,total_apartment_supply_3m_before,total_unsold_count_3m_before,total_unsold_ratio_3m_before,total_last_month_deal_count_3m_before,total_last_month_full_rent_count_3m_before,total_last_month_month_rent_count_3m_before,kospi_index_6m_before,korea_rp_6m_before,korea_3_year_6m_before,korea_10_year_6m_before,korea_10-3_year_6m_before,us_3_month_6m_before,us_2_year_6m_before,us_10_year_6m_before,us_10-2_year_6m_before,us_10-3_year_month_6m_before,total_apartment_supply_6m_before,total_unsold_count_6m_before,total_unsold_ratio_6m_before,total_last_month_deal_count_6m_before,total_last_month_full_rent_count_6m_before,total_last_month_month_rent_count_6m_before,kospi_index_12m_before,korea_rp_12m_before,korea_3_year_12m_before,korea_10_year_12m_before,korea_10-3_year_12m_before,us_3_month_12m_before,us_2_year_12m_before,us_10_year_12m_before,us_10-2_year_12m_before,us_10-3_year_month_12m_before,total_apartment_supply_12m_before,total_unsold_count_12m_before,total_unsold_ratio_12m_before,total_last_month_deal_count_12m_before,total_last_month_full_rent_count_12m_before,total_last_month_month_rent_count_12m_before
0,2012-01-01,2012,1,1,86.800003,1825.739990,3.25,3.320,3.770,0.450,0.020,0.2470,1.876,1.6290,1.856,6960,1861,26.738506,3712,9861,1988,-90.440002,0.0,-0.070,-0.030,0.040,0.0230,-0.0110,-0.210,-0.1990,-0.2330,3196,60,-21.109528,193,1629,165,56.090000,0.00,-0.180,-0.160,0.020,-0.005,0.0000,-0.041,-0.0410,-0.036,5688,85,-112.884132,-466,589,-371,-300.000000,0.0,-0.445,-0.550,-0.105,0.002,-0.2270,-1.306,-1.0790,-1.308,2909,36,-18.312099,25,-147,-211,-225.259995,0.75,-0.120,-0.800,-0.680,-0.104,-0.3540,-1.458,-1.1040,-1.354,1615,-408,-15.712383,-3467,-2475,-526
1,2012-01-02,2012,1,2,86.800003,1826.369995,3.25,3.340,3.780,0.440,0.020,0.2470,1.873,1.6260,1.853,6960,1861,26.738506,3712,9861,1988,-89.669998,0.0,-0.040,-0.010,0.030,0.0170,-0.0070,-0.160,-0.1530,-0.1770,3196,60,-21.109528,193,1629,165,56.720001,0.00,-0.160,-0.150,0.010,-0.005,0.0000,-0.044,-0.0440,-0.039,5688,85,-112.884132,-466,589,-371,-299.369995,0.0,-0.425,-0.540,-0.115,0.002,-0.2270,-1.309,-1.0820,-1.311,2909,36,-18.312099,25,-147,-211,-224.630005,0.75,-0.100,-0.790,-0.690,-0.104,-0.3540,-1.461,-1.1070,-1.357,1615,-408,-15.712383,-3467,-2475,-526
2,2012-01-03,2012,1,3,86.800003,1875.410034,3.25,3.360,3.790,0.430,0.020,0.2590,1.956,1.6970,1.936,6960,1861,26.738506,3712,9861,1988,-40.630001,0.0,-0.020,0.000,0.020,0.0170,0.0050,-0.077,-0.0820,-0.0940,3196,60,-21.109528,193,1629,165,105.760002,0.00,-0.140,-0.140,0.000,0.012,0.0240,0.207,0.1830,0.195,5688,85,-112.884132,-466,589,-371,-250.330002,0.0,-0.405,-0.530,-0.125,0.002,-0.2150,-1.226,-1.0110,-1.228,2909,36,-18.312099,25,-147,-211,-194.669998,0.75,-0.080,-0.780,-0.700,-0.104,-0.3420,-1.378,-1.0360,-1.274,1615,-408,-15.712383,-3467,-2475,-526
3,2012-01-04,2012,1,4,86.800003,1866.219971,3.25,3.360,3.790,0.430,0.015,0.2630,1.984,1.7210,1.969,6960,1861,26.738506,3712,9861,1988,-49.820000,0.0,-0.020,0.000,0.020,0.0120,0.0090,-0.049,-0.0580,-0.0610,3196,60,-21.109528,193,1629,165,160.029999,0.00,-0.160,-0.010,0.150,0.007,0.0080,0.167,0.1590,0.160,5688,85,-112.884132,-466,589,-371,-279.079987,0.0,-0.415,-0.530,-0.115,0.000,-0.2110,-1.216,-1.0050,-1.216,2909,36,-18.312099,25,-147,-211,-218.919998,0.75,-0.135,-0.790,-0.655,-0.127,-0.3580,-1.354,-0.9960,-1.227,1615,-408,-15.712383,-3467,-2475,-526
4,2012-01-05,2012,1,5,86.800003,1863.739990,3.25,3.340,3.780,0.440,0.015,0.2630,1.996,1.7330,1.981,6960,1861,26.738506,3712,9861,1988,-59.160000,0.0,-0.040,0.000,0.040,0.0050,0.0010,-0.037,-0.0380,-0.0420,3196,60,-21.109528,193,1629,165,197.220001,0.00,-0.160,0.000,0.160,0.012,0.0040,0.105,0.1010,0.093,5688,85,-112.884132,-466,589,-371,-298.010010,0.0,-0.430,-0.520,-0.090,0.010,-0.1710,-1.123,-0.9520,-1.133,2909,36,-18.312099,25,-147,-211,-218.809998,0.75,-0.155,-0.850,-0.695,-0.127,-0.4450,-1.467,-1.0220,-1.340,1615,-408,-15.712383,-3467,-2475,-526
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4013,2022-12-27,2022,12,27,104.400002,2332.790039,3.25,3.661,3.612,-0.049,4.311,4.3827,3.849,-0.5337,-0.462,1759,865,49.175667,750,8890,7709,-105.070000,0.0,0.021,-0.001,-0.022,-0.0997,-0.0802,0.155,0.2352,0.2547,494,-1,-19.282829,176,-1669,15,108.930000,0.75,-0.633,-0.594,0.039,0.983,0.0956,-0.100,-0.1956,-1.083,-94,255,16.256079,-10,-2451,294,-69.129997,1.5,0.091,-0.112,-0.203,2.626,1.2508,0.647,-0.6038,-1.979,1654,177,-606.062439,-1094,-2766,-987,-666.760010,2.25,1.885,1.400,-0.485,4.255,3.6779,2.372,-1.3059,-1.883,-266,811,46.509003,-695,-2296,1048
4014,2022-12-28,2022,12,28,104.400002,2280.449951,3.25,3.668,3.675,0.007,4.457,4.3574,3.886,-0.4714,-0.571,1759,865,49.175667,750,8890,7709,-127.820000,0.0,-0.002,0.065,0.067,0.0550,-0.0869,0.207,0.2939,0.1520,494,-1,-19.282829,176,-1669,15,111.160004,0.75,-0.548,-0.556,-0.008,1.083,0.2307,0.149,-0.0817,-0.934,-94,255,16.256079,-10,-2451,294,-141.639999,1.5,0.129,0.011,-0.118,2.677,1.2458,0.709,-0.5368,-1.968,1654,177,-606.062439,-1094,-2766,-987,-739.789978,2.25,1.882,1.479,-0.403,4.396,3.6015,2.402,-1.1995,-1.994,-266,811,46.509003,-695,-2296,1048
4015,2022-12-29,2022,12,29,104.400002,2236.399902,3.25,3.718,3.723,0.005,4.423,4.3656,3.820,-0.5456,-0.603,1759,865,49.175667,750,8890,7709,-196.990005,0.0,-0.005,0.051,0.056,0.0390,-0.1158,0.070,0.1858,0.0310,494,-1,-19.282829,176,-1669,15,65.470001,0.75,-0.605,-0.504,0.101,1.109,0.1732,0.038,-0.1352,-1.071,-94,255,16.256079,-10,-2451,294,-141.589996,1.5,0.171,0.056,-0.115,2.666,1.3230,0.733,-0.5900,-1.933,1654,177,-606.062439,-1094,-2766,-987,-756.890015,2.25,1.935,1.543,-0.392,4.372,3.6136,2.264,-1.3496,-2.108,-266,811,46.509003,-695,-2296,1048
4016,2022-12-30,2022,12,30,104.400002,2236.399902,3.25,3.725,3.735,0.010,4.405,4.4279,3.879,-0.5489,-0.526,1759,865,49.175667,750,8890,7709,-236.130005,0.0,0.059,0.081,0.022,0.0410,0.0992,0.268,0.1688,0.2270,494,-1,-19.282829,176,-1669,15,80.910004,0.75,-0.489,-0.374,0.115,1.122,0.1553,0.050,-0.1053,-1.072,-94,255,16.256079,-10,-2451,294,-96.239998,1.5,0.176,0.098,-0.078,2.705,1.4725,0.862,-0.6105,-1.843,1654,177,-606.062439,-1094,-2766,-987,-741.250000,2.25,1.923,1.487,-0.436,4.359,3.6976,2.372,-1.3256,-1.987,-266,811,46.509003,-695,-2296,1048


In [None]:
# null 값이나 inf 값들을 확인
import numpy as np
df_economic2.replace([np.inf, -np.inf], np.nan, inplace=True)
var = df_economic2.isnull().sum()
print(var.to_string())

date                                            0
year                                            0
month                                           0
day                                             0
apartment_index                                 0
kospi_index                                     0
korea_rp                                        0
korea_3_year                                    0
korea_10_year                                   0
korea_10-3_year                                 0
us_3_month                                      0
us_2_year                                       0
us_10_year                                      0
us_10-2_year                                    0
us_10-3_year_month                              0
total_apartment_supply                          0
total_unsold_count                              0
total_unsold_ratio                              0
total_last_month_deal_count                     0
total_last_month_full_rent_count                0


In [None]:
df_economic2.to_pickle('/content/drive/MyDrive/house_price/after_data/final_economic.pkl')

>> 메모리 용량을 줄이기 위해서 타입들을 변환할 수도 있다.

>> 값들을 병합하거나 수정한 후, null 값이나 inf 값들이 존재하는 확인을 해야 한다 -> 나중에 진행이 된 다음에 발견을 하면 많은 부분을 수정해야 한다

# month_region_count 생성 

- 세부적인 아파트의 거래량(매매, 전세, 월세 체결)에 대한 데이터프레임 생성
- 이전 economic_data2를 만들 때, 추가했던 아파트 거래량 정보들은 서울 전체 거래량에 대한 정보 - 이번에 추가하는 거래량 정보들은 year, month, address_1, address_2, address_3 까지 동일 아파트의 거래량을 추가

In [None]:
import pandas as pd
import os
# 데이터 정보 불러오기
df_deal = pd.read_csv('/content/drive/MyDrive/house_price/after_data/apartment_deal.csv',encoding='utf-8')
df_full_rent = pd.read_csv('/content/drive/MyDrive/house_price/after_data/apartment_full_rent.csv',encoding='utf-8')
df_month_rent = pd.read_csv('/content/drive/MyDrive/house_price/after_data/apartment_month_rent.csv',encoding='utf-8')
df_economic = pd.read_pickle('/content/drive/MyDrive/house_price/after_data/economic_data2.pkl')

## 데이터 특징 확인

- 거래량에 대한 데이터프레임을 생성하기 전, 데이터의 특징을 파악하기 위해 테스트 진행

In [None]:
# 거래 일별 거래 개수들을 생성
df_deal_count = df_deal.groupby(['year','month','day','address_1','address_2','address_3','address_4'])['deal_price'].count()
df_deal_count = df_deal_count.reset_index()
df_deal_count.columns = ['year','month','day','address_1','address_2','address_3','address_4','deal_count']

df_full_rent_count = df_full_rent.groupby(['year','month','day','address_1','address_2','address_3','address_4'])['full_rent_price'].count()
df_full_rent_count = df_full_rent_count.reset_index()
df_full_rent_count.columns = ['year','month','day','address_1','address_2','address_3','address_4','full_rent_count']

df_month_rent_count = df_month_rent.groupby(['year','month','day','address_1','address_2','address_3','address_4'])['month_rent_price'].count()
df_month_rent_count = df_month_rent_count.reset_index()
df_month_rent_count.columns = ['year','month','day','address_1','address_2','address_3','address_4','month_rent_count']

In [None]:
df_deal_count['deal_count'].value_counts()

1      673402
2       70339
3       12832
4        3224
5         958
        ...  
32          1
134         1
65          1
31          1
72          1
Name: deal_count, Length: 71, dtype: int64

In [None]:
df_full_rent_count['full_rent_count'].value_counts()

1      977795
2      140427
3       30791
4        9477
5        3613
        ...  
56          1
71          1
100         1
48          1
44          1
Name: full_rent_count, Length: 65, dtype: int64

In [None]:
df_month_rent_count['month_rent_count'].value_counts()

1      455309
2       48783
3        9272
4        2814
5        1231
        ...  
304         1
243         1
163         1
67          1
52          1
Name: month_rent_count, Length: 94, dtype: int64

- 거래개수들이 날짜의 정보는  year,month, day까지, 그리고 주소의 정보는 address_1, address_2, address_3, address_4까지 그룹으로 하면 1,2,3 등 적은 수의 개수가 너무 많아서 거래량 변화등을 계산하기가 용이하지 않은듯
- 그래서 날짜의 정보는 year,month 까지, 주소의 정보는 address_1, address_2, address_3 까지 그룹화를 할 예정 

## 달별, 지역별 거래개수 데이터프레임 생성 

In [None]:
# 거래 달별, 지역별(address_3 까지지) 거래 개수들을 생성
df_deal_count = df_deal.groupby(['address_1','address_2','address_3','year','month'])['deal_price'].count()
df_deal_count = df_deal_count.reset_index()
df_deal_count.columns = ['address_1','address_2','address_3','year','month','deal_count']

df_full_rent_count = df_full_rent.groupby(['address_1','address_2','address_3','year','month'])['full_rent_price'].count()
df_full_rent_count = df_full_rent_count.reset_index()
df_full_rent_count.columns = ['address_1','address_2','address_3','year','month','full_rent_count']

df_month_rent_count = df_month_rent.groupby(['address_1','address_2','address_3','year','month'])['month_rent_price'].count()
df_month_rent_count = df_month_rent_count.reset_index()
df_month_rent_count.columns = ['address_1','address_2','address_3','year','month','month_rent_count']

In [None]:
df_deal_count

Unnamed: 0,address_1,address_2,address_3,year,month,deal_count
0,강남구,개포동,12.0,2011,1,16
1,강남구,개포동,12.0,2011,2,17
2,강남구,개포동,12.0,2011,3,8
3,강남구,개포동,12.0,2011,4,13
4,강남구,개포동,12.0,2011,5,7
...,...,...,...,...,...,...
292758,중랑구,중화동,454.0,2020,12,1
292759,중랑구,중화동,454.0,2021,1,1
292760,중랑구,중화동,454.0,2021,7,1
292761,중랑구,중화동,454.0,2021,8,1


- 해당 거래량은 해당 달의 거래량인데, 거래 종합은 다음달에 발표가 된다고 가정

In [None]:
# 거래 개수들(매매 체결개수, 전세 체결개수, 월세 체결개수 3개의 데이터프레임)을 통합
# 각 데이터프레임의 값들이 다 쓸모가 있으므로 outer를 통해서 병합
df_final_count = pd.merge(df_deal_count, df_full_rent_count, on=['year','month','address_1','address_2','address_3'], how='outer')
df_final_count = pd.merge(df_final_count,df_month_rent_count, on=['year','month','address_1','address_2','address_3'], how='outer')
df_final_count.fillna(0,inplace=True)
df_final_count

Unnamed: 0,address_1,address_2,address_3,year,month,deal_count,full_rent_count,month_rent_count
0,강남구,개포동,12.0,2011,1,16.0,38.0,7.0
1,강남구,개포동,12.0,2011,2,17.0,38.0,8.0
2,강남구,개포동,12.0,2011,3,8.0,46.0,16.0
3,강남구,개포동,12.0,2011,4,13.0,30.0,11.0
4,강남구,개포동,12.0,2011,5,7.0,21.0,8.0
...,...,...,...,...,...,...,...,...
495788,중랑구,중화동,453.0,2021,9,0.0,0.0,1.0
495789,중랑구,중화동,453.0,2021,11,0.0,0.0,1.0
495790,중랑구,중화동,453.0,2021,12,0.0,0.0,1.0
495791,중랑구,중화동,454.0,2015,12,0.0,0.0,1.0


## 체결 날짜외 모든 날짜들을 추가

- 거래가 체결되는 날짜들 외에 날짜들도 필요함으로, 값 처리
- 체결되는 날짜들 외에 날짜들은 거래량을 0으로 처리

In [None]:
# 범위내 모든 날짜 파트만 추려서 저장
# df_economic 에는 모든 날짜들의 대한 정보가 있기에 날짜 정보들을 가져옴 
df_date = df_economic[['year','month']].copy()
df_date.columns=['temp_year','temp_month']
# 기간 내 날짜(year,month)들을 데이터 프레임으로 생성
# 겹치는 값들을 제거
df_date = df_date.drop_duplicates(subset=['temp_year','temp_month'], keep='last')
df_date

Unnamed: 0,temp_year,temp_month
30,2011,1
58,2011,2
89,2011,3
119,2011,4
150,2011,5
...,...,...
4260,2022,8
4290,2022,9
4321,2022,10
4351,2022,11


In [None]:
# 년과 월만 리스트로 생성
df_date_year_list = list(df_date['temp_year'])
df_date_month_list = list(df_date['temp_month'])
print(df_date_year_list,len(df_date_year_list))
print(df_date_month_list,len(df_date_month_list))

[2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022] 144
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5

- df_final_count가 모든 날짜들이 아닌 체결된 날짜만 가지고 있음으로 모든 날짜들을 데이터프레임으로 만든 후 merge를 진행 

In [None]:
# 주소의 종류들을 알기 위해서 
df_index_is_address = df_final_count.set_index(['address_1','address_2','address_3']).copy()
df_index_is_address

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,year,month,deal_count,full_rent_count,month_rent_count
address_1,address_2,address_3,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
강남구,개포동,12.0,2011,1,16.0,38.0,7.0
강남구,개포동,12.0,2011,2,17.0,38.0,8.0
강남구,개포동,12.0,2011,3,8.0,46.0,16.0
강남구,개포동,12.0,2011,4,13.0,30.0,11.0
강남구,개포동,12.0,2011,5,7.0,21.0,8.0
...,...,...,...,...,...,...,...
중랑구,중화동,453.0,2021,9,0.0,0.0,1.0
중랑구,중화동,453.0,2021,11,0.0,0.0,1.0
중랑구,중화동,453.0,2021,12,0.0,0.0,1.0
중랑구,중화동,454.0,2015,12,0.0,0.0,1.0


In [None]:
# 인덱스들을이 겹치는 것들이 많음으로 겹치는 것들을 처리 
address_kind_list = list(df_index_is_address.index)
address_kind_list = list(dict.fromkeys(address_kind_list)) # 겹치는 부분을 제거
print(len(address_kind_list))
address_kind_list

6902


[('강남구', '개포동', 12.0),
 ('강남구', '개포동', 138.0),
 ('강남구', '개포동', 140.0),
 ('강남구', '개포동', 141.0),
 ('강남구', '개포동', 166.0),
 ('강남구', '개포동', 172.0),
 ('강남구', '개포동', 176.0),
 ('강남구', '개포동', 177.0),
 ('강남구', '개포동', 179.0),
 ('강남구', '개포동', 185.0),
 ('강남구', '개포동', 187.0),
 ('강남구', '개포동', 189.0),
 ('강남구', '개포동', 649.0),
 ('강남구', '개포동', 651.0),
 ('강남구', '개포동', 652.0),
 ('강남구', '개포동', 653.0),
 ('강남구', '개포동', 654.0),
 ('강남구', '개포동', 655.0),
 ('강남구', '개포동', 656.0),
 ('강남구', '개포동', 658.0),
 ('강남구', '개포동', 1164.0),
 ('강남구', '개포동', 1165.0),
 ('강남구', '개포동', 1167.0),
 ('강남구', '개포동', 1204.0),
 ('강남구', '개포동', 1242.0),
 ('강남구', '개포동', 1260.0),
 ('강남구', '개포동', 1280.0),
 ('강남구', '개포동', 1282.0),
 ('강남구', '논현동', 9.0),
 ('강남구', '논현동', 16.0),
 ('강남구', '논현동', 22.0),
 ('강남구', '논현동', 39.0),
 ('강남구', '논현동', 43.0),
 ('강남구', '논현동', 44.0),
 ('강남구', '논현동', 46.0),
 ('강남구', '논현동', 47.0),
 ('강남구', '논현동', 53.0),
 ('강남구', '논현동', 55.0),
 ('강남구', '논현동', 58.0),
 ('강남구', '논현동', 62.0),
 ('강남구', '논현동', 79.0),
 ('강남구', '논현동', 80.0),


In [None]:
address_1_list = list()
address_2_list = list()
address_3_list = list()
year_list = list()
month_list = list()
# 모든 주소들의 종류에 대해서 적용 
for address_kind in address_kind_list:
    # 연-월의 쌍이 총 144개 있음으로
    for i in range(144): # 해당 리스트들에 값을 대입
        address_1_list.append(address_kind[0])
        address_2_list.append(address_kind[1])
        address_3_list.append(address_kind[2])
        year_list.append(df_date_year_list[i])
        month_list.append(df_date_month_list[i])
        

In [None]:
# 기간내 모든 날짜들에 대한 데이터프레임 생성
df_every_day = pd.DataFrame({'address_1':address_1_list,
                            'address_2':address_2_list,
                            'address_3':address_3_list,
                            'year':year_list,
                            'month':month_list,},columns=['address_1','address_2','address_3','year','month'])
df_every_day

Unnamed: 0,address_1,address_2,address_3,year,month
0,강남구,개포동,12.0,2011,1
1,강남구,개포동,12.0,2011,2
2,강남구,개포동,12.0,2011,3
3,강남구,개포동,12.0,2011,4
4,강남구,개포동,12.0,2011,5
...,...,...,...,...,...
993883,중랑구,신내동,831.0,2022,8
993884,중랑구,신내동,831.0,2022,9
993885,중랑구,신내동,831.0,2022,10
993886,중랑구,신내동,831.0,2022,11


In [None]:
# 데이터를 병합
# 모든 날짜들에 대한 정보들이 있어야 함으로 df_every_day를 left로 해서 병합합
df_final_count = pd.merge(df_every_day, df_final_count, on=['address_1','address_2','address_3','year','month'], how='left')

In [None]:
df_final_count.head()

Unnamed: 0,address_1,address_2,address_3,year,month,deal_count,full_rent_count,month_rent_count
0,강남구,개포동,12.0,2011,1,16.0,38.0,7.0
1,강남구,개포동,12.0,2011,2,17.0,38.0,8.0
2,강남구,개포동,12.0,2011,3,8.0,46.0,16.0
3,강남구,개포동,12.0,2011,4,13.0,30.0,11.0
4,강남구,개포동,12.0,2011,5,7.0,21.0,8.0


In [None]:
df_final_count.info() 

<class 'pandas.core.frame.DataFrame'>
Int64Index: 993888 entries, 0 to 993887
Data columns (total 8 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   address_1         993888 non-null  object 
 1   address_2         993888 non-null  object 
 2   address_3         993888 non-null  float64
 3   year              993888 non-null  int64  
 4   month             993888 non-null  int64  
 5   deal_count        495793 non-null  float64
 6   full_rent_count   495793 non-null  float64
 7   month_rent_count  495793 non-null  float64
dtypes: float64(4), int64(2), object(2)
memory usage: 68.2+ MB


In [None]:
# 거래 결과 컬럼들이 null값들이 있는데, 그 달에 그 지역에 체결이력력이 없었다는 뜻이므로 0으로 채워야 함
df_final_count.fillna(0,inplace=True)

In [None]:
df_final_count.info() # null 값들이 채워졌음을 확인

<class 'pandas.core.frame.DataFrame'>
Int64Index: 993888 entries, 0 to 993887
Data columns (total 8 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   address_1         993888 non-null  object 
 1   address_2         993888 non-null  object 
 2   address_3         993888 non-null  float64
 3   year              993888 non-null  int64  
 4   month             993888 non-null  int64  
 5   deal_count        993888 non-null  float64
 6   full_rent_count   993888 non-null  float64
 7   month_rent_count  993888 non-null  float64
dtypes: float64(4), int64(2), object(2)
memory usage: 68.2+ MB


## 체결 달을 한 달 씩 미룸 

- 체결결과는 다음달이 되어서야 알 수 있음으로 한달씩미룸

In [None]:
# 해당 부분이 addres_3가 바뀌는 부분
df_final_count.iloc[140:150]

Unnamed: 0,address_1,address_2,address_3,year,month,deal_count,full_rent_count,month_rent_count
140,강남구,개포동,12.0,2022,9,1.0,47.0,21.0
141,강남구,개포동,12.0,2022,10,2.0,38.0,26.0
142,강남구,개포동,12.0,2022,11,3.0,28.0,15.0
143,강남구,개포동,12.0,2022,12,3.0,30.0,24.0
144,강남구,개포동,138.0,2011,1,4.0,11.0,9.0
145,강남구,개포동,138.0,2011,2,2.0,14.0,4.0
146,강남구,개포동,138.0,2011,3,4.0,15.0,10.0
147,강남구,개포동,138.0,2011,4,2.0,7.0,9.0
148,강남구,개포동,138.0,2011,5,2.0,14.0,2.0
149,강남구,개포동,138.0,2011,6,2.0,5.0,5.0


In [None]:
# 한칸씩 미루기 위해서 마지막 달의 값을 null 값으로 생성
import numpy as np
df_final_count.loc[(df_final_count['year']==2022)&(df_final_count['month']==12),['deal_count','full_rent_count','month_rent_count']]=np.nan
df_final_count.iloc[140:150] # 지역이 바뀌는 지점의 값들이 null 값이 됨을 확인

Unnamed: 0,address_1,address_2,address_3,year,month,deal_count,full_rent_count,month_rent_count
140,강남구,개포동,12.0,2022,9,1.0,47.0,21.0
141,강남구,개포동,12.0,2022,10,2.0,38.0,26.0
142,강남구,개포동,12.0,2022,11,3.0,28.0,15.0
143,강남구,개포동,12.0,2022,12,,,
144,강남구,개포동,138.0,2011,1,4.0,11.0,9.0
145,강남구,개포동,138.0,2011,2,2.0,14.0,4.0
146,강남구,개포동,138.0,2011,3,4.0,15.0,10.0
147,강남구,개포동,138.0,2011,4,2.0,7.0,9.0
148,강남구,개포동,138.0,2011,5,2.0,14.0,2.0
149,강남구,개포동,138.0,2011,6,2.0,5.0,5.0


In [None]:
# 체결 달만 1달씩 미룸
df_final_count[['last_month_deal_count','last_month_full_rent_count','last_month_month_rent_count']] = df_final_count[['deal_count','full_rent_count','month_rent_count']].shift(1)
df_final_count

Unnamed: 0,address_1,address_2,address_3,year,month,deal_count,full_rent_count,month_rent_count,last_month_deal_count,last_month_full_rent_count,last_month_month_rent_count
0,강남구,개포동,12.0,2011,1,16.0,38.0,7.0,,,
1,강남구,개포동,12.0,2011,2,17.0,38.0,8.0,16.0,38.0,7.0
2,강남구,개포동,12.0,2011,3,8.0,46.0,16.0,17.0,38.0,8.0
3,강남구,개포동,12.0,2011,4,13.0,30.0,11.0,8.0,46.0,16.0
4,강남구,개포동,12.0,2011,5,7.0,21.0,8.0,13.0,30.0,11.0
...,...,...,...,...,...,...,...,...,...,...,...
993883,중랑구,신내동,831.0,2022,8,0.0,0.0,0.0,0.0,0.0,2.0
993884,중랑구,신내동,831.0,2022,9,0.0,0.0,0.0,0.0,0.0,0.0
993885,중랑구,신내동,831.0,2022,10,0.0,0.0,1.0,0.0,0.0,0.0
993886,중랑구,신내동,831.0,2022,11,0.0,0.0,97.0,0.0,0.0,1.0


In [None]:
# 이전 컬럼 제거
df_final_count.drop(["deal_count", "full_rent_count","month_rent_count"], axis=1,inplace=True)

In [None]:
df_final_count.iloc[140:150] # 칸들이 잘 미뤄줬음을 확인

Unnamed: 0,address_1,address_2,address_3,year,month,last_month_deal_count,last_month_full_rent_count,last_month_month_rent_count
140,강남구,개포동,12.0,2022,9,3.0,44.0,17.0
141,강남구,개포동,12.0,2022,10,1.0,47.0,21.0
142,강남구,개포동,12.0,2022,11,2.0,38.0,26.0
143,강남구,개포동,12.0,2022,12,3.0,28.0,15.0
144,강남구,개포동,138.0,2011,1,,,
145,강남구,개포동,138.0,2011,2,4.0,11.0,9.0
146,강남구,개포동,138.0,2011,3,2.0,14.0,4.0
147,강남구,개포동,138.0,2011,4,4.0,15.0,10.0
148,강남구,개포동,138.0,2011,5,2.0,7.0,9.0
149,강남구,개포동,138.0,2011,6,2.0,14.0,2.0


## 새로운 컬럼들을 생성

- 과거 수치들을 컬럼으로 추가

In [None]:
df_final_count2 = df_final_count.copy()
# 1개월 전 수치들을 구함
df_final_count2['deal_count_1m_before'] = df_final_count2['last_month_deal_count']
df_final_count2['full_rent_count_1m_before'] = df_final_count2['last_month_full_rent_count']
df_final_count2['month_rent_count_1m_before'] = df_final_count2['last_month_month_rent_count']
df_final_count2.loc[(df_final_count2['year']==2022)&(df_final_count2['month']==12),
                    ['deal_count_1m_before','full_rent_count_1m_before',
                     'month_rent_count_1m_before']]=np.nan
df_final_count2[['deal_count_1m_before','full_rent_count_1m_before',
                 'month_rent_count_1m_before']] = df_final_count2[['deal_count_1m_before',
                 'full_rent_count_1m_before','month_rent_count_1m_before']].shift(1)

# 3개월 전 수치들을 구함
df_final_count2['deal_count_3m_before'] = df_final_count2['last_month_deal_count']
df_final_count2['full_rent_count_3m_before'] = df_final_count2['last_month_full_rent_count']
df_final_count2['month_rent_count_3m_before'] = df_final_count2['last_month_month_rent_count']
for i in range(3):
    number=12-i
    df_final_count2.loc[(df_final_count2['year']==2022)&(df_final_count2['month']==number),
                        ['deal_count_3m_before','full_rent_count_3m_before',
                         'month_rent_count_3m_before']]=np.nan
df_final_count2[['deal_count_3m_before','full_rent_count_3m_before',
                 'month_rent_count_3m_before']] = df_final_count2[['deal_count_3m_before',
                 'full_rent_count_3m_before','month_rent_count_3m_before']].shift(3)

# 6개월 전 수치들을 구함
df_final_count2['deal_count_6m_before'] = df_final_count2['last_month_deal_count']
df_final_count2['full_rent_count_6m_before'] = df_final_count2['last_month_full_rent_count']
df_final_count2['month_rent_count_6m_before'] = df_final_count2['last_month_month_rent_count']
for i in range(6):
    number=12-i
    df_final_count2.loc[(df_final_count2['year']==2022)&(df_final_count2['month']==number),
                        ['deal_count_6m_before','full_rent_count_6m_before',
                         'month_rent_count_6m_before']]=np.nan
df_final_count2[['deal_count_6m_before','full_rent_count_6m_before',
                 'month_rent_count_6m_before']] = df_final_count2[['deal_count_6m_before',
                 'full_rent_count_6m_before','month_rent_count_6m_before']].shift(6)

# 12개월 전 수치들을 구함
df_final_count2['deal_count_12m_before'] = df_final_count2['last_month_deal_count']
df_final_count2['full_rent_count_12m_before'] = df_final_count2['last_month_full_rent_count']
df_final_count2['month_rent_count_12m_before'] = df_final_count2['last_month_month_rent_count']
df_final_count2.loc[(df_final_count2['year']==2022), ['deal_count_12m_before',
              'full_rent_count_12m_before','month_rent_count_12m_before']]=np.nan
df_final_count2[['deal_count_12m_before','full_rent_count_12m_before',
                 'month_rent_count_12m_before']] = df_final_count2[['deal_count_12m_before',
                 'full_rent_count_12m_before','month_rent_count_12m_before']].shift(12)

In [None]:
df_final_count2.head(30) # 컬럼들이 생성됨을 확인

Unnamed: 0,address_1,address_2,address_3,year,month,last_month_deal_count,last_month_full_rent_count,last_month_month_rent_count,deal_count_1m_before,full_rent_count_1m_before,month_rent_count_1m_before,deal_count_3m_before,full_rent_count_3m_before,month_rent_count_3m_before,deal_count_6m_before,full_rent_count_6m_before,month_rent_count_6m_before,deal_count_12m_before,full_rent_count_12m_before,month_rent_count_12m_before
0,강남구,개포동,12.0,2011,1,,,,,,,,,,,,,,,
1,강남구,개포동,12.0,2011,2,16.0,38.0,7.0,,,,,,,,,,,,
2,강남구,개포동,12.0,2011,3,17.0,38.0,8.0,16.0,38.0,7.0,,,,,,,,,
3,강남구,개포동,12.0,2011,4,8.0,46.0,16.0,17.0,38.0,8.0,,,,,,,,,
4,강남구,개포동,12.0,2011,5,13.0,30.0,11.0,8.0,46.0,16.0,16.0,38.0,7.0,,,,,,
5,강남구,개포동,12.0,2011,6,7.0,21.0,8.0,13.0,30.0,11.0,17.0,38.0,8.0,,,,,,
6,강남구,개포동,12.0,2011,7,8.0,41.0,14.0,7.0,21.0,8.0,8.0,46.0,16.0,,,,,,
7,강남구,개포동,12.0,2011,8,4.0,33.0,13.0,8.0,41.0,14.0,13.0,30.0,11.0,16.0,38.0,7.0,,,
8,강남구,개포동,12.0,2011,9,8.0,29.0,9.0,4.0,33.0,13.0,7.0,21.0,8.0,17.0,38.0,8.0,,,
9,강남구,개포동,12.0,2011,10,8.0,28.0,13.0,8.0,29.0,9.0,8.0,41.0,14.0,8.0,46.0,16.0,,,


In [None]:
# 결측치가 있는 row들을  제거
df_final_count2.dropna(inplace=True)

In [None]:
df_final_count2

Unnamed: 0,address_1,address_2,address_3,year,month,last_month_deal_count,last_month_full_rent_count,last_month_month_rent_count,deal_count_1m_before,full_rent_count_1m_before,month_rent_count_1m_before,deal_count_3m_before,full_rent_count_3m_before,month_rent_count_3m_before,deal_count_6m_before,full_rent_count_6m_before,month_rent_count_6m_before,deal_count_12m_before,full_rent_count_12m_before,month_rent_count_12m_before
13,강남구,개포동,12.0,2012,2,7.0,33.0,9.0,9.0,32.0,9.0,8.0,40.0,14.0,4.0,33.0,13.0,16.0,38.0,7.0
14,강남구,개포동,12.0,2012,3,7.0,42.0,9.0,7.0,33.0,9.0,4.0,36.0,5.0,8.0,29.0,9.0,17.0,38.0,8.0
15,강남구,개포동,12.0,2012,4,5.0,49.0,9.0,7.0,42.0,9.0,9.0,32.0,9.0,8.0,28.0,13.0,8.0,46.0,16.0
16,강남구,개포동,12.0,2012,5,1.0,35.0,12.0,5.0,49.0,9.0,7.0,33.0,9.0,8.0,40.0,14.0,13.0,30.0,11.0
17,강남구,개포동,12.0,2012,6,6.0,26.0,8.0,1.0,35.0,12.0,7.0,42.0,9.0,4.0,36.0,5.0,7.0,21.0,8.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
993883,중랑구,신내동,831.0,2022,8,0.0,0.0,2.0,0.0,0.0,3.0,0.0,0.0,3.0,0.0,0.0,3.0,0.0,0.0,0.0
993884,중랑구,신내동,831.0,2022,9,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,5.0,0.0,0.0,17.0,0.0,0.0,15.0
993885,중랑구,신내동,831.0,2022,10,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,17.0,0.0,0.0,0.0
993886,중랑구,신내동,831.0,2022,11,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,3.0,0.0,0.0,2.0


## 변화개수들을 계산함

- 위에서 새로 생성한 컬럼들에 계산한 변화값들을 덮어씀

In [None]:
# 변화개수 = 현재개수 = 과거개수
df_final_count2['deal_count_1m_before'] = df_final_count2['last_month_deal_count']-df_final_count2['deal_count_1m_before']
df_final_count2['full_rent_count_1m_before'] = df_final_count2['last_month_full_rent_count']-df_final_count2['full_rent_count_1m_before']
df_final_count2['month_rent_count_1m_before'] = df_final_count2['last_month_month_rent_count']-df_final_count2['month_rent_count_1m_before']

df_final_count2['deal_count_3m_before'] = df_final_count2['last_month_deal_count']-df_final_count2['deal_count_3m_before']
df_final_count2['full_rent_count_3m_before'] = df_final_count2['last_month_full_rent_count']-df_final_count2['full_rent_count_3m_before']
df_final_count2['month_rent_count_3m_before'] = df_final_count2['last_month_month_rent_count']-df_final_count2['month_rent_count_3m_before']

df_final_count2['deal_count_6m_before'] = df_final_count2['last_month_deal_count']-df_final_count2['deal_count_6m_before']
df_final_count2['full_rent_count_6m_before'] = df_final_count2['last_month_full_rent_count']-df_final_count2['full_rent_count_6m_before']
df_final_count2['month_rent_count_6m_before'] = df_final_count2['last_month_month_rent_count']-df_final_count2['month_rent_count_6m_before']

df_final_count2['deal_count_12m_before'] = df_final_count2['last_month_deal_count']-df_final_count2['deal_count_12m_before']
df_final_count2['full_rent_count_12m_before'] = df_final_count2['last_month_full_rent_count']-df_final_count2['full_rent_count_12m_before']
df_final_count2['month_rent_count_12m_before'] = df_final_count2['last_month_month_rent_count']-df_final_count2['month_rent_count_12m_before']

In [None]:
df_final_count2 # 계산할때 0값들이 있어서 null 이 됨을 확인

Unnamed: 0,address_1,address_2,address_3,year,month,last_month_deal_count,last_month_full_rent_count,last_month_month_rent_count,deal_count_1m_before,full_rent_count_1m_before,month_rent_count_1m_before,deal_count_3m_before,full_rent_count_3m_before,month_rent_count_3m_before,deal_count_6m_before,full_rent_count_6m_before,month_rent_count_6m_before,deal_count_12m_before,full_rent_count_12m_before,month_rent_count_12m_before
13,강남구,개포동,12.0,2012,2,7.0,33.0,9.0,-2.0,1.0,0.0,-1.0,-7.0,-5.0,3.0,0.0,-4.0,-9.0,-5.0,2.0
14,강남구,개포동,12.0,2012,3,7.0,42.0,9.0,0.0,9.0,0.0,3.0,6.0,4.0,-1.0,13.0,0.0,-10.0,4.0,1.0
15,강남구,개포동,12.0,2012,4,5.0,49.0,9.0,-2.0,7.0,0.0,-4.0,17.0,0.0,-3.0,21.0,-4.0,-3.0,3.0,-7.0
16,강남구,개포동,12.0,2012,5,1.0,35.0,12.0,-4.0,-14.0,3.0,-6.0,2.0,3.0,-7.0,-5.0,-2.0,-12.0,5.0,1.0
17,강남구,개포동,12.0,2012,6,6.0,26.0,8.0,5.0,-9.0,-4.0,-1.0,-16.0,-1.0,2.0,-10.0,3.0,-1.0,5.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
993883,중랑구,신내동,831.0,2022,8,0.0,0.0,2.0,0.0,0.0,-1.0,0.0,0.0,-1.0,0.0,0.0,-1.0,0.0,0.0,2.0
993884,중랑구,신내동,831.0,2022,9,0.0,0.0,0.0,0.0,0.0,-2.0,0.0,0.0,-5.0,0.0,0.0,-17.0,0.0,0.0,-15.0
993885,중랑구,신내동,831.0,2022,10,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-3.0,0.0,0.0,-17.0,0.0,0.0,0.0
993886,중랑구,신내동,831.0,2022,11,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,-1.0,0.0,0.0,-2.0,0.0,0.0,-1.0


In [None]:
df_final_count2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 904162 entries, 13 to 993887
Data columns (total 20 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   address_1                    904162 non-null  object 
 1   address_2                    904162 non-null  object 
 2   address_3                    904162 non-null  float64
 3   year                         904162 non-null  int64  
 4   month                        904162 non-null  int64  
 5   last_month_deal_count        904162 non-null  float64
 6   last_month_full_rent_count   904162 non-null  float64
 7   last_month_month_rent_count  904162 non-null  float64
 8   deal_count_1m_before         904162 non-null  float64
 9   full_rent_count_1m_before    904162 non-null  float64
 10  month_rent_count_1m_before   904162 non-null  float64
 11  deal_count_3m_before         904162 non-null  float64
 12  full_rent_count_3m_before    904162 non-null  float64
 13

In [None]:
# 타입 변경 
df_final_count2 = df_final_count2.astype({'address_3':'int16','year': 'int16', 'month':'int16', 'month':'int16',
                                         'last_month_deal_count':'int16','last_month_full_rent_count':'int16', 'last_month_month_rent_count':'int16',
                                         'deal_count_1m_before':'int16', 'full_rent_count_1m_before':'int16', 'month_rent_count_1m_before':'int16',
                                         'deal_count_3m_before':'int16', 'full_rent_count_3m_before':'int16', 'month_rent_count_3m_before':'int16',
                                         'deal_count_6m_before':'int16', 'full_rent_count_6m_before':'int16', 'month_rent_count_6m_before':'int16',
                                         'deal_count_12m_before':'int16', 'full_rent_count_12m_before':'int16', 'month_rent_count_12m_before':'int16'})

In [None]:
df_final_count2.head()

Unnamed: 0,address_1,address_2,address_3,year,month,last_month_deal_count,last_month_full_rent_count,last_month_month_rent_count,deal_count_1m_before,full_rent_count_1m_before,month_rent_count_1m_before,deal_count_3m_before,full_rent_count_3m_before,month_rent_count_3m_before,deal_count_6m_before,full_rent_count_6m_before,month_rent_count_6m_before,deal_count_12m_before,full_rent_count_12m_before,month_rent_count_12m_before
13,강남구,개포동,12,2012,2,7,33,9,-2,1,0,-1,-7,-5,3,0,-4,-9,-5,2
14,강남구,개포동,12,2012,3,7,42,9,0,9,0,3,6,4,-1,13,0,-10,4,1
15,강남구,개포동,12,2012,4,5,49,9,-2,7,0,-4,17,0,-3,21,-4,-3,3,-7
16,강남구,개포동,12,2012,5,1,35,12,-4,-14,3,-6,2,3,-7,-5,-2,-12,5,1
17,강남구,개포동,12,2012,6,6,26,8,5,-9,-4,-1,-16,-1,2,-10,3,-1,5,0


In [None]:
df_final_count2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 904162 entries, 13 to 993887
Data columns (total 20 columns):
 #   Column                       Non-Null Count   Dtype 
---  ------                       --------------   ----- 
 0   address_1                    904162 non-null  object
 1   address_2                    904162 non-null  object
 2   address_3                    904162 non-null  int16 
 3   year                         904162 non-null  int16 
 4   month                        904162 non-null  int16 
 5   last_month_deal_count        904162 non-null  int16 
 6   last_month_full_rent_count   904162 non-null  int16 
 7   last_month_month_rent_count  904162 non-null  int16 
 8   deal_count_1m_before         904162 non-null  int16 
 9   full_rent_count_1m_before    904162 non-null  int16 
 10  month_rent_count_1m_before   904162 non-null  int16 
 11  deal_count_3m_before         904162 non-null  int16 
 12  full_rent_count_3m_before    904162 non-null  int16 
 13  month_rent_co

In [None]:
df_final_count2.to_pickle('/content/drive/MyDrive/house_price/after_data/month_region_count.pkl')

# deal_everyday 파일 들 통합

## final_deal 생성

- 메모리 부족 문제로 나누어서 생성한 deal_everyday 폴더 안에 파일들을 통합하여서 final_deal 파일 생성

In [None]:
import pandas as pd
import os

dir_path = "/content/drive/MyDrive/house_price/after_data/deal_everyday"
file_list = os.listdir(dir_path)
file_list.sort()
# 해당 폴더 안에 있는 csv 파일들을 읽어서 리스트 안에 데이터프레임들을 담음
for i,csv_file in enumerate(file_list):
    if i == 0:
        final_deal = pd.read_csv(dir_path+"/"+csv_file ,encoding='utf-8')
    else:
        final_deal = pd.concat([final_deal, pd.read_csv(dir_path+"/"+csv_file ,encoding='utf-8')], axis=0)
final_deal.reset_index(drop=True, inplace=True)

In [None]:
final_deal.tail()

Unnamed: 0,year,month,day,address_1,address_2,address_3,address_4,deal_price
38833375,2022,12,31,중랑구,중화동,438,0,69500
38833376,2022,12,31,중랑구,중화동,450,0,91500
38833377,2022,12,31,중랑구,중화동,452,0,55000
38833378,2022,12,31,중랑구,중화동,453,0,85500
38833379,2022,12,31,중랑구,중화동,454,0,96000


In [None]:
final_deal.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38833380 entries, 0 to 38833379
Data columns (total 8 columns):
 #   Column      Dtype 
---  ------      ----- 
 0   year        int64 
 1   month       int64 
 2   day         int64 
 3   address_1   object
 4   address_2   object
 5   address_3   int64 
 6   address_4   int64 
 7   deal_price  int64 
dtypes: int64(6), object(2)
memory usage: 2.3+ GB


In [None]:
# 용량을 줄이기 위해서 타입을 변경
final_deal=final_deal.astype({'year': 'int16','month': 'int16','day': 'int16','address_3': 'int16','address_4': 'int16','deal_price': 'int32'})

In [None]:
final_deal.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38833380 entries, 0 to 38833379
Data columns (total 8 columns):
 #   Column      Dtype 
---  ------      ----- 
 0   year        int16 
 1   month       int16 
 2   day         int16 
 3   address_1   object
 4   address_2   object
 5   address_3   int16 
 6   address_4   int16 
 7   deal_price  int32 
dtypes: int16(5), int32(1), object(2)
memory usage: 1.1+ GB


In [None]:
# 파일로 저장
final_deal.to_pickle('/content/drive/MyDrive/house_price/after_data/final_deal.pkl')

## final_full_rent 생성

- final_deal 생성 참조

In [None]:
import pandas as pd
import os
dir_path = "/content/drive/MyDrive/house_price/after_data/full_rent_everyday"
file_list = os.listdir(dir_path)
file_list.sort()
for i,csv_file in enumerate(file_list):
    if i == 0:
        final_full_rent = pd.read_csv(dir_path+"/"+csv_file ,encoding='utf-8')
    else:
        final_full_rent = pd.concat([final_full_rent, pd.read_csv(dir_path+"/"+csv_file ,encoding='utf-8')], axis=0)
final_full_rent.reset_index(drop=True, inplace=True)

In [None]:
final_full_rent.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40577814 entries, 0 to 40577813
Data columns (total 8 columns):
 #   Column           Dtype 
---  ------           ----- 
 0   year             int64 
 1   month            int64 
 2   day              int64 
 3   address_1        object
 4   address_2        object
 5   address_3        int64 
 6   address_4        int64 
 7   full_rent_price  int64 
dtypes: int64(6), object(2)
memory usage: 2.4+ GB


In [None]:
final_full_rent = final_full_rent.astype({'year': 'int16','month': 'int16','day': 'int16','address_3': 'int16','address_4': 'int16','full_rent_price': 'int32'})
final_full_rent.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40577814 entries, 0 to 40577813
Data columns (total 8 columns):
 #   Column           Dtype 
---  ------           ----- 
 0   year             int16 
 1   month            int16 
 2   day              int16 
 3   address_1        object
 4   address_2        object
 5   address_3        int16 
 6   address_4        int16 
 7   full_rent_price  int32 
dtypes: int16(5), int32(1), object(2)
memory usage: 1.1+ GB


In [None]:
# 파일로 저장
final_full_rent.to_pickle('/content/drive/MyDrive/house_price/after_data/final_full_rent.pkl')

## final_year_rent 생성

- final_deal 참조

In [None]:
import pandas as pd
import os

dir_path = "/content/drive/MyDrive/house_price/after_data/year_rent_everyday"
file_list = os.listdir(dir_path)
file_list.sort()
# 해당 폴더 안에 있는 csv 파일들을 읽어서 리스트 안에 데이터프레임들을 담음
for i,csv_file in enumerate(file_list):
    if i == 0:
        final_year_rent = pd.read_csv(dir_path+"/"+csv_file ,encoding='utf-8')
    else:
        final_year_rent = pd.concat([final_year_rent, pd.read_csv(dir_path+"/"+csv_file ,encoding='utf-8')], axis=0)
final_year_rent.reset_index(drop=True, inplace=True)

In [None]:
final_year_rent.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36633114 entries, 0 to 36633113
Data columns (total 8 columns):
 #   Column           Dtype 
---  ------           ----- 
 0   year             int64 
 1   month            int64 
 2   day              int64 
 3   address_1        object
 4   address_2        object
 5   address_3        int64 
 6   address_4        int64 
 7   year_rent_price  int64 
dtypes: int64(6), object(2)
memory usage: 2.2+ GB


In [None]:
final_year_rent = final_year_rent.astype({'year': 'int16','month': 'int16','day': 'int16','address_3': 'int16','address_4': 'int16','year_rent_price': 'int32'})
final_year_rent.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36633114 entries, 0 to 36633113
Data columns (total 8 columns):
 #   Column           Dtype 
---  ------           ----- 
 0   year             int16 
 1   month            int16 
 2   day              int16 
 3   address_1        object
 4   address_2        object
 5   address_3        int16 
 6   address_4        int16 
 7   year_rent_price  int32 
dtypes: int16(5), int32(1), object(2)
memory usage: 1.0+ GB


In [None]:
# 파일로 저장
final_year_rent.to_pickle('/content/drive/MyDrive/house_price/after_data/final_year_rent.pkl')

# final_df1 생성

- 이전에 만들어 놓은 final_deal, final_full_rent, final_year_rent 를 통합하여 하나의 파일로 생성
- RAM 용량 부족으로 인해서 부분을 나눠서 저장 후, 불러와서 이어서 통합과정을 진행

## final_deal 불러오기

In [None]:
import pandas as pd
final_deal = pd.read_pickle('/content/drive/MyDrive/house_price/after_data/final_deal.pkl')
final_deal

Unnamed: 0,year,month,day,address_1,address_2,address_3,address_4,deal_price
0,2011,1,1,강남구,개포동,12,0,0
1,2011,1,1,강남구,개포동,12,2,0
2,2011,1,1,강남구,개포동,138,0,0
3,2011,1,1,강남구,개포동,140,0,0
4,2011,1,1,강남구,개포동,141,0,0
...,...,...,...,...,...,...,...,...
38833375,2022,12,31,중랑구,중화동,438,0,69500
38833376,2022,12,31,중랑구,중화동,450,0,91500
38833377,2022,12,31,중랑구,중화동,452,0,55000
38833378,2022,12,31,중랑구,중화동,453,0,85500


In [None]:
final_deal.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38833380 entries, 0 to 38833379
Data columns (total 8 columns):
 #   Column      Dtype 
---  ------      ----- 
 0   year        int16 
 1   month       int16 
 2   day         int16 
 3   address_1   object
 4   address_2   object
 5   address_3   int16 
 6   address_4   int16 
 7   deal_price  int32 
dtypes: int16(5), int32(1), object(2)
memory usage: 1.1+ GB


## final_full_rent 불러오기

In [None]:
final_full_rent = pd.read_pickle('/content/drive/MyDrive/house_price/after_data/final_full_rent.pkl')
final_full_rent

Unnamed: 0,year,month,day,address_1,address_2,address_3,address_4,full_rent_price
0,2011,1,1,강남구,개포동,12,0,0
1,2011,1,1,강남구,개포동,12,2,0
2,2011,1,1,강남구,개포동,138,0,0
3,2011,1,1,강남구,개포동,140,0,0
4,2011,1,1,강남구,개포동,141,0,0
...,...,...,...,...,...,...,...,...
40577809,2022,12,31,중랑구,중화동,438,0,32000
40577810,2022,12,31,중랑구,중화동,450,0,32500
40577811,2022,12,31,중랑구,중화동,452,0,33000
40577812,2022,12,31,중랑구,중화동,453,0,57750


In [None]:
final_full_rent.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40577814 entries, 0 to 40577813
Data columns (total 8 columns):
 #   Column           Dtype 
---  ------           ----- 
 0   year             int16 
 1   month            int16 
 2   day              int16 
 3   address_1        object
 4   address_2        object
 5   address_3        int16 
 6   address_4        int16 
 7   full_rent_price  int32 
dtypes: int16(5), int32(1), object(2)
memory usage: 1.1+ GB


## final_deal,final_full_rent 통합

In [None]:
# final_deal과 final_full_rent 두개의 값이 모두 필요함으로 inner를 통해 merge
final_deal_full_rent = pd.merge(final_deal,final_full_rent, on =['year','month','day','address_1','address_2','address_3','address_4'], how='inner')

In [None]:
final_deal_full_rent.head()

Unnamed: 0,year,month,day,address_1,address_2,address_3,address_4,deal_price,full_rent_price
0,2011,1,1,강남구,개포동,12,0,0,0
1,2011,1,1,강남구,개포동,12,2,0,0
2,2011,1,1,강남구,개포동,138,0,0,0
3,2011,1,1,강남구,개포동,140,0,0,0
4,2011,1,1,강남구,개포동,141,0,0,0


In [None]:
# 합친거 확인
final_deal_full_rent.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 37816524 entries, 0 to 37816523
Data columns (total 9 columns):
 #   Column           Dtype 
---  ------           ----- 
 0   year             int16 
 1   month            int16 
 2   day              int16 
 3   address_1        object
 4   address_2        object
 5   address_3        int16 
 6   address_4        int16 
 7   deal_price       int32 
 8   full_rent_price  int32 
dtypes: int16(5), int32(2), object(2)
memory usage: 1.5+ GB


### 중간 저장 - RAM의 부족으로 나눠서 저장

In [None]:
# 파일 저장하기
final_deal_full_rent.to_pickle('/content/drive/MyDrive/house_price/after_data/final_deal_full_rent.pkl')

## 저장한 final_deal_full_rent 불러오기

In [None]:
import pandas as pd

# 파일 불러오기기
final_deal_full_rent = pd.read_pickle('/content/drive/MyDrive/house_price/after_data/final_deal_full_rent.pkl')

In [None]:
final_deal_full_rent.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 37816524 entries, 0 to 37816523
Data columns (total 9 columns):
 #   Column           Dtype 
---  ------           ----- 
 0   year             int16 
 1   month            int16 
 2   day              int16 
 3   address_1        object
 4   address_2        object
 5   address_3        int16 
 6   address_4        int16 
 7   deal_price       int32 
 8   full_rent_price  int32 
dtypes: int16(5), int32(2), object(2)
memory usage: 1.5+ GB


## final_year_rent 불러오기

In [None]:
# 정보 불러오기기
final_year_rent = pd.read_pickle('/content/drive/MyDrive/house_price/after_data/final_year_rent.pkl')
final_year_rent

Unnamed: 0,year,month,day,address_1,address_2,address_3,address_4,year_rent_price
0,2011,1,1,강남구,개포동,12,0,0
1,2011,1,1,강남구,개포동,12,2,0
2,2011,1,1,강남구,개포동,138,0,0
3,2011,1,1,강남구,개포동,140,0,0
4,2011,1,1,강남구,개포동,141,0,0
...,...,...,...,...,...,...,...,...
36633109,2022,12,31,중랑구,중화동,438,0,389
36633110,2022,12,31,중랑구,중화동,450,0,2090
36633111,2022,12,31,중랑구,중화동,452,0,1610
36633112,2022,12,31,중랑구,중화동,453,0,2070


In [None]:
final_year_rent.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36633114 entries, 0 to 36633113
Data columns (total 8 columns):
 #   Column           Dtype 
---  ------           ----- 
 0   year             int16 
 1   month            int16 
 2   day              int16 
 3   address_1        object
 4   address_2        object
 5   address_3        int16 
 6   address_4        int16 
 7   year_rent_price  int32 
dtypes: int16(5), int32(1), object(2)
memory usage: 1.0+ GB


## final_deal_full_rent, final_year_rent 통합

In [None]:
final_df = pd.merge(final_deal_full_rent, final_year_rent, on =['year','month','day','address_1','address_2','address_3','address_4'], how='inner')
final_df.head()

Unnamed: 0,year,month,day,address_1,address_2,address_3,address_4,deal_price,full_rent_price,year_rent_price
0,2011,1,1,강남구,개포동,12,0,0,0,0
1,2011,1,1,강남구,개포동,12,2,0,0,0
2,2011,1,1,강남구,개포동,138,0,0,0,0
3,2011,1,1,강남구,개포동,140,0,0,0,0
4,2011,1,1,강남구,개포동,141,0,0,0,0


In [None]:
final_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 33117948 entries, 0 to 33117947
Data columns (total 10 columns):
 #   Column           Dtype 
---  ------           ----- 
 0   year             int16 
 1   month            int16 
 2   day              int16 
 3   address_1        object
 4   address_2        object
 5   address_3        int16 
 6   address_4        int16 
 7   deal_price       int32 
 8   full_rent_price  int32 
 9   year_rent_price  int32 
dtypes: int16(5), int32(3), object(2)
memory usage: 1.4+ GB


In [None]:
# 파일 저장하기기
final_df.to_pickle('/content/drive/MyDrive/house_price/after_data/final_df1.pkl')

# final_df2 생성(final_df1 수정)

- 거시경제 지표는 아파트 거래의 전반적인 시장분위기를 파악한다면, 각 아파트별 평가를 위한 가치평가지표가 필요
- 이 프로젝트에서 사용하는 '거시경제 지표'는 주식에서의 금리, 환율, 경제성장률 등의 경제지표를 의미하고 '가치평가 지표'는 주식에서의 per, pbr, roe 등 자금과 관련된 지표들을 활용해서 생성한 지표
- final_df1 에 가치평가 지표들을 추가

## 기존 데이터 프레임에서 0값들 제거

- 0인 값들은 매매가, 전세가, 연세가가 없는 것임으로 해당 값이 있는 row들은 제거가 필요

In [None]:
import pandas as pd
# 저장한 파일 불러오기
final_df = pd.read_pickle('/content/drive/MyDrive/house_price/after_data/final_df1.pkl')
final_df.head()

Unnamed: 0,year,month,day,address_1,address_2,address_3,address_4,deal_price,full_rent_price,year_rent_price
0,2011,1,1,강남구,개포동,12,0,0,0,0
1,2011,1,1,강남구,개포동,12,2,0,0,0
2,2011,1,1,강남구,개포동,138,0,0,0,0
3,2011,1,1,강남구,개포동,140,0,0,0,0
4,2011,1,1,강남구,개포동,141,0,0,0,0


In [None]:
final_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 33117948 entries, 0 to 33117947
Data columns (total 10 columns):
 #   Column           Dtype 
---  ------           ----- 
 0   year             int16 
 1   month            int16 
 2   day              int16 
 3   address_1        object
 4   address_2        object
 5   address_3        int16 
 6   address_4        int16 
 7   deal_price       int32 
 8   full_rent_price  int32 
 9   year_rent_price  int32 
dtypes: int16(5), int32(3), object(2)
memory usage: 1.4+ GB


In [None]:
# 값이 0인 row를 제거거
final_df.drop(final_df[(final_df.deal_price == 0)|(final_df.full_rent_price == 0)|(final_df.year_rent_price == 0)].index, inplace=True)
final_df.reset_index(drop=True,inplace=True)

In [None]:
final_df.head()

Unnamed: 0,year,month,day,address_1,address_2,address_3,address_4,deal_price,full_rent_price,year_rent_price
0,2011,1,5,강남구,개포동,138,0,89400,11000,778
1,2011,1,5,강남구,개포동,185,0,89000,22000,1786
2,2011,1,6,강남구,개포동,138,0,89400,7500,778
3,2011,1,6,강남구,개포동,185,0,89000,30666,1424
4,2011,1,7,강남구,개포동,12,0,35000,24000,1196


In [None]:
final_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24144106 entries, 0 to 24144105
Data columns (total 10 columns):
 #   Column           Dtype 
---  ------           ----- 
 0   year             int16 
 1   month            int16 
 2   day              int16 
 3   address_1        object
 4   address_2        object
 5   address_3        int16 
 6   address_4        int16 
 7   deal_price       int32 
 8   full_rent_price  int32 
 9   year_rent_price  int32 
dtypes: int16(5), int32(3), object(2)
memory usage: 875.0+ MB


## 가치평가지표 컬럼들 추가

- deal_price 컬럼이 매매가, full_rent_price는 전세가, year_rent_price 는 연세가를 나타냄으로 각 지표들을 조합해 가치평가 컬럼들을 생성
- 기존에 매매가/전세가, 매매가/연세가 를 통해서 배수들을 구하려 했는데 추후 계산시 배수로 계산시 수의 범위가 너무 넓어져서 전세가/매매가로 'deal_to_full_rent_rate' 계산
- 연세가/매매가 로하면 소수점자리로 길게 나올 것 같기에 deal_to_year_rent_multiple는 배수로 유지

>> 계산을 통한 컬럼을 생성 시, 계산과정이 원하는 계산식인지 꼼꼼히 확인이 필요

In [None]:
# 전세가율, 연세가율을 구함함
final_df['deal_to_full_rent_rate'] = 100*(final_df['full_rent_price']/final_df['deal_price'])
final_df['deal_to_year_rent_multiple'] = final_df['deal_price']/final_df['year_rent_price']
final_df

Unnamed: 0,year,month,day,address_1,address_2,address_3,address_4,deal_price,full_rent_price,year_rent_price,deal_to_full_rent_rate,deal_to_year_rent_multiple
0,2011,1,5,강남구,개포동,138,0,89400,11000,778,12.304251,114.910026
1,2011,1,5,강남구,개포동,185,0,89000,22000,1786,24.719101,49.832027
2,2011,1,6,강남구,개포동,138,0,89400,7500,778,8.389262,114.910026
3,2011,1,6,강남구,개포동,185,0,89000,30666,1424,34.456180,62.500000
4,2011,1,7,강남구,개포동,12,0,35000,24000,1196,68.571429,29.264214
...,...,...,...,...,...,...,...,...,...,...,...,...
24144101,2022,12,31,중랑구,중화동,438,0,69500,32000,389,46.043165,178.663239
24144102,2022,12,31,중랑구,중화동,450,0,91500,32500,2090,35.519126,43.779904
24144103,2022,12,31,중랑구,중화동,452,0,55000,33000,1610,60.000000,34.161491
24144104,2022,12,31,중랑구,중화동,453,0,85500,57750,2070,67.543860,41.304348


In [None]:
final_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24144106 entries, 0 to 24144105
Data columns (total 12 columns):
 #   Column                      Dtype  
---  ------                      -----  
 0   year                        int16  
 1   month                       int16  
 2   day                         int16  
 3   address_1                   object 
 4   address_2                   object 
 5   address_3                   int16  
 6   address_4                   int16  
 7   deal_price                  int32  
 8   full_rent_price             int32  
 9   year_rent_price             int32  
 10  deal_to_full_rent_rate      float64
 11  deal_to_year_rent_multiple  float64
dtypes: float64(2), int16(5), int32(3), object(2)
memory usage: 1.2+ GB


In [None]:
# 타입변경 - 메모리 용량문제로 인해 타입 변경
final_df = final_df.astype({'deal_to_full_rent_rate':'float32', 'deal_to_year_rent_multiple':'float32'})
final_df

Unnamed: 0,year,month,day,address_1,address_2,address_3,address_4,deal_price,full_rent_price,year_rent_price,deal_to_full_rent_rate,deal_to_year_rent_multiple
0,2011,1,5,강남구,개포동,138,0,89400,11000,778,12.304251,114.910027
1,2011,1,5,강남구,개포동,185,0,89000,22000,1786,24.719101,49.832027
2,2011,1,6,강남구,개포동,138,0,89400,7500,778,8.389262,114.910027
3,2011,1,6,강남구,개포동,185,0,89000,30666,1424,34.456181,62.500000
4,2011,1,7,강남구,개포동,12,0,35000,24000,1196,68.571426,29.264214
...,...,...,...,...,...,...,...,...,...,...,...,...
24144101,2022,12,31,중랑구,중화동,438,0,69500,32000,389,46.043167,178.663239
24144102,2022,12,31,중랑구,중화동,450,0,91500,32500,2090,35.519127,43.779903
24144103,2022,12,31,중랑구,중화동,452,0,55000,33000,1610,60.000000,34.161491
24144104,2022,12,31,중랑구,중화동,453,0,85500,57750,2070,67.543861,41.304348


In [None]:
# null 값들과 inf 값들 있는지 확인인
import numpy as np
print(final_df.isnull().sum())

final_df.replace([np.inf, -np.inf], np.nan, inplace=True)
var = final_df.isnull().sum()
print(var.to_string())

year                          0
month                         0
day                           0
address_1                     0
address_2                     0
address_3                     0
address_4                     0
deal_price                    0
full_rent_price               0
year_rent_price               0
deal_to_full_rent_rate        0
deal_to_year_rent_multiple    0
dtype: int64
year                          0
month                         0
day                           0
address_1                     0
address_2                     0
address_3                     0
address_4                     0
deal_price                    0
full_rent_price               0
year_rent_price               0
deal_to_full_rent_rate        0
deal_to_year_rent_multiple    0


In [None]:
final_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24144106 entries, 0 to 24144105
Data columns (total 12 columns):
 #   Column                      Dtype  
---  ------                      -----  
 0   year                        int16  
 1   month                       int16  
 2   day                         int16  
 3   address_1                   object 
 4   address_2                   object 
 5   address_3                   int16  
 6   address_4                   int16  
 7   deal_price                  int32  
 8   full_rent_price             int32  
 9   year_rent_price             int32  
 10  deal_to_full_rent_rate      float32
 11  deal_to_year_rent_multiple  float32
dtypes: float32(2), int16(5), int32(3), object(2)
memory usage: 1.0+ GB


In [None]:
final_df.to_pickle('/content/drive/MyDrive/house_price/after_data/final_df2.pkl')

# final_df3 생성(final_df2 수정)

- final_df2에 각 아파트 별 과거 가격들에 대한 정보를 추가
- 30일 전 가격으로 지정을 하면 가격 변화가 그 경우의 수가 많을 것 같고 일관성이 없을 것 같아서 몇개월 전의 평균 가격을 과거 가격으로 설정

## 데이터프레임 불러오기

In [None]:
import pandas as pd
# 아파트 거래 테이블 불러오기기
final_df2 = pd.read_pickle('/content/drive/MyDrive/house_price/after_data/final_df2.pkl')
final_df2.head()

Unnamed: 0,year,month,day,address_1,address_2,address_3,address_4,deal_price,full_rent_price,year_rent_price,deal_to_full_rent_rate,deal_to_year_rent_multiple
0,2011,1,5,강남구,개포동,138,0,89400,11000,778,12.304251,114.910027
1,2011,1,5,강남구,개포동,185,0,89000,22000,1786,24.719101,49.832027
2,2011,1,6,강남구,개포동,138,0,89400,7500,778,8.389262,114.910027
3,2011,1,6,강남구,개포동,185,0,89000,30666,1424,34.456181,62.5
4,2011,1,7,강남구,개포동,12,0,35000,24000,1196,68.571426,29.264214


In [None]:
final_df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24144106 entries, 0 to 24144105
Data columns (total 12 columns):
 #   Column                      Dtype  
---  ------                      -----  
 0   year                        int16  
 1   month                       int16  
 2   day                         int16  
 3   address_1                   object 
 4   address_2                   object 
 5   address_3                   int16  
 6   address_4                   int16  
 7   deal_price                  int32  
 8   full_rent_price             int32  
 9   year_rent_price             int32  
 10  deal_to_full_rent_rate      float32
 11  deal_to_year_rent_multiple  float32
dtypes: float32(2), int16(5), int32(3), object(2)
memory usage: 1.0+ GB


## 과거달의 평균 수치들을 구함

- 과거수치 대비 변경사항을 파악하기 위해서 지난 달의 평균 거래(매매,전세,월세) 가격을 구함
- 거래개수, 지표등은 해당달의 수치를 다음달에 알 수 있기에 한달 씩 미루었지만, 체결가격들은 체결 현황을 바로바로 파악할 수 있다고 가정해서 한달 씩 미루지 않음

In [None]:
# 지역별(address_4까지지) 달별 거래 평균을 구함함
group_final_df = final_df2.groupby(['address_1','address_2','address_3','address_4','year','month']).agg({'deal_price': 'mean','full_rent_price':'mean',
                                                                                                          'year_rent_price':'mean','deal_to_full_rent_rate':'mean','deal_to_year_rent_multiple':'mean'}).copy()
group_final_df.reset_index(inplace=True)
group_final_df.head(10)

Unnamed: 0,address_1,address_2,address_3,address_4,year,month,deal_price,full_rent_price,year_rent_price,deal_to_full_rent_rate,deal_to_year_rent_multiple
0,강남구,개포동,12,0,2011,1,40938.0,18332.64,1190.48,45.434387,35.128788
1,강남구,개포동,12,0,2011,2,46330.357143,16232.107143,1176.464286,36.548016,40.936523
2,강남구,개포동,12,0,2011,3,47822.580645,17971.774194,1256.774194,38.856297,39.303127
3,강남구,개포동,12,0,2011,4,45638.666667,16180.866667,1026.8,35.936687,46.845188
4,강남구,개포동,12,0,2011,5,41406.451613,18709.645161,1200.387097,45.567566,35.201447
5,강남구,개포동,12,0,2011,6,42056.666667,19296.066667,1096.966667,48.413422,40.14835
6,강남구,개포동,12,0,2011,7,42387.096774,20866.129032,1332.83871,49.842899,32.93047
7,강남구,개포동,12,0,2011,8,44316.129032,20264.516129,1228.387097,45.971195,36.481705
8,강남구,개포동,12,0,2011,9,43208.333333,20738.866667,1133.933333,49.645,38.670517
9,강남구,개포동,12,0,2011,10,35287.096774,21646.451613,1315.354839,66.50042,28.182751


In [None]:
# 'deal_to_full_rent_rate' 와 'deal_to_year_rent_multiple' 는 평균 매매가, 평균전세가, 평균월세가들을 활용해서 재계산해야 함함
group_final_df['deal_to_full_rent_rate'] = 100*(group_final_df['full_rent_price']/group_final_df['deal_price'])
group_final_df['deal_to_year_rent_multiple'] = group_final_df['deal_price']/group_final_df['year_rent_price']
group_final_df.head(10)

Unnamed: 0,address_1,address_2,address_3,address_4,year,month,deal_price,full_rent_price,year_rent_price,deal_to_full_rent_rate,deal_to_year_rent_multiple
0,강남구,개포동,12,0,2011,1,40938.0,18332.64,1190.48,44.781474,34.38781
1,강남구,개포동,12,0,2011,2,46330.357143,16232.107143,1176.464286,35.035575,39.381015
2,강남구,개포동,12,0,2011,3,47822.580645,17971.774194,1256.774194,37.580101,38.051848
3,강남구,개포동,12,0,2011,4,45638.666667,16180.866667,1026.8,35.454293,44.447474
4,강남구,개포동,12,0,2011,5,41406.451613,18709.645161,1200.387097,45.185338,34.494249
5,강남구,개포동,12,0,2011,6,42056.666667,19296.066667,1096.966667,45.881113,38.339056
6,강남구,개포동,12,0,2011,7,42387.096774,20866.129032,1332.83871,49.227549,31.80212
7,강남구,개포동,12,0,2011,8,44316.129032,20264.516129,1228.387097,45.72718,36.076681
8,강남구,개포동,12,0,2011,9,43208.333333,20738.866667,1133.933333,47.997377,38.104827
9,강남구,개포동,12,0,2011,10,35287.096774,21646.451613,1315.354839,61.343816,26.827055


In [None]:
import numpy as np
# 1개월 전 평균 가격들을 계산 하여 컬럼 생성
group_final_df['deal_price_mean_1m_before'] = group_final_df['deal_price']
group_final_df['full_rent_price_mean_1m_before'] = group_final_df['full_rent_price']
group_final_df['year_rent_price_mean_1m_before'] = group_final_df['year_rent_price']
group_final_df['deal_to_full_rent_rate_mean_1m_before'] = group_final_df['deal_to_full_rent_rate']
group_final_df['deal_to_year_rent_multiple_mean_1m_before'] = group_final_df['deal_to_year_rent_multiple']
# 밀리면서 사라질 값들을 null 값 설정, null로 안하면 다른 지역값에 더어쓰기가 됨
group_final_df.loc[(group_final_df['year']==2022)&(group_final_df['month']==12),
                   ['deal_price_mean_1m_before', 'full_rent_price_mean_1m_before',
                    'year_rent_price_mean_1m_before','deal_to_full_rent_rate_mean_1m_before','deal_to_year_rent_multiple_mean_1m_before']] = np.nan
# 한 칸씩 미뤄서 이전달의 결과를 얻음음
group_final_df[['deal_price_mean_1m_before', 'full_rent_price_mean_1m_before',
                'year_rent_price_mean_1m_before','deal_to_full_rent_rate_mean_1m_before','deal_to_year_rent_multiple_mean_1m_before']] = group_final_df[['deal_price_mean_1m_before', 
                                                                     'full_rent_price_mean_1m_before','year_rent_price_mean_1m_before','deal_to_full_rent_rate_mean_1m_before',
                                                                     'deal_to_year_rent_multiple_mean_1m_before']].shift(1)

# 3개월 전 평균 가격들을 계산 컬럼 생성
group_final_df['deal_price_mean_3m_before'] = group_final_df['deal_price']
group_final_df['full_rent_price_mean_3m_before'] = group_final_df['full_rent_price']
group_final_df['year_rent_price_mean_3m_before'] = group_final_df['year_rent_price']
group_final_df['deal_to_full_rent_rate_mean_3m_before'] = group_final_df['deal_to_full_rent_rate']
group_final_df['deal_to_year_rent_multiple_mean_3m_before'] = group_final_df['deal_to_year_rent_multiple']
for i in range(3):
    number=12-i
    group_final_df.loc[(group_final_df['year']==2022)&(group_final_df['month']==number),
                        ['deal_price_mean_3m_before', 'full_rent_price_mean_3m_before',
                         'year_rent_price_mean_3m_before','deal_to_full_rent_rate_mean_3m_before','deal_to_year_rent_multiple_mean_3m_before']]=np.nan
group_final_df[['deal_price_mean_3m_before', 'full_rent_price_mean_3m_before',
                 'year_rent_price_mean_3m_before','deal_to_full_rent_rate_mean_3m_before','deal_to_year_rent_multiple_mean_3m_before']] = group_final_df[['deal_price_mean_3m_before',
                 'full_rent_price_mean_3m_before','year_rent_price_mean_3m_before','deal_to_full_rent_rate_mean_3m_before','deal_to_year_rent_multiple_mean_3m_before']].shift(3)

# 6개월 전 평균 가격들을 계산 컬럼 생성
group_final_df['deal_price_mean_6m_before'] = group_final_df['deal_price']
group_final_df['full_rent_price_mean_6m_before'] = group_final_df['full_rent_price']
group_final_df['year_rent_price_mean_6m_before'] = group_final_df['year_rent_price']
group_final_df['deal_to_full_rent_rate_mean_6m_before'] = group_final_df['deal_to_full_rent_rate']
group_final_df['deal_to_year_rent_multiple_mean_6m_before'] = group_final_df['deal_to_year_rent_multiple']
for i in range(6):
    number=12-i
    group_final_df.loc[(group_final_df['year']==2022)&(group_final_df['month']==number),
                        ['deal_price_mean_6m_before', 'full_rent_price_mean_6m_before',
                         'year_rent_price_mean_6m_before','deal_to_full_rent_rate_mean_6m_before','deal_to_year_rent_multiple_mean_6m_before']]=np.nan
group_final_df[['deal_price_mean_6m_before', 'full_rent_price_mean_6m_before',
                 'year_rent_price_mean_6m_before','deal_to_full_rent_rate_mean_6m_before','deal_to_year_rent_multiple_mean_6m_before']] = group_final_df[['deal_price_mean_6m_before',
                 'full_rent_price_mean_6m_before','year_rent_price_mean_6m_before','deal_to_full_rent_rate_mean_6m_before','deal_to_year_rent_multiple_mean_6m_before']].shift(6)

# 12개월 전 평균 가격들을 계산 컬럼 생성
group_final_df['deal_price_mean_12m_before'] = group_final_df['deal_price']
group_final_df['full_rent_price_mean_12m_before'] = group_final_df['full_rent_price']
group_final_df['year_rent_price_mean_12m_before'] = group_final_df['year_rent_price']
group_final_df['deal_to_full_rent_rate_mean_12m_before'] = group_final_df['deal_to_full_rent_rate']
group_final_df['deal_to_year_rent_multiple_mean_12m_before'] = group_final_df['deal_to_year_rent_multiple']
group_final_df.loc[(group_final_df['year']==2022),
                        ['deal_price_mean_12m_before', 'full_rent_price_mean_12m_before',
                         'year_rent_price_mean_12m_before','deal_to_full_rent_rate_mean_12m_before','deal_to_year_rent_multiple_mean_12m_before']]=np.nan
group_final_df[['deal_price_mean_12m_before', 'full_rent_price_mean_12m_before',
                 'year_rent_price_mean_12m_before','deal_to_full_rent_rate_mean_12m_before','deal_to_year_rent_multiple_mean_12m_before']] = group_final_df[['deal_price_mean_12m_before',
                 'full_rent_price_mean_12m_before','year_rent_price_mean_12m_before','deal_to_full_rent_rate_mean_12m_before','deal_to_year_rent_multiple_mean_12m_before']].shift(12)

group_final_df.drop(['deal_price','full_rent_price','year_rent_price','deal_to_full_rent_rate','deal_to_year_rent_multiple'],axis=1, inplace=True)
group_final_df.dropna(inplace=True)
group_final_df.head(10)      

Unnamed: 0,address_1,address_2,address_3,address_4,year,month,deal_price_mean_1m_before,full_rent_price_mean_1m_before,year_rent_price_mean_1m_before,deal_to_full_rent_rate_mean_1m_before,...,deal_price_mean_6m_before,full_rent_price_mean_6m_before,year_rent_price_mean_6m_before,deal_to_full_rent_rate_mean_6m_before,deal_to_year_rent_multiple_mean_6m_before,deal_price_mean_12m_before,full_rent_price_mean_12m_before,year_rent_price_mean_12m_before,deal_to_full_rent_rate_mean_12m_before,deal_to_year_rent_multiple_mean_12m_before
12,강남구,개포동,12,0,2012,1,42426.709677,21678.483871,1155.290323,51.096312,...,42387.096774,20866.129032,1332.83871,49.227549,31.80212,40938.0,18332.64,1190.48,44.781474,34.38781
13,강남구,개포동,12,0,2012,2,30774.193548,19862.774194,1112.387097,64.543606,...,44316.129032,20264.516129,1228.387097,45.72718,36.076681,46330.357143,16232.107143,1176.464286,35.035575,39.381015
14,강남구,개포동,12,0,2012,3,40358.62069,20747.275862,1088.896552,51.407297,...,43208.333333,20738.866667,1133.933333,47.997377,38.104827,47822.580645,17971.774194,1256.774194,37.580101,38.051848
15,강남구,개포동,12,0,2012,4,43870.967742,19877.935484,1057.032258,45.31,...,35287.096774,21646.451613,1315.354839,61.343816,26.827055,45638.666667,16180.866667,1026.8,35.454293,44.447474
16,강남구,개포동,12,0,2012,5,44583.333333,19611.533333,1185.033333,43.988486,...,53386.666667,21059.133333,1280.266667,39.446429,41.699646,41406.451613,18709.645161,1200.387097,45.185338,34.494249
17,강남구,개포동,12,0,2012,6,42716.129032,19727.129032,1055.677419,46.181921,...,42426.709677,21678.483871,1155.290323,51.096312,36.723851,42056.666667,19296.066667,1096.966667,45.881113,38.339056
18,강남구,개포동,12,0,2012,7,43401.666667,19952.733333,1163.833333,45.972274,...,30774.193548,19862.774194,1112.387097,64.543606,27.665004,42387.096774,20866.129032,1332.83871,49.227549,31.80212
19,강남구,개포동,12,0,2012,8,32402.419355,18897.806452,1070.645161,58.322208,...,40358.62069,20747.275862,1088.896552,51.407297,37.063779,44316.129032,20264.516129,1228.387097,45.72718,36.076681
20,강남구,개포동,12,0,2012,9,35403.225806,19605.354839,1282.0,55.377312,...,43870.967742,19877.935484,1057.032258,45.31,41.503906,43208.333333,20738.866667,1133.933333,47.997377,38.104827
21,강남구,개포동,12,0,2012,10,32816.666667,22546.366667,1250.2,68.704012,...,44583.333333,19611.533333,1185.033333,43.988486,37.622008,35287.096774,21646.451613,1315.354839,61.343816,26.827055


In [None]:
group_final_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 707038 entries, 12 to 796746
Data columns (total 26 columns):
 #   Column                                      Non-Null Count   Dtype  
---  ------                                      --------------   -----  
 0   address_1                                   707038 non-null  object 
 1   address_2                                   707038 non-null  object 
 2   address_3                                   707038 non-null  int64  
 3   address_4                                   707038 non-null  int64  
 4   year                                        707038 non-null  int64  
 5   month                                       707038 non-null  int64  
 6   deal_price_mean_1m_before                   707038 non-null  float64
 7   full_rent_price_mean_1m_before              707038 non-null  float64
 8   year_rent_price_mean_1m_before              707038 non-null  float64
 9   deal_to_full_rent_rate_mean_1m_before       707038 non-null  float64


In [None]:
# 반올림 실행행
# group_final_df['deal_to_full_rent_rate_mean_1m_before'] = group_final_df['deal_to_full_rent_rate_mean_1m_before'].astype(float).round(2)
# group_final_df['deal_to_year_rent_multiple_mean_1m_before'] = group_final_df['deal_to_year_rent_multiple_mean_1m_before'].astype(float).round(2)

# group_final_df['deal_to_full_rent_rate_mean_3m_before'] = group_final_df['deal_to_full_rent_rate_mean_3m_before'].astype(float).round(2)
# group_final_df['deal_to_year_rent_multiple_mean_3m_before'] = group_final_df['deal_to_year_rent_multiple_mean_3m_before'].astype(float).round(2)

# group_final_df['deal_to_full_rent_rate_mean_6m_before'] = group_final_df['deal_to_full_rent_rate_mean_6m_before'].astype(float).round(2)
# group_final_df['deal_to_year_rent_multiple_mean_6m_before'] = group_final_df['deal_to_year_rent_multiple_mean_6m_before'].astype(float).round(2)

# group_final_df['deal_to_full_rent_rate_mean_12m_before'] = group_final_df['deal_to_full_rent_rate_mean_12m_before'].astype(float).round(2)
# group_final_df['deal_to_year_rent_multiple_mean_12m_before'] = group_final_df['deal_to_year_rent_multiple_mean_12m_before'].astype(float).round(2)

In [None]:
# 타입변경경
group_final_df = group_final_df.astype({'address_3':'int16','address_4':'int16','year':'int16','month':'int16',
                       'deal_price_mean_1m_before':'int32', 'full_rent_price_mean_1m_before':'int32', 'year_rent_price_mean_1m_before':'int32', 'deal_to_full_rent_rate_mean_1m_before':'float32', 'deal_to_year_rent_multiple_mean_1m_before':'float32',
                       'deal_price_mean_3m_before':'int32', 'full_rent_price_mean_3m_before':'int32', 'year_rent_price_mean_3m_before':'int32', 'deal_to_full_rent_rate_mean_3m_before':'float32', 'deal_to_year_rent_multiple_mean_3m_before':'float32',
                       'deal_price_mean_6m_before':'int32', 'full_rent_price_mean_6m_before':'int32', 'year_rent_price_mean_6m_before':'int32', 'deal_to_full_rent_rate_mean_6m_before':'float32', 'deal_to_year_rent_multiple_mean_6m_before':'float32',
                       'deal_price_mean_12m_before':'int32', 'full_rent_price_mean_12m_before':'int32', 'year_rent_price_mean_12m_before':'int32', 'deal_to_full_rent_rate_mean_12m_before':'float32', 'deal_to_year_rent_multiple_mean_12m_before':'float32'})

In [None]:
group_final_df.head()

Unnamed: 0,address_1,address_2,address_3,address_4,year,month,deal_price_mean_1m_before,full_rent_price_mean_1m_before,year_rent_price_mean_1m_before,deal_to_full_rent_rate_mean_1m_before,...,deal_price_mean_6m_before,full_rent_price_mean_6m_before,year_rent_price_mean_6m_before,deal_to_full_rent_rate_mean_6m_before,deal_to_year_rent_multiple_mean_6m_before,deal_price_mean_12m_before,full_rent_price_mean_12m_before,year_rent_price_mean_12m_before,deal_to_full_rent_rate_mean_12m_before,deal_to_year_rent_multiple_mean_12m_before
12,강남구,개포동,12,0,2012,1,42426,21678,1155,51.096313,...,42387,20866,1332,49.227551,31.80212,40938,18332,1190,44.781475,34.38781
13,강남구,개포동,12,0,2012,2,30774,19862,1112,64.54361,...,44316,20264,1228,45.72718,36.076679,46330,16232,1176,35.035576,39.381016
14,강남구,개포동,12,0,2012,3,40358,20747,1088,51.407295,...,43208,20738,1133,47.997375,38.104828,47822,17971,1256,37.580101,38.051849
15,강남구,개포동,12,0,2012,4,43870,19877,1057,45.310001,...,35287,21646,1315,61.343815,26.827055,45638,16180,1026,35.454292,44.447475
16,강남구,개포동,12,0,2012,5,44583,19611,1185,43.988487,...,53386,21059,1280,39.44643,41.699646,41406,18709,1200,45.185337,34.494247


In [None]:
group_final_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 707038 entries, 12 to 796746
Data columns (total 26 columns):
 #   Column                                      Non-Null Count   Dtype  
---  ------                                      --------------   -----  
 0   address_1                                   707038 non-null  object 
 1   address_2                                   707038 non-null  object 
 2   address_3                                   707038 non-null  int16  
 3   address_4                                   707038 non-null  int16  
 4   year                                        707038 non-null  int16  
 5   month                                       707038 non-null  int16  
 6   deal_price_mean_1m_before                   707038 non-null  int32  
 7   full_rent_price_mean_1m_before              707038 non-null  int32  
 8   year_rent_price_mean_1m_before              707038 non-null  int32  
 9   deal_to_full_rent_rate_mean_1m_before       707038 non-null  float32


## final_df2와의 병합

- group_final_df 는 지역별 과거의 평균 수치를 나타내주기에, final_df2와 병합하여서 정보들을 final_df3 생성

In [None]:
final_df3 = pd.merge(final_df2, group_final_df, 
         on=['address_1','address_2','address_3','address_4','year','month'], how='left')

In [None]:
final_df3.head()

Unnamed: 0,year,month,day,address_1,address_2,address_3,address_4,deal_price,full_rent_price,year_rent_price,...,deal_price_mean_6m_before,full_rent_price_mean_6m_before,year_rent_price_mean_6m_before,deal_to_full_rent_rate_mean_6m_before,deal_to_year_rent_multiple_mean_6m_before,deal_price_mean_12m_before,full_rent_price_mean_12m_before,year_rent_price_mean_12m_before,deal_to_full_rent_rate_mean_12m_before,deal_to_year_rent_multiple_mean_12m_before
0,2011,1,5,강남구,개포동,138,0,89400,11000,778,...,,,,,,,,,,
1,2011,1,5,강남구,개포동,185,0,89000,22000,1786,...,,,,,,,,,,
2,2011,1,6,강남구,개포동,138,0,89400,7500,778,...,,,,,,,,,,
3,2011,1,6,강남구,개포동,185,0,89000,30666,1424,...,,,,,,,,,,
4,2011,1,7,강남구,개포동,12,0,35000,24000,1196,...,,,,,,,,,,


In [None]:
final_df3.isnull().sum() # 2011년 자료들이 null 값들이 됨(12개월 전 평균 가격들 때문에)

year                                                0
month                                               0
day                                                 0
address_1                                           0
address_2                                           0
address_3                                           0
address_4                                           0
deal_price                                          0
full_rent_price                                     0
year_rent_price                                     0
deal_to_full_rent_rate                              0
deal_to_year_rent_multiple                          0
deal_price_mean_1m_before                     2620275
full_rent_price_mean_1m_before                2620275
year_rent_price_mean_1m_before                2620275
deal_to_full_rent_rate_mean_1m_before         2620275
deal_to_year_rent_multiple_mean_1m_before     2620275
deal_price_mean_3m_before                     2620275
full_rent_price_mean_3m_befo

In [None]:
final_df3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 24144106 entries, 0 to 24144105
Data columns (total 32 columns):
 #   Column                                      Dtype  
---  ------                                      -----  
 0   year                                        int16  
 1   month                                       int16  
 2   day                                         int16  
 3   address_1                                   object 
 4   address_2                                   object 
 5   address_3                                   int16  
 6   address_4                                   int16  
 7   deal_price                                  int32  
 8   full_rent_price                             int32  
 9   year_rent_price                             int32  
 10  deal_to_full_rent_rate                      float32
 11  deal_to_year_rent_multiple                  float32
 12  deal_price_mean_1m_before                   float64
 13  full_rent_price_mean_1m_b

- 메모리 용량 문제로 인해서 저장후 다시 실행하여 진행

In [None]:
final_df3.to_pickle('/content/drive/MyDrive/house_price/after_data/final_df3.pkl')

# final_df4 생성(final_df3 수정)

- final_df3에서 결측치 처리 
- final_df3에서 구한 과거가격 대비 현재가격의 변화율 계산
- 이전에는 변화정도를 계산할 때는, 0인 값들이 있어서 계산에 어려움이 있어 변화정도(개수)를 구했지만, 거래가격 및 가치평가지표들은 0의 값이 없음으로 변화율로 계산 가능

In [None]:
import pandas as pd
final_df3 = pd.read_pickle('/content/drive/MyDrive/house_price/after_data/final_df3.pkl')

In [None]:
final_df3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 24144106 entries, 0 to 24144105
Data columns (total 32 columns):
 #   Column                                      Dtype  
---  ------                                      -----  
 0   year                                        int16  
 1   month                                       int16  
 2   day                                         int16  
 3   address_1                                   object 
 4   address_2                                   object 
 5   address_3                                   int16  
 6   address_4                                   int16  
 7   deal_price                                  int32  
 8   full_rent_price                             int32  
 9   year_rent_price                             int32  
 10  deal_to_full_rent_rate                      float32
 11  deal_to_year_rent_multiple                  float32
 12  deal_price_mean_1m_before                   float64
 13  full_rent_price_mean_1m_b

In [None]:
# null 있는 row들 제거거
final_df3.dropna(inplace=True)

In [None]:
# 변화율 = 100*((현재가격-과거가격)/과거가격)
final_df3['deal_price_mean_1m_before'] = 100*((final_df3['deal_price']-final_df3['deal_price_mean_1m_before'])/final_df3['deal_price_mean_1m_before'])
final_df3['full_rent_price_mean_1m_before'] = 100*((final_df3['full_rent_price']-final_df3['full_rent_price_mean_1m_before'])/final_df3['full_rent_price_mean_1m_before'])
final_df3['year_rent_price_mean_1m_before'] = 100*((final_df3['year_rent_price']-final_df3['year_rent_price_mean_1m_before'])/final_df3['year_rent_price_mean_1m_before'])
final_df3['deal_to_full_rent_rate_mean_1m_before'] = 100*((final_df3['deal_to_full_rent_rate']-final_df3['deal_to_full_rent_rate_mean_1m_before'])/final_df3['deal_to_full_rent_rate_mean_1m_before'])
final_df3['deal_to_year_rent_multiple_mean_1m_before'] = 100*((final_df3['deal_to_year_rent_multiple']-final_df3['deal_to_year_rent_multiple_mean_1m_before'])/final_df3['deal_to_year_rent_multiple_mean_1m_before'])

final_df3['deal_price_mean_3m_before'] = 100*((final_df3['deal_price']-final_df3['deal_price_mean_3m_before'])/final_df3['deal_price_mean_3m_before'])
final_df3['full_rent_price_mean_3m_before'] = 100*((final_df3['full_rent_price']-final_df3['full_rent_price_mean_3m_before'])/final_df3['full_rent_price_mean_3m_before'])
final_df3['year_rent_price_mean_3m_before'] = 100*((final_df3['year_rent_price']-final_df3['year_rent_price_mean_3m_before'])/final_df3['year_rent_price_mean_3m_before'])
final_df3['deal_to_full_rent_rate_mean_3m_before'] = 100*((final_df3['deal_to_full_rent_rate']-final_df3['deal_to_full_rent_rate_mean_3m_before'])/final_df3['deal_to_full_rent_rate_mean_3m_before'])
final_df3['deal_to_year_rent_multiple_mean_3m_before'] = 100*((final_df3['deal_to_year_rent_multiple']-final_df3['deal_to_year_rent_multiple_mean_3m_before'])/final_df3['deal_to_year_rent_multiple_mean_3m_before'])

final_df3['deal_price_mean_6m_before'] = 100*((final_df3['deal_price']-final_df3['deal_price_mean_6m_before'])/final_df3['deal_price_mean_6m_before'])
final_df3['full_rent_price_mean_6m_before'] = 100*((final_df3['full_rent_price']-final_df3['full_rent_price_mean_6m_before'])/final_df3['full_rent_price_mean_6m_before'])
final_df3['year_rent_price_mean_6m_before'] = 100*((final_df3['year_rent_price']-final_df3['year_rent_price_mean_6m_before'])/final_df3['year_rent_price_mean_6m_before'])
final_df3['deal_to_full_rent_rate_mean_6m_before'] = 100*((final_df3['deal_to_full_rent_rate']-final_df3['deal_to_full_rent_rate_mean_6m_before'])/final_df3['deal_to_full_rent_rate_mean_6m_before'])
final_df3['deal_to_year_rent_multiple_mean_6m_before'] = 100*((final_df3['deal_to_year_rent_multiple']-final_df3['deal_to_year_rent_multiple_mean_6m_before'])/final_df3['deal_to_year_rent_multiple_mean_6m_before'])

final_df3['deal_price_mean_12m_before'] = 100*((final_df3['deal_price']-final_df3['deal_price_mean_12m_before'])/final_df3['deal_price_mean_12m_before'])
final_df3['full_rent_price_mean_12m_before'] = 100*((final_df3['full_rent_price']-final_df3['full_rent_price_mean_12m_before'])/final_df3['full_rent_price_mean_12m_before'])
final_df3['year_rent_price_mean_12m_before'] = 100*((final_df3['year_rent_price']-final_df3['year_rent_price_mean_12m_before'])/final_df3['year_rent_price_mean_12m_before'])
final_df3['deal_to_full_rent_rate_mean_12m_before'] = 100*((final_df3['deal_to_full_rent_rate']-final_df3['deal_to_full_rent_rate_mean_12m_before'])/final_df3['deal_to_full_rent_rate_mean_12m_before'])
final_df3['deal_to_year_rent_multiple_mean_12m_before'] = 100*((final_df3['deal_to_year_rent_multiple']-final_df3['deal_to_year_rent_multiple_mean_12m_before'])/final_df3['deal_to_year_rent_multiple_mean_12m_before'])


In [None]:
final_df3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21523831 entries, 4847 to 24144105
Data columns (total 32 columns):
 #   Column                                      Dtype  
---  ------                                      -----  
 0   year                                        int16  
 1   month                                       int16  
 2   day                                         int16  
 3   address_1                                   object 
 4   address_2                                   object 
 5   address_3                                   int16  
 6   address_4                                   int16  
 7   deal_price                                  int32  
 8   full_rent_price                             int32  
 9   year_rent_price                             int32  
 10  deal_to_full_rent_rate                      float32
 11  deal_to_year_rent_multiple                  float32
 12  deal_price_mean_1m_before                   float64
 13  full_rent_price_mean_1

In [None]:
pd.set_option('display.max_columns', 35) # 컬럼들을 확인하기 위해서 설정 
final_df3.head(10)

Unnamed: 0,year,month,day,address_1,address_2,address_3,address_4,deal_price,full_rent_price,year_rent_price,deal_to_full_rent_rate,deal_to_year_rent_multiple,deal_price_mean_1m_before,full_rent_price_mean_1m_before,year_rent_price_mean_1m_before,deal_to_full_rent_rate_mean_1m_before,deal_to_year_rent_multiple_mean_1m_before,deal_price_mean_3m_before,full_rent_price_mean_3m_before,year_rent_price_mean_3m_before,deal_to_full_rent_rate_mean_3m_before,deal_to_year_rent_multiple_mean_3m_before,deal_price_mean_6m_before,full_rent_price_mean_6m_before,year_rent_price_mean_6m_before,deal_to_full_rent_rate_mean_6m_before,deal_to_year_rent_multiple_mean_6m_before,deal_price_mean_12m_before,full_rent_price_mean_12m_before,year_rent_price_mean_12m_before,deal_to_full_rent_rate_mean_12m_before,deal_to_year_rent_multiple_mean_12m_before
4847,2012,1,1,강남구,개포동,12,0,26500,15000,1172,56.603775,22.610922,-37.538302,-30.805425,1.471861,10.778589,-38.429871,-24.901522,-30.703132,-10.874525,-7.727005,-15.715975,-37.480831,-28.112719,-12.012012,14.983935,-28.901213,-35.267966,-18.175867,-1.512605,26.399979,-34.247276
4849,2012,1,1,강남구,개포동,138,0,60000,12000,665,20.0,90.225563,-3.124243,49.868865,-11.92053,54.692726,10.117944,-3.074164,46.092038,8.660131,50.716675,-10.799408,-10.886678,48.957299,10.465116,67.147148,-19.251957,-24.552028,26.275913,-7.638889,67.357758,-18.20787
4850,2012,1,1,강남구,개포동,140,0,98800,25000,1170,25.303644,84.444443,12.577197,20.25012,-2.743142,6.814438,15.804789,42.490409,45.070504,33.40935,1.80661,6.868674,67.307334,39.828849,6.074343,-16.423784,57.859653,9.741197,64.171263,-4.723127,49.596737,15.229732
4851,2012,1,1,강남구,개포동,141,0,96500,11500,731,11.917098,132.010941,27.401149,3.912533,-12.244898,-18.437162,45.238098,23.071037,2.376925,-13.079667,-16.819038,41.670792,15.409914,19.072272,-0.136612,3.166785,15.668888,32.345882,9.806168,-9.079602,-17.034622,45.64743
4854,2012,1,1,강남구,개포동,185,0,80800,35500,1644,43.935642,49.148418,2.031797,18.538801,1.669759,16.176289,0.372925,4.533223,5.188302,-19.608802,0.624885,30.084763,7.687387,10.958305,-8.310095,3.036009,17.478666,-1.844068,31.696097,-1.261261,34.167637,-0.575397
4855,2012,1,1,강남구,개포동,187,0,61000,22000,2390,36.065575,25.523012,-8.540242,-23.454299,18.199802,-16.308435,-22.601431,-26.469702,-32.773109,38.470452,-8.574032,-46.894669,-19.152828,-32.215923,58.383035,-16.159578,-48.940884,31.343798,-26.095136,25.32774,-43.732899,4.819834
4856,2012,1,1,강남구,개포동,189,0,63000,10000,680,15.873015,92.647057,-2.264971,2.574623,-8.232119,4.942883,6.575386,-4.109589,4.036621,-2.439024,8.492653,-1.708987,-7.920315,0.908174,2.255639,9.580949,-9.938848,51.602657,11.209964,-5.4242,-26.645311,60.375427
4857,2012,1,1,강남구,개포동,649,0,120800,50000,3430,41.390728,35.218658,0.0,-5.515977,0.0,-5.516608,0.0,25.496063,27.625903,2.265951,1.696118,22.734194,-3.234592,-9.974793,9.76,-6.965523,-11.816846,21.407035,3.092784,56.621005,-15.084998,-22.483553
4862,2012,1,2,강남구,개포동,12,0,26500,23500,1172,88.679245,22.610922,-37.538302,8.404834,1.471861,73.553116,-38.429871,-24.901522,8.565093,-10.874525,44.56102,-15.715975,-37.480831,12.623406,-12.012012,80.141495,-28.901213,-35.267966,28.191141,-1.512605,98.026627,-34.247276
4864,2012,1,2,강남구,개포동,138,0,60000,12000,665,20.0,90.225563,-3.124243,49.868865,-11.92053,54.692726,10.117944,-3.074164,46.092038,8.660131,50.716675,-10.799408,-10.886678,48.957299,10.465116,67.147148,-19.251957,-24.552028,26.275913,-7.638889,67.357758,-18.20787


In [None]:
# 변화율 소수점 자리 수정
final_df3['deal_to_year_rent_multiple'] = final_df3['deal_to_year_rent_multiple'].astype(float).round(2)

final_df3['deal_price_mean_1m_before'] = final_df3['deal_price_mean_1m_before'].astype(float).round(2)
final_df3['full_rent_price_mean_1m_before'] = final_df3['full_rent_price_mean_1m_before'].astype(float).round(2)
final_df3['year_rent_price_mean_1m_before'] = final_df3['year_rent_price_mean_1m_before'].astype(float).round(2)
final_df3['deal_to_full_rent_rate_mean_1m_before'] = final_df3['deal_to_full_rent_rate_mean_1m_before'].astype(float).round(2)
final_df3['deal_to_year_rent_multiple_mean_1m_before'] = final_df3['deal_to_year_rent_multiple_mean_1m_before'].astype(float).round(2)

final_df3['deal_price_mean_3m_before'] = final_df3['deal_price_mean_3m_before'].astype(float).round(2)
final_df3['full_rent_price_mean_3m_before'] = final_df3['full_rent_price_mean_3m_before'].astype(float).round(2)
final_df3['year_rent_price_mean_3m_before'] = final_df3['year_rent_price_mean_3m_before'].astype(float).round(2)
final_df3['deal_to_full_rent_rate_mean_3m_before'] = final_df3['deal_to_full_rent_rate_mean_3m_before'].astype(float).round(2)
final_df3['deal_to_year_rent_multiple_mean_3m_before'] = final_df3['deal_to_year_rent_multiple_mean_3m_before'].astype(float).round(2)

final_df3['deal_price_mean_6m_before'] = final_df3['deal_price_mean_6m_before'].astype(float).round(2)
final_df3['full_rent_price_mean_6m_before'] = final_df3['full_rent_price_mean_6m_before'].astype(float).round(2)
final_df3['year_rent_price_mean_6m_before'] = final_df3['year_rent_price_mean_6m_before'].astype(float).round(2)
final_df3['deal_to_full_rent_rate_mean_6m_before'] = final_df3['deal_to_full_rent_rate_mean_6m_before'].astype(float).round(2)
final_df3['deal_to_year_rent_multiple_mean_6m_before'] = final_df3['deal_to_year_rent_multiple_mean_6m_before'].astype(float).round(2)

final_df3['deal_price_mean_12m_before'] = final_df3['deal_price_mean_12m_before'].astype(float).round(2)
final_df3['full_rent_price_mean_12m_before'] = final_df3['full_rent_price_mean_12m_before'].astype(float).round(2)
final_df3['year_rent_price_mean_12m_before'] = final_df3['year_rent_price_mean_12m_before'].astype(float).round(2)
final_df3['deal_to_full_rent_rate_mean_12m_before'] = final_df3['deal_to_full_rent_rate_mean_12m_before'].astype(float).round(2)
final_df3['deal_to_year_rent_multiple_mean_12m_before'] = final_df3['deal_to_year_rent_multiple_mean_12m_before'].astype(float).round(2)
final_df3.head()

Unnamed: 0,year,month,day,address_1,address_2,address_3,address_4,deal_price,full_rent_price,year_rent_price,deal_to_full_rent_rate,deal_to_year_rent_multiple,deal_price_mean_1m_before,full_rent_price_mean_1m_before,year_rent_price_mean_1m_before,deal_to_full_rent_rate_mean_1m_before,deal_to_year_rent_multiple_mean_1m_before,deal_price_mean_3m_before,full_rent_price_mean_3m_before,year_rent_price_mean_3m_before,deal_to_full_rent_rate_mean_3m_before,deal_to_year_rent_multiple_mean_3m_before,deal_price_mean_6m_before,full_rent_price_mean_6m_before,year_rent_price_mean_6m_before,deal_to_full_rent_rate_mean_6m_before,deal_to_year_rent_multiple_mean_6m_before,deal_price_mean_12m_before,full_rent_price_mean_12m_before,year_rent_price_mean_12m_before,deal_to_full_rent_rate_mean_12m_before,deal_to_year_rent_multiple_mean_12m_before
4847,2012,1,1,강남구,개포동,12,0,26500,15000,1172,56.603775,22.61,-37.54,-30.81,1.47,10.78,-38.43,-24.9,-30.7,-10.87,-7.73,-15.72,-37.48,-28.11,-12.01,14.98,-28.9,-35.27,-18.18,-1.51,26.4,-34.25
4849,2012,1,1,강남구,개포동,138,0,60000,12000,665,20.0,90.23,-3.12,49.87,-11.92,54.69,10.12,-3.07,46.09,8.66,50.72,-10.8,-10.89,48.96,10.47,67.15,-19.25,-24.55,26.28,-7.64,67.36,-18.21
4850,2012,1,1,강남구,개포동,140,0,98800,25000,1170,25.303644,84.44,12.58,20.25,-2.74,6.81,15.8,42.49,45.07,33.41,1.81,6.87,67.31,39.83,6.07,-16.42,57.86,9.74,64.17,-4.72,49.6,15.23
4851,2012,1,1,강남구,개포동,141,0,96500,11500,731,11.917098,132.01,27.4,3.91,-12.24,-18.44,45.24,23.07,2.38,-13.08,-16.82,41.67,15.41,19.07,-0.14,3.17,15.67,32.35,9.81,-9.08,-17.03,45.65
4854,2012,1,1,강남구,개포동,185,0,80800,35500,1644,43.935642,49.15,2.03,18.54,1.67,16.18,0.37,4.53,5.19,-19.61,0.62,30.08,7.69,10.96,-8.31,3.04,17.48,-1.84,31.7,-1.26,34.17,-0.58


In [None]:
final_df3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21523831 entries, 4847 to 24144105
Data columns (total 32 columns):
 #   Column                                      Dtype  
---  ------                                      -----  
 0   year                                        int16  
 1   month                                       int16  
 2   day                                         int16  
 3   address_1                                   object 
 4   address_2                                   object 
 5   address_3                                   int16  
 6   address_4                                   int16  
 7   deal_price                                  int32  
 8   full_rent_price                             int32  
 9   year_rent_price                             int32  
 10  deal_to_full_rent_rate                      float32
 11  deal_to_year_rent_multiple                  float64
 12  deal_price_mean_1m_before                   float64
 13  full_rent_price_mean_1

In [None]:
# float64 타입 변경
final_df3 = final_df3.astype({'deal_to_year_rent_multiple':'float32',
                              'deal_price_mean_1m_before':'float32', 'full_rent_price_mean_1m_before':'float32', 'year_rent_price_mean_1m_before':'float32','deal_to_full_rent_rate_mean_1m_before':'float32','deal_to_year_rent_multiple_mean_1m_before':'float32',
                              'deal_price_mean_3m_before':'float32', 'full_rent_price_mean_3m_before':'float32', 'year_rent_price_mean_3m_before':'float32','deal_to_full_rent_rate_mean_3m_before':'float32','deal_to_year_rent_multiple_mean_3m_before':'float32',
                              'deal_price_mean_6m_before':'float32', 'full_rent_price_mean_6m_before':'float32', 'year_rent_price_mean_6m_before':'float32','deal_to_full_rent_rate_mean_6m_before':'float32','deal_to_year_rent_multiple_mean_6m_before':'float32',
                              'deal_price_mean_12m_before':'float32', 'full_rent_price_mean_12m_before':'float32', 'year_rent_price_mean_12m_before':'float32','deal_to_full_rent_rate_mean_12m_before':'float32','deal_to_year_rent_multiple_mean_12m_before':'float32',})

>> round 한뒤 형변환 하면 round 한것이 취소가 됨..

In [None]:
final_df3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21523831 entries, 4847 to 24144105
Data columns (total 32 columns):
 #   Column                                      Dtype  
---  ------                                      -----  
 0   year                                        int16  
 1   month                                       int16  
 2   day                                         int16  
 3   address_1                                   object 
 4   address_2                                   object 
 5   address_3                                   int16  
 6   address_4                                   int16  
 7   deal_price                                  int32  
 8   full_rent_price                             int32  
 9   year_rent_price                             int32  
 10  deal_to_full_rent_rate                      float32
 11  deal_to_year_rent_multiple                  float32
 12  deal_price_mean_1m_before                   float32
 13  full_rent_price_mean_1

In [None]:
final_df3.to_pickle('/content/drive/MyDrive/house_price/after_data/final_df4.pkl')

# future_df 생성

- 여러 지표들을 통해 궁극적으로는 아파트 매매가격 예측을 해야 하기에, 주소와 미래가격을 컬럼들로 가진 데이터프레임 생성(추후 미래가격의 정보를 추가하기 위해서)
- 미래가격은 1년 뒤 아파트의 매매가격으로 정함

## final_df2 불러오기

- final_df2에 0값들을 제외한 거래가격(매매가격,전세가격,연세가격)에 대한 정보들이 있기에 final_df2를 참조하여서 미래가격 데이터프레임 생성

In [None]:
import pandas as pd
# final_df2가 거래가격이0인 값들을 다 제거한 것이므로 불러옴옴
future_df = pd.read_pickle('/content/drive/MyDrive/house_price/after_data/final_df2.pkl')
future_df

Unnamed: 0,year,month,day,address_1,address_2,address_3,address_4,deal_price,full_rent_price,year_rent_price,deal_to_full_rent_rate,deal_to_year_rent_multiple
0,2011,1,5,강남구,개포동,138,0,89400,11000,778,12.304251,114.910027
1,2011,1,5,강남구,개포동,185,0,89000,22000,1786,24.719101,49.832027
2,2011,1,6,강남구,개포동,138,0,89400,7500,778,8.389262,114.910027
3,2011,1,6,강남구,개포동,185,0,89000,30666,1424,34.456181,62.500000
4,2011,1,7,강남구,개포동,12,0,35000,24000,1196,68.571426,29.264214
...,...,...,...,...,...,...,...,...,...,...,...,...
24144101,2022,12,31,중랑구,중화동,438,0,69500,32000,389,46.043167,178.663239
24144102,2022,12,31,중랑구,중화동,450,0,91500,32500,2090,35.519127,43.779903
24144103,2022,12,31,중랑구,중화동,452,0,55000,33000,1610,60.000000,34.161491
24144104,2022,12,31,중랑구,중화동,453,0,85500,57750,2070,67.543861,41.304348


In [None]:
future_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24144106 entries, 0 to 24144105
Data columns (total 12 columns):
 #   Column                      Dtype  
---  ------                      -----  
 0   year                        int16  
 1   month                       int16  
 2   day                         int16  
 3   address_1                   object 
 4   address_2                   object 
 5   address_3                   int16  
 6   address_4                   int16  
 7   deal_price                  int32  
 8   full_rent_price             int32  
 9   year_rent_price             int32  
 10  deal_to_full_rent_rate      float32
 11  deal_to_year_rent_multiple  float32
dtypes: float32(2), int16(5), int32(3), object(2)
memory usage: 1.0+ GB


In [None]:
# 1년뒤 날짜를 구하기 위해서 날짜 컬럼 생성
future_df = future_df.astype({'year':'str','month':'str','day':'str'})
future_df.loc[future_df["month"].str.len()==1,"month"]='0'+future_df.loc[future_df["month"].str.len()==1,"month"]
future_df.loc[future_df["day"].str.len()==1,"day"]='0'+future_df.loc[future_df["day"].str.len()==1,"day"] # 일이 있는 컬럼에서 1자리 숫자인 경우 앞에 0을 추가성
future_df['date'] = pd.to_datetime(future_df['year']+future_df['month']+future_df['day']) 
future_df = future_df.astype({'year':'int16','month':'int16','day':'int16'})
future_df.head(10)

Unnamed: 0,year,month,day,address_1,address_2,address_3,address_4,deal_price,full_rent_price,year_rent_price,deal_to_full_rent_rate,deal_to_year_rent_multiple,date
0,2011,1,5,강남구,개포동,138,0,89400,11000,778,12.304251,114.910027,2011-01-05
1,2011,1,5,강남구,개포동,185,0,89000,22000,1786,24.719101,49.832027,2011-01-05
2,2011,1,6,강남구,개포동,138,0,89400,7500,778,8.389262,114.910027,2011-01-06
3,2011,1,6,강남구,개포동,185,0,89000,30666,1424,34.456181,62.5,2011-01-06
4,2011,1,7,강남구,개포동,12,0,35000,24000,1196,68.571426,29.264214,2011-01-07
5,2011,1,7,강남구,개포동,138,0,89400,7000,778,7.829978,114.910027,2011-01-07
6,2011,1,7,강남구,개포동,141,0,80300,12000,1134,14.94396,70.811287,2011-01-07
7,2011,1,7,강남구,개포동,185,0,79000,24000,1424,30.379747,55.477528,2011-01-07
8,2011,1,8,강남구,개포동,12,0,35000,24000,1196,68.571426,29.264214,2011-01-08
9,2011,1,8,강남구,개포동,138,0,89400,9000,658,10.067114,135.866257,2011-01-08


In [None]:
future_df.tail()

Unnamed: 0,year,month,day,address_1,address_2,address_3,address_4,deal_price,full_rent_price,year_rent_price,deal_to_full_rent_rate,deal_to_year_rent_multiple,date
24144101,2022,12,31,중랑구,중화동,438,0,69500,32000,389,46.043167,178.663239,2022-12-31
24144102,2022,12,31,중랑구,중화동,450,0,91500,32500,2090,35.519127,43.779903,2022-12-31
24144103,2022,12,31,중랑구,중화동,452,0,55000,33000,1610,60.0,34.161491,2022-12-31
24144104,2022,12,31,중랑구,중화동,453,0,85500,57750,2070,67.543861,41.304348,2022-12-31
24144105,2022,12,31,중랑구,중화동,454,0,96000,40000,2140,41.666668,44.859814,2022-12-31


In [None]:
# 1년 전 날짜 구하기
# future_df의 과거 날짜와 final_df의 현재날짜를 merge 하면 final_df의 해당 날짜(현재날짜짜)의 1년뒤 가격을 구할 수 있다
future_df['date'] = future_df['date'] - pd.Timedelta(days=365)

In [None]:
future_df

Unnamed: 0,year,month,day,address_1,address_2,address_3,address_4,deal_price,full_rent_price,year_rent_price,deal_to_full_rent_rate,deal_to_year_rent_multiple,date
0,2011,1,5,강남구,개포동,138,0,89400,11000,778,12.304251,114.910027,2010-01-05
1,2011,1,5,강남구,개포동,185,0,89000,22000,1786,24.719101,49.832027,2010-01-05
2,2011,1,6,강남구,개포동,138,0,89400,7500,778,8.389262,114.910027,2010-01-06
3,2011,1,6,강남구,개포동,185,0,89000,30666,1424,34.456181,62.500000,2010-01-06
4,2011,1,7,강남구,개포동,12,0,35000,24000,1196,68.571426,29.264214,2010-01-07
...,...,...,...,...,...,...,...,...,...,...,...,...,...
24144101,2022,12,31,중랑구,중화동,438,0,69500,32000,389,46.043167,178.663239,2021-12-31
24144102,2022,12,31,중랑구,중화동,450,0,91500,32500,2090,35.519127,43.779903,2021-12-31
24144103,2022,12,31,중랑구,중화동,452,0,55000,33000,1610,60.000000,34.161491,2021-12-31
24144104,2022,12,31,중랑구,중화동,453,0,85500,57750,2070,67.543861,41.304348,2021-12-31


In [None]:
# 추후 merge를 사용하기 위해서 추가적인 컬럼 생성성
future_df['year'] = future_df['date'].dt.year
future_df['month'] = future_df['date'].dt.month
future_df['day'] = future_df['date'].dt.day
future_df

Unnamed: 0,year,month,day,address_1,address_2,address_3,address_4,deal_price,full_rent_price,year_rent_price,deal_to_full_rent_rate,deal_to_year_rent_multiple,date
0,2010,1,5,강남구,개포동,138,0,89400,11000,778,12.304251,114.910027,2010-01-05
1,2010,1,5,강남구,개포동,185,0,89000,22000,1786,24.719101,49.832027,2010-01-05
2,2010,1,6,강남구,개포동,138,0,89400,7500,778,8.389262,114.910027,2010-01-06
3,2010,1,6,강남구,개포동,185,0,89000,30666,1424,34.456181,62.500000,2010-01-06
4,2010,1,7,강남구,개포동,12,0,35000,24000,1196,68.571426,29.264214,2010-01-07
...,...,...,...,...,...,...,...,...,...,...,...,...,...
24144101,2021,12,31,중랑구,중화동,438,0,69500,32000,389,46.043167,178.663239,2021-12-31
24144102,2021,12,31,중랑구,중화동,450,0,91500,32500,2090,35.519127,43.779903,2021-12-31
24144103,2021,12,31,중랑구,중화동,452,0,55000,33000,1610,60.000000,34.161491,2021-12-31
24144104,2021,12,31,중랑구,중화동,453,0,85500,57750,2070,67.543861,41.304348,2021-12-31


In [None]:
# 사용안하는 컬럼 제거 및 컬럼명 변경경
future_df.drop(["date","full_rent_price","year_rent_price","deal_to_full_rent_rate","deal_to_year_rent_multiple"], axis=1,inplace=True)
future_df.columns = ['year','month','day','address_1','address_2','address_3','address_4','future_deal_price']
future_df

Unnamed: 0,year,month,day,address_1,address_2,address_3,address_4,future_deal_price
0,2010,1,5,강남구,개포동,138,0,89400
1,2010,1,5,강남구,개포동,185,0,89000
2,2010,1,6,강남구,개포동,138,0,89400
3,2010,1,6,강남구,개포동,185,0,89000
4,2010,1,7,강남구,개포동,12,0,35000
...,...,...,...,...,...,...,...,...
24144101,2021,12,31,중랑구,중화동,438,0,69500
24144102,2021,12,31,중랑구,중화동,450,0,91500
24144103,2021,12,31,중랑구,중화동,452,0,55000
24144104,2021,12,31,중랑구,중화동,453,0,85500


In [None]:
# 컬럼타입 변경경
future_df = future_df.astype({'year':'int16','month':'int16','day':'int16'})
future_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24144106 entries, 0 to 24144105
Data columns (total 8 columns):
 #   Column             Dtype 
---  ------             ----- 
 0   year               int16 
 1   month              int16 
 2   day                int16 
 3   address_1          object
 4   address_2          object
 5   address_3          int16 
 6   address_4          int16 
 7   future_deal_price  int32 
dtypes: int16(5), int32(1), object(2)
memory usage: 690.8+ MB


In [None]:
future_df.to_pickle('/content/drive/MyDrive/house_price/after_data/future_df.pkl')

# final_df5 생성(final_df4 수정)

- final_df4에 future_df를 merge 하여서 final_df5 생성 

In [None]:
import pandas as pd
# 파일들 불러오기
future_df = pd.read_pickle('/content/drive/MyDrive/house_price/after_data/future_df.pkl')
final_df4 = pd.read_pickle('/content/drive/MyDrive/house_price/after_data/final_df4.pkl')

In [None]:
final_df4.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21523831 entries, 4847 to 24144105
Data columns (total 32 columns):
 #   Column                                      Dtype  
---  ------                                      -----  
 0   year                                        int16  
 1   month                                       int16  
 2   day                                         int16  
 3   address_1                                   object 
 4   address_2                                   object 
 5   address_3                                   int16  
 6   address_4                                   int16  
 7   deal_price                                  int32  
 8   full_rent_price                             int32  
 9   year_rent_price                             int32  
 10  deal_to_full_rent_rate                      float32
 11  deal_to_year_rent_multiple                  float32
 12  deal_price_mean_1m_before                   float32
 13  full_rent_price_mean_1

In [None]:
# 변수명으로 변화율인지 변화값인지 헷갈릴수 있기에 구분분 필요로 인해서 변수명 수정
rename_column_list = list()
column_list = list(final_df4.columns)
for i,column in enumerate(column_list):
  if i <12:
    rename_column_list.append(column)
  else:
    rename_column_list.append(column+'_rate')
rename_column_list

['year',
 'month',
 'day',
 'address_1',
 'address_2',
 'address_3',
 'address_4',
 'deal_price',
 'full_rent_price',
 'year_rent_price',
 'deal_to_full_rent_rate',
 'deal_to_year_rent_multiple',
 'deal_price_mean_1m_before_rate',
 'full_rent_price_mean_1m_before_rate',
 'year_rent_price_mean_1m_before_rate',
 'deal_to_full_rent_rate_mean_1m_before_rate',
 'deal_to_year_rent_multiple_mean_1m_before_rate',
 'deal_price_mean_3m_before_rate',
 'full_rent_price_mean_3m_before_rate',
 'year_rent_price_mean_3m_before_rate',
 'deal_to_full_rent_rate_mean_3m_before_rate',
 'deal_to_year_rent_multiple_mean_3m_before_rate',
 'deal_price_mean_6m_before_rate',
 'full_rent_price_mean_6m_before_rate',
 'year_rent_price_mean_6m_before_rate',
 'deal_to_full_rent_rate_mean_6m_before_rate',
 'deal_to_year_rent_multiple_mean_6m_before_rate',
 'deal_price_mean_12m_before_rate',
 'full_rent_price_mean_12m_before_rate',
 'year_rent_price_mean_12m_before_rate',
 'deal_to_full_rent_rate_mean_12m_before_rate',

In [None]:
# 변수명을 수정정
final_df4.columns = rename_column_list
final_df4.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21523831 entries, 4847 to 24144105
Data columns (total 32 columns):
 #   Column                                           Dtype  
---  ------                                           -----  
 0   year                                             int16  
 1   month                                            int16  
 2   day                                              int16  
 3   address_1                                        object 
 4   address_2                                        object 
 5   address_3                                        int16  
 6   address_4                                        int16  
 7   deal_price                                       int32  
 8   full_rent_price                                  int32  
 9   year_rent_price                                  int32  
 10  deal_to_full_rent_rate                           float32
 11  deal_to_year_rent_multiple                       float32
 12  deal_pric

In [None]:
# merge를 통해서 미래가격(매매가)를 컬럼으로 추가가
final_df5 = pd.merge(final_df4, future_df, on=['address_1','address_2','address_3','address_4',
                                   'year','month','day'], how = 'inner')

In [None]:
final_df5.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18875832 entries, 0 to 18875831
Data columns (total 33 columns):
 #   Column                                           Dtype  
---  ------                                           -----  
 0   year                                             int16  
 1   month                                            int16  
 2   day                                              int16  
 3   address_1                                        object 
 4   address_2                                        object 
 5   address_3                                        int16  
 6   address_4                                        int16  
 7   deal_price                                       int32  
 8   full_rent_price                                  int32  
 9   year_rent_price                                  int32  
 10  deal_to_full_rent_rate                           float32
 11  deal_to_year_rent_multiple                       float32
 12  deal_price_m

In [None]:
final_df5

Unnamed: 0,year,month,day,address_1,address_2,address_3,address_4,deal_price,full_rent_price,year_rent_price,...,full_rent_price_mean_6m_before_rate,year_rent_price_mean_6m_before_rate,deal_to_full_rent_rate_mean_6m_before_rate,deal_to_year_rent_multiple_mean_6m_before_rate,deal_price_mean_12m_before_rate,full_rent_price_mean_12m_before_rate,year_rent_price_mean_12m_before_rate,deal_to_full_rent_rate_mean_12m_before_rate,deal_to_year_rent_multiple_mean_12m_before_rate,future_deal_price
0,2012,1,1,강남구,개포동,12,0,26500,15000,1172,...,-28.110001,-12.01,14.980000,-28.900000,-35.270000,-18.180000,-1.510000,26.400000,-34.250000,40000
1,2012,1,1,강남구,개포동,138,0,60000,12000,665,...,48.959999,10.47,67.150002,-19.250000,-24.549999,26.280001,-7.640000,67.360001,-18.209999,45000
2,2012,1,1,강남구,개포동,140,0,98800,25000,1170,...,39.830002,6.07,-16.420000,57.860001,9.740000,64.169998,-4.720000,49.599998,15.230000,85750
3,2012,1,1,강남구,개포동,141,0,96500,11500,731,...,19.070000,-0.14,3.170000,15.670000,32.349998,9.810000,-9.080000,-17.030001,45.650002,86500
4,2012,1,1,강남구,개포동,185,0,80800,35500,1644,...,10.960000,-8.31,3.040000,17.480000,-1.840000,31.700001,-1.260000,34.169998,-0.580000,49000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18875827,2021,12,31,중랑구,중화동,438,0,62500,23000,389,...,-23.330000,0.00,-38.160000,23.969999,68.919998,-23.330000,0.000000,-54.610001,68.919998,69500
18875828,2021,12,31,중랑구,중화동,450,0,83200,31500,1336,...,-24.850000,-13.86,-35.509998,35.320000,37.320000,-9.900000,-24.090000,-34.389999,80.900002,91500
18875829,2021,12,31,중랑구,중화동,452,0,55000,45000,1480,...,8.870000,0.00,8.870000,0.000000,0.000000,36.360001,0.000000,36.360001,0.000000,55000
18875830,2021,12,31,중랑구,중화동,453,0,80000,50000,2070,...,-9.090000,-13.53,-9.090000,15.650000,19.940001,-9.090000,7.250000,-24.200001,11.870000,85500


In [None]:
# null 값들과 inf 값들 있는지 확인인
import numpy as np
print(final_df5.isnull().sum())

final_df5.replace([np.inf, -np.inf], np.nan, inplace=True)
var = final_df5.isnull().sum()
print(var.to_string())

year                                               0
month                                              0
day                                                0
address_1                                          0
address_2                                          0
address_3                                          0
address_4                                          0
deal_price                                         0
full_rent_price                                    0
year_rent_price                                    0
deal_to_full_rent_rate                             0
deal_to_year_rent_multiple                         0
deal_price_mean_1m_before_rate                     0
full_rent_price_mean_1m_before_rate                0
year_rent_price_mean_1m_before_rate                0
deal_to_full_rent_rate_mean_1m_before_rate         0
deal_to_year_rent_multiple_mean_1m_before_rate     0
deal_price_mean_3m_before_rate                     0
full_rent_price_mean_3m_before_rate           

In [None]:
final_df5.to_pickle('/content/drive/MyDrive/house_price/after_data/final_df5.pkl')

# df_micro 관련 데이터프레임들 생성

- final_df5 에 month_region_count(거래량 정보)를 통합
- EDA_file1.ipynb 파일을 통해서 상관관계들이 매우 낮게 나오는 상황을 파악
- 그 원인 중 첫번째로, 각 아파트 평가지표들(전세가율, 전세가율 변화정도 등)이 모든 날자들에 대한 모든 아파트들의 정보들을 가지고 있는데, 해당 과정에서 체결되지 않은 지표들도 전부 생성을 하다가 보니, 입력값과 결과값들이 중복되는 경우나, 입력값들이 동일해도 결과값들이 다른 경우들이 발생 
- 첫번째 문제를 해결하기 위해서 각 feature 별로 groupby를 통해 중복 값들을 처리

>> 데이터들의 특징들을 파악해서, 중복 값들이 많은 확인을 할 필요가 있음

## df_micro 생성

In [None]:
import pandas as pd
final_df5 = pd.read_pickle('/content/drive/MyDrive/house_price/after_data/final_df5.pkl')
final_df5.head()

Unnamed: 0,year,month,day,address_1,address_2,address_3,address_4,deal_price,full_rent_price,year_rent_price,...,full_rent_price_mean_6m_before_rate,year_rent_price_mean_6m_before_rate,deal_to_full_rent_rate_mean_6m_before_rate,deal_to_year_rent_multiple_mean_6m_before_rate,deal_price_mean_12m_before_rate,full_rent_price_mean_12m_before_rate,year_rent_price_mean_12m_before_rate,deal_to_full_rent_rate_mean_12m_before_rate,deal_to_year_rent_multiple_mean_12m_before_rate,future_deal_price
0,2012,1,1,강남구,개포동,12,0,26500,15000,1172,...,-28.110001,-12.01,14.98,-28.9,-35.27,-18.18,-1.51,26.4,-34.25,40000
1,2012,1,1,강남구,개포동,138,0,60000,12000,665,...,48.959999,10.47,67.150002,-19.25,-24.549999,26.280001,-7.64,67.360001,-18.209999,45000
2,2012,1,1,강남구,개포동,140,0,98800,25000,1170,...,39.830002,6.07,-16.42,57.860001,9.74,64.169998,-4.72,49.599998,15.23,85750
3,2012,1,1,강남구,개포동,141,0,96500,11500,731,...,19.07,-0.14,3.17,15.67,32.349998,9.81,-9.08,-17.030001,45.650002,86500
4,2012,1,1,강남구,개포동,185,0,80800,35500,1644,...,10.96,-8.31,3.04,17.48,-1.84,31.700001,-1.26,34.169998,-0.58,49000


In [None]:
final_df5.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18875832 entries, 0 to 18875831
Data columns (total 33 columns):
 #   Column                                           Dtype  
---  ------                                           -----  
 0   year                                             int16  
 1   month                                            int16  
 2   day                                              int16  
 3   address_1                                        object 
 4   address_2                                        object 
 5   address_3                                        int16  
 6   address_4                                        int16  
 7   deal_price                                       int32  
 8   full_rent_price                                  int32  
 9   year_rent_price                                  int32  
 10  deal_to_full_rent_rate                           float32
 11  deal_to_year_rent_multiple                       float32
 12  deal_price_m

In [None]:
# 아파트 거래량 테이블 불러오기기
final_count = pd.read_pickle('/content/drive/MyDrive/house_price/after_data/month_region_count.pkl')
final_count.head()

Unnamed: 0,address_1,address_2,address_3,year,month,last_month_deal_count,last_month_full_rent_count,last_month_month_rent_count,deal_count_1m_before,full_rent_count_1m_before,month_rent_count_1m_before,deal_count_3m_before,full_rent_count_3m_before,month_rent_count_3m_before,deal_count_6m_before,full_rent_count_6m_before,month_rent_count_6m_before,deal_count_12m_before,full_rent_count_12m_before,month_rent_count_12m_before
13,강남구,개포동,12,2012,2,7,33,9,-2,1,0,-1,-7,-5,3,0,-4,-9,-5,2
14,강남구,개포동,12,2012,3,7,42,9,0,9,0,3,6,4,-1,13,0,-10,4,1
15,강남구,개포동,12,2012,4,5,49,9,-2,7,0,-4,17,0,-3,21,-4,-3,3,-7
16,강남구,개포동,12,2012,5,1,35,12,-4,-14,3,-6,2,3,-7,-5,-2,-12,5,1
17,강남구,개포동,12,2012,6,6,26,8,5,-9,-4,-1,-16,-1,2,-10,3,-1,5,0


In [None]:
final_count.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 904162 entries, 13 to 993887
Data columns (total 20 columns):
 #   Column                       Non-Null Count   Dtype 
---  ------                       --------------   ----- 
 0   address_1                    904162 non-null  object
 1   address_2                    904162 non-null  object
 2   address_3                    904162 non-null  int16 
 3   year                         904162 non-null  int16 
 4   month                        904162 non-null  int16 
 5   last_month_deal_count        904162 non-null  int16 
 6   last_month_full_rent_count   904162 non-null  int16 
 7   last_month_month_rent_count  904162 non-null  int16 
 8   deal_count_1m_before         904162 non-null  int16 
 9   full_rent_count_1m_before    904162 non-null  int16 
 10  month_rent_count_1m_before   904162 non-null  int16 
 11  deal_count_3m_before         904162 non-null  int16 
 12  full_rent_count_3m_before    904162 non-null  int16 
 13  month_rent_co

In [None]:
# 통합 실행
df_micro = pd.merge(final_df5, final_count, on=['year','month','address_1','address_2','address_3'])
df_micro

Unnamed: 0,year,month,day,address_1,address_2,address_3,address_4,deal_price,full_rent_price,year_rent_price,...,month_rent_count_1m_before,deal_count_3m_before,full_rent_count_3m_before,month_rent_count_3m_before,deal_count_6m_before,full_rent_count_6m_before,month_rent_count_6m_before,deal_count_12m_before,full_rent_count_12m_before,month_rent_count_12m_before
0,2012,2,1,강남구,개포동,12,0,50500,19000,1210,...,0,-1,-7,-5,3,0,-4,-9,-5,2
1,2012,2,2,강남구,개포동,12,0,50100,19000,1210,...,0,-1,-7,-5,3,0,-4,-9,-5,2
2,2012,2,3,강남구,개포동,12,0,50100,30000,1210,...,0,-1,-7,-5,3,0,-4,-9,-5,2
3,2012,2,4,강남구,개포동,12,0,50100,24500,1194,...,0,-1,-7,-5,3,0,-4,-9,-5,2
4,2012,2,5,강남구,개포동,12,0,50100,23000,1194,...,0,-1,-7,-5,3,0,-4,-9,-5,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18853197,2021,12,27,중랑구,중화동,454,0,96000,45150,2140,...,0,-1,1,0,0,1,0,0,0,0
18853198,2021,12,28,중랑구,중화동,454,0,96000,45150,2140,...,0,-1,1,0,0,1,0,0,0,0
18853199,2021,12,29,중랑구,중화동,454,0,96000,45150,2140,...,0,-1,1,0,0,1,0,0,0,0
18853200,2021,12,30,중랑구,중화동,454,0,96000,45150,2140,...,0,-1,1,0,0,1,0,0,0,0


In [None]:
df_micro.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18853202 entries, 0 to 18853201
Data columns (total 48 columns):
 #   Column                                           Dtype  
---  ------                                           -----  
 0   year                                             int16  
 1   month                                            int16  
 2   day                                              int16  
 3   address_1                                        object 
 4   address_2                                        object 
 5   address_3                                        int16  
 6   address_4                                        int16  
 7   deal_price                                       int32  
 8   full_rent_price                                  int32  
 9   year_rent_price                                  int32  
 10  deal_to_full_rent_rate                           float32
 11  deal_to_year_rent_multiple                       float32
 12  deal_price_m

In [None]:
# date 컬럼 생성성
df_micro['date'] = pd.to_datetime(df_micro[['year', 'month', 'day']])
df_micro.head()

Unnamed: 0,year,month,day,address_1,address_2,address_3,address_4,deal_price,full_rent_price,year_rent_price,...,deal_count_3m_before,full_rent_count_3m_before,month_rent_count_3m_before,deal_count_6m_before,full_rent_count_6m_before,month_rent_count_6m_before,deal_count_12m_before,full_rent_count_12m_before,month_rent_count_12m_before,date
0,2012,2,1,강남구,개포동,12,0,50500,19000,1210,...,-1,-7,-5,3,0,-4,-9,-5,2,2012-02-01
1,2012,2,2,강남구,개포동,12,0,50100,19000,1210,...,-1,-7,-5,3,0,-4,-9,-5,2,2012-02-02
2,2012,2,3,강남구,개포동,12,0,50100,30000,1210,...,-1,-7,-5,3,0,-4,-9,-5,2,2012-02-03
3,2012,2,4,강남구,개포동,12,0,50100,24500,1194,...,-1,-7,-5,3,0,-4,-9,-5,2,2012-02-04
4,2012,2,5,강남구,개포동,12,0,50100,23000,1194,...,-1,-7,-5,3,0,-4,-9,-5,2,2012-02-05


In [None]:
# future_deal_change_rate를 가격 변화율 컬럼럼을 추가가
# 변화율 = 100*((미래가격 - 현재가격)/현재가격)
df_micro['future_deal_change_rate'] = 100*((df_micro['future_deal_price']-df_micro['deal_price'])/df_micro['deal_price'])
df_micro = df_micro.astype({'future_deal_change_rate':'float32'})
df_micro.head()

Unnamed: 0,year,month,day,address_1,address_2,address_3,address_4,deal_price,full_rent_price,year_rent_price,...,full_rent_count_3m_before,month_rent_count_3m_before,deal_count_6m_before,full_rent_count_6m_before,month_rent_count_6m_before,deal_count_12m_before,full_rent_count_12m_before,month_rent_count_12m_before,date,future_deal_change_rate
0,2012,2,1,강남구,개포동,12,0,50500,19000,1210,...,-7,-5,3,0,-4,-9,-5,2,2012-02-01,-33.168316
1,2012,2,2,강남구,개포동,12,0,50100,19000,1210,...,-7,-5,3,0,-4,-9,-5,2,2012-02-02,-32.634731
2,2012,2,3,강남구,개포동,12,0,50100,30000,1210,...,-7,-5,3,0,-4,-9,-5,2,2012-02-03,-19.161676
3,2012,2,4,강남구,개포동,12,0,50100,24500,1194,...,-7,-5,3,0,-4,-9,-5,2,2012-02-04,-44.111778
4,2012,2,5,강남구,개포동,12,0,50100,23000,1194,...,-7,-5,3,0,-4,-9,-5,2,2012-02-05,-44.111778


### 파일 저장

In [None]:
df_micro.to_pickle('/content/drive/MyDrive/house_price/after_data/df_micro.pkl')

### 상관계수 파악

- df_micro에서 각 feature들은 체결 가격이 변하거나, 월이 변하거나 하는 등의 경우들을 제외하면 중복값들이 매우 많음
- 해당 문제를 해결하기 위해서, 각 feature별로 groupby와 결과값을 mean으로 처리를 통해서 중복값 들의 문제를 해결

In [None]:
import pandas as pd
df_micro = pd.read_pickle('/content/drive/MyDrive/house_price/after_data/df_micro.pkl')
df_micro

Unnamed: 0,year,month,day,address_1,address_2,address_3,address_4,deal_price,full_rent_price,year_rent_price,...,full_rent_count_3m_before,month_rent_count_3m_before,deal_count_6m_before,full_rent_count_6m_before,month_rent_count_6m_before,deal_count_12m_before,full_rent_count_12m_before,month_rent_count_12m_before,date,future_deal_change_rate
0,2012,2,1,강남구,개포동,12,0,50500,19000,1210,...,-7,-5,3,0,-4,-9,-5,2,2012-02-01,-33.168316
1,2012,2,2,강남구,개포동,12,0,50100,19000,1210,...,-7,-5,3,0,-4,-9,-5,2,2012-02-02,-32.634731
2,2012,2,3,강남구,개포동,12,0,50100,30000,1210,...,-7,-5,3,0,-4,-9,-5,2,2012-02-03,-19.161676
3,2012,2,4,강남구,개포동,12,0,50100,24500,1194,...,-7,-5,3,0,-4,-9,-5,2,2012-02-04,-44.111778
4,2012,2,5,강남구,개포동,12,0,50100,23000,1194,...,-7,-5,3,0,-4,-9,-5,2,2012-02-05,-44.111778
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18853197,2021,12,27,중랑구,중화동,454,0,96000,45150,2140,...,1,0,0,1,0,0,0,0,2021-12-27,0.000000
18853198,2021,12,28,중랑구,중화동,454,0,96000,45150,2140,...,1,0,0,1,0,0,0,0,2021-12-28,0.000000
18853199,2021,12,29,중랑구,중화동,454,0,96000,45150,2140,...,1,0,0,1,0,0,0,0,2021-12-29,0.000000
18853200,2021,12,30,중랑구,중화동,454,0,96000,45150,2140,...,1,0,0,1,0,0,0,0,2021-12-30,0.000000


In [None]:
micro_corr_columns = [
 'deal_price',
 'full_rent_price',
 'year_rent_price',
 'deal_to_full_rent_rate',
 'deal_to_year_rent_multiple',
 'deal_price_mean_1m_before_rate',
 'full_rent_price_mean_1m_before_rate',
 'year_rent_price_mean_1m_before_rate',
 'deal_to_full_rent_rate_mean_1m_before_rate',
 'deal_to_year_rent_multiple_mean_1m_before_rate',
 'deal_price_mean_3m_before_rate',
 'full_rent_price_mean_3m_before_rate',
 'year_rent_price_mean_3m_before_rate',
 'deal_to_full_rent_rate_mean_3m_before_rate',
 'deal_to_year_rent_multiple_mean_3m_before_rate',
 'deal_price_mean_6m_before_rate',
 'full_rent_price_mean_6m_before_rate',
 'year_rent_price_mean_6m_before_rate',
 'deal_to_full_rent_rate_mean_6m_before_rate',
 'deal_to_year_rent_multiple_mean_6m_before_rate',
 'deal_price_mean_12m_before_rate',
 'full_rent_price_mean_12m_before_rate',
 'year_rent_price_mean_12m_before_rate',
 'deal_to_full_rent_rate_mean_12m_before_rate',
 'deal_to_year_rent_multiple_mean_12m_before_rate',
 'future_deal_price',
 'last_month_deal_count',
 'last_month_full_rent_count',
 'last_month_month_rent_count',
 'deal_count_1m_before',
 'full_rent_count_1m_before',
 'month_rent_count_1m_before',
 'deal_count_3m_before',
 'full_rent_count_3m_before',
 'month_rent_count_3m_before',
 'deal_count_6m_before',
 'full_rent_count_6m_before',
 'month_rent_count_6m_before',
 'deal_count_12m_before',
 'full_rent_count_12m_before',
 'month_rent_count_12m_before',]

for micro_corr_column in micro_corr_columns:
  print(df_micro.groupby(micro_corr_column).agg({'future_deal_change_rate': 'mean'}).reset_index(micro_corr_column).corr()['future_deal_change_rate'].sort_values(ascending=True).to_frame())

                         future_deal_change_rate
deal_price                             -0.100013
future_deal_change_rate                 1.000000
                         future_deal_change_rate
full_rent_price                         0.065563
future_deal_change_rate                 1.000000
                         future_deal_change_rate
year_rent_price                          0.13132
future_deal_change_rate                  1.00000
                         future_deal_change_rate
deal_to_full_rent_rate                  0.412803
future_deal_change_rate                 1.000000
                            future_deal_change_rate
deal_to_year_rent_multiple                -0.087031
future_deal_change_rate                    1.000000
                                future_deal_change_rate
deal_price_mean_1m_before_rate                -0.462093
future_deal_change_rate                        1.000000
                                     future_deal_change_rate
full_rent_price_mean_1m_bef

- 'deal_to_full_rent_rate', 'deal_price_mean_1m_before_rate', 'deal_price_mean_3m_before_rate' feature 들이 상관계수 0.4 이상으로, 그나마 상관계수가 높음 

#### 상관계수 높은 feature들 저장

In [None]:
df_micro.groupby('deal_to_full_rent_rate').agg({'future_deal_change_rate': 'mean'}).reset_index('deal_to_full_rent_rate').to_pickle('/content/drive/MyDrive/house_price/after_data/df_micro_deal_to_full_rent_rate.pkl')
df_micro.groupby('deal_price_mean_1m_before_rate').agg({'future_deal_change_rate': 'mean'}).reset_index('deal_price_mean_1m_before_rate').to_pickle('/content/drive/MyDrive/house_price/after_data/df_micro_deal_price_mean_1m_before_rate.pkl')
df_micro.groupby('deal_price_mean_3m_before_rate').agg({'future_deal_change_rate': 'mean'}).reset_index('deal_price_mean_3m_before_rate').to_pickle('/content/drive/MyDrive/house_price/after_data/df_micro_deal_price_mean_3m_before_ratee.pkl')

- 상관 계수가 예상보다 낮게 나오는데, 각 지표들이 그 자체의 의미들 보다는 다른 지표들과 융합을 했을 때 효과가 그런걸까? 하는 가정

# df_macro 관련 데이터프레임들 생성

- EDA_file1.ipynb 파일을 통해서 상관관계들이 매우 낮게 나오는 상황을 파악
- 그 원인 중 두 번째로, 거시경제 지표들은 날짜마다는 지표들이 동일한데, 각 아파트들에 대해 수익률이 전부 다름으로 입력값들(거시경제 지표들)은 동일한데 결과값(미래가격 변화율)이 다른 경우들이 발생, 해당 과정 때문에 원하는 수치가 나오지 않았던 것 같음 
- 두번째 문제를 해결하기 위해서 해당 일자의 거시경제 지표들을 그룹화해서 평균값을 결과값으로 한 데이터프레임 생성



## df_future_mean 생성

In [None]:
# 미래 변화율의 평균을 구하기 위해서 df_micro 정보를 가져옴옴
import pandas as pd
df_micro = pd.read_pickle('/content/drive/MyDrive/house_price/after_data/df_micro.pkl')
df_micro.head()

Unnamed: 0,year,month,day,address_1,address_2,address_3,address_4,deal_price,full_rent_price,year_rent_price,...,full_rent_count_3m_before,month_rent_count_3m_before,deal_count_6m_before,full_rent_count_6m_before,month_rent_count_6m_before,deal_count_12m_before,full_rent_count_12m_before,month_rent_count_12m_before,date,future_deal_change_rate
0,2012,2,1,강남구,개포동,12,0,50500,19000,1210,...,-7,-5,3,0,-4,-9,-5,2,2012-02-01,-33.168316
1,2012,2,2,강남구,개포동,12,0,50100,19000,1210,...,-7,-5,3,0,-4,-9,-5,2,2012-02-02,-32.634731
2,2012,2,3,강남구,개포동,12,0,50100,30000,1210,...,-7,-5,3,0,-4,-9,-5,2,2012-02-03,-19.161676
3,2012,2,4,강남구,개포동,12,0,50100,24500,1194,...,-7,-5,3,0,-4,-9,-5,2,2012-02-04,-44.111778
4,2012,2,5,강남구,개포동,12,0,50100,23000,1194,...,-7,-5,3,0,-4,-9,-5,2,2012-02-05,-44.111778


In [None]:
# date 로 그룹화 해서, 일자별 deal_price의 총합과 future_deal_price 의 총합들을 구함함
df_total_future_change_rate = df_micro.groupby(['date']).agg({'future_deal_price': 'sum','deal_price': 'sum'})
df_total_future_change_rate

Unnamed: 0_level_0,future_deal_price,deal_price
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2012-02-01,55920578,59924382
2012-02-02,55839205,59861770
2012-02-03,55726218,60076511
2012-02-04,55705868,60124070
2012-02-05,55705975,60133004
...,...,...
2021-12-27,638879069,644457033
2021-12-28,638753969,644393558
2021-12-29,638507169,644620508
2021-12-30,638286918,644552975


In [None]:
# 미래 매매가의 변화율을 구함함
df_total_future_change_rate['total_future_deal_change_rate'] = 100*((df_total_future_change_rate['future_deal_price']- df_total_future_change_rate['deal_price']) / df_total_future_change_rate['deal_price'])
df_total_future_change_rate

Unnamed: 0_level_0,future_deal_price,deal_price,total_future_deal_change_rate
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2012-02-01,55920578,59924382,-6.681427
2012-02-02,55839205,59861770,-6.719756
2012-02-03,55726218,60076511,-7.241254
2012-02-04,55705868,60124070,-7.348475
2012-02-05,55705975,60133004,-7.362062
...,...,...,...
2021-12-27,638879069,644457033,-0.865529
2021-12-28,638753969,644393558,-0.875178
2021-12-29,638507169,644620508,-0.948362
2021-12-30,638286918,644552975,-0.972155


In [None]:
# 사용안하는 컬럼들 제거 및 타입 변경 
df_total_future_change_rate = df_total_future_change_rate.astype({'total_future_deal_change_rate': 'float32'})
df_total_future_change_rate = df_total_future_change_rate.drop(["future_deal_price", "deal_price"], axis=1)
df_total_future_change_rate = df_total_future_change_rate.reset_index('date')
df_total_future_change_rate

Unnamed: 0,date,total_future_deal_change_rate
0,2012-02-01,-6.681427
1,2012-02-02,-6.719756
2,2012-02-03,-7.241254
3,2012-02-04,-7.348475
4,2012-02-05,-7.362062
...,...,...
3617,2021-12-27,-0.865529
3618,2021-12-28,-0.875178
3619,2021-12-29,-0.948362
3620,2021-12-30,-0.972155


In [None]:
# 데이터프레임 정보 확인
df_total_future_change_rate.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3622 entries, 0 to 3621
Data columns (total 2 columns):
 #   Column                         Non-Null Count  Dtype         
---  ------                         --------------  -----         
 0   date                           3622 non-null   datetime64[ns]
 1   total_future_deal_change_rate  3622 non-null   float32       
dtypes: datetime64[ns](1), float32(1)
memory usage: 42.6 KB


### 파일 저장

In [None]:
df_total_future_change_rate.to_pickle('/content/drive/MyDrive/house_price/after_data/df_total_future_change_rate.pkl')

## df_macro 생성 

In [None]:
# 거시경제지표 불러오기 
import pandas as pd
final_economic = pd.read_pickle('/content/drive/MyDrive/house_price/after_data/final_economic.pkl')
final_economic.head()

Unnamed: 0,date,year,month,day,apartment_index,kospi_index,korea_rp,korea_3_year,korea_10_year,korea_10-3_year,...,us_2_year_12m_before,us_10_year_12m_before,us_10-2_year_12m_before,us_10-3_year_month_12m_before,total_apartment_supply_12m_before,total_unsold_count_12m_before,total_unsold_ratio_12m_before,total_last_month_deal_count_12m_before,total_last_month_full_rent_count_12m_before,total_last_month_month_rent_count_12m_before
0,2012-01-01,2012,1,1,86.800003,1825.73999,3.25,3.32,3.77,0.45,...,-0.354,-1.458,-1.104,-1.354,1615,-408,-15.712383,-3467,-2475,-526
1,2012-01-02,2012,1,2,86.800003,1826.369995,3.25,3.34,3.78,0.44,...,-0.354,-1.461,-1.107,-1.357,1615,-408,-15.712383,-3467,-2475,-526
2,2012-01-03,2012,1,3,86.800003,1875.410034,3.25,3.36,3.79,0.43,...,-0.342,-1.378,-1.036,-1.274,1615,-408,-15.712383,-3467,-2475,-526
3,2012-01-04,2012,1,4,86.800003,1866.219971,3.25,3.36,3.79,0.43,...,-0.358,-1.354,-0.996,-1.227,1615,-408,-15.712383,-3467,-2475,-526
4,2012-01-05,2012,1,5,86.800003,1863.73999,3.25,3.34,3.78,0.44,...,-0.445,-1.467,-1.022,-1.34,1615,-408,-15.712383,-3467,-2475,-526


In [None]:
final_economic.info() # 해당 정보를 통해서 date 컬럼 타입을 생성성

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4018 entries, 0 to 4017
Data columns (total 85 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   date                                          4018 non-null   object 
 1   year                                          4018 non-null   int16  
 2   month                                         4018 non-null   int16  
 3   day                                           4018 non-null   int16  
 4   apartment_index                               4018 non-null   float32
 5   kospi_index                                   4018 non-null   float32
 6   korea_rp                                      4018 non-null   float32
 7   korea_3_year                                  4018 non-null   float32
 8   korea_10_year                                 4018 non-null   float32
 9   korea_10-3_year                               4018 non-null   f

In [None]:
final_economic['date'] = pd.to_datetime(final_economic['date'], format='%Y-%m-%d')
final_economic['date']

0      2012-01-01
1      2012-01-02
2      2012-01-03
3      2012-01-04
4      2012-01-05
          ...    
4013   2022-12-27
4014   2022-12-28
4015   2022-12-29
4016   2022-12-30
4017   2022-12-31
Name: date, Length: 4018, dtype: datetime64[ns]

In [None]:
# df_total_future_change_rate 불러오기기
df_total_future_change_rate = pd.read_pickle('/content/drive/MyDrive/house_price/after_data/df_total_future_change_rate.pkl')
df_total_future_change_rate.head()

Unnamed: 0,date,total_future_deal_change_rate
0,2012-02-01,-6.681427
1,2012-02-02,-6.719756
2,2012-02-03,-7.241254
3,2012-02-04,-7.348475
4,2012-02-05,-7.362062


In [None]:
df_total_future_change_rate.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3622 entries, 0 to 3621
Data columns (total 2 columns):
 #   Column                         Non-Null Count  Dtype         
---  ------                         --------------  -----         
 0   date                           3622 non-null   datetime64[ns]
 1   total_future_deal_change_rate  3622 non-null   float32       
dtypes: datetime64[ns](1), float32(1)
memory usage: 42.6 KB


In [None]:
df_macro=pd.merge(final_economic, df_total_future_change_rate, on='date',how='inner')
df_macro # row가 3622개로, 제대로 merge가 되었음을 확인 가능

Unnamed: 0,date,year,month,day,apartment_index,kospi_index,korea_rp,korea_3_year,korea_10_year,korea_10-3_year,...,us_10_year_12m_before,us_10-2_year_12m_before,us_10-3_year_month_12m_before,total_apartment_supply_12m_before,total_unsold_count_12m_before,total_unsold_ratio_12m_before,total_last_month_deal_count_12m_before,total_last_month_full_rent_count_12m_before,total_last_month_month_rent_count_12m_before,total_future_deal_change_rate
0,2012-02-01,2012,2,1,86.800003,1959.239990,3.25,3.380,3.750,0.370,...,-1.605,-1.2260,-1.5090,-3520,-379,61.257435,-4393,-1891,-237,-6.681427
1,2012-02-02,2012,2,2,86.800003,1984.300049,3.25,3.380,3.760,0.380,...,-1.656,-1.2180,-1.5830,-3520,-379,61.257435,-4393,-1891,-237,-6.719756
2,2012-02-03,2012,2,3,86.800003,1972.339966,3.25,3.380,3.760,0.380,...,-1.623,-1.1490,-1.5500,-3520,-379,61.257435,-4393,-1891,-237,-7.241254
3,2012-02-04,2012,2,4,86.800003,1972.339966,3.25,3.380,3.760,0.380,...,-1.714,-1.2000,-1.6410,-3520,-379,61.257435,-4393,-1891,-237,-7.348475
4,2012-02-05,2012,2,5,86.800003,1972.339966,3.25,3.380,3.760,0.380,...,-1.714,-1.2000,-1.6410,-3520,-379,61.257435,-4393,-1891,-237,-7.362062
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3617,2021-12-27,2021,12,27,104.400002,2999.550049,1.00,1.776,2.212,0.436,...,0.539,-0.0408,0.5717,789,2,-1.540453,-5084,1313,-278,-0.865529
3618,2021-12-28,2021,12,28,104.400002,3020.239990,1.00,1.786,2.196,0.410,...,0.561,-0.0719,0.5940,789,2,-1.540453,-5084,1313,-278,-0.875178
3619,2021-12-29,2021,12,29,104.400002,2993.290039,1.00,1.783,2.180,0.397,...,0.616,-0.0071,0.6610,789,2,-1.540453,-5084,1313,-278,-0.948362
3620,2021-12-30,2021,12,30,104.400002,2977.649902,1.00,1.802,2.248,0.446,...,0.581,-0.0243,0.6160,789,2,-1.540453,-5084,1313,-278,-0.972155


In [None]:
df_macro.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3622 entries, 0 to 3621
Data columns (total 86 columns):
 #   Column                                        Non-Null Count  Dtype         
---  ------                                        --------------  -----         
 0   date                                          3622 non-null   datetime64[ns]
 1   year                                          3622 non-null   int16         
 2   month                                         3622 non-null   int16         
 3   day                                           3622 non-null   int16         
 4   apartment_index                               3622 non-null   float32       
 5   kospi_index                                   3622 non-null   float32       
 6   korea_rp                                      3622 non-null   float32       
 7   korea_3_year                                  3622 non-null   float32       
 8   korea_10_year                                 3622 non-null   float3

### 파일 저장

In [None]:
df_macro.to_pickle('/content/drive/MyDrive/house_price/after_data/df_macro.pkl')

- df_macro의 각 컬럼들이 매일 값이 변하는 경우, 매달 값이 변하는 경우, 무작위로 값이 변하는 경우들이 있다.
- 각 컬럼별로 중복값의 문제가 있을 수 있기에, 각 경우에 따라 데이터프레임을 분리해서 생성

### 상관계수 파악


- df_micro의 과정을 통해, 어차피 groupby를 각각 적용을 시킴으로 매일 변하는지, 매달 변하는지 등을 고려할 필요가 없음

In [None]:
# 거시경제지표 불러오기 
import pandas as pd
df_macro = pd.read_pickle('/content/drive/MyDrive/house_price/after_data/df_macro.pkl')
df_macro.head()

Unnamed: 0,date,year,month,day,apartment_index,kospi_index,korea_rp,korea_3_year,korea_10_year,korea_10-3_year,...,us_10_year_12m_before,us_10-2_year_12m_before,us_10-3_year_month_12m_before,total_apartment_supply_12m_before,total_unsold_count_12m_before,total_unsold_ratio_12m_before,total_last_month_deal_count_12m_before,total_last_month_full_rent_count_12m_before,total_last_month_month_rent_count_12m_before,total_future_deal_change_rate
0,2012-02-01,2012,2,1,86.800003,1959.23999,3.25,3.38,3.75,0.37,...,-1.605,-1.226,-1.509,-3520,-379,61.257435,-4393,-1891,-237,-6.681427
1,2012-02-02,2012,2,2,86.800003,1984.300049,3.25,3.38,3.76,0.38,...,-1.656,-1.218,-1.583,-3520,-379,61.257435,-4393,-1891,-237,-6.719756
2,2012-02-03,2012,2,3,86.800003,1972.339966,3.25,3.38,3.76,0.38,...,-1.623,-1.149,-1.55,-3520,-379,61.257435,-4393,-1891,-237,-7.241254
3,2012-02-04,2012,2,4,86.800003,1972.339966,3.25,3.38,3.76,0.38,...,-1.714,-1.2,-1.641,-3520,-379,61.257435,-4393,-1891,-237,-7.348475
4,2012-02-05,2012,2,5,86.800003,1972.339966,3.25,3.38,3.76,0.38,...,-1.714,-1.2,-1.641,-3520,-379,61.257435,-4393,-1891,-237,-7.362062


In [None]:
list(df_macro.columns)

['date',
 'year',
 'month',
 'day',
 'apartment_index',
 'kospi_index',
 'korea_rp',
 'korea_3_year',
 'korea_10_year',
 'korea_10-3_year',
 'us_3_month',
 'us_2_year',
 'us_10_year',
 'us_10-2_year',
 'us_10-3_year_month',
 'total_apartment_supply',
 'total_unsold_count',
 'total_unsold_ratio',
 'total_last_month_deal_count',
 'total_last_month_full_rent_count',
 'total_last_month_month_rent_count',
 'kospi_index_1m_before',
 'korea_rp_1m_before',
 'korea_3_year_1m_before',
 'korea_10_year_1m_before',
 'korea_10-3_year_1m_before',
 'us_3_month_1m_before',
 'us_2_year_1m_before',
 'us_10_year_1m_before',
 'us_10-2_year_1m_before',
 'us_10-3_year_month_1m_before',
 'total_apartment_supply_1m_before',
 'total_unsold_count_1m_before',
 'total_unsold_ratio_1m_before',
 'total_last_month_deal_count_1m_before',
 'total_last_month_full_rent_count_1m_before',
 'total_last_month_month_rent_count_1m_before',
 'kospi_index_3m_before',
 'korea_rp_3m_before',
 'korea_3_year_3m_before',
 'korea_10_y

In [None]:
macro_corr_columns = [
 'kospi_index',
 'korea_rp',
 'korea_3_year',
 'korea_10_year',
 'korea_10-3_year',
 'us_3_month',
 'us_2_year',
 'us_10_year',
 'us_10-2_year',
 'us_10-3_year_month',
 'total_apartment_supply',
 'total_unsold_count',
 'total_unsold_ratio',
 'total_last_month_deal_count',
 'total_last_month_full_rent_count',
 'total_last_month_month_rent_count',
 'kospi_index_1m_before',
 'korea_rp_1m_before',
 'korea_3_year_1m_before',
 'korea_10_year_1m_before',
 'korea_10-3_year_1m_before',
 'us_3_month_1m_before',
 'us_2_year_1m_before',
 'us_10_year_1m_before',
 'us_10-2_year_1m_before',
 'us_10-3_year_month_1m_before',
 'total_apartment_supply_1m_before',
 'total_unsold_count_1m_before',
 'total_unsold_ratio_1m_before',
 'total_last_month_deal_count_1m_before',
 'total_last_month_full_rent_count_1m_before',
 'total_last_month_month_rent_count_1m_before',
 'kospi_index_3m_before',
 'korea_rp_3m_before',
 'korea_3_year_3m_before',
 'korea_10_year_3m_before',
 'korea_10-3_year_3m_before',
 'us_3_month_3m_before',
 'us_2_year_3m_before',
 'us_10_year_3m_before',
 'us_10-2_year_3m_before',
 'us_10-3_year_month_3m_before',
 'total_apartment_supply_3m_before',
 'total_unsold_count_3m_before',
 'total_unsold_ratio_3m_before',
 'total_last_month_deal_count_3m_before',
 'total_last_month_full_rent_count_3m_before',
 'total_last_month_month_rent_count_3m_before',
 'kospi_index_6m_before',
 'korea_rp_6m_before',
 'korea_3_year_6m_before',
 'korea_10_year_6m_before',
 'korea_10-3_year_6m_before',
 'us_3_month_6m_before',
 'us_2_year_6m_before',
 'us_10_year_6m_before',
 'us_10-2_year_6m_before',
 'us_10-3_year_month_6m_before',
 'total_apartment_supply_6m_before',
 'total_unsold_count_6m_before',
 'total_unsold_ratio_6m_before',
 'total_last_month_deal_count_6m_before',
 'total_last_month_full_rent_count_6m_before',
 'total_last_month_month_rent_count_6m_before',
 'kospi_index_12m_before',
 'korea_rp_12m_before',
 'korea_3_year_12m_before',
 'korea_10_year_12m_before',
 'korea_10-3_year_12m_before',
 'us_3_month_12m_before',
 'us_2_year_12m_before',
 'us_10_year_12m_before',
 'us_10-2_year_12m_before',
 'us_10-3_year_month_12m_before',
 'total_apartment_supply_12m_before',
 'total_unsold_count_12m_before',
 'total_unsold_ratio_12m_before',
 'total_last_month_deal_count_12m_before',
 'total_last_month_full_rent_count_12m_before',
 'total_last_month_month_rent_count_12m_before']

for macro_corr_column in macro_corr_columns:
  print(df_macro.groupby(macro_corr_column).agg({'total_future_deal_change_rate': 'mean'}).reset_index(macro_corr_column).corr()['total_future_deal_change_rate'].sort_values(ascending=True).to_frame())

                               total_future_deal_change_rate
kospi_index                                         0.144402
total_future_deal_change_rate                       1.000000
                               total_future_deal_change_rate
korea_rp                                           -0.695317
total_future_deal_change_rate                       1.000000
                               total_future_deal_change_rate
korea_3_year                                       -0.753658
total_future_deal_change_rate                       1.000000
                               total_future_deal_change_rate
korea_10_year                                      -0.715527
total_future_deal_change_rate                       1.000000
                               total_future_deal_change_rate
korea_10-3_year                                     -0.17929
total_future_deal_change_rate                        1.00000
                               total_future_deal_change_rate
us_3_month              

- df_micro와는 달리, df_macro는 결과값이 미래변화율들의 평균이기에 완전전한 비교는 힘들지만 df_micro 보다 높은 상관관계들이 있는 지표들이 많음음
- korea_rp : -0.695317, korea_3_year : -0.753658, korea_10_year : -0.715527, us_10-2_year : -0.571734, us_10-3_year_month : -0.614389, 
total_unsold_count : -0.663598, total_unsold_ratio : -0.49486, korea_rp_1m_before : -0.825879, korea_rp_3m_before : -0.948187,
korea_rp_6m_before : -0.829453, korea_10-3_year_6m_before : 0.63468, korea_rp_12m_before : -0.824738
- korea_rp 관련 지표들이 상관계수가 매우 높게 나오는 이유는 그 데이터의 개수가 많지 않아서 높게 나오는 것 같음


#### 상관계수 높은 feature들 저장

In [None]:
df_macro.groupby('korea_rp').agg({'total_future_deal_change_rate': 'mean'}).reset_index('korea_rp').to_pickle('/content/drive/MyDrive/house_price/after_data/df_macro_korea_rp.pkl')
df_macro.groupby('korea_3_year').agg({'total_future_deal_change_rate': 'mean'}).reset_index('korea_3_year').to_pickle('/content/drive/MyDrive/house_price/after_data/df_macro_korea_3_year.pkl')
df_macro.groupby('korea_10_year').agg({'total_future_deal_change_rate': 'mean'}).reset_index('korea_10_year').to_pickle('/content/drive/MyDrive/house_price/after_data/df_macro_korea_10_year.pkl')
df_macro.groupby('us_10-2_year').agg({'total_future_deal_change_rate': 'mean'}).reset_index('us_10-2_year').to_pickle('/content/drive/MyDrive/house_price/after_data/df_macro_us_10-2_year.pkl')
df_macro.groupby('us_10-3_year_month').agg({'total_future_deal_change_rate': 'mean'}).reset_index('us_10-3_year_month').to_pickle('/content/drive/MyDrive/house_price/after_data/df_macro_us_10-3_year_month.pkl')
df_macro.groupby('total_unsold_count').agg({'total_future_deal_change_rate': 'mean'}).reset_index('total_unsold_count').to_pickle('/content/drive/MyDrive/house_price/after_data/df_macro_total_unsold_count.pkl')
df_macro.groupby('total_unsold_ratio').agg({'total_future_deal_change_rate': 'mean'}).reset_index('total_unsold_ratio').to_pickle('/content/drive/MyDrive/house_price/after_data/df_macro_total_unsold_ratio.pkl')
df_macro.groupby('korea_rp_1m_before').agg({'total_future_deal_change_rate': 'mean'}).reset_index('korea_rp_1m_before').to_pickle('/content/drive/MyDrive/house_price/after_data/df_macro_korea_rp_1m_before.pkl')
df_macro.groupby('korea_rp_3m_before').agg({'total_future_deal_change_rate': 'mean'}).reset_index(' korea_rp_3m_before').to_pickle('/content/drive/MyDrive/house_price/after_data/df_macro korea_rp_3m_before.pkl')
df_macro.groupby('korea_rp_6m_before').agg({'total_future_deal_change_rate': 'mean'}).reset_index('korea_rp_6m_before').to_pickle('/content/drive/MyDrive/house_price/after_data/df_macro_korea_rp_6m_before.pkl')
df_macro.groupby('korea_10-3_year_6m_before').agg({'total_future_deal_change_rate': 'mean'}).reset_index('korea_10-3_year_6m_before').to_pickle('/content/drive/MyDrive/house_price/after_data/df_macro_korea_10-3_year_6m_before.pkl')
df_macro.groupby('korea_rp_12m_before').agg({'total_future_deal_change_rate': 'mean'}).reset_index('korea_rp_12m_before').to_pickle('/content/drive/MyDrive/house_price/after_data/df_macro_korea_rp_12m_before.pkl')



# 모델 생성

- df_micro의 feature들은 높은 상관계수를 가지지는 않지만, 최종적으로 예측을 하고자 하는 '미래변화율'과 직접적인 관련이 있음
- df_macro의 feature 들은 높은 상관관계를 가지고 있지만, 최종적으로 예측을 하고자 하는 '미래변화율'과 간접적인인 관련이 있음
- 이 문제를 해결하기 위해서, df_micro에서 상관계수가 높은 feature들을 통해서 학습을 한 모델과, df_macro에서 높은 featue들을 통해서 학습을 한 모델을 융합해서 통합모델 생성

# 데이터 시각화

- 여러가지 정보들을 시각화를 통해서 파악

## df_address 생성

In [None]:
# !pip install geopy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
# 가치평가가지표 불러오기 
import pandas as pd
df_micro = pd.read_pickle('/content/drive/MyDrive/house_price/after_data/df_micro.pkl')
df_micro.head()

Unnamed: 0,year,month,day,address_1,address_2,address_3,address_4,deal_price,full_rent_price,year_rent_price,...,full_rent_count_3m_before,month_rent_count_3m_before,deal_count_6m_before,full_rent_count_6m_before,month_rent_count_6m_before,deal_count_12m_before,full_rent_count_12m_before,month_rent_count_12m_before,date,future_deal_change_rate
0,2012,2,1,강남구,개포동,12,0,50500,19000,1210,...,-7,-5,3,0,-4,-9,-5,2,2012-02-01,-33.168316
1,2012,2,2,강남구,개포동,12,0,50100,19000,1210,...,-7,-5,3,0,-4,-9,-5,2,2012-02-02,-32.634731
2,2012,2,3,강남구,개포동,12,0,50100,30000,1210,...,-7,-5,3,0,-4,-9,-5,2,2012-02-03,-19.161676
3,2012,2,4,강남구,개포동,12,0,50100,24500,1194,...,-7,-5,3,0,-4,-9,-5,2,2012-02-04,-44.111778
4,2012,2,5,강남구,개포동,12,0,50100,23000,1194,...,-7,-5,3,0,-4,-9,-5,2,2012-02-05,-44.111778


In [None]:
# address_4까지 위도 경도로 변환하면, none으로 처리되는 경우가 많음
# address_3까지 같은 address_4가 달라도, 위도와경도가 같음
df_address = df_micro.drop_duplicates(subset=['address_1','address_2','address_3'], keep='last')
df_address = df_address[['address_1','address_2','address_3']]
df_address

Unnamed: 0,address_1,address_2,address_3
65832,강남구,개포동,12
65863,강남구,개포동,138
65894,강남구,개포동,140
65925,강남구,개포동,141
65956,강남구,개포동,172
...,...,...,...
18853077,중랑구,중화동,438
18853108,중랑구,중화동,450
18853139,중랑구,중화동,452
18853170,중랑구,중화동,453


In [None]:
df_address['address_full'] = df_address['address_1']+' ' + df_address['address_2'] +' ' + df_address['address_3'].apply(str)
df_address

Unnamed: 0,address_1,address_2,address_3,address_full
65832,강남구,개포동,12,강남구 개포동 12
65863,강남구,개포동,138,강남구 개포동 138
65894,강남구,개포동,140,강남구 개포동 140
65925,강남구,개포동,141,강남구 개포동 141
65956,강남구,개포동,172,강남구 개포동 172
...,...,...,...,...
18853077,중랑구,중화동,438,중랑구 중화동 438
18853108,중랑구,중화동,450,중랑구 중화동 450
18853139,중랑구,중화동,452,중랑구 중화동 452
18853170,중랑구,중화동,453,중랑구 중화동 453


In [None]:
df_address.reset_index(inplace=True,drop=True)
df_address

Unnamed: 0,address_1,address_2,address_3,address_full
0,강남구,개포동,12,강남구 개포동 12
1,강남구,개포동,138,강남구 개포동 138
2,강남구,개포동,140,강남구 개포동 140
3,강남구,개포동,141,강남구 개포동 141
4,강남구,개포동,172,강남구 개포동 172
...,...,...,...,...
5606,중랑구,중화동,438,중랑구 중화동 438
5607,중랑구,중화동,450,중랑구 중화동 450
5608,중랑구,중화동,452,중랑구 중화동 452
5609,중랑구,중화동,453,중랑구 중화동 453


In [None]:
df_address.to_pickle('/content/drive/MyDrive/house_price/after_data/df_address.pkl')

## geopy 사용(실행X)

- geopy 라이브러리를 활용하려 했으나 주소에 대한 위도, 경도 정보들이 상세하게 나오지 않고 중복되게 나오는 경우가 많아서 적합하지 않음

In [None]:
# address_4까지 위도 경도로 변환하면, none으로 처리되는 경우가 많음
# address_3까지 같은 address_4가 달라도, 위도와경도가 같음

from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent = 'South Korea')

for i, location in enumerate(df_address['address_full']):
  try:
    print(i,'/5610',)
    geo = geolocator.geocode(location)
    # print('geo :',geo)
    if geo is None:
      print(location,'geo is none')
    else:
      # print('location :', location, 'latitude :', geo.latitude, 'longitude :',geo.longitude) 
      df_address.loc[i,'latitude'] = geo.latitude
      df_address.loc[i,'longitude'] = geo.longitude
  except:
      pass

[1;30;43m스트리밍 출력 내용이 길어서 마지막 5000줄이 삭제되었습니다.[0m
622 /5610
623 /5610
624 /5610
625 /5610
626 /5610
627 /5610
628 /5610
629 /5610
630 /5610
631 /5610
632 /5610
633 /5610
634 /5610
635 /5610
636 /5610
637 /5610
638 /5610
639 /5610
640 /5610
641 /5610
642 /5610
643 /5610
644 /5610
645 /5610
646 /5610
647 /5610
648 /5610
649 /5610
650 /5610
651 /5610
652 /5610
653 /5610
654 /5610
655 /5610
656 /5610
657 /5610
658 /5610
659 /5610
660 /5610
661 /5610
662 /5610
663 /5610
664 /5610
665 /5610
666 /5610
667 /5610
668 /5610
669 /5610
670 /5610
671 /5610
672 /5610
673 /5610
674 /5610
675 /5610
676 /5610
677 /5610
678 /5610
679 /5610
680 /5610
681 /5610
682 /5610
683 /5610
684 /5610
685 /5610
686 /5610
687 /5610
688 /5610
689 /5610
690 /5610
691 /5610
692 /5610
693 /5610
694 /5610
695 /5610
696 /5610
697 /5610
698 /5610
699 /5610
700 /5610
701 /5610
702 /5610
703 /5610
704 /5610
705 /5610
706 /5610
707 /5610
708 /5610
709 /5610
710 /5610
711 /5610
712 /5610
713 /5610
714 /5610
715 /5610
716 /5610


In [None]:
df_address.head(30)

Unnamed: 0,index,address_1,address_2,address_3,address_full,latitude,longitude
0,65832,강남구,개포동,12,강남구 개포동 12,37.482088,127.061548
1,65863,강남구,개포동,138,강남구 개포동 138,37.482088,127.061548
2,65894,강남구,개포동,140,강남구 개포동 140,37.482088,127.061548
3,65925,강남구,개포동,141,강남구 개포동 141,37.482088,127.061548
4,65956,강남구,개포동,172,강남구 개포동 172,37.482088,127.061548
5,65987,강남구,개포동,176,강남구 개포동 176,37.482088,127.061548
6,66018,강남구,개포동,177,강남구 개포동 177,37.482088,127.061548
7,66049,강남구,개포동,179,강남구 개포동 179,37.482088,127.061548
8,66080,강남구,개포동,185,강남구 개포동 185,37.482088,127.061548
9,66111,강남구,개포동,187,강남구 개포동 187,37.482088,127.061548


- 예상과는 다르게  위경도가 같게 처리되는 경우들이 많아서 원하는 결과로 나오지 않음

In [None]:
location = '강남구 개포동 649	'
geo = geolocator.geocode(location)
print(geo.latitude, geo.longitude)

location = '강남구 개포동 1282'
geo = geolocator.geocode(location)
print(geo.latitude, geo.longitude) # 주소가 달라도 위경도가 같게 나오는 문제 발견견

37.48208765 127.06154796893333
37.48208765 127.06154796893333


## 구글맵 활용

In [None]:
pip install googlemaps

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting googlemaps
  Downloading googlemaps-4.10.0.tar.gz (33 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: googlemaps
  Building wheel for googlemaps (setup.py) ... [?25l[?25hdone
  Created wheel for googlemaps: filename=googlemaps-4.10.0-py3-none-any.whl size=40715 sha256=8a047d46ec3e3a80693c6645abd1553526764a4fb0529febc63aa2a37407ddd1
  Stored in directory: /root/.cache/pip/wheels/d9/5f/46/54a2bdb4bcb07d3faba4463d2884865705914cc72a7b8bb5f0
Successfully built googlemaps
Installing collected packages: googlemaps
Successfully installed googlemaps-4.10.0


In [None]:
#구글맵 api 로드
import googlemaps
from datetime import datetime

import pandas as pd
df_address = pd.read_pickle('/content/drive/MyDrive/house_price/after_data/df_address.pkl')

my_key = "AIzaSyBfzws3S5B5CuY196F_4cyA2I672q2GK7k" #구글맵 API 키값
maps = googlemaps.Client(key=my_key)  # 구글맵 api 가져오기

In [None]:
df_address

Unnamed: 0,address_1,address_2,address_3,address_full
0,강남구,개포동,12,강남구 개포동 12
1,강남구,개포동,138,강남구 개포동 138
2,강남구,개포동,140,강남구 개포동 140
3,강남구,개포동,141,강남구 개포동 141
4,강남구,개포동,172,강남구 개포동 172
...,...,...,...,...
5606,중랑구,중화동,438,중랑구 중화동 438
5607,중랑구,중화동,450,중랑구 중화동 450
5608,중랑구,중화동,452,중랑구 중화동 452
5609,중랑구,중화동,453,중랑구 중화동 453


In [None]:
import time #구동 시간을 측정하기 위하여 time 모듈 임포트


lat = []  #위도
lng = []  #경도

i=0

t1 = time.time() #지오코딩 코드 처리 전 시각

for address in df_address['address_full']:   
    i = i + 1
    print(i)
    try:
        geo_location = maps.geocode(address)[0].get('geometry')
        lat.append(geo_location['location']['lat'])
        lng.append(geo_location['location']['lng'])
        
# 좌표를 가져오지 못한 경우 에러 출력
    except:
        lat.append('')
        lng.append('')
        print("%d번 인덱스 에러"%(i))


[1;30;43m스트리밍 출력 내용이 길어서 마지막 5000줄이 삭제되었습니다.[0m
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
84

In [None]:
df_address['latitude'] = lat
df_address['longitude'] = lng
df_address[(df_address['latitude']=='')|(df_address['longitude']=='')|(df_address['latitude']==0)|(df_address['longitude']==0)] # 값을 못찾은 값이 있는 출력

Unnamed: 0,address_1,address_2,address_3,address_full,latitude,longitude


In [None]:
df_address.to_pickle('/content/drive/MyDrive/house_price/after_data/df_address_2.pkl')

In [None]:
import pandas as pd
df_address = pd.read_pickle('/content/drive/MyDrive/house_price/after_data/df_address_2.pkl')
df_micro = pd.read_pickle('/content/drive/MyDrive/house_price/after_data/df_micro.pkl')
df_micro.head()

Unnamed: 0,year,month,day,address_1,address_2,address_3,address_4,deal_price,full_rent_price,year_rent_price,...,full_rent_count_3m_before,month_rent_count_3m_before,deal_count_6m_before,full_rent_count_6m_before,month_rent_count_6m_before,deal_count_12m_before,full_rent_count_12m_before,month_rent_count_12m_before,date,future_deal_change_rate
0,2012,2,1,강남구,개포동,12,0,50500,19000,1210,...,-7,-5,3,0,-4,-9,-5,2,2012-02-01,-33.168316
1,2012,2,2,강남구,개포동,12,0,50100,19000,1210,...,-7,-5,3,0,-4,-9,-5,2,2012-02-02,-32.634731
2,2012,2,3,강남구,개포동,12,0,50100,30000,1210,...,-7,-5,3,0,-4,-9,-5,2,2012-02-03,-19.161676
3,2012,2,4,강남구,개포동,12,0,50100,24500,1194,...,-7,-5,3,0,-4,-9,-5,2,2012-02-04,-44.111778
4,2012,2,5,강남구,개포동,12,0,50100,23000,1194,...,-7,-5,3,0,-4,-9,-5,2,2012-02-05,-44.111778


In [None]:
import folium
