<a href="https://colab.research.google.com/github/Ryong1998/house_price/blob/main/EDA_file3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 프로젝트 주제

- 해당 프로젝트는 미래의 아파트 집 값을 예측하는 프로젝트 입니다

# 프로젝트 소개 

- 다양한 지역의 다양한 부동산의 종류(아파트, 단독주택 등)들 중 '서울'의 '아파트'의 '미래 가격 변화율'을 예측함
- 최종적으로면 '미래 가격 변화율'이 가장 좋을 것으로 예상되는, 즉, 가장 수익률이 높을 것으로 예상되는 아파트를 찾는 것이 목적
- 부동산의 가치는 '1. 거주지로서의 특성'과 '2. 금융상품으로서의 특성' 두가지를 통해서 평가를 할 수 있다고 가정
- '1. 거주지로서의 특성'은 주변 편의시설, 교육시설, 아파트 평수, 주변 교통시설 등 더 편한 거주환경을 제공하는 요소들을 포함
- '2. 금융상품으로서의 특성'은 기준금리, 아파트 공급량, 아파트 미분양, 현재 매매가, 전세가율 등 금융관련 수치들로 표현이 되는 요소들을 포함
- '1. 거주지로서의 특성'에서 높은 가치를 의미하는 요소들은 시대가 변함에 따라서 바뀔 수가 있음(예를 들어 대가족에서 소가족 형태로 가족 구성원 구조가 바뀌면서 시대에 따라 사람들이 선호하는 아파트 평수가 바뀔 수도 있고, 인터넷 강의의 발달로 인해서 교육시설 인프라의 중요성이 향후 낮아질 수 있음음)
- '1. 거주지로서의 특성'에서 높은 가치들은 과거 계속 변화했을 수 있지만 어떻게 변했는지 파악하기가 쉽지 않고, 미래에 어떻게 변할지 알 수 없기에 평가의 기준이 '변동적'이라는 특징이 있음
- 하지만 '2. 금융상품으로서의 특성'은 가격과 경제를 바탕으로 한 '수치'들을 표현 하기에 '1. 거주지로서의 특성'보다 일관성 있게 부동산의 가치를 평가할 수 있음
- '2. 금융상품으로서의 특성'에 해당하는 수치들은 그 자체로 변화하는 '1. 거주지로서의 특성'의 가치를 내포하고 있다고 가정
- 해당 프로젝트는 '2. 금융상품으로서의 특성'에 집중하여서 집값의 변화를 예측 할 예정

# original_data 확보

- 'http://rtdown.molit.go.kr/' 사이트를 통해서 아파트매매가, 아파트 전/월세 가격 정보 파일로 얻음
- 'https://kr.investing.com/' 사이트를 통해서 한국국채금리, 미국국채금리, 코스피 정보를 얻음 
- 'https://data.kbland.kr/publicdata/unsold-apartments' 사이트를 통해서 미분양 아파트 수량 정보를 얻음
- 'https://asil.kr/asil/sub/movein.jsp' 사이트를 통해서 분양 아파트 수량 정보를 얻음
- 'https://www.bok.or.kr/portal/singl/baseRate/list.do?dataSeCd=01&menuNo=200643' 사이트를 통해서 기준금리 정보를 얻음
- 'https://data.seoul.go.kr/dataList/801/S/2/datasetView.do' 사이트를 통해서 서울시 주택가격지수를 얻음



>> 공공데이터포털의 api를 이용해서 아파트매매가, 아파트 전/월세 가격 정보를 얻으려 했지만 일일 트래픽 제한으로 인해서 직접 'http://rtdown.molit.go.kr/' 사이트에 접속해서 파일들을 다운 받아 필요 데이터를 확보

In [6]:
# 구글 드라이브 마운트
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# 파이썬 버전 확인인
!python --version

Python 3.8.10


In [None]:
# 라이브러리 버전 확인인
pip list

Package                       Version
----------------------------- ----------------------
absl-py                       1.4.0
aeppl                         0.0.33
aesara                        2.7.9
aiohttp                       3.8.4
aiosignal                     1.3.1
alabaster                     0.7.13
albumentations                1.2.1
altair                        4.2.2
appdirs                       1.4.4
argon2-cffi                   21.3.0
argon2-cffi-bindings          21.2.0
arviz                         0.12.1
astor                         0.8.1
astropy                       4.3.1
astunparse                    1.6.3
async-timeout                 4.0.2
atari-py                      0.2.9
atomicwrites                  1.4.1
attrs                         22.2.0
audioread                     3.0.0
autograd                      1.5
Babel                         2.11.0
backcall                      0.2.0
beautifulsoup4                4.6.3
bleach                        6.0.0
blis

# apartment_deal.csv 파일 생성

- apartment_deal(아파트 매매) 파일 생성
- 'http://rtdown.molit.go.kr/' 사이트를 통해서 아파트매매가 정보 파일로 얻음

## csv 파일들 불러오기 및 병합

- 아파트 매매 정보 원본본파일들은 연도별로 파일들이 나누어져 되어있고, 각 csv 파일 내의 모든 정보들이 필요하지는 않기에 전처리 과정 진행

In [None]:
import pandas as pd
import os

# 연도별 아파트 매매 정보들이 들어있는 csv경로 설정 
dir_path = "/content/drive/MyDrive/house_price/original_data/deal_price/Seoul" 
file_list = os.listdir(dir_path)
file_list.sort()
df_list = list()
# 해당 폴더 안에 있는 csv 파일들을 읽어서 리스트 안에 데이터프레임들을 담음
for csv_file in file_list:
    df_list.append(pd.read_csv(dir_path+"/"+csv_file ,skiprows=15,  encoding='cp949'))

>> 코랩은 파일을 읽어올 때 업로드한 순서대로 파일을 불러오는 듯

In [None]:
df_list[0].info() # 리스트 안에 잘 담겼는지 확인

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120812 entries, 0 to 120811
Data columns (total 15 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   시군구       120812 non-null  object 
 1   번지        120812 non-null  object 
 2   본번        120812 non-null  int64  
 3   부번        120812 non-null  int64  
 4   단지명       120812 non-null  object 
 5   전용면적(㎡)   120812 non-null  float64
 6   계약년월      120812 non-null  int64  
 7   계약일       120812 non-null  int64  
 8   거래금액(만원)  120812 non-null  object 
 9   층         120812 non-null  int64  
 10  건축년도      120812 non-null  int64  
 11  도로명       120812 non-null  object 
 12  해제사유발생일   0 non-null       float64
 13  거래유형      120812 non-null  object 
 14  중개사소재지    120812 non-null  object 
dtypes: float64(2), int64(6), object(7)
memory usage: 13.8+ MB


In [None]:
df_list[0].head()

Unnamed: 0,시군구,번지,본번,부번,단지명,전용면적(㎡),계약년월,계약일,거래금액(만원),층,건축년도,도로명,해제사유발생일,거래유형,중개사소재지
0,서울특별시 강남구 개포동,655-2,655,2,개포2차현대아파트(220),77.75,200603,10,59500,7,1988,언주로 103,,-,-
1,서울특별시 강남구 개포동,655-2,655,2,개포2차현대아파트(220),77.75,200603,29,60000,6,1988,언주로 103,,-,-
2,서울특별시 강남구 개포동,655-2,655,2,개포2차현대아파트(220),77.75,200604,29,67000,9,1988,언주로 103,,-,-
3,서울특별시 강남구 개포동,655-2,655,2,개포2차현대아파트(220),77.75,200606,1,60000,4,1988,언주로 103,,-,-
4,서울특별시 강남구 개포동,655-2,655,2,개포2차현대아파트(220),77.75,200610,20,72250,5,1988,언주로 103,,-,-


In [None]:
# 모든 데이터프레임을 하나의 데이터프레임으로 통합
df_default = df_list[0]
for df_csv in df_list[1:]:
    df_default = pd.concat([df_default, df_csv], axis=0)
df_default.reset_index(drop=True, inplace=True) # concat으로 합쳐질 때 인덱스 재설정
df_default.loc[1] 

시군구          서울특별시 강남구 개포동
번지                   655-2
본번                   655.0
부번                     2.0
단지명         개포2차현대아파트(220)
전용면적(㎡)              77.75
계약년월                200603
계약일                     29
거래금액(만원)            60,000
층                        6
건축년도                1988.0
도로명                언주로 103
해제사유발생일                NaN
거래유형                     -
중개사소재지                   -
Name: 1, dtype: object

In [None]:
df_default.head() # 병합한 테이블의 정보 파악

Unnamed: 0,시군구,번지,본번,부번,단지명,전용면적(㎡),계약년월,계약일,거래금액(만원),층,건축년도,도로명,해제사유발생일,거래유형,중개사소재지
0,서울특별시 강남구 개포동,655-2,655.0,2.0,개포2차현대아파트(220),77.75,200603,10,59500,7,1988.0,언주로 103,,-,-
1,서울특별시 강남구 개포동,655-2,655.0,2.0,개포2차현대아파트(220),77.75,200603,29,60000,6,1988.0,언주로 103,,-,-
2,서울특별시 강남구 개포동,655-2,655.0,2.0,개포2차현대아파트(220),77.75,200604,29,67000,9,1988.0,언주로 103,,-,-
3,서울특별시 강남구 개포동,655-2,655.0,2.0,개포2차현대아파트(220),77.75,200606,1,60000,4,1988.0,언주로 103,,-,-
4,서울특별시 강남구 개포동,655-2,655.0,2.0,개포2차현대아파트(220),77.75,200610,20,72250,5,1988.0,언주로 103,,-,-


In [None]:
df_default.info() # 데이터프레임 합친 결과 확인

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1237491 entries, 0 to 1237490
Data columns (total 15 columns):
 #   Column    Non-Null Count    Dtype  
---  ------    --------------    -----  
 0   시군구       1237491 non-null  object 
 1   번지        1237270 non-null  object 
 2   본번        1237416 non-null  float64
 3   부번        1237416 non-null  float64
 4   단지명       1237491 non-null  object 
 5   전용면적(㎡)   1237491 non-null  float64
 6   계약년월      1237491 non-null  int64  
 7   계약일       1237491 non-null  int64  
 8   거래금액(만원)  1237491 non-null  object 
 9   층         1237491 non-null  int64  
 10  건축년도      1237489 non-null  float64
 11  도로명       1237491 non-null  object 
 12  해제사유발생일   5242 non-null     float64
 13  거래유형      1237491 non-null  object 
 14  중개사소재지    1237491 non-null  object 
dtypes: float64(5), int64(3), object(7)
memory usage: 141.6+ MB


## 필요한 컬럼만 선택

- df_default 데이터프레임에서 모든 컬럼들을 사용하지 않기에, 사용할 컬럼들만을 선택

In [None]:
# 사용할 컬럼들만 거르고 컬럼명 영어로 치환 - 필요한 컬럼들만 선택
df_default = df_default[['시군구','본번','부번','도로명','단지명','계약년월','계약일','전용면적(㎡)','거래금액(만원)','층']]
df_default.columns = ['address','main_number','sub_number','road','name','year_month','day','area','deal_price','floor']
df_default.head() # 잘 걸러졌는지 확인

Unnamed: 0,address,main_number,sub_number,road,name,year_month,day,area,deal_price,floor
0,서울특별시 강남구 개포동,655.0,2.0,언주로 103,개포2차현대아파트(220),200603,10,77.75,59500,7
1,서울특별시 강남구 개포동,655.0,2.0,언주로 103,개포2차현대아파트(220),200603,29,77.75,60000,6
2,서울특별시 강남구 개포동,655.0,2.0,언주로 103,개포2차현대아파트(220),200604,29,77.75,67000,9
3,서울특별시 강남구 개포동,655.0,2.0,언주로 103,개포2차현대아파트(220),200606,1,77.75,60000,4
4,서울특별시 강남구 개포동,655.0,2.0,언주로 103,개포2차현대아파트(220),200610,20,77.75,72250,5


In [None]:
# 타입 변경을 통해서 deal_price,year_month, day 타입 변경
df_default["deal_price"] = df_default["deal_price"].str.replace(",", "") # 'deal_price'에서 ','가 들어있는 부분 제거(추후 계산에 사용하기 위해서서)
df = df_default.astype({'year_month':'str','day':'str','deal_price':'int64'}).copy()
df.head() # 형태가 변경된거 확인

Unnamed: 0,address,main_number,sub_number,road,name,year_month,day,area,deal_price,floor
0,서울특별시 강남구 개포동,655.0,2.0,언주로 103,개포2차현대아파트(220),200603,10,77.75,59500,7
1,서울특별시 강남구 개포동,655.0,2.0,언주로 103,개포2차현대아파트(220),200603,29,77.75,60000,6
2,서울특별시 강남구 개포동,655.0,2.0,언주로 103,개포2차현대아파트(220),200604,29,77.75,67000,9
3,서울특별시 강남구 개포동,655.0,2.0,언주로 103,개포2차현대아파트(220),200606,1,77.75,60000,4
4,서울특별시 강남구 개포동,655.0,2.0,언주로 103,개포2차현대아파트(220),200610,20,77.75,72250,5


In [None]:
df.info() # 타입변경 및 null 확인 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1237491 entries, 0 to 1237490
Data columns (total 10 columns):
 #   Column       Non-Null Count    Dtype  
---  ------       --------------    -----  
 0   address      1237491 non-null  object 
 1   main_number  1237416 non-null  float64
 2   sub_number   1237416 non-null  float64
 3   road         1237491 non-null  object 
 4   name         1237491 non-null  object 
 5   year_month   1237491 non-null  object 
 6   day          1237491 non-null  object 
 7   area         1237491 non-null  float64
 8   deal_price   1237491 non-null  int64  
 9   floor        1237491 non-null  int64  
dtypes: float64(3), int64(2), object(5)
memory usage: 94.4+ MB


In [None]:
# 'main_number'혹은 'sub_number'이 null 인데 'road'도 null 인 값을 확인 -> 없음
# 즉, 'road가 주소에 관한한 정보가 더욱 많음'
df[((df['main_number'].isnull()) |(df['sub_number'].isnull())) &(df['road'].isnull()) ]

Unnamed: 0,address,main_number,sub_number,road,name,year_month,day,area,deal_price,floor


- main_number와 sub_number에 null 값들이 있음을 확인 -> road 정보가 주소에 관한 정보로 적합하다는 판단

## 새로운 컬럼 생성

- 날짜 관련한 컬럼들을 추후 그룹화등을 할 때 사용하기에 'year_month' 컬럼과 'day' 컬럼을 가공하여서 다양한 날짜 관련 컬럼들을 생성

In [None]:
# 추후 데이터들 그룹화에 사용하기 위해서 날짜관련 컬럼들들 분리 및 생성
df['year'] = df['year_month'].str[0:4] # '연','월' 합쳐져 있는 컬럼에서 연도만 추출
df['month'] = df['year_month'].str[4:] # '연','월' 합쳐져 있는 컬럼에서 월만 추출
df.loc[df["day"].str.len()==1,"day"]='0'+df.loc[df["day"].str.len()==1,"day"] # '일'이 있는 컬럼에서 해당 '일'이 1일, 2일 처럼 1자리 숫자인 경우 앞에 0을 추가
df['date'] = pd.to_datetime(df['year']+df['month']+df['day']) # 일자들을 합쳐서 date 컬럼 생성
df = df.astype({'year':'int64','month':'int64','day':'int64'}) # 원하는 타입으로 변경경
df = df.drop(['year_month'], axis=1) # 사용 안하는 컬럼들 제거
df.head()

Unnamed: 0,address,main_number,sub_number,road,name,day,area,deal_price,floor,year,month,date
0,서울특별시 강남구 개포동,655.0,2.0,언주로 103,개포2차현대아파트(220),10,77.75,59500,7,2006,3,2006-03-10
1,서울특별시 강남구 개포동,655.0,2.0,언주로 103,개포2차현대아파트(220),29,77.75,60000,6,2006,3,2006-03-29
2,서울특별시 강남구 개포동,655.0,2.0,언주로 103,개포2차현대아파트(220),29,77.75,67000,9,2006,4,2006-04-29
3,서울특별시 강남구 개포동,655.0,2.0,언주로 103,개포2차현대아파트(220),1,77.75,60000,4,2006,6,2006-06-01
4,서울특별시 강남구 개포동,655.0,2.0,언주로 103,개포2차현대아파트(220),20,77.75,72250,5,2006,10,2006-10-20


In [None]:
df.info() # 타입들이 원하는데로 변경됨을 확인인

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1237491 entries, 0 to 1237490
Data columns (total 12 columns):
 #   Column       Non-Null Count    Dtype         
---  ------       --------------    -----         
 0   address      1237491 non-null  object        
 1   main_number  1237416 non-null  float64       
 2   sub_number   1237416 non-null  float64       
 3   road         1237491 non-null  object        
 4   name         1237491 non-null  object        
 5   day          1237491 non-null  int64         
 6   area         1237491 non-null  float64       
 7   deal_price   1237491 non-null  int64         
 8   floor        1237491 non-null  int64         
 9   year         1237491 non-null  int64         
 10  month        1237491 non-null  int64         
 11  date         1237491 non-null  datetime64[ns]
dtypes: datetime64[ns](1), float64(3), int64(5), object(3)
memory usage: 113.3+ MB


In [None]:
# 주소 및 도로명들 분리
df["address_0"] = df["address"].str.split(' ',expand=True)[0] # '시' 만 추출해야 하나, 서울만 함으로 일단은 실행 X
df["address_1"] = df["address"].str.split(' ',expand=True)[1] # '구' 만 추출
df["address_2"] = df["address"].str.split(' ',expand=True)[2] # '동' 만 추출
df["road_name"] = df["road"].str.split(' ',expand=True)[0] # '도로명' 만 추출
df["road_number"] = df["road"].str.split(' ',expand=True)[1] # '도로숫자' 만 추출
df= df[['year','month','day','address_0','address_1','address_2','road_name','road_number','area','deal_price','name','main_number','sub_number','date']] # 사용할 컬럼만 선택
df.head()

Unnamed: 0,year,month,day,address_0,address_1,address_2,road_name,road_number,area,deal_price,name,main_number,sub_number,date
0,2006,3,10,서울특별시,강남구,개포동,언주로,103,77.75,59500,개포2차현대아파트(220),655.0,2.0,2006-03-10
1,2006,3,29,서울특별시,강남구,개포동,언주로,103,77.75,60000,개포2차현대아파트(220),655.0,2.0,2006-03-29
2,2006,4,29,서울특별시,강남구,개포동,언주로,103,77.75,67000,개포2차현대아파트(220),655.0,2.0,2006-04-29
3,2006,6,1,서울특별시,강남구,개포동,언주로,103,77.75,60000,개포2차현대아파트(220),655.0,2.0,2006-06-01
4,2006,10,20,서울특별시,강남구,개포동,언주로,103,77.75,72250,개포2차현대아파트(220),655.0,2.0,2006-10-20


## 결측치 처리1

In [None]:
df.info() # road_number에 1개의의 null 값이 생김을 확인

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1237491 entries, 0 to 1237490
Data columns (total 14 columns):
 #   Column       Non-Null Count    Dtype         
---  ------       --------------    -----         
 0   year         1237491 non-null  int64         
 1   month        1237491 non-null  int64         
 2   day          1237491 non-null  int64         
 3   address_0    1237491 non-null  object        
 4   address_1    1237491 non-null  object        
 5   address_2    1237491 non-null  object        
 6   road_name    1237491 non-null  object        
 7   road_number  1237490 non-null  object        
 8   area         1237491 non-null  float64       
 9   deal_price   1237491 non-null  int64         
 10  name         1237491 non-null  object        
 11  main_number  1237416 non-null  float64       
 12  sub_number   1237416 non-null  float64       
 13  date         1237491 non-null  datetime64[ns]
dtypes: datetime64[ns](1), float64(3), int64(4), object(6)
memory usage

In [None]:
df[df['road_number'].isnull()] # road_number에 null값이 들어 있는 row를 확인

Unnamed: 0,year,month,day,address_0,address_1,address_2,road_name,road_number,area,deal_price,name,main_number,sub_number,date
1177515,2020,12,31,서울특별시,중구,만리동2가,만리재로,,39.9541,161000,서울역센트럴자이(임대),176.0,1.0,2020-12-31


In [None]:
# '서울역센트럴자이'를 확인 -> '' 값이 존재함을 확인..
df.loc[df['name'] == '서울역센트럴자이',:]

Unnamed: 0,year,month,day,address_0,address_1,address_2,road_name,road_number,area,deal_price,name,main_number,sub_number,date
936223,2017,5,3,서울특별시,중구,만리동2가,만리재로,175.0,84.972,79390,서울역센트럴자이,176.0,1.0,2017-05-03
936224,2017,12,20,서울특별시,중구,만리동2가,만리재로,175.0,59.943,85000,서울역센트럴자이,176.0,1.0,2017-12-20
936225,2017,12,30,서울특별시,중구,만리동2가,,,59.94,85000,서울역센트럴자이,176.0,1.0,2017-12-30
1018067,2018,3,20,서울특별시,중구,만리동2가,,,72.99,85000,서울역센트럴자이,176.0,1.0,2018-03-20
1093938,2019,7,13,서울특별시,중구,만리동2가,만리재로,175.0,84.972,134500,서울역센트럴자이,176.0,1.0,2019-07-13
1093939,2019,8,20,서울특별시,중구,만리동2가,만리재로,175.0,59.94,95000,서울역센트럴자이,176.0,1.0,2019-08-20
1093940,2019,8,23,서울특별시,중구,만리동2가,만리재로,175.0,84.972,139000,서울역센트럴자이,176.0,1.0,2019-08-23
1093941,2019,9,8,서울특별시,중구,만리동2가,만리재로,175.0,59.94,113800,서울역센트럴자이,176.0,1.0,2019-09-08
1093942,2019,9,21,서울특별시,중구,만리동2가,만리재로,175.0,72.9733,132000,서울역센트럴자이,176.0,1.0,2019-09-21
1093943,2019,11,30,서울특별시,중구,만리동2가,만리재로,175.0,59.9808,120000,서울역센트럴자이,176.0,1.0,2019-11-30


In [None]:
# 값이 '' 로 되어 있는 row들을 확인인
df.loc[df['road_name'] == '',:]

Unnamed: 0,year,month,day,address_0,address_1,address_2,road_name,road_number,area,deal_price,name,main_number,sub_number,date
1606,2006,2,23,서울특별시,강남구,논현동,,,128.67,73500,경복,276.0,0.0,2006-02-23
1628,2006,10,19,서울특별시,강남구,논현동,,,95.48,71000,경복,276.0,0.0,2006-10-19
2799,2006,1,24,서울특별시,강남구,대치동,,,76.56,80000,청실1,633.0,0.0,2006-01-24
2806,2006,2,14,서울특별시,강남구,대치동,,,102.64,143500,청실1,633.0,0.0,2006-02-14
2807,2006,2,14,서울특별시,강남구,대치동,,,102.64,142000,청실1,633.0,0.0,2006-02-14
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1234065,2022,6,24,서울특별시,송파구,거여동,,,84.95,128000,e편한세상송파파크센트럴,696.0,0.0,2022-06-24
1234066,2022,7,21,서울특별시,송파구,거여동,,,84.97,135000,e편한세상송파파크센트럴,696.0,0.0,2022-07-21
1234067,2022,7,23,서울특별시,송파구,거여동,,,59.96,125000,e편한세상송파파크센트럴,696.0,0.0,2022-07-23
1234069,2022,8,19,서울특별시,송파구,거여동,,,84.96,130000,e편한세상송파파크센트럴,696.0,0.0,2022-08-19


>> null 값이 없다고 ''값도 없는건 아니구나! -> 의미적으로는 결측치이지만 ''로 표시되어서 마치 값이 있는 것처럼 있을 수도 있음

In [None]:
df.loc[df['name'] == '서울역센트럴자이(임대)','name']='서울역센트럴자이' # '서울역센트럴자이(임대)' 명칭을을 '서울역센트럴자이'로 수정
df.loc[df['name'] == '서울역센트럴자이','road_name']='만리재로' # 위에서 확인한 '서울역센트럴자이'의 값들로 'road_name' 수정
df.loc[df['name'] == '서울역센트럴자이','road_number']='175' # 위에서 확인한 '서울역센트럴자이'의 값들로 'road_number' 수정
df.info() # 우선 1차적으로 null 로 표시되는는 null 값들은 처리함을 확인

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1237491 entries, 0 to 1237490
Data columns (total 14 columns):
 #   Column       Non-Null Count    Dtype         
---  ------       --------------    -----         
 0   year         1237491 non-null  int64         
 1   month        1237491 non-null  int64         
 2   day          1237491 non-null  int64         
 3   address_0    1237491 non-null  object        
 4   address_1    1237491 non-null  object        
 5   address_2    1237491 non-null  object        
 6   road_name    1237491 non-null  object        
 7   road_number  1237491 non-null  object        
 8   area         1237491 non-null  float64       
 9   deal_price   1237491 non-null  int64         
 10  name         1237491 non-null  object        
 11  main_number  1237416 non-null  float64       
 12  sub_number   1237416 non-null  float64       
 13  date         1237491 non-null  datetime64[ns]
dtypes: datetime64[ns](1), float64(3), int64(4), object(6)
memory usage

## 결측치 처리2

- 앞에서 과정들을 통해서 ''들이 값들로 들어 있을 수도 있음을 깨닫고 '' 값들을 null로 간주하여서 결측치 처리 진행

In [None]:
import numpy as np
df = df.replace('', np.nan) # ''값만 있는 값들을 null 값들로 수정
df.info() # 수정한 후 정보 확인

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1237491 entries, 0 to 1237490
Data columns (total 14 columns):
 #   Column       Non-Null Count    Dtype         
---  ------       --------------    -----         
 0   year         1237491 non-null  int64         
 1   month        1237491 non-null  int64         
 2   day          1237491 non-null  int64         
 3   address_0    1237491 non-null  object        
 4   address_1    1237491 non-null  object        
 5   address_2    1237491 non-null  object        
 6   road_name    1235462 non-null  object        
 7   road_number  1234196 non-null  object        
 8   area         1237491 non-null  float64       
 9   deal_price   1237491 non-null  int64         
 10  name         1237491 non-null  object        
 11  main_number  1237416 non-null  float64       
 12  sub_number   1237416 non-null  float64       
 13  date         1237491 non-null  datetime64[ns]
dtypes: datetime64[ns](1), float64(3), int64(4), object(6)
memory usage

In [None]:
df.isnull().sum() # df의 'road_name'과 'road_number'의 null 값들이 증가함을 확인, 

year              0
month             0
day               0
address_0         0
address_1         0
address_2         0
road_name      2029
road_number    3295
area              0
deal_price        0
name              0
main_number      75
sub_number       75
date              0
dtype: int64

- 처음에는 도로주소가 null값이 더 적은 줄 알았지만, 전처리 과정 중 지번주소가 null 값이 더 적은 것을 확인

In [None]:
# 'main_number'나 'sub_number' 둘중 하나만 null 인 것을 확인 -> 없음
# 즉, 2개가 동시에 null 값을 가지고 있음
df[((df['main_number'].isnull()) &(df['sub_number'].notnull()))
  |((df['main_number'].notnull()) &(df['sub_number'].isnull()))]

Unnamed: 0,year,month,day,address_0,address_1,address_2,road_name,road_number,area,deal_price,name,main_number,sub_number,date


In [None]:
# 도로명정보에는 null이고 지번주소도 null인 데이터를 확인 -> 없다
# 즉, 도로명주소나 지번주소 둘 중 하나를 활용해서 주소에 대한 정보를 얻을 수 있다
df[((df['road_name'].isnull()) | (df['road_number'].isnull())) & (df['main_number'].isnull())] 

Unnamed: 0,year,month,day,address_0,address_1,address_2,road_name,road_number,area,deal_price,name,main_number,sub_number,date


In [None]:
# 처리해야 할 null 값이 있는 데이터프레임을 조회
df.loc[df['main_number'].isnull(),['address_0','address_1','address_2','road_name','road_number','name','main_number','sub_number']] 

Unnamed: 0,address_0,address_1,address_2,road_name,road_number,name,main_number,sub_number
681633,서울특별시,서초구,신원동,헌릉로8길,10-12,힐스테이트 서초 젠트리스,,
681634,서울특별시,서초구,신원동,헌릉로8길,10-12,힐스테이트 서초 젠트리스,,
681635,서울특별시,서초구,신원동,헌릉로8길,10-12,힐스테이트 서초 젠트리스,,
681636,서울특별시,서초구,신원동,헌릉로8길,10-12,힐스테이트 서초 젠트리스,,
681637,서울특별시,서초구,신원동,헌릉로8길,10-12,힐스테이트 서초 젠트리스,,
...,...,...,...,...,...,...,...,...
1209122,서울특별시,서초구,신원동,헌릉로8길,10-12,힐스테이트 서초 젠트리스,,
1209123,서울특별시,서초구,신원동,헌릉로8길,10-12,힐스테이트 서초 젠트리스,,
1209124,서울특별시,서초구,신원동,헌릉로8길,10-12,힐스테이트 서초 젠트리스,,
1232880,서울특별시,서초구,신원동,헌릉로8길,10-12,힐스테이트 서초 젠트리스,,


In [None]:
df.loc[df['main_number'].isnull(),'name'].unique() # 처리해야 할 지번주소에 null 값이 있는 아파트명들 조회
                                                   # '힐스테이트 서초 젠트리스'만 수정하면 될듯

array(['힐스테이트 서초 젠트리스'], dtype=object)

In [None]:
df.loc[df['name']=='힐스테이트 서초 젠트리스',:] # 기존 name 컬럼이 '힐스테이트 서초 젠트리스' 인 전체 값들이이 지번주소가 null값으로 되어 있음

Unnamed: 0,year,month,day,address_0,address_1,address_2,road_name,road_number,area,deal_price,name,main_number,sub_number,date
681633,2015,3,1,서울특별시,서초구,신원동,헌릉로8길,10-12,84.95,73430,힐스테이트 서초 젠트리스,,,2015-03-01
681634,2015,4,17,서울특별시,서초구,신원동,헌릉로8길,10-12,84.99,79000,힐스테이트 서초 젠트리스,,,2015-04-17
681635,2015,5,1,서울특별시,서초구,신원동,헌릉로8길,10-12,101.90,95000,힐스테이트 서초 젠트리스,,,2015-05-01
681636,2015,6,16,서울특별시,서초구,신원동,헌릉로8길,10-12,84.95,87200,힐스테이트 서초 젠트리스,,,2015-06-16
681637,2015,6,26,서울특별시,서초구,신원동,헌릉로8길,10-12,101.90,94500,힐스테이트 서초 젠트리스,,,2015-06-26
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1209122,2021,4,27,서울특별시,서초구,신원동,헌릉로8길,10-12,101.90,184500,힐스테이트 서초 젠트리스,,,2021-04-27
1209123,2021,5,26,서울특별시,서초구,신원동,헌릉로8길,10-12,84.95,165000,힐스테이트 서초 젠트리스,,,2021-05-26
1209124,2021,7,26,서울특별시,서초구,신원동,헌릉로8길,10-12,84.99,182000,힐스테이트 서초 젠트리스,,,2021-07-26
1232880,2022,6,23,서울특별시,서초구,신원동,헌릉로8길,10-12,101.90,204000,힐스테이트 서초 젠트리스,,,2022-06-23


In [None]:
# 지번주소 null 값들을 네이버를 통해 검색하여서 정보를 얻고 수정
df.loc[df['name']=='힐스테이트 서초 젠트리스','main_number'] = 557
df.loc[df['name']=='힐스테이트 서초 젠트리스','sub_number'] = 0

In [None]:
# 사용할 컬럼들 선택택과, 컬럼명들 수정
df_deal = df[['date','year','month','day','address_0','address_1','address_2','main_number','sub_number','name','area','deal_price']].copy()
df_deal.columns =['date','year','month','day','address_0','address_1','address_2','address_3','address_4','name','area','deal_price']
df_deal = df_deal[df_deal['year']>=2011] # 전세/월세데이터가 2011년 이후로 있어서 연도 선택
df_deal.head()

Unnamed: 0,date,year,month,day,address_0,address_1,address_2,address_3,address_4,name,area,deal_price
355306,2011-07-09,2011,7,9,서울특별시,강남구,개포동,655.0,2.0,개포2차현대아파트(220),77.75,64000
355307,2011-07-28,2011,7,28,서울특별시,강남구,개포동,655.0,2.0,개포2차현대아파트(220),77.75,65500
355308,2011-01-19,2011,1,19,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,67.28,70500
355309,2011-09-02,2011,9,2,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,79.97,85000
355310,2011-12-17,2011,12,17,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,67.28,68000


In [None]:
df_deal.info() # 데이터프레임 정보 확인

<class 'pandas.core.frame.DataFrame'>
Int64Index: 882185 entries, 355306 to 1237490
Data columns (total 12 columns):
 #   Column      Non-Null Count   Dtype         
---  ------      --------------   -----         
 0   date        882185 non-null  datetime64[ns]
 1   year        882185 non-null  int64         
 2   month       882185 non-null  int64         
 3   day         882185 non-null  int64         
 4   address_0   882185 non-null  object        
 5   address_1   882185 non-null  object        
 6   address_2   882185 non-null  object        
 7   address_3   882185 non-null  float64       
 8   address_4   882185 non-null  float64       
 9   name        882185 non-null  object        
 10  area        882185 non-null  float64       
 11  deal_price  882185 non-null  int64         
dtypes: datetime64[ns](1), float64(3), int64(4), object(4)
memory usage: 87.5+ MB


In [None]:
df_deal.iloc[200] # 정보들 제대로 있는지 확인

date          2011-12-23 00:00:00
year                         2011
month                          12
day                            23
address_0                   서울특별시
address_1                     강남구
address_2                     개포동
address_3                   141.0
address_4                     0.0
name                      개포주공1단지
area                        56.57
deal_price                  95000
Name: 355506, dtype: object

In [None]:
df_deal.to_csv('/content/drive/MyDrive/house_price/after_data/apartment_deal.csv',index=False)

# apartment_full_rent.csv, apartment_month_rent.csv 파일 생성

- apartment_full_rent(아파트 전세), apartment_month_rent(아파트 월세) 파일 생성
- 'http://rtdown.molit.go.kr/' 사이트를 통해서 아파트전세,월세 정보 파일로 얻음
- 아파트 전세정보 csv 파일들은 연도별로 파일들이 분류가 되어있고, 각 csv 파일 내의 모든 정보들이 전부 필요하지는 않기에 전처리 과정 진행

## csv 파일들 불러오기 및 병합

In [None]:
import pandas as pd
import os


dir_path = "/content/drive/MyDrive/house_price/original_data/rent_price/Seoul"
file_list = os.listdir(dir_path)
file_list.sort()
df_list = list()

# 해당 폴더 안에 있는 csv 파일들을 읽어서 리스트 안에 데이터프레임들을 담음
for csv_file in file_list:
    df_list.append(pd.read_csv(dir_path+"/"+csv_file ,skiprows=15,  encoding='cp949'))


  exec(code_obj, self.user_global_ns, self.user_ns)
  exec(code_obj, self.user_global_ns, self.user_ns)


In [None]:
df_list[-1].info() # 리스트 안에 잘 담겼는지 확인

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 231846 entries, 0 to 231845
Data columns (total 19 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   시군구            231846 non-null  object 
 1   번지             231657 non-null  object 
 2   본번             231819 non-null  float64
 3   부번             231819 non-null  float64
 4   단지명            231846 non-null  object 
 5   전월세구분          231846 non-null  object 
 6   전용면적(㎡)        231846 non-null  float64
 7   계약년월           231846 non-null  int64  
 8   계약일            231846 non-null  int64  
 9   보증금(만원)        231846 non-null  object 
 10  월세(만원)         231846 non-null  object 
 11  층              231846 non-null  int64  
 12  건축년도           231749 non-null  float64
 13  도로명            231846 non-null  object 
 14  계약기간           231846 non-null  object 
 15  계약구분           231846 non-null  object 
 16  갱신요구권 사용       231846 non-null  object 
 17  종전계약 보증금 (만원)  188985 non-nul

In [None]:
# 모든 데이터프레임을을 통합
df_default = df_list[0]
for df_csv in df_list[1:]:
    df_default = pd.concat([df_default, df_csv], axis=0)
df_default.reset_index(drop=True, inplace=True) # concat으로 합쳐질 때 인덱스 재설정
df_default.info() # 데이터프레임 합친 결과 확인

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2085775 entries, 0 to 2085774
Data columns (total 19 columns):
 #   Column         Dtype  
---  ------         -----  
 0   시군구            object 
 1   번지             object 
 2   본번             float64
 3   부번             float64
 4   단지명            object 
 5   전월세구분          object 
 6   전용면적(㎡)        float64
 7   계약년월           int64  
 8   계약일            int64  
 9   보증금(만원)        object 
 10  월세(만원)         object 
 11  층              float64
 12  건축년도           float64
 13  도로명            object 
 14  계약기간           object 
 15  계약구분           object 
 16  갱신요구권 사용       object 
 17  종전계약 보증금 (만원)  object 
 18  종전계약 월세 (만원)   object 
dtypes: float64(5), int64(2), object(12)
memory usage: 302.4+ MB


In [None]:
df_default.loc[1]

시군구               서울특별시 강남구 개포동
번지                        655-2
본번                        655.0
부번                          2.0
단지명              개포2차현대아파트(220)
전월세구분                        전세
전용면적(㎡)                   77.75
계약년월                     201101
계약일                          18
보증금(만원)                  20,000
월세(만원)                        0
층                           8.0
건축년도                     1988.0
도로명                     언주로 103
계약기간                          -
계약구분                          -
갱신요구권 사용                      -
종전계약 보증금 (만원)               NaN
종전계약 월세 (만원)                NaN
Name: 1, dtype: object

In [None]:
df_default.head() # 데이터 형태 확인 

Unnamed: 0,시군구,번지,본번,부번,단지명,전월세구분,전용면적(㎡),계약년월,계약일,보증금(만원),월세(만원),층,건축년도,도로명,계약기간,계약구분,갱신요구권 사용,종전계약 보증금 (만원),종전계약 월세 (만원)
0,서울특별시 강남구 개포동,655-2,655.0,2.0,개포2차현대아파트(220),전세,77.75,201101,5,35000,0,7.0,1988.0,언주로 103,-,-,-,,
1,서울특별시 강남구 개포동,655-2,655.0,2.0,개포2차현대아파트(220),전세,77.75,201101,18,20000,0,8.0,1988.0,언주로 103,-,-,-,,
2,서울특별시 강남구 개포동,655-2,655.0,2.0,개포2차현대아파트(220),전세,77.75,201102,1,24000,0,5.0,1988.0,언주로 103,-,-,-,,
3,서울특별시 강남구 개포동,655-2,655.0,2.0,개포2차현대아파트(220),전세,77.75,201102,11,31000,0,9.0,1988.0,언주로 103,-,-,-,,
4,서울특별시 강남구 개포동,655-2,655.0,2.0,개포2차현대아파트(220),전세,77.75,201102,24,30500,0,9.0,1988.0,언주로 103,-,-,-,,


In [None]:
df_default.isnull().sum() # 번지, 본번, 부번이 null 값들이 있음

시군구                    0
번지                  1586
본번                   234
부번                   234
단지명                    0
전월세구분                  0
전용면적(㎡)               36
계약년월                   0
계약일                    0
보증금(만원)                0
월세(만원)                 0
층                     36
건축년도                 249
도로명                    0
계약기간                   0
계약구분                   0
갱신요구권 사용               0
종전계약 보증금 (만원)    1793799
종전계약 월세 (만원)     1793799
dtype: int64

In [None]:
df_default['전월세구분'].unique()

array(['전세', '월세'], dtype=object)

- 전월세구분이 '전세'와 '월세' 두 가지만 있음으로 조건문을 활용해서 나누기에 용이함

## 전세 데이터 프레임 생성 

- apartment_deal 과 진행 과정이 거의 동일하기에 apartment_deal.ipynb 파일의 진행과정을 참조해서 하나의 셀로 합쳐서 진행
- 주석 부분들은 중간과정 확인 부분

In [None]:
# 전세 데이터 프레임 생성 - 주석부분은 중간에서 값 확인하는 부분들
df_full_rent = df_default.loc[df_default['전월세구분']=='전세',['시군구','본번','부번','도로명','계약년월','계약일','보증금(만원)','전용면적(㎡)','단지명']].copy()
df_full_rent.columns = ['address','main_number','sub_number','road','year_month','day','full_rent_price','area','name']
# print(df_full_rent.head())
# print(df_full_rent.info())

df_full_rent = df_full_rent.astype({'full_rent_price':'str','year_month':'str','day':'str','full_rent_price':'str'})
df_full_rent["full_rent_price"] = df_full_rent["full_rent_price"].str.replace(",", "")
df_full_rent.loc[df_full_rent["day"].str.len()==1,"day"]='0'+df_full_rent.loc[df_full_rent["day"].str.len()==1,"day"] # 일이 있는 컬럼에서 1자리 숫자인 경우 앞에 0을 추가성
df_full_rent['year'] = df_full_rent['year_month'].str[0:4] # 연,월 합쳐져 있는 컬럼에서 연도만 추출
df_full_rent['month'] = df_full_rent['year_month'].str[4:] # 연,월 합쳐져 있는 컬럼에서 월만 추출
df_full_rent['date'] = pd.to_datetime(df_full_rent['year']+df_full_rent['month']+df_full_rent['day']) # 일자들을 합쳐서 date 컬럼 생
df_full_rent = df_full_rent.astype({'year':'int64','month':'int64','day':'int64','full_rent_price':'int64'})
df_full_rent = df_full_rent.drop(['year_month'], axis=1) # 사용 안하는 컬럼들 제거
# print(df_full_rent.head())
# print(df_full_rent.info())

df_full_rent["address_0"] = df_full_rent["address"].str.split(' ',expand=True)[0] # '시' 만 추출해야 하나, 서울만 함으로 일단은 실행 X
df_full_rent["address_1"] = df_full_rent["address"].str.split(' ',expand=True)[1] # '구' 만 추출
df_full_rent["address_2"] = df_full_rent["address"].str.split(' ',expand=True)[2] # '동' 만 추출
df_full_rent["road_name"] = df_full_rent["road"].str.split(' ',expand=True)[0] # '도로명' 만 추출
df_full_rent["road_number"] = df_full_rent["road"].str.split(' ',expand=True)[1] # '도로숫자' 만 추출
df_full_rent= df_full_rent[['year','month','day','address_0','address_1','address_2','main_number','sub_number','road_name','road_number','area',"full_rent_price",'name','date']] # 사용할 컬럼만 선택
# print(df_full_rent.head())
# print(df_full_rent.info())
# print(df_full_rent.isnull().sum())

df_full_rent = df_full_rent.replace('', None) # ''값만 있는 값들을 null 값들로 수정
# print(df_full_rent.isnull().sum())

# df_full_rent[((df_full_rent['main_number'].isnull()) &(df_full_rent['sub_number'].notnull()))
#   |((df_full_rent['main_number'].notnull()) &(df_full_rent['sub_number'].isnull()))]

# df_full_rent[((df_full_rent['road_name'].isnull()) | (df_full_rent['road_number'].isnull())) & (df_full_rent['main_number'].isnull())] 

# df_full_rent.loc[df_full_rent['main_number'].isnull(),['address_0','address_1','address_2','main_number','sub_number','road_name','road_number','name']]

# df_full_rent.loc[df_full_rent['main_number'].isnull(),'name'].unique()

# df_full_rent.loc[df_full_rent['name']=='힐스테이트 서초 젠트리스',:]

df_full_rent.loc[df_full_rent['name']=='힐스테이트 서초 젠트리스','main_number'] = 557
df_full_rent.loc[df_full_rent['name']=='힐스테이트 서초 젠트리스','sub_number'] = 0


df_full_rent = df_full_rent[['date','year','month','day','address_0','address_1','address_2','main_number','sub_number','name','area','full_rent_price']].copy()
df_full_rent.columns =['date','year','month','day','address_0','address_1','address_2','address_3','address_4','name','area','full_rent_price']
# df_full_rent.head()

# df_full_rent.info() 

  mask |= arr == x


In [None]:
df_full_rent.isnull().sum()

date                0
year                0
month               0
day                 0
address_0           0
address_1           0
address_2           0
address_3           0
address_4           0
name                0
area               25
full_rent_price     0
dtype: int64

### 'area' 컬럼 결측치 처리

- apartment_deal.ipynb 와 달리 area 컬럼에 결측치가 존재하기에 결측치 처리 부분 추가
- 결측치는 해당 주소의 전세 아파트의 거래 내역 중 가장 거래가 많았던 area 컬럼의 값 으로 대체하여 처리

In [None]:
# area의 빈 칸들 해결
df_full_rent[df_full_rent['area'].isnull()].tail()

Unnamed: 0,date,year,month,day,address_0,address_1,address_2,address_3,address_4,name,area,full_rent_price
357440,2013-11-16,2013,11,16,서울특별시,노원구,공릉동,683.0,14.0,한일휴니스빌,,8000
375219,2013-11-30,2013,11,30,서울특별시,동대문구,장안동,312.0,8.0,태솔에버빌,,12000
389892,2013-01-17,2013,1,17,서울특별시,서대문구,창천동,501.0,14.0,삼성아트빌,,9000
439901,2013-01-20,2013,1,20,서울특별시,영등포구,영등포동4가,103.0,0.0,영등포그랑그루,,8000
490009,2014-02-19,2014,2,19,서울특별시,강서구,화곡동,29.0,47.0,드림하우스(29-47),,9500


In [None]:
# area가 null값인 row들이 다른 주소정보관련 컬럼들을 리스트 화
add_1 = list(df_full_rent.loc[df_full_rent['area'].isnull(),'address_1'])
add_2 = list(df_full_rent.loc[df_full_rent['area'].isnull(),'address_2'])
add_3 = list(df_full_rent.loc[df_full_rent['area'].isnull(),'address_3'])
add_4 = list(df_full_rent.loc[df_full_rent['area'].isnull(),'address_4'])
area_list = list()

In [None]:
# area_list 에 값 추가
for i in range(len(add_1)):
    # 해당 주소에서 거래된 매물들의 '층' 정보가 없을 경우, area null을 처리할 참조 자료가 없음으로 ''으로 처리리
    if (len(df_full_rent.loc[(df_full_rent['address_1'] ==add_1[i]) & 
                     (df_full_rent['address_2'] ==add_2[i]) &
                     (df_full_rent['address_3'] ==add_3[i]) &
                     (df_full_rent['address_4'] ==add_4[i]),
                     'area'].value_counts())) == 0:

        area_list.append('')
    else:
        # 해당 주소에서 가장 많이 거래되었던 층수를 null 값에 채움움
        area_list.append(df_full_rent.loc[(df_full_rent['address_1'] ==add_1[i]) & 
                     (df_full_rent['address_2'] ==add_2[i]) &
                     (df_full_rent['address_3'] ==add_3[i]) &
                     (df_full_rent['address_4'] ==add_4[i]),
                     'area'].value_counts().idxmax())
print(area_list) # area 이 null 값인 주소의 매물들의 가장 많이 거래된 층들을 출력력

[84.9, 33.33, 15.94, 15.94, 84.98, 142.034, 142.034, 142.034, 142.034, 17.07, 17.07, 17.07, 17.07, 17.07, 64.52, 23.47, 23.47, 13.2195, 13.2195, 13.2195, 13.2195, 49.65, 39.28, 12.1, '']


- 마지막에 ''인 값이 있는데 이건 해당 매물은 참조할 만할 거래내역이 없음을 의미

In [None]:
# len을 통해서 리스트들이 다 만들어 졌는지 확인
print(len(add_1),len(add_2),len(add_3),len(add_4),len(area_list)) 

25 25 25 25 25


In [None]:
# 맨 마지막 row가 '' 여서 해당 row의 area 값을 채우기 위해 참조할 값을 확인 -> 없음
# 해당은 area를 알수있는 방법이 없음 - 다른 참조할만할 area 값들이 없음 -> 추후 제거 필요
df_full_rent.loc[(df_full_rent['address_3']==29)&(df_full_rent['address_4']==47),:] # 테스트로 area이 null 값인 row를 대표로 확인인

Unnamed: 0,date,year,month,day,address_0,address_1,address_2,address_3,address_4,name,area,full_rent_price
490009,2014-02-19,2014,2,19,서울특별시,강서구,화곡동,29.0,47.0,드림하우스(29-47),,9500


In [None]:
# floor가 null인 값들을 처리, 가장 많이 거래된 '층'의 정보로 결측치 처리리
for i in range(len(add_1)):
    df_full_rent.loc[(df_full_rent['address_1'] ==add_1[i]) & 
                         (df_full_rent['address_2'] ==add_2[i]) &
                         (df_full_rent['address_3'] ==add_3[i]) &
                         (df_full_rent['address_4'] ==add_4[i]),
                         'area']=area_list[i]

In [None]:
# null 대신 ''이 잘 들어있는지 확인
df_full_rent.loc[df_full_rent['area']=='',:]

Unnamed: 0,date,year,month,day,address_0,address_1,address_2,address_3,address_4,name,area,full_rent_price
490009,2014-02-19,2014,2,19,서울특별시,강서구,화곡동,29.0,47.0,드림하우스(29-47),,9500


In [None]:
# floor이 ''인 값 제거
df_full_rent=df_full_rent.drop(df_full_rent[df_full_rent['area']==''].index)

# 제거후 값 확인
df_full_rent.loc[df_full_rent['area']=='',:] # 제거가 된음 확인인

Unnamed: 0,date,year,month,day,address_0,address_1,address_2,address_3,address_4,name,area,full_rent_price


In [None]:
df_full_rent.info() # 값확인을 통해서 null값 처리가 되었는지 확인인

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1448687 entries, 0 to 2085774
Data columns (total 12 columns):
 #   Column           Non-Null Count    Dtype         
---  ------           --------------    -----         
 0   date             1448687 non-null  datetime64[ns]
 1   year             1448687 non-null  int64         
 2   month            1448687 non-null  int64         
 3   day              1448687 non-null  int64         
 4   address_0        1448687 non-null  object        
 5   address_1        1448687 non-null  object        
 6   address_2        1448687 non-null  object        
 7   address_3        1448687 non-null  float64       
 8   address_4        1448687 non-null  float64       
 9   name             1448687 non-null  object        
 10  area             1448662 non-null  float64       
 11  full_rent_price  1448687 non-null  int64         
dtypes: datetime64[ns](1), float64(3), int64(4), object(4)
memory usage: 143.7+ MB


In [None]:
df_full_rent.to_csv('/content/drive/MyDrive/house_price/after_data/apartment_full_rent.csv', index=False) # 전세 csv 파일 생성성

## 월세 데이터 프레임 생성

- 전세 데이터프레임 생성 파트 참조

In [None]:
# 월세 데이터 프레임 생성, 필요한 컬럼들만 필터링
df_month_rent = df_default.loc[df_default['전월세구분']=='월세',['시군구','본번','부번','도로명','계약년월','계약일','보증금(만원)','월세(만원)','전용면적(㎡)','단지명']].copy()
df_month_rent.columns = ['address','main_number','sub_number','road','year_month','day','rent_deposit','month_rent_price','area','name']
# df_month_rent.head()

df_month_rent.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 637088 entries, 25 to 2085770
Data columns (total 10 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   address           637088 non-null  object 
 1   main_number       637039 non-null  float64
 2   sub_number        637039 non-null  float64
 3   road              637088 non-null  object 
 4   year_month        637088 non-null  int64  
 5   day               637088 non-null  int64  
 6   rent_deposit      637088 non-null  object 
 7   month_rent_price  637088 non-null  object 
 8   area              637077 non-null  float64
 9   name              637088 non-null  object 
dtypes: float64(3), int64(2), object(5)
memory usage: 53.5+ MB


전세 파트와 다른 부분 확인! ↓

In [None]:
df_month_rent["month_rent_price2"] = df_month_rent["month_rent_price"].str.replace(',','')
df_month_rent.info() 

<class 'pandas.core.frame.DataFrame'>
Int64Index: 637088 entries, 25 to 2085770
Data columns (total 11 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   address            637088 non-null  object 
 1   main_number        637039 non-null  float64
 2   sub_number         637039 non-null  float64
 3   road               637088 non-null  object 
 4   year_month         637088 non-null  int64  
 5   day                637088 non-null  int64  
 6   rent_deposit       637088 non-null  object 
 7   month_rent_price   637088 non-null  object 
 8   area               637077 non-null  float64
 9   name               637088 non-null  object 
 10  month_rent_price2  349840 non-null  object 
dtypes: float64(3), int64(2), object(6)
memory usage: 58.3+ MB


- "month_rent_price"를 replace를 적용해서 month_rent_price2를 생성하는데 replace 함수가 제데로 처리가 안됨을 확인

>> df_month_rent["month_rent_price"].str.replace(',','') 

>> 진행했을 때, 'month_rent_price2' 컬럼에서의 null 값이 매우 커짐 -> replace 매소드가 제대로 동작 안함을 확인

>> 왜 동작을 안할까? -> string 과 object 타입의 차이, object는 타입의 혼용?

In [None]:
# 해당 파트를 통해서 우선 type 을 변경한 다음에 진행해야 함
df_month_rent = df_month_rent.astype({'month_rent_price':'str','rent_deposit':'str'})

- apartment_deal 과 진행 과정이 거의 동일하기에 한 셀로 합쳐서 진행
- 주석 부분들은 중간과정 확인 부분

In [None]:
df_month_rent["rent_deposit"] = df_month_rent["rent_deposit"].str.replace(",", "")
df_month_rent["month_rent_price"] = df_month_rent["month_rent_price"].str.replace(',','')
df_month_rent = df_month_rent.astype({'year_month':'str','day':'str','rent_deposit':'int64','month_rent_price':'int64'})
df_month_rent['year'] = df_month_rent['year_month'].str[0:4] # 연,월 합쳐져 있는 컬럼에서 연도만 추출
df_month_rent['month'] = df_month_rent['year_month'].str[4:] # 연,월 합쳐져 있는 컬럼에서 월만 추출
df_month_rent.loc[df_month_rent["day"].str.len()==1,"day"]='0'+df_month_rent.loc[df_month_rent["day"].str.len()==1,"day"] # 일이 있는 컬럼에서 1자리 숫자인 경우 앞에 0을 추가성
df_month_rent['date'] = pd.to_datetime(df_month_rent['year']+df_month_rent['month']+df_month_rent['day']) # 일자들을 합쳐서 date 컬럼 생
df_month_rent = df_month_rent.astype({'year':'int64','month':'int64','day':'int64'})
df_month_rent = df_month_rent.drop(['year_month'], axis=1) # 사용 안하는 컬럼들 제거
# print(df_month_rent.head())

df_month_rent["address_0"] = df_month_rent["address"].str.split(' ',expand=True)[0] # '시' 만 추출해야 하나, 서울만 함으로 일단은 실행 X
df_month_rent["address_1"] = df_month_rent["address"].str.split(' ',expand=True)[1] # '구' 만 추출
df_month_rent["address_2"] = df_month_rent["address"].str.split(' ',expand=True)[2] # '동' 만 추출
df_month_rent["road_name"] = df_month_rent["road"].str.split(' ',expand=True)[0] # '도로명' 만 추출
df_month_rent["road_number"] = df_month_rent["road"].str.split(' ',expand=True)[1] # '도로숫자' 만 추출
df_month_rent= df_month_rent[['year','month','day','address_0','address_1','address_2','main_number','sub_number','road_name','road_number','area',"rent_deposit","month_rent_price",'name','date']] # 사용할 컬럼만 선택
# print(df_month_rent.head())

# print(df_month_rent.info())
# print(df_month_rent.isnull().sum())

df_month_rent = df_month_rent.replace('', None) # ''값만 있는 값들을 null 값들로 수정
# print(df_month_rent.isnull().sum()) # 수정한 후 정보 확인 -> road_name과 road_number가 기하급수적으로 증가함

# df_month_rent[((df_month_rent['main_number'].isnull()) &(df_month_rent['sub_number'].notnull()))
#   |((df_month_rent['main_number'].notnull()) &(df_month_rent['sub_number'].isnull()))]

# df_month_rent[((df_month_rent['road_name'].isnull()) | (df_month_rent['road_number'].isnull())) & (df_month_rent['main_number'].isnull())] 

# df_month_rent.loc[df_month_rent['main_number'].isnull(),['address_0','address_1','address_2','main_number','sub_number','road_name','road_number','name']]

# df_month_rent.loc[df_month_rent['main_number'].isnull(),'name'].unique()

# df_month_rent.loc[df_month_rent['name']=='힐스테이트 서초 젠트리스',:]


df_month_rent.loc[df_month_rent['name']=='힐스테이트 서초 젠트리스','main_number'] = 557
df_month_rent.loc[df_month_rent['name']=='힐스테이트 서초 젠트리스','sub_number'] = 0

df_month_rent = df_month_rent[['date','year','month','day','address_0','address_1','address_2','main_number','sub_number','name','area','rent_deposit','month_rent_price']]
df_month_rent.columns =['date','year','month','day','address_0','address_1','address_2','address_3','address_4','name','area','rent_deposit','month_rent_price']
# df_month_rent.head()

# df_month_rent.info()

  mask |= arr == x


In [None]:
df_month_rent.isnull().sum()

date                 0
year                 0
month                0
day                  0
address_0            0
address_1            0
address_2            0
address_3            0
address_4            0
name                 0
area                11
rent_deposit         0
month_rent_price     0
dtype: int64

### 'area' 컬럼 결측치 처리

- 전세의 floor 결측치 처리 부분 참조

In [None]:
# df_month_rent[df_month_rent['area'].isnull()].tail()


add_1 = list(df_month_rent.loc[df_month_rent['area'].isnull(),'address_1'])
add_2 = list(df_month_rent.loc[df_month_rent['area'].isnull(),'address_2'])
add_3 = list(df_month_rent.loc[df_month_rent['area'].isnull(),'address_3'])
add_4 = list(df_month_rent.loc[df_month_rent['area'].isnull(),'address_4'])
area_list = list()
# area_list 에 값 추가
for i in range(len(add_1)):
    # 해당 주소에서 거래된 매물들의 '층' 정보가 없을 경우, area null을 처리할 참조 자료가 없음으로 ''으로 처리리
    if (len(df_month_rent.loc[(df_month_rent['address_1'] ==add_1[i]) & 
                     (df_month_rent['address_2'] ==add_2[i]) &
                     (df_month_rent['address_3'] ==add_3[i]) &
                     (df_month_rent['address_4'] ==add_4[i]),
                     'area'].value_counts())) == 0:

        area_list.append('')
    else:
        # 해당 주소에서 가장 많이 거래되었던 층수를 null 값에 채울거임
        area_list.append(df_month_rent.loc[(df_month_rent['address_1'] ==add_1[i]) & 
                     (df_month_rent['address_2'] ==add_2[i]) &
                     (df_month_rent['address_3'] ==add_3[i]) &
                     (df_month_rent['address_4'] ==add_4[i]),
                     'area'].value_counts().idxmax())
# print(area_list)

# print(len(add_1),len(add_2),len(add_3),len(add_4),len(area_list)) 

for i in range(len(add_1)):
    df_month_rent.loc[(df_month_rent['address_1'] ==add_1[i]) & 
                         (df_month_rent['address_2'] ==add_2[i]) &
                         (df_month_rent['address_3'] ==add_3[i]) &
                         (df_month_rent['address_4'] ==add_4[i]),
                         'area']=area_list[i]

# df_month_rent.head()

# df_month_rent.info()



In [None]:
df_month_rent.isnull().sum()

date                0
year                0
month               0
day                 0
address_0           0
address_1           0
address_2           0
address_3           0
address_4           0
name                0
area                0
rent_deposit        0
month_rent_price    0
dtype: int64

In [None]:
df_month_rent.head()

Unnamed: 0,date,year,month,day,address_0,address_1,address_2,address_3,address_4,name,area,rent_deposit,month_rent_price
25,2011-03-18,2011,3,18,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,79.97,19000,63
28,2011-04-09,2011,4,9,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,79.97,21000,35
38,2011-07-09,2011,7,9,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,79.97,3000,160
46,2011-09-19,2011,9,19,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,79.97,6000,140
47,2011-09-20,2011,9,20,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,79.97,5000,160


In [None]:
df_month_rent.to_csv('/content/drive/MyDrive/house_price/after_data/apartment_month_rent.csv', index=False)

# economic_data.csv 파일생성

- economic_data(거시경제 정보관련) 파일 생성
- economic_data 에는 한국기준금리, 부동산지수, 기준금리, 코스피지수, 한국국채금리, 미국국채금리, 장단기금리차, 아파트 분양 공급량, 아파트 미분양수, 아파트 미분양률 의 정보를 포함함

## 기준금리 정보관련 데이터 프레임 생성

- 'https://www.bok.or.kr/portal/singl/baseRate/list.do?dataSeCd=01&menuNo=200643' 홈페이지에서 기준금리의 변경 일자들을 제공하기에 크롤링을 하여 일자별 기준금리를 나타내는 데이터프레임을 생성

### 크롤링을 통해서 기준금리 정보 가져오기

In [None]:
#라이브러리 임포트

import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
#웹페이지 가져오기

res = requests.get('https://www.bok.or.kr/portal/singl/baseRate/list.do?dataSeCd=01&menuNo=200643')

#웹페이지 파싱하기
soup = BeautifulSoup(res.content,'html.parser')

#필요한 데이터 추출하기
items = soup.select('#content > div.table.tac > table > tbody > tr')

# 크롤링할 정보들을 담을 리스트 -> 추후 데이터프레임의 컬럼으로 대입할 예정정
change_year_list = list()
change_date_list = list()
rp_list = list()

# 사이트에서 표 안에 있는 정보들(text 정보들)을 가져와서 각 리스트에 삽입입
for item in items:
    table_list = item.select('td')
    change_year_list.append(table_list[0].get_text())
    change_date_list.append(table_list[1].get_text())
    rp_list.append(table_list[2].get_text())
    
# df는 기준금리 정보를 가져온 데이터 프레임 생성성
df = pd.DataFrame({
    "year": change_year_list,
    "change_date": change_date_list,
    "korea_rp": rp_list
}, columns=["year", "change_date", "korea_rp"])

df.tail() # 데이터프레임 형태 확인인

Unnamed: 0,year,change_date,korea_rp
51,2001,07월 05일,4.75
52,2001,02월 08일,5.0
53,2000,10월 05일,5.25
54,2000,02월 10일,5.0
55,1999,05월 06일,4.75


- change_date는 기준금리가 변경된 일자를, korea_rp는 변경한 기준금리를 나타냄

### 컬럼 통합

- year 컬럼과 change_date 컬럼이 일자를 나타내는 컬럼이므로 하나의 컬럼으로 통합

In [None]:
df['month']=df['change_date'].str[0:2] # 월의 정보만 추출
df['date'] = df['change_date'].str[4:6] # 일의 정보만 추출
df = df.astype({'korea_rp':'float64'}) # rp 컬럼 타입 변경
df['rp_date'] = df['year']+df['month']+df['date'] # 새로운 컬럼 생성
df = df.drop(['change_date', 'year','month','date'], axis=1) # 안쓰는 컬럼 제거
df=df.sort_index(ascending=False) # 날짜가 역순으로 되어 있어서 정렬
df['rp_date'] = pd.to_datetime(df['rp_date'], format='%Y-%m-%d %H:%M:%S', errors='raise') # date 타입으로 변경

In [None]:
df.head() # 데이터프레임 형태 확인

Unnamed: 0,korea_rp,rp_date
55,4.75,1999-05-06
54,5.0,2000-02-10
53,5.25,2000-10-05
52,5.0,2001-02-08
51,4.75,2001-07-05


In [None]:
df.tail() # 데이터프레임 형태 확인

Unnamed: 0,korea_rp,rp_date
4,2.5,2022-08-25
3,3.0,2022-10-12
2,3.25,2022-11-24
1,3.5,2023-01-13
0,3.5,2023-02-23


### '기준금리 변경날짜'들 사이에 있는 날짜들의 기준금리 정보 생성

- 위에서의 데이터 프레임은 '기준금리 변경일자'와 '변경된 기준금리'의 정보를 나타내는데, '기준금리 변경일자'들 사이에 있는 모든 날짜들에 대응하는 '기준금리'에 대한 정보도 필요하기에 사이 날짜들에 대한 기준금리 정보들을 생성 

In [None]:
import datetime

# 크롤링한 날짜 기간에 있는 모든 날짜들을 계산
start = datetime.datetime.strptime("06-05-1999", "%d-%m-%Y") # 시작날짜 설정
end = datetime.datetime.strptime("31-01-2023", "%d-%m-%Y") # 끝날짜 설정정
date_generated = [start + datetime.timedelta(days=x) for x in range(0, (end-start).days)] # 시작날짜와 끝날짜 사이에 있는 날짜들 생성
date_list=list()
for date in date_generated:
    date_list.append(date.strftime("%Y-%m-%d")) # date_list 에서 생성한 날짜들의 형식을 맞춰서 대입 

In [None]:
# df_date는 조회할 모든 날짜들의 정보가 들어있는 series
df_date = pd.DataFrame({
    "date": date_list
}, columns=["date"])
df_date['date'] = pd.to_datetime(df_date['date'], format='%Y-%m-%d %H:%M:%S', errors='raise') # date 타입으로 변경

In [None]:
df_date.head() # 데이터 프레임 형태 확인 

Unnamed: 0,date
0,1999-05-06
1,1999-05-07
2,1999-05-08
3,1999-05-09
4,1999-05-10


In [None]:
# 두개의 데이터프레임 결합을 통해서 날짜별 기준금리 현황을 생성
df_rp=pd.merge(df_date, df, left_on='date', right_on='rp_date', how='left')

In [None]:
# 사용할 컬럼만을 선택
df_rp = df_rp[['date','korea_rp']]
df_rp # 생성한 데이터 프레임 형태 확인 

Unnamed: 0,date,korea_rp
0,1999-05-06,4.75
1,1999-05-07,
2,1999-05-08,
3,1999-05-09,
4,1999-05-10,
...,...,...
8666,2023-01-26,
8667,2023-01-27,
8668,2023-01-28,
8669,2023-01-29,


In [None]:
# 가장 최근에 변경된 기준금리가 이후 변경되기 전까지 유지가 되기에, null값들을 젤 위에 있는 값(변경된 가장 최근의 기준금리 값)들로 채움
# 일자별 기준금리의 정보들을 생성
df_rp=df_rp.ffill() # ffill() 매소드를 통해서 젤 위의 있는 값으로 null 값들을 채움 
df_rp

Unnamed: 0,date,korea_rp
0,1999-05-06,4.75
1,1999-05-07,4.75
2,1999-05-08,4.75
3,1999-05-09,4.75
4,1999-05-10,4.75
...,...,...
8666,2023-01-26,3.50
8667,2023-01-27,3.50
8668,2023-01-28,3.50
8669,2023-01-29,3.50


In [None]:
# 기준금리 현황 그래프 출력
# x축을 날짜, y축을 기준금리 값으로 한 그래프 출력력
import plotly.graph_objects as go

# Create traces
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_rp['date'], y=df_rp['korea_rp'],
                    mode='lines',
                    name='korea_rp',yaxis='y1'))


fig.show(renderer="colab")

## 부동산 지수 데이터 프레임 생성

- https://data.seoul.go.kr/dataList/801/S/2/datasetView.do 사이트에서 아파트 매매 지수 파일을 다운 받아서 진행
- 아파트 매매 지수는 거시경제관련 지표들이 아파트 가격에 연관이 있는지 대략적인 확인을 위해서 사용 -> 추후 사용은 X

In [None]:
# 부동산지수 파일을 불러옴
df_real_estate = pd.read_csv("/content/drive/MyDrive/house_price/original_data/seoul_deal_index.csv",  encoding='UTF8') # 부동산 지수 불러오기
df_real_estate= df_real_estate.loc[(df_real_estate['시점']>1998) & (df_real_estate['자치구별(2)']=='소계'),['시점','아파트']]# 해당 조건에 대응하는 데이터만 거르기
df_real_estate.head()


Unnamed: 0,시점,아파트
39,1999,38.7
42,2000,40.3
45,2001,48.1
48,2002,62.9
51,2003,61.2


In [None]:
df_real_estate.info() # 데이터프레임 정보 확인인

<class 'pandas.core.frame.DataFrame'>
Int64Index: 23 entries, 39 to 519
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   시점      23 non-null     int64  
 1   아파트     23 non-null     float64
dtypes: float64(1), int64(1)
memory usage: 552.0 bytes


In [None]:
#  지수의 head를 파악
df_real_estate['시점'] = pd.to_datetime(df_real_estate['시점'], format='%Y') # 연도만을 datetime형식으로 변환
df_real_estate.head()

Unnamed: 0,시점,아파트
39,1999-01-01,38.7
42,2000-01-01,40.3
45,2001-01-01,48.1
48,2002-01-01,62.9
51,2003-01-01,61.2


In [None]:
df_real_estate.info() # 타입이 변경된을 확인인

<class 'pandas.core.frame.DataFrame'>
Int64Index: 23 entries, 39 to 519
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   시점      23 non-null     datetime64[ns]
 1   아파트     23 non-null     float64       
dtypes: datetime64[ns](1), float64(1)
memory usage: 552.0 bytes


## 기준금리 & 부동산지수 통합

- 기준금리와 부동산지수 데이터프레임들을 통합
- 기준금리 데이터프레임이 모든 날짜들에 대한 정보를 가지고 있기에, 기준금리 데이터 프레임을 left로 두어서 merge 실행
- 부동산지수 데이터프레임의 수치들은 1년동안 값이 일정하다 가정

In [None]:
df_final=pd.merge(df_rp, df_real_estate, left_on='date', right_on='시점', how='left') # 기준금리 데이터 프레임과 부동산지수 데이터 프레임을 병합합
df_final=df_final.ffill() # 젤 위의 값으로 null 값을 채움, 부동산지수의 수치가 1년동안 일정하다 가정정
df_final.head()

Unnamed: 0,date,korea_rp,시점,아파트
0,1999-05-06,4.75,NaT,
1,1999-05-07,4.75,NaT,
2,1999-05-08,4.75,NaT,
3,1999-05-09,4.75,NaT,
4,1999-05-10,4.75,NaT,


In [None]:
df_final.info() # 데이터프레임 정보 확인인

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8671 entries, 0 to 8670
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   date      8671 non-null   datetime64[ns]
 1   korea_rp  8671 non-null   float64       
 2   시점        8431 non-null   datetime64[ns]
 3   아파트       8431 non-null   float64       
dtypes: datetime64[ns](2), float64(2)
memory usage: 338.7 KB


In [None]:
df_final.tail()

Unnamed: 0,date,korea_rp,시점,아파트
8666,2023-01-26,3.5,2021-01-01,104.4
8667,2023-01-27,3.5,2021-01-01,104.4
8668,2023-01-28,3.5,2021-01-01,104.4
8669,2023-01-29,3.5,2021-01-01,104.4
8670,2023-01-30,3.5,2021-01-01,104.4


In [None]:
df_final = df_final.fillna(38.7) # 결측치를 채움, 38.7이 가장 과거의 값이기에 해당 값으로 값을 채움
df_final = df_final[['date','korea_rp','아파트']] # 사용할 컬럼만을 선택
df_final.columns = ['date','korea_rp','apartment_index'] # 컬럼명 수정정

In [None]:
df_final.head()

Unnamed: 0,date,korea_rp,apartment_index
0,1999-05-06,4.75,38.7
1,1999-05-07,4.75,38.7
2,1999-05-08,4.75,38.7
3,1999-05-09,4.75,38.7
4,1999-05-10,4.75,38.7


### 기준금리(역) 과 부동산지수 비교

In [None]:
# 기준금리와 부동산지수 2개의 그래프를 출력
# 기준금리는 x축을 기준으로 뒤짚은 값값

import plotly.graph_objects as go

# Create traces
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['korea_rp'],
                    mode='lines',
                    name='korea_rp',yaxis='y1'))
# x축으로 그래프를 뒤집음
fig.update_layout(
    yaxis = dict(autorange="reversed")
)


fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['apartment_index'],
                    mode='lines',
                    name='apartment_index',
                    yaxis="y2"))
fig.update_layout(

   # create first Y-axis
   yaxis=dict(
      title="rp point",
      titlefont=dict(color="blue"),
      tickfont=dict(color="blue")
   ),

   # create second Y-axis
   yaxis2=dict(
      title="apartment index",
      overlaying="y",
      side="right")
)
fig.show(renderer="colab")

2005년 이전까지는 동일한움직임, 2005년 부터 2008년은 반대로, 2008년 이후로는 어느정도 동일하게 움직인다
2008년 이후부터 양적완화의 등장으로 인한 유동성의 증가로 기준금리(역)과 부동산 가격이 유사하게 움직이는 건가?

## 데이터프레임 기간 수정

- 전세,월세에 대한 정보가 2011년 이후 부터 있기에 데이터 프레임을 2011년 ~2022년 으로 자름

In [None]:
df_final = df_final[(df_final['date']>='2011-01-01') & (df_final['date']<='2022-12-31')] # 사용할 날자만 자름
df_final.head()

Unnamed: 0,date,korea_rp,apartment_index
4258,2011-01-01,2.5,93.0
4259,2011-01-02,2.5,93.0
4260,2011-01-03,2.5,93.0
4261,2011-01-04,2.5,93.0
4262,2011-01-05,2.5,93.0


In [None]:
df_final.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4383 entries, 4258 to 8640
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   date             4383 non-null   datetime64[ns]
 1   korea_rp         4383 non-null   float64       
 2   apartment_index  4383 non-null   float64       
dtypes: datetime64[ns](1), float64(2)
memory usage: 137.0 KB


### 기준금리(역) 과 부동산지수 비교

In [None]:
# Create traces
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['korea_rp'],
                    mode='lines',
                    name='korea_rp',yaxis='y1'))
# x축으로 그래프를 뒤집음
fig.update_layout(
    yaxis = dict(autorange="reversed")
)


fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['apartment_index'],
                    mode='lines',
                    name='apartment_index',
                    yaxis="y2"))
fig.update_layout(

   # create first Y-axis
   yaxis=dict(
      title="rp point",
      titlefont=dict(color="blue"),
      tickfont=dict(color="blue")
   ),

   # create second Y-axis
   yaxis2=dict(
      title="apartment index",
      overlaying="y",
      side="right")
)

- 기준금리(역)과 부동산 지수는 연관성이 있는듯

## 코스피 지수 데이터 프레임 생성

In [None]:
df_kospi = pd.read_csv("/content/drive/MyDrive/house_price/original_data/kospi.csv",  encoding='UTF8') # 코스피 지수 정보 불러오기
df_kospi.head()

Unnamed: 0,날짜,종가,오픈,고가,저가,거래량,변동 %
0,2022- 12- 29,2236.4,2265.73,2272.67,2236.38,361.19M,-1.93%
1,2022- 12- 28,2280.45,2296.45,2296.45,2276.9,405.89M,-2.24%
2,2022- 12- 27,2332.79,2327.52,2335.99,2321.48,448.50M,0.68%
3,2022- 12- 26,2317.14,2312.54,2321.92,2304.2,427.84M,0.15%
4,2022- 12- 23,2313.69,2325.86,2333.08,2311.9,366.99M,-1.83%


In [None]:
df_kospi=df_kospi.sort_index(ascending=False) # 날짜가 역순으로 되어 있어서 정렬
df_kospi.reset_index(drop=True, inplace=True) # index 재설정
df_kospi.head()

Unnamed: 0,날짜,종가,오픈,고가,저가,거래량,변동 %
0,2007- 01- 02,1435.26,1438.89,1439.71,1430.06,147.74M,0.06%
1,2007- 01- 03,1409.35,1436.42,1437.79,1409.31,203.21M,-1.81%
2,2007- 01- 04,1397.29,1410.55,1411.12,1388.5,241.17M,-0.86%
3,2007- 01- 05,1385.76,1398.6,1400.59,1372.36,277.29M,-0.83%
4,2007- 01- 08,1370.81,1376.76,1384.65,1366.48,177.59M,-1.08%


In [None]:
df_kospi.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3956 entries, 0 to 3955
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   날짜      3956 non-null   object
 1   종가      3956 non-null   object
 2   오픈      3956 non-null   object
 3   고가      3956 non-null   object
 4   저가      3956 non-null   object
 5   거래량     3956 non-null   object
 6   변동 %    3956 non-null   object
dtypes: object(7)
memory usage: 216.5+ KB


In [None]:
# 필요한 컬럼만 선택 후, 컬럼명 수정, 타입변경경
df_kospi = df_kospi[['날짜','종가']]
df_kospi.columns = ['kospi_date','kospi_index']
df_kospi["kospi_date"] = pd.to_datetime(df_kospi["kospi_date"])
df_kospi.head()

Unnamed: 0,kospi_date,kospi_index
0,2007-01-02,1435.26
1,2007-01-03,1409.35
2,2007-01-04,1397.29
3,2007-01-05,1385.76
4,2007-01-08,1370.81


In [None]:
df_kospi.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3956 entries, 0 to 3955
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   kospi_date   3956 non-null   datetime64[ns]
 1   kospi_index  3956 non-null   object        
dtypes: datetime64[ns](1), object(1)
memory usage: 61.9+ KB


In [None]:
# kospi_index 값을 이후 계산에 사용하기 위해서 숫자 형태로 수정
df_kospi["kospi_index"] = df_kospi["kospi_index"].str.replace(",", "") # 문자형으로 되어 있기에 , 을 제거 
df_kospi = df_kospi.astype({'kospi_index': 'float64'})# 컬럼 타입 변경 
df_kospi.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3956 entries, 0 to 3955
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   kospi_date   3956 non-null   datetime64[ns]
 1   kospi_index  3956 non-null   float64       
dtypes: datetime64[ns](1), float64(1)
memory usage: 61.9 KB


In [None]:
df_kospi.head() # 데이터프레임 형태 확인 

Unnamed: 0,kospi_date,kospi_index
0,2007-01-02,1435.26
1,2007-01-03,1409.35
2,2007-01-04,1397.29
3,2007-01-05,1385.76
4,2007-01-08,1370.81


## 코스피 지수 데이터 프레임과 병합

In [None]:
# 기준금리&부동산지수 데이터프레임과 코스피 지수 데이터프레임 병합합
df_final=pd.merge(df_final, df_kospi, left_on='date', right_on='kospi_date', how='left') # 두 데이터프레임을 결함
df_final.head()

Unnamed: 0,date,korea_rp,apartment_index,kospi_date,kospi_index
0,2011-01-01,2.5,93.0,NaT,
1,2011-01-02,2.5,93.0,NaT,
2,2011-01-03,2.5,93.0,2011-01-03,2070.08
3,2011-01-04,2.5,93.0,2011-01-04,2085.14
4,2011-01-05,2.5,93.0,2011-01-05,2082.55


In [None]:
df_final.info() # 정보확인 -> 주말등 휴장일들의 존재로 kospi_date 컬럼과 kospi_index 컬럼에서 null 값들이 있음

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4383 entries, 0 to 4382
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   date             4383 non-null   datetime64[ns]
 1   korea_rp         4383 non-null   float64       
 2   apartment_index  4383 non-null   float64       
 3   kospi_date       2958 non-null   datetime64[ns]
 4   kospi_index      2958 non-null   float64       
dtypes: datetime64[ns](2), float64(3)
memory usage: 205.5 KB


In [None]:
# 휴장일에는 이전의 지수가 유지된다고 가정 
# 해결방안으로 이전의 값으로 null 값을 채우기
df_final["kospi_index"]=df_final["kospi_index"].fillna(method='ffill')
df_final.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4383 entries, 0 to 4382
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   date             4383 non-null   datetime64[ns]
 1   korea_rp         4383 non-null   float64       
 2   apartment_index  4383 non-null   float64       
 3   kospi_date       2958 non-null   datetime64[ns]
 4   kospi_index      4381 non-null   float64       
dtypes: datetime64[ns](2), float64(3)
memory usage: 205.5 KB


In [None]:
# 가장 위에 있는 null 값은 직접 찾아서(네이버 검색을 통해서서) 대입
df_final["kospi_index"] = df_final["kospi_index"].fillna(2051)
df_final.info() # 값들 대입이 되었는지 확인

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4383 entries, 0 to 4382
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   date             4383 non-null   datetime64[ns]
 1   korea_rp         4383 non-null   float64       
 2   apartment_index  4383 non-null   float64       
 3   kospi_date       2958 non-null   datetime64[ns]
 4   kospi_index      4383 non-null   float64       
dtypes: datetime64[ns](2), float64(3)
memory usage: 205.5 KB


In [None]:
df_final.head() # 형태 확인

Unnamed: 0,date,korea_rp,apartment_index,kospi_date,kospi_index
0,2011-01-01,2.5,93.0,NaT,2051.0
1,2011-01-02,2.5,93.0,NaT,2051.0
2,2011-01-03,2.5,93.0,2011-01-03,2070.08
3,2011-01-04,2.5,93.0,2011-01-04,2085.14
4,2011-01-05,2.5,93.0,2011-01-05,2082.55


In [None]:
# 사용할 컬럼만 설정
df_final = df_final[['date','korea_rp','apartment_index','kospi_index']]
df_final.head()

Unnamed: 0,date,korea_rp,apartment_index,kospi_index
0,2011-01-01,2.5,93.0,2051.0
1,2011-01-02,2.5,93.0,2051.0
2,2011-01-03,2.5,93.0,2070.08
3,2011-01-04,2.5,93.0,2085.14
4,2011-01-05,2.5,93.0,2082.55


### 코스피지수의 필요성 그래프로 점검

In [None]:
# Create traces
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['kospi_index'],
                    mode='lines',
                    name='kospi_index',yaxis='y1'))



fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['apartment_index'],
                    mode='lines',
                    name='apartment_index',
                    yaxis="y2"))
fig.update_layout(

   # create first Y-axis
   yaxis=dict(
      title="kospi index",
      titlefont=dict(color="blue"),
      tickfont=dict(color="blue")
   ),

   # create second Y-axis
   yaxis2=dict(
      title="apartment index",
      overlaying="y",
      side="right")
)

- 코스피지수와 부동산 지수는 어느정도의 상관성은 있나? 그래프로 봐서는 잘 모르겠음

## 한국국채 금리 데이터프레임 생성

- 코스피 데이터프레임 생성과정과 거의 비슷

In [None]:
import os


dir_path = "/content/drive/MyDrive/house_price/original_data/korean_bond"
file_list = os.listdir(dir_path)
file_list.sort()
name_list = list()
df_list = list()

# 해당 폴더 안에 있는 csv 파일들을 읽어서 리스트 안에 데이터프레임들을 담음
for csv_file in file_list:
    df_list.append(pd.read_csv(dir_path+"/"+csv_file , encoding='UTF8'))
    name_list.append(csv_file.split('.')[0])
for i in range(len(df_list)):
    df_korea = df_list[i] # 파일이 잘 들어갔는지 확인
    df_korea=df_korea.sort_index(ascending=False) # 날짜가 역순으로 되어 있어서 정렬
    df_korea.reset_index(drop=True, inplace=True) # index 재설정
    df_korea = df_korea[['날짜','종가']]
    df_korea.columns = ['korea_date',name_list[i]]
    df_korea['korea_date'] = pd.to_datetime(df_korea['korea_date'])
    df_final=pd.merge(df_final, df_korea, left_on='date', right_on='korea_date', how='left')
    df_final[name_list[i]]=df_final[name_list[i]].fillna(method='ffill') # 중간 공휴일들을 처리
    df_final[name_list[i]]=df_final[name_list[i]].fillna(method='bfill') # 제일 위의 있는 값을 근처 값으로 처리
    df_final = df_final.drop(['korea_date'], axis=1)

In [None]:
df_final.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4383 entries, 0 to 4382
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   date             4383 non-null   datetime64[ns]
 1   korea_rp         4383 non-null   float64       
 2   apartment_index  4383 non-null   float64       
 3   kospi_index      4383 non-null   float64       
 4   korea_10_year    4383 non-null   float64       
 5   korea_1_year     4383 non-null   float64       
 6   korea_20_year    4383 non-null   float64       
 7   korea_2_year     4383 non-null   float64       
 8   korea_3_year     4383 non-null   float64       
 9   korea_4_year     4383 non-null   float64       
 10  korea_5_year     4383 non-null   float64       
dtypes: datetime64[ns](1), float64(10)
memory usage: 410.9 KB


In [None]:
# 컬럼 순서 변경
df_final = df_final[['date', 'apartment_index','kospi_index','korea_rp',
                    'korea_1_year','korea_2_year','korea_3_year','korea_4_year','korea_5_year',
                    'korea_10_year','korea_20_year']]
df_final.head()

Unnamed: 0,date,apartment_index,kospi_index,korea_rp,korea_1_year,korea_2_year,korea_3_year,korea_4_year,korea_5_year,korea_10_year,korea_20_year
0,2011-01-01,93.0,2051.0,2.5,2.81,3.4,3.44,4.09,4.14,4.57,4.73
1,2011-01-02,93.0,2051.0,2.5,2.81,3.4,3.44,4.09,4.14,4.57,4.73
2,2011-01-03,93.0,2070.08,2.5,2.81,3.4,3.44,4.09,4.14,4.57,4.73
3,2011-01-04,93.0,2085.14,2.5,2.83,3.37,3.495,4.16,4.2,4.58,4.74
4,2011-01-05,93.0,2082.55,2.5,2.8,3.42,3.495,4.15,4.17,4.63,4.75


In [None]:
# 년,월,일일 컬럼 생성
df_final['year'] = df_final['date'].dt.year
df_final['month'] = df_final['date'].dt.month
df_final['day'] = df_final['date'].dt.day
df_final.head()

Unnamed: 0,date,apartment_index,kospi_index,korea_rp,korea_1_year,korea_2_year,korea_3_year,korea_4_year,korea_5_year,korea_10_year,korea_20_year,year,month,day
0,2011-01-01,93.0,2051.0,2.5,2.81,3.4,3.44,4.09,4.14,4.57,4.73,2011,1,1
1,2011-01-02,93.0,2051.0,2.5,2.81,3.4,3.44,4.09,4.14,4.57,4.73,2011,1,2
2,2011-01-03,93.0,2070.08,2.5,2.81,3.4,3.44,4.09,4.14,4.57,4.73,2011,1,3
3,2011-01-04,93.0,2085.14,2.5,2.83,3.37,3.495,4.16,4.2,4.58,4.74,2011,1,4
4,2011-01-05,93.0,2082.55,2.5,2.8,3.42,3.495,4.15,4.17,4.63,4.75,2011,1,5


### 부동산지수와 한국국채금리 시각화

In [None]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['korea_rp'],
                    mode='lines',
                    name='korea_rp',yaxis='y1'))

fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['korea_1_year'],
                    mode='lines',
                    name='korea_1_year',yaxis='y1'))

fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['korea_2_year'],
                    mode='lines',
                    name='korea_2_year',yaxis='y1'))

fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['korea_3_year'],
                    mode='lines',
                    name='korea_3_year',yaxis='y1'))

fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['korea_4_year'],
                    mode='lines',
                    name='korea_4_year',yaxis='y1'))

fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['korea_5_year'],
                    mode='lines',
                    name='korea_5_year',yaxis='y1'))

fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['korea_10_year'],
                    mode='lines',
                    name='korea_10_year',yaxis='y1'))

fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['korea_20_year'],
                    mode='lines',
                    name='korea_20_year',yaxis='y1'))

# 앞에서의 그래프들은 뒤집기
fig.update_layout(
    yaxis = dict(autorange="reversed")
)
fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['apartment_index'],
                    mode='lines',
                    name='apartment_index',
                    yaxis="y2"))
fig.update_layout(

   # create first Y-axis
   yaxis=dict(
      title="rate index",
      titlefont=dict(color="black"),
      tickfont=dict(color="black")
   ),

   # create second Y-axis
   yaxis2=dict(
      title="apartment index",
      overlaying="y",
      side="right")
)

- 한국국채금리(역)과 부동산지수는 연관이 있는듯

In [None]:
# 금리들이 얼추 비슷한 흐름을 보임으로 국채에서 3년과 10년만 사용
df_final = df_final[['date','year','month','day','apartment_index','kospi_index','korea_rp',
                    'korea_3_year','korea_10_year']]
df_final.head()

Unnamed: 0,date,year,month,day,apartment_index,kospi_index,korea_rp,korea_3_year,korea_10_year
0,2011-01-01,2011,1,1,93.0,2051.0,2.5,3.44,4.57
1,2011-01-02,2011,1,2,93.0,2051.0,2.5,3.44,4.57
2,2011-01-03,2011,1,3,93.0,2070.08,2.5,3.44,4.57
3,2011-01-04,2011,1,4,93.0,2085.14,2.5,3.495,4.58
4,2011-01-05,2011,1,5,93.0,2082.55,2.5,3.495,4.63


## 미국금채 금리 데이터프레임 생성

- 한국국채금리 데이터프레임 생성과정과 거의 동일

In [None]:
# 변수들 초기화
dir_path = "/content/drive/MyDrive/house_price/original_data/us_bond"
file_list = os.listdir(dir_path)
file_list.sort()
name_list = list()
df_list = list()

# 해당 폴더 안에 있는 csv 파일들을 읽어서 리스트 안에 데이터프레임들을 담음
for csv_file in file_list:
    df_list.append(pd.read_csv(dir_path+"/"+csv_file , encoding='UTF8'))
    name_list.append(csv_file.split('.')[0])
for i in range(len(df_list)):
    df_us = df_list[i]
    df_us=df_us.sort_index(ascending=False) # 날짜가 역순으로 되어 있어서 정렬
    df_us.reset_index(drop=True, inplace=True) # index 재설정
    df_us = df_us[['날짜','종가']]
    df_us.columns = ['us_date',name_list[i]]
    df_us['us_date'] = pd.to_datetime(df_us['us_date'])
    df_final=pd.merge(df_final, df_us, left_on='date', right_on='us_date', how='left')
    df_final[name_list[i]]=df_final[name_list[i]].fillna(method='ffill') # 중간 공휴일들을 처리
    df_final[name_list[i]]=df_final[name_list[i]].fillna(method='bfill') # 제일 위의 있는 값을 근처 값으로 처리
    df_final = df_final.drop(['us_date'], axis=1)

In [None]:
df_final.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4383 entries, 0 to 4382
Data columns (total 17 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   date             4383 non-null   datetime64[ns]
 1   year             4383 non-null   int64         
 2   month            4383 non-null   int64         
 3   day              4383 non-null   int64         
 4   apartment_index  4383 non-null   float64       
 5   kospi_index      4383 non-null   float64       
 6   korea_rp         4383 non-null   float64       
 7   korea_3_year     4383 non-null   float64       
 8   korea_10_year    4383 non-null   float64       
 9   us_10_year       4383 non-null   float64       
 10  us_1_month       4383 non-null   float64       
 11  us_2_year        4383 non-null   float64       
 12  us_30_year       4383 non-null   float64       
 13  us_3_month       4383 non-null   float64       
 14  us_3_year        4383 non-null   float64

In [None]:
df_final = df_final[['date','year','month','day','apartment_index','kospi_index','korea_rp',
                    'korea_3_year','korea_10_year','us_1_month','us_3_month',
                    'us_6_month','us_2_year', 'us_3_year', 'us_5_year',
                    'us_10_year','us_30_year']]

In [None]:
df_final.head()

Unnamed: 0,date,year,month,day,apartment_index,kospi_index,korea_rp,korea_3_year,korea_10_year,us_1_month,us_3_month,us_6_month,us_2_year,us_3_year,us_5_year,us_10_year,us_30_year
0,2011-01-01,2011,1,1,93.0,2051.0,2.5,3.44,4.57,0.096,0.124,0.183,0.601,1.006,2.011,3.334,4.401
1,2011-01-02,2011,1,2,93.0,2051.0,2.5,3.44,4.57,0.096,0.124,0.183,0.601,1.006,2.011,3.334,4.401
2,2011-01-03,2011,1,3,93.0,2070.08,2.5,3.44,4.57,0.096,0.124,0.183,0.601,1.006,2.011,3.334,4.401
3,2011-01-04,2011,1,4,93.0,2085.14,2.5,3.495,4.58,0.106,0.142,0.187,0.621,1.026,2.016,3.338,4.422
4,2011-01-05,2011,1,5,93.0,2082.55,2.5,3.495,4.63,0.129,0.142,0.184,0.708,1.129,2.133,3.463,4.541


### 미국국채금리와 부동산 지수 비교

In [None]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['us_1_month'],
                    mode='lines',
                    name='us_1_month',yaxis='y1'))

fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['us_3_month'],
                    mode='lines',
                    name='us_3_month',yaxis='y1'))

fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['us_6_month'],
                    mode='lines',
                    name='us_6_month',yaxis='y1'))

fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['us_2_year'],
                    mode='lines',
                    name='us_2_year',yaxis='y1'))

fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['us_3_year'],
                    mode='lines',
                    name='us_3_year',yaxis='y1'))

fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['us_5_year'],
                    mode='lines',
                    name='us_5_year',yaxis='y1'))

fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['us_10_year'],
                    mode='lines',
                    name='us_10_year',yaxis='y1'))

fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['us_30_year'],
                    mode='lines',
                    name='us_30_year',yaxis='y1'))

# 앞에서의 그래프들은 뒤집기
fig.update_layout(
    yaxis = dict(autorange="reversed")
)
fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['apartment_index'],
                    mode='lines',
                    name='apartment_index',
                    yaxis="y2"))
fig.update_layout(

   # create first Y-axis
   yaxis=dict(
      title="rate index",
      titlefont=dict(color="black"),
      tickfont=dict(color="black")
   ),

   # create second Y-axis
   yaxis2=dict(
      title="apartment index",
      overlaying="y",
      side="right")
)

- 미국 국채금리(역)은 한국 국채금리(역)보다는 부동산지수와 어느정도 연관이 있는듯?

In [None]:
# 금리들이 얼추 비슷한 흐름을 보임으로 국채에서 3개월, 2년, 10년 데이터프레임을 생성
df_final = df_final[['date','year','month','day','apartment_index','kospi_index','korea_rp',
                    'korea_3_year','korea_10_year','us_3_month', 'us_2_year', 'us_10_year']]
df_final.head()

Unnamed: 0,date,year,month,day,apartment_index,kospi_index,korea_rp,korea_3_year,korea_10_year,us_3_month,us_2_year,us_10_year
0,2011-01-01,2011,1,1,93.0,2051.0,2.5,3.44,4.57,0.124,0.601,3.334
1,2011-01-02,2011,1,2,93.0,2051.0,2.5,3.44,4.57,0.124,0.601,3.334
2,2011-01-03,2011,1,3,93.0,2070.08,2.5,3.44,4.57,0.124,0.601,3.334
3,2011-01-04,2011,1,4,93.0,2085.14,2.5,3.495,4.58,0.142,0.621,3.338
4,2011-01-05,2011,1,5,93.0,2082.55,2.5,3.495,4.63,0.142,0.708,3.463


## 그룹화를 통해서 월별 데이터를 얻음

In [None]:
df_final = df_final.groupby(['year','month']).agg({'apartment_index': 'mean', 'kospi_index': 'mean','korea_rp': 'mean','korea_3_year': 'mean','korea_10_year': 'mean','us_3_month': 'mean','us_2_year': 'mean','us_10_year': 'mean'})
df_final.reset_index(inplace= True)
df_final

Unnamed: 0,year,month,apartment_index,kospi_index,korea_rp,korea_3_year,korea_10_year,us_3_month,us_2_year,us_10_year
0,2011,1,93.0,2089.359032,2.653226,3.687742,4.681290,0.148097,0.598097,3.357290
1,2011,2,93.0,2011.301786,2.750000,3.939286,4.745714,0.127536,0.762071,3.565429
2,2011,3,93.0,1999.638710,2.927419,3.745968,4.518387,0.095839,0.679452,3.404226
3,2011,4,93.0,2152.758000,3.000000,3.747000,4.483333,0.056200,0.719767,3.434600
4,2011,5,93.0,2126.069355,3.000000,3.673710,4.347742,0.035581,0.536161,3.152774
...,...,...,...,...,...,...,...,...,...,...
139,2022,8,104.4,2485.917097,2.306452,3.246419,3.325935,2.662774,3.262348,2.904935
140,2022,9,104.4,2346.002333,2.500000,3.842233,3.852467,3.139330,3.828267,3.486200
141,2022,10,104.4,2223.492903,2.822581,4.235484,4.252290,3.734961,4.382039,3.979645
142,2022,11,104.4,2420.471667,3.058333,3.882467,3.907033,4.238337,4.501560,3.873067



## 금리차 컬럼들 추가

In [None]:
# 금리차 컬럼들을 추가
df_final['korea_10-3_year'] = df_final['korea_10_year'] - df_final['korea_3_year']
df_final['us_10-2_year'] = df_final['us_10_year'] - df_final['us_2_year']
df_final['us_10-3_year_month'] = df_final['us_10_year'] - df_final['us_3_month']
df_final

Unnamed: 0,year,month,apartment_index,kospi_index,korea_rp,korea_3_year,korea_10_year,us_3_month,us_2_year,us_10_year,korea_10-3_year,us_10-2_year,us_10-3_year_month
0,2011,1,93.0,2089.359032,2.653226,3.687742,4.681290,0.148097,0.598097,3.357290,0.993548,2.759194,3.209194
1,2011,2,93.0,2011.301786,2.750000,3.939286,4.745714,0.127536,0.762071,3.565429,0.806429,2.803357,3.437893
2,2011,3,93.0,1999.638710,2.927419,3.745968,4.518387,0.095839,0.679452,3.404226,0.772419,2.724774,3.308387
3,2011,4,93.0,2152.758000,3.000000,3.747000,4.483333,0.056200,0.719767,3.434600,0.736333,2.714833,3.378400
4,2011,5,93.0,2126.069355,3.000000,3.673710,4.347742,0.035581,0.536161,3.152774,0.674032,2.616613,3.117194
...,...,...,...,...,...,...,...,...,...,...,...,...,...
139,2022,8,104.4,2485.917097,2.306452,3.246419,3.325935,2.662774,3.262348,2.904935,0.079516,-0.357413,0.242161
140,2022,9,104.4,2346.002333,2.500000,3.842233,3.852467,3.139330,3.828267,3.486200,0.010233,-0.342067,0.346870
141,2022,10,104.4,2223.492903,2.822581,4.235484,4.252290,3.734961,4.382039,3.979645,0.016806,-0.402394,0.244684
142,2022,11,104.4,2420.471667,3.058333,3.882467,3.907033,4.238337,4.501560,3.873067,0.024567,-0.628493,-0.365270


## 아파트 분양 공급 데이터프레임 생성

- https://asil.kr/asil/sub/movein.jsp 사이트를 통해서 아파트 공급량의 정보를 확보

In [None]:
# txt 파일을 불러옴옴
df_apartment_supply = pd.read_csv("/content/drive/MyDrive/house_price/original_data/apartment_supply.txt",  encoding='UTF8',sep="\t")
df_apartment_supply.head()

Unnamed: 0,위치,단지명,입주년월,총세대수
0,서울 서대문구 홍은동,e편한세상홍제가든플라츠,2022년 12월,481세대
1,서울 서초구 잠원동,르엘신반포,2022년 12월,280세대
2,서울 마포구 아현동,마포더클래시,2022년 12월,"1,419세대"
3,서울 중랑구 면목동,"봄작시티201(민간임대,도시형)",2022년 12월,128세대
4,서울 서대문구 홍은동,힐스테이트홍은포레스트,2022년 11월,623세대


In [None]:
df_apartment_supply.info() # 데이터프레임 정보 확인 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1003 entries, 0 to 1002
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   위치      1003 non-null   object
 1   단지명     1003 non-null   object
 2   입주년월    1003 non-null   object
 3   총세대수    1003 non-null   object
dtypes: object(4)
memory usage: 31.5+ KB


In [None]:
# 년, 월 컬럼들 생성
# ' ' 을 기준으로 잘라서 컬럼들을 생성
df_apartment_supply['year'] =df_apartment_supply['입주년월'].str.split(' ',expand=True)[0]
df_apartment_supply['month'] =df_apartment_supply['입주년월'].str.split(' ',expand=True)[1]

In [None]:
df_apartment_supply.head()

Unnamed: 0,위치,단지명,입주년월,총세대수,year,month
0,서울 서대문구 홍은동,e편한세상홍제가든플라츠,2022년 12월,481세대,2022년,12월
1,서울 서초구 잠원동,르엘신반포,2022년 12월,280세대,2022년,12월
2,서울 마포구 아현동,마포더클래시,2022년 12월,"1,419세대",2022년,12월
3,서울 중랑구 면목동,"봄작시티201(민간임대,도시형)",2022년 12월,128세대,2022년,12월
4,서울 서대문구 홍은동,힐스테이트홍은포레스트,2022년 11월,623세대,2022년,11월


In [None]:
df_apartment_supply.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1003 entries, 0 to 1002
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   위치      1003 non-null   object
 1   단지명     1003 non-null   object
 2   입주년월    1003 non-null   object
 3   총세대수    1003 non-null   object
 4   year    1003 non-null   object
 5   month   1003 non-null   object
dtypes: object(6)
memory usage: 47.1+ KB


In [None]:
# 문자열 특정 문자들 수정
# 추후 데이터프레임 계산에 용이하게 문자들을 수정 
df_apartment_supply["year"] = df_apartment_supply["year"].str.replace("년", "")
df_apartment_supply["month"] = df_apartment_supply["month"].str.replace("월", "")
df_apartment_supply["apartment_supply"] = df_apartment_supply["총세대수"].str.replace("세대", "")
df_apartment_supply["apartment_supply"] = df_apartment_supply["apartment_supply"].str.replace(",", "")
df_apartment_supply.head()

Unnamed: 0,위치,단지명,입주년월,총세대수,year,month,apartment_supply
0,서울 서대문구 홍은동,e편한세상홍제가든플라츠,2022년 12월,481세대,2022,12,481
1,서울 서초구 잠원동,르엘신반포,2022년 12월,280세대,2022,12,280
2,서울 마포구 아현동,마포더클래시,2022년 12월,"1,419세대",2022,12,1419
3,서울 중랑구 면목동,"봄작시티201(민간임대,도시형)",2022년 12월,128세대,2022,12,128
4,서울 서대문구 홍은동,힐스테이트홍은포레스트,2022년 11월,623세대,2022,11,623


In [None]:
df_apartment_supply.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1003 entries, 0 to 1002
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   위치                1003 non-null   object
 1   단지명               1003 non-null   object
 2   입주년월              1003 non-null   object
 3   총세대수              1003 non-null   object
 4   year              1003 non-null   object
 5   month             1003 non-null   object
 6   apartment_supply  1003 non-null   object
dtypes: object(7)
memory usage: 55.0+ KB


In [None]:
# date 컬럼 생성
df_apartment_supply['date'] = pd.to_datetime(df_apartment_supply['year']+'-'+df_apartment_supply['month'], format="%Y-%m")

- 해당 달의 수치의 결과는 다음달에 발표한다고 가정(예를들어 2011년 1월의 거래수치는 2011년 1월동안에는 알 수 없고 2월이 되어야 1월의 결과를 종합해서 수치를 알 수 있다)

In [None]:
# 다음 달에 지수가 발표한다고 가정
df_apartment_supply['date_column'] = df_apartment_supply['date'] + datetime.timedelta(days=32)
df_apartment_supply['announcement_year'] = df_apartment_supply['date_column'].dt.year
df_apartment_supply['announcement_month'] = df_apartment_supply['date_column'].dt.month

In [None]:
df_apartment_supply.head()

Unnamed: 0,위치,단지명,입주년월,총세대수,year,month,apartment_supply,date,date_column,announcement_year,announcement_month
0,서울 서대문구 홍은동,e편한세상홍제가든플라츠,2022년 12월,481세대,2022,12,481,2022-12-01,2023-01-02,2023,1
1,서울 서초구 잠원동,르엘신반포,2022년 12월,280세대,2022,12,280,2022-12-01,2023-01-02,2023,1
2,서울 마포구 아현동,마포더클래시,2022년 12월,"1,419세대",2022,12,1419,2022-12-01,2023-01-02,2023,1
3,서울 중랑구 면목동,"봄작시티201(민간임대,도시형)",2022년 12월,128세대,2022,12,128,2022-12-01,2023-01-02,2023,1
4,서울 서대문구 홍은동,힐스테이트홍은포레스트,2022년 11월,623세대,2022,11,623,2022-11-01,2022-12-03,2022,12


In [None]:
df_apartment_supply.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1003 entries, 0 to 1002
Data columns (total 3 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   announcement_year   1003 non-null   int64
 1   announcement_month  1003 non-null   int64
 2   apartment_supply    1003 non-null   int64
dtypes: int64(3)
memory usage: 23.6 KB


In [None]:
# 사용할 컬럼만을 거른 후, 타입 변경
df_apartment_supply = df_apartment_supply[['announcement_year','announcement_month','apartment_supply']]
df_apartment_supply = df_apartment_supply.astype({'apartment_supply': 'int64'})
df_apartment_supply.head()

Unnamed: 0,announcement_year,announcement_month,apartment_supply
0,2023,1,481
1,2023,1,280
2,2023,1,1419
3,2023,1,128
4,2022,12,623


In [None]:
df_apartment_supply.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1003 entries, 0 to 1002
Data columns (total 3 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   announcement_year   1003 non-null   int64
 1   announcement_month  1003 non-null   int64
 2   apartment_supply    1003 non-null   int64
dtypes: int64(3)
memory usage: 23.6 KB


In [None]:
# 연, 월별 분양공급량을 group by를 통해서 구한 후, reset_index를 통해서 다시 컬럼화
df_apartment_supply=df_apartment_supply.groupby(['announcement_year','announcement_month'])['apartment_supply'].agg('sum')
df_apartment_supply = df_apartment_supply.reset_index(['announcement_year','announcement_month'])
df_apartment_supply.head()

Unnamed: 0,announcement_year,announcement_month,apartment_supply
0,2011,2,5342
1,2011,3,3494
2,2011,4,1511
3,2011,5,709
4,2011,6,1507


## 아파트 미분양 데이터 프레임 생성

- https://data.kbland.kr/publicdata/unsold-apartments 사이트를 통해서 미분양 데이터 정보를 확보

In [None]:
df_apartment_unsold = pd.read_excel("/content/drive/MyDrive/house_price/original_data/unsold/서울 미분양 현황.xlsx")
df_apartment_unsold.index = df_apartment_unsold['구분']
df_apartment_unsold=df_apartment_unsold.drop('구분',axis=1)
df_apartment_unsold.head()

Unnamed: 0_level_0,'07.01,'07.02,'07.03,'07.04,'07.05,'07.06,'07.07,'07.08,'07.09,'07.10,...,'22.02,'22.03,'22.04,'22.05,'22.06,'22.07,'22.08,'22.09,'22.10,'22.11
구분,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
미분양,697,590.0,687.0,685.0,704.0,778.0,840.0,730.0,724.0,977.0,...,47,180.0,360,688.0,719.0,592.0,610.0,719.0,866.0,865.0
변동률,-,-15.35,16.44,-0.29,2.77,10.51,7.97,-13.1,-0.82,34.94,...,0,282.98,100,91.11,4.51,-17.66,3.04,17.87,20.45,-0.12


In [None]:
# T 매소드를 통해서 row와 column을 교환환
df_apartment_unsold=df_apartment_unsold.T
df_apartment_unsold.head()

구분,미분양,변동률
'07.01,697.0,-
'07.02,590.0,-15.35
'07.03,687.0,16.44
'07.04,685.0,-0.29
'07.05,704.0,2.77


In [None]:
df_apartment_unsold.info()

<class 'pandas.core.frame.DataFrame'>
Index: 191 entries, '07.01 to '22.11
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   미분양     191 non-null    object
 1   변동률     191 non-null    object
dtypes: object(2)
memory usage: 8.5+ KB


In [None]:
# index가 날짜의 정보를 가지고 있음으로 reset_index를 통해서 날짜 정보를 컬럼으로 생성성
df_apartment_unsold = df_apartment_unsold.reset_index()
df_apartment_unsold.head()

구분,index,미분양,변동률
0,'07.01,697.0,-
1,'07.02,590.0,-15.35
2,'07.03,687.0,16.44
3,'07.04,685.0,-0.29
4,'07.05,704.0,2.77


In [None]:
# 컬럼명 수정정
df_apartment_unsold.columns=['year_month','unsold_count','ratio']
df_apartment_unsold.head()

Unnamed: 0,year_month,unsold_count,ratio
0,'07.01,697.0,-
1,'07.02,590.0,-15.35
2,'07.03,687.0,16.44
3,'07.04,685.0,-0.29
4,'07.05,704.0,2.77


In [None]:
# year_month 컬럼에서 ' 부분을 제거
df_apartment_unsold["year_month"] = df_apartment_unsold["year_month"].str.replace("'", "")
df_apartment_unsold.head()

Unnamed: 0,year_month,unsold_count,ratio
0,7.01,697.0,-
1,7.02,590.0,-15.35
2,7.03,687.0,16.44
3,7.04,685.0,-0.29
4,7.05,704.0,2.77


In [None]:
# 연, 월 컬럼 생성성
df_apartment_unsold['year'] =df_apartment_unsold["year_month"].str.split('.',expand=True)[0]
df_apartment_unsold['month'] =df_apartment_unsold["year_month"].str.split('.',expand=True)[1]
df_apartment_unsold.head()

Unnamed: 0,year_month,unsold_count,ratio,year,month
0,7.01,697.0,-,7,1
1,7.02,590.0,-15.35,7,2
2,7.03,687.0,16.44,7,3
3,7.04,685.0,-0.29,7,4
4,7.05,704.0,2.77,7,5


In [None]:
# 연 컬럼 수정 및 사용할 컬럼 선택택
df_apartment_unsold['year'] = '20'+df_apartment_unsold['year']
df_apartment_unsold = df_apartment_unsold[['year','month','unsold_count']]
df_apartment_unsold.head()

Unnamed: 0,year,month,unsold_count
0,2007,1,697.0
1,2007,2,590.0
2,2007,3,687.0
3,2007,4,685.0
4,2007,5,704.0


In [None]:
df_apartment_unsold.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 191 entries, 0 to 190
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   year          191 non-null    object
 1   month         191 non-null    object
 2   unsold_count  191 non-null    object
dtypes: object(3)
memory usage: 4.6+ KB


In [None]:
# 미분양에 대한 정보는 한달이 지나야 결과를 알 수 있다 가정정
df_apartment_unsold['date'] = pd.to_datetime(df_apartment_unsold['year']+'-'+df_apartment_unsold['month'], format="%Y-%m")
df_apartment_unsold['date_column'] = df_apartment_unsold['date'] + datetime.timedelta(days=32)
df_apartment_unsold['announcement_year'] = df_apartment_unsold['date_column'].dt.year
df_apartment_unsold['announcement_month'] = df_apartment_unsold['date_column'].dt.month
df_apartment_unsold = df_apartment_unsold[['announcement_year','announcement_month','unsold_count']]
df_apartment_unsold = df_apartment_unsold.astype({'unsold_count': 'int64'})
df_apartment_unsold.head()

Unnamed: 0,announcement_year,announcement_month,unsold_count
0,2007,2,697
1,2007,3,590
2,2007,4,687
3,2007,5,685
4,2007,6,704


In [None]:
# 사용할 연도의의 범위만을 설정 
df_apartment_unsold=df_apartment_unsold[df_apartment_unsold['announcement_year']>=2011]
df_apartment_unsold.head()

Unnamed: 0,announcement_year,announcement_month,unsold_count
47,2011,1,2729
48,2011,2,2269
49,2011,3,2216
50,2011,4,2104
51,2011,5,1855


In [None]:
df_apartment_unsold.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 144 entries, 47 to 190
Data columns (total 3 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   announcement_year   144 non-null    int64
 1   announcement_month  144 non-null    int64
 2   unsold_count        144 non-null    int64
dtypes: int64(3)
memory usage: 4.5 KB


## 아파트 분양 & 미분양 데이터 프레임 병합 

In [None]:
df_apartment_supply.tail()

Unnamed: 0,announcement_year,announcement_month,apartment_supply
139,2022,9,1853
140,2022,10,1552
141,2022,11,1265
142,2022,12,1759
143,2023,1,2308


In [None]:
df_apartment_supply.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 144 entries, 0 to 143
Data columns (total 3 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   announcement_year   144 non-null    int64
 1   announcement_month  144 non-null    int64
 2   apartment_supply    144 non-null    int64
dtypes: int64(3)
memory usage: 3.5 KB


In [None]:
df_apartment_unsold.tail()

Unnamed: 0,announcement_year,announcement_month,unsold_count
186,2022,8,592
187,2022,9,610
188,2022,10,719
189,2022,11,866
190,2022,12,865


In [None]:
df_apartment_unsold.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 144 entries, 47 to 190
Data columns (total 3 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   announcement_year   144 non-null    int64
 1   announcement_month  144 non-null    int64
 2   unsold_count        144 non-null    int64
dtypes: int64(3)
memory usage: 4.5 KB


In [None]:
# 데이터 프레임 병합합
df_apartment_supply_unsold=pd.merge(df_apartment_supply, df_apartment_unsold, on=['announcement_year','announcement_month'], how='inner')
df_apartment_supply_unsold.tail()

Unnamed: 0,announcement_year,announcement_month,apartment_supply,unsold_count
138,2022,8,1736,592
139,2022,9,1853,610
140,2022,10,1552,719
141,2022,11,1265,866
142,2022,12,1759,865


In [None]:
df_apartment_supply_unsold.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 143 entries, 0 to 142
Data columns (total 4 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   announcement_year   143 non-null    int64
 1   announcement_month  143 non-null    int64
 2   apartment_supply    143 non-null    int64
 3   unsold_count        143 non-null    int64
dtypes: int64(4)
memory usage: 5.6 KB


### 미분양 비율 컬럼 추가

In [None]:
# 미분양 비율을 구함
df_apartment_supply_unsold['unsold_ratio'] = 100*(df_apartment_supply_unsold['unsold_count'] / df_apartment_supply_unsold['apartment_supply'])
df_apartment_supply_unsold.head()

Unnamed: 0,announcement_year,announcement_month,apartment_supply,unsold_count,unsold_ratio
0,2011,2,5342,2269,42.474729
1,2011,3,3494,2216,63.423011
2,2011,4,1511,2104,139.245533
3,2011,5,709,1855,261.636107
4,2011,6,1507,1785,118.447246


In [None]:
df_apartment_supply_unsold.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 143 entries, 0 to 142
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   announcement_year   143 non-null    int64  
 1   announcement_month  143 non-null    int64  
 2   apartment_supply    143 non-null    int64  
 3   unsold_count        143 non-null    int64  
 4   unsold_ratio        143 non-null    float64
dtypes: float64(1), int64(4)
memory usage: 6.7 KB


## 최종 테이블에 병합

In [None]:
# 데이터 병합
df_final=pd.merge(df_final, df_apartment_supply_unsold, left_on=['year','month'], right_on=['announcement_year','announcement_month'], how='left')
df_final = df_final.drop(["announcement_year", "announcement_month"], axis=1)
df_final.head()

Unnamed: 0,year,month,apartment_index,kospi_index,korea_rp,korea_3_year,korea_10_year,us_3_month,us_2_year,us_10_year,korea_10-3_year,us_10-2_year,us_10-3_year_month,apartment_supply,unsold_count,unsold_ratio
0,2011,1,93.0,2089.359032,2.653226,3.687742,4.68129,0.148097,0.598097,3.35729,0.993548,2.759194,3.209194,,,
1,2011,2,93.0,2011.301786,2.75,3.939286,4.745714,0.127536,0.762071,3.565429,0.806429,2.803357,3.437893,5342.0,2269.0,42.474729
2,2011,3,93.0,1999.63871,2.927419,3.745968,4.518387,0.095839,0.679452,3.404226,0.772419,2.724774,3.308387,3494.0,2216.0,63.423011
3,2011,4,93.0,2152.758,3.0,3.747,4.483333,0.0562,0.719767,3.4346,0.736333,2.714833,3.3784,1511.0,2104.0,139.245533
4,2011,5,93.0,2126.069355,3.0,3.67371,4.347742,0.035581,0.536161,3.152774,0.674032,2.616613,3.117194,709.0,1855.0,261.636107


In [None]:
df_final.info() # 데이터프레임 정보 확인

<class 'pandas.core.frame.DataFrame'>
Int64Index: 144 entries, 0 to 143
Data columns (total 16 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   year                144 non-null    int64  
 1   month               144 non-null    int64  
 2   apartment_index     144 non-null    float64
 3   kospi_index         144 non-null    float64
 4   korea_rp            144 non-null    float64
 5   korea_3_year        144 non-null    float64
 6   korea_10_year       144 non-null    float64
 7   us_3_month          144 non-null    float64
 8   us_2_year           144 non-null    float64
 9   us_10_year          144 non-null    float64
 10  korea_10-3_year     144 non-null    float64
 11  us_10-2_year        144 non-null    float64
 12  us_10-3_year_month  144 non-null    float64
 13  apartment_supply    143 non-null    float64
 14  unsold_count        143 non-null    float64
 15  unsold_ratio        143 non-null    float64
dtypes: float

In [None]:
df_final.isnull().sum() # null data 있는지 확인

year                  0
month                 0
apartment_index       0
kospi_index           0
korea_rp              0
korea_3_year          0
korea_10_year         0
us_3_month            0
us_2_year             0
us_10_year            0
korea_10-3_year       0
us_10-2_year          0
us_10-3_year_month    0
apartment_supply      1
unsold_count          1
unsold_ratio          1
dtype: int64

In [None]:
df_final = df_final.dropna(subset=['apartment_supply']) # 해당 컬럼에 null 값이 있기에 제거
df_final.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 143 entries, 1 to 143
Data columns (total 16 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   year                143 non-null    int64  
 1   month               143 non-null    int64  
 2   apartment_index     143 non-null    float64
 3   kospi_index         143 non-null    float64
 4   korea_rp            143 non-null    float64
 5   korea_3_year        143 non-null    float64
 6   korea_10_year       143 non-null    float64
 7   us_3_month          143 non-null    float64
 8   us_2_year           143 non-null    float64
 9   us_10_year          143 non-null    float64
 10  korea_10-3_year     143 non-null    float64
 11  us_10-2_year        143 non-null    float64
 12  us_10-3_year_month  143 non-null    float64
 13  apartment_supply    143 non-null    float64
 14  unsold_count        143 non-null    float64
 15  unsold_ratio        143 non-null    float64
dtypes: float

In [None]:
df_final.to_csv('/content/drive/MyDrive/house_price/after_data/economic_data.csv',index=False)

# economic_data2 파일 생성 


In [None]:
import pandas as pd
import numpy as np
# 데이터들 불러오기
df_deal = pd.read_csv("/content/drive/MyDrive/house_price/after_data/apartment_deal.csv",  encoding='UTF8')
df_month_rent = pd.read_csv("/content/drive/MyDrive/house_price/after_data/apartment_month_rent.csv",  encoding='UTF8')
df_full_rent = pd.read_csv("/content/drive/MyDrive/house_price/after_data/apartment_full_rent.csv",  encoding='UTF8')
df_economic = pd.read_csv("/content/drive/MyDrive/house_price/after_data/economic_data.csv",  encoding='UTF8')

- economic_data2 파일은 economic_data 파일에 월별 아파트 거래체결량들(매매체결량, 전세체결량,월세체결량) 정보를 추가한 파일
- '아파트 거래' 는 '아파트 매매', '아파트 전세', '아파트 월세' 를 합친 개념
- 아파트 월별 거래량은 해당 달에 체결된 서울 총 아파트 거래량을 의미

## 아파트 매매 체결량 데이터프레임 생성

In [None]:
df_deal.head()

Unnamed: 0,date,year,month,day,address_0,address_1,address_2,address_3,address_4,name,area,deal_price
0,2011-07-09,2011,7,9,서울특별시,강남구,개포동,655.0,2.0,개포2차현대아파트(220),77.75,64000
1,2011-07-28,2011,7,28,서울특별시,강남구,개포동,655.0,2.0,개포2차현대아파트(220),77.75,65500
2,2011-01-19,2011,1,19,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,67.28,70500
3,2011-09-02,2011,9,2,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,79.97,85000
4,2011-12-17,2011,12,17,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,67.28,68000


In [None]:
df_deal.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 882185 entries, 0 to 882184
Data columns (total 12 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   date        882185 non-null  object 
 1   year        882185 non-null  int64  
 2   month       882185 non-null  int64  
 3   day         882185 non-null  int64  
 4   address_0   882185 non-null  object 
 5   address_1   882185 non-null  object 
 6   address_2   882185 non-null  object 
 7   address_3   882185 non-null  float64
 8   address_4   882185 non-null  float64
 9   name        882185 non-null  object 
 10  area        882185 non-null  float64
 11  deal_price  882185 non-null  int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 80.8+ MB


In [None]:
# 서울 아파트 월별 거래량을 group by를 이용하여여 계산
df_count = df_deal.groupby(["year","month"])["name"].agg('count').copy()
df_count = df_count.reset_index(["year","month"]) # index로 있던 컬럼들을 다시 컬럼화
df_count.columns = ["year","month","deal_count"] # 컬럼명들 수정정
df_count

Unnamed: 0,year,month,deal_count
0,2011,1,7179
1,2011,2,6026
2,2011,3,5419
3,2011,4,4028
4,2011,5,3836
...,...,...,...
139,2022,8,760
140,2022,9,649
141,2022,10,574
142,2022,11,750


In [None]:
df_count.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 144 entries, 0 to 143
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   year        144 non-null    int64
 1   month       144 non-null    int64
 2   deal_count  144 non-null    int64
dtypes: int64(3)
memory usage: 3.5 KB


## 아파트 전세 체결량 데이터프레임 추가

- 아파트 매매 체결량 부분 참조

In [None]:
df_full_rent.head()

Unnamed: 0,date,year,month,day,address_0,address_1,address_2,address_3,address_4,name,area,full_rent_price
0,2011-01-05,2011,1,5,서울특별시,강남구,개포동,655.0,2.0,개포2차현대아파트(220),77.75,35000
1,2011-01-18,2011,1,18,서울특별시,강남구,개포동,655.0,2.0,개포2차현대아파트(220),77.75,20000
2,2011-02-01,2011,2,1,서울특별시,강남구,개포동,655.0,2.0,개포2차현대아파트(220),77.75,24000
3,2011-02-11,2011,2,11,서울특별시,강남구,개포동,655.0,2.0,개포2차현대아파트(220),77.75,31000
4,2011-02-24,2011,2,24,서울특별시,강남구,개포동,655.0,2.0,개포2차현대아파트(220),77.75,30500


In [None]:
df_temp = df_full_rent.groupby(["year","month"])["name"].agg('count').copy()
df_temp = df_temp.reset_index(["year","month"])
df_temp.columns = ["year","month","full_rent_count"]
df_temp

Unnamed: 0,year,month,full_rent_count
0,2011,1,12336
1,2011,2,12261
2,2011,3,12121
3,2011,4,9754
4,2011,5,9280
...,...,...,...
139,2022,8,11341
140,2022,9,10258
141,2022,10,10559
142,2022,11,8890


In [None]:
# 아파트 매매 체결량 데이터프레임과 아파트 전세 체결량 데이터프레임임을 병합합
df_count=pd.merge(df_count,df_temp, left_on=["year","month"], right_on=["year","month"], how="inner")
df_count.head()

Unnamed: 0,year,month,deal_count,full_rent_count
0,2011,1,7179,12336
1,2011,2,6026,12261
2,2011,3,5419,12121
3,2011,4,4028,9754
4,2011,5,3836,9280


In [None]:
df_count.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 144 entries, 0 to 143
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype
---  ------           --------------  -----
 0   year             144 non-null    int64
 1   month            144 non-null    int64
 2   deal_count       144 non-null    int64
 3   full_rent_count  144 non-null    int64
dtypes: int64(4)
memory usage: 5.6 KB


## 아파트 월세 체결량 데이터프레임 추가

- 아파트 매매 체결량 데이터프레임 참조

In [None]:
df_month_rent.head()

Unnamed: 0,date,year,month,day,address_0,address_1,address_2,address_3,address_4,name,area,rent_deposit,month_rent_price
0,2011-03-18,2011,3,18,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,79.97,19000,63
1,2011-04-09,2011,4,9,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,79.97,21000,35
2,2011-07-09,2011,7,9,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,79.97,3000,160
3,2011-09-19,2011,9,19,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,79.97,6000,140
4,2011-09-20,2011,9,20,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,79.97,5000,160


In [None]:
df_temp = df_month_rent.groupby(["year","month"])["name"].agg('count').copy()
df_temp = df_temp.reset_index(["year","month"])
df_temp.columns = ["year","month","month_rent_count"]
df_temp

Unnamed: 0,year,month,month_rent_count
0,2011,1,2514
1,2011,2,2711
2,2011,3,2775
3,2011,4,2210
4,2011,5,2168
...,...,...,...
139,2022,8,7415
140,2022,9,7793
141,2022,10,7694
142,2022,11,7709


In [None]:
# 아파트 월세 거래량 데이터프레임을 추가하여 병합합
df_count=pd.merge(df_count,df_temp, left_on=["year","month"], right_on=["year","month"], how="inner")
df_count.head()

Unnamed: 0,year,month,deal_count,full_rent_count,month_rent_count
0,2011,1,7179,12336,2514
1,2011,2,6026,12261,2711
2,2011,3,5419,12121,2775
3,2011,4,4028,9754,2210
4,2011,5,3836,9280,2168


In [None]:
df_count.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 144 entries, 0 to 143
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype
---  ------            --------------  -----
 0   year              144 non-null    int64
 1   month             144 non-null    int64
 2   deal_count        144 non-null    int64
 3   full_rent_count   144 non-null    int64
 4   month_rent_count  144 non-null    int64
dtypes: int64(5)
memory usage: 6.8 KB


## 월 정보들 shift

- 해당 달의 거래량은 다음달에 알 수 있음으로 한칸씩 shift(1달씩 미룸)

In [None]:
df_count['deal_count'] = df_count['deal_count'].shift(1)
df_count['month_rent_count'] = df_count['month_rent_count'].shift(1)
df_count['full_rent_count'] = df_count['full_rent_count'].shift(1)
# 컬럼명 수정
df_count.columns = ['year','month','last_month_total_deal_count','last_month_total_full_rent_count', 'last_month_month_total_rent_count']
df_count

Unnamed: 0,year,month,last_month_total_deal_count,last_month_total_full_rent_count,last_month_month_total_rent_count
0,2011,1,,,
1,2011,2,7179.0,12336.0,2514.0
2,2011,3,6026.0,12261.0,2711.0
3,2011,4,5419.0,12121.0,2775.0
4,2011,5,4028.0,9754.0,2210.0
...,...,...,...,...,...
139,2022,8,688.0,11654.0,8916.0
140,2022,9,760.0,11341.0,7415.0
141,2022,10,649.0,10258.0,7793.0
142,2022,11,574.0,10559.0,7694.0


In [None]:
df_count.dropna(axis=0,inplace=True)
df_count.reset_index(inplace=True,drop=True)
df_count

Unnamed: 0,year,month,last_month_total_deal_count,last_month_total_full_rent_count,last_month_month_total_rent_count
0,2011,2,7179.0,12336.0,2514.0
1,2011,3,6026.0,12261.0,2711.0
2,2011,4,5419.0,12121.0,2775.0
3,2011,5,4028.0,9754.0,2210.0
4,2011,6,3836.0,9280.0,2168.0
...,...,...,...,...,...
138,2022,8,688.0,11654.0,8916.0
139,2022,9,760.0,11341.0,7415.0
140,2022,10,649.0,10258.0,7793.0
141,2022,11,574.0,10559.0,7694.0


In [None]:
df_count.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 143 entries, 0 to 142
Data columns (total 5 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   year                               143 non-null    int64  
 1   month                              143 non-null    int64  
 2   last_month_total_deal_count        143 non-null    float64
 3   last_month_total_full_rent_count   143 non-null    float64
 4   last_month_month_total_rent_count  143 non-null    float64
dtypes: float64(3), int64(2)
memory usage: 5.7 KB


## economic_data 와의 통합 

In [None]:
df_economic.head()

Unnamed: 0,year,month,apartment_index,kospi_index,korea_rp,korea_3_year,korea_10_year,us_3_month,us_2_year,us_10_year,korea_10-3_year,us_10-2_year,us_10-3_year_month,apartment_supply,unsold_count,unsold_ratio
0,2011,2,93.0,2011.301786,2.75,3.939286,4.745714,0.127536,0.762071,3.565429,0.806429,2.803357,3.437893,5342.0,2269.0,42.474729
1,2011,3,93.0,1999.63871,2.927419,3.745968,4.518387,0.095839,0.679452,3.404226,0.772419,2.724774,3.308387,3494.0,2216.0,63.423011
2,2011,4,93.0,2152.758,3.0,3.747,4.483333,0.0562,0.719767,3.4346,0.736333,2.714833,3.3784,1511.0,2104.0,139.245533
3,2011,5,93.0,2126.069355,3.0,3.67371,4.347742,0.035581,0.536161,3.152774,0.674032,2.616613,3.117194,709.0,1855.0,261.636107
4,2011,6,93.0,2074.891667,3.175,3.6385,4.240667,0.032667,0.402067,2.976833,0.602167,2.574767,2.944167,1507.0,1785.0,118.447246


In [None]:
# 거시경제 지표가 모든 날짜들에 대한 정보를 가지고 있음으로, year과 month를 통해서 병합
df_economic=pd.merge(df_economic, df_count, left_on=["year","month"], right_on=["year","month"], how="inner")
df_economic.head()

Unnamed: 0,year,month,apartment_index,kospi_index,korea_rp,korea_3_year,korea_10_year,us_3_month,us_2_year,us_10_year,korea_10-3_year,us_10-2_year,us_10-3_year_month,apartment_supply,unsold_count,unsold_ratio,last_month_total_deal_count,last_month_total_full_rent_count,last_month_month_total_rent_count
0,2011,2,93.0,2011.301786,2.75,3.939286,4.745714,0.127536,0.762071,3.565429,0.806429,2.803357,3.437893,5342.0,2269.0,42.474729,7179.0,12336.0,2514.0
1,2011,3,93.0,1999.63871,2.927419,3.745968,4.518387,0.095839,0.679452,3.404226,0.772419,2.724774,3.308387,3494.0,2216.0,63.423011,6026.0,12261.0,2711.0
2,2011,4,93.0,2152.758,3.0,3.747,4.483333,0.0562,0.719767,3.4346,0.736333,2.714833,3.3784,1511.0,2104.0,139.245533,5419.0,12121.0,2775.0
3,2011,5,93.0,2126.069355,3.0,3.67371,4.347742,0.035581,0.536161,3.152774,0.674032,2.616613,3.117194,709.0,1855.0,261.636107,4028.0,9754.0,2210.0
4,2011,6,93.0,2074.891667,3.175,3.6385,4.240667,0.032667,0.402067,2.976833,0.602167,2.574767,2.944167,1507.0,1785.0,118.447246,3836.0,9280.0,2168.0


In [None]:
df_economic.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 143 entries, 0 to 142
Data columns (total 19 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   year                               143 non-null    int64  
 1   month                              143 non-null    int64  
 2   apartment_index                    143 non-null    float64
 3   kospi_index                        143 non-null    float64
 4   korea_rp                           143 non-null    float64
 5   korea_3_year                       143 non-null    float64
 6   korea_10_year                      143 non-null    float64
 7   us_3_month                         143 non-null    float64
 8   us_2_year                          143 non-null    float64
 9   us_10_year                         143 non-null    float64
 10  korea_10-3_year                    143 non-null    float64
 11  us_10-2_year                       143 non-null    float64

In [None]:
# 컬럼명이 헷갈리는 요소가 있어어
df_economic.columns = ['year', 'month','apartment_index', 'kospi_index',
       'korea_rp', 'korea_3_year', 'korea_10_year', 'us_3_month', 'us_2_year', 'us_10_year', 
       'korea_10-3_year', 'us_10-2_year', 'us_10-3_year_month', 'last_month_total_apartment_supply', 'last_month_total_unsold_count',
       'last_month_total_unsold_ratio', 'last_month_total_deal_count',
       'last_month_total_full_rent_count',
       'last_month_total_month_rent_count']
df_economic.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 143 entries, 0 to 142
Data columns (total 19 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   year                               143 non-null    int64  
 1   month                              143 non-null    int64  
 2   apartment_index                    143 non-null    float64
 3   kospi_index                        143 non-null    float64
 4   korea_rp                           143 non-null    float64
 5   korea_3_year                       143 non-null    float64
 6   korea_10_year                      143 non-null    float64
 7   us_3_month                         143 non-null    float64
 8   us_2_year                          143 non-null    float64
 9   us_10_year                         143 non-null    float64
 10  korea_10-3_year                    143 non-null    float64
 11  us_10-2_year                       143 non-null    float64

In [None]:
# 데이터프레임 타입 변경 
df_economic=df_economic.astype({'year': 'int16','month': 'int16',
                    'last_month_total_apartment_supply': 'int32',
                    'last_month_total_unsold_count': 'int32',
                    'last_month_total_deal_count': 'int32',
                    'last_month_total_full_rent_count': 'int32',
                    'last_month_total_month_rent_count': 'int32'})

In [None]:
df_economic.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 143 entries, 0 to 142
Data columns (total 19 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   year                               143 non-null    int16  
 1   month                              143 non-null    int16  
 2   apartment_index                    143 non-null    float64
 3   kospi_index                        143 non-null    float64
 4   korea_rp                           143 non-null    float64
 5   korea_3_year                       143 non-null    float64
 6   korea_10_year                      143 non-null    float64
 7   us_3_month                         143 non-null    float64
 8   us_2_year                          143 non-null    float64
 9   us_10_year                         143 non-null    float64
 10  korea_10-3_year                    143 non-null    float64
 11  us_10-2_year                       143 non-null    float64

In [None]:
df_economic.head()

Unnamed: 0,year,month,apartment_index,kospi_index,korea_rp,korea_3_year,korea_10_year,us_3_month,us_2_year,us_10_year,korea_10-3_year,us_10-2_year,us_10-3_year_month,last_month_total_apartment_supply,last_month_total_unsold_count,last_month_total_unsold_ratio,last_month_total_deal_count,last_month_total_full_rent_count,last_month_total_month_rent_count
0,2011,2,93.0,2011.301786,2.75,3.939286,4.745714,0.127536,0.762071,3.565429,0.806429,2.803357,3.437893,5342,2269,42.474729,7179,12336,2514
1,2011,3,93.0,1999.63871,2.927419,3.745968,4.518387,0.095839,0.679452,3.404226,0.772419,2.724774,3.308387,3494,2216,63.423011,6026,12261,2711
2,2011,4,93.0,2152.758,3.0,3.747,4.483333,0.0562,0.719767,3.4346,0.736333,2.714833,3.3784,1511,2104,139.245533,5419,12121,2775
3,2011,5,93.0,2126.069355,3.0,3.67371,4.347742,0.035581,0.536161,3.152774,0.674032,2.616613,3.117194,709,1855,261.636107,4028,9754,2210
4,2011,6,93.0,2074.891667,3.175,3.6385,4.240667,0.032667,0.402067,2.976833,0.602167,2.574767,2.944167,1507,1785,118.447246,3836,9280,2168


In [None]:
# csv 파일 저장
df_economic.to_pickle('/content/drive/MyDrive/house_price/after_data/economic_data2.pkl')

# final_economic 파일 생성

- economic_data2 은 '해당 일자'에 대한 거시경제 지표들을 가지고 있다. 
- final_economic 파일은 economic_data2 파일에 추가적으로 과거 수치대비 변화에 대한 정보들을 추가한 파일

## 기본정보 파악

In [None]:
import pandas as pd
# 데이터 프레임 불러오기기
df_economic = pd.read_pickle('/content/drive/MyDrive/house_price/after_data/economic_data2.pkl')
df_economic.head()

Unnamed: 0,year,month,apartment_index,kospi_index,korea_rp,korea_3_year,korea_10_year,us_3_month,us_2_year,us_10_year,korea_10-3_year,us_10-2_year,us_10-3_year_month,last_month_total_apartment_supply,last_month_total_unsold_count,last_month_total_unsold_ratio,last_month_total_deal_count,last_month_total_full_rent_count,last_month_total_month_rent_count
0,2011,2,93.0,2011.301786,2.75,3.939286,4.745714,0.127536,0.762071,3.565429,0.806429,2.803357,3.437893,5342,2269,42.474729,7179,12336,2514
1,2011,3,93.0,1999.63871,2.927419,3.745968,4.518387,0.095839,0.679452,3.404226,0.772419,2.724774,3.308387,3494,2216,63.423011,6026,12261,2711
2,2011,4,93.0,2152.758,3.0,3.747,4.483333,0.0562,0.719767,3.4346,0.736333,2.714833,3.3784,1511,2104,139.245533,5419,12121,2775
3,2011,5,93.0,2126.069355,3.0,3.67371,4.347742,0.035581,0.536161,3.152774,0.674032,2.616613,3.117194,709,1855,261.636107,4028,9754,2210
4,2011,6,93.0,2074.891667,3.175,3.6385,4.240667,0.032667,0.402067,2.976833,0.602167,2.574767,2.944167,1507,1785,118.447246,3836,9280,2168


In [None]:
df_economic.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 143 entries, 0 to 142
Data columns (total 19 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   year                               143 non-null    int16  
 1   month                              143 non-null    int16  
 2   apartment_index                    143 non-null    float64
 3   kospi_index                        143 non-null    float64
 4   korea_rp                           143 non-null    float64
 5   korea_3_year                       143 non-null    float64
 6   korea_10_year                      143 non-null    float64
 7   us_3_month                         143 non-null    float64
 8   us_2_year                          143 non-null    float64
 9   us_10_year                         143 non-null    float64
 10  korea_10-3_year                    143 non-null    float64
 11  us_10-2_year                       143 non-null    float64

## 6,12개월 전 대비 변화정도 계산

In [None]:
# 6달전 날짜들 구한
df_economic.loc[df_economic['month']<7, '6m_before_year'] = df_economic['year']-1
df_economic.loc[df_economic['month']<7, '6m_before_month'] = 12-(6-df_economic['month'])
df_economic.loc[df_economic['month']>=7, '6m_before_year'] = df_economic['year']
df_economic.loc[df_economic['month']>=7, '6m_before_month'] = df_economic['month']-6

# 12달전 날짜들 구한
df_economic.loc[:, '12m_before_year'] = df_economic['year']-1
df_economic.loc[:, '12m_before_month'] = df_economic['month']

df_economic=df_economic.astype({'6m_before_year': 'int16','6m_before_month': 'int16'})
df_economic

Unnamed: 0,year,month,apartment_index,kospi_index,korea_rp,korea_3_year,korea_10_year,us_3_month,us_2_year,us_10_year,...,last_month_total_apartment_supply,last_month_total_unsold_count,last_month_total_unsold_ratio,last_month_total_deal_count,last_month_total_full_rent_count,last_month_total_month_rent_count,6m_before_year,6m_before_month,12m_before_year,12m_before_month
0,2011,2,93.0,2011.301786,2.750000,3.939286,4.745714,0.127536,0.762071,3.565429,...,5342,2269,42.474729,7179,12336,2514,2010,8,2010,2
1,2011,3,93.0,1999.638710,2.927419,3.745968,4.518387,0.095839,0.679452,3.404226,...,3494,2216,63.423011,6026,12261,2711,2010,9,2010,3
2,2011,4,93.0,2152.758000,3.000000,3.747000,4.483333,0.056200,0.719767,3.434600,...,1511,2104,139.245533,5419,12121,2775,2010,10,2010,4
3,2011,5,93.0,2126.069355,3.000000,3.673710,4.347742,0.035581,0.536161,3.152774,...,709,1855,261.636107,4028,9754,2210,2010,11,2010,5
4,2011,6,93.0,2074.891667,3.175000,3.638500,4.240667,0.032667,0.402067,2.976833,...,1507,1785,118.447246,3836,9280,2168,2010,12,2010,6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
138,2022,8,104.4,2485.917097,2.306452,3.246419,3.325935,2.662774,3.262348,2.904935,...,1736,592,34.101382,688,11654,8916,2022,2,2021,8
139,2022,9,104.4,2346.002333,2.500000,3.842233,3.852467,3.139330,3.828267,3.486200,...,1853,610,32.919590,760,11341,7415,2022,3,2021,9
140,2022,10,104.4,2223.492903,2.822581,4.235484,4.252290,3.734961,4.382039,3.979645,...,1552,719,46.327320,649,10258,7793,2022,4,2021,10
141,2022,11,104.4,2420.471667,3.058333,3.882467,3.907033,4.238337,4.501560,3.873067,...,1265,866,68.458498,574,10559,7694,2022,5,2021,11


In [None]:
# merge 할 과거의 데이터프레임들을 생성성
df_economic_6m_before = df_economic[['year', 'month','kospi_index', 'korea_rp',
       'korea_3_year', 'korea_10_year', 'us_3_month', 'us_2_year',
       'us_10_year', 'korea_10-3_year', 'us_10-2_year', 'us_10-3_year_month',
       'last_month_total_apartment_supply', 'last_month_total_unsold_count',
       'last_month_total_unsold_ratio', 'last_month_total_deal_count',
       'last_month_total_full_rent_count', 'last_month_total_month_rent_count']].copy()
df_economic_12m_before = df_economic[['year', 'month','kospi_index', 'korea_rp',
       'korea_3_year', 'korea_10_year', 'us_3_month', 'us_2_year',
       'us_10_year', 'korea_10-3_year', 'us_10-2_year', 'us_10-3_year_month',
       'last_month_total_apartment_supply', 'last_month_total_unsold_count',
       'last_month_total_unsold_ratio', 'last_month_total_deal_count',
       'last_month_total_full_rent_count', 'last_month_total_month_rent_count']].copy()

In [None]:
# 추가할 컬럼들의 컬럼명들을 생성
temp_column_total_list = list()
month_num_list = [6,12] # 1개월,3개월,6개월,12개월 이전 자료들 생성
for i in month_num_list:
    column_list = list()
    column_list.append('year_'+str(i)+'m_before')
    column_list.append('month_'+str(i)+'m_before')
    column_list.append('kospi_index_'+str(i)+'m_before')
    column_list.append('korea_rp_'+str(i)+'m_before')
    column_list.append('korea_3_year_'+str(i)+'m_before')
    column_list.append('korea_10_year_'+str(i)+'m_before')
    column_list.append('us_3_month_'+str(i)+'m_before')
    column_list.append('us_2_year_'+str(i)+'m_before')
    column_list.append('us_10_year_'+str(i)+'m_before')
    column_list.append('korea_10-3_year_'+str(i)+'m_before')
    column_list.append('us_10-2_year_'+str(i)+'m_before')
    column_list.append('us_10-3_year_month_'+str(i)+'m_before')
    column_list.append('last_month_total_apartment_supply_'+str(i)+'m_before')
    column_list.append('last_month_total_unsold_count_'+str(i)+'m_before')
    column_list.append('last_month_total_unsold_ratio_'+str(i)+'m_before')
    column_list.append('last_month_total_deal_count_'+str(i)+'m_before')
    column_list.append('last_month_total_full_rent_count_'+str(i)+'m_before')
    column_list.append('last_month_total_month_rent_count_'+str(i)+'m_before')
    temp_column_total_list.append(column_list)

In [None]:
temp_column_total_list[0]

['year_6m_before',
 'month_6m_before',
 'kospi_index_6m_before',
 'korea_rp_6m_before',
 'korea_3_year_6m_before',
 'korea_10_year_6m_before',
 'us_3_month_6m_before',
 'us_2_year_6m_before',
 'us_10_year_6m_before',
 'korea_10-3_year_6m_before',
 'us_10-2_year_6m_before',
 'us_10-3_year_month_6m_before',
 'last_month_total_apartment_supply_6m_before',
 'last_month_total_unsold_count_6m_before',
 'last_month_total_unsold_ratio_6m_before',
 'last_month_total_deal_count_6m_before',
 'last_month_total_full_rent_count_6m_before',
 'last_month_total_month_rent_count_6m_before']

In [None]:
temp_column_total_list[1]

['year_12m_before',
 'month_12m_before',
 'kospi_index_12m_before',
 'korea_rp_12m_before',
 'korea_3_year_12m_before',
 'korea_10_year_12m_before',
 'us_3_month_12m_before',
 'us_2_year_12m_before',
 'us_10_year_12m_before',
 'korea_10-3_year_12m_before',
 'us_10-2_year_12m_before',
 'us_10-3_year_month_12m_before',
 'last_month_total_apartment_supply_12m_before',
 'last_month_total_unsold_count_12m_before',
 'last_month_total_unsold_ratio_12m_before',
 'last_month_total_deal_count_12m_before',
 'last_month_total_full_rent_count_12m_before',
 'last_month_total_month_rent_count_12m_before']

In [None]:
df_economic_6m_before.columns = temp_column_total_list[0]
df_economic_12m_before.columns = temp_column_total_list[1]

In [None]:
pd.set_option('display.max_columns', 100)
df_economic = pd.merge(df_economic, df_economic_6m_before, left_on=['6m_before_year', '6m_before_month'], right_on=['year_6m_before','month_6m_before'], how='inner')
df_economic = pd.merge(df_economic, df_economic_12m_before, left_on=['12m_before_year', '12m_before_month'], right_on=['year_12m_before','month_12m_before'], how='inner')
df_economic = df_economic.drop(["6m_before_year", "6m_before_month", "12m_before_year", "12m_before_month", "year_6m_before", "month_6m_before","year_12m_before", "month_12m_before"], axis=1)
df_economic

Unnamed: 0,year,month,apartment_index,kospi_index,korea_rp,korea_3_year,korea_10_year,us_3_month,us_2_year,us_10_year,korea_10-3_year,us_10-2_year,us_10-3_year_month,last_month_total_apartment_supply,last_month_total_unsold_count,last_month_total_unsold_ratio,last_month_total_deal_count,last_month_total_full_rent_count,last_month_total_month_rent_count,kospi_index_6m_before,korea_rp_6m_before,korea_3_year_6m_before,korea_10_year_6m_before,us_3_month_6m_before,us_2_year_6m_before,us_10_year_6m_before,korea_10-3_year_6m_before,us_10-2_year_6m_before,us_10-3_year_month_6m_before,last_month_total_apartment_supply_6m_before,last_month_total_unsold_count_6m_before,last_month_total_unsold_ratio_6m_before,last_month_total_deal_count_6m_before,last_month_total_full_rent_count_6m_before,last_month_total_month_rent_count_6m_before,kospi_index_12m_before,korea_rp_12m_before,korea_3_year_12m_before,korea_10_year_12m_before,us_3_month_12m_before,us_2_year_12m_before,us_10_year_12m_before,korea_10-3_year_12m_before,us_10-2_year_12m_before,us_10-3_year_month_12m_before,last_month_total_apartment_supply_12m_before,last_month_total_unsold_count_12m_before,last_month_total_unsold_ratio_12m_before,last_month_total_deal_count_12m_before,last_month_total_full_rent_count_12m_before,last_month_total_month_rent_count_12m_before
0,2012,2,86.8,2002.988966,3.250000,3.432241,3.812414,0.090172,0.277207,1.964621,0.380172,1.687414,1.874448,1822,1890,103.732162,2786,10445,2277,1852.969677,3.250000,3.569194,3.942903,0.015387,0.225677,2.286581,0.373710,2.060903,2.271194,1964,1826,92.973523,4319,9682,2311,2011.301786,2.750000,3.939286,4.745714,0.127536,0.762071,3.565429,0.806429,2.803357,3.437893,5342,2269,42.474729,7179,12336,2514
1,2012,3,86.8,2024.652903,3.250000,3.546452,3.941290,0.082568,0.336494,2.156613,0.394839,1.820119,2.074045,1026,1703,165.984405,3948,13055,2638,1796.043000,3.250000,3.429333,3.728333,0.011367,0.201433,1.959400,0.299000,1.757967,1.948033,1106,1767,159.764919,4711,10164,2498,1999.638710,2.927419,3.745968,4.518387,0.095839,0.679452,3.404226,0.772419,2.724774,3.308387,3494,2216,63.423011,6026,12261,2711
2,2012,4,86.8,1996.757333,3.250000,3.500000,3.894333,0.081993,0.289883,2.023367,0.394333,1.733483,1.941373,607,1732,285.337727,4077,11902,2566,1824.276452,3.250000,3.465806,3.861290,0.014935,0.275161,2.136581,0.395484,1.861419,2.121645,1272,1776,139.622642,4178,9272,2359,2152.758000,3.000000,3.747000,4.483333,0.056200,0.719767,3.434600,0.736333,2.714833,3.378400,1511,2104,139.245533,5419,12121,2775
3,2012,5,86.8,1885.333871,3.250000,3.391613,3.751290,0.087297,0.277287,1.789903,0.359677,1.512616,1.702606,1661,1691,101.806141,3415,9420,2077,1856.945333,3.250000,3.392100,3.803333,0.011667,0.250833,2.004233,0.411233,1.753400,1.992567,5195,1821,35.052936,4027,9521,2154,2126.069355,3.000000,3.673710,4.347742,0.035581,0.536161,3.152774,0.674032,2.616613,3.117194,709,1855,261.636107,4028,9754,2210
4,2012,6,86.8,1845.975000,3.250000,3.284000,3.627667,0.086127,0.284087,1.604633,0.343667,1.320547,1.518507,927,1703,183.710895,3489,9363,2038,1863.717742,3.250000,3.355968,3.807742,0.007065,0.252903,1.973129,0.451774,1.720226,1.966065,3764,1801,47.848034,3519,8232,1823,2074.891667,3.175000,3.638500,4.240667,0.032667,0.402067,2.976833,0.602167,2.574767,2.944167,1507,1785,118.447246,3836,9280,2168
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
126,2022,8,104.4,2485.917097,2.306452,3.246419,3.325935,2.662774,3.262348,2.904935,0.079516,-0.357413,0.242161,1736,592,34.101382,688,11654,8916,2721.337500,1.250000,2.275321,2.680821,0.321386,1.441118,1.929607,0.405500,0.488489,1.608221,5665,47,0.829656,1139,12114,8229,3176.717742,0.548387,1.409387,1.905194,0.053300,0.214306,1.280645,0.495806,1.066339,1.227345,2912,59,2.026099,4804,11447,6760
127,2022,9,104.4,2346.002333,2.500000,3.842233,3.852467,3.139330,3.828267,3.486200,0.010233,-0.342067,0.346870,1853,610,32.919590,760,11341,7415,2697.356129,1.250000,2.344032,2.766806,0.435958,1.894190,2.117613,0.422774,0.223423,1.681655,2198,47,2.138308,856,12977,8786,3143.298667,0.750000,1.515200,2.056933,0.040160,0.233207,1.369233,0.541733,1.136027,1.329073,3241,55,1.697007,4208,11800,8461
128,2022,10,104.4,2223.492903,2.822581,4.235484,4.252290,3.734961,4.382039,3.979645,0.016806,-0.402394,0.244684,1552,719,46.327320,649,10258,7793,2704.839000,1.391667,2.937967,3.216867,0.751133,2.536937,2.745700,0.278900,0.208763,1.994567,3218,180,5.593536,1501,12601,8689,2990.897742,0.750000,1.825290,2.390355,0.051616,0.386716,1.577935,0.565065,1.191219,1.526319,2640,55,2.083333,2804,9307,6184
129,2022,11,104.4,2420.471667,3.058333,3.882467,3.907033,4.238337,4.501560,3.873067,0.024567,-0.628493,-0.365270,1265,866,68.458498,574,10559,7694,2632.900323,1.548387,3.019129,3.293774,0.982135,2.615255,2.899419,0.274645,0.284165,1.917284,1967,360,18.301983,1836,11707,7728,2963.523333,0.800000,1.949567,2.349567,0.050583,0.507863,1.549667,0.400000,1.041803,1.499083,2789,55,1.972033,2295,11996,8418


- 처음에 변화율을 구하려 했지만, 수치가 0인 값들이 있어서 계산을 할 때 null이나 inf가 되는 경우들이 있어서 변화율보다는 변화정도로 진행을 하기로 함

>> 계산식을 생성할 때, 0으로 나누거나 나누어지는 경우들에 대해서 조심해야 한다

In [None]:
column_list = list()
column_list.append(['kospi_index', 'korea_rp',
       'korea_3_year', 'korea_10_year', 'us_3_month', 'us_2_year',
       'us_10_year', 'korea_10-3_year', 'us_10-2_year', 'us_10-3_year_month',
       'last_month_total_apartment_supply', 'last_month_total_unsold_count',
       'last_month_total_unsold_ratio', 'last_month_total_deal_count',
       'last_month_total_full_rent_count', 'last_month_total_month_rent_count'])
column_list

[['kospi_index',
  'korea_rp',
  'korea_3_year',
  'korea_10_year',
  'us_3_month',
  'us_2_year',
  'us_10_year',
  'korea_10-3_year',
  'us_10-2_year',
  'us_10-3_year_month',
  'last_month_total_apartment_supply',
  'last_month_total_unsold_count',
  'last_month_total_unsold_ratio',
  'last_month_total_deal_count',
  'last_month_total_full_rent_count',
  'last_month_total_month_rent_count']]

In [None]:
column_list.append(temp_column_total_list[0][2:])
column_list.append(temp_column_total_list[1][2:])
column_list

[['kospi_index',
  'korea_rp',
  'korea_3_year',
  'korea_10_year',
  'us_3_month',
  'us_2_year',
  'us_10_year',
  'korea_10-3_year',
  'us_10-2_year',
  'us_10-3_year_month',
  'last_month_total_apartment_supply',
  'last_month_total_unsold_count',
  'last_month_total_unsold_ratio',
  'last_month_total_deal_count',
  'last_month_total_full_rent_count',
  'last_month_total_month_rent_count'],
 ['kospi_index_6m_before',
  'korea_rp_6m_before',
  'korea_3_year_6m_before',
  'korea_10_year_6m_before',
  'us_3_month_6m_before',
  'us_2_year_6m_before',
  'us_10_year_6m_before',
  'korea_10-3_year_6m_before',
  'us_10-2_year_6m_before',
  'us_10-3_year_month_6m_before',
  'last_month_total_apartment_supply_6m_before',
  'last_month_total_unsold_count_6m_before',
  'last_month_total_unsold_ratio_6m_before',
  'last_month_total_deal_count_6m_before',
  'last_month_total_full_rent_count_6m_before',
  'last_month_total_month_rent_count_6m_before'],
 ['kospi_index_12m_before',
  'korea_rp_12m_

In [None]:
# 변화정도 = 현재데이터 - 과거데이터 
for i in range(len(column_list[0])):
  df_economic[column_list[1][i]] = df_economic[column_list[0][i]] - df_economic[column_list[1][i]]
  df_economic[column_list[2][i]] = df_economic[column_list[0][i]] - df_economic[column_list[2][i]]
df_economic

Unnamed: 0,year,month,apartment_index,kospi_index,korea_rp,korea_3_year,korea_10_year,us_3_month,us_2_year,us_10_year,korea_10-3_year,us_10-2_year,us_10-3_year_month,last_month_total_apartment_supply,last_month_total_unsold_count,last_month_total_unsold_ratio,last_month_total_deal_count,last_month_total_full_rent_count,last_month_total_month_rent_count,kospi_index_6m_before,korea_rp_6m_before,korea_3_year_6m_before,korea_10_year_6m_before,us_3_month_6m_before,us_2_year_6m_before,us_10_year_6m_before,korea_10-3_year_6m_before,us_10-2_year_6m_before,us_10-3_year_month_6m_before,last_month_total_apartment_supply_6m_before,last_month_total_unsold_count_6m_before,last_month_total_unsold_ratio_6m_before,last_month_total_deal_count_6m_before,last_month_total_full_rent_count_6m_before,last_month_total_month_rent_count_6m_before,kospi_index_12m_before,korea_rp_12m_before,korea_3_year_12m_before,korea_10_year_12m_before,us_3_month_12m_before,us_2_year_12m_before,us_10_year_12m_before,korea_10-3_year_12m_before,us_10-2_year_12m_before,us_10-3_year_month_12m_before,last_month_total_apartment_supply_12m_before,last_month_total_unsold_count_12m_before,last_month_total_unsold_ratio_12m_before,last_month_total_deal_count_12m_before,last_month_total_full_rent_count_12m_before,last_month_total_month_rent_count_12m_before
0,2012,2,86.8,2002.988966,3.250000,3.432241,3.812414,0.090172,0.277207,1.964621,0.380172,1.687414,1.874448,1822,1890,103.732162,2786,10445,2277,150.019288,0.000000,-0.136952,-0.130489,0.074785,0.051529,-0.321960,0.006463,-0.373489,-0.396745,-142,64,10.758639,-1533,763,-34,-8.312820,0.500000,-0.507044,-0.933300,-0.037363,-0.484865,-1.600808,-0.426256,-1.115943,-1.563445,-3520,-379,61.257434,-4393,-1891,-237
1,2012,3,86.8,2024.652903,3.250000,3.546452,3.941290,0.082568,0.336494,2.156613,0.394839,1.820119,2.074045,1026,1703,165.984405,3948,13055,2638,228.609903,0.000000,0.117118,0.212957,0.071201,0.135060,0.197213,0.095839,0.062153,0.126012,-80,-64,6.219487,-763,2891,140,25.014194,0.322581,-0.199516,-0.577097,-0.013271,-0.342958,-1.247613,-0.377581,-0.904655,-1.234342,-2468,-513,102.561395,-2078,794,-73
2,2012,4,86.8,1996.757333,3.250000,3.500000,3.894333,0.081993,0.289883,2.023367,0.394333,1.733483,1.941373,607,1732,285.337727,4077,11902,2566,172.480882,0.000000,0.034194,0.033043,0.067058,0.014722,-0.113214,-0.001151,-0.127936,-0.180272,-665,-44,145.715085,-101,2630,207,-156.000667,0.250000,-0.247000,-0.589000,0.025793,-0.429883,-1.411233,-0.342000,-0.981350,-1.437027,-904,-372,146.092194,-1342,-219,-209
3,2012,5,86.8,1885.333871,3.250000,3.391613,3.751290,0.087297,0.277287,1.789903,0.359677,1.512616,1.702606,1661,1691,101.806141,3415,9420,2077,28.388538,0.000000,-0.000487,-0.052043,0.075630,0.026454,-0.214330,-0.051556,-0.240784,-0.289960,-3534,-130,66.753205,-612,-101,-77,-240.735484,0.250000,-0.282097,-0.596452,0.051716,-0.258874,-1.362871,-0.314355,-1.103997,-1.414587,952,-164,-159.829966,-613,-334,-133
4,2012,6,86.8,1845.975000,3.250000,3.284000,3.627667,0.086127,0.284087,1.604633,0.343667,1.320547,1.518507,927,1703,183.710895,3489,9363,2038,-17.742742,0.000000,-0.071968,-0.180075,0.079062,0.031183,-0.368496,-0.108108,-0.399679,-0.447558,-2837,-98,135.862861,-30,1131,215,-228.916667,0.075000,-0.354500,-0.613000,0.053460,-0.117980,-1.372200,-0.258500,-1.254220,-1.425660,-580,-82,65.263649,-347,83,-130
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
126,2022,8,104.4,2485.917097,2.306452,3.246419,3.325935,2.662774,3.262348,2.904935,0.079516,-0.357413,0.242161,1736,592,34.101382,688,11654,8916,-235.420403,1.056452,0.971098,0.645114,2.341388,1.821231,0.975328,-0.325984,-0.845902,-1.366060,-3929,545,33.271727,-451,-460,687,-690.800645,1.758065,1.837032,1.420742,2.609474,3.048042,1.624290,-0.416290,-1.423752,-0.985184,-1176,533,32.075284,-4116,207,2156
127,2022,9,104.4,2346.002333,2.500000,3.842233,3.852467,3.139330,3.828267,3.486200,0.010233,-0.342067,0.346870,1853,610,32.919590,760,11341,7415,-351.353796,1.250000,1.498201,1.085660,2.703372,1.934076,1.368587,-0.412541,-0.565489,-1.334785,-345,563,30.781282,-96,-1636,-1371,-797.296333,1.750000,2.327033,1.795533,3.099170,3.595060,2.116967,-0.531500,-1.478093,-0.982203,-1388,555,31.222583,-3448,-459,-1046
128,2022,10,104.4,2223.492903,2.822581,4.235484,4.252290,3.734961,4.382039,3.979645,0.016806,-0.402394,0.244684,1552,719,46.327320,649,10258,7793,-481.346097,1.430914,1.297517,1.035424,2.983828,1.845102,1.233945,-0.262094,-0.611157,-1.749883,-1666,539,40.733783,-852,-2343,-896,-767.404839,2.072581,2.410194,1.861935,3.683345,3.995323,2.401710,-0.548258,-1.593613,-1.281635,-1088,664,44.243986,-2155,951,1609
129,2022,11,104.4,2420.471667,3.058333,3.882467,3.907033,4.238337,4.501560,3.873067,0.024567,-0.628493,-0.365270,1265,866,68.458498,574,10559,7694,-212.428656,1.509946,0.863338,0.613259,3.256201,1.886305,0.973647,-0.250078,-0.912658,-2.282554,-702,506,50.156515,-1262,-1148,-34,-543.051667,2.258333,1.932900,1.557467,4.187753,3.993697,2.323400,-0.375433,-1.670297,-1.864353,-1524,811,66.486465,-1721,-1437,-724


In [None]:
# inf 값들을 null값로 수정한 후, null 값을 구해서 inf값과 null 값을 동시에 구함
import numpy as np
df_economic.replace([np.inf, -np.inf], np.nan, inplace=True)
var = df_economic.isnull().sum()
print(var.to_string())

year                                            0
month                                           0
apartment_index                                 0
kospi_index                                     0
korea_rp                                        0
korea_3_year                                    0
korea_10_year                                   0
us_3_month                                      0
us_2_year                                       0
us_10_year                                      0
korea_10-3_year                                 0
us_10-2_year                                    0
us_10-3_year_month                              0
last_month_total_apartment_supply               0
last_month_total_unsold_count                   0
last_month_total_unsold_ratio                   0
last_month_total_deal_count                     0
last_month_total_full_rent_count                0
last_month_total_month_rent_count               0
kospi_index_6m_before                           0


In [None]:
df_economic.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 131 entries, 0 to 130
Data columns (total 51 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   year                                          131 non-null    int16  
 1   month                                         131 non-null    int16  
 2   apartment_index                               131 non-null    float64
 3   kospi_index                                   131 non-null    float64
 4   korea_rp                                      131 non-null    float64
 5   korea_3_year                                  131 non-null    float64
 6   korea_10_year                                 131 non-null    float64
 7   us_3_month                                    131 non-null    float64
 8   us_2_year                                     131 non-null    float64
 9   us_10_year                                    131 non-null    flo

In [None]:
# type 이 floay64 인 컬럼을 float32로 변경, 메모리 사용량을 줄이기 위해서서
df_economic_columns = list(df_economic.columns)
for df_economic_column in df_economic_columns:
    if df_economic[df_economic_column].dtypes =='float64':
        df_economic[df_economic_column]=df_economic[df_economic_column].astype('float32')
    else:
        pass

In [None]:
df_economic.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 131 entries, 0 to 130
Data columns (total 51 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   year                                          131 non-null    int16  
 1   month                                         131 non-null    int16  
 2   apartment_index                               131 non-null    float32
 3   kospi_index                                   131 non-null    float32
 4   korea_rp                                      131 non-null    float32
 5   korea_3_year                                  131 non-null    float32
 6   korea_10_year                                 131 non-null    float32
 7   us_3_month                                    131 non-null    float32
 8   us_2_year                                     131 non-null    float32
 9   us_10_year                                    131 non-null    flo

In [None]:
df_economic

Unnamed: 0,year,month,apartment_index,kospi_index,korea_rp,korea_3_year,korea_10_year,us_3_month,us_2_year,us_10_year,korea_10-3_year,us_10-2_year,us_10-3_year_month,last_month_total_apartment_supply,last_month_total_unsold_count,last_month_total_unsold_ratio,last_month_total_deal_count,last_month_total_full_rent_count,last_month_total_month_rent_count,kospi_index_6m_before,korea_rp_6m_before,korea_3_year_6m_before,korea_10_year_6m_before,us_3_month_6m_before,us_2_year_6m_before,us_10_year_6m_before,korea_10-3_year_6m_before,us_10-2_year_6m_before,us_10-3_year_month_6m_before,last_month_total_apartment_supply_6m_before,last_month_total_unsold_count_6m_before,last_month_total_unsold_ratio_6m_before,last_month_total_deal_count_6m_before,last_month_total_full_rent_count_6m_before,last_month_total_month_rent_count_6m_before,kospi_index_12m_before,korea_rp_12m_before,korea_3_year_12m_before,korea_10_year_12m_before,us_3_month_12m_before,us_2_year_12m_before,us_10_year_12m_before,korea_10-3_year_12m_before,us_10-2_year_12m_before,us_10-3_year_month_12m_before,last_month_total_apartment_supply_12m_before,last_month_total_unsold_count_12m_before,last_month_total_unsold_ratio_12m_before,last_month_total_deal_count_12m_before,last_month_total_full_rent_count_12m_before,last_month_total_month_rent_count_12m_before
0,2012,2,86.800003,2002.989014,3.250000,3.432241,3.812414,0.090172,0.277207,1.964621,0.380172,1.687414,1.874448,1822,1890,103.732162,2786,10445,2277,150.019287,0.000000,-0.136952,-0.130489,0.074785,0.051529,-0.321960,0.006463,-0.373489,-0.396745,-142,64,10.758639,-1533,763,-34,-8.312820,0.500000,-0.507044,-0.933300,-0.037363,-0.484865,-1.600808,-0.426256,-1.115943,-1.563445,-3520,-379,61.257435,-4393,-1891,-237
1,2012,3,86.800003,2024.652954,3.250000,3.546452,3.941290,0.082568,0.336494,2.156613,0.394839,1.820119,2.074045,1026,1703,165.984406,3948,13055,2638,228.609909,0.000000,0.117118,0.212957,0.071201,0.135060,0.197213,0.095839,0.062153,0.126012,-80,-64,6.219487,-763,2891,140,25.014194,0.322581,-0.199516,-0.577097,-0.013271,-0.342958,-1.247613,-0.377581,-0.904655,-1.234342,-2468,-513,102.561394,-2078,794,-73
2,2012,4,86.800003,1996.757324,3.250000,3.500000,3.894333,0.081993,0.289883,2.023367,0.394333,1.733483,1.941373,607,1732,285.337738,4077,11902,2566,172.480881,0.000000,0.034194,0.033043,0.067058,0.014722,-0.113214,-0.001151,-0.127936,-0.180272,-665,-44,145.715088,-101,2630,207,-156.000671,0.250000,-0.247000,-0.589000,0.025793,-0.429883,-1.411233,-0.342000,-0.981350,-1.437027,-904,-372,146.092194,-1342,-219,-209
3,2012,5,86.800003,1885.333862,3.250000,3.391613,3.751290,0.087297,0.277287,1.789903,0.359677,1.512616,1.702606,1661,1691,101.806137,3415,9420,2077,28.388538,0.000000,-0.000487,-0.052043,0.075630,0.026454,-0.214330,-0.051556,-0.240784,-0.289960,-3534,-130,66.753204,-612,-101,-77,-240.735489,0.250000,-0.282097,-0.596452,0.051716,-0.258874,-1.362871,-0.314355,-1.103997,-1.414587,952,-164,-159.829971,-613,-334,-133
4,2012,6,86.800003,1845.974976,3.250000,3.284000,3.627667,0.086127,0.284087,1.604633,0.343667,1.320547,1.518507,927,1703,183.710892,3489,9363,2038,-17.742743,0.000000,-0.071968,-0.180075,0.079062,0.031183,-0.368496,-0.108108,-0.399679,-0.447558,-2837,-98,135.862854,-30,1131,215,-228.916672,0.075000,-0.354500,-0.613000,0.053460,-0.117980,-1.372200,-0.258500,-1.254220,-1.425660,-580,-82,65.263649,-347,83,-130
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
126,2022,8,104.400002,2485.916992,2.306452,3.246419,3.325936,2.662774,3.262348,2.904936,0.079516,-0.357413,0.242161,1736,592,34.101383,688,11654,8916,-235.420410,1.056452,0.971098,0.645114,2.341388,1.821231,0.975328,-0.325984,-0.845902,-1.366060,-3929,545,33.271729,-451,-460,687,-690.800659,1.758065,1.837032,1.420742,2.609474,3.048042,1.624290,-0.416290,-1.423752,-0.985184,-1176,533,32.075283,-4116,207,2156
127,2022,9,104.400002,2346.002441,2.500000,3.842233,3.852467,3.139330,3.828267,3.486200,0.010233,-0.342067,0.346870,1853,610,32.919590,760,11341,7415,-351.353790,1.250000,1.498201,1.085660,2.703372,1.934076,1.368587,-0.412541,-0.565489,-1.334785,-345,563,30.781282,-96,-1636,-1371,-797.296326,1.750000,2.327033,1.795533,3.099170,3.595060,2.116967,-0.531500,-1.478093,-0.982203,-1388,555,31.222582,-3448,-459,-1046
128,2022,10,104.400002,2223.492920,2.822581,4.235484,4.252290,3.734961,4.382039,3.979645,0.016806,-0.402394,0.244684,1552,719,46.327320,649,10258,7793,-481.346100,1.430914,1.297517,1.035424,2.983828,1.845102,1.233945,-0.262094,-0.611157,-1.749883,-1666,539,40.733784,-852,-2343,-896,-767.404846,2.072581,2.410193,1.861935,3.683345,3.995322,2.401710,-0.548258,-1.593613,-1.281636,-1088,664,44.243988,-2155,951,1609
129,2022,11,104.400002,2420.471680,3.058333,3.882467,3.907033,4.238337,4.501560,3.873067,0.024567,-0.628493,-0.365270,1265,866,68.458496,574,10559,7694,-212.428650,1.509946,0.863338,0.613259,3.256201,1.886305,0.973647,-0.250078,-0.912658,-2.282554,-702,506,50.156517,-1262,-1148,-34,-543.051697,2.258333,1.932900,1.557467,4.187753,3.993697,2.323400,-0.375433,-1.670297,-1.864353,-1524,811,66.486465,-1721,-1437,-724


In [None]:
# null 값이나 inf 값들을 확인
import numpy as np
df_economic.replace([np.inf, -np.inf], np.nan, inplace=True)
var = df_economic.isnull().sum()
print(var.to_string())

year                                            0
month                                           0
apartment_index                                 0
kospi_index                                     0
korea_rp                                        0
korea_3_year                                    0
korea_10_year                                   0
us_3_month                                      0
us_2_year                                       0
us_10_year                                      0
korea_10-3_year                                 0
us_10-2_year                                    0
us_10-3_year_month                              0
last_month_total_apartment_supply               0
last_month_total_unsold_count                   0
last_month_total_unsold_ratio                   0
last_month_total_deal_count                     0
last_month_total_full_rent_count                0
last_month_total_month_rent_count               0
kospi_index_6m_before                           0


In [None]:
df_economic.to_pickle('/content/drive/MyDrive/house_price/after_data/final_economic.pkl')

>> 메모리 용량을 줄이기 위해서 타입들을 변환할 수도 있다.

>> 값들을 병합하거나 수정한 후, null 값이나 inf 값들이 존재하는 확인을 해야 한다 -> 나중에 진행이 된 다음에 발견을 하면 많은 부분을 수정해야 한다

# df_area_deal, df_area_full_rent, df_area_year_rent 파일들 생성

- '아파트 거래'가 체결된 날 이외의 날들은 가장 최근에 체결된 거래가격이 유지된다고 가정

## 필요한 데이터들 불러오기

In [3]:
import pandas as pd
import numpy as np
# 데이터들 불러오기
df_deal = pd.read_csv("/content/drive/MyDrive/house_price/after_data/apartment_deal.csv",  encoding='UTF8')
df_month_rent = pd.read_csv("/content/drive/MyDrive/house_price/after_data/apartment_month_rent.csv",  encoding='UTF8')
df_full_rent = pd.read_csv("/content/drive/MyDrive/house_price/after_data/apartment_full_rent.csv",  encoding='UTF8')
df_economic = pd.read_csv("/content/drive/MyDrive/house_price/after_data/economic_data.csv",  encoding='UTF8') # 모든 월별 날짜를 가져오기 위해서 사용용

## 아파트 월별 매매 피봇 테이블 생성

In [4]:
# 대표 데이터 파악
df_deal.head()

Unnamed: 0,date,year,month,day,address_0,address_1,address_2,address_3,address_4,name,area,deal_price
0,2011-07-09,2011,7,9,서울특별시,강남구,개포동,655.0,2.0,개포2차현대아파트(220),77.75,64000
1,2011-07-28,2011,7,28,서울특별시,강남구,개포동,655.0,2.0,개포2차현대아파트(220),77.75,65500
2,2011-01-19,2011,1,19,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,67.28,70500
3,2011-09-02,2011,9,2,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,79.97,85000
4,2011-12-17,2011,12,17,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,67.28,68000


In [5]:
# 면적당 가격 컬럼을 추가
df_deal['area_deal_price'] = df_deal['deal_price'] / df_deal['area']
df_deal.head()

Unnamed: 0,date,year,month,day,address_0,address_1,address_2,address_3,address_4,name,area,deal_price,area_deal_price
0,2011-07-09,2011,7,9,서울특별시,강남구,개포동,655.0,2.0,개포2차현대아파트(220),77.75,64000,823.151125
1,2011-07-28,2011,7,28,서울특별시,강남구,개포동,655.0,2.0,개포2차현대아파트(220),77.75,65500,842.44373
2,2011-01-19,2011,1,19,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,67.28,70500,1047.859691
3,2011-09-02,2011,9,2,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,79.97,85000,1062.898587
4,2011-12-17,2011,12,17,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,67.28,68000,1010.701546


In [6]:
# 최근에 체결된 가격이 계속 유지된다고 생각을 하고 모든 날짜의 가격들을 결정
# 이를 위해서 그룹
import numpy as np
pivot_table_area_deal = df_deal.pivot_table(index=['year','month'], columns=['address_1','address_2','address_3','address_4'], values='area_deal_price', aggfunc=np.mean)
pivot_table_area_deal


Unnamed: 0_level_0,address_1,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,...,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구
Unnamed: 0_level_1,address_2,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,...,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동
Unnamed: 0_level_2,address_3,12.0,12.0,138.0,140.0,141.0,166.0,172.0,176.0,177.0,179.0,...,307.0,314.0,318.0,331.0,413.0,438.0,450.0,452.0,453.0,454.0
Unnamed: 0_level_3,address_4,0.0,2.0,0.0,0.0,0.0,4.0,3.0,1.0,0.0,0.0,...,76.0,1.0,81.0,64.0,8.0,0.0,0.0,0.0,0.0,0.0
year,month,Unnamed: 2_level_4,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4,Unnamed: 7_level_4,Unnamed: 8_level_4,Unnamed: 9_level_4,Unnamed: 10_level_4,Unnamed: 11_level_4,Unnamed: 12_level_4,Unnamed: 13_level_4,Unnamed: 14_level_4,Unnamed: 15_level_4,Unnamed: 16_level_4,Unnamed: 17_level_4,Unnamed: 18_level_4,Unnamed: 19_level_4,Unnamed: 20_level_4,Unnamed: 21_level_4,Unnamed: 22_level_4
2011,1,1056.223061,,1985.413117,1843.539520,1834.608613,,,,,,...,,,,,,,450.299419,,,
2011,2,1047.044551,920.318252,1914.281706,2068.121832,1929.906477,,,,1119.359020,954.653938,...,,365.380762,,,,,462.221502,,445.152911,
2011,3,997.464329,,1983.669216,1966.517376,1728.264881,,,,1284.317191,,...,,359.487524,,,,,449.526357,,,
2011,4,1056.760361,,1965.817286,1828.668916,1904.404669,,,,,,...,,383.060476,,,,,452.296247,,,
2011,5,1009.316006,,1852.884387,1971.081572,1653.633357,,,,,,...,,,,,,371.55534,444.357652,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022,8,2801.826056,,,,,,,,,,...,,,,,,,,,,
2022,9,2772.754671,,4019.153604,,,,,,,,...,,,,,,818.61013,,,,
2022,10,2342.766727,,,,,,,,,,...,,,,,,,,,,
2022,11,2459.681105,,2963.637355,,,,,,,,...,,,,,,,,,,


In [7]:
pivot_table_area_deal.info() # 2011년 1월부터 2022년 12월까지 '모든날짜'는 144의 항목이 있는데 하는데 '거래날짜'는 144로 값이 다 있음음

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 144 entries, (2011, 1) to (2022, 12)
Columns: 8860 entries, ('강남구', '개포동', 12.0, 0.0) to ('중랑구', '중화동', 454.0, 0.0)
dtypes: float64(8860)
memory usage: 9.7 MB


In [8]:
# 가장 최근에 체결된 값이 거래가격으로 유지 됨으로 ffill()을 사용
pivot_table_area_deal=pivot_table_area_deal.ffill()
pivot_table_area_deal

Unnamed: 0_level_0,address_1,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,...,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구
Unnamed: 0_level_1,address_2,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,...,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동
Unnamed: 0_level_2,address_3,12.0,12.0,138.0,140.0,141.0,166.0,172.0,176.0,177.0,179.0,...,307.0,314.0,318.0,331.0,413.0,438.0,450.0,452.0,453.0,454.0
Unnamed: 0_level_3,address_4,0.0,2.0,0.0,0.0,0.0,4.0,3.0,1.0,0.0,0.0,...,76.0,1.0,81.0,64.0,8.0,0.0,0.0,0.0,0.0,0.0
year,month,Unnamed: 2_level_4,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4,Unnamed: 7_level_4,Unnamed: 8_level_4,Unnamed: 9_level_4,Unnamed: 10_level_4,Unnamed: 11_level_4,Unnamed: 12_level_4,Unnamed: 13_level_4,Unnamed: 14_level_4,Unnamed: 15_level_4,Unnamed: 16_level_4,Unnamed: 17_level_4,Unnamed: 18_level_4,Unnamed: 19_level_4,Unnamed: 20_level_4,Unnamed: 21_level_4,Unnamed: 22_level_4
2011,1,1056.223061,,1985.413117,1843.539520,1834.608613,,,,,,...,,,,,,,450.299419,,,
2011,2,1047.044551,920.318252,1914.281706,2068.121832,1929.906477,,,,1119.359020,954.653938,...,,365.380762,,,,,462.221502,,445.152911,
2011,3,997.464329,920.318252,1983.669216,1966.517376,1728.264881,,,,1284.317191,954.653938,...,,359.487524,,,,,449.526357,,445.152911,
2011,4,1056.760361,920.318252,1965.817286,1828.668916,1904.404669,,,,1284.317191,954.653938,...,,383.060476,,,,,452.296247,,445.152911,
2011,5,1009.316006,920.318252,1852.884387,1971.081572,1653.633357,,,,1284.317191,954.653938,...,,383.060476,,,,371.555340,444.357652,,445.152911,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022,8,2801.826056,1779.004227,3857.953574,2508.957785,4208.615293,1413.594063,1342.758827,2172.968275,2380.225816,3014.696646,...,457.073761,872.199239,466.954023,956.130484,595.238095,874.431303,1163.591651,727.417008,1006.355932,1131.141746
2022,9,2772.754671,1779.004227,4019.153604,2508.957785,4208.615293,1413.594063,1342.758827,2172.968275,2380.225816,3014.696646,...,457.073761,872.199239,466.954023,956.130484,595.238095,818.610130,1163.591651,727.417008,1006.355932,1131.141746
2022,10,2342.766727,1779.004227,4019.153604,2508.957785,4208.615293,1413.594063,1342.758827,2172.968275,2380.225816,3014.696646,...,457.073761,872.199239,466.954023,956.130484,595.238095,818.610130,1163.591651,727.417008,1006.355932,1131.141746
2022,11,2459.681105,1779.004227,2963.637355,2508.957785,4208.615293,1413.594063,1342.758827,2172.968275,2380.225816,3014.696646,...,457.073761,872.199239,466.954023,956.130484,595.238095,818.610130,1163.591651,727.417008,1006.355932,1131.141746


## 아파트 월별 전세 피봇 테이블 생성

- 아파트 매매 피봇 테이블 생성 부분 참조

In [9]:
df_full_rent.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1448686 entries, 0 to 1448685
Data columns (total 12 columns):
 #   Column           Non-Null Count    Dtype  
---  ------           --------------    -----  
 0   date             1448686 non-null  object 
 1   year             1448686 non-null  int64  
 2   month            1448686 non-null  int64  
 3   day              1448686 non-null  int64  
 4   address_0        1448686 non-null  object 
 5   address_1        1448686 non-null  object 
 6   address_2        1448686 non-null  object 
 7   address_3        1448686 non-null  float64
 8   address_4        1448686 non-null  float64
 9   name             1448686 non-null  object 
 10  area             1448686 non-null  float64
 11  full_rent_price  1448686 non-null  int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 132.6+ MB


In [10]:
# 면적당 가격을 추가
df_full_rent['area_full_rent_price'] = df_full_rent['full_rent_price'] / df_full_rent['area']
df_full_rent.head()

Unnamed: 0,date,year,month,day,address_0,address_1,address_2,address_3,address_4,name,area,full_rent_price,area_full_rent_price
0,2011-01-05,2011,1,5,서울특별시,강남구,개포동,655.0,2.0,개포2차현대아파트(220),77.75,35000,450.160772
1,2011-01-18,2011,1,18,서울특별시,강남구,개포동,655.0,2.0,개포2차현대아파트(220),77.75,20000,257.234727
2,2011-02-01,2011,2,1,서울특별시,강남구,개포동,655.0,2.0,개포2차현대아파트(220),77.75,24000,308.681672
3,2011-02-11,2011,2,11,서울특별시,강남구,개포동,655.0,2.0,개포2차현대아파트(220),77.75,31000,398.713826
4,2011-02-24,2011,2,24,서울특별시,강남구,개포동,655.0,2.0,개포2차현대아파트(220),77.75,30500,392.282958


In [11]:
pivot_table_area_full_rent=df_full_rent.pivot_table(index=['year','month'], columns=['address_1','address_2','address_3','address_4'], values='area_full_rent_price')
pivot_table_area_full_rent # 해당 날짜에 거래가 많을 경우 mean 값이 나옴을 확인!

Unnamed: 0_level_0,address_1,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,...,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구
Unnamed: 0_level_1,address_2,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,...,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동
Unnamed: 0_level_2,address_3,12.0,12.0,138.0,140.0,141.0,166.0,172.0,176.0,177.0,179.0,...,307.0,314.0,318.0,331.0,413.0,438.0,450.0,452.0,453.0,454.0
Unnamed: 0_level_3,address_4,0.0,2.0,0.0,0.0,0.0,4.0,3.0,1.0,0.0,0.0,...,76.0,1.0,81.0,64.0,8.0,0.0,0.0,0.0,0.0,0.0
year,month,Unnamed: 2_level_4,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4,Unnamed: 7_level_4,Unnamed: 8_level_4,Unnamed: 9_level_4,Unnamed: 10_level_4,Unnamed: 11_level_4,Unnamed: 12_level_4,Unnamed: 13_level_4,Unnamed: 14_level_4,Unnamed: 15_level_4,Unnamed: 16_level_4,Unnamed: 17_level_4,Unnamed: 18_level_4,Unnamed: 19_level_4,Unnamed: 20_level_4,Unnamed: 21_level_4,Unnamed: 22_level_4
2011,1,432.532897,440.045054,211.154904,258.848450,209.447885,,,,396.240321,425.511870,...,194.457948,231.800699,203.665988,,,198.560437,239.439022,,,
2011,2,420.187917,461.653016,209.653658,247.684779,206.098616,,,,412.395428,426.431723,...,,225.715445,,143.141648,,176.678445,246.031308,238.063748,235.404896,
2011,3,425.833338,,202.726317,240.732852,208.836187,,,,,445.923879,...,,187.021280,155.426409,,,252.780586,241.615927,,229.519774,
2011,4,414.627546,52.122115,206.219598,267.855555,195.808012,,,,400.612702,410.752418,...,,82.505333,,,,,244.869185,238.063748,,
2011,5,430.543477,409.530901,188.072915,245.028909,197.540818,300.388738,,,356.400356,372.179732,...,,243.181791,215.517241,,,188.457008,252.230868,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022,8,951.719030,977.289650,2038.030102,,,,,,,873.633966,...,,436.099619,,,,467.537034,565.292599,,,
2022,9,917.385474,,1635.203915,,,717.398987,,,854.440037,1063.517984,...,,,,,,,619.117902,,,
2022,10,908.431603,944.864228,1916.378550,,,,,,1119.359020,1094.711720,...,,,,,,,605.027652,,,
2022,11,867.409593,920.512565,1780.685565,,,,,,993.686192,1080.266298,...,,,,,,,611.818738,,661.235093,638.071606


In [12]:
pivot_table_area_full_rent.info() # 2011년 1월부터 2022년 12월까지 총 144개의 항목이 있음

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 144 entries, (2011, 1) to (2022, 12)
Columns: 9258 entries, ('강남구', '개포동', 12.0, 0.0) to ('중랑구', '중화동', 454.0, 0.0)
dtypes: float64(9258)
memory usage: 10.2 MB


In [13]:
pivot_table_area_full_rent = pivot_table_area_full_rent.ffill()
pivot_table_area_full_rent

Unnamed: 0_level_0,address_1,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,...,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구
Unnamed: 0_level_1,address_2,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,...,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동
Unnamed: 0_level_2,address_3,12.0,12.0,138.0,140.0,141.0,166.0,172.0,176.0,177.0,179.0,...,307.0,314.0,318.0,331.0,413.0,438.0,450.0,452.0,453.0,454.0
Unnamed: 0_level_3,address_4,0.0,2.0,0.0,0.0,0.0,4.0,3.0,1.0,0.0,0.0,...,76.0,1.0,81.0,64.0,8.0,0.0,0.0,0.0,0.0,0.0
year,month,Unnamed: 2_level_4,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4,Unnamed: 7_level_4,Unnamed: 8_level_4,Unnamed: 9_level_4,Unnamed: 10_level_4,Unnamed: 11_level_4,Unnamed: 12_level_4,Unnamed: 13_level_4,Unnamed: 14_level_4,Unnamed: 15_level_4,Unnamed: 16_level_4,Unnamed: 17_level_4,Unnamed: 18_level_4,Unnamed: 19_level_4,Unnamed: 20_level_4,Unnamed: 21_level_4,Unnamed: 22_level_4
2011,1,432.532897,440.045054,211.154904,258.848450,209.447885,,,,396.240321,425.511870,...,194.457948,231.800699,203.665988,,,198.560437,239.439022,,,
2011,2,420.187917,461.653016,209.653658,247.684779,206.098616,,,,412.395428,426.431723,...,194.457948,225.715445,203.665988,143.141648,,176.678445,246.031308,238.063748,235.404896,
2011,3,425.833338,461.653016,202.726317,240.732852,208.836187,,,,412.395428,445.923879,...,194.457948,187.021280,155.426409,143.141648,,252.780586,241.615927,238.063748,229.519774,
2011,4,414.627546,52.122115,206.219598,267.855555,195.808012,,,,400.612702,410.752418,...,194.457948,82.505333,155.426409,143.141648,,252.780586,244.869185,238.063748,229.519774,
2011,5,430.543477,409.530901,188.072915,245.028909,197.540818,300.388738,,,356.400356,372.179732,...,194.457948,243.181791,215.517241,143.141648,,188.457008,252.230868,238.063748,229.519774,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022,8,951.719030,977.289650,2038.030102,1493.370551,266.299118,591.311344,549.820467,931.272118,793.408605,873.633966,...,475.382003,436.099619,172.413793,192.930047,540.0914,467.537034,565.292599,595.159370,682.674200,494.874514
2022,9,917.385474,977.289650,1635.203915,1493.370551,266.299118,717.398987,549.820467,931.272118,854.440037,1063.517984,...,475.382003,436.099619,172.413793,192.930047,540.0914,467.537034,619.117902,595.159370,682.674200,494.874514
2022,10,908.431603,944.864228,1916.378550,1493.370551,266.299118,717.398987,549.820467,931.272118,1119.359020,1094.711720,...,475.382003,436.099619,172.413793,192.930047,540.0914,467.537034,605.027652,595.159370,682.674200,494.874514
2022,11,867.409593,920.512565,1780.685565,1493.370551,266.299118,717.398987,549.820467,931.272118,993.686192,1080.266298,...,475.382003,436.099619,172.413793,192.930047,540.0914,467.537034,611.818738,595.159370,661.235093,638.071606


## 아파트월세 피봇테이블 -> 아파트 월별 연세 피봇테이블 

- 아파트 매매 피봇 테이블 생성 부분 참조
- 보증금은 계약시의 상황마다 다를 것
- 전월세전환률을 적용하여서 월세에서의 보증금을 변환
- 거래들마다 상황에 따라 보증금과 월세금액은 다를 수 있음으로, 보증금의 5.8% 값에 월세*12을 더하여 1년간 들어가는 금액인 연세를 계산

In [14]:
# 보증금의 5.8% 값에 월세*12을 더하여 1년간 들어가는 금액인 연세를 계산
df_month_rent['year_rent_price'] = (df_month_rent['rent_deposit']*0.058)+(df_month_rent['month_rent_price']*12)
df_month_rent['area_year_rent_price'] = df_month_rent['year_rent_price'] / df_month_rent['area']
df_month_rent

Unnamed: 0,date,year,month,day,address_0,address_1,address_2,address_3,address_4,name,area,rent_deposit,month_rent_price,year_rent_price,area_year_rent_price
0,2011-03-18,2011,3,18,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,79.97,19000,63,1858.0,23.233713
1,2011-04-09,2011,4,9,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,79.97,21000,35,1638.0,20.482681
2,2011-07-09,2011,7,9,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,79.97,3000,160,2094.0,26.184819
3,2011-09-19,2011,9,19,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,79.97,6000,140,2028.0,25.359510
4,2011-09-20,2011,9,20,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,79.97,5000,160,2210.0,27.635363
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
637083,2022-11-25,2022,11,25,서울특별시,중랑구,중화동,450.0,0.0,한신아파트(103~109),84.03,30000,48,2316.0,27.561585
637084,2022-12-10,2022,12,10,서울특별시,중랑구,중화동,450.0,0.0,한신아파트(103~109),59.76,25000,50,2050.0,34.303882
637085,2022-12-24,2022,12,24,서울특별시,중랑구,중화동,450.0,0.0,한신아파트(103~109),59.76,20000,50,1760.0,29.451138
637086,2022-12-28,2022,12,28,서울특별시,중랑구,중화동,450.0,0.0,한신아파트(103~109),84.03,5000,150,2090.0,24.872069


In [15]:
pivot_table_area_year_rent=df_month_rent.pivot_table(index=['year','month'], columns=['address_1','address_2','address_3','address_4'], values='area_year_rent_price')
pivot_table_area_year_rent

Unnamed: 0_level_0,address_1,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,...,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구
Unnamed: 0_level_1,address_2,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,...,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동
Unnamed: 0_level_2,address_3,12.0,12.0,138.0,140.0,141.0,172.0,176.0,177.0,179.0,185.0,...,307.0,307.0,314.0,318.0,331.0,438.0,450.0,452.0,453.0,454.0
Unnamed: 0_level_3,address_4,0.0,2.0,0.0,0.0,0.0,3.0,1.0,0.0,0.0,0.0,...,6.0,76.0,1.0,81.0,64.0,0.0,0.0,0.0,0.0,0.0
year,month,Unnamed: 2_level_4,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4,Unnamed: 7_level_4,Unnamed: 8_level_4,Unnamed: 9_level_4,Unnamed: 10_level_4,Unnamed: 11_level_4,Unnamed: 12_level_4,Unnamed: 13_level_4,Unnamed: 14_level_4,Unnamed: 15_level_4,Unnamed: 16_level_4,Unnamed: 17_level_4,Unnamed: 18_level_4,Unnamed: 19_level_4,Unnamed: 20_level_4,Unnamed: 21_level_4,Unnamed: 22_level_4
2011,1,29.358370,,17.911876,20.323663,18.050763,,,,29.895742,27.382319,...,,,,,,,13.922356,,,
2011,2,28.277485,,17.456082,18.333546,17.506052,,,,,26.441473,...,,14.316808,,,,,16.238534,,14.523557,
2011,3,27.614375,,17.629960,18.453881,16.910225,,,24.412572,29.644517,28.500476,...,,,,,,,14.892905,,,
2011,4,26.463862,25.650160,17.165162,18.575909,16.591916,,,,,23.895105,...,,,13.593862,,,,13.947456,15.500595,,
2011,5,27.021347,,11.109562,18.235376,16.988492,,,,,24.901078,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022,8,40.697447,39.761727,84.104350,,,,,48.971620,,39.601365,...,,,,,,,29.029074,,,
2022,9,45.050984,,90.944970,,,,,,,32.009178,...,,,26.969377,,,,35.475234,,,
2022,10,42.932170,48.741623,87.228899,,,,,,56.148725,39.161184,...,,,12.994336,,,,31.408411,21.293480,,
2022,11,47.229884,,79.793732,,,,,67.260516,52.003517,38.385315,...,,,27.508765,,,,29.190680,,,


In [16]:
print(pivot_table_area_year_rent.info()) # 144 모든 인덱스가 있음

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 144 entries, (2011, 1) to (2022, 12)
Columns: 8358 entries, ('강남구', '개포동', 12.0, 0.0) to ('중랑구', '중화동', 454.0, 0.0)
dtypes: float64(8358)
memory usage: 9.2 MB
None


In [17]:
pivot_table_area_year_rent=pivot_table_area_year_rent.ffill()
pivot_table_area_year_rent

Unnamed: 0_level_0,address_1,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,...,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구
Unnamed: 0_level_1,address_2,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,...,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동
Unnamed: 0_level_2,address_3,12.0,12.0,138.0,140.0,141.0,172.0,176.0,177.0,179.0,185.0,...,307.0,307.0,314.0,318.0,331.0,438.0,450.0,452.0,453.0,454.0
Unnamed: 0_level_3,address_4,0.0,2.0,0.0,0.0,0.0,3.0,1.0,0.0,0.0,0.0,...,6.0,76.0,1.0,81.0,64.0,0.0,0.0,0.0,0.0,0.0
year,month,Unnamed: 2_level_4,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4,Unnamed: 7_level_4,Unnamed: 8_level_4,Unnamed: 9_level_4,Unnamed: 10_level_4,Unnamed: 11_level_4,Unnamed: 12_level_4,Unnamed: 13_level_4,Unnamed: 14_level_4,Unnamed: 15_level_4,Unnamed: 16_level_4,Unnamed: 17_level_4,Unnamed: 18_level_4,Unnamed: 19_level_4,Unnamed: 20_level_4,Unnamed: 21_level_4,Unnamed: 22_level_4
2011,1,29.358370,,17.911876,20.323663,18.050763,,,,29.895742,27.382319,...,,,,,,,13.922356,,,
2011,2,28.277485,,17.456082,18.333546,17.506052,,,,29.895742,26.441473,...,,14.316808,,,,,16.238534,,14.523557,
2011,3,27.614375,,17.629960,18.453881,16.910225,,,24.412572,29.644517,28.500476,...,,14.316808,,,,,14.892905,,14.523557,
2011,4,26.463862,25.650160,17.165162,18.575909,16.591916,,,24.412572,29.644517,23.895105,...,,14.316808,13.593862,,,,13.947456,15.500595,14.523557,
2011,5,27.021347,25.650160,11.109562,18.235376,16.988492,,,24.412572,29.644517,24.901078,...,,14.316808,13.593862,,,,13.947456,15.500595,14.523557,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022,8,40.697447,39.761727,84.104350,11.258409,6.298209,20.496709,59.353076,48.971620,29.116945,39.601365,...,17.172279,16.291854,26.415850,4.281609,12.4471,6.555443,29.029074,19.574130,24.442083,22.19917
2022,9,45.050984,39.761727,90.944970,11.258409,6.298209,20.496709,59.353076,48.971620,29.116945,32.009178,...,17.172279,16.291854,26.969377,4.281609,12.4471,6.555443,35.475234,19.574130,24.442083,22.19917
2022,10,42.932170,48.741623,87.228899,11.258409,6.298209,20.496709,59.353076,48.971620,56.148725,39.161184,...,17.172279,16.291854,12.994336,4.281609,12.4471,6.555443,31.408411,21.293480,24.442083,22.19917
2022,11,47.229884,48.741623,79.793732,11.258409,6.298209,20.496709,59.353076,67.260516,52.003517,38.385315,...,17.172279,16.291854,27.508765,4.281609,12.4471,6.555443,29.190680,21.293480,24.442083,22.19917


- deal_everyday 폴더에 있는 파일들은 apartment_deal 파일에 모든 날짜의 아파트매매가 현황 정보를 추가한 파일 
- full_rent_everyday 폴더에 있는 파일들은 apartment_full_rent 파일에 모든 날짜의 아파트전세가 현황 정보를 추가한 파일 
- year_rent_everyday 폴더에 있는 파일들은 apartment_month_rent 파일에 모든 날짜의 아파트월세가 현황 정보를 추가한 파일 

- 이전에 만들었던 apartment_deal, apartment_full_rent,apartment_month_rent 데이터프레임들은 거래가 체결된 날짜에 대한 정보만 데이터로 가지고 있다. 
- 거래가 체결된 날짜 외에 주어진 기간에 해당하는 모든 날짜들에 대한 매매, 전세, 연세 관련 정보들을 구하기 위해서 데이터프레임 생성
- 추후 데이터처리를 위해 위에서 생성한 피봇테이블들을 컬럼을 address_1, address_2, address_3, address_4, year, month, day, 거래가격(매매가격,전세가격,연세가격) 으로 재구조화 해야 함
- 하지만 피봇테이블들은 컬럼의 갯수가 너무 많기에 재구조화 하는 과정에서 메모리 부족 오류가 발생
- 메모리 부족 사태를 해결하기 위해서 다양한 방법 시도

## df_area_deal 파일 생성

- 여러 파트로 나누어서 저장해야 하기에, 폴더 안에 파일들을 담아서 진행

In [18]:
# 여기서 pivo_table_deal은 reset_index 하기 전 테이블
pivot_table_area_deal.head()

Unnamed: 0_level_0,address_1,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,...,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구
Unnamed: 0_level_1,address_2,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,...,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동
Unnamed: 0_level_2,address_3,12.0,12.0,138.0,140.0,141.0,166.0,172.0,176.0,177.0,179.0,...,307.0,314.0,318.0,331.0,413.0,438.0,450.0,452.0,453.0,454.0
Unnamed: 0_level_3,address_4,0.0,2.0,0.0,0.0,0.0,4.0,3.0,1.0,0.0,0.0,...,76.0,1.0,81.0,64.0,8.0,0.0,0.0,0.0,0.0,0.0
year,month,Unnamed: 2_level_4,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4,Unnamed: 7_level_4,Unnamed: 8_level_4,Unnamed: 9_level_4,Unnamed: 10_level_4,Unnamed: 11_level_4,Unnamed: 12_level_4,Unnamed: 13_level_4,Unnamed: 14_level_4,Unnamed: 15_level_4,Unnamed: 16_level_4,Unnamed: 17_level_4,Unnamed: 18_level_4,Unnamed: 19_level_4,Unnamed: 20_level_4,Unnamed: 21_level_4,Unnamed: 22_level_4
2011,1,1056.223061,,1985.413117,1843.53952,1834.608613,,,,,,...,,,,,,,450.299419,,,
2011,2,1047.044551,920.318252,1914.281706,2068.121832,1929.906477,,,,1119.35902,954.653938,...,,365.380762,,,,,462.221502,,445.152911,
2011,3,997.464329,920.318252,1983.669216,1966.517376,1728.264881,,,,1284.317191,954.653938,...,,359.487524,,,,,449.526357,,445.152911,
2011,4,1056.760361,920.318252,1965.817286,1828.668916,1904.404669,,,,1284.317191,954.653938,...,,383.060476,,,,,452.296247,,445.152911,
2011,5,1009.316006,920.318252,1852.884387,1971.081572,1653.633357,,,,1284.317191,954.653938,...,,383.060476,,,,371.55534,444.357652,,445.152911,


In [19]:
pivot_table_area_deal.info() # 피봇 테이블 정보 확인인

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 144 entries, (2011, 1) to (2022, 12)
Columns: 8860 entries, ('강남구', '개포동', 12.0, 0.0) to ('중랑구', '중화동', 454.0, 0.0)
dtypes: float64(8860)
memory usage: 9.7 MB


In [20]:
# null 값을 채움 - 값을 채우지 않으면 추후 stack을 할 때 null 값을 계산을 안함
pivot_table_area_deal = pivot_table_area_deal.fillna(0)
pivot_table_area_deal

Unnamed: 0_level_0,address_1,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,...,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구
Unnamed: 0_level_1,address_2,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,...,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동
Unnamed: 0_level_2,address_3,12.0,12.0,138.0,140.0,141.0,166.0,172.0,176.0,177.0,179.0,...,307.0,314.0,318.0,331.0,413.0,438.0,450.0,452.0,453.0,454.0
Unnamed: 0_level_3,address_4,0.0,2.0,0.0,0.0,0.0,4.0,3.0,1.0,0.0,0.0,...,76.0,1.0,81.0,64.0,8.0,0.0,0.0,0.0,0.0,0.0
year,month,Unnamed: 2_level_4,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4,Unnamed: 7_level_4,Unnamed: 8_level_4,Unnamed: 9_level_4,Unnamed: 10_level_4,Unnamed: 11_level_4,Unnamed: 12_level_4,Unnamed: 13_level_4,Unnamed: 14_level_4,Unnamed: 15_level_4,Unnamed: 16_level_4,Unnamed: 17_level_4,Unnamed: 18_level_4,Unnamed: 19_level_4,Unnamed: 20_level_4,Unnamed: 21_level_4,Unnamed: 22_level_4
2011,1,1056.223061,0.000000,1985.413117,1843.539520,1834.608613,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,450.299419,0.000000,0.000000,0.000000
2011,2,1047.044551,920.318252,1914.281706,2068.121832,1929.906477,0.000000,0.000000,0.000000,1119.359020,954.653938,...,0.000000,365.380762,0.000000,0.000000,0.000000,0.000000,462.221502,0.000000,445.152911,0.000000
2011,3,997.464329,920.318252,1983.669216,1966.517376,1728.264881,0.000000,0.000000,0.000000,1284.317191,954.653938,...,0.000000,359.487524,0.000000,0.000000,0.000000,0.000000,449.526357,0.000000,445.152911,0.000000
2011,4,1056.760361,920.318252,1965.817286,1828.668916,1904.404669,0.000000,0.000000,0.000000,1284.317191,954.653938,...,0.000000,383.060476,0.000000,0.000000,0.000000,0.000000,452.296247,0.000000,445.152911,0.000000
2011,5,1009.316006,920.318252,1852.884387,1971.081572,1653.633357,0.000000,0.000000,0.000000,1284.317191,954.653938,...,0.000000,383.060476,0.000000,0.000000,0.000000,371.555340,444.357652,0.000000,445.152911,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022,8,2801.826056,1779.004227,3857.953574,2508.957785,4208.615293,1413.594063,1342.758827,2172.968275,2380.225816,3014.696646,...,457.073761,872.199239,466.954023,956.130484,595.238095,874.431303,1163.591651,727.417008,1006.355932,1131.141746
2022,9,2772.754671,1779.004227,4019.153604,2508.957785,4208.615293,1413.594063,1342.758827,2172.968275,2380.225816,3014.696646,...,457.073761,872.199239,466.954023,956.130484,595.238095,818.610130,1163.591651,727.417008,1006.355932,1131.141746
2022,10,2342.766727,1779.004227,4019.153604,2508.957785,4208.615293,1413.594063,1342.758827,2172.968275,2380.225816,3014.696646,...,457.073761,872.199239,466.954023,956.130484,595.238095,818.610130,1163.591651,727.417008,1006.355932,1131.141746
2022,11,2459.681105,1779.004227,2963.637355,2508.957785,4208.615293,1413.594063,1342.758827,2172.968275,2380.225816,3014.696646,...,457.073761,872.199239,466.954023,956.130484,595.238095,818.610130,1163.591651,727.417008,1006.355932,1131.141746


>> stack을 할 때 null 값을 계산을 안함으로, 계산 시 값 변경을 예방하기 위해서 null 값들을 채워야 한다

In [21]:
# 컬럼을 slice해서 값을 처리할 때, 컬럼의 개수가 많으면, row가 많을 때 보다 메모리를 많이 소모함으로 전치를 시킴킴
pivot_table_area_deal = pivot_table_area_deal.T
pivot_table_area_deal

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,year,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,...,2022,2022,2022,2022,2022,2022,2022,2022,2022,2022
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,month,1,2,3,4,5,6,7,8,9,10,...,3,4,5,6,7,8,9,10,11,12
address_1,address_2,address_3,address_4,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2,Unnamed: 24_level_2
강남구,개포동,12.0,0.0,1056.223061,1047.044551,997.464329,1056.760361,1009.316006,1006.335055,1037.309983,1022.117418,1004.370464,938.972701,...,3414.733596,3491.675164,3491.675164,3059.071730,3154.856873,2801.826056,2772.754671,2342.766727,2459.681105,2042.620747
강남구,개포동,12.0,2.0,0.000000,920.318252,920.318252,920.318252,920.318252,920.318252,920.318252,920.318252,926.255789,926.255789,...,1460.634129,1460.634129,1460.634129,1460.634129,1779.004227,1779.004227,1779.004227,1779.004227,1779.004227,1779.004227
강남구,개포동,138.0,0.0,1985.413117,1914.281706,1983.669216,1965.817286,1852.884387,1817.946369,1843.967435,1856.844091,1793.024518,1669.471442,...,4492.752961,4004.153817,3857.953574,3857.953574,3857.953574,3857.953574,4019.153604,4019.153604,2963.637355,2888.242098
강남구,개포동,140.0,0.0,1843.539520,2068.121832,1966.517376,1828.668916,1971.081572,1813.294626,1738.376472,1962.128801,1787.659569,1585.645329,...,2508.957785,2508.957785,2508.957785,2508.957785,2508.957785,2508.957785,2508.957785,2508.957785,2508.957785,2508.957785
강남구,개포동,141.0,0.0,1834.608613,1929.906477,1728.264881,1904.404669,1653.633357,1621.679351,1752.040181,1707.681397,1664.676817,1602.701955,...,4208.615293,4208.615293,4208.615293,4208.615293,4208.615293,4208.615293,4208.615293,4208.615293,4208.615293,4208.615293
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
중랑구,중화동,438.0,0.0,0.000000,0.000000,0.000000,0.000000,371.555340,371.555340,371.555340,371.555340,371.555340,421.300977,...,874.431303,874.431303,874.431303,874.431303,874.431303,874.431303,818.610130,818.610130,818.610130,818.610130
중랑구,중화동,450.0,0.0,450.299419,462.221502,449.526357,452.296247,444.357652,445.558466,465.444836,419.313137,446.894347,425.661949,...,1221.552878,1221.552878,1221.552878,1221.552878,1163.591651,1163.591651,1163.591651,1163.591651,1163.591651,1163.591651
중랑구,중화동,452.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,420.579288,420.579288,416.611559,416.611559,416.611559,...,727.417008,727.417008,727.417008,727.417008,727.417008,727.417008,727.417008,727.417008,727.417008,727.417008
중랑구,중화동,453.0,0.0,0.000000,445.152911,445.152911,445.152911,445.152911,445.152911,401.464163,429.613936,429.613936,429.613936,...,1006.355932,1006.355932,1006.355932,1006.355932,1006.355932,1006.355932,1006.355932,1006.355932,1006.355932,1006.355932


In [22]:
# 피봇테이블을 일반데이터프레임화 
df_area_deal = pivot_table_area_deal.stack(level=[0,1])
df_area_deal =df_area_deal.reset_index()
df_area_deal.columns = ['address_1','address_2','address_3','address_4','year','month','area_deal'] # 컬럼명 수정 
df_area_deal = df_area_deal.astype({'address_3': 'int16', 'address_4': 'int16'})
df_area_deal = df_area_deal.drop(df_area_deal[df_area_deal.area_deal == 0].index) # 위에서 값이 null인 값들을 0으로 처리했으므로, 0인 값들을 제거한다 
df_area_deal

Unnamed: 0,address_1,address_2,address_3,address_4,year,month,area_deal
0,강남구,개포동,12,0,2011,1,1056.223061
1,강남구,개포동,12,0,2011,2,1047.044551
2,강남구,개포동,12,0,2011,3,997.464329
3,강남구,개포동,12,0,2011,4,1056.760361
4,강남구,개포동,12,0,2011,5,1009.316006
...,...,...,...,...,...,...,...
1275835,중랑구,중화동,454,0,2022,8,1131.141746
1275836,중랑구,중화동,454,0,2022,9,1131.141746
1275837,중랑구,중화동,454,0,2022,10,1131.141746
1275838,중랑구,중화동,454,0,2022,11,1131.141746


In [23]:
df_area_deal.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1082921 entries, 0 to 1275839
Data columns (total 7 columns):
 #   Column     Non-Null Count    Dtype  
---  ------     --------------    -----  
 0   address_1  1082921 non-null  object 
 1   address_2  1082921 non-null  object 
 2   address_3  1082921 non-null  int16  
 3   address_4  1082921 non-null  int16  
 4   year       1082921 non-null  int64  
 5   month      1082921 non-null  int64  
 6   area_deal  1082921 non-null  float64
dtypes: float64(1), int16(2), int64(2), object(2)
memory usage: 53.7+ MB


In [24]:
df_area_deal.to_pickle('/content/drive/MyDrive/house_price/after_data/df_area_deal.pkl')

## df_area_full_rent 파일 생성

- 위에서의 deal_everday 생성 부분 참조

In [25]:
pivot_table_area_full_rent.info() # 컬럼의 개수가 총 9258개

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 144 entries, (2011, 1) to (2022, 12)
Columns: 9258 entries, ('강남구', '개포동', 12.0, 0.0) to ('중랑구', '중화동', 454.0, 0.0)
dtypes: float64(9258)
memory usage: 10.2 MB


In [26]:
pivot_table_area_full_rent = pivot_table_area_full_rent.fillna(0)
pivot_table_area_full_rent

Unnamed: 0_level_0,address_1,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,...,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구
Unnamed: 0_level_1,address_2,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,...,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동
Unnamed: 0_level_2,address_3,12.0,12.0,138.0,140.0,141.0,166.0,172.0,176.0,177.0,179.0,...,307.0,314.0,318.0,331.0,413.0,438.0,450.0,452.0,453.0,454.0
Unnamed: 0_level_3,address_4,0.0,2.0,0.0,0.0,0.0,4.0,3.0,1.0,0.0,0.0,...,76.0,1.0,81.0,64.0,8.0,0.0,0.0,0.0,0.0,0.0
year,month,Unnamed: 2_level_4,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4,Unnamed: 7_level_4,Unnamed: 8_level_4,Unnamed: 9_level_4,Unnamed: 10_level_4,Unnamed: 11_level_4,Unnamed: 12_level_4,Unnamed: 13_level_4,Unnamed: 14_level_4,Unnamed: 15_level_4,Unnamed: 16_level_4,Unnamed: 17_level_4,Unnamed: 18_level_4,Unnamed: 19_level_4,Unnamed: 20_level_4,Unnamed: 21_level_4,Unnamed: 22_level_4
2011,1,432.532897,440.045054,211.154904,258.848450,209.447885,0.000000,0.000000,0.000000,396.240321,425.511870,...,194.457948,231.800699,203.665988,0.000000,0.0000,198.560437,239.439022,0.000000,0.000000,0.000000
2011,2,420.187917,461.653016,209.653658,247.684779,206.098616,0.000000,0.000000,0.000000,412.395428,426.431723,...,194.457948,225.715445,203.665988,143.141648,0.0000,176.678445,246.031308,238.063748,235.404896,0.000000
2011,3,425.833338,461.653016,202.726317,240.732852,208.836187,0.000000,0.000000,0.000000,412.395428,445.923879,...,194.457948,187.021280,155.426409,143.141648,0.0000,252.780586,241.615927,238.063748,229.519774,0.000000
2011,4,414.627546,52.122115,206.219598,267.855555,195.808012,0.000000,0.000000,0.000000,400.612702,410.752418,...,194.457948,82.505333,155.426409,143.141648,0.0000,252.780586,244.869185,238.063748,229.519774,0.000000
2011,5,430.543477,409.530901,188.072915,245.028909,197.540818,300.388738,0.000000,0.000000,356.400356,372.179732,...,194.457948,243.181791,215.517241,143.141648,0.0000,188.457008,252.230868,238.063748,229.519774,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022,8,951.719030,977.289650,2038.030102,1493.370551,266.299118,591.311344,549.820467,931.272118,793.408605,873.633966,...,475.382003,436.099619,172.413793,192.930047,540.0914,467.537034,565.292599,595.159370,682.674200,494.874514
2022,9,917.385474,977.289650,1635.203915,1493.370551,266.299118,717.398987,549.820467,931.272118,854.440037,1063.517984,...,475.382003,436.099619,172.413793,192.930047,540.0914,467.537034,619.117902,595.159370,682.674200,494.874514
2022,10,908.431603,944.864228,1916.378550,1493.370551,266.299118,717.398987,549.820467,931.272118,1119.359020,1094.711720,...,475.382003,436.099619,172.413793,192.930047,540.0914,467.537034,605.027652,595.159370,682.674200,494.874514
2022,11,867.409593,920.512565,1780.685565,1493.370551,266.299118,717.398987,549.820467,931.272118,993.686192,1080.266298,...,475.382003,436.099619,172.413793,192.930047,540.0914,467.537034,611.818738,595.159370,661.235093,638.071606


In [27]:
# 피봇테이블을 일반데이터프레임화 
pivot_table_area_full_rent = pivot_table_area_full_rent.T
df_area_full_rent = pivot_table_area_full_rent.stack(level=[0,1])
df_area_full_rent =df_area_full_rent.reset_index()
df_area_full_rent.columns = ['address_1','address_2','address_3','address_4','year','month','area_full_rent'] # 컬럼명 수정 
df_area_full_rent = df_area_full_rent.astype({'address_3': 'int16', 'address_4': 'int16'})
df_area_full_rent = df_area_full_rent.drop(df_area_full_rent[df_area_full_rent.area_full_rent == 0].index) # 위에서 값이 null인 값들을 0으로 처리했으므로, 0인 값들을 제거한다 
df_area_full_rent

Unnamed: 0,address_1,address_2,address_3,address_4,year,month,area_full_rent
0,강남구,개포동,12,0,2011,1,432.532897
1,강남구,개포동,12,0,2011,2,420.187917
2,강남구,개포동,12,0,2011,3,425.833338
3,강남구,개포동,12,0,2011,4,414.627546
4,강남구,개포동,12,0,2011,5,430.543477
...,...,...,...,...,...,...,...
1333147,중랑구,중화동,454,0,2022,8,494.874514
1333148,중랑구,중화동,454,0,2022,9,494.874514
1333149,중랑구,중화동,454,0,2022,10,494.874514
1333150,중랑구,중화동,454,0,2022,11,638.071606


In [28]:
df_area_full_rent.to_pickle('/content/drive/MyDrive/house_price/after_data/df_area_full_rent.pkl')

## df_area_year_rent 파일 생성

- deal_everyday 생성 참조

In [29]:
pivot_table_area_year_rent.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 144 entries, (2011, 1) to (2022, 12)
Columns: 8358 entries, ('강남구', '개포동', 12.0, 0.0) to ('중랑구', '중화동', 454.0, 0.0)
dtypes: float64(8358)
memory usage: 9.2 MB


In [30]:
pivot_table_area_year_rent = pivot_table_area_year_rent.fillna(0)
pivot_table_area_year_rent

Unnamed: 0_level_0,address_1,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,...,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구
Unnamed: 0_level_1,address_2,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,...,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동
Unnamed: 0_level_2,address_3,12.0,12.0,138.0,140.0,141.0,172.0,176.0,177.0,179.0,185.0,...,307.0,307.0,314.0,318.0,331.0,438.0,450.0,452.0,453.0,454.0
Unnamed: 0_level_3,address_4,0.0,2.0,0.0,0.0,0.0,3.0,1.0,0.0,0.0,0.0,...,6.0,76.0,1.0,81.0,64.0,0.0,0.0,0.0,0.0,0.0
year,month,Unnamed: 2_level_4,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4,Unnamed: 7_level_4,Unnamed: 8_level_4,Unnamed: 9_level_4,Unnamed: 10_level_4,Unnamed: 11_level_4,Unnamed: 12_level_4,Unnamed: 13_level_4,Unnamed: 14_level_4,Unnamed: 15_level_4,Unnamed: 16_level_4,Unnamed: 17_level_4,Unnamed: 18_level_4,Unnamed: 19_level_4,Unnamed: 20_level_4,Unnamed: 21_level_4,Unnamed: 22_level_4
2011,1,29.358370,0.000000,17.911876,20.323663,18.050763,0.000000,0.000000,0.000000,29.895742,27.382319,...,0.000000,0.000000,0.000000,0.000000,0.0000,0.000000,13.922356,0.000000,0.000000,0.00000
2011,2,28.277485,0.000000,17.456082,18.333546,17.506052,0.000000,0.000000,0.000000,29.895742,26.441473,...,0.000000,14.316808,0.000000,0.000000,0.0000,0.000000,16.238534,0.000000,14.523557,0.00000
2011,3,27.614375,0.000000,17.629960,18.453881,16.910225,0.000000,0.000000,24.412572,29.644517,28.500476,...,0.000000,14.316808,0.000000,0.000000,0.0000,0.000000,14.892905,0.000000,14.523557,0.00000
2011,4,26.463862,25.650160,17.165162,18.575909,16.591916,0.000000,0.000000,24.412572,29.644517,23.895105,...,0.000000,14.316808,13.593862,0.000000,0.0000,0.000000,13.947456,15.500595,14.523557,0.00000
2011,5,27.021347,25.650160,11.109562,18.235376,16.988492,0.000000,0.000000,24.412572,29.644517,24.901078,...,0.000000,14.316808,13.593862,0.000000,0.0000,0.000000,13.947456,15.500595,14.523557,0.00000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022,8,40.697447,39.761727,84.104350,11.258409,6.298209,20.496709,59.353076,48.971620,29.116945,39.601365,...,17.172279,16.291854,26.415850,4.281609,12.4471,6.555443,29.029074,19.574130,24.442083,22.19917
2022,9,45.050984,39.761727,90.944970,11.258409,6.298209,20.496709,59.353076,48.971620,29.116945,32.009178,...,17.172279,16.291854,26.969377,4.281609,12.4471,6.555443,35.475234,19.574130,24.442083,22.19917
2022,10,42.932170,48.741623,87.228899,11.258409,6.298209,20.496709,59.353076,48.971620,56.148725,39.161184,...,17.172279,16.291854,12.994336,4.281609,12.4471,6.555443,31.408411,21.293480,24.442083,22.19917
2022,11,47.229884,48.741623,79.793732,11.258409,6.298209,20.496709,59.353076,67.260516,52.003517,38.385315,...,17.172279,16.291854,27.508765,4.281609,12.4471,6.555443,29.190680,21.293480,24.442083,22.19917


In [31]:
# 피봇테이블을 일반데이터프레임화 
pivot_table_area_year_rent = pivot_table_area_year_rent.T
df_area_year_rent = pivot_table_area_year_rent.stack(level=[0,1])
df_area_year_rent = df_area_year_rent.reset_index()
df_area_year_rent.columns = ['address_1','address_2','address_3','address_4','year','month','area_year_rent'] # 컬럼명 수정 
df_area_year_rent = df_area_year_rent.astype({'address_3': 'int16', 'address_4': 'int16'})
df_area_year_rent = df_area_year_rent.drop(df_area_year_rent[df_area_year_rent.area_year_rent == 0].index) # 위에서 값이 null인 값들을 0으로 처리했으므로, 0인 값들을 제거한다 
df_area_year_rent

Unnamed: 0,address_1,address_2,address_3,address_4,year,month,area_year_rent
0,강남구,개포동,12,0,2011,1,29.358370
1,강남구,개포동,12,0,2011,2,28.277485
2,강남구,개포동,12,0,2011,3,27.614375
3,강남구,개포동,12,0,2011,4,26.463862
4,강남구,개포동,12,0,2011,5,27.021347
...,...,...,...,...,...,...,...
1203547,중랑구,중화동,454,0,2022,8,22.199170
1203548,중랑구,중화동,454,0,2022,9,22.199170
1203549,중랑구,중화동,454,0,2022,10,22.199170
1203550,중랑구,중화동,454,0,2022,11,22.199170


In [32]:
df_area_year_rent.to_pickle('/content/drive/MyDrive/house_price/after_data/df_area_year_rent.pkl')

# df_address 생성

In [7]:
import pandas as pd
import os
# 데이터 정보 불러오기
df_area_deal = pd.read_pickle('/content/drive/MyDrive/house_price/after_data/df_area_deal.pkl')
df_area_full_rent = pd.read_pickle('/content/drive/MyDrive/house_price/after_data/df_area_full_rent.pkl')
df_area_year_rent = pd.read_pickle('/content/drive/MyDrive/house_price/after_data/df_area_year_rent.pkl')

In [8]:
df_address = pd.concat([df_area_deal[['address_1','address_2','address_3','address_4']], df_area_full_rent[['address_1','address_2','address_3','address_4']], df_area_year_rent[['address_1','address_2','address_3','address_4']]], axis=0)
df_address

Unnamed: 0,address_1,address_2,address_3,address_4
0,강남구,개포동,12,0
1,강남구,개포동,12,0
2,강남구,개포동,12,0
3,강남구,개포동,12,0
4,강남구,개포동,12,0
...,...,...,...,...
1203547,중랑구,중화동,454,0
1203548,중랑구,중화동,454,0
1203549,중랑구,중화동,454,0
1203550,중랑구,중화동,454,0


In [9]:
df_address = df_address.drop_duplicates(subset=['address_1','address_2','address_3','address_4'], keep='last')
df_address.reset_index(inplace=True,drop=True)
df_address

Unnamed: 0,address_1,address_2,address_3,address_4
0,강남구,대치동,633,22
1,강남구,도곡동,153,2
2,강남구,도곡동,193,67
3,강남구,도곡동,893,2
4,강남구,역삼동,709,0
...,...,...,...,...
9681,중랑구,중화동,438,0
9682,중랑구,중화동,450,0
9683,중랑구,중화동,452,0
9684,중랑구,중화동,453,0


In [10]:
df_address['address'] = df_address['address_1'] +' '+ df_address['address_2'] +' '+ df_address['address_3'].apply(str) +'-'+ df_address['address_4'].apply(str)
df_address

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_address['address'] = df_address['address_1'] +' '+ df_address['address_2'] +' '+ df_address['address_3'].apply(str) +'-'+ df_address['address_4'].apply(str)


Unnamed: 0,address_1,address_2,address_3,address_4,address
0,강남구,대치동,633,22,강남구 대치동 633-22
1,강남구,도곡동,153,2,강남구 도곡동 153-2
2,강남구,도곡동,193,67,강남구 도곡동 193-67
3,강남구,도곡동,893,2,강남구 도곡동 893-2
4,강남구,역삼동,709,0,강남구 역삼동 709-0
...,...,...,...,...,...
9681,중랑구,중화동,438,0,중랑구 중화동 438-0
9682,중랑구,중화동,450,0,중랑구 중화동 450-0
9683,중랑구,중화동,452,0,중랑구 중화동 452-0
9684,중랑구,중화동,453,0,중랑구 중화동 453-0


## 구글 api 활용(단점)

In [44]:
!pip install -U googlemaps

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting googlemaps
  Downloading googlemaps-4.10.0.tar.gz (33 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: googlemaps
  Building wheel for googlemaps (setup.py) ... [?25l[?25hdone
  Created wheel for googlemaps: filename=googlemaps-4.10.0-py3-none-any.whl size=40718 sha256=b3003e37bc58ca2bc265112b322f8db53b259420df231c4b7e5600f15e00fdf4
  Stored in directory: /root/.cache/pip/wheels/d9/5f/46/54a2bdb4bcb07d3faba4463d2884865705914cc72a7b8bb5f0
Successfully built googlemaps
Installing collected packages: googlemaps
Successfully installed googlemaps-4.10.0


In [45]:
#구글맵 api 로드
import googlemaps
from datetime import datetime
my_key = "AIzaSyBfzws3S5B5CuY196F_4cyA2I672q2GK7k" #구글맵 API 키값
maps = googlemaps.Client(key=my_key)  # 구글맵 api 가져오기

In [None]:
df_address_google = df_address.copy()

In [46]:
import time #구동 시간을 측정하기 위하여 time 모듈 임포트


lat = []  #위도
lng = []  #경도

i=0

t1 = time.time() #지오코딩 코드 처리 전 시각

for address in df_address_google['address']:   
    i = i + 1
    try:
        geo_location = maps.geocode(address)[0].get('geometry')
        lat.append(geo_location['location']['lat'])
        lng.append(geo_location['location']['lng'])
        
# 좌표를 가져오지 못한 경우 에러 출력
    except:
        lat.append('')
        lng.append('')
        print("%d번 인덱스 에러"%(i))


print(time.time() - t1) #지오코딩 총 구동 시간

784.9911198616028


In [47]:
df_address_google['lat'] = lat
df_address_google['lng'] = lng
df_address_google

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_address['lat'] = lat
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_address['lng'] = lng


Unnamed: 0,address_1,address_2,address_3,address_4,address,lat,lng
0,강남구,대치동,633,22,강남구 대치동 633-22,37.495111,127.057637
1,강남구,도곡동,153,2,강남구 도곡동 153-2,37.488505,127.045644
2,강남구,도곡동,193,67,강남구 도곡동 193-67,37.487497,127.044117
3,강남구,도곡동,893,2,강남구 도곡동 893-2,37.487448,127.039270
4,강남구,역삼동,709,0,강남구 역삼동 709-0,37.502436,127.046655
...,...,...,...,...,...,...,...
9681,중랑구,중화동,438,0,중랑구 중화동 438-0,37.604499,127.077902
9682,중랑구,중화동,450,0,중랑구 중화동 450-0,37.597069,127.081830
9683,중랑구,중화동,452,0,중랑구 중화동 452-0,37.599716,127.078481
9684,중랑구,중화동,453,0,중랑구 중화동 453-0,37.602642,127.080072


In [49]:
df_address_google[df_address_google.duplicated(subset=['lat','lng'])]


Unnamed: 0,address_1,address_2,address_3,address_4,address,lat,lng
77,서초구,내곡동,619,0,서초구 내곡동 619-0,37.461771,127.051160
97,성북구,동소문동7가,280,0,성북구 동소문동7가 280-0,37.596110,127.014267
301,강서구,방화동,907,0,강서구 방화동 907-0,37.576456,126.813873
1364,강남구,개포동,1280,0,강남구 개포동 1280-0,37.478964,127.060928
1365,강남구,개포동,1282,0,강남구 개포동 1282-0,37.478964,127.060928
...,...,...,...,...,...,...,...
9603,중랑구,상봉동,193,1,중랑구 상봉동 193-1,37.599074,127.089344
9619,중랑구,상봉동,500,0,중랑구 상봉동 500-0,37.599074,127.089344
9620,중랑구,상봉동,501,0,중랑구 상봉동 501-0,37.599074,127.089344
9659,중랑구,신내동,826,0,중랑구 신내동 826-0,37.610331,127.096077


## 네이버 api 사용

In [11]:
#사용 라이브러리
import numpy as np
import pandas as pd
from urllib.request import urlopen
from urllib import parse
from urllib.request import Request
from urllib.error import HTTPError
from bs4 import BeautifulSoup
import json

In [13]:
df_address_naver = df_address.copy()
df_address_naver

Unnamed: 0,address_1,address_2,address_3,address_4,address
0,강남구,대치동,633,22,강남구 대치동 633-22
1,강남구,도곡동,153,2,강남구 도곡동 153-2
2,강남구,도곡동,193,67,강남구 도곡동 193-67
3,강남구,도곡동,893,2,강남구 도곡동 893-2
4,강남구,역삼동,709,0,강남구 역삼동 709-0
...,...,...,...,...,...
9681,중랑구,중화동,438,0,중랑구 중화동 438-0
9682,중랑구,중화동,450,0,중랑구 중화동 450-0
9683,중랑구,중화동,452,0,중랑구 중화동 452-0
9684,중랑구,중화동,453,0,중랑구 중화동 453-0


In [14]:
df_address_naver['address'][:10]

0    강남구 대치동 633-22
1     강남구 도곡동 153-2
2    강남구 도곡동 193-67
3     강남구 도곡동 893-2
4     강남구 역삼동 709-0
5     강남구 역삼동 728-6
6     강남구 율현동 686-0
7     강남구 청담동 90-17
8    강남구 청담동 106-21
9     강동구 암사동 513-4
Name: address, dtype: object

In [20]:
#naver map api key
client_id = 'w70jzijusj';    # 본인이 할당받은 ID 입력
client_pw = 'xOThNfM4jigh5bOrXxeEJCIO415wocSz6doOlMnJ';    # 본인이 할당받은 Secret 입력

api_url = 'https://naveropenapi.apigw.ntruss.com/map-geocode/v2/geocode?query='

# 네이버 지도 API 이용해서 위경도 찾기
geo_coordi = list()
for i,add in enumerate(df_address_naver['address']):
    add_urlenc = parse.quote(add)  
    url = api_url + add_urlenc
    request = Request(url)
    request.add_header('X-NCP-APIGW-API-KEY-ID', client_id)
    request.add_header('X-NCP-APIGW-API-KEY', client_pw)
    try:
        response = urlopen(request)
    except HTTPError as e:
        print('HTTP Error!')
        latitude = None
        longitude = None
    else:
        rescode = response.getcode()
        if rescode == 200:
            response_body = response.read().decode('utf-8')
            response_body = json.loads(response_body)   # json
            if response_body['addresses'] == [] :
                print(add," not exist!")
                latitude = None
                longitude = None
            else:
                latitude = response_body['addresses'][0]['y']
                longitude = response_body['addresses'][0]['x']
                # print("Success!")
        else:
            print('Response error code : %d' % rescode)
            latitude = None
            longitude = None

    geo_coordi.append([latitude, longitude])
    print(i+1,'/',len(df_address_naver['address']))

1 / 9686
2 / 9686
3 / 9686
4 / 9686
5 / 9686
6 / 9686
강남구 율현동 686-0  not exist!
7 / 9686
8 / 9686
9 / 9686
10 / 9686
11 / 9686
12 / 9686
13 / 9686
14 / 9686
15 / 9686
16 / 9686
17 / 9686
관악구 봉천동 1152-0  not exist!
18 / 9686
19 / 9686
20 / 9686
21 / 9686
22 / 9686
23 / 9686
24 / 9686
25 / 9686
26 / 9686
27 / 9686
28 / 9686
29 / 9686
30 / 9686
31 / 9686
32 / 9686
33 / 9686
34 / 9686
35 / 9686
36 / 9686
37 / 9686
38 / 9686
39 / 9686
40 / 9686
41 / 9686
42 / 9686
43 / 9686
44 / 9686
45 / 9686
46 / 9686
47 / 9686
48 / 9686
구로구 천왕동 292-10  not exist!
49 / 9686
50 / 9686
51 / 9686
52 / 9686
53 / 9686
54 / 9686
55 / 9686
56 / 9686
57 / 9686
58 / 9686
59 / 9686
60 / 9686
61 / 9686
62 / 9686
63 / 9686
64 / 9686
65 / 9686
동작구 신대방동 729-24  not exist!
66 / 9686
마포구 공덕동 800-0  not exist!
67 / 9686
68 / 9686
69 / 9686
70 / 9686
서대문구 남가좌동 458-0  not exist!
71 / 9686
72 / 9686
73 / 9686
74 / 9686
75 / 9686
76 / 9686
서초구 내곡동 568-0  not exist!
77 / 9686
서초구 내곡동 619-0  not exist!
78 / 9686
79 / 9686
80 / 

KeyboardInterrupt: ignored

In [18]:
geo_coordi

[['37.4951529', '127.0576638'],
 ['37.4886005', '127.0457672'],
 ['37.4875758', '127.0441622'],
 ['37.4877483', '127.0392143'],
 ['37.5025026', '127.0463890'],
 ['37.4976924', '127.0396481'],
 [None, None],
 ['37.5248507', '127.0427856'],
 ['37.5277680', '127.0514480'],
 ['37.5489814', '127.1262129']]

# 지역별 직전 달 거래개수 데이터프레임 생성 

- 세부적인 아파트의 거래량(매매, 전세, 월세 체결)에 대한 데이터프레임 생성
- 이전 economic_data2를 만들 때, 추가했던 아파트 거래량 정보들은 서울 전체 거래량에 대한 정보 - 이번에 추가하는 거래량 정보들은 year, month, address_1, address_2, address_3 까지 동일 아파트의 거래량을 추가

In [None]:
import pandas as pd
import os
# 데이터 정보 불러오기
df_deal = pd.read_csv('/content/drive/MyDrive/house_price/after_data/apartment_deal.csv',encoding='utf-8')
df_full_rent = pd.read_csv('/content/drive/MyDrive/house_price/after_data/apartment_full_rent.csv',encoding='utf-8')
df_month_rent = pd.read_csv('/content/drive/MyDrive/house_price/after_data/apartment_month_rent.csv',encoding='utf-8')

- 거래량에 대한 데이터프레임을 생성하기 전, 데이터의 특징을 파악하기 위해 테스트 진행

- 거래개수들이 날짜의 정보는  year,month, day까지, 그리고 주소의 정보는 address_1, address_2, address_3, address_4까지 그룹으로 하면 1,2,3 등 적은 수의 개수가 너무 많아서 거래량 변화등을 계산하기가 용이하지 않은듯
- 그래서 날짜의 정보는 year,month 까지, 주소의 정보는 address_1, address_2, address_3 까지 그룹화를 할 예정 

## 거래개수 데이터 프레임 생성

In [None]:
# 거래 달별, 지역별(address_3 까지지) 거래 개수들을 생성
df_deal_count = df_deal.groupby(['address_1','address_2','address_3','year','month'])['deal_price'].count()
df_deal_count = df_deal_count.reset_index()
df_deal_count.columns = ['address_1','address_2','address_3','year','month','deal_count']

df_full_rent_count = df_full_rent.groupby(['address_1','address_2','address_3','year','month'])['full_rent_price'].count()
df_full_rent_count = df_full_rent_count.reset_index()
df_full_rent_count.columns = ['address_1','address_2','address_3','year','month','full_rent_count']

df_month_rent_count = df_month_rent.groupby(['address_1','address_2','address_3','year','month'])['month_rent_price'].count()
df_month_rent_count = df_month_rent_count.reset_index()
df_month_rent_count.columns = ['address_1','address_2','address_3','year','month','year_rent_count']

In [None]:
df_deal_count

Unnamed: 0,address_1,address_2,address_3,year,month,deal_count
0,강남구,개포동,12.0,2011,1,16
1,강남구,개포동,12.0,2011,2,17
2,강남구,개포동,12.0,2011,3,8
3,강남구,개포동,12.0,2011,4,13
4,강남구,개포동,12.0,2011,5,7
...,...,...,...,...,...,...
292758,중랑구,중화동,454.0,2020,12,1
292759,중랑구,중화동,454.0,2021,1,1
292760,중랑구,중화동,454.0,2021,7,1
292761,중랑구,중화동,454.0,2021,8,1


- 해당 거래량은 해당 달의 거래량인데, 거래 종합은 다음달에 발표가 된다고 가정

In [None]:
# 한달씩 기록을 미룸
df_deal_count.loc[df_deal_count['month'] == 12, 'announced_year'] = df_deal_count['year']+1
df_deal_count.loc[df_deal_count['month'] == 12, 'announced_month'] = 1
df_deal_count.loc[df_deal_count['month'] != 12, 'announced_year'] = df_deal_count['year']
df_deal_count.loc[df_deal_count['month'] != 12, 'announced_month'] = df_deal_count['month']+1

df_full_rent_count.loc[df_full_rent_count['month'] == 12, 'announced_year'] = df_full_rent_count['year']+1
df_full_rent_count.loc[df_full_rent_count['month'] == 12, 'announced_month'] = 1
df_full_rent_count.loc[df_full_rent_count['month'] != 12, 'announced_year'] = df_full_rent_count['year']
df_full_rent_count.loc[df_full_rent_count['month'] != 12, 'announced_month'] = df_full_rent_count['month']+1

df_month_rent_count.loc[df_month_rent_count['month'] == 12, 'announced_year'] = df_month_rent_count['year']+1
df_month_rent_count.loc[df_month_rent_count['month'] == 12, 'announced_month'] = 1
df_month_rent_count.loc[df_month_rent_count['month'] != 12, 'announced_year'] = df_month_rent_count['year']
df_month_rent_count.loc[df_month_rent_count['month'] != 12, 'announced_month'] = df_month_rent_count['month']+1
df_deal_count

Unnamed: 0,address_1,address_2,address_3,year,month,deal_count,announced_year,announced_month
0,강남구,개포동,12.0,2011,1,16,2011.0,2.0
1,강남구,개포동,12.0,2011,2,17,2011.0,3.0
2,강남구,개포동,12.0,2011,3,8,2011.0,4.0
3,강남구,개포동,12.0,2011,4,13,2011.0,5.0
4,강남구,개포동,12.0,2011,5,7,2011.0,6.0
...,...,...,...,...,...,...,...,...
292758,중랑구,중화동,454.0,2020,12,1,2021.0,1.0
292759,중랑구,중화동,454.0,2021,1,1,2021.0,2.0
292760,중랑구,중화동,454.0,2021,7,1,2021.0,8.0
292761,중랑구,중화동,454.0,2021,8,1,2021.0,9.0


## 미래날짜 컬럼들을 생성

- 해당 수치의 미래 날짜들을 컬럼으로 추가

In [None]:
# 6개월 후 날짜들을 구함함
df_deal_count.loc[df_deal_count['announced_month']<7, '6m_after_year'] = df_deal_count['announced_year']
df_deal_count.loc[df_deal_count['announced_month']<7, '6m_after_month'] = df_deal_count['announced_month']+6
df_deal_count.loc[df_deal_count['announced_month']>=7, '6m_after_year'] = df_deal_count['announced_year']+1
df_deal_count.loc[df_deal_count['announced_month']>=7, '6m_after_month'] = df_deal_count['announced_month']-6

df_full_rent_count.loc[df_full_rent_count['announced_month']<7, '6m_after_year'] = df_full_rent_count['announced_year']
df_full_rent_count.loc[df_full_rent_count['announced_month']<7, '6m_after_month'] = df_full_rent_count['announced_month']+6
df_full_rent_count.loc[df_full_rent_count['announced_month']>=7, '6m_after_year'] = df_full_rent_count['announced_year']+1
df_full_rent_count.loc[df_full_rent_count['announced_month']>=7, '6m_after_month'] = df_full_rent_count['announced_month']-6

df_month_rent_count.loc[df_month_rent_count['announced_month']<7, '6m_after_year'] = df_month_rent_count['announced_year']
df_month_rent_count.loc[df_month_rent_count['announced_month']<7, '6m_after_month'] = df_month_rent_count['announced_month']+6
df_month_rent_count.loc[df_month_rent_count['announced_month']>=7, '6m_after_year'] = df_month_rent_count['announced_year']+1
df_month_rent_count.loc[df_month_rent_count['announced_month']>=7, '6m_after_month'] = df_month_rent_count['announced_month']-6

# 12개월 후 날짜들을 구함 
df_deal_count.loc[:, '12m_after_year'] = df_deal_count['announced_year']+1
df_deal_count.loc[:, '12m_after_month'] = df_deal_count['announced_month']
                
df_full_rent_count.loc[:, '12m_after_year'] = df_full_rent_count['announced_year']+1
df_full_rent_count.loc[:, '12m_after_month'] = df_full_rent_count['announced_month']
                  
df_month_rent_count.loc[:, '12m_after_year'] = df_month_rent_count['announced_year']+1
df_month_rent_count.loc[:, '12m_after_month'] = df_month_rent_count['announced_month']

# 데이터 타입들을 변경경
df_deal_count = df_deal_count.astype({'announced_year': 'int16','announced_month': 'int16',
                                    '6m_after_year': 'int16','6m_after_month': 'int16',
                                    '12m_after_year': 'int16','12m_after_month': 'int16'})
df_full_rent_count = df_full_rent_count.astype({'announced_year': 'int16','announced_month': 'int16',
                                    '6m_after_year': 'int16','6m_after_month': 'int16',
                                    '12m_after_year': 'int16','12m_after_month': 'int16'})
df_month_rent_count = df_month_rent_count.astype({'announced_year': 'int16','announced_month': 'int16',
                                    '6m_after_year': 'int16','6m_after_month': 'int16',
                                    '12m_after_year': 'int16','12m_after_month': 'int16'})

# 필요없는 컬럼제거
df_deal_count = df_deal_count.drop(['year','month'], axis=1)
df_full_rent_count = df_full_rent_count.drop(['year','month'], axis=1)
df_month_rent_count = df_month_rent_count.drop(['year','month'], axis=1)

In [None]:
df_deal_count

Unnamed: 0,address_1,address_2,address_3,deal_count,announced_year,announced_month,6m_after_year,6m_after_month,12m_after_year,12m_after_month
0,강남구,개포동,12.0,16,2011,2,2011,8,2012,2
1,강남구,개포동,12.0,17,2011,3,2011,9,2012,3
2,강남구,개포동,12.0,8,2011,4,2011,10,2012,4
3,강남구,개포동,12.0,13,2011,5,2011,11,2012,5
4,강남구,개포동,12.0,7,2011,6,2011,12,2012,6
...,...,...,...,...,...,...,...,...,...,...
292758,중랑구,중화동,454.0,1,2021,1,2021,7,2022,1
292759,중랑구,중화동,454.0,1,2021,2,2021,8,2022,2
292760,중랑구,중화동,454.0,1,2021,8,2022,2,2022,8
292761,중랑구,중화동,454.0,1,2021,9,2022,3,2022,9


In [None]:
df_full_rent_count

Unnamed: 0,address_1,address_2,address_3,full_rent_count,announced_year,announced_month,6m_after_year,6m_after_month,12m_after_year,12m_after_month
0,강남구,개포동,12.0,38,2011,2,2011,8,2012,2
1,강남구,개포동,12.0,38,2011,3,2011,9,2012,3
2,강남구,개포동,12.0,46,2011,4,2011,10,2012,4
3,강남구,개포동,12.0,30,2011,5,2011,11,2012,5
4,강남구,개포동,12.0,21,2011,6,2011,12,2012,6
...,...,...,...,...,...,...,...,...,...,...
367856,중랑구,중화동,454.0,3,2021,11,2022,5,2022,11
367857,중랑구,중화동,454.0,1,2021,12,2022,6,2022,12
367858,중랑구,중화동,454.0,2,2022,4,2022,10,2023,4
367859,중랑구,중화동,454.0,1,2022,12,2023,6,2023,12


In [None]:
df_month_rent_count

Unnamed: 0,address_1,address_2,address_3,month_rent_count,announced_year,announced_month,6m_after_year,6m_after_month,12m_after_year,12m_after_month
0,강남구,개포동,12.0,7,2011,2,2011,8,2012,2
1,강남구,개포동,12.0,8,2011,3,2011,9,2012,3
2,강남구,개포동,12.0,16,2011,4,2011,10,2012,4
3,강남구,개포동,12.0,11,2011,5,2011,11,2012,5
4,강남구,개포동,12.0,8,2011,6,2011,12,2012,6
...,...,...,...,...,...,...,...,...,...,...
221022,중랑구,중화동,454.0,1,2019,1,2019,7,2020,1
221023,중랑구,중화동,454.0,1,2020,2,2020,8,2021,2
221024,중랑구,중화동,454.0,1,2020,7,2021,1,2021,7
221025,중랑구,중화동,454.0,1,2020,9,2021,3,2021,9


## df_deal_count 파일 생성

In [None]:
df_deal_count.to_pickle('/content/drive/MyDrive/house_price/after_data/df_deal_count.pkl')

## df_full_rent_count 파일 생성

In [None]:
df_full_rent_count.to_pickle('/content/drive/MyDrive/house_price/after_data/df_full_rent_count.pkl')

## df_year_rent_count 파일 생성

In [None]:
df_month_rent_count.to_pickle('/content/drive/MyDrive/house_price/after_data/df_year_rent_count.pkl')