<a href="https://colab.research.google.com/github/Ryong1998/house_price/blob/main/apartment_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 프로젝트 주제

- 해당 프로젝트는 미래의 아파트 집 값 변화율을 예측하는 프로젝트 입니다

# 프로젝트 소개

- 다양한 지역의 다양한 부동산의 종류(아파트, 단독주택 등)들 중 '서울'의 '아파트'의 '미래 가격 변화율'을 예측함
- 최종적으로는 '미래 가격 변화율'이 가장 좋을 것으로 예상되는, 즉, 가장 수익률이 높을 것으로 예상되는 아파트를 찾는 것이 목적
- 부동산의 가치는 '1. 거주지로서의 특성'과 '2. 금융상품으로서의 특성' 두가지를 통해서 평가를 할 수 있다고 가정
- '1. 거주지로서의 특성'은 주변 편의시설, 교육시설, 아파트 평수, 주변 교통시설 등 더 편한 거주환경을 제공하는 요소들을 포함
- '2. 금융상품으로서의 특성'은 기준금리, 아파트 공급량, 아파트 미분양, 현재 매매가, 전세가율 등 금융관련 수치들로 표현이 되는 요소들을 포함
- '1. 거주지로서의 특성'에서 높은 가치를 의미하는 요소들은 시대가 변함에 따라서 바뀔 수가 있음(예를 들어 대가족에서 소가족 형태로 가족 구성원 구조가 바뀌면서 시대에 따라 사람들이 선호하는 아파트 평수가 바뀔 수도 있고, 인터넷 강의의 발달로 인해서 교육시설 인프라의 중요성이 향후 낮아질 수 있음)
- '1. 거주지로서의 특성'에서 높은 가치들은 과거 계속 변화했을 수 있지만 어떻게 변했는지 파악하기가 쉽지 않고, 미래에 어떻게 변할지 알 수 없기에 평가의 기준이 '변동적'이라는 특징이 있음
- 하지만 '2. 금융상품으로서의 특성'은 가격과 경제를 바탕으로 한 '수치'들을 표현 하기에 '1. 거주지로서의 특성'보다 일관성 있게 부동산의 가치를 평가할 수 있을거라는 가정
- '2. 금융상품으로서의 특성'에 해당하는 수치들은 그 자체로 변화하는 '1. 거주지로서의 특성'의 가치를 내포하고 있다고 가정
- 해당 프로젝트는 '2. 금융상품으로서의 특성'에 집중하여서 집값의 변화를 예측 할 예정
- 일별로 '1년뒤 서울 아파트 전체 평당 매매가'를 예측하는 모델을 생성하여 진행
- 개별 아파트를 추천하지는 못하더라도, 서울 아파트 시장의 1년뒤 전망을 통해 현재 아파트를 살 타이밍인지 아닌지를 예측하는 프로젝트를 진행

In [2]:
# 구글 드라이브 마운트
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# original_data 확보

- 'http://rtdown.molit.go.kr/' 사이트를 통해서 아파트매매가, 아파트 전/월세 가격 정보 파일로 얻음
- 'https://kr.investing.com/' 사이트를 통해서 한국국채금리, 미국국채금리, 코스피 정보를 얻음
- 'https://data.kbland.kr/publicdata/unsold-apartments' 사이트를 통해서 미분양 아파트 수량 정보를 얻음
- 'https://asil.kr/asil/sub/movein.jsp' 사이트를 통해서 분양 아파트 수량 정보를 얻음
- 'https://www.bok.or.kr/portal/singl/baseRate/list.do?dataSeCd=01&menuNo=200643' 사이트를 통해서 기준금리 정보를 얻음
- 'https://data.seoul.go.kr/dataList/801/S/2/datasetView.do' 사이트를 통해서 서울시 주택가격지수를 얻음



>> 공공데이터포털의 api를 이용해서 아파트매매가, 아파트 전/월세 가격 정보를 얻으려 했지만 일일 트래픽 제한으로 인해서 직접 'http://rtdown.molit.go.kr/' 사이트에 접속해서 파일들을 다운 받아 필요 데이터를 확보

# apartment_deal.csv 파일 생성

- 'http://rtdown.molit.go.kr/' 사이트를 통해서 아파트매매가 정보 파일들을 얻음
- '아파트 매매' 관련 정보들을 가지고 있는 데이터프레임을 생성하여 apartment_deal.csv 로 저장

## csv 파일들 불러오기 및 병합

- 아파트 매매 정보 원본본파일들은 연도별로 파일들이 나누어져 되어있고, 각 csv 파일 내의 모든 정보들이 필요하지는 않기에 전처리 과정 진행

In [None]:
import pandas as pd
import os

# 연도별 아파트 매매 정보들이 들어있는 csv경로 설정
dir_path = "/content/drive/MyDrive/house_price/original_data/deal_price/Seoul"
file_list = os.listdir(dir_path)
file_list.sort()
df_list = list()
# 해당 폴더 안에 있는 csv 파일들을 읽어서 리스트 안에 데이터프레임들을 담음
for csv_file in file_list:
    df_list.append(pd.read_csv(dir_path+"/"+csv_file ,skiprows=15,  encoding='cp949'))

>> 코랩은 파일을 읽어올 때 업로드한 순서대로 파일을 불러오는 듯

In [None]:
df_list[0].info() # 리스트 안에 잘 담겼는지 확인

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120812 entries, 0 to 120811
Data columns (total 15 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   시군구       120812 non-null  object 
 1   번지        120812 non-null  object 
 2   본번        120812 non-null  int64  
 3   부번        120812 non-null  int64  
 4   단지명       120812 non-null  object 
 5   전용면적(㎡)   120812 non-null  float64
 6   계약년월      120812 non-null  int64  
 7   계약일       120812 non-null  int64  
 8   거래금액(만원)  120812 non-null  object 
 9   층         120812 non-null  int64  
 10  건축년도      120812 non-null  int64  
 11  도로명       120812 non-null  object 
 12  해제사유발생일   0 non-null       float64
 13  거래유형      120812 non-null  object 
 14  중개사소재지    120812 non-null  object 
dtypes: float64(2), int64(6), object(7)
memory usage: 13.8+ MB


In [None]:
df_list[0].head() # 데이터 형태들을 확인

Unnamed: 0,시군구,번지,본번,부번,단지명,전용면적(㎡),계약년월,계약일,거래금액(만원),층,건축년도,도로명,해제사유발생일,거래유형,중개사소재지
0,서울특별시 강남구 개포동,655-2,655,2,개포2차현대아파트(220),77.75,200603,10,59500,7,1988,언주로 103,,-,-
1,서울특별시 강남구 개포동,655-2,655,2,개포2차현대아파트(220),77.75,200603,29,60000,6,1988,언주로 103,,-,-
2,서울특별시 강남구 개포동,655-2,655,2,개포2차현대아파트(220),77.75,200604,29,67000,9,1988,언주로 103,,-,-
3,서울특별시 강남구 개포동,655-2,655,2,개포2차현대아파트(220),77.75,200606,1,60000,4,1988,언주로 103,,-,-
4,서울특별시 강남구 개포동,655-2,655,2,개포2차현대아파트(220),77.75,200610,20,72250,5,1988,언주로 103,,-,-


In [None]:
# 모든 데이터프레임을 하나의 데이터프레임으로 통합
df_default = df_list[0]
for df_csv in df_list[1:]:
    df_default = pd.concat([df_default, df_csv], axis=0) # concat을 통해서 위-아래로 데이터 프레임들을 병합
df_default.reset_index(drop=True, inplace=True) # concat으로 합쳐질 때 인덱스 재설정
df_default.loc[1]

시군구          서울특별시 강남구 개포동
번지                   655-2
본번                   655.0
부번                     2.0
단지명         개포2차현대아파트(220)
전용면적(㎡)              77.75
계약년월                200603
계약일                     29
거래금액(만원)            60,000
층                        6
건축년도                1988.0
도로명                언주로 103
해제사유발생일                NaN
거래유형                     -
중개사소재지                   -
Name: 1, dtype: object

In [None]:
df_default.head() # 병합한 테이블의 정보 파악

Unnamed: 0,시군구,번지,본번,부번,단지명,전용면적(㎡),계약년월,계약일,거래금액(만원),층,건축년도,도로명,해제사유발생일,거래유형,중개사소재지
0,서울특별시 강남구 개포동,655-2,655.0,2.0,개포2차현대아파트(220),77.75,200603,10,59500,7,1988.0,언주로 103,,-,-
1,서울특별시 강남구 개포동,655-2,655.0,2.0,개포2차현대아파트(220),77.75,200603,29,60000,6,1988.0,언주로 103,,-,-
2,서울특별시 강남구 개포동,655-2,655.0,2.0,개포2차현대아파트(220),77.75,200604,29,67000,9,1988.0,언주로 103,,-,-
3,서울특별시 강남구 개포동,655-2,655.0,2.0,개포2차현대아파트(220),77.75,200606,1,60000,4,1988.0,언주로 103,,-,-
4,서울특별시 강남구 개포동,655-2,655.0,2.0,개포2차현대아파트(220),77.75,200610,20,72250,5,1988.0,언주로 103,,-,-


In [None]:
df_default.info() # 데이터프레임 합친 결과 확인

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1246694 entries, 0 to 1246693
Data columns (total 15 columns):
 #   Column    Non-Null Count    Dtype  
---  ------    --------------    -----  
 0   시군구       1246694 non-null  object 
 1   번지        1246471 non-null  object 
 2   본번        1246619 non-null  float64
 3   부번        1246619 non-null  float64
 4   단지명       1246694 non-null  object 
 5   전용면적(㎡)   1246694 non-null  float64
 6   계약년월      1246694 non-null  int64  
 7   계약일       1246694 non-null  int64  
 8   거래금액(만원)  1246694 non-null  object 
 9   층         1246694 non-null  int64  
 10  건축년도      1246692 non-null  float64
 11  도로명       1246694 non-null  object 
 12  해제사유발생일   5454 non-null     float64
 13  거래유형      1246694 non-null  object 
 14  중개사소재지    1246694 non-null  object 
dtypes: float64(5), int64(3), object(7)
memory usage: 142.7+ MB


## 필요한 컬럼만 선택

- df_default 데이터프레임에서 모든 컬럼들을 사용하지 않기에, 사용할 컬럼들만을 선택

In [None]:
# 사용할 컬럼들만 거르고 컬럼명 영어로 치환 - 필요한 컬럼들만 선택
df_default = df_default[['시군구','본번','부번','도로명','단지명','계약년월','계약일','전용면적(㎡)','거래금액(만원)','층']]
df_default.columns = ['address','main_number','sub_number','road','name','year_month','day','area','deal_price','floor']
df_default.head() # 잘 걸러졌는지 확인

Unnamed: 0,address,main_number,sub_number,road,name,year_month,day,area,deal_price,floor
0,서울특별시 강남구 개포동,655.0,2.0,언주로 103,개포2차현대아파트(220),200603,10,77.75,59500,7
1,서울특별시 강남구 개포동,655.0,2.0,언주로 103,개포2차현대아파트(220),200603,29,77.75,60000,6
2,서울특별시 강남구 개포동,655.0,2.0,언주로 103,개포2차현대아파트(220),200604,29,77.75,67000,9
3,서울특별시 강남구 개포동,655.0,2.0,언주로 103,개포2차현대아파트(220),200606,1,77.75,60000,4
4,서울특별시 강남구 개포동,655.0,2.0,언주로 103,개포2차현대아파트(220),200610,20,77.75,72250,5


In [None]:
# 타입 변경을 통해서 deal_price,year_month, day 타입 변경
df_default["deal_price"] = df_default["deal_price"].str.replace(",", "") # 'deal_price'에서 ','가 들어있는 부분 제거(추후 계산에 사용하기 위해서서)
df = df_default.astype({'year_month':'str','day':'str','deal_price':'int64'}).copy()
df.head() # 형태가 변경된거 확인

Unnamed: 0,address,main_number,sub_number,road,name,year_month,day,area,deal_price,floor
0,서울특별시 강남구 개포동,655.0,2.0,언주로 103,개포2차현대아파트(220),200603,10,77.75,59500,7
1,서울특별시 강남구 개포동,655.0,2.0,언주로 103,개포2차현대아파트(220),200603,29,77.75,60000,6
2,서울특별시 강남구 개포동,655.0,2.0,언주로 103,개포2차현대아파트(220),200604,29,77.75,67000,9
3,서울특별시 강남구 개포동,655.0,2.0,언주로 103,개포2차현대아파트(220),200606,1,77.75,60000,4
4,서울특별시 강남구 개포동,655.0,2.0,언주로 103,개포2차현대아파트(220),200610,20,77.75,72250,5


In [None]:
df.info() # 타입변경 및 null 확인

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1246694 entries, 0 to 1246693
Data columns (total 10 columns):
 #   Column       Non-Null Count    Dtype  
---  ------       --------------    -----  
 0   address      1246694 non-null  object 
 1   main_number  1246619 non-null  float64
 2   sub_number   1246619 non-null  float64
 3   road         1246694 non-null  object 
 4   name         1246694 non-null  object 
 5   year_month   1246694 non-null  object 
 6   day          1246694 non-null  object 
 7   area         1246694 non-null  float64
 8   deal_price   1246694 non-null  int64  
 9   floor        1246694 non-null  int64  
dtypes: float64(3), int64(2), object(5)
memory usage: 95.1+ MB


In [None]:
# 'main_number'혹은 'sub_number'이 null 인데 'road'도 null 인 값을 확인 -> 없음
# 즉, 'road가 주소에 관한한 정보가 더욱 많음'
df[((df['main_number'].isnull()) |(df['sub_number'].isnull())) &(df['road'].isnull()) ]

Unnamed: 0,address,main_number,sub_number,road,name,year_month,day,area,deal_price,floor


- main_number와 sub_number에 null 값들이 있음을 확인 -> road 정보가 주소에 관한 정보로 적합하다는 판단

## year, month, day 컬럼추가

- 날짜 관련한 컬럼들을 추후 그룹화등을 할 때 사용하기에 'year_month' 컬럼과 'day' 컬럼을 가공하여서 다양한 날짜 관련 컬럼들을 생성

In [None]:
# 추후 데이터들 그룹화에 사용하기 위해서 날짜관련 컬럼들들 분리 및 생성
df['year'] = df['year_month'].str[0:4] # '연','월' 합쳐져 있는 컬럼에서 연도만 추출
df['month'] = df['year_month'].str[4:] # '연','월' 합쳐져 있는 컬럼에서 월만 추출
df.loc[df["day"].str.len()==1,"day"]='0'+df.loc[df["day"].str.len()==1,"day"] # '일'이 있는 컬럼에서 해당 '일'이 1일, 2일 처럼 1자리 숫자인 경우 앞에 0을 추가
df['date'] = pd.to_datetime(df['year']+df['month']+df['day']) # 일자들을 합쳐서 date 컬럼 생성
df = df.astype({'year':'int64','month':'int64','day':'int64'}) # 원하는 타입으로 변경
df = df.drop(['year_month'], axis=1) # 사용 안하는 컬럼들 제거
df.head()

Unnamed: 0,address,main_number,sub_number,road,name,day,area,deal_price,floor,year,month,date
0,서울특별시 강남구 개포동,655.0,2.0,언주로 103,개포2차현대아파트(220),10,77.75,59500,7,2006,3,2006-03-10
1,서울특별시 강남구 개포동,655.0,2.0,언주로 103,개포2차현대아파트(220),29,77.75,60000,6,2006,3,2006-03-29
2,서울특별시 강남구 개포동,655.0,2.0,언주로 103,개포2차현대아파트(220),29,77.75,67000,9,2006,4,2006-04-29
3,서울특별시 강남구 개포동,655.0,2.0,언주로 103,개포2차현대아파트(220),1,77.75,60000,4,2006,6,2006-06-01
4,서울특별시 강남구 개포동,655.0,2.0,언주로 103,개포2차현대아파트(220),20,77.75,72250,5,2006,10,2006-10-20


In [None]:
df.info() # 타입들이 원하는데로 변경됨을 확인인

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1246694 entries, 0 to 1246693
Data columns (total 12 columns):
 #   Column       Non-Null Count    Dtype         
---  ------       --------------    -----         
 0   address      1246694 non-null  object        
 1   main_number  1246619 non-null  float64       
 2   sub_number   1246619 non-null  float64       
 3   road         1246694 non-null  object        
 4   name         1246694 non-null  object        
 5   day          1246694 non-null  int64         
 6   area         1246694 non-null  float64       
 7   deal_price   1246694 non-null  int64         
 8   floor        1246694 non-null  int64         
 9   year         1246694 non-null  int64         
 10  month        1246694 non-null  int64         
 11  date         1246694 non-null  datetime64[ns]
dtypes: datetime64[ns](1), float64(3), int64(5), object(3)
memory usage: 114.1+ MB


In [None]:
# 주소 및 도로명들 분리
df["address_0"] = df["address"].str.split(' ',expand=True)[0] # '시' 만 추출
df["address_1"] = df["address"].str.split(' ',expand=True)[1] # '구' 만 추출
df["address_2"] = df["address"].str.split(' ',expand=True)[2] # '동' 만 추출
df["road_name"] = df["road"].str.split(' ',expand=True)[0] # '도로명' 만 추출
df["road_number"] = df["road"].str.split(' ',expand=True)[1] # '도로숫자' 만 추출
df= df[['year','month','day','address_0','address_1','address_2','road_name','road_number','area','deal_price','name','main_number','sub_number','date']] # 사용할 컬럼만 선택
df.head()

Unnamed: 0,year,month,day,address_0,address_1,address_2,road_name,road_number,area,deal_price,name,main_number,sub_number,date
0,2006,3,10,서울특별시,강남구,개포동,언주로,103,77.75,59500,개포2차현대아파트(220),655.0,2.0,2006-03-10
1,2006,3,29,서울특별시,강남구,개포동,언주로,103,77.75,60000,개포2차현대아파트(220),655.0,2.0,2006-03-29
2,2006,4,29,서울특별시,강남구,개포동,언주로,103,77.75,67000,개포2차현대아파트(220),655.0,2.0,2006-04-29
3,2006,6,1,서울특별시,강남구,개포동,언주로,103,77.75,60000,개포2차현대아파트(220),655.0,2.0,2006-06-01
4,2006,10,20,서울특별시,강남구,개포동,언주로,103,77.75,72250,개포2차현대아파트(220),655.0,2.0,2006-10-20


## 결측치 처리1

In [None]:
df.info() # road_number에 1개의의 null 값이 생김을 확인

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1246694 entries, 0 to 1246693
Data columns (total 14 columns):
 #   Column       Non-Null Count    Dtype         
---  ------       --------------    -----         
 0   year         1246694 non-null  int64         
 1   month        1246694 non-null  int64         
 2   day          1246694 non-null  int64         
 3   address_0    1246694 non-null  object        
 4   address_1    1246694 non-null  object        
 5   address_2    1246694 non-null  object        
 6   road_name    1246694 non-null  object        
 7   road_number  1246693 non-null  object        
 8   area         1246694 non-null  float64       
 9   deal_price   1246694 non-null  int64         
 10  name         1246694 non-null  object        
 11  main_number  1246619 non-null  float64       
 12  sub_number   1246619 non-null  float64       
 13  date         1246694 non-null  datetime64[ns]
dtypes: datetime64[ns](1), float64(3), int64(4), object(6)
memory usage

In [None]:
df[df['road_number'].isnull()] # road_number에 null값이 들어 있는 row를 확인

Unnamed: 0,year,month,day,address_0,address_1,address_2,road_name,road_number,area,deal_price,name,main_number,sub_number,date
1177515,2020,12,31,서울특별시,중구,만리동2가,만리재로,,39.9541,161000,서울역센트럴자이(임대),176.0,1.0,2020-12-31


In [None]:
# '서울역센트럴자이'를 확인 -> '' 값이 존재함을 확인..
df.loc[df['name'] == '서울역센트럴자이',:]

Unnamed: 0,year,month,day,address_0,address_1,address_2,road_name,road_number,area,deal_price,name,main_number,sub_number,date
936223,2017,5,3,서울특별시,중구,만리동2가,만리재로,175.0,84.972,79390,서울역센트럴자이,176.0,1.0,2017-05-03
936224,2017,12,20,서울특별시,중구,만리동2가,만리재로,175.0,59.943,85000,서울역센트럴자이,176.0,1.0,2017-12-20
936225,2017,12,30,서울특별시,중구,만리동2가,,,59.94,85000,서울역센트럴자이,176.0,1.0,2017-12-30
1018067,2018,3,20,서울특별시,중구,만리동2가,,,72.99,85000,서울역센트럴자이,176.0,1.0,2018-03-20
1093938,2019,7,13,서울특별시,중구,만리동2가,만리재로,175.0,84.972,134500,서울역센트럴자이,176.0,1.0,2019-07-13
1093939,2019,8,20,서울특별시,중구,만리동2가,만리재로,175.0,59.94,95000,서울역센트럴자이,176.0,1.0,2019-08-20
1093940,2019,8,23,서울특별시,중구,만리동2가,만리재로,175.0,84.972,139000,서울역센트럴자이,176.0,1.0,2019-08-23
1093941,2019,9,8,서울특별시,중구,만리동2가,만리재로,175.0,59.94,113800,서울역센트럴자이,176.0,1.0,2019-09-08
1093942,2019,9,21,서울특별시,중구,만리동2가,만리재로,175.0,72.9733,132000,서울역센트럴자이,176.0,1.0,2019-09-21
1093943,2019,11,30,서울특별시,중구,만리동2가,만리재로,175.0,59.9808,120000,서울역센트럴자이,176.0,1.0,2019-11-30


In [None]:
# 값이 '' 로 되어 있는 row들을 확인인
df.loc[df['road_name'] == '',:]

Unnamed: 0,year,month,day,address_0,address_1,address_2,road_name,road_number,area,deal_price,name,main_number,sub_number,date
1606,2006,2,23,서울특별시,강남구,논현동,,,128.6700,73500,경복,276.0,0.0,2006-02-23
1628,2006,10,19,서울특별시,강남구,논현동,,,95.4800,71000,경복,276.0,0.0,2006-10-19
2799,2006,1,24,서울특별시,강남구,대치동,,,76.5600,80000,청실1,633.0,0.0,2006-01-24
2806,2006,2,14,서울특별시,강남구,대치동,,,102.6400,143500,청실1,633.0,0.0,2006-02-14
2807,2006,2,14,서울특별시,강남구,대치동,,,102.6400,142000,청실1,633.0,0.0,2006-02-14
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1234067,2022,7,23,서울특별시,송파구,거여동,,,59.9600,125000,e편한세상송파파크센트럴,696.0,0.0,2022-07-23
1234069,2022,8,19,서울특별시,송파구,거여동,,,84.9600,130000,e편한세상송파파크센트럴,696.0,0.0,2022-08-19
1234071,2022,10,7,서울특별시,송파구,거여동,,,113.1800,148000,e편한세상송파파크센트럴,696.0,0.0,2022-10-07
1242574,2023,2,6,서울특별시,서대문구,북아현동,,,59.8404,89999,힐스테이트신촌,1017.0,0.0,2023-02-06


>> null 값이 없다고 ''값도 없는건 아니구나! -> 의미적으로는 결측치이지만 ''로 표시되어서 마치 값이 있는 것처럼 있을 수도 있음

In [None]:
df.loc[df['name'] == '서울역센트럴자이(임대)','name']='서울역센트럴자이' # '서울역센트럴자이(임대)' 명칭을을 '서울역센트럴자이'로 수정
df.loc[df['name'] == '서울역센트럴자이','road_name']='만리재로' # 위에서 확인한 '서울역센트럴자이'의 값들로 'road_name' 수정
df.loc[df['name'] == '서울역센트럴자이','road_number']='175' # 위에서 확인한 '서울역센트럴자이'의 값들로 'road_number' 수정
df.info() # 우선 1차적으로 null 로 표시되는는 null 값들은 처리함을 확인

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1246694 entries, 0 to 1246693
Data columns (total 14 columns):
 #   Column       Non-Null Count    Dtype         
---  ------       --------------    -----         
 0   year         1246694 non-null  int64         
 1   month        1246694 non-null  int64         
 2   day          1246694 non-null  int64         
 3   address_0    1246694 non-null  object        
 4   address_1    1246694 non-null  object        
 5   address_2    1246694 non-null  object        
 6   road_name    1246694 non-null  object        
 7   road_number  1246694 non-null  object        
 8   area         1246694 non-null  float64       
 9   deal_price   1246694 non-null  int64         
 10  name         1246694 non-null  object        
 11  main_number  1246619 non-null  float64       
 12  sub_number   1246619 non-null  float64       
 13  date         1246694 non-null  datetime64[ns]
dtypes: datetime64[ns](1), float64(3), int64(4), object(6)
memory usage

## 결측치 처리2

- 앞에서 과정들을 통해서 ''들이 값들로 들어 있을 수도 있음을 깨닫고 '' 값들을 null로 간주하여서 결측치 처리 진행

In [None]:
import numpy as np
df = df.replace('', np.nan) # ''값만 있는 값들을 null 값들로 수정
df.info() # 수정한 후 정보 확인

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1246694 entries, 0 to 1246693
Data columns (total 14 columns):
 #   Column       Non-Null Count    Dtype         
---  ------       --------------    -----         
 0   year         1246694 non-null  int64         
 1   month        1246694 non-null  int64         
 2   day          1246694 non-null  int64         
 3   address_0    1246694 non-null  object        
 4   address_1    1246694 non-null  object        
 5   address_2    1246694 non-null  object        
 6   road_name    1244663 non-null  object        
 7   road_number  1243397 non-null  object        
 8   area         1246694 non-null  float64       
 9   deal_price   1246694 non-null  int64         
 10  name         1246694 non-null  object        
 11  main_number  1246619 non-null  float64       
 12  sub_number   1246619 non-null  float64       
 13  date         1246694 non-null  datetime64[ns]
dtypes: datetime64[ns](1), float64(3), int64(4), object(6)
memory usage

In [None]:
df.isnull().sum() # df의 'road_name'과 'road_number'의 null 값들이 증가함을 확인,

year              0
month             0
day               0
address_0         0
address_1         0
address_2         0
road_name      2031
road_number    3297
area              0
deal_price        0
name              0
main_number      75
sub_number       75
date              0
dtype: int64

- 처음에는 도로주소가 null값이 더 적은 줄 알았지만, 전처리 과정 중 지번주소가 null 값이 더 적은 것을 확인

In [None]:
# 'main_number'나 'sub_number' 둘중 하나만 null 인 것을 확인 -> 없음
# 즉, 2개가 동시에 null 값을 가지고 있음
df[((df['main_number'].isnull()) &(df['sub_number'].notnull()))
  |((df['main_number'].notnull()) &(df['sub_number'].isnull()))]

Unnamed: 0,year,month,day,address_0,address_1,address_2,road_name,road_number,area,deal_price,name,main_number,sub_number,date


In [None]:
# 도로명정보에는 null이고 지번주소도 null인 데이터를 확인 -> 없다
# 즉, 도로명주소나 지번주소 둘 중 하나를 활용해서 주소에 대한 정보를 얻을 수 있다
df[((df['road_name'].isnull()) | (df['road_number'].isnull())) & (df['main_number'].isnull())]

Unnamed: 0,year,month,day,address_0,address_1,address_2,road_name,road_number,area,deal_price,name,main_number,sub_number,date


In [None]:
# 처리해야 할 null 값이 있는 데이터프레임을 조회
df.loc[df['main_number'].isnull(),['address_0','address_1','address_2','road_name','road_number','name','main_number','sub_number']]

Unnamed: 0,address_0,address_1,address_2,road_name,road_number,name,main_number,sub_number
681633,서울특별시,서초구,신원동,헌릉로8길,10-12,힐스테이트 서초 젠트리스,,
681634,서울특별시,서초구,신원동,헌릉로8길,10-12,힐스테이트 서초 젠트리스,,
681635,서울특별시,서초구,신원동,헌릉로8길,10-12,힐스테이트 서초 젠트리스,,
681636,서울특별시,서초구,신원동,헌릉로8길,10-12,힐스테이트 서초 젠트리스,,
681637,서울특별시,서초구,신원동,헌릉로8길,10-12,힐스테이트 서초 젠트리스,,
...,...,...,...,...,...,...,...,...
1209122,서울특별시,서초구,신원동,헌릉로8길,10-12,힐스테이트 서초 젠트리스,,
1209123,서울특별시,서초구,신원동,헌릉로8길,10-12,힐스테이트 서초 젠트리스,,
1209124,서울특별시,서초구,신원동,헌릉로8길,10-12,힐스테이트 서초 젠트리스,,
1232880,서울특별시,서초구,신원동,헌릉로8길,10-12,힐스테이트 서초 젠트리스,,


In [None]:
df.loc[df['main_number'].isnull(),'name'].unique() # 처리해야 할 지번주소에 null 값이 있는 아파트명들 조회
                                                   # '힐스테이트 서초 젠트리스'만 수정하면 될듯

array(['힐스테이트 서초 젠트리스'], dtype=object)

In [None]:
df.loc[df['name']=='힐스테이트 서초 젠트리스',:] # 기존 name 컬럼이 '힐스테이트 서초 젠트리스' 인 전체 값들이 지번주소가 null값으로 되어 있음

Unnamed: 0,year,month,day,address_0,address_1,address_2,road_name,road_number,area,deal_price,name,main_number,sub_number,date
681633,2015,3,1,서울특별시,서초구,신원동,헌릉로8길,10-12,84.95,73430,힐스테이트 서초 젠트리스,,,2015-03-01
681634,2015,4,17,서울특별시,서초구,신원동,헌릉로8길,10-12,84.99,79000,힐스테이트 서초 젠트리스,,,2015-04-17
681635,2015,5,1,서울특별시,서초구,신원동,헌릉로8길,10-12,101.90,95000,힐스테이트 서초 젠트리스,,,2015-05-01
681636,2015,6,16,서울특별시,서초구,신원동,헌릉로8길,10-12,84.95,87200,힐스테이트 서초 젠트리스,,,2015-06-16
681637,2015,6,26,서울특별시,서초구,신원동,헌릉로8길,10-12,101.90,94500,힐스테이트 서초 젠트리스,,,2015-06-26
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1209122,2021,4,27,서울특별시,서초구,신원동,헌릉로8길,10-12,101.90,184500,힐스테이트 서초 젠트리스,,,2021-04-27
1209123,2021,5,26,서울특별시,서초구,신원동,헌릉로8길,10-12,84.95,165000,힐스테이트 서초 젠트리스,,,2021-05-26
1209124,2021,7,26,서울특별시,서초구,신원동,헌릉로8길,10-12,84.99,182000,힐스테이트 서초 젠트리스,,,2021-07-26
1232880,2022,6,23,서울특별시,서초구,신원동,헌릉로8길,10-12,101.90,204000,힐스테이트 서초 젠트리스,,,2022-06-23


In [None]:
# 지번주소 null 값들을 네이버를 통해 검색하여서 정보를 얻고 수정
df.loc[df['name']=='힐스테이트 서초 젠트리스','main_number'] = 557
df.loc[df['name']=='힐스테이트 서초 젠트리스','sub_number'] = 0

In [None]:
# 사용할 컬럼들 선택택과, 컬럼명들 수정
df_deal = df[['date','year','month','day','address_0','address_1','address_2','main_number','sub_number','name','area','deal_price']].copy()
df_deal.columns =['date','year','month','day','address_0','address_1','address_2','address_3','address_4','name','area','deal_price']
df_deal = df_deal[df_deal['year']>=2011] # 전세/월세데이터가 2011년 이후로 있어서 연도 선택
df_deal.head()

Unnamed: 0,date,year,month,day,address_0,address_1,address_2,address_3,address_4,name,area,deal_price
355306,2011-07-09,2011,7,9,서울특별시,강남구,개포동,655.0,2.0,개포2차현대아파트(220),77.75,64000
355307,2011-07-28,2011,7,28,서울특별시,강남구,개포동,655.0,2.0,개포2차현대아파트(220),77.75,65500
355308,2011-01-19,2011,1,19,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,67.28,70500
355309,2011-09-02,2011,9,2,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,79.97,85000
355310,2011-12-17,2011,12,17,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,67.28,68000


In [None]:
df_deal.info() # 데이터프레임 정보 확인

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891388 entries, 355306 to 1246693
Data columns (total 12 columns):
 #   Column      Non-Null Count   Dtype         
---  ------      --------------   -----         
 0   date        891388 non-null  datetime64[ns]
 1   year        891388 non-null  int64         
 2   month       891388 non-null  int64         
 3   day         891388 non-null  int64         
 4   address_0   891388 non-null  object        
 5   address_1   891388 non-null  object        
 6   address_2   891388 non-null  object        
 7   address_3   891388 non-null  float64       
 8   address_4   891388 non-null  float64       
 9   name        891388 non-null  object        
 10  area        891388 non-null  float64       
 11  deal_price  891388 non-null  int64         
dtypes: datetime64[ns](1), float64(3), int64(4), object(4)
memory usage: 88.4+ MB


- 해당 데이터프레임은 거래일자의 정보를 나타내는 'date', 'month', 'day' 컬럼들, 주소의 정보를 나타내는 'address_0', 'address_1', 'address_2', 'address_4' 컬럼들, 아파트명을 나타내는 'name'컬럼, 아파트의 면적을 나타내는 'area' 컬럼, 매매가격을 나타내는 'deal_price' 컬럼으로 이루어져 있다

In [None]:
df_deal.iloc[200] # 정보들 제대로 있는지 확인

date          2011-12-23 00:00:00
year                         2011
month                          12
day                            23
address_0                   서울특별시
address_1                     강남구
address_2                     개포동
address_3                   141.0
address_4                     0.0
name                      개포주공1단지
area                        56.57
deal_price                  95000
Name: 355506, dtype: object

In [None]:
df_deal.to_csv('/content/drive/MyDrive/house_price/after_data/apartment_deal.csv',index=False) # 파일로 저장

# apartment_full_rent.csv, apartment_month_rent.csv 파일 생성

- 'http://rtdown.molit.go.kr/' 사이트를 통해서 아파트전세,월세 정보 파일로 얻음
- '아파트 전세' 관련 정보들을 가지고 있는 데이터프레임을 생성하여 apartment_full_rent.csv 로 저장
- '파이트 월세' 관련 정보들을 가지고 있는 데이터프레임을 생성하여 apartment_month_rent.csv 로 저장

## csv 파일들 불러오기 및 병합

- 아파트 전세정보 csv 파일들은 연도별로 파일들이 분류가 되어있고, 각 csv 파일 내의 모든 정보들이 전부 필요하지는 않기에 전처리 과정 진행

In [None]:
import pandas as pd
import os


dir_path = "/content/drive/MyDrive/house_price/original_data/rent_price/Seoul"
file_list = os.listdir(dir_path)
file_list.sort()
df_list = list()

# 해당 폴더 안에 있는 csv 파일들을 읽어서 리스트 안에 데이터프레임들을 담음
for csv_file in file_list:
    df_list.append(pd.read_csv(dir_path+"/"+csv_file ,skiprows=15,  encoding='cp949', low_memory=False))


In [None]:
df_list[-1].info() # 리스트 안에 잘 담겼는지 확인

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85011 entries, 0 to 85010
Data columns (total 19 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   시군구            85011 non-null  object 
 1   번지             84951 non-null  object 
 2   본번             85005 non-null  float64
 3   부번             85005 non-null  float64
 4   단지명            85011 non-null  object 
 5   전월세구분          85011 non-null  object 
 6   전용면적(㎡)        85011 non-null  float64
 7   계약년월           85011 non-null  int64  
 8   계약일            85011 non-null  int64  
 9   보증금(만원)        85011 non-null  object 
 10  월세(만원)         85011 non-null  object 
 11  층              85011 non-null  int64  
 12  건축년도           84999 non-null  float64
 13  도로명            85011 non-null  object 
 14  계약기간           85011 non-null  object 
 15  계약구분           85011 non-null  object 
 16  갱신요구권 사용       85011 non-null  object 
 17  종전계약 보증금 (만원)  72050 non-null  object 
 18  종전계약 월

In [None]:
# 모든 데이터프레임을을 통합
df_default = df_list[0]
for df_csv in df_list[1:]:
    df_default = pd.concat([df_default, df_csv], axis=0)
df_default.reset_index(drop=True, inplace=True) # concat으로 합쳐질 때 인덱스 재설정
df_default.info() # 데이터프레임 합친 결과 확인

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2170786 entries, 0 to 2170785
Data columns (total 19 columns):
 #   Column         Dtype  
---  ------         -----  
 0   시군구            object 
 1   번지             object 
 2   본번             float64
 3   부번             float64
 4   단지명            object 
 5   전월세구분          object 
 6   전용면적(㎡)        float64
 7   계약년월           int64  
 8   계약일            int64  
 9   보증금(만원)        object 
 10  월세(만원)         object 
 11  층              float64
 12  건축년도           float64
 13  도로명            object 
 14  계약기간           object 
 15  계약구분           object 
 16  갱신요구권 사용       object 
 17  종전계약 보증금 (만원)  object 
 18  종전계약 월세 (만원)   object 
dtypes: float64(5), int64(2), object(12)
memory usage: 314.7+ MB


In [None]:
df_default.head() # 데이터 형태 확인

Unnamed: 0,시군구,번지,본번,부번,단지명,전월세구분,전용면적(㎡),계약년월,계약일,보증금(만원),월세(만원),층,건축년도,도로명,계약기간,계약구분,갱신요구권 사용,종전계약 보증금 (만원),종전계약 월세 (만원)
0,서울특별시 강남구 개포동,655-2,655.0,2.0,개포2차현대아파트(220),전세,77.75,201101,5,35000,0,7.0,1988.0,언주로 103,-,-,-,,
1,서울특별시 강남구 개포동,655-2,655.0,2.0,개포2차현대아파트(220),전세,77.75,201101,18,20000,0,8.0,1988.0,언주로 103,-,-,-,,
2,서울특별시 강남구 개포동,655-2,655.0,2.0,개포2차현대아파트(220),전세,77.75,201102,1,24000,0,5.0,1988.0,언주로 103,-,-,-,,
3,서울특별시 강남구 개포동,655-2,655.0,2.0,개포2차현대아파트(220),전세,77.75,201102,11,31000,0,9.0,1988.0,언주로 103,-,-,-,,
4,서울특별시 강남구 개포동,655-2,655.0,2.0,개포2차현대아파트(220),전세,77.75,201102,24,30500,0,9.0,1988.0,언주로 103,-,-,-,,


In [None]:
df_default.isnull().sum() # 번지, 본번, 부번이 null 값들이 있음

시군구                    0
번지                  1646
본번                   240
부번                   240
단지명                    0
전월세구분                  0
전용면적(㎡)               36
계약년월                   0
계약일                    0
보증금(만원)                0
월세(만원)                 0
층                     36
건축년도                 261
도로명                    0
계약기간                   0
계약구분                   0
갱신요구권 사용               0
종전계약 보증금 (만원)    1806760
종전계약 월세 (만원)     1806760
dtype: int64

In [None]:
df_default['전월세구분'].unique()

array(['전세', '월세'], dtype=object)

- 전월세구분이 '전세'와 '월세' 두 가지만 있음으로 조건문을 활용해서 나누기에 용이함

## 전세 데이터 프레임 생성

- apartment_deal 과 진행 과정이 거의 동일하기에 apartment_deal.csv 파일 생성의 진행과정을 참조해서 하나의 셀로 합쳐서 진행
- 주석 부분들은 중간과정 확인 부분

In [None]:
# 전세 데이터 프레임 생성
df_full_rent = df_default.loc[df_default['전월세구분']=='전세',['시군구','본번','부번','도로명','계약년월','계약일','보증금(만원)','전용면적(㎡)','단지명']].copy()
df_full_rent.columns = ['address','main_number','sub_number','road','year_month','day','full_rent_price','area','name']


df_full_rent = df_full_rent.astype({'full_rent_price':'str','year_month':'str','day':'str','full_rent_price':'str'})
df_full_rent["full_rent_price"] = df_full_rent["full_rent_price"].str.replace(",", "")
df_full_rent.loc[df_full_rent["day"].str.len()==1,"day"]='0'+df_full_rent.loc[df_full_rent["day"].str.len()==1,"day"] # 일이 있는 컬럼에서 1자리 숫자인 경우 앞에 0을 추가성
df_full_rent['year'] = df_full_rent['year_month'].str[0:4] # 연,월 합쳐져 있는 컬럼에서 연도만 추출
df_full_rent['month'] = df_full_rent['year_month'].str[4:] # 연,월 합쳐져 있는 컬럼에서 월만 추출
df_full_rent['date'] = pd.to_datetime(df_full_rent['year']+df_full_rent['month']+df_full_rent['day']) # 일자들을 합쳐서 date 컬럼 생
df_full_rent = df_full_rent.astype({'year':'int64','month':'int64','day':'int64','full_rent_price':'int64'})
df_full_rent = df_full_rent.drop(['year_month'], axis=1) # 사용 안하는 컬럼들 제거

df_full_rent["address_0"] = df_full_rent["address"].str.split(' ',expand=True)[0] # '시' 만 추출해야 하나, 서울만 함으로 일단은 실행 X
df_full_rent["address_1"] = df_full_rent["address"].str.split(' ',expand=True)[1] # '구' 만 추출
df_full_rent["address_2"] = df_full_rent["address"].str.split(' ',expand=True)[2] # '동' 만 추출
df_full_rent["road_name"] = df_full_rent["road"].str.split(' ',expand=True)[0] # '도로명' 만 추출
df_full_rent["road_number"] = df_full_rent["road"].str.split(' ',expand=True)[1] # '도로숫자' 만 추출
df_full_rent= df_full_rent[['year','month','day','address_0','address_1','address_2','main_number','sub_number','road_name','road_number','area',"full_rent_price",'name','date']] # 사용할 컬럼만 선택


df_full_rent = df_full_rent.replace('', None) # ''값만 있는 값들을 null 값들로 수정

df_full_rent.loc[df_full_rent['name']=='힐스테이트 서초 젠트리스','main_number'] = 557
df_full_rent.loc[df_full_rent['name']=='힐스테이트 서초 젠트리스','sub_number'] = 0


df_full_rent = df_full_rent[['date','year','month','day','address_0','address_1','address_2','main_number','sub_number','name','area','full_rent_price']].copy()
df_full_rent.columns =['date','year','month','day','address_0','address_1','address_2','address_3','address_4','name','area','full_rent_price']

In [None]:
df_full_rent.isnull().sum()

date                0
year                0
month               0
day                 0
address_0           0
address_1           0
address_2           0
address_3           0
address_4           0
name                0
area               25
full_rent_price     0
dtype: int64

### 'area' 컬럼 결측치 처리

- apartment_deal.csv 생성과 달리 area 컬럼에 결측치가 존재하기에 결측치 처리 부분 추가
- 결측치는 해당 주소의 전세 아파트의 거래 내역 중 가장 거래가 많았던 area 컬럼의 값 으로 대체하여 처리

In [None]:
# area의 빈 칸들 해결
df_full_rent[df_full_rent['area'].isnull()].tail()

Unnamed: 0,date,year,month,day,address_0,address_1,address_2,address_3,address_4,name,area,full_rent_price
357440,2013-11-16,2013,11,16,서울특별시,노원구,공릉동,683.0,14.0,한일휴니스빌,,8000
375219,2013-11-30,2013,11,30,서울특별시,동대문구,장안동,312.0,8.0,태솔에버빌,,12000
389892,2013-01-17,2013,1,17,서울특별시,서대문구,창천동,501.0,14.0,삼성아트빌,,9000
439901,2013-01-20,2013,1,20,서울특별시,영등포구,영등포동4가,103.0,0.0,영등포그랑그루,,8000
490009,2014-02-19,2014,2,19,서울특별시,강서구,화곡동,29.0,47.0,드림하우스(29-47),,9500


In [None]:
# area가 null값인 row들이 다른 주소정보관련 컬럼들을 리스트 화
add_1 = list(df_full_rent.loc[df_full_rent['area'].isnull(),'address_1'])
add_2 = list(df_full_rent.loc[df_full_rent['area'].isnull(),'address_2'])
add_3 = list(df_full_rent.loc[df_full_rent['area'].isnull(),'address_3'])
add_4 = list(df_full_rent.loc[df_full_rent['area'].isnull(),'address_4'])
area_list = list()

In [None]:
# area_list 에 값 추가
for i in range(len(add_1)):
    # 해당 주소에서 거래된 매물들의 'area' 정보가 없을 경우, area null을 처리할 참조 자료가 없음으로 ''으로 처리리
    if (len(df_full_rent.loc[(df_full_rent['address_1'] ==add_1[i]) &
                     (df_full_rent['address_2'] ==add_2[i]) &
                     (df_full_rent['address_3'] ==add_3[i]) &
                     (df_full_rent['address_4'] ==add_4[i]),
                     'area'].value_counts())) == 0:

        area_list.append('')
    else:
        # 해당 주소에서 가장 많이 거래되었던 층수를 null 값에 채움움
        area_list.append(df_full_rent.loc[(df_full_rent['address_1'] ==add_1[i]) &
                     (df_full_rent['address_2'] ==add_2[i]) &
                     (df_full_rent['address_3'] ==add_3[i]) &
                     (df_full_rent['address_4'] ==add_4[i]),
                     'area'].value_counts().idxmax())
print(area_list) # area 이 null 값인 주소의 매물들의 가장 많이 거래된 층들을 출력력

[84.9, 33.33, 15.94, 15.94, 84.98, 142.034, 142.034, 142.034, 142.034, 17.07, 17.07, 17.07, 17.07, 17.07, 64.52, 23.47, 23.47, 13.2195, 13.2195, 13.2195, 13.2195, 49.65, 39.28, 12.1, '']


- 마지막에 ''인 값이 있는데 이건 해당 매물은 참조할 만할 거래내역이 없음을 의미

In [None]:
# len을 통해서 리스트들이 다 만들어 졌는지 확인
print(len(add_1),len(add_2),len(add_3),len(add_4),len(area_list))

25 25 25 25 25


In [None]:
# 맨 마지막 row가 '' 여서 해당 row의 area 값을 채우기 위해 참조할 값을 확인 -> 없음
# 해당은 area를 알수있는 방법이 없음 - 다른 참조할만할 area 값들이 없음 -> 추후 제거 필요
df_full_rent.loc[(df_full_rent['address_3']==29)&(df_full_rent['address_4']==47),:] # 테스트로 area이 null 값인 row를 대표로 확인인

Unnamed: 0,date,year,month,day,address_0,address_1,address_2,address_3,address_4,name,area,full_rent_price
490009,2014-02-19,2014,2,19,서울특별시,강서구,화곡동,29.0,47.0,드림하우스(29-47),,9500


In [None]:
# area 가 null인 값들을 처리, 가장 많이 거래된 'area'의 정보로 결측치 처리리
for i in range(len(add_1)):
    df_full_rent.loc[(df_full_rent['address_1'] ==add_1[i]) &
                         (df_full_rent['address_2'] ==add_2[i]) &
                         (df_full_rent['address_3'] ==add_3[i]) &
                         (df_full_rent['address_4'] ==add_4[i]),
                         'area']=area_list[i]

In [None]:
# null 대신 ''이 잘 들어있는지 확인
df_full_rent.loc[df_full_rent['area']=='',:]

Unnamed: 0,date,year,month,day,address_0,address_1,address_2,address_3,address_4,name,area,full_rent_price
490009,2014-02-19,2014,2,19,서울특별시,강서구,화곡동,29.0,47.0,드림하우스(29-47),,9500


In [None]:
# area이 ''인 값 제거
df_full_rent=df_full_rent.drop(df_full_rent[df_full_rent['area']==''].index)

# 제거후 값 확인
df_full_rent.loc[df_full_rent['area']=='',:] # 제거가 된음 확인인

Unnamed: 0,date,year,month,day,address_0,address_1,address_2,address_3,address_4,name,area,full_rent_price


In [None]:
df_full_rent.info() # 값확인을 통해서 null값 처리가 되었는지 확인인

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1498500 entries, 0 to 2170785
Data columns (total 12 columns):
 #   Column           Non-Null Count    Dtype         
---  ------           --------------    -----         
 0   date             1498500 non-null  datetime64[ns]
 1   year             1498500 non-null  int64         
 2   month            1498500 non-null  int64         
 3   day              1498500 non-null  int64         
 4   address_0        1498500 non-null  object        
 5   address_1        1498500 non-null  object        
 6   address_2        1498500 non-null  object        
 7   address_3        1498500 non-null  float64       
 8   address_4        1498500 non-null  float64       
 9   name             1498500 non-null  object        
 10  area             1498500 non-null  object        
 11  full_rent_price  1498500 non-null  int64         
dtypes: datetime64[ns](1), float64(2), int64(4), object(5)
memory usage: 148.6+ MB


- 해당 데이터프레임은 거래일자의 정보를 나타내는 'date', 'month', 'day' 컬럼들, 주소의 정보를 나타내는 'address_0', 'address_1', 'address_2', 'address_4' 컬럼들, 아파트명을 나타내는 'name'컬럼, 아파트의 면적을 나타내는 'area' 컬럼, 전세가격을 나타내는 'full_rent_price' 컬럼으로 이루어져 있다

In [None]:
df_full_rent.to_csv('/content/drive/MyDrive/house_price/after_data/apartment_full_rent.csv', index=False) # 전세 csv 파일 생성성

## 월세 데이터 프레임 생성

- 전세 데이터프레임 생성 파트 참조

In [None]:
# 월세 데이터 프레임 생성, 필요한 컬럼들만 필터링
df_month_rent = df_default.loc[df_default['전월세구분']=='월세',['시군구','본번','부번','도로명','계약년월','계약일','보증금(만원)','월세(만원)','전용면적(㎡)','단지명']].copy()
df_month_rent.columns = ['address','main_number','sub_number','road','year_month','day','rent_deposit','month_rent_price','area','name']
# df_month_rent.head()

df_month_rent.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 672285 entries, 25 to 2170783
Data columns (total 10 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   address           672285 non-null  object 
 1   main_number       672235 non-null  float64
 2   sub_number        672235 non-null  float64
 3   road              672285 non-null  object 
 4   year_month        672285 non-null  int64  
 5   day               672285 non-null  int64  
 6   rent_deposit      672285 non-null  object 
 7   month_rent_price  672285 non-null  object 
 8   area              672274 non-null  float64
 9   name              672285 non-null  object 
dtypes: float64(3), int64(2), object(5)
memory usage: 56.4+ MB


전세 파트와 다른 부분 확인! ↓

In [None]:
df_month_rent["month_rent_price2"] = df_month_rent["month_rent_price"].str.replace(',','')
df_month_rent.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 672285 entries, 25 to 2170783
Data columns (total 11 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   address            672285 non-null  object 
 1   main_number        672235 non-null  float64
 2   sub_number         672235 non-null  float64
 3   road               672285 non-null  object 
 4   year_month         672285 non-null  int64  
 5   day                672285 non-null  int64  
 6   rent_deposit       672285 non-null  object 
 7   month_rent_price   672285 non-null  object 
 8   area               672274 non-null  float64
 9   name               672285 non-null  object 
 10  month_rent_price2  607817 non-null  object 
dtypes: float64(3), int64(2), object(6)
memory usage: 61.5+ MB


- "month_rent_price"를 replace를 적용해서 month_rent_price2 컬럼을 생성하는데 replace 함수가 제데로 처리가 안됨을 확인

>> df_month_rent["month_rent_price"].str.replace(',','')

>> 진행했을 때, 'month_rent_price2' 컬럼에서의 null 값이 매우 커짐 -> replace 매소드가 제대로 동작 안함을 확인

>> 왜 동작을 안할까? -> string 과 object 타입의 차이, object는 타입의 혼용?

In [None]:
# 해당 파트를 통해서 우선 type 을 변경한 다음에 진행해야 함
df_month_rent = df_month_rent.astype({'month_rent_price':'str','rent_deposit':'str'})

- apartment_deal 과 진행 과정이 거의 동일하기에 한 셀로 합쳐서 진행
- 주석 부분들은 중간과정 확인 부분

In [None]:
df_month_rent["rent_deposit"] = df_month_rent["rent_deposit"].str.replace(",", "")
df_month_rent["month_rent_price"] = df_month_rent["month_rent_price"].str.replace(',','')
df_month_rent = df_month_rent.astype({'year_month':'str','day':'str','rent_deposit':'int64','month_rent_price':'int64'})
df_month_rent['year'] = df_month_rent['year_month'].str[0:4] # 연,월 합쳐져 있는 컬럼에서 연도만 추출
df_month_rent['month'] = df_month_rent['year_month'].str[4:] # 연,월 합쳐져 있는 컬럼에서 월만 추출
df_month_rent.loc[df_month_rent["day"].str.len()==1,"day"]='0'+df_month_rent.loc[df_month_rent["day"].str.len()==1,"day"] # 일이 있는 컬럼에서 1자리 숫자인 경우 앞에 0을 추가성
df_month_rent['date'] = pd.to_datetime(df_month_rent['year']+df_month_rent['month']+df_month_rent['day']) # 일자들을 합쳐서 date 컬럼 생
df_month_rent = df_month_rent.astype({'year':'int64','month':'int64','day':'int64'})
df_month_rent = df_month_rent.drop(['year_month'], axis=1) # 사용 안하는 컬럼들 제거

df_month_rent["address_0"] = df_month_rent["address"].str.split(' ',expand=True)[0] # '시' 만 추출해야 하나, 서울만 함으로 일단은 실행 X
df_month_rent["address_1"] = df_month_rent["address"].str.split(' ',expand=True)[1] # '구' 만 추출
df_month_rent["address_2"] = df_month_rent["address"].str.split(' ',expand=True)[2] # '동' 만 추출
df_month_rent["road_name"] = df_month_rent["road"].str.split(' ',expand=True)[0] # '도로명' 만 추출
df_month_rent["road_number"] = df_month_rent["road"].str.split(' ',expand=True)[1] # '도로숫자' 만 추출
df_month_rent= df_month_rent[['year','month','day','address_0','address_1','address_2','main_number','sub_number','road_name','road_number','area',"rent_deposit","month_rent_price",'name','date']] # 사용할 컬럼만 선택


df_month_rent = df_month_rent.replace('', None) # ''값만 있는 값들을 null 값들로 수정

df_month_rent.loc[df_month_rent['name']=='힐스테이트 서초 젠트리스','main_number'] = 557
df_month_rent.loc[df_month_rent['name']=='힐스테이트 서초 젠트리스','sub_number'] = 0

df_month_rent = df_month_rent[['date','year','month','day','address_0','address_1','address_2','main_number','sub_number','name','area','rent_deposit','month_rent_price']]
df_month_rent.columns =['date','year','month','day','address_0','address_1','address_2','address_3','address_4','name','area','rent_deposit','month_rent_price']

In [None]:
df_month_rent.isnull().sum()

date                 0
year                 0
month                0
day                  0
address_0            0
address_1            0
address_2            0
address_3            0
address_4            0
name                 0
area                11
rent_deposit         0
month_rent_price     0
dtype: int64

### 'area' 컬럼 결측치 처리

- 전세의 area 결측치 처리 부분 참조

In [None]:
add_1 = list(df_month_rent.loc[df_month_rent['area'].isnull(),'address_1'])
add_2 = list(df_month_rent.loc[df_month_rent['area'].isnull(),'address_2'])
add_3 = list(df_month_rent.loc[df_month_rent['area'].isnull(),'address_3'])
add_4 = list(df_month_rent.loc[df_month_rent['area'].isnull(),'address_4'])
area_list = list()
# area_list 에 값 추가
for i in range(len(add_1)):
    # 해당 주소에서 거래된 매물들의 '층' 정보가 없을 경우, area null을 처리할 참조 자료가 없음으로 ''으로 처리리
    if (len(df_month_rent.loc[(df_month_rent['address_1'] ==add_1[i]) &
                     (df_month_rent['address_2'] ==add_2[i]) &
                     (df_month_rent['address_3'] ==add_3[i]) &
                     (df_month_rent['address_4'] ==add_4[i]),
                     'area'].value_counts())) == 0:

        area_list.append('')
    else:
        # 해당 주소에서 가장 많이 거래되었던 층수를 null 값에 채울거임
        area_list.append(df_month_rent.loc[(df_month_rent['address_1'] ==add_1[i]) &
                     (df_month_rent['address_2'] ==add_2[i]) &
                     (df_month_rent['address_3'] ==add_3[i]) &
                     (df_month_rent['address_4'] ==add_4[i]),
                     'area'].value_counts().idxmax())


for i in range(len(add_1)):
    df_month_rent.loc[(df_month_rent['address_1'] ==add_1[i]) &
                         (df_month_rent['address_2'] ==add_2[i]) &
                         (df_month_rent['address_3'] ==add_3[i]) &
                         (df_month_rent['address_4'] ==add_4[i]),
                         'area']=area_list[i]



In [None]:
df_month_rent.isnull().sum()

date                0
year                0
month               0
day                 0
address_0           0
address_1           0
address_2           0
address_3           0
address_4           0
name                0
area                0
rent_deposit        0
month_rent_price    0
dtype: int64

In [None]:
df_month_rent.head()

Unnamed: 0,date,year,month,day,address_0,address_1,address_2,address_3,address_4,name,area,rent_deposit,month_rent_price
25,2011-03-18,2011,3,18,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,79.97,19000,63
28,2011-04-09,2011,4,9,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,79.97,21000,35
38,2011-07-09,2011,7,9,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,79.97,3000,160
46,2011-09-19,2011,9,19,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,79.97,6000,140
47,2011-09-20,2011,9,20,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,79.97,5000,160


In [None]:
df_month_rent.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 672285 entries, 25 to 2170783
Data columns (total 13 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   date              672285 non-null  datetime64[ns]
 1   year              672285 non-null  int64         
 2   month             672285 non-null  int64         
 3   day               672285 non-null  int64         
 4   address_0         672285 non-null  object        
 5   address_1         672285 non-null  object        
 6   address_2         672285 non-null  object        
 7   address_3         672285 non-null  float64       
 8   address_4         672285 non-null  float64       
 9   name              672285 non-null  object        
 10  area              672285 non-null  float64       
 11  rent_deposit      672285 non-null  int64         
 12  month_rent_price  672285 non-null  int64         
dtypes: datetime64[ns](1), float64(3), int64(5), object(4)
mem

- 해당 데이터프레임은 거래일자의 정보를 나타내는 'date', 'month', 'day' 컬럼들, 주소의 정보를 나타내는 'address_0', 'address_1', 'address_2', 'address_4' 컬럼들, 아파트명을 나타내는 'name'컬럼, 아파트의 면적을 나타내는 'area' 컬럼, 보증근 정보를 나타내는 'rent_deposit' 컬럼,월세가격을 나타내는 'month_rent_price' 컬럼으로 이루어져 있다

In [None]:
df_month_rent.to_csv('/content/drive/MyDrive/house_price/after_data/apartment_month_rent.csv', index=False)

# economic_data.csv 파일생성

- economic_data(거시경제 정보관련) 파일 생성
- economic_data 에는 한국기준금리, 부동산지수, 기준금리, 코스피지수, 한국국채금리, 미국국채금리, 장단기금리차, 아파트 분양 공급량, 아파트 미분양수, 아파트 미분양률 의 정보를 포함함

## 기준금리 정보관련 데이터 프레임 생성

- 'https://www.bok.or.kr/portal/singl/baseRate/list.do?dataSeCd=01&menuNo=200643' 홈페이지에서 기준금리의 변경 일자들을 제공하기에 크롤링을 하여 일자별 기준금리를 나타내는 데이터프레임을 생성

### 크롤링을 통해서 기준금리 정보 가져오기

In [None]:
#라이브러리 임포트

import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
#웹페이지 가져오기

res = requests.get('https://www.bok.or.kr/portal/singl/baseRate/list.do?dataSeCd=01&menuNo=200643')

#웹페이지 파싱하기
soup = BeautifulSoup(res.content,'html.parser')

#필요한 데이터 추출하기
items = soup.select('#content > div.table.tac > table > tbody > tr')

# 크롤링할 정보들을 담을 리스트 -> 추후 데이터프레임의 컬럼으로 대입할 예정정
change_year_list = list()
change_date_list = list()
rp_list = list()

# 사이트에서 표 안에 있는 정보들(text 정보들)을 가져와서 각 리스트에 삽입입
for item in items:
    table_list = item.select('td')
    change_year_list.append(table_list[0].get_text())
    change_date_list.append(table_list[1].get_text())
    rp_list.append(table_list[2].get_text())

# df는 기준금리 정보를 가져온 데이터 프레임 생성성
df = pd.DataFrame({
    "year": change_year_list,
    "change_date": change_date_list,
    "korea_rp": rp_list
}, columns=["year", "change_date", "korea_rp"])

df.tail() # 데이터프레임 형태 확인인

Unnamed: 0,year,change_date,korea_rp
50,2001,07월 05일,4.75
51,2001,02월 08일,5.0
52,2000,10월 05일,5.25
53,2000,02월 10일,5.0
54,1999,05월 06일,4.75


- change_date는 기준금리가 변경된 일자를, korea_rp는 변경한 기준금리를 나타냄

### rp_date 컬럼 생성

- year 컬럼과 change_date 컬럼이 일자를 나타내는 컬럼이므로 하나의 컬럼으로 통합

In [None]:
df['month']=df['change_date'].str[0:2] # 월의 정보만 추출
df['date'] = df['change_date'].str[4:6] # 일의 정보만 추출
df = df.astype({'korea_rp':'float64'}) # rp 컬럼 타입 변경
df['rp_date'] = df['year']+df['month']+df['date'] # 새로운 컬럼 생성
df = df.drop(['change_date', 'year','month','date'], axis=1) # 안쓰는 컬럼 제거
df=df.sort_index(ascending=False) # 날짜가 역순으로 되어 있어서 정렬
df['rp_date'] = pd.to_datetime(df['rp_date'], format='%Y-%m-%d %H:%M:%S', errors='raise') # date 타입으로 변경

In [None]:
df.head() # 데이터프레임 형태 확인

Unnamed: 0,korea_rp,rp_date
54,4.75,1999-05-06
53,5.0,2000-02-10
52,5.25,2000-10-05
51,5.0,2001-02-08
50,4.75,2001-07-05


### '기준금리 변경날짜'들 사이에 있는 날짜들의 기준금리 정보 생성

- 위에서의 데이터 프레임은 '기준금리 변경일자'와 '변경된 기준금리'의 정보를 나타내는데, '기준금리 변경일자'들 사이에 있는 모든 날짜들에 대응하는 '기준금리'에 대한 정보도 필요하기에 사이 날짜들에 대한 기준금리 정보들을 생성

In [None]:
import datetime

# 크롤링한 날짜 기간에 있는 모든 날짜들을 계산
start = datetime.datetime.strptime("06-05-1999", "%d-%m-%Y") # 시작날짜 설정
end = datetime.datetime.strptime("31-01-2023", "%d-%m-%Y") # 끝날짜 설정정
date_generated = [start + datetime.timedelta(days=x) for x in range(0, (end-start).days)] # 시작날짜와 끝날짜 사이에 있는 날짜들 생성
date_list=list()
for date in date_generated:
    date_list.append(date.strftime("%Y-%m-%d")) # date_list 에서 생성한 날짜들의 형식을 맞춰서 대입

In [None]:
# df_date는 조회할 모든 날짜들의 정보가 들어있는 series
df_date = pd.DataFrame({
    "date": date_list
}, columns=["date"])
df_date['date'] = pd.to_datetime(df_date['date'], format='%Y-%m-%d %H:%M:%S', errors='raise') # date 타입으로 변경

In [None]:
df_date.head() # 데이터 프레임 형태 확인

Unnamed: 0,date
0,1999-05-06
1,1999-05-07
2,1999-05-08
3,1999-05-09
4,1999-05-10


In [None]:
# 두개의 데이터프레임 결합을 통해서 날짜별 기준금리 현황을 생성
df_rp=pd.merge(df_date, df, left_on='date', right_on='rp_date', how='left')

In [None]:
# 사용할 컬럼만을 선택
df_rp = df_rp[['date','korea_rp']]
df_rp # 생성한 데이터 프레임 형태 확인

Unnamed: 0,date,korea_rp
0,1999-05-06,4.75
1,1999-05-07,
2,1999-05-08,
3,1999-05-09,
4,1999-05-10,
...,...,...
8666,2023-01-26,
8667,2023-01-27,
8668,2023-01-28,
8669,2023-01-29,


In [None]:
# 가장 최근에 변경된 기준금리가 이후 변경되기 전까지 유지가 되기에, null값들을 젤 위에 있는 값(변경된 가장 최근의 기준금리 값)들로 채움
# 일자별 기준금리의 정보들을 생성
df_rp=df_rp.ffill() # ffill() 매소드를 통해서 젤 위의 있는 값으로 null 값들을 채움
df_rp

Unnamed: 0,date,korea_rp
0,1999-05-06,4.75
1,1999-05-07,4.75
2,1999-05-08,4.75
3,1999-05-09,4.75
4,1999-05-10,4.75
...,...,...
8666,2023-01-26,3.50
8667,2023-01-27,3.50
8668,2023-01-28,3.50
8669,2023-01-29,3.50


In [None]:
# 기준금리 현황 그래프 출력
# x축을 날짜, y축을 기준금리 값으로 한 그래프 출력력
import plotly.graph_objects as go

# Create traces
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_rp['date'], y=df_rp['korea_rp'],
                    mode='lines',
                    name='korea_rp',yaxis='y1'))


fig.show(renderer="colab")

## 부동산 지수 데이터 추가

- https://data.seoul.go.kr/dataList/801/S/2/datasetView.do 사이트에서 아파트 매매 지수 파일을 다운 받아서 진행
- 아파트 매매 지수는 거시경제관련 지표들이 아파트 가격에 연관이 있는지 대략적인 확인을 위해서 사용

In [None]:
# 부동산지수 파일을 불러옴
df_real_estate = pd.read_csv("/content/drive/MyDrive/house_price/original_data/seoul_deal_index.csv",  encoding='UTF8') # 부동산 지수 불러오기
df_real_estate= df_real_estate.loc[(df_real_estate['시점']>1998) & (df_real_estate['자치구별(2)']=='소계'),['시점','아파트']]# 해당 조건에 대응하는 데이터만 거르기
df_real_estate.head()


Unnamed: 0,시점,아파트
39,1999,38.7
42,2000,40.3
45,2001,48.1
48,2002,62.9
51,2003,61.2


In [None]:
df_real_estate.info() # 데이터프레임 정보 확인인

<class 'pandas.core.frame.DataFrame'>
Int64Index: 23 entries, 39 to 519
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   시점      23 non-null     int64  
 1   아파트     23 non-null     float64
dtypes: float64(1), int64(1)
memory usage: 552.0 bytes


In [None]:
#  지수의 head를 파악
df_real_estate['시점'] = pd.to_datetime(df_real_estate['시점'], format='%Y') # 연도만을 datetime형식으로 변환
df_real_estate.head()

Unnamed: 0,시점,아파트
39,1999-01-01,38.7
42,2000-01-01,40.3
45,2001-01-01,48.1
48,2002-01-01,62.9
51,2003-01-01,61.2


In [None]:
df_real_estate.info() # 타입이 변경된을 확인인

<class 'pandas.core.frame.DataFrame'>
Int64Index: 23 entries, 39 to 519
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   시점      23 non-null     datetime64[ns]
 1   아파트     23 non-null     float64       
dtypes: datetime64[ns](1), float64(1)
memory usage: 552.0 bytes


### 기준금리 & 부동산지수 통합

- 기준금리와 부동산지수 데이터프레임들을 통합
- 기준금리 데이터프레임이 모든 날짜들에 대한 정보를 가지고 있기에, 기준금리 데이터 프레임을 left로 두어서 merge 실행
- 부동산지수 데이터프레임의 수치들은 1년동안 값이 일정하다 가정

In [None]:
df_final=pd.merge(df_rp, df_real_estate, left_on='date', right_on='시점', how='left') # 기준금리 데이터 프레임과 부동산지수 데이터 프레임을 병합합
df_final=df_final.ffill() # 젤 위의 값으로 null 값을 채움, 부동산지수의 수치가 1년동안 일정하다 가정정
df_final.head()

Unnamed: 0,date,korea_rp,시점,아파트
0,1999-05-06,4.75,NaT,
1,1999-05-07,4.75,NaT,
2,1999-05-08,4.75,NaT,
3,1999-05-09,4.75,NaT,
4,1999-05-10,4.75,NaT,


In [None]:
df_final.info() # 데이터프레임 정보 확인인

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8671 entries, 0 to 8670
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   date      8671 non-null   datetime64[ns]
 1   korea_rp  8671 non-null   float64       
 2   시점        8431 non-null   datetime64[ns]
 3   아파트       8431 non-null   float64       
dtypes: datetime64[ns](2), float64(2)
memory usage: 338.7 KB


In [None]:
df_final = df_final.fillna(38.7) # 결측치를 채움, 38.7이 가장 과거의 값이기에 해당 값으로 값을 채움
df_final = df_final[['date','korea_rp','아파트']] # 사용할 컬럼만을 선택
df_final.columns = ['date','korea_rp','apartment_index'] # 컬럼명 수정정

In [None]:
df_final.head()

Unnamed: 0,date,korea_rp,apartment_index
0,1999-05-06,4.75,38.7
1,1999-05-07,4.75,38.7
2,1999-05-08,4.75,38.7
3,1999-05-09,4.75,38.7
4,1999-05-10,4.75,38.7


### 기준금리(역) 과 부동산지수 비교

In [None]:
# 기준금리와 부동산지수 2개의 그래프를 출력
# 기준금리는 x축을 기준으로 뒤짚은 값값

import plotly.graph_objects as go

# Create traces
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['korea_rp'],
                    mode='lines',
                    name='korea_rp',yaxis='y1'))
# x축으로 그래프를 뒤집음
fig.update_layout(
    yaxis = dict(autorange="reversed")
)


fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['apartment_index'],
                    mode='lines',
                    name='apartment_index',
                    yaxis="y2"))
fig.update_layout(

   # create first Y-axis
   yaxis=dict(
      title="rp point",
      titlefont=dict(color="blue"),
      tickfont=dict(color="blue")
   ),

   # create second Y-axis
   yaxis2=dict(
      title="apartment index",
      overlaying="y",
      side="right")
)
fig.show(renderer="colab")

2005년 이전까지는 동일한움직임, 2005년 부터 2008년은 반대로, 2008년 이후로는 어느정도 동일하게 움직인다
2008년 이후부터 양적완화의 등장으로 인한 유동성의 증가로 기준금리(역)과 부동산 가격이 유사하게 움직이는 건가?

## 데이터프레임 기간 수정

- 전세,월세에 대한 정보가 2011년 이후 부터 있기에 데이터 프레임을 2011년 ~2022년 으로 자름

In [None]:
df_final = df_final[(df_final['date']>='2011-01-01') & (df_final['date']<='2022-12-31')] # 사용할 날자만 자름
df_final.head()

Unnamed: 0,date,korea_rp,apartment_index
4258,2011-01-01,2.5,93.0
4259,2011-01-02,2.5,93.0
4260,2011-01-03,2.5,93.0
4261,2011-01-04,2.5,93.0
4262,2011-01-05,2.5,93.0


In [None]:
df_final.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4383 entries, 4258 to 8640
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   date             4383 non-null   datetime64[ns]
 1   korea_rp         4383 non-null   float64       
 2   apartment_index  4383 non-null   float64       
dtypes: datetime64[ns](1), float64(2)
memory usage: 137.0 KB


### 기준금리(역) 과 부동산지수 비교

In [None]:
# Create traces
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['korea_rp'],
                    mode='lines',
                    name='korea_rp',yaxis='y1'))
# x축으로 그래프를 뒤집음
fig.update_layout(
    yaxis = dict(autorange="reversed")
)


fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['apartment_index'],
                    mode='lines',
                    name='apartment_index',
                    yaxis="y2"))
fig.update_layout(

   # create first Y-axis
   yaxis=dict(
      title="rp point",
      titlefont=dict(color="blue"),
      tickfont=dict(color="blue")
   ),

   # create second Y-axis
   yaxis2=dict(
      title="apartment index",
      overlaying="y",
      side="right")
)

- 기준금리(역)과 부동산 지수는 연관성이 있는듯

## 코스피 지수 데이터 추가

In [None]:
df_kospi = pd.read_csv("/content/drive/MyDrive/house_price/original_data/kospi.csv",  encoding='UTF8') # 코스피 지수 정보 불러오기
df_kospi.head()

Unnamed: 0,날짜,종가,오픈,고가,저가,거래량,변동 %
0,2022- 12- 29,2236.4,2265.73,2272.67,2236.38,361.19M,-1.93%
1,2022- 12- 28,2280.45,2296.45,2296.45,2276.9,405.89M,-2.24%
2,2022- 12- 27,2332.79,2327.52,2335.99,2321.48,448.50M,0.68%
3,2022- 12- 26,2317.14,2312.54,2321.92,2304.2,427.84M,0.15%
4,2022- 12- 23,2313.69,2325.86,2333.08,2311.9,366.99M,-1.83%


In [None]:
df_kospi=df_kospi.sort_index(ascending=False) # 날짜가 역순으로 되어 있어서 정렬
df_kospi.reset_index(drop=True, inplace=True) # index 재설정
df_kospi.head()

Unnamed: 0,날짜,종가,오픈,고가,저가,거래량,변동 %
0,2007- 01- 02,1435.26,1438.89,1439.71,1430.06,147.74M,0.06%
1,2007- 01- 03,1409.35,1436.42,1437.79,1409.31,203.21M,-1.81%
2,2007- 01- 04,1397.29,1410.55,1411.12,1388.5,241.17M,-0.86%
3,2007- 01- 05,1385.76,1398.6,1400.59,1372.36,277.29M,-0.83%
4,2007- 01- 08,1370.81,1376.76,1384.65,1366.48,177.59M,-1.08%


In [None]:
df_kospi.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3956 entries, 0 to 3955
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   날짜      3956 non-null   object
 1   종가      3956 non-null   object
 2   오픈      3956 non-null   object
 3   고가      3956 non-null   object
 4   저가      3956 non-null   object
 5   거래량     3956 non-null   object
 6   변동 %    3956 non-null   object
dtypes: object(7)
memory usage: 216.5+ KB


In [None]:
# 필요한 컬럼만 선택 후, 컬럼명 수정, 타입변경경
df_kospi = df_kospi[['날짜','종가']]
df_kospi.columns = ['kospi_date','kospi_index']
df_kospi["kospi_date"] = pd.to_datetime(df_kospi["kospi_date"])
df_kospi.head()

Unnamed: 0,kospi_date,kospi_index
0,2007-01-02,1435.26
1,2007-01-03,1409.35
2,2007-01-04,1397.29
3,2007-01-05,1385.76
4,2007-01-08,1370.81


In [None]:
df_kospi.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3956 entries, 0 to 3955
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   kospi_date   3956 non-null   datetime64[ns]
 1   kospi_index  3956 non-null   object        
dtypes: datetime64[ns](1), object(1)
memory usage: 61.9+ KB


In [None]:
# kospi_index 값을 이후 계산에 사용하기 위해서 숫자 형태로 수정
df_kospi["kospi_index"] = df_kospi["kospi_index"].str.replace(",", "") # 문자형으로 되어 있기에 , 을 제거
df_kospi = df_kospi.astype({'kospi_index': 'float64'})# 컬럼 타입 변경
df_kospi.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3956 entries, 0 to 3955
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   kospi_date   3956 non-null   datetime64[ns]
 1   kospi_index  3956 non-null   float64       
dtypes: datetime64[ns](1), float64(1)
memory usage: 61.9 KB


In [None]:
df_kospi.head() # 데이터프레임 형태 확인

Unnamed: 0,kospi_date,kospi_index
0,2007-01-02,1435.26
1,2007-01-03,1409.35
2,2007-01-04,1397.29
3,2007-01-05,1385.76
4,2007-01-08,1370.81


### 코스피 지수 데이터와 병합

In [None]:
# 기준금리&부동산지수 데이터프레임과 코스피 지수 데이터프레임 병합합
df_final=pd.merge(df_final, df_kospi, left_on='date', right_on='kospi_date', how='left') # 두 데이터프레임을 결함
df_final.head()

Unnamed: 0,date,korea_rp,apartment_index,kospi_date,kospi_index
0,2011-01-01,2.5,93.0,NaT,
1,2011-01-02,2.5,93.0,NaT,
2,2011-01-03,2.5,93.0,2011-01-03,2070.08
3,2011-01-04,2.5,93.0,2011-01-04,2085.14
4,2011-01-05,2.5,93.0,2011-01-05,2082.55


In [None]:
df_final.info() # 정보확인 -> 주말등 휴장일들의 존재로 kospi_date 컬럼과 kospi_index 컬럼에서 null 값들이 있음

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4383 entries, 0 to 4382
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   date             4383 non-null   datetime64[ns]
 1   korea_rp         4383 non-null   float64       
 2   apartment_index  4383 non-null   float64       
 3   kospi_date       2958 non-null   datetime64[ns]
 4   kospi_index      2958 non-null   float64       
dtypes: datetime64[ns](2), float64(3)
memory usage: 205.5 KB


In [None]:
# 휴장일에는 이전의 지수값이 유지된다고 가정
# 해결방안으로 이전의 값으로 null 값을 채우기
df_final["kospi_index"]=df_final["kospi_index"].fillna(method='ffill')
df_final.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4383 entries, 0 to 4382
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   date             4383 non-null   datetime64[ns]
 1   korea_rp         4383 non-null   float64       
 2   apartment_index  4383 non-null   float64       
 3   kospi_date       2958 non-null   datetime64[ns]
 4   kospi_index      4381 non-null   float64       
dtypes: datetime64[ns](2), float64(3)
memory usage: 205.5 KB


In [None]:
# 가장 위에 있는 null 값은 직접 찾아서(네이버 검색을 통해서서) 대입
df_final["kospi_index"] = df_final["kospi_index"].fillna(2051)
df_final.info() # 값들 대입이 되었는지 확인

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4383 entries, 0 to 4382
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   date             4383 non-null   datetime64[ns]
 1   korea_rp         4383 non-null   float64       
 2   apartment_index  4383 non-null   float64       
 3   kospi_date       2958 non-null   datetime64[ns]
 4   kospi_index      4383 non-null   float64       
dtypes: datetime64[ns](2), float64(3)
memory usage: 205.5 KB


In [None]:
df_final.head() # 형태 확인

Unnamed: 0,date,korea_rp,apartment_index,kospi_date,kospi_index
0,2011-01-01,2.5,93.0,NaT,2051.0
1,2011-01-02,2.5,93.0,NaT,2051.0
2,2011-01-03,2.5,93.0,2011-01-03,2070.08
3,2011-01-04,2.5,93.0,2011-01-04,2085.14
4,2011-01-05,2.5,93.0,2011-01-05,2082.55


In [None]:
# 사용할 컬럼만 설정
df_final = df_final[['date','korea_rp','apartment_index','kospi_index']]
df_final.head()

Unnamed: 0,date,korea_rp,apartment_index,kospi_index
0,2011-01-01,2.5,93.0,2051.0
1,2011-01-02,2.5,93.0,2051.0
2,2011-01-03,2.5,93.0,2070.08
3,2011-01-04,2.5,93.0,2085.14
4,2011-01-05,2.5,93.0,2082.55


### 코스피지수의 필요성 그래프로 점검

In [None]:
# Create traces
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['kospi_index'],
                    mode='lines',
                    name='kospi_index',yaxis='y1'))



fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['apartment_index'],
                    mode='lines',
                    name='apartment_index',
                    yaxis="y2"))
fig.update_layout(

   # create first Y-axis
   yaxis=dict(
      title="kospi index",
      titlefont=dict(color="blue"),
      tickfont=dict(color="blue")
   ),

   # create second Y-axis
   yaxis2=dict(
      title="apartment index",
      overlaying="y",
      side="right")
)

- 코스피지수와 부동산 지수는 어느정도의 상관성은 있나? 그래프로 봐서는 잘 모르겠음

## 한국국채 금리 데이터 추가

- 코스피 데이터프레임 생성과정과 거의 비슷

In [None]:
import os


dir_path = "/content/drive/MyDrive/house_price/original_data/korean_bond"
file_list = os.listdir(dir_path)
file_list.sort()
name_list = list()
df_list = list()

# 해당 폴더 안에 있는 csv 파일들을 읽어서 리스트 안에 데이터프레임들을 담음
for csv_file in file_list:
    df_list.append(pd.read_csv(dir_path+"/"+csv_file , encoding='UTF8'))
    name_list.append(csv_file.split('.')[0])
for i in range(len(df_list)):
    df_korea = df_list[i] # 파일이 잘 들어갔는지 확인
    df_korea=df_korea.sort_index(ascending=False) # 날짜가 역순으로 되어 있어서 정렬
    df_korea.reset_index(drop=True, inplace=True) # index 재설정
    df_korea = df_korea[['날짜','종가']]
    df_korea.columns = ['korea_date',name_list[i]]
    df_korea['korea_date'] = pd.to_datetime(df_korea['korea_date'])
    df_final=pd.merge(df_final, df_korea, left_on='date', right_on='korea_date', how='left')
    df_final[name_list[i]]=df_final[name_list[i]].fillna(method='ffill') # 중간 공휴일들을 처리
    df_final[name_list[i]]=df_final[name_list[i]].fillna(method='bfill') # 제일 위의 있는 값을 근처 값으로 처리
    df_final = df_final.drop(['korea_date'], axis=1)

In [None]:
df_final.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4383 entries, 0 to 4382
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   date             4383 non-null   datetime64[ns]
 1   korea_rp         4383 non-null   float64       
 2   apartment_index  4383 non-null   float64       
 3   kospi_index      4383 non-null   float64       
 4   korea_10_year    4383 non-null   float64       
 5   korea_1_year     4383 non-null   float64       
 6   korea_20_year    4383 non-null   float64       
 7   korea_2_year     4383 non-null   float64       
 8   korea_3_year     4383 non-null   float64       
 9   korea_4_year     4383 non-null   float64       
 10  korea_5_year     4383 non-null   float64       
dtypes: datetime64[ns](1), float64(10)
memory usage: 410.9 KB


In [None]:
# 컬럼 순서 변경
df_final = df_final[['date', 'apartment_index','kospi_index','korea_rp',
                    'korea_1_year','korea_2_year','korea_3_year','korea_4_year','korea_5_year',
                    'korea_10_year','korea_20_year']]
df_final.head()

Unnamed: 0,date,apartment_index,kospi_index,korea_rp,korea_1_year,korea_2_year,korea_3_year,korea_4_year,korea_5_year,korea_10_year,korea_20_year
0,2011-01-01,93.0,2051.0,2.5,2.81,3.4,3.44,4.09,4.14,4.57,4.73
1,2011-01-02,93.0,2051.0,2.5,2.81,3.4,3.44,4.09,4.14,4.57,4.73
2,2011-01-03,93.0,2070.08,2.5,2.81,3.4,3.44,4.09,4.14,4.57,4.73
3,2011-01-04,93.0,2085.14,2.5,2.83,3.37,3.495,4.16,4.2,4.58,4.74
4,2011-01-05,93.0,2082.55,2.5,2.8,3.42,3.495,4.15,4.17,4.63,4.75


In [None]:
# 년,월,일일 컬럼 생성
df_final['year'] = df_final['date'].dt.year
df_final['month'] = df_final['date'].dt.month
df_final['day'] = df_final['date'].dt.day
df_final.head()

Unnamed: 0,date,apartment_index,kospi_index,korea_rp,korea_1_year,korea_2_year,korea_3_year,korea_4_year,korea_5_year,korea_10_year,korea_20_year,year,month,day
0,2011-01-01,93.0,2051.0,2.5,2.81,3.4,3.44,4.09,4.14,4.57,4.73,2011,1,1
1,2011-01-02,93.0,2051.0,2.5,2.81,3.4,3.44,4.09,4.14,4.57,4.73,2011,1,2
2,2011-01-03,93.0,2070.08,2.5,2.81,3.4,3.44,4.09,4.14,4.57,4.73,2011,1,3
3,2011-01-04,93.0,2085.14,2.5,2.83,3.37,3.495,4.16,4.2,4.58,4.74,2011,1,4
4,2011-01-05,93.0,2082.55,2.5,2.8,3.42,3.495,4.15,4.17,4.63,4.75,2011,1,5


### 부동산지수와 한국국채금리 시각화

In [None]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['korea_rp'],
                    mode='lines',
                    name='korea_rp',yaxis='y1'))

fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['korea_1_year'],
                    mode='lines',
                    name='korea_1_year',yaxis='y1'))

fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['korea_2_year'],
                    mode='lines',
                    name='korea_2_year',yaxis='y1'))

fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['korea_3_year'],
                    mode='lines',
                    name='korea_3_year',yaxis='y1'))

fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['korea_4_year'],
                    mode='lines',
                    name='korea_4_year',yaxis='y1'))

fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['korea_5_year'],
                    mode='lines',
                    name='korea_5_year',yaxis='y1'))

fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['korea_10_year'],
                    mode='lines',
                    name='korea_10_year',yaxis='y1'))

fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['korea_20_year'],
                    mode='lines',
                    name='korea_20_year',yaxis='y1'))

# 앞에서의 그래프들은 뒤집기
fig.update_layout(
    yaxis = dict(autorange="reversed")
)
fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['apartment_index'],
                    mode='lines',
                    name='apartment_index',
                    yaxis="y2"))
fig.update_layout(

   # create first Y-axis
   yaxis=dict(
      title="rate index",
      titlefont=dict(color="black"),
      tickfont=dict(color="black")
   ),

   # create second Y-axis
   yaxis2=dict(
      title="apartment index",
      overlaying="y",
      side="right")
)

- 한국국채금리(역)과 부동산지수는 연관이 있는듯

In [None]:
# 금리들이 얼추 비슷한 흐름을 보임으로 국채에서 3년과 10년만 사용
df_final = df_final[['date','year','month','day','apartment_index','kospi_index','korea_rp',
                    'korea_3_year','korea_10_year']]
df_final.head()

Unnamed: 0,date,year,month,day,apartment_index,kospi_index,korea_rp,korea_3_year,korea_10_year
0,2011-01-01,2011,1,1,93.0,2051.0,2.5,3.44,4.57
1,2011-01-02,2011,1,2,93.0,2051.0,2.5,3.44,4.57
2,2011-01-03,2011,1,3,93.0,2070.08,2.5,3.44,4.57
3,2011-01-04,2011,1,4,93.0,2085.14,2.5,3.495,4.58
4,2011-01-05,2011,1,5,93.0,2082.55,2.5,3.495,4.63


## 미국금채 금리 데이터 추가

- 한국국채금리 데이터프레임 생성과정과 거의 동일

In [None]:
# 변수들 초기화
dir_path = "/content/drive/MyDrive/house_price/original_data/us_bond"
file_list = os.listdir(dir_path)
file_list.sort()
name_list = list()
df_list = list()

# 해당 폴더 안에 있는 csv 파일들을 읽어서 리스트 안에 데이터프레임들을 담음
for csv_file in file_list:
    df_list.append(pd.read_csv(dir_path+"/"+csv_file , encoding='UTF8'))
    name_list.append(csv_file.split('.')[0])
for i in range(len(df_list)):
    df_us = df_list[i]
    df_us=df_us.sort_index(ascending=False) # 날짜가 역순으로 되어 있어서 정렬
    df_us.reset_index(drop=True, inplace=True) # index 재설정
    df_us = df_us[['날짜','종가']]
    df_us.columns = ['us_date',name_list[i]]
    df_us['us_date'] = pd.to_datetime(df_us['us_date'])
    df_final=pd.merge(df_final, df_us, left_on='date', right_on='us_date', how='left')
    df_final[name_list[i]]=df_final[name_list[i]].fillna(method='ffill') # 중간 공휴일들을 처리
    df_final[name_list[i]]=df_final[name_list[i]].fillna(method='bfill') # 제일 위의 있는 값을 근처 값으로 처리
    df_final = df_final.drop(['us_date'], axis=1)

In [None]:
df_final.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4383 entries, 0 to 4382
Data columns (total 17 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   date             4383 non-null   datetime64[ns]
 1   year             4383 non-null   int64         
 2   month            4383 non-null   int64         
 3   day              4383 non-null   int64         
 4   apartment_index  4383 non-null   float64       
 5   kospi_index      4383 non-null   float64       
 6   korea_rp         4383 non-null   float64       
 7   korea_3_year     4383 non-null   float64       
 8   korea_10_year    4383 non-null   float64       
 9   us_10_year       4383 non-null   float64       
 10  us_1_month       4383 non-null   float64       
 11  us_2_year        4383 non-null   float64       
 12  us_30_year       4383 non-null   float64       
 13  us_3_month       4383 non-null   float64       
 14  us_3_year        4383 non-null   float64

In [None]:
df_final = df_final[['date','year','month','day','apartment_index','kospi_index','korea_rp',
                    'korea_3_year','korea_10_year','us_1_month','us_3_month',
                    'us_6_month','us_2_year', 'us_3_year', 'us_5_year',
                    'us_10_year','us_30_year']]

In [None]:
df_final.head()

Unnamed: 0,date,year,month,day,apartment_index,kospi_index,korea_rp,korea_3_year,korea_10_year,us_1_month,us_3_month,us_6_month,us_2_year,us_3_year,us_5_year,us_10_year,us_30_year
0,2011-01-01,2011,1,1,93.0,2051.0,2.5,3.44,4.57,0.096,0.124,0.183,0.601,1.006,2.011,3.334,4.401
1,2011-01-02,2011,1,2,93.0,2051.0,2.5,3.44,4.57,0.096,0.124,0.183,0.601,1.006,2.011,3.334,4.401
2,2011-01-03,2011,1,3,93.0,2070.08,2.5,3.44,4.57,0.096,0.124,0.183,0.601,1.006,2.011,3.334,4.401
3,2011-01-04,2011,1,4,93.0,2085.14,2.5,3.495,4.58,0.106,0.142,0.187,0.621,1.026,2.016,3.338,4.422
4,2011-01-05,2011,1,5,93.0,2082.55,2.5,3.495,4.63,0.129,0.142,0.184,0.708,1.129,2.133,3.463,4.541


### 미국국채금리와 부동산 지수 비교

In [None]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['us_1_month'],
                    mode='lines',
                    name='us_1_month',yaxis='y1'))

fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['us_3_month'],
                    mode='lines',
                    name='us_3_month',yaxis='y1'))

fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['us_6_month'],
                    mode='lines',
                    name='us_6_month',yaxis='y1'))

fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['us_2_year'],
                    mode='lines',
                    name='us_2_year',yaxis='y1'))

fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['us_3_year'],
                    mode='lines',
                    name='us_3_year',yaxis='y1'))

fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['us_5_year'],
                    mode='lines',
                    name='us_5_year',yaxis='y1'))

fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['us_10_year'],
                    mode='lines',
                    name='us_10_year',yaxis='y1'))

fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['us_30_year'],
                    mode='lines',
                    name='us_30_year',yaxis='y1'))

# 앞에서의 그래프들은 뒤집기
fig.update_layout(
    yaxis = dict(autorange="reversed")
)
fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['apartment_index'],
                    mode='lines',
                    name='apartment_index',
                    yaxis="y2"))
fig.update_layout(

   # create first Y-axis
   yaxis=dict(
      title="rate index",
      titlefont=dict(color="black"),
      tickfont=dict(color="black")
   ),

   # create second Y-axis
   yaxis2=dict(
      title="apartment index",
      overlaying="y",
      side="right")
)

- 미국 국채금리(역)은 한국 국채금리(역)보다는 부동산지수와 어느정도 연관이 있는듯?

In [None]:
# 금리들이 얼추 비슷한 흐름을 보임으로 국채에서 3개월, 2년, 10년 데이터프레임을 생성
df_final = df_final[['date','year','month','day','apartment_index','kospi_index','korea_rp',
                    'korea_3_year','korea_10_year','us_3_month', 'us_2_year', 'us_10_year']]
df_final.head()

Unnamed: 0,date,year,month,day,apartment_index,kospi_index,korea_rp,korea_3_year,korea_10_year,us_3_month,us_2_year,us_10_year
0,2011-01-01,2011,1,1,93.0,2051.0,2.5,3.44,4.57,0.124,0.601,3.334
1,2011-01-02,2011,1,2,93.0,2051.0,2.5,3.44,4.57,0.124,0.601,3.334
2,2011-01-03,2011,1,3,93.0,2070.08,2.5,3.44,4.57,0.124,0.601,3.334
3,2011-01-04,2011,1,4,93.0,2085.14,2.5,3.495,4.58,0.142,0.621,3.338
4,2011-01-05,2011,1,5,93.0,2082.55,2.5,3.495,4.63,0.142,0.708,3.463


## 중간저장

In [None]:
df_final.to_csv('/content/drive/MyDrive/house_price/after_data/economic_data_temp.csv',index=False)

## 저장한 파일 불러오기

In [None]:
df_economic = pd.read_csv('/content/drive/MyDrive/house_price/after_data/economic_data_temp.csv')
df_economic

Unnamed: 0,date,year,month,day,apartment_index,kospi_index,korea_rp,korea_3_year,korea_10_year,us_3_month,us_2_year,us_10_year
0,2011-01-01,2011,1,1,93.0,2051.00,2.50,3.440,4.570,0.124,0.6010,3.334
1,2011-01-02,2011,1,2,93.0,2051.00,2.50,3.440,4.570,0.124,0.6010,3.334
2,2011-01-03,2011,1,3,93.0,2070.08,2.50,3.440,4.570,0.124,0.6010,3.334
3,2011-01-04,2011,1,4,93.0,2085.14,2.50,3.495,4.580,0.142,0.6210,3.338
4,2011-01-05,2011,1,5,93.0,2082.55,2.50,3.495,4.630,0.142,0.7080,3.463
...,...,...,...,...,...,...,...,...,...,...,...,...
4378,2022-12-27,2022,12,27,104.4,2332.79,3.25,3.661,3.612,4.311,4.3827,3.849
4379,2022-12-28,2022,12,28,104.4,2280.45,3.25,3.668,3.675,4.457,4.3574,3.886
4380,2022-12-29,2022,12,29,104.4,2236.40,3.25,3.718,3.723,4.423,4.3656,3.820
4381,2022-12-30,2022,12,30,104.4,2236.40,3.25,3.725,3.735,4.405,4.4279,3.879



## 금리차 컬럼들 추가

In [None]:
# 금리차 컬럼들을 추가
df_economic['korea_10-3_year'] = df_economic['korea_10_year'] - df_economic['korea_3_year']
df_economic['us_10-2_year'] = df_economic['us_10_year'] - df_economic['us_2_year']
df_economic['us_10-3_year_month'] = df_economic['us_10_year'] - df_economic['us_3_month']
df_economic

Unnamed: 0,date,year,month,day,apartment_index,kospi_index,korea_rp,korea_3_year,korea_10_year,us_3_month,us_2_year,us_10_year,korea_10-3_year,us_10-2_year,us_10-3_year_month
0,2011-01-01,2011,1,1,93.0,2051.00,2.50,3.440,4.570,0.124,0.6010,3.334,1.130,2.7330,3.210
1,2011-01-02,2011,1,2,93.0,2051.00,2.50,3.440,4.570,0.124,0.6010,3.334,1.130,2.7330,3.210
2,2011-01-03,2011,1,3,93.0,2070.08,2.50,3.440,4.570,0.124,0.6010,3.334,1.130,2.7330,3.210
3,2011-01-04,2011,1,4,93.0,2085.14,2.50,3.495,4.580,0.142,0.6210,3.338,1.085,2.7170,3.196
4,2011-01-05,2011,1,5,93.0,2082.55,2.50,3.495,4.630,0.142,0.7080,3.463,1.135,2.7550,3.321
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4378,2022-12-27,2022,12,27,104.4,2332.79,3.25,3.661,3.612,4.311,4.3827,3.849,-0.049,-0.5337,-0.462
4379,2022-12-28,2022,12,28,104.4,2280.45,3.25,3.668,3.675,4.457,4.3574,3.886,0.007,-0.4714,-0.571
4380,2022-12-29,2022,12,29,104.4,2236.40,3.25,3.718,3.723,4.423,4.3656,3.820,0.005,-0.5456,-0.603
4381,2022-12-30,2022,12,30,104.4,2236.40,3.25,3.725,3.735,4.405,4.4279,3.879,0.010,-0.5489,-0.526


## 아파트 공급량 관련 데이터 추가

### 아파트 분양 공급 데이터 추가

- https://asil.kr/asil/sub/movein.jsp 사이트를 통해서 아파트 공급량의 정보를 확보

In [None]:
import pandas as pd
# txt 파일을 불러옴옴
df_apartment_supply = pd.read_csv("/content/drive/MyDrive/house_price/original_data/apartment_supply.txt",  encoding='UTF8',sep="\t")
df_apartment_supply.head()

Unnamed: 0,위치,단지명,입주년월,총세대수
0,서울 서대문구 홍은동,e편한세상홍제가든플라츠,2022년 12월,481세대
1,서울 서초구 잠원동,르엘신반포,2022년 12월,280세대
2,서울 마포구 아현동,마포더클래시,2022년 12월,"1,419세대"
3,서울 중랑구 면목동,"봄작시티201(민간임대,도시형)",2022년 12월,128세대
4,서울 서대문구 홍은동,힐스테이트홍은포레스트,2022년 11월,623세대


In [None]:
df_apartment_supply.info() # 데이터프레임 정보 확인

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1003 entries, 0 to 1002
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   위치      1003 non-null   object
 1   단지명     1003 non-null   object
 2   입주년월    1003 non-null   object
 3   총세대수    1003 non-null   object
dtypes: object(4)
memory usage: 31.5+ KB


In [None]:
# 년, 월 컬럼들 생성
# ' ' 을 기준으로 잘라서 컬럼들을 생성
df_apartment_supply['year'] =df_apartment_supply['입주년월'].str.split(' ',expand=True)[0]
df_apartment_supply['month'] =df_apartment_supply['입주년월'].str.split(' ',expand=True)[1]

In [None]:
df_apartment_supply.head()

Unnamed: 0,위치,단지명,입주년월,총세대수,year,month
0,서울 서대문구 홍은동,e편한세상홍제가든플라츠,2022년 12월,481세대,2022년,12월
1,서울 서초구 잠원동,르엘신반포,2022년 12월,280세대,2022년,12월
2,서울 마포구 아현동,마포더클래시,2022년 12월,"1,419세대",2022년,12월
3,서울 중랑구 면목동,"봄작시티201(민간임대,도시형)",2022년 12월,128세대,2022년,12월
4,서울 서대문구 홍은동,힐스테이트홍은포레스트,2022년 11월,623세대,2022년,11월


In [None]:
df_apartment_supply.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1003 entries, 0 to 1002
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   위치      1003 non-null   object
 1   단지명     1003 non-null   object
 2   입주년월    1003 non-null   object
 3   총세대수    1003 non-null   object
 4   year    1003 non-null   object
 5   month   1003 non-null   object
dtypes: object(6)
memory usage: 47.1+ KB


In [None]:
# 문자열 특정 문자들 수정
# 추후 데이터프레임 계산에 용이하게 문자들을 수정
df_apartment_supply["year"] = df_apartment_supply["year"].str.replace("년", "")
df_apartment_supply["month"] = df_apartment_supply["month"].str.replace("월", "")
df_apartment_supply["apartment_supply"] = df_apartment_supply["총세대수"].str.replace("세대", "")
df_apartment_supply["apartment_supply"] = df_apartment_supply["apartment_supply"].str.replace(",", "")
df_apartment_supply.head()

Unnamed: 0,위치,단지명,입주년월,총세대수,year,month,apartment_supply
0,서울 서대문구 홍은동,e편한세상홍제가든플라츠,2022년 12월,481세대,2022,12,481
1,서울 서초구 잠원동,르엘신반포,2022년 12월,280세대,2022,12,280
2,서울 마포구 아현동,마포더클래시,2022년 12월,"1,419세대",2022,12,1419
3,서울 중랑구 면목동,"봄작시티201(민간임대,도시형)",2022년 12월,128세대,2022,12,128
4,서울 서대문구 홍은동,힐스테이트홍은포레스트,2022년 11월,623세대,2022,11,623


In [None]:
df_apartment_supply.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1003 entries, 0 to 1002
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   위치                1003 non-null   object
 1   단지명               1003 non-null   object
 2   입주년월              1003 non-null   object
 3   총세대수              1003 non-null   object
 4   year              1003 non-null   object
 5   month             1003 non-null   object
 6   apartment_supply  1003 non-null   object
dtypes: object(7)
memory usage: 55.0+ KB


In [None]:
# date 컬럼 생성
df_apartment_supply['date'] = pd.to_datetime(df_apartment_supply['year']+'-'+df_apartment_supply['month'], format="%Y-%m")

- 해당 달의 수치의 결과는 다음달에 발표한다고 가정(예를들어 2011년 1월의 거래수치는 2011년 1월동안에는 알 수 없고 2월이 되어야 1월의 결과를 종합해서 수치를 알 수 있다)

In [None]:
import datetime
# 다음 달에 지수가 발표한다고 가정
df_apartment_supply['date_column'] = df_apartment_supply['date'] + datetime.timedelta(days=32)
df_apartment_supply['announcement_year'] = df_apartment_supply['date_column'].dt.year
df_apartment_supply['announcement_month'] = df_apartment_supply['date_column'].dt.month

In [None]:
df_apartment_supply.head()

Unnamed: 0,위치,단지명,입주년월,총세대수,year,month,apartment_supply,date,date_column,announcement_year,announcement_month
0,서울 서대문구 홍은동,e편한세상홍제가든플라츠,2022년 12월,481세대,2022,12,481,2022-12-01,2023-01-02,2023,1
1,서울 서초구 잠원동,르엘신반포,2022년 12월,280세대,2022,12,280,2022-12-01,2023-01-02,2023,1
2,서울 마포구 아현동,마포더클래시,2022년 12월,"1,419세대",2022,12,1419,2022-12-01,2023-01-02,2023,1
3,서울 중랑구 면목동,"봄작시티201(민간임대,도시형)",2022년 12월,128세대,2022,12,128,2022-12-01,2023-01-02,2023,1
4,서울 서대문구 홍은동,힐스테이트홍은포레스트,2022년 11월,623세대,2022,11,623,2022-11-01,2022-12-03,2022,12


In [None]:
df_apartment_supply.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1003 entries, 0 to 1002
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   위치                  1003 non-null   object        
 1   단지명                 1003 non-null   object        
 2   입주년월                1003 non-null   object        
 3   총세대수                1003 non-null   object        
 4   year                1003 non-null   object        
 5   month               1003 non-null   object        
 6   apartment_supply    1003 non-null   object        
 7   date                1003 non-null   datetime64[ns]
 8   date_column         1003 non-null   datetime64[ns]
 9   announcement_year   1003 non-null   int64         
 10  announcement_month  1003 non-null   int64         
dtypes: datetime64[ns](2), int64(2), object(7)
memory usage: 86.3+ KB


In [None]:
# 사용할 컬럼만을 거른 후, 타입 변경
df_apartment_supply = df_apartment_supply[['announcement_year','announcement_month','apartment_supply']]
df_apartment_supply = df_apartment_supply.astype({'apartment_supply': 'int64'})
df_apartment_supply.head()

Unnamed: 0,announcement_year,announcement_month,apartment_supply
0,2023,1,481
1,2023,1,280
2,2023,1,1419
3,2023,1,128
4,2022,12,623


In [None]:
# 연, 월별 분양공급량을 group by를 통해서 구한 후, reset_index를 통해서 다시 컬럼화
df_apartment_supply=df_apartment_supply.groupby(['announcement_year','announcement_month'])['apartment_supply'].agg('sum')
df_apartment_supply = df_apartment_supply.reset_index(['announcement_year','announcement_month'])
df_apartment_supply.head()

Unnamed: 0,announcement_year,announcement_month,apartment_supply
0,2011,2,5342
1,2011,3,3494
2,2011,4,1511
3,2011,5,709
4,2011,6,1507


###| 아파트 미분양 데이터 추가

- https://data.kbland.kr/publicdata/unsold-apartments 사이트를 통해서 미분양 데이터 정보를 확보

In [None]:
df_apartment_unsold = pd.read_excel("/content/drive/MyDrive/house_price/original_data/unsold/서울 미분양 현황.xlsx")
df_apartment_unsold.head()

Unnamed: 0,구분,'07.01,'07.02,'07.03,'07.04,'07.05,'07.06,'07.07,'07.08,'07.09,...,'22.02,'22.03,'22.04,'22.05,'22.06,'22.07,'22.08,'22.09,'22.10,'22.11
0,미분양,697,590.0,687.0,685.0,704.0,778.0,840.0,730.0,724.0,...,47,180.0,360,688.0,719.0,592.0,610.0,719.0,866.0,865.0
1,변동률,-,-15.35,16.44,-0.29,2.77,10.51,7.97,-13.1,-0.82,...,0,282.98,100,91.11,4.51,-17.66,3.04,17.87,20.45,-0.12


In [None]:
df_apartment_unsold = df_apartment_unsold.set_index('구분') # '구분'커럼을 인덱스로 설정
df_apartment_unsold.head()

Unnamed: 0_level_0,'07.01,'07.02,'07.03,'07.04,'07.05,'07.06,'07.07,'07.08,'07.09,'07.10,...,'22.02,'22.03,'22.04,'22.05,'22.06,'22.07,'22.08,'22.09,'22.10,'22.11
구분,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
미분양,697,590.0,687.0,685.0,704.0,778.0,840.0,730.0,724.0,977.0,...,47,180.0,360,688.0,719.0,592.0,610.0,719.0,866.0,865.0
변동률,-,-15.35,16.44,-0.29,2.77,10.51,7.97,-13.1,-0.82,34.94,...,0,282.98,100,91.11,4.51,-17.66,3.04,17.87,20.45,-0.12


In [None]:
# T 매소드를 통해서 row와 column을 교환환
df_apartment_unsold=df_apartment_unsold.T
df_apartment_unsold.head()

구분,미분양,변동률
'07.01,697.0,-
'07.02,590.0,-15.35
'07.03,687.0,16.44
'07.04,685.0,-0.29
'07.05,704.0,2.77


In [None]:
df_apartment_unsold.info()

<class 'pandas.core.frame.DataFrame'>
Index: 191 entries, '07.01 to '22.11
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   미분양     191 non-null    object
 1   변동률     191 non-null    object
dtypes: object(2)
memory usage: 8.5+ KB


In [None]:
# index가 날짜의 정보를 가지고 있음으로 reset_index를 통해서 날짜 정보를 컬럼으로 생성성
df_apartment_unsold = df_apartment_unsold.reset_index()
df_apartment_unsold.head()

구분,index,미분양,변동률
0,'07.01,697.0,-
1,'07.02,590.0,-15.35
2,'07.03,687.0,16.44
3,'07.04,685.0,-0.29
4,'07.05,704.0,2.77


In [None]:
# 컬럼명 수정정
df_apartment_unsold.columns=['year_month','unsold_count','ratio']
df_apartment_unsold.head()

Unnamed: 0,year_month,unsold_count,ratio
0,'07.01,697.0,-
1,'07.02,590.0,-15.35
2,'07.03,687.0,16.44
3,'07.04,685.0,-0.29
4,'07.05,704.0,2.77


In [None]:
# year_month 컬럼에서 ' 부분을 제거
df_apartment_unsold["year_month"] = df_apartment_unsold["year_month"].str.replace("'", "")
df_apartment_unsold.head()

Unnamed: 0,year_month,unsold_count,ratio
0,7.01,697.0,-
1,7.02,590.0,-15.35
2,7.03,687.0,16.44
3,7.04,685.0,-0.29
4,7.05,704.0,2.77


In [None]:
# 연, 월 컬럼 생성
df_apartment_unsold['year'] =df_apartment_unsold["year_month"].str.split('.',expand=True)[0]
df_apartment_unsold['month'] =df_apartment_unsold["year_month"].str.split('.',expand=True)[1]
df_apartment_unsold.head()

Unnamed: 0,year_month,unsold_count,ratio,year,month
0,7.01,697.0,-,7,1
1,7.02,590.0,-15.35,7,2
2,7.03,687.0,16.44,7,3
3,7.04,685.0,-0.29,7,4
4,7.05,704.0,2.77,7,5


In [None]:
# 연 컬럼 수정 및 사용할 컬럼 선택
df_apartment_unsold['year'] = '20'+df_apartment_unsold['year']
df_apartment_unsold = df_apartment_unsold[['year','month','unsold_count']]
df_apartment_unsold.head()

Unnamed: 0,year,month,unsold_count
0,2007,1,697.0
1,2007,2,590.0
2,2007,3,687.0
3,2007,4,685.0
4,2007,5,704.0


In [None]:
# 미분양에 대한 정보는 한달이 지나야 결과를 알 수 있다 가정
df_apartment_unsold['date'] = pd.to_datetime(df_apartment_unsold['year']+'-'+df_apartment_unsold['month'], format="%Y-%m")
df_apartment_unsold['date_column'] = df_apartment_unsold['date'] + datetime.timedelta(days=32) # 한달 뒤의 날짜를 구함(발표날짜)
df_apartment_unsold['announcement_year'] = df_apartment_unsold['date_column'].dt.year
df_apartment_unsold['announcement_month'] = df_apartment_unsold['date_column'].dt.month
df_apartment_unsold = df_apartment_unsold[['announcement_year','announcement_month','unsold_count']]
df_apartment_unsold = df_apartment_unsold.astype({'unsold_count': 'int64'})
df_apartment_unsold.head()

Unnamed: 0,announcement_year,announcement_month,unsold_count
0,2007,2,697
1,2007,3,590
2,2007,4,687
3,2007,5,685
4,2007,6,704


In [None]:
# 사용할 연도의 범위를 설정
df_apartment_unsold=df_apartment_unsold[df_apartment_unsold['announcement_year']>=2011]
df_apartment_unsold.head()

Unnamed: 0,announcement_year,announcement_month,unsold_count
47,2011,1,2729
48,2011,2,2269
49,2011,3,2216
50,2011,4,2104
51,2011,5,1855


In [None]:
df_apartment_unsold.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 144 entries, 47 to 190
Data columns (total 3 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   announcement_year   144 non-null    int64
 1   announcement_month  144 non-null    int64
 2   unsold_count        144 non-null    int64
dtypes: int64(3)
memory usage: 4.5 KB


### 아파트 분양 & 미분양 데이터 프레임 병합

In [None]:
df_apartment_supply.tail()

Unnamed: 0,announcement_year,announcement_month,apartment_supply
139,2022,9,1853
140,2022,10,1552
141,2022,11,1265
142,2022,12,1759
143,2023,1,2308


In [None]:
df_apartment_supply.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 144 entries, 0 to 143
Data columns (total 3 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   announcement_year   144 non-null    int64
 1   announcement_month  144 non-null    int64
 2   apartment_supply    144 non-null    int64
dtypes: int64(3)
memory usage: 3.5 KB


In [None]:
df_apartment_unsold.tail()

Unnamed: 0,announcement_year,announcement_month,unsold_count
186,2022,8,592
187,2022,9,610
188,2022,10,719
189,2022,11,866
190,2022,12,865


In [None]:
df_apartment_unsold.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 144 entries, 47 to 190
Data columns (total 3 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   announcement_year   144 non-null    int64
 1   announcement_month  144 non-null    int64
 2   unsold_count        144 non-null    int64
dtypes: int64(3)
memory usage: 4.5 KB


In [None]:
# 데이터 프레임 병합합
df_apartment_supply_unsold=pd.merge(df_apartment_supply, df_apartment_unsold, on=['announcement_year','announcement_month'], how='inner')
df_apartment_supply_unsold.tail()

Unnamed: 0,announcement_year,announcement_month,apartment_supply,unsold_count
138,2022,8,1736,592
139,2022,9,1853,610
140,2022,10,1552,719
141,2022,11,1265,866
142,2022,12,1759,865


In [None]:
df_apartment_supply_unsold.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 143 entries, 0 to 142
Data columns (total 4 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   announcement_year   143 non-null    int64
 1   announcement_month  143 non-null    int64
 2   apartment_supply    143 non-null    int64
 3   unsold_count        143 non-null    int64
dtypes: int64(4)
memory usage: 5.6 KB


#### 미분양 비율 컬럼 추가

In [None]:
# 미분양 비율을 구함
df_apartment_supply_unsold['unsold_ratio'] = 100*(df_apartment_supply_unsold['unsold_count'] / df_apartment_supply_unsold['apartment_supply'])
df_apartment_supply_unsold.head()

Unnamed: 0,announcement_year,announcement_month,apartment_supply,unsold_count,unsold_ratio
0,2011,2,5342,2269,42.474729
1,2011,3,3494,2216,63.423011
2,2011,4,1511,2104,139.245533
3,2011,5,709,1855,261.636107
4,2011,6,1507,1785,118.447246


In [None]:
df_apartment_supply_unsold.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 143 entries, 0 to 142
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   announcement_year   143 non-null    int64  
 1   announcement_month  143 non-null    int64  
 2   apartment_supply    143 non-null    int64  
 3   unsold_count        143 non-null    int64  
 4   unsold_ratio        143 non-null    float64
dtypes: float64(1), int64(4)
memory usage: 6.7 KB


### 최종 테이블에 병합

In [None]:
# 데이터 병합
df_economic=pd.merge(df_economic, df_apartment_supply_unsold, left_on=['year','month'], right_on=['announcement_year','announcement_month'], how='left')
df_economic = df_economic.drop(["announcement_year", "announcement_month"], axis=1)
df_economic.head()

Unnamed: 0,date,year,month,day,apartment_index,kospi_index,korea_rp,korea_3_year,korea_10_year,us_3_month,us_2_year,us_10_year,korea_10-3_year,us_10-2_year,us_10-3_year_month,apartment_supply,unsold_count,unsold_ratio
0,2011-01-01,2011,1,1,93.0,2051.0,2.5,3.44,4.57,0.124,0.601,3.334,1.13,2.733,3.21,,,
1,2011-01-02,2011,1,2,93.0,2051.0,2.5,3.44,4.57,0.124,0.601,3.334,1.13,2.733,3.21,,,
2,2011-01-03,2011,1,3,93.0,2070.08,2.5,3.44,4.57,0.124,0.601,3.334,1.13,2.733,3.21,,,
3,2011-01-04,2011,1,4,93.0,2085.14,2.5,3.495,4.58,0.142,0.621,3.338,1.085,2.717,3.196,,,
4,2011-01-05,2011,1,5,93.0,2082.55,2.5,3.495,4.63,0.142,0.708,3.463,1.135,2.755,3.321,,,


In [None]:
df_economic.info() # 데이터프레임 정보 확인

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4383 entries, 0 to 4382
Data columns (total 18 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   date                4383 non-null   object 
 1   year                4383 non-null   int64  
 2   month               4383 non-null   int64  
 3   day                 4383 non-null   int64  
 4   apartment_index     4383 non-null   float64
 5   kospi_index         4383 non-null   float64
 6   korea_rp            4383 non-null   float64
 7   korea_3_year        4383 non-null   float64
 8   korea_10_year       4383 non-null   float64
 9   us_3_month          4383 non-null   float64
 10  us_2_year           4383 non-null   float64
 11  us_10_year          4383 non-null   float64
 12  korea_10-3_year     4383 non-null   float64
 13  us_10-2_year        4383 non-null   float64
 14  us_10-3_year_month  4383 non-null   float64
 15  apartment_supply    4352 non-null   float64
 16  unsold

In [None]:
df_economic.isnull().sum() # null data 있는지 확인

year                  0
month                 0
apartment_index       0
kospi_index           0
korea_rp              0
korea_3_year          0
korea_10_year         0
us_3_month            0
us_2_year             0
us_10_year            0
korea_10-3_year       0
us_10-2_year          0
us_10-3_year_month    0
apartment_supply      1
unsold_count          1
unsold_ratio          1
dtype: int64

In [None]:
df_economic = df_economic.dropna(subset=['apartment_supply']) # 해당 컬럼에 null 값이 있기에 제거
df_economic.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4352 entries, 31 to 4382
Data columns (total 18 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   date                4352 non-null   object 
 1   year                4352 non-null   int64  
 2   month               4352 non-null   int64  
 3   day                 4352 non-null   int64  
 4   apartment_index     4352 non-null   float64
 5   kospi_index         4352 non-null   float64
 6   korea_rp            4352 non-null   float64
 7   korea_3_year        4352 non-null   float64
 8   korea_10_year       4352 non-null   float64
 9   us_3_month          4352 non-null   float64
 10  us_2_year           4352 non-null   float64
 11  us_10_year          4352 non-null   float64
 12  korea_10-3_year     4352 non-null   float64
 13  us_10-2_year        4352 non-null   float64
 14  us_10-3_year_month  4352 non-null   float64
 15  apartment_supply    4352 non-null   float64
 16  unsol

In [None]:
df_final.to_csv('/content/drive/MyDrive/house_price/after_data/economic_data.csv',index=False)

# economic_data2 파일 생성


- economic_data2 파일은 economic_data 파일에 월별 아파트 거래체결량들(매매체결량, 전세체결량,월세체결량) 정보를 추가한 파일
- '아파트 거래' 는 '아파트 매매', '아파트 전세', '아파트 월세' 를 합친 개념
- 아파트 월별 거래량은 이전 달에 체결된 서울 총 아파트 거래량을 의미

In [None]:
import pandas as pd
import numpy as np
# 데이터들 불러오기
df_deal = pd.read_csv("/content/drive/MyDrive/house_price/after_data/apartment_deal.csv",  encoding='UTF8')
df_month_rent = pd.read_csv("/content/drive/MyDrive/house_price/after_data/apartment_month_rent.csv",  encoding='UTF8')
df_full_rent = pd.read_csv("/content/drive/MyDrive/house_price/after_data/apartment_full_rent.csv",  encoding='UTF8')
df_economic = pd.read_csv("/content/drive/MyDrive/house_price/after_data/economic_data.csv",  encoding='UTF8')

## 아파트 매매 체결량 데이터프레임 생성

In [None]:
df_deal.head()

Unnamed: 0,date,year,month,day,address_0,address_1,address_2,address_3,address_4,name,area,deal_price
0,2011-07-09,2011,7,9,서울특별시,강남구,개포동,655.0,2.0,개포2차현대아파트(220),77.75,64000
1,2011-07-28,2011,7,28,서울특별시,강남구,개포동,655.0,2.0,개포2차현대아파트(220),77.75,65500
2,2011-01-19,2011,1,19,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,67.28,70500
3,2011-09-02,2011,9,2,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,79.97,85000
4,2011-12-17,2011,12,17,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,67.28,68000


In [None]:
# 서울 아파트 월별 거래량을 group by를 이용하여여 계산
df_count = df_deal.groupby(["year","month"])["name"].agg('count').copy()
df_count = df_count.reset_index(["year","month"]) # index로 있던 컬럼들을 다시 컬럼화
df_count.columns = ["year","month","deal_count"] # 컬럼명들 수정
df_count

Unnamed: 0,year,month,deal_count
0,2011,1,7179
1,2011,2,6026
2,2011,3,5419
3,2011,4,4028
4,2011,5,3836
...,...,...,...
143,2022,12,888
144,2023,1,1483
145,2023,2,2520
146,2023,3,3040


In [None]:
df_count.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 148 entries, 0 to 147
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   year        148 non-null    int64
 1   month       148 non-null    int64
 2   deal_count  148 non-null    int64
dtypes: int64(3)
memory usage: 3.6 KB


## 아파트 전세 체결량 정보 추가

- 아파트 매매 체결량 부분 참조

In [None]:
df_full_rent.head()

Unnamed: 0,date,year,month,day,address_0,address_1,address_2,address_3,address_4,name,area,full_rent_price
0,2011-01-05,2011,1,5,서울특별시,강남구,개포동,655.0,2.0,개포2차현대아파트(220),77.75,35000
1,2011-01-18,2011,1,18,서울특별시,강남구,개포동,655.0,2.0,개포2차현대아파트(220),77.75,20000
2,2011-02-01,2011,2,1,서울특별시,강남구,개포동,655.0,2.0,개포2차현대아파트(220),77.75,24000
3,2011-02-11,2011,2,11,서울특별시,강남구,개포동,655.0,2.0,개포2차현대아파트(220),77.75,31000
4,2011-02-24,2011,2,24,서울특별시,강남구,개포동,655.0,2.0,개포2차현대아파트(220),77.75,30500


In [None]:
# 월별 전세 체결량을 group과 count를 통해서 구함
df_temp = df_full_rent.groupby(["year","month"])["name"].agg('count').copy()
df_temp = df_temp.reset_index(["year","month"])
df_temp.columns = ["year","month","full_rent_count"]
df_temp

Unnamed: 0,year,month,full_rent_count
0,2011,1,12336
1,2011,2,12261
2,2011,3,12121
3,2011,4,9754
4,2011,5,9280
...,...,...,...
143,2022,12,8783
144,2023,1,11420
145,2023,2,14763
146,2023,3,13931


In [None]:
# 아파트 매매 체결량 데이터프레임과 아파트 전세 체결량 데이터프레임을 병합
df_count=pd.merge(df_count,df_temp, left_on=["year","month"], right_on=["year","month"], how="inner")
df_count.head()

Unnamed: 0,year,month,deal_count,full_rent_count
0,2011,1,7179,12336
1,2011,2,6026,12261
2,2011,3,5419,12121
3,2011,4,4028,9754
4,2011,5,3836,9280


In [None]:
df_count.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 148 entries, 0 to 147
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype
---  ------           --------------  -----
 0   year             148 non-null    int64
 1   month            148 non-null    int64
 2   deal_count       148 non-null    int64
 3   full_rent_count  148 non-null    int64
dtypes: int64(4)
memory usage: 5.8 KB


## 아파트 월세 체결량 정보 추가

- 아파트 매매 체결량 데이터프레임 참조

In [None]:
df_month_rent.head()

Unnamed: 0,date,year,month,day,address_0,address_1,address_2,address_3,address_4,name,area,rent_deposit,month_rent_price
0,2011-03-18,2011,3,18,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,79.97,19000,63
1,2011-04-09,2011,4,9,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,79.97,21000,35
2,2011-07-09,2011,7,9,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,79.97,3000,160
3,2011-09-19,2011,9,19,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,79.97,6000,140
4,2011-09-20,2011,9,20,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,79.97,5000,160


In [None]:
df_temp = df_month_rent.groupby(["year","month"])["name"].agg('count').copy()
df_temp = df_temp.reset_index(["year","month"])
df_temp.columns = ["year","month","month_rent_count"]
df_temp

Unnamed: 0,year,month,month_rent_count
0,2011,1,2514
1,2011,2,2711
2,2011,3,2775
3,2011,4,2210
4,2011,5,2168
...,...,...,...
143,2022,12,8588
144,2023,1,9276
145,2023,2,11346
146,2023,3,8708


In [None]:
# 아파트 월세 거래량 데이터프레임을 추가하여 병합
df_count=pd.merge(df_count,df_temp, left_on=["year","month"], right_on=["year","month"], how="inner")
df_count.head()

Unnamed: 0,year,month,deal_count,full_rent_count,month_rent_count
0,2011,1,7179,12336,2514
1,2011,2,6026,12261,2711
2,2011,3,5419,12121,2775
3,2011,4,4028,9754,2210
4,2011,5,3836,9280,2168


In [None]:
df_count.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 148 entries, 0 to 147
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype
---  ------            --------------  -----
 0   year              148 non-null    int64
 1   month             148 non-null    int64
 2   deal_count        148 non-null    int64
 3   full_rent_count   148 non-null    int64
 4   month_rent_count  148 non-null    int64
dtypes: int64(5)
memory usage: 6.9 KB


## 월 정보들 shift

- 해당 달의 거래량은 다음달에 알 수 있음으로 한칸씩 shift(1달씩 미룸)

In [None]:
df_count['deal_count'] = df_count['deal_count'].shift(1)
df_count['month_rent_count'] = df_count['month_rent_count'].shift(1)
df_count['full_rent_count'] = df_count['full_rent_count'].shift(1)
# 컬럼명 수정
df_count.columns = ['year','month','last_month_total_deal_count','last_month_total_full_rent_count', 'last_month_total_month_rent_count']
df_count

Unnamed: 0,year,month,last_month_total_deal_count,last_month_total_full_rent_count,last_month_total_month_rent_count
0,2011,1,,,
1,2011,2,7179.0,12336.0,2514.0
2,2011,3,6026.0,12261.0,2711.0
3,2011,4,5419.0,12121.0,2775.0
4,2011,5,4028.0,9754.0,2210.0
...,...,...,...,...,...
143,2022,12,750.0,8890.0,7709.0
144,2023,1,888.0,8783.0,8588.0
145,2023,2,1483.0,11420.0,9276.0
146,2023,3,2520.0,14763.0,11346.0


In [None]:
# null 값이 있는 row를 제거한 후,
df_count.dropna(axis=0,inplace=True)
df_count.reset_index(inplace=True,drop=True) # 인덱스 초기화
df_count

Unnamed: 0,year,month,last_month_total_deal_count,last_month_total_full_rent_count,last_month_total_month_rent_count
0,2011,2,7179.0,12336.0,2514.0
1,2011,3,6026.0,12261.0,2711.0
2,2011,4,5419.0,12121.0,2775.0
3,2011,5,4028.0,9754.0,2210.0
4,2011,6,3836.0,9280.0,2168.0
...,...,...,...,...,...
142,2022,12,750.0,8890.0,7709.0
143,2023,1,888.0,8783.0,8588.0
144,2023,2,1483.0,11420.0,9276.0
145,2023,3,2520.0,14763.0,11346.0


In [None]:
df_count.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 147 entries, 0 to 146
Data columns (total 5 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   year                               147 non-null    int64  
 1   month                              147 non-null    int64  
 2   last_month_total_deal_count        147 non-null    float64
 3   last_month_total_full_rent_count   147 non-null    float64
 4   last_month_total_month_rent_count  147 non-null    float64
dtypes: float64(3), int64(2)
memory usage: 5.9 KB


In [None]:
df_economic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4352 entries, 0 to 4351
Data columns (total 18 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   date                4352 non-null   object 
 1   year                4352 non-null   int64  
 2   month               4352 non-null   int64  
 3   day                 4352 non-null   int64  
 4   apartment_index     4352 non-null   float64
 5   kospi_index         4352 non-null   float64
 6   korea_rp            4352 non-null   float64
 7   korea_3_year        4352 non-null   float64
 8   korea_10_year       4352 non-null   float64
 9   us_3_month          4352 non-null   float64
 10  us_2_year           4352 non-null   float64
 11  us_10_year          4352 non-null   float64
 12  korea_10-3_year     4352 non-null   float64
 13  us_10-2_year        4352 non-null   float64
 14  us_10-3_year_month  4352 non-null   float64
 15  apartment_supply    4352 non-null   float64
 16  unsold

## economic_data 와의 통합

In [None]:
# 거시경제 지표가 모든 날짜들에 대한 정보를 가지고 있음으로, year과 month를 통해서 병합
df_economic2=pd.merge(df_economic, df_count, left_on=["year","month"], right_on=["year","month"], how="inner")

df_economic2 = df_economic2.rename(columns={'apartment_supply':  'last_month_total_apartment_supply', 'unsold_count' : 'last_month_total_unsold_count',
                                          'unsold_ratio' : 'last_month_total_unsold_ratio'})



print(df_economic2.info())
df_economic2.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4352 entries, 0 to 4351
Data columns (total 21 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   date                               4352 non-null   object 
 1   year                               4352 non-null   int64  
 2   month                              4352 non-null   int64  
 3   day                                4352 non-null   int64  
 4   apartment_index                    4352 non-null   float64
 5   kospi_index                        4352 non-null   float64
 6   korea_rp                           4352 non-null   float64
 7   korea_3_year                       4352 non-null   float64
 8   korea_10_year                      4352 non-null   float64
 9   us_3_month                         4352 non-null   float64
 10  us_2_year                          4352 non-null   float64
 11  us_10_year                         4352 non-null   float

Unnamed: 0,date,year,month,day,apartment_index,kospi_index,korea_rp,korea_3_year,korea_10_year,us_3_month,...,us_10_year,korea_10-3_year,us_10-2_year,us_10-3_year_month,last_month_total_apartment_supply,last_month_total_unsold_count,last_month_total_unsold_ratio,last_month_total_deal_count,last_month_total_full_rent_count,last_month_total_month_rent_count
0,2011-02-01,2011,2,1,93.0,2072.03,2.75,3.97,4.71,0.157,...,3.435,0.74,2.83,3.278,5342.0,2269.0,42.474729,7179.0,12336.0,2514.0
1,2011-02-02,2011,2,2,93.0,2072.03,2.75,3.97,4.71,0.157,...,3.479,0.74,2.815,3.322,5342.0,2269.0,42.474729,7179.0,12336.0,2514.0
2,2011-02-03,2011,2,3,93.0,2072.03,2.75,3.97,4.71,0.152,...,3.547,0.74,2.835,3.395,5342.0,2269.0,42.474729,7179.0,12336.0,2514.0
3,2011-02-04,2011,2,4,93.0,2072.03,2.75,3.97,4.71,0.152,...,3.638,0.74,2.886,3.486,5342.0,2269.0,42.474729,7179.0,12336.0,2514.0
4,2011-02-05,2011,2,5,93.0,2072.03,2.75,3.97,4.71,0.152,...,3.638,0.74,2.886,3.486,5342.0,2269.0,42.474729,7179.0,12336.0,2514.0


In [None]:
df_economic2.head()

Unnamed: 0,date,year,month,day,apartment_index,kospi_index,korea_rp,korea_3_year,korea_10_year,us_3_month,...,us_10_year,korea_10-3_year,us_10-2_year,us_10-3_year_month,last_month_total_apartment_supply,last_month_total_unsold_count,last_month_total_unsold_ratio,last_month_total_deal_count,last_month_total_full_rent_count,last_month_total_month_rent_count
0,2011-02-01,2011,2,1,93.0,2072.03,2.75,3.97,4.71,0.157,...,3.435,0.74,2.83,3.278,5342.0,2269.0,42.474729,7179.0,12336.0,2514.0
1,2011-02-02,2011,2,2,93.0,2072.03,2.75,3.97,4.71,0.157,...,3.479,0.74,2.815,3.322,5342.0,2269.0,42.474729,7179.0,12336.0,2514.0
2,2011-02-03,2011,2,3,93.0,2072.03,2.75,3.97,4.71,0.152,...,3.547,0.74,2.835,3.395,5342.0,2269.0,42.474729,7179.0,12336.0,2514.0
3,2011-02-04,2011,2,4,93.0,2072.03,2.75,3.97,4.71,0.152,...,3.638,0.74,2.886,3.486,5342.0,2269.0,42.474729,7179.0,12336.0,2514.0
4,2011-02-05,2011,2,5,93.0,2072.03,2.75,3.97,4.71,0.152,...,3.638,0.74,2.886,3.486,5342.0,2269.0,42.474729,7179.0,12336.0,2514.0


In [None]:

# 데이터프레임 타입 변경
df_economic2=df_economic2.astype({'year': 'int16','month': 'int16',
                    'last_month_total_apartment_supply': 'int32',
                    'last_month_total_unsold_count': 'int32',
                    'last_month_total_deal_count': 'int32',
                    'last_month_total_full_rent_count': 'int32',
                    'last_month_total_month_rent_count': 'int32'})

print(df_economic2.info())
df_economic2.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4352 entries, 0 to 4351
Data columns (total 21 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   date                               4352 non-null   object 
 1   year                               4352 non-null   int16  
 2   month                              4352 non-null   int16  
 3   day                                4352 non-null   int64  
 4   apartment_index                    4352 non-null   float64
 5   kospi_index                        4352 non-null   float64
 6   korea_rp                           4352 non-null   float64
 7   korea_3_year                       4352 non-null   float64
 8   korea_10_year                      4352 non-null   float64
 9   us_3_month                         4352 non-null   float64
 10  us_2_year                          4352 non-null   float64
 11  us_10_year                         4352 non-null   float

Unnamed: 0,date,year,month,day,apartment_index,kospi_index,korea_rp,korea_3_year,korea_10_year,us_3_month,...,us_10_year,korea_10-3_year,us_10-2_year,us_10-3_year_month,last_month_total_apartment_supply,last_month_total_unsold_count,last_month_total_unsold_ratio,last_month_total_deal_count,last_month_total_full_rent_count,last_month_total_month_rent_count
0,2011-02-01,2011,2,1,93.0,2072.03,2.75,3.97,4.71,0.157,...,3.435,0.74,2.83,3.278,5342,2269,42.474729,7179,12336,2514
1,2011-02-02,2011,2,2,93.0,2072.03,2.75,3.97,4.71,0.157,...,3.479,0.74,2.815,3.322,5342,2269,42.474729,7179,12336,2514
2,2011-02-03,2011,2,3,93.0,2072.03,2.75,3.97,4.71,0.152,...,3.547,0.74,2.835,3.395,5342,2269,42.474729,7179,12336,2514
3,2011-02-04,2011,2,4,93.0,2072.03,2.75,3.97,4.71,0.152,...,3.638,0.74,2.886,3.486,5342,2269,42.474729,7179,12336,2514
4,2011-02-05,2011,2,5,93.0,2072.03,2.75,3.97,4.71,0.152,...,3.638,0.74,2.886,3.486,5342,2269,42.474729,7179,12336,2514


In [None]:
# csv 파일 저장
df_economic.to_pickle('/content/drive/MyDrive/house_price/after_data/economic_data2.pkl')

# final_economic 파일 생성

- economic_data2 은 '해당 월'에 대한 거시경제 지표들을 가지고 있다.
- final_economic 파일은 economic_data2 파일에 추가적으로 과거 수치대비 변화에 대한 정보들을 추가한 파일

## 기본정보 파악

In [None]:
import pandas as pd
# 데이터 프레임 불러오기기
df_economic = pd.read_pickle('/content/drive/MyDrive/house_price/after_data/economic_data2.pkl')
df_economic.head()

Unnamed: 0,date,year,month,day,apartment_index,kospi_index,korea_rp,korea_3_year,korea_10_year,us_3_month,us_2_year,us_10_year,korea_10-3_year,us_10-2_year,us_10-3_year_month,last_month_total_apartment_supply,last_month_total_unsold_count,last_month_total_unsold_ratio,last_month_total_deal_count,last_month_total_full_rent_count,last_month_total_month_rent_count
0,2011-02-01,2011,2,1,93.0,2072.03,2.75,3.97,4.71,0.157,0.605,3.435,0.74,2.83,3.278,5342,2269,42.474729,7179,12336,2514
1,2011-02-02,2011,2,2,93.0,2072.03,2.75,3.97,4.71,0.157,0.664,3.479,0.74,2.815,3.322,5342,2269,42.474729,7179,12336,2514
2,2011-02-03,2011,2,3,93.0,2072.03,2.75,3.97,4.71,0.152,0.712,3.547,0.74,2.835,3.395,5342,2269,42.474729,7179,12336,2514
3,2011-02-04,2011,2,4,93.0,2072.03,2.75,3.97,4.71,0.152,0.752,3.638,0.74,2.886,3.486,5342,2269,42.474729,7179,12336,2514
4,2011-02-05,2011,2,5,93.0,2072.03,2.75,3.97,4.71,0.152,0.752,3.638,0.74,2.886,3.486,5342,2269,42.474729,7179,12336,2514


In [None]:
df_economic.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4352 entries, 0 to 4351
Data columns (total 21 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   date                               4352 non-null   object 
 1   year                               4352 non-null   int16  
 2   month                              4352 non-null   int16  
 3   day                                4352 non-null   int64  
 4   apartment_index                    4352 non-null   float64
 5   kospi_index                        4352 non-null   float64
 6   korea_rp                           4352 non-null   float64
 7   korea_3_year                       4352 non-null   float64
 8   korea_10_year                      4352 non-null   float64
 9   us_3_month                         4352 non-null   float64
 10  us_2_year                          4352 non-null   float64
 11  us_10_year                         4352 non-null   float

## 6,12개월 전 대비 변화정도 계산

In [None]:
# 월별 평균 값을 구한 데이터프레임 2개 구함(추후 병합에 사용)
df_economic_6m_before = df_economic.drop(['date','day','apartment_index'],axis=1).copy()


# 일 단위여서 월단위로 묶어서 평균을 할 필요가 있음음
df_economic_6m_before = df_economic_6m_before.groupby(['year','month']).agg('mean').reset_index()


df_economic_12m_before = df_economic.drop(['date','day','apartment_index'],axis=1).copy()
df_economic_12m_before = df_economic_6m_before.groupby(['year','month']).agg('mean').reset_index()

In [None]:
# 6달전 날짜들 구한
df_economic.loc[df_economic['month']<7, '6m_before_year'] = df_economic['year']-1
df_economic.loc[df_economic['month']<7, '6m_before_month'] = 12-(6-df_economic['month'])
df_economic.loc[df_economic['month']>=7, '6m_before_year'] = df_economic['year']
df_economic.loc[df_economic['month']>=7, '6m_before_month'] = df_economic['month']-6

# 12달전 날짜들 구한
df_economic.loc[:, '12m_before_year'] = df_economic['year']-1
df_economic.loc[:, '12m_before_month'] = df_economic['month']

df_economic=df_economic.astype({'6m_before_year': 'int16','6m_before_month': 'int16'})
df_economic

Unnamed: 0,date,year,month,day,apartment_index,kospi_index,korea_rp,korea_3_year,korea_10_year,us_3_month,us_2_year,us_10_year,korea_10-3_year,us_10-2_year,us_10-3_year_month,last_month_total_apartment_supply,last_month_total_unsold_count,last_month_total_unsold_ratio,last_month_total_deal_count,last_month_total_full_rent_count,last_month_total_month_rent_count,6m_before_year,6m_before_month,12m_before_year,12m_before_month
0,2011-02-01,2011,2,1,93.0,2072.03,2.75,3.970,4.710,0.157,0.6050,3.435,0.740,2.8300,3.278,5342,2269,42.474729,7179,12336,2514,2010,8,2010,2
1,2011-02-02,2011,2,2,93.0,2072.03,2.75,3.970,4.710,0.157,0.6640,3.479,0.740,2.8150,3.322,5342,2269,42.474729,7179,12336,2514,2010,8,2010,2
2,2011-02-03,2011,2,3,93.0,2072.03,2.75,3.970,4.710,0.152,0.7120,3.547,0.740,2.8350,3.395,5342,2269,42.474729,7179,12336,2514,2010,8,2010,2
3,2011-02-04,2011,2,4,93.0,2072.03,2.75,3.970,4.710,0.152,0.7520,3.638,0.740,2.8860,3.486,5342,2269,42.474729,7179,12336,2514,2010,8,2010,2
4,2011-02-05,2011,2,5,93.0,2072.03,2.75,3.970,4.710,0.152,0.7520,3.638,0.740,2.8860,3.486,5342,2269,42.474729,7179,12336,2514,2010,8,2010,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4347,2022-12-27,2022,12,27,104.4,2332.79,3.25,3.661,3.612,4.311,4.3827,3.849,-0.049,-0.5337,-0.462,1759,865,49.175668,750,8890,7709,2022,6,2021,12
4348,2022-12-28,2022,12,28,104.4,2280.45,3.25,3.668,3.675,4.457,4.3574,3.886,0.007,-0.4714,-0.571,1759,865,49.175668,750,8890,7709,2022,6,2021,12
4349,2022-12-29,2022,12,29,104.4,2236.40,3.25,3.718,3.723,4.423,4.3656,3.820,0.005,-0.5456,-0.603,1759,865,49.175668,750,8890,7709,2022,6,2021,12
4350,2022-12-30,2022,12,30,104.4,2236.40,3.25,3.725,3.735,4.405,4.4279,3.879,0.010,-0.5489,-0.526,1759,865,49.175668,750,8890,7709,2022,6,2021,12


In [None]:
# 추가할 컬럼들의 컬럼명들을 생성
temp_column_total_list = list()
month_num_list = [6,12] # 1개월,3개월,6개월,12개월 이전 자료들 생성
for i in month_num_list:
    column_list = list()
    column_list.append('year_'+str(i)+'m_before')
    column_list.append('month_'+str(i)+'m_before')
    column_list.append('kospi_index_'+str(i)+'m_before')
    column_list.append('korea_rp_'+str(i)+'m_before')
    column_list.append('korea_3_year_'+str(i)+'m_before')
    column_list.append('korea_10_year_'+str(i)+'m_before')
    column_list.append('us_3_month_'+str(i)+'m_before')
    column_list.append('us_2_year_'+str(i)+'m_before')
    column_list.append('us_10_year_'+str(i)+'m_before')
    column_list.append('korea_10-3_year_'+str(i)+'m_before')
    column_list.append('us_10-2_year_'+str(i)+'m_before')
    column_list.append('us_10-3_year_month_'+str(i)+'m_before')
    column_list.append('last_month_total_apartment_supply_'+str(i)+'m_before')
    column_list.append('last_month_total_unsold_count_'+str(i)+'m_before')
    column_list.append('last_month_total_unsold_ratio_'+str(i)+'m_before')
    column_list.append('last_month_total_deal_count_'+str(i)+'m_before')
    column_list.append('last_month_total_full_rent_count_'+str(i)+'m_before')
    column_list.append('last_month_total_month_rent_count_'+str(i)+'m_before')
    temp_column_total_list.append(column_list)

df_economic_6m_before.columns = temp_column_total_list[0]
df_economic_12m_before.columns = temp_column_total_list[1]

In [None]:
df_economic_6m_before.columns

Index(['year_6m_before', 'month_6m_before', 'kospi_index_6m_before',
       'korea_rp_6m_before', 'korea_3_year_6m_before',
       'korea_10_year_6m_before', 'us_3_month_6m_before',
       'us_2_year_6m_before', 'us_10_year_6m_before',
       'korea_10-3_year_6m_before', 'us_10-2_year_6m_before',
       'us_10-3_year_month_6m_before',
       'last_month_total_apartment_supply_6m_before',
       'last_month_total_unsold_count_6m_before',
       'last_month_total_unsold_ratio_6m_before',
       'last_month_total_deal_count_6m_before',
       'last_month_total_full_rent_count_6m_before',
       'last_month_total_month_rent_count_6m_before'],
      dtype='object')

In [None]:
df_economic_12m_before.columns

Index(['year_12m_before', 'month_12m_before', 'kospi_index_12m_before',
       'korea_rp_12m_before', 'korea_3_year_12m_before',
       'korea_10_year_12m_before', 'us_3_month_12m_before',
       'us_2_year_12m_before', 'us_10_year_12m_before',
       'korea_10-3_year_12m_before', 'us_10-2_year_12m_before',
       'us_10-3_year_month_12m_before',
       'last_month_total_apartment_supply_12m_before',
       'last_month_total_unsold_count_12m_before',
       'last_month_total_unsold_ratio_12m_before',
       'last_month_total_deal_count_12m_before',
       'last_month_total_full_rent_count_12m_before',
       'last_month_total_month_rent_count_12m_before'],
      dtype='object')

In [None]:
pd.set_option('display.max_columns', 100)
# df_economic의 '6m_before_year', '6m_before_month' 과 df_economic_6m_before의 'year_6m_before','month_6m_before' 이 매칭이 됨
df_economic = pd.merge(df_economic, df_economic_6m_before, left_on=['6m_before_year', '6m_before_month'], right_on=['year_6m_before','month_6m_before'], how='inner')
df_economic = pd.merge(df_economic, df_economic_12m_before, left_on=['12m_before_year', '12m_before_month'], right_on=['year_12m_before','month_12m_before'], how='inner')
df_economic = df_economic.drop(["6m_before_year", "6m_before_month", "12m_before_year", "12m_before_month", "year_6m_before", "month_6m_before","year_12m_before", "month_12m_before"], axis=1)
df_economic.head(20)

Unnamed: 0,date,year,month,day,apartment_index,kospi_index,korea_rp,korea_3_year,korea_10_year,us_3_month,us_2_year,us_10_year,korea_10-3_year,us_10-2_year,us_10-3_year_month,last_month_total_apartment_supply,last_month_total_unsold_count,last_month_total_unsold_ratio,last_month_total_deal_count,last_month_total_full_rent_count,last_month_total_month_rent_count,kospi_index_6m_before,korea_rp_6m_before,korea_3_year_6m_before,korea_10_year_6m_before,us_3_month_6m_before,us_2_year_6m_before,us_10_year_6m_before,korea_10-3_year_6m_before,us_10-2_year_6m_before,us_10-3_year_month_6m_before,last_month_total_apartment_supply_6m_before,last_month_total_unsold_count_6m_before,last_month_total_unsold_ratio_6m_before,last_month_total_deal_count_6m_before,last_month_total_full_rent_count_6m_before,last_month_total_month_rent_count_6m_before,kospi_index_12m_before,korea_rp_12m_before,korea_3_year_12m_before,korea_10_year_12m_before,us_3_month_12m_before,us_2_year_12m_before,us_10_year_12m_before,korea_10-3_year_12m_before,us_10-2_year_12m_before,us_10-3_year_month_12m_before,last_month_total_apartment_supply_12m_before,last_month_total_unsold_count_12m_before,last_month_total_unsold_ratio_12m_before,last_month_total_deal_count_12m_before,last_month_total_full_rent_count_12m_before,last_month_total_month_rent_count_12m_before
0,2012-02-01,2012,2,1,86.8,1959.24,3.25,3.38,3.75,0.061,0.226,1.83,0.37,1.604,1.769,1822,1890,103.732162,2786,10445,2277,1852.969677,3.25,3.569194,3.942903,0.015387,0.225677,2.286581,0.37371,2.060903,2.271194,1964.0,1826.0,92.973523,4319.0,9682.0,2311.0,2011.301786,2.75,3.939286,4.745714,0.127536,0.762071,3.565429,0.806429,2.803357,3.437893,5342.0,2269.0,42.474729,7179.0,12336.0,2514.0
1,2012-02-02,2012,2,2,86.8,1984.3,3.25,3.38,3.76,0.084,0.226,1.823,0.38,1.597,1.739,1822,1890,103.732162,2786,10445,2277,1852.969677,3.25,3.569194,3.942903,0.015387,0.225677,2.286581,0.37371,2.060903,2.271194,1964.0,1826.0,92.973523,4319.0,9682.0,2311.0,2011.301786,2.75,3.939286,4.745714,0.127536,0.762071,3.565429,0.806429,2.803357,3.437893,5342.0,2269.0,42.474729,7179.0,12336.0,2514.0
2,2012-02-03,2012,2,3,86.8,1972.34,3.25,3.38,3.76,0.079,0.238,1.924,0.38,1.686,1.845,1822,1890,103.732162,2786,10445,2277,1852.969677,3.25,3.569194,3.942903,0.015387,0.225677,2.286581,0.37371,2.060903,2.271194,1964.0,1826.0,92.973523,4319.0,9682.0,2311.0,2011.301786,2.75,3.939286,4.745714,0.127536,0.762071,3.565429,0.806429,2.803357,3.437893,5342.0,2269.0,42.474729,7179.0,12336.0,2514.0
3,2012-02-04,2012,2,4,86.8,1972.34,3.25,3.38,3.76,0.079,0.238,1.924,0.38,1.686,1.845,1822,1890,103.732162,2786,10445,2277,1852.969677,3.25,3.569194,3.942903,0.015387,0.225677,2.286581,0.37371,2.060903,2.271194,1964.0,1826.0,92.973523,4319.0,9682.0,2311.0,2011.301786,2.75,3.939286,4.745714,0.127536,0.762071,3.565429,0.806429,2.803357,3.437893,5342.0,2269.0,42.474729,7179.0,12336.0,2514.0
4,2012-02-05,2012,2,5,86.8,1972.34,3.25,3.38,3.76,0.079,0.238,1.924,0.38,1.686,1.845,1822,1890,103.732162,2786,10445,2277,1852.969677,3.25,3.569194,3.942903,0.015387,0.225677,2.286581,0.37371,2.060903,2.271194,1964.0,1826.0,92.973523,4319.0,9682.0,2311.0,2011.301786,2.75,3.939286,4.745714,0.127536,0.762071,3.565429,0.806429,2.803357,3.437893,5342.0,2269.0,42.474729,7179.0,12336.0,2514.0
5,2012-02-06,2012,2,6,86.8,1973.13,3.25,3.39,3.78,0.086,0.234,1.901,0.39,1.667,1.815,1822,1890,103.732162,2786,10445,2277,1852.969677,3.25,3.569194,3.942903,0.015387,0.225677,2.286581,0.37371,2.060903,2.271194,1964.0,1826.0,92.973523,4319.0,9682.0,2311.0,2011.301786,2.75,3.939286,4.745714,0.127536,0.762071,3.565429,0.806429,2.803357,3.437893,5342.0,2269.0,42.474729,7179.0,12336.0,2514.0
6,2012-02-07,2012,2,7,86.8,1981.59,3.25,3.41,3.81,0.081,0.25,1.977,0.4,1.727,1.896,1822,1890,103.732162,2786,10445,2277,1852.969677,3.25,3.569194,3.942903,0.015387,0.225677,2.286581,0.37371,2.060903,2.271194,1964.0,1826.0,92.973523,4319.0,9682.0,2311.0,2011.301786,2.75,3.939286,4.745714,0.127536,0.762071,3.565429,0.806429,2.803357,3.437893,5342.0,2269.0,42.474729,7179.0,12336.0,2514.0
7,2012-02-08,2012,2,8,86.8,2003.73,3.25,3.44,3.83,0.081,0.258,1.982,0.39,1.724,1.901,1822,1890,103.732162,2786,10445,2277,1852.969677,3.25,3.569194,3.942903,0.015387,0.225677,2.286581,0.37371,2.060903,2.271194,1964.0,1826.0,92.973523,4319.0,9682.0,2311.0,2011.301786,2.75,3.939286,4.745714,0.127536,0.762071,3.565429,0.806429,2.803357,3.437893,5342.0,2269.0,42.474729,7179.0,12336.0,2514.0
8,2012-02-09,2012,2,9,86.8,2014.62,3.25,3.45,3.81,0.091,0.266,2.036,0.36,1.77,1.945,1822,1890,103.732162,2786,10445,2277,1852.969677,3.25,3.569194,3.942903,0.015387,0.225677,2.286581,0.37371,2.060903,2.271194,1964.0,1826.0,92.973523,4319.0,9682.0,2311.0,2011.301786,2.75,3.939286,4.745714,0.127536,0.762071,3.565429,0.806429,2.803357,3.437893,5342.0,2269.0,42.474729,7179.0,12336.0,2514.0
9,2012-02-10,2012,2,10,86.8,1993.71,3.25,3.455,3.82,0.089,0.278,1.984,0.365,1.706,1.895,1822,1890,103.732162,2786,10445,2277,1852.969677,3.25,3.569194,3.942903,0.015387,0.225677,2.286581,0.37371,2.060903,2.271194,1964.0,1826.0,92.973523,4319.0,9682.0,2311.0,2011.301786,2.75,3.939286,4.745714,0.127536,0.762071,3.565429,0.806429,2.803357,3.437893,5342.0,2269.0,42.474729,7179.0,12336.0,2514.0


- 처음에 변화율을 구하려 했지만, 수치가 0인 값들이 있어서 계산을 할 때 null이나 inf가 되는 경우들이 있어서 변화율보다는 변화정도로 진행을 하기로 함

>> 계산식을 생성할 때, 0으로 나누거나 나누어지는 경우들에 대해서 조심해야 한다

In [None]:
column_list = list()
column_list.append(['kospi_index', 'korea_rp',
       'korea_3_year', 'korea_10_year', 'us_3_month', 'us_2_year',
       'us_10_year', 'korea_10-3_year', 'us_10-2_year', 'us_10-3_year_month',
       'last_month_total_apartment_supply', 'last_month_total_unsold_count',
       'last_month_total_unsold_ratio', 'last_month_total_deal_count',
       'last_month_total_full_rent_count', 'last_month_total_month_rent_count'])

column_list.append(temp_column_total_list[0][2:])
column_list.append(temp_column_total_list[1][2:])
column_list

[['kospi_index',
  'korea_rp',
  'korea_3_year',
  'korea_10_year',
  'us_3_month',
  'us_2_year',
  'us_10_year',
  'korea_10-3_year',
  'us_10-2_year',
  'us_10-3_year_month',
  'last_month_total_apartment_supply',
  'last_month_total_unsold_count',
  'last_month_total_unsold_ratio',
  'last_month_total_deal_count',
  'last_month_total_full_rent_count',
  'last_month_total_month_rent_count'],
 ['kospi_index_6m_before',
  'korea_rp_6m_before',
  'korea_3_year_6m_before',
  'korea_10_year_6m_before',
  'us_3_month_6m_before',
  'us_2_year_6m_before',
  'us_10_year_6m_before',
  'korea_10-3_year_6m_before',
  'us_10-2_year_6m_before',
  'us_10-3_year_month_6m_before',
  'last_month_total_apartment_supply_6m_before',
  'last_month_total_unsold_count_6m_before',
  'last_month_total_unsold_ratio_6m_before',
  'last_month_total_deal_count_6m_before',
  'last_month_total_full_rent_count_6m_before',
  'last_month_total_month_rent_count_6m_before'],
 ['kospi_index_12m_before',
  'korea_rp_12m_

In [None]:
# 변화정도 = 현재데이터 - 과거데이터
# 반복문을 통해서 변화정도들을 계산
for i in range(len(column_list[0])):
  df_economic[column_list[1][i]] = df_economic[column_list[0][i]] - df_economic[column_list[1][i]]
  df_economic[column_list[2][i]] = df_economic[column_list[0][i]] - df_economic[column_list[2][i]]
df_economic

Unnamed: 0,date,year,month,day,apartment_index,kospi_index,korea_rp,korea_3_year,korea_10_year,us_3_month,us_2_year,us_10_year,korea_10-3_year,us_10-2_year,us_10-3_year_month,last_month_total_apartment_supply,last_month_total_unsold_count,last_month_total_unsold_ratio,last_month_total_deal_count,last_month_total_full_rent_count,last_month_total_month_rent_count,kospi_index_6m_before,korea_rp_6m_before,korea_3_year_6m_before,korea_10_year_6m_before,us_3_month_6m_before,us_2_year_6m_before,us_10_year_6m_before,korea_10-3_year_6m_before,us_10-2_year_6m_before,us_10-3_year_month_6m_before,last_month_total_apartment_supply_6m_before,last_month_total_unsold_count_6m_before,last_month_total_unsold_ratio_6m_before,last_month_total_deal_count_6m_before,last_month_total_full_rent_count_6m_before,last_month_total_month_rent_count_6m_before,kospi_index_12m_before,korea_rp_12m_before,korea_3_year_12m_before,korea_10_year_12m_before,us_3_month_12m_before,us_2_year_12m_before,us_10_year_12m_before,korea_10-3_year_12m_before,us_10-2_year_12m_before,us_10-3_year_month_12m_before,last_month_total_apartment_supply_12m_before,last_month_total_unsold_count_12m_before,last_month_total_unsold_ratio_12m_before,last_month_total_deal_count_12m_before,last_month_total_full_rent_count_12m_before,last_month_total_month_rent_count_12m_before
0,2012-02-01,2012,2,1,86.8,1959.24,3.25,3.380,3.750,0.061,0.2260,1.830,0.370,1.6040,1.769,1822,1890,103.732162,2786,10445,2277,106.270323,0.0,-0.189194,-0.192903,0.045613,0.000323,-0.456581,-0.00371,-0.456903,-0.502194,-142.0,64.0,10.758639,-1533.0,763.0,-34.0,-52.061786,0.50,-0.559286,-0.995714,-0.066536,-0.536071,-1.735429,-0.436429,-1.199357,-1.668893,-3520.0,-379.0,61.257434,-4393.0,-1891.0,-237.0
1,2012-02-02,2012,2,2,86.8,1984.30,3.25,3.380,3.760,0.084,0.2260,1.823,0.380,1.5970,1.739,1822,1890,103.732162,2786,10445,2277,131.330323,0.0,-0.189194,-0.182903,0.068613,0.000323,-0.463581,0.00629,-0.463903,-0.532194,-142.0,64.0,10.758639,-1533.0,763.0,-34.0,-27.001786,0.50,-0.559286,-0.985714,-0.043536,-0.536071,-1.742429,-0.426429,-1.206357,-1.698893,-3520.0,-379.0,61.257434,-4393.0,-1891.0,-237.0
2,2012-02-03,2012,2,3,86.8,1972.34,3.25,3.380,3.760,0.079,0.2380,1.924,0.380,1.6860,1.845,1822,1890,103.732162,2786,10445,2277,119.370323,0.0,-0.189194,-0.182903,0.063613,0.012323,-0.362581,0.00629,-0.374903,-0.426194,-142.0,64.0,10.758639,-1533.0,763.0,-34.0,-38.961786,0.50,-0.559286,-0.985714,-0.048536,-0.524071,-1.641429,-0.426429,-1.117357,-1.592893,-3520.0,-379.0,61.257434,-4393.0,-1891.0,-237.0
3,2012-02-04,2012,2,4,86.8,1972.34,3.25,3.380,3.760,0.079,0.2380,1.924,0.380,1.6860,1.845,1822,1890,103.732162,2786,10445,2277,119.370323,0.0,-0.189194,-0.182903,0.063613,0.012323,-0.362581,0.00629,-0.374903,-0.426194,-142.0,64.0,10.758639,-1533.0,763.0,-34.0,-38.961786,0.50,-0.559286,-0.985714,-0.048536,-0.524071,-1.641429,-0.426429,-1.117357,-1.592893,-3520.0,-379.0,61.257434,-4393.0,-1891.0,-237.0
4,2012-02-05,2012,2,5,86.8,1972.34,3.25,3.380,3.760,0.079,0.2380,1.924,0.380,1.6860,1.845,1822,1890,103.732162,2786,10445,2277,119.370323,0.0,-0.189194,-0.182903,0.063613,0.012323,-0.362581,0.00629,-0.374903,-0.426194,-142.0,64.0,10.758639,-1533.0,763.0,-34.0,-38.961786,0.50,-0.559286,-0.985714,-0.048536,-0.524071,-1.641429,-0.426429,-1.117357,-1.592893,-3520.0,-379.0,61.257434,-4393.0,-1891.0,-237.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3982,2022-12-27,2022,12,27,104.4,2332.79,3.25,3.661,3.612,4.311,4.3827,3.849,-0.049,-0.5337,-0.462,1759,865,49.175668,750,8890,7709,-167.330667,1.5,0.224867,0.006767,2.815190,1.377177,0.717067,-0.21810,-0.660110,-2.098123,1654.0,177.0,-606.062427,-1094.0,-2766.0,-987.0,-658.886129,2.25,1.857516,1.422548,4.251632,3.720555,2.393742,-0.434968,-1.326813,-1.857890,-266.0,811.0,46.509001,-695.0,-2296.0,1048.0
3983,2022-12-28,2022,12,28,104.4,2280.45,3.25,3.668,3.675,4.457,4.3574,3.886,0.007,-0.4714,-0.571,1759,865,49.175668,750,8890,7709,-219.670667,1.5,0.231867,0.069767,2.961190,1.351877,0.754067,-0.16210,-0.597810,-2.207123,1654.0,177.0,-606.062427,-1094.0,-2766.0,-987.0,-711.226129,2.25,1.864516,1.485548,4.397632,3.695255,2.430742,-0.378968,-1.264513,-1.966890,-266.0,811.0,46.509001,-695.0,-2296.0,1048.0
3984,2022-12-29,2022,12,29,104.4,2236.40,3.25,3.718,3.723,4.423,4.3656,3.820,0.005,-0.5456,-0.603,1759,865,49.175668,750,8890,7709,-263.720667,1.5,0.281867,0.117767,2.927190,1.360077,0.688067,-0.16410,-0.672010,-2.239123,1654.0,177.0,-606.062427,-1094.0,-2766.0,-987.0,-755.276129,2.25,1.914516,1.533548,4.363632,3.703455,2.364742,-0.380968,-1.338713,-1.998890,-266.0,811.0,46.509001,-695.0,-2296.0,1048.0
3985,2022-12-30,2022,12,30,104.4,2236.40,3.25,3.725,3.735,4.405,4.4279,3.879,0.010,-0.5489,-0.526,1759,865,49.175668,750,8890,7709,-263.720667,1.5,0.288867,0.129767,2.909190,1.422377,0.747067,-0.15910,-0.675310,-2.162123,1654.0,177.0,-606.062427,-1094.0,-2766.0,-987.0,-755.276129,2.25,1.921516,1.545548,4.345632,3.765755,2.423742,-0.375968,-1.342013,-1.921890,-266.0,811.0,46.509001,-695.0,-2296.0,1048.0


In [None]:
df_economic.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3987 entries, 0 to 3986
Data columns (total 53 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   date                                          3987 non-null   object 
 1   year                                          3987 non-null   int16  
 2   month                                         3987 non-null   int16  
 3   day                                           3987 non-null   int64  
 4   apartment_index                               3987 non-null   float64
 5   kospi_index                                   3987 non-null   float64
 6   korea_rp                                      3987 non-null   float64
 7   korea_3_year                                  3987 non-null   float64
 8   korea_10_year                                 3987 non-null   float64
 9   us_3_month                                    3987 non-null   f

In [None]:

# type 이 floay64 인 컬럼을 float32로 변경, 메모리 사용량을 줄이기 위해서서
df_economic_columns = list(df_economic.columns)
for df_economic_column in df_economic_columns:
    if df_economic[df_economic_column].dtypes =='float64':
        df_economic[df_economic_column]=df_economic[df_economic_column].astype('float32')
    else:
        pass

In [None]:
df_economic.info() # 타입이 변경되었음을 확인

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3987 entries, 0 to 3986
Data columns (total 53 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   date                                          3987 non-null   object 
 1   year                                          3987 non-null   int16  
 2   month                                         3987 non-null   int16  
 3   day                                           3987 non-null   int64  
 4   apartment_index                               3987 non-null   float32
 5   kospi_index                                   3987 non-null   float32
 6   korea_rp                                      3987 non-null   float32
 7   korea_3_year                                  3987 non-null   float32
 8   korea_10_year                                 3987 non-null   float32
 9   us_3_month                                    3987 non-null   f

In [None]:
df_economic # 데이터프레임 형태 확인

Unnamed: 0,date,year,month,day,apartment_index,kospi_index,korea_rp,korea_3_year,korea_10_year,us_3_month,us_2_year,us_10_year,korea_10-3_year,us_10-2_year,us_10-3_year_month,last_month_total_apartment_supply,last_month_total_unsold_count,last_month_total_unsold_ratio,last_month_total_deal_count,last_month_total_full_rent_count,last_month_total_month_rent_count,kospi_index_6m_before,korea_rp_6m_before,korea_3_year_6m_before,korea_10_year_6m_before,us_3_month_6m_before,us_2_year_6m_before,us_10_year_6m_before,korea_10-3_year_6m_before,us_10-2_year_6m_before,us_10-3_year_month_6m_before,last_month_total_apartment_supply_6m_before,last_month_total_unsold_count_6m_before,last_month_total_unsold_ratio_6m_before,last_month_total_deal_count_6m_before,last_month_total_full_rent_count_6m_before,last_month_total_month_rent_count_6m_before,kospi_index_12m_before,korea_rp_12m_before,korea_3_year_12m_before,korea_10_year_12m_before,us_3_month_12m_before,us_2_year_12m_before,us_10_year_12m_before,korea_10-3_year_12m_before,us_10-2_year_12m_before,us_10-3_year_month_12m_before,last_month_total_apartment_supply_12m_before,last_month_total_unsold_count_12m_before,last_month_total_unsold_ratio_12m_before,last_month_total_deal_count_12m_before,last_month_total_full_rent_count_12m_before,last_month_total_month_rent_count_12m_before
0,2012-02-01,2012,2,1,86.800003,1959.239990,3.25,3.380,3.750,0.061,0.2260,1.830,0.370,1.6040,1.769,1822,1890,103.732162,2786,10445,2277,106.270325,0.0,-0.189194,-0.192903,0.045613,0.000323,-0.456581,-0.00371,-0.456903,-0.502194,-142.0,64.0,10.758639,-1533.0,763.0,-34.0,-52.061787,0.50,-0.559286,-0.995714,-0.066536,-0.536071,-1.735429,-0.436429,-1.199357,-1.668893,-3520.0,-379.0,61.257435,-4393.0,-1891.0,-237.0
1,2012-02-02,2012,2,2,86.800003,1984.300049,3.25,3.380,3.760,0.084,0.2260,1.823,0.380,1.5970,1.739,1822,1890,103.732162,2786,10445,2277,131.330322,0.0,-0.189194,-0.182903,0.068613,0.000323,-0.463581,0.00629,-0.463903,-0.532194,-142.0,64.0,10.758639,-1533.0,763.0,-34.0,-27.001785,0.50,-0.559286,-0.985714,-0.043536,-0.536071,-1.742429,-0.426429,-1.206357,-1.698893,-3520.0,-379.0,61.257435,-4393.0,-1891.0,-237.0
2,2012-02-03,2012,2,3,86.800003,1972.339966,3.25,3.380,3.760,0.079,0.2380,1.924,0.380,1.6860,1.845,1822,1890,103.732162,2786,10445,2277,119.370323,0.0,-0.189194,-0.182903,0.063613,0.012323,-0.362581,0.00629,-0.374903,-0.426194,-142.0,64.0,10.758639,-1533.0,763.0,-34.0,-38.961784,0.50,-0.559286,-0.985714,-0.048536,-0.524071,-1.641429,-0.426429,-1.117357,-1.592893,-3520.0,-379.0,61.257435,-4393.0,-1891.0,-237.0
3,2012-02-04,2012,2,4,86.800003,1972.339966,3.25,3.380,3.760,0.079,0.2380,1.924,0.380,1.6860,1.845,1822,1890,103.732162,2786,10445,2277,119.370323,0.0,-0.189194,-0.182903,0.063613,0.012323,-0.362581,0.00629,-0.374903,-0.426194,-142.0,64.0,10.758639,-1533.0,763.0,-34.0,-38.961784,0.50,-0.559286,-0.985714,-0.048536,-0.524071,-1.641429,-0.426429,-1.117357,-1.592893,-3520.0,-379.0,61.257435,-4393.0,-1891.0,-237.0
4,2012-02-05,2012,2,5,86.800003,1972.339966,3.25,3.380,3.760,0.079,0.2380,1.924,0.380,1.6860,1.845,1822,1890,103.732162,2786,10445,2277,119.370323,0.0,-0.189194,-0.182903,0.063613,0.012323,-0.362581,0.00629,-0.374903,-0.426194,-142.0,64.0,10.758639,-1533.0,763.0,-34.0,-38.961784,0.50,-0.559286,-0.985714,-0.048536,-0.524071,-1.641429,-0.426429,-1.117357,-1.592893,-3520.0,-379.0,61.257435,-4393.0,-1891.0,-237.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3982,2022-12-27,2022,12,27,104.400002,2332.790039,3.25,3.661,3.612,4.311,4.3827,3.849,-0.049,-0.5337,-0.462,1759,865,49.175667,750,8890,7709,-167.330673,1.5,0.224867,0.006767,2.815190,1.377177,0.717067,-0.21810,-0.660110,-2.098123,1654.0,177.0,-606.062439,-1094.0,-2766.0,-987.0,-658.886108,2.25,1.857516,1.422548,4.251632,3.720555,2.393742,-0.434968,-1.326813,-1.857890,-266.0,811.0,46.509003,-695.0,-2296.0,1048.0
3983,2022-12-28,2022,12,28,104.400002,2280.449951,3.25,3.668,3.675,4.457,4.3574,3.886,0.007,-0.4714,-0.571,1759,865,49.175667,750,8890,7709,-219.670670,1.5,0.231867,0.069767,2.961190,1.351877,0.754067,-0.16210,-0.597810,-2.207123,1654.0,177.0,-606.062439,-1094.0,-2766.0,-987.0,-711.226135,2.25,1.864516,1.485548,4.397632,3.695255,2.430742,-0.378968,-1.264513,-1.966890,-266.0,811.0,46.509003,-695.0,-2296.0,1048.0
3984,2022-12-29,2022,12,29,104.400002,2236.399902,3.25,3.718,3.723,4.423,4.3656,3.820,0.005,-0.5456,-0.603,1759,865,49.175667,750,8890,7709,-263.720673,1.5,0.281867,0.117767,2.927190,1.360077,0.688067,-0.16410,-0.672010,-2.239123,1654.0,177.0,-606.062439,-1094.0,-2766.0,-987.0,-755.276123,2.25,1.914516,1.533548,4.363632,3.703455,2.364742,-0.380968,-1.338713,-1.998890,-266.0,811.0,46.509003,-695.0,-2296.0,1048.0
3985,2022-12-30,2022,12,30,104.400002,2236.399902,3.25,3.725,3.735,4.405,4.4279,3.879,0.010,-0.5489,-0.526,1759,865,49.175667,750,8890,7709,-263.720673,1.5,0.288867,0.129767,2.909190,1.422377,0.747067,-0.15910,-0.675310,-2.162123,1654.0,177.0,-606.062439,-1094.0,-2766.0,-987.0,-755.276123,2.25,1.921516,1.545548,4.345632,3.765755,2.423742,-0.375968,-1.342013,-1.921890,-266.0,811.0,46.509003,-695.0,-2296.0,1048.0


In [None]:
df_economic.to_pickle('/content/drive/MyDrive/house_price/after_data/final_economic.pkl')

>> 메모리 용량을 줄이기 위해서 타입들을 변환할 수도 있다.

>> 값들을 병합하거나 수정한 후, null 값이나 inf 값들이 존재하는 확인을 해야 한다 -> 나중에 진행이 된 다음에 발견을 하면 많은 부분을 수정해야 한다

# df_area_deal, df_area_full_rent, df_area_year_rent 파일들 생성

- '아파트 거래'가 체결된 날 이외의 날들은 가장 최근에 체결된 거래가격이 유지된다고 가정

## 필요한 데이터들 불러오기

In [None]:
import pandas as pd
import numpy as np
# 데이터들 불러오기
df_deal = pd.read_csv("/content/drive/MyDrive/house_price/after_data/apartment_deal.csv",  encoding='UTF8')
df_month_rent = pd.read_csv("/content/drive/MyDrive/house_price/after_data/apartment_month_rent.csv",  encoding='UTF8')
df_full_rent = pd.read_csv("/content/drive/MyDrive/house_price/after_data/apartment_full_rent.csv",  encoding='UTF8')

## df_area_deal 파일 생성

- 아파트별로 가장 최근에 체결된 '평당 매매가격' 정보를 가진 데이터프레임 생성

### 아파트 월별 매매 피봇 테이블 생성

In [None]:
# 대표 데이터 파악
df_deal.head()

Unnamed: 0,date,year,month,day,address_0,address_1,address_2,address_3,address_4,name,area,deal_price
0,2011-07-09,2011,7,9,서울특별시,강남구,개포동,655.0,2.0,개포2차현대아파트(220),77.75,64000
1,2011-07-28,2011,7,28,서울특별시,강남구,개포동,655.0,2.0,개포2차현대아파트(220),77.75,65500
2,2011-01-19,2011,1,19,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,67.28,70500
3,2011-09-02,2011,9,2,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,79.97,85000
4,2011-12-17,2011,12,17,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,67.28,68000


In [None]:
df_deal.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891388 entries, 0 to 891387
Data columns (total 12 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   date        891388 non-null  object 
 1   year        891388 non-null  int64  
 2   month       891388 non-null  int64  
 3   day         891388 non-null  int64  
 4   address_0   891388 non-null  object 
 5   address_1   891388 non-null  object 
 6   address_2   891388 non-null  object 
 7   address_3   891388 non-null  float64
 8   address_4   891388 non-null  float64
 9   name        891388 non-null  object 
 10  area        891388 non-null  float64
 11  deal_price  891388 non-null  int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 81.6+ MB


In [None]:
# 면적당 가격 컬럼을 추가
df_deal['area_deal_price'] = df_deal['deal_price'] / df_deal['area']
df_deal.head()

Unnamed: 0,date,year,month,day,address_0,address_1,address_2,address_3,address_4,name,area,deal_price,area_deal_price
0,2011-07-09,2011,7,9,서울특별시,강남구,개포동,655.0,2.0,개포2차현대아파트(220),77.75,64000,823.151125
1,2011-07-28,2011,7,28,서울특별시,강남구,개포동,655.0,2.0,개포2차현대아파트(220),77.75,65500,842.44373
2,2011-01-19,2011,1,19,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,67.28,70500,1047.859691
3,2011-09-02,2011,9,2,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,79.97,85000,1062.898587
4,2011-12-17,2011,12,17,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,67.28,68000,1010.701546


In [None]:
# 최근에 체결된 가격이 계속 유지된다고 생각을 하고 모든 날짜의 가격들을 결정
import numpy as np
pivot_table_area_deal = df_deal.pivot_table(index=['year','month','day'], columns=['address_1','address_2','address_3','address_4'], values='area_deal_price')
pivot_table_area_deal


Unnamed: 0_level_0,Unnamed: 1_level_0,address_1,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,...,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구
Unnamed: 0_level_1,Unnamed: 1_level_1,address_2,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,...,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동
Unnamed: 0_level_2,Unnamed: 1_level_2,address_3,12.0,12.0,138.0,140.0,141.0,166.0,172.0,176.0,177.0,179.0,...,307.0,314.0,318.0,331.0,413.0,438.0,450.0,452.0,453.0,454.0
Unnamed: 0_level_3,Unnamed: 1_level_3,address_4,0.0,2.0,0.0,0.0,0.0,4.0,3.0,1.0,0.0,0.0,...,76.0,1.0,81.0,64.0,8.0,0.0,0.0,0.0,0.0,0.0
year,month,day,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4,Unnamed: 7_level_4,Unnamed: 8_level_4,Unnamed: 9_level_4,Unnamed: 10_level_4,Unnamed: 11_level_4,Unnamed: 12_level_4,Unnamed: 13_level_4,Unnamed: 14_level_4,Unnamed: 15_level_4,Unnamed: 16_level_4,Unnamed: 17_level_4,Unnamed: 18_level_4,Unnamed: 19_level_4,Unnamed: 20_level_4,Unnamed: 21_level_4,Unnamed: 22_level_4,Unnamed: 23_level_4
2011,1,1,,,,,,,,,,,...,,,,,,,,,,
2011,1,2,,,,,,,,,,,...,,,,,,,431.726908,,,
2011,1,3,,,,,,,,,,,...,,,,,,,,,,
2011,1,4,1018.685955,,,,,,,,,,...,,,,,,,,,,
2011,1,5,1087.781432,,2101.057579,,1887.191539,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2023,4,26,,,,,,,,,,,...,,,,,,,,,,
2023,4,27,,,,,,,,,,,...,,,,,,,,,,
2023,4,28,,,,,,,,,,,...,,,,,,,,,,
2023,4,29,,,,,,,,,,,...,,,,,,,,,,


In [None]:
pivot_table_area_deal.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 4501 entries, (2011, 1, 1) to (2023, 4, 30)
Columns: 8890 entries, ('강남구', '개포동', 12.0, 0.0) to ('중랑구', '중화동', 454.0, 0.0)
dtypes: float64(8890)
memory usage: 305.3 MB


In [None]:
# 2011년 1월 1일부터 2022년 12월 31일 까지의 모든 일자들을 리스트에 선언
from datetime import datetime, timedelta

start_date = datetime(2011, 1, 1)  # 시작 날짜
end_date = datetime(2023, 4, 30)  # 끝 날짜

date_list = []
current_date = start_date
while current_date <= end_date:
    date_tuple = (current_date.year, current_date.month, current_date.day)
    date_list.append(date_tuple)
    current_date += timedelta(days=1)

print(date_list)

[(2011, 1, 1), (2011, 1, 2), (2011, 1, 3), (2011, 1, 4), (2011, 1, 5), (2011, 1, 6), (2011, 1, 7), (2011, 1, 8), (2011, 1, 9), (2011, 1, 10), (2011, 1, 11), (2011, 1, 12), (2011, 1, 13), (2011, 1, 14), (2011, 1, 15), (2011, 1, 16), (2011, 1, 17), (2011, 1, 18), (2011, 1, 19), (2011, 1, 20), (2011, 1, 21), (2011, 1, 22), (2011, 1, 23), (2011, 1, 24), (2011, 1, 25), (2011, 1, 26), (2011, 1, 27), (2011, 1, 28), (2011, 1, 29), (2011, 1, 30), (2011, 1, 31), (2011, 2, 1), (2011, 2, 2), (2011, 2, 3), (2011, 2, 4), (2011, 2, 5), (2011, 2, 6), (2011, 2, 7), (2011, 2, 8), (2011, 2, 9), (2011, 2, 10), (2011, 2, 11), (2011, 2, 12), (2011, 2, 13), (2011, 2, 14), (2011, 2, 15), (2011, 2, 16), (2011, 2, 17), (2011, 2, 18), (2011, 2, 19), (2011, 2, 20), (2011, 2, 21), (2011, 2, 22), (2011, 2, 23), (2011, 2, 24), (2011, 2, 25), (2011, 2, 26), (2011, 2, 27), (2011, 2, 28), (2011, 3, 1), (2011, 3, 2), (2011, 3, 3), (2011, 3, 4), (2011, 3, 5), (2011, 3, 6), (2011, 3, 7), (2011, 3, 8), (2011, 3, 9), (2011,

In [None]:
len(date_list)

4503

In [None]:
pivot_table_area_deal.index

MultiIndex([(2011, 1,  1),
            (2011, 1,  2),
            (2011, 1,  3),
            (2011, 1,  4),
            (2011, 1,  5),
            (2011, 1,  6),
            (2011, 1,  7),
            (2011, 1,  8),
            (2011, 1,  9),
            (2011, 1, 10),
            ...
            (2023, 4, 21),
            (2023, 4, 22),
            (2023, 4, 23),
            (2023, 4, 24),
            (2023, 4, 25),
            (2023, 4, 26),
            (2023, 4, 27),
            (2023, 4, 28),
            (2023, 4, 29),
            (2023, 4, 30)],
           names=['year', 'month', 'day'], length=4501)

In [None]:
# 기간 내 모든 날짜들에서 '거래날짜'들 빼서 '거래날짜'에서 없는 날짜들을 고름
print(set(date_list) - set(pivot_table_area_deal.index)) # '모든날짜'에 있고 '거래날짜'에 없는 날짜
print(set(pivot_table_area_deal.index) - set(date_list)) # 잘못 추가 생성된 날짜

{(2016, 2, 9), (2022, 9, 11)}
set()


In [None]:
# 빈 날짜 들(거래날짜에서 포함되지 않은 날짜의 거래가격들)을 null 로 채워서 row로 추가
pivot_table_area_deal.loc[(2016, 2, 9)]=np.nan
pivot_table_area_deal.loc[(2022, 9, 11)]=np.nan

In [None]:
# 연, 월, 일 로 정렬을 함 - 정렬을 하지 않으면 바로 위에서 추가한 row들이 적절한 위치에 들어가 있지 않는다
pivot_table_area_deal = pivot_table_area_deal.sort_values(by=['year','month','day'])
pivot_table_area_deal

Unnamed: 0_level_0,Unnamed: 1_level_0,address_1,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,...,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구
Unnamed: 0_level_1,Unnamed: 1_level_1,address_2,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,...,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동
Unnamed: 0_level_2,Unnamed: 1_level_2,address_3,12.0,12.0,138.0,140.0,141.0,166.0,172.0,176.0,177.0,179.0,...,307.0,314.0,318.0,331.0,413.0,438.0,450.0,452.0,453.0,454.0
Unnamed: 0_level_3,Unnamed: 1_level_3,address_4,0.0,2.0,0.0,0.0,0.0,4.0,3.0,1.0,0.0,0.0,...,76.0,1.0,81.0,64.0,8.0,0.0,0.0,0.0,0.0,0.0
year,month,day,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4,Unnamed: 7_level_4,Unnamed: 8_level_4,Unnamed: 9_level_4,Unnamed: 10_level_4,Unnamed: 11_level_4,Unnamed: 12_level_4,Unnamed: 13_level_4,Unnamed: 14_level_4,Unnamed: 15_level_4,Unnamed: 16_level_4,Unnamed: 17_level_4,Unnamed: 18_level_4,Unnamed: 19_level_4,Unnamed: 20_level_4,Unnamed: 21_level_4,Unnamed: 22_level_4,Unnamed: 23_level_4
2011,1,1,,,,,,,,,,,...,,,,,,,,,,
2011,1,2,,,,,,,,,,,...,,,,,,,431.726908,,,
2011,1,3,,,,,,,,,,,...,,,,,,,,,,
2011,1,4,1018.685955,,,,,,,,,,...,,,,,,,,,,
2011,1,5,1087.781432,,2101.057579,,1887.191539,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2023,4,26,,,,,,,,,,,...,,,,,,,,,,
2023,4,27,,,,,,,,,,,...,,,,,,,,,,
2023,4,28,,,,,,,,,,,...,,,,,,,,,,
2023,4,29,,,,,,,,,,,...,,,,,,,,,,


In [None]:
# 가장 최근에 체결된 값이 거래가격으로 유지 됨으로 ffill()을 사용
pivot_table_area_deal=pivot_table_area_deal.ffill()
pivot_table_area_deal

Unnamed: 0_level_0,Unnamed: 1_level_0,address_1,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,...,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구
Unnamed: 0_level_1,Unnamed: 1_level_1,address_2,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,...,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동
Unnamed: 0_level_2,Unnamed: 1_level_2,address_3,12.0,12.0,138.0,140.0,141.0,166.0,172.0,176.0,177.0,179.0,...,307.0,314.0,318.0,331.0,413.0,438.0,450.0,452.0,453.0,454.0
Unnamed: 0_level_3,Unnamed: 1_level_3,address_4,0.0,2.0,0.0,0.0,0.0,4.0,3.0,1.0,0.0,0.0,...,76.0,1.0,81.0,64.0,8.0,0.0,0.0,0.0,0.0,0.0
year,month,day,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4,Unnamed: 7_level_4,Unnamed: 8_level_4,Unnamed: 9_level_4,Unnamed: 10_level_4,Unnamed: 11_level_4,Unnamed: 12_level_4,Unnamed: 13_level_4,Unnamed: 14_level_4,Unnamed: 15_level_4,Unnamed: 16_level_4,Unnamed: 17_level_4,Unnamed: 18_level_4,Unnamed: 19_level_4,Unnamed: 20_level_4,Unnamed: 21_level_4,Unnamed: 22_level_4,Unnamed: 23_level_4
2011,1,1,,,,,,,,,,,...,,,,,,,,,,
2011,1,2,,,,,,,,,,,...,,,,,,,431.726908,,,
2011,1,3,,,,,,,,,,,...,,,,,,,431.726908,,,
2011,1,4,1018.685955,,,,,,,,,,...,,,,,,,431.726908,,,
2011,1,5,1087.781432,,2101.057579,,1887.191539,,,,,,...,,,,,,,431.726908,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2023,4,26,2712.477396,1779.004227,3297.187014,2487.219819,4324.324324,1413.594063,1342.758827,2172.968275,2136.100092,3014.696646,...,589.761736,872.199239,466.954023,956.130484,595.238095,818.61013,1029.116466,727.417008,1006.355932,1131.141746
2023,4,27,2712.477396,1779.004227,3297.187014,2487.219819,4324.324324,1413.594063,1342.758827,2172.968275,2136.100092,3014.696646,...,589.761736,872.199239,466.954023,956.130484,595.238095,818.61013,1029.116466,727.417008,1006.355932,1131.141746
2023,4,28,2712.477396,1779.004227,3297.187014,2487.219819,4324.324324,1413.594063,1342.758827,2172.968275,2136.100092,3014.696646,...,589.761736,872.199239,466.954023,956.130484,595.238095,818.61013,1029.116466,727.417008,1006.355932,1131.141746
2023,4,29,2712.477396,1779.004227,3297.187014,2487.219819,4324.324324,1413.594063,1342.758827,2172.968275,2136.100092,3014.696646,...,589.761736,872.199239,466.954023,956.130484,595.238095,818.61013,1029.116466,727.417008,1006.355932,1131.141746


In [None]:
# null 값을 채움 - 값을 채우지 않으면 추후 stack을 할 때 null 값을 계산을 안함
pivot_table_area_deal = pivot_table_area_deal.fillna(0)
pivot_table_area_deal

Unnamed: 0_level_0,Unnamed: 1_level_0,address_1,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,...,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구
Unnamed: 0_level_1,Unnamed: 1_level_1,address_2,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,...,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동
Unnamed: 0_level_2,Unnamed: 1_level_2,address_3,12.0,12.0,138.0,140.0,141.0,166.0,172.0,176.0,177.0,179.0,...,307.0,314.0,318.0,331.0,413.0,438.0,450.0,452.0,453.0,454.0
Unnamed: 0_level_3,Unnamed: 1_level_3,address_4,0.0,2.0,0.0,0.0,0.0,4.0,3.0,1.0,0.0,0.0,...,76.0,1.0,81.0,64.0,8.0,0.0,0.0,0.0,0.0,0.0
year,month,day,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4,Unnamed: 7_level_4,Unnamed: 8_level_4,Unnamed: 9_level_4,Unnamed: 10_level_4,Unnamed: 11_level_4,Unnamed: 12_level_4,Unnamed: 13_level_4,Unnamed: 14_level_4,Unnamed: 15_level_4,Unnamed: 16_level_4,Unnamed: 17_level_4,Unnamed: 18_level_4,Unnamed: 19_level_4,Unnamed: 20_level_4,Unnamed: 21_level_4,Unnamed: 22_level_4,Unnamed: 23_level_4
2011,1,1,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000
2011,1,2,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,431.726908,0.000000,0.000000,0.000000
2011,1,3,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,431.726908,0.000000,0.000000,0.000000
2011,1,4,1018.685955,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,431.726908,0.000000,0.000000,0.000000
2011,1,5,1087.781432,0.000000,2101.057579,0.000000,1887.191539,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,431.726908,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2023,4,26,2712.477396,1779.004227,3297.187014,2487.219819,4324.324324,1413.594063,1342.758827,2172.968275,2136.100092,3014.696646,...,589.761736,872.199239,466.954023,956.130484,595.238095,818.61013,1029.116466,727.417008,1006.355932,1131.141746
2023,4,27,2712.477396,1779.004227,3297.187014,2487.219819,4324.324324,1413.594063,1342.758827,2172.968275,2136.100092,3014.696646,...,589.761736,872.199239,466.954023,956.130484,595.238095,818.61013,1029.116466,727.417008,1006.355932,1131.141746
2023,4,28,2712.477396,1779.004227,3297.187014,2487.219819,4324.324324,1413.594063,1342.758827,2172.968275,2136.100092,3014.696646,...,589.761736,872.199239,466.954023,956.130484,595.238095,818.61013,1029.116466,727.417008,1006.355932,1131.141746
2023,4,29,2712.477396,1779.004227,3297.187014,2487.219819,4324.324324,1413.594063,1342.758827,2172.968275,2136.100092,3014.696646,...,589.761736,872.199239,466.954023,956.130484,595.238095,818.61013,1029.116466,727.417008,1006.355932,1131.141746


### 피봇테이블 -> 데이터프레임

In [None]:
# 컬럼을 slice해서 값을 처리할 때, 컬럼의 개수가 많으면, row가 많을 때 보다 메모리를 많이 소모함으로 전치를 시킴
pivot_table_area_deal = pivot_table_area_deal.T
pivot_table_area_deal

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,year,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,...,2023,2023,2023,2023,2023,2023,2023,2023,2023,2023
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,month,1,1,1,1,1,1,1,1,1,1,...,4,4,4,4,4,4,4,4,4,4
Unnamed: 0_level_2,Unnamed: 1_level_2,Unnamed: 2_level_2,day,1,2,3,4,5,6,7,8,9,10,...,21,22,23,24,25,26,27,28,29,30
address_1,address_2,address_3,address_4,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3,Unnamed: 11_level_3,Unnamed: 12_level_3,Unnamed: 13_level_3,Unnamed: 14_level_3,Unnamed: 15_level_3,Unnamed: 16_level_3,Unnamed: 17_level_3,Unnamed: 18_level_3,Unnamed: 19_level_3,Unnamed: 20_level_3,Unnamed: 21_level_3,Unnamed: 22_level_3,Unnamed: 23_level_3,Unnamed: 24_level_3
강남구,개포동,12.0,0.0,0.0,0.000000,0.000000,1018.685955,1087.781432,1040.914561,1054.852321,1054.852321,1054.852321,1054.852321,...,2712.477396,2712.477396,2712.477396,2712.477396,2712.477396,2712.477396,2712.477396,2712.477396,2712.477396,2712.477396
강남구,개포동,12.0,2.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,1779.004227,1779.004227,1779.004227,1779.004227,1779.004227,1779.004227,1779.004227,1779.004227,1779.004227,1779.004227
강남구,개포동,138.0,0.0,0.0,0.000000,0.000000,0.000000,2101.057579,2101.057579,2101.057579,2101.057579,2101.057579,2101.057579,...,3297.187014,3297.187014,3297.187014,3297.187014,3297.187014,3297.187014,3297.187014,3297.187014,3297.187014,3297.187014
강남구,개포동,140.0,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,1565.991903,2064.490759,...,2487.219819,2487.219819,2487.219819,2487.219819,2487.219819,2487.219819,2487.219819,2487.219819,2487.219819,2487.219819
강남구,개포동,141.0,0.0,0.0,0.000000,0.000000,0.000000,1887.191539,1887.191539,1887.191539,1887.191539,1887.191539,1887.191539,...,4324.324324,4324.324324,4324.324324,4324.324324,4324.324324,4324.324324,4324.324324,4324.324324,4324.324324,4324.324324
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
중랑구,중화동,438.0,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,818.610130,818.610130,818.610130,818.610130,818.610130,818.610130,818.610130,818.610130,818.610130,818.610130
중랑구,중화동,450.0,0.0,0.0,431.726908,431.726908,431.726908,431.726908,485.274431,485.274431,485.274431,485.274431,485.274431,...,1029.116466,1029.116466,1029.116466,1029.116466,1029.116466,1029.116466,1029.116466,1029.116466,1029.116466,1029.116466
중랑구,중화동,452.0,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,727.417008,727.417008,727.417008,727.417008,727.417008,727.417008,727.417008,727.417008,727.417008,727.417008
중랑구,중화동,453.0,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,1006.355932,1006.355932,1006.355932,1006.355932,1006.355932,1006.355932,1006.355932,1006.355932,1006.355932,1006.355932


- pandas는 row 개수가 많은것이, column의 개수가 많은 것보다 더 메모리 부담이 크다

In [None]:
# 피봇테이블을 일반데이터프레임화
df_area_deal = pivot_table_area_deal.stack(level=[0,1,2])
df_area_deal =df_area_deal.reset_index()
df_area_deal

Unnamed: 0,address_1,address_2,address_3,address_4,year,month,day,0
0,강남구,개포동,12.0,0.0,2011,1,1,0.000000
1,강남구,개포동,12.0,0.0,2011,1,2,0.000000
2,강남구,개포동,12.0,0.0,2011,1,3,0.000000
3,강남구,개포동,12.0,0.0,2011,1,4,1018.685955
4,강남구,개포동,12.0,0.0,2011,1,5,1087.781432
...,...,...,...,...,...,...,...,...
40031665,중랑구,중화동,454.0,0.0,2023,4,26,1131.141746
40031666,중랑구,중화동,454.0,0.0,2023,4,27,1131.141746
40031667,중랑구,중화동,454.0,0.0,2023,4,28,1131.141746
40031668,중랑구,중화동,454.0,0.0,2023,4,29,1131.141746


In [None]:
df_area_deal.columns = ['address_1','address_2','address_3','address_4','year','month','day','area_deal'] # 컬럼명 수정
df_area_deal = df_area_deal.astype({'address_3': 'int16', 'address_4': 'int16','year':'int16', 'month':'int16', 'day':'int16', 'area_deal':'float32'}) # 데이터 타입 변경
df_area_deal = df_area_deal.drop(df_area_deal[df_area_deal.area_deal == 0].index) # 위에서 값이 null인 값들을 0으로 처리했으므로, 0인 값들을 제거한다
df_area_deal

Unnamed: 0,address_1,address_2,address_3,address_4,year,month,day,area_deal
3,강남구,개포동,12,0,2011,1,4,1018.685974
4,강남구,개포동,12,0,2011,1,5,1087.781372
5,강남구,개포동,12,0,2011,1,6,1040.914551
6,강남구,개포동,12,0,2011,1,7,1054.852295
7,강남구,개포동,12,0,2011,1,8,1054.852295
...,...,...,...,...,...,...,...,...
40031665,중랑구,중화동,454,0,2023,4,26,1131.141724
40031666,중랑구,중화동,454,0,2023,4,27,1131.141724
40031667,중랑구,중화동,454,0,2023,4,28,1131.141724
40031668,중랑구,중화동,454,0,2023,4,29,1131.141724


In [None]:
df_area_deal.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 33909436 entries, 3 to 40031669
Data columns (total 8 columns):
 #   Column     Dtype  
---  ------     -----  
 0   address_1  object 
 1   address_2  object 
 2   address_3  int16  
 3   address_4  int16  
 4   year       int16  
 5   month      int16  
 6   day        int16  
 7   area_deal  float32
dtypes: float32(1), int16(5), object(2)
memory usage: 1.2+ GB


### 파일저장

In [None]:
df_area_deal.to_pickle('/content/drive/MyDrive/house_price/after_data/df_area_deal.pkl')

## df_area_full_rent 파일 생성

- 아파트별로 가장 최근에 체결된 '평당 전세가격' 정보를 가진 데이터프레임 생성

- df_area_deal 파일생성 부분 참조

In [None]:
df_full_rent.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1498500 entries, 0 to 1498499
Data columns (total 12 columns):
 #   Column           Non-Null Count    Dtype  
---  ------           --------------    -----  
 0   date             1498500 non-null  object 
 1   year             1498500 non-null  int64  
 2   month            1498500 non-null  int64  
 3   day              1498500 non-null  int64  
 4   address_0        1498500 non-null  object 
 5   address_1        1498500 non-null  object 
 6   address_2        1498500 non-null  object 
 7   address_3        1498500 non-null  float64
 8   address_4        1498500 non-null  float64
 9   name             1498500 non-null  object 
 10  area             1498500 non-null  float64
 11  full_rent_price  1498500 non-null  int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 137.2+ MB


In [None]:
import numpy as np
# 면적당 가격을 추가
df_full_rent['area_full_rent_price'] = df_full_rent['full_rent_price'] / df_full_rent['area']
pivot_table_area_full_rent = df_full_rent.pivot_table(index=['year','month','day'], columns=['address_1','address_2','address_3','address_4'], values='area_full_rent_price')
pivot_table_area_full_rent

Unnamed: 0_level_0,Unnamed: 1_level_0,address_1,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,...,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구
Unnamed: 0_level_1,Unnamed: 1_level_1,address_2,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,...,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동
Unnamed: 0_level_2,Unnamed: 1_level_2,address_3,12.0,12.0,138.0,140.0,141.0,166.0,172.0,176.0,177.0,179.0,...,307.0,314.0,318.0,331.0,413.0,438.0,450.0,452.0,453.0,454.0
Unnamed: 0_level_3,Unnamed: 1_level_3,address_4,0.0,2.0,0.0,0.0,0.0,4.0,3.0,1.0,0.0,0.0,...,76.0,1.0,81.0,64.0,8.0,0.0,0.0,0.0,0.0,0.0
year,month,day,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4,Unnamed: 7_level_4,Unnamed: 8_level_4,Unnamed: 9_level_4,Unnamed: 10_level_4,Unnamed: 11_level_4,Unnamed: 12_level_4,Unnamed: 13_level_4,Unnamed: 14_level_4,Unnamed: 15_level_4,Unnamed: 16_level_4,Unnamed: 17_level_4,Unnamed: 18_level_4,Unnamed: 19_level_4,Unnamed: 20_level_4,Unnamed: 21_level_4,Unnamed: 22_level_4,Unnamed: 23_level_4
2011,1,1,,,,,,,,,,,...,,,,,,,,,,
2011,1,2,,,,,,,,,,,...,,,,,,,,,,
2011,1,3,430.053124,469.099032,,,190.044764,,,,,,...,,,,,,,,,,
2011,1,4,416.009890,,,259.109312,159.620342,,,,,,...,,,203.665988,,,,251.004016,,,
2011,1,5,,,217.090981,267.487606,212.476466,,,,,,...,,,,,,,190.408188,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2023,4,26,766.075782,,,,,,,,,,...,,,,,,,,,,
2023,4,27,818.527648,,,,,,,,,,...,,,,,,,,,,
2023,4,28,,,,,,,,,,,...,,,,,,,,,,
2023,4,29,,,,,,,,,,,...,,,,,,,632.184286,,,


In [None]:
from datetime import datetime, timedelta

start_date = datetime(2011, 1, 1)  # 시작 날짜
end_date = datetime(2023, 4, 30)  # 끝 날짜

date_list = []
current_date = start_date
while current_date <= end_date:
    date_tuple = (current_date.year, current_date.month, current_date.day)
    date_list.append(date_tuple)
    current_date += timedelta(days=1)

# 기간 내 모든 날짜들에서 '거래날짜'들 빼서 '거래날짜'에서 없는 날짜들을 고름
print(set(date_list) - set(pivot_table_area_full_rent.index)) # '모든날짜'에 있고 '거래날짜'에 없는 날짜
print(set(pivot_table_area_full_rent.index) - set(date_list)) # 잘못 추가 생성된 날짜

set()
set()


In [None]:
pivot_table_area_full_rent = pivot_table_area_full_rent.ffill()
pivot_table_area_full_rent = pivot_table_area_full_rent.fillna(0)
pivot_table_area_full_rent = pivot_table_area_full_rent.T
df_area_full_rent = pivot_table_area_full_rent.stack(level=[0,1,2])
df_area_full_rent =df_area_full_rent.reset_index()
df_area_full_rent.columns = ['address_1','address_2','address_3','address_4','year','month','day','area_full_rent'] # 컬럼명 수정
df_area_full_rent = df_area_full_rent.drop(df_area_full_rent[df_area_full_rent.area_full_rent == 0].index) # 위에서 값이 null인 값들을 0으로 처리했으므로, 0인 값들을 제거한다
df_area_full_rent = df_area_full_rent.astype({'address_3': 'int16', 'address_4': 'int16','year':'int16', 'month':'int16', 'day':'int16', 'area_full_rent':'float32'})
df_area_full_rent

Unnamed: 0,address_1,address_2,address_3,address_4,year,month,day,area_full_rent
2,강남구,개포동,12,0,2011,1,3,430.053131
3,강남구,개포동,12,0,2011,1,4,416.009888
4,강남구,개포동,12,0,2011,1,5,416.009888
5,강남구,개포동,12,0,2011,1,6,416.009888
6,강남구,개포동,12,0,2011,1,7,400.000000
...,...,...,...,...,...,...,...,...
41828362,중랑구,중화동,454,0,2023,4,26,466.804993
41828363,중랑구,중화동,454,0,2023,4,27,466.804993
41828364,중랑구,중화동,454,0,2023,4,28,466.804993
41828365,중랑구,중화동,454,0,2023,4,29,466.804993


In [None]:
df_area_full_rent.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 35627911 entries, 2 to 41828366
Data columns (total 8 columns):
 #   Column          Dtype  
---  ------          -----  
 0   address_1       object 
 1   address_2       object 
 2   address_3       int16  
 3   address_4       int16  
 4   year            int16  
 5   month           int16  
 6   day             int16  
 7   area_full_rent  float32
dtypes: float32(1), int16(5), object(2)
memory usage: 1.3+ GB


In [None]:
df_area_full_rent.to_pickle('/content/drive/MyDrive/house_price/after_data/df_area_full_rent.pkl')

## df_area_year_rent 파일 생성

- 아파트별로 가장 최근에 체결된 '평당 월세가격' 정보를 가진 데이터프레임 생성

- df_area_deal 파일 생성 참조
- 아파트월세 피봇테이블 -> 아파트 월별 연세 피봇테이블
- 보증금은 계약시의 상황마다 다를 것
- 전월세전환률을 적용하여서 월세에서의 보증금을 변환
- 거래들마다 상황에 따라 보증금과 월세금액은 다를 수 있음으로, 보증금의 5.8% 값에 월세*12을 더하여 1년간 들어가는 금액인 연세를 계산

In [None]:
df_month_rent.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 672285 entries, 0 to 672284
Data columns (total 13 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   date              672285 non-null  object 
 1   year              672285 non-null  int64  
 2   month             672285 non-null  int64  
 3   day               672285 non-null  int64  
 4   address_0         672285 non-null  object 
 5   address_1         672285 non-null  object 
 6   address_2         672285 non-null  object 
 7   address_3         672285 non-null  float64
 8   address_4         672285 non-null  float64
 9   name              672285 non-null  object 
 10  area              672285 non-null  float64
 11  rent_deposit      672285 non-null  int64  
 12  month_rent_price  672285 non-null  int64  
dtypes: float64(3), int64(5), object(5)
memory usage: 66.7+ MB


In [None]:
# 보증금의 5.8% 값에 월세*12을 더하여 1년간 들어가는 금액인 연세를 계산
df_month_rent['year_rent_price'] = (df_month_rent['rent_deposit']*0.058)+(df_month_rent['month_rent_price']*12)
df_month_rent['area_year_rent_price'] = df_month_rent['year_rent_price'] / df_month_rent['area']
pivot_table_area_year_rent = df_month_rent.pivot_table(index=['year','month','day'], columns=['address_1','address_2','address_3','address_4'], values='area_year_rent_price')
pivot_table_area_year_rent

Unnamed: 0_level_0,Unnamed: 1_level_0,address_1,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,...,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구
Unnamed: 0_level_1,Unnamed: 1_level_1,address_2,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,...,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동
Unnamed: 0_level_2,Unnamed: 1_level_2,address_3,12.0,12.0,138.0,140.0,141.0,172.0,176.0,177.0,179.0,185.0,...,307.0,307.0,314.0,318.0,331.0,438.0,450.0,452.0,453.0,454.0
Unnamed: 0_level_3,Unnamed: 1_level_3,address_4,0.0,2.0,0.0,0.0,0.0,3.0,1.0,0.0,0.0,0.0,...,6.0,76.0,1.0,81.0,64.0,0.0,0.0,0.0,0.0,0.0
year,month,day,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4,Unnamed: 7_level_4,Unnamed: 8_level_4,Unnamed: 9_level_4,Unnamed: 10_level_4,Unnamed: 11_level_4,Unnamed: 12_level_4,Unnamed: 13_level_4,Unnamed: 14_level_4,Unnamed: 15_level_4,Unnamed: 16_level_4,Unnamed: 17_level_4,Unnamed: 18_level_4,Unnamed: 19_level_4,Unnamed: 20_level_4,Unnamed: 21_level_4,Unnamed: 22_level_4,Unnamed: 23_level_4
2011,1,1,,,,,,,,,,,...,,,,,,,,,,
2011,1,2,,,,,,,,,,,...,,,,,,,,,,
2011,1,3,,,,,,,,,,,...,,,,,,,,,,
2011,1,4,,,,,,,,,,29.702312,...,,,,,,,,,,
2011,1,5,,,18.284371,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2023,4,26,,,,,,,,,,,...,,,,,,,,,,
2023,4,27,31.722742,,,,,,,,,,...,,,,,,,,,,
2023,4,28,,,,,,,,,,,...,,,,,,,,,,
2023,4,29,28.332912,,99.341124,,,,,,,,...,,,,,,,,,,


In [None]:
from datetime import datetime, timedelta

start_date = datetime(2011, 1, 1)  # 시작 날짜
end_date = datetime(2023, 4, 30)  # 끝 날짜

date_list = []
current_date = start_date
while current_date <= end_date:
    date_tuple = (current_date.year, current_date.month, current_date.day)
    date_list.append(date_tuple)
    current_date += timedelta(days=1)

# 기간 내 모든 날짜들에서 '거래날짜'들 빼서 '거래날짜'에서 없는 날짜들을 고름
print(set(date_list) - set(pivot_table_area_year_rent.index)) # '모든날짜'에 있고 '거래날짜'에 없는 날짜
print(set(pivot_table_area_year_rent.index) - set(date_list)) # 잘못 추가 생성된 날짜

set()
set()


In [None]:
pivot_table_area_year_rent=pivot_table_area_year_rent.ffill()
pivot_table_area_year_rent = pivot_table_area_year_rent.fillna(0)

# 피봇테이블을 일반데이터프레임화
pivot_table_area_year_rent = pivot_table_area_year_rent.T
df_area_year_rent = pivot_table_area_year_rent.stack(level=[0,1,2])
df_area_year_rent = df_area_year_rent.reset_index()
df_area_year_rent.columns = ['address_1','address_2','address_3','address_4','year','month','day','area_year_rent'] # 컬럼명 수정
df_area_year_rent = df_area_year_rent.drop(df_area_year_rent[df_area_year_rent.area_year_rent == 0].index) # 위에서 값이 null인 값들을 0으로 처리했으므로, 0인 값들을 제거한다
df_area_year_rent = df_area_year_rent.astype({'address_3': 'int16', 'address_4': 'int16','year':'int16', 'month':'int16', 'day':'int16', 'area_year_rent':'float32'})
df_area_year_rent

Unnamed: 0,address_1,address_2,address_3,address_4,year,month,day,area_year_rent
6,강남구,개포동,12,0,2011,1,7,30.255503
7,강남구,개포동,12,0,2011,1,8,30.255503
8,강남구,개포동,12,0,2011,1,9,30.255503
9,강남구,개포동,12,0,2011,1,10,30.255503
10,강남구,개포동,12,0,2011,1,11,30.255503
...,...,...,...,...,...,...,...,...
37910752,중랑구,중화동,454,0,2023,4,26,22.199171
37910753,중랑구,중화동,454,0,2023,4,27,22.199171
37910754,중랑구,중화동,454,0,2023,4,28,22.199171
37910755,중랑구,중화동,454,0,2023,4,29,22.199171


In [None]:
df_area_year_rent.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 28251402 entries, 6 to 37910756
Data columns (total 8 columns):
 #   Column          Dtype  
---  ------          -----  
 0   address_1       object 
 1   address_2       object 
 2   address_3       int16  
 3   address_4       int16  
 4   year            int16  
 5   month           int16  
 6   day             int16  
 7   area_year_rent  float32
dtypes: float32(1), int16(5), object(2)
memory usage: 1023.8+ MB


In [None]:
df_area_year_rent.to_pickle('/content/drive/MyDrive/house_price/after_data/df_area_year_rent.pkl')

## df_area_all 파일 생성

- df_area_deal, df_area_full_rent, df_area_year_rent 3개의 파일 병합하여 df_area_all을 생성
- 가치평가 컬럼들을 구하기 위해서 merge를 통해, 매매가, 전세가, 연세가 기록이 다 있는 아파트 만을 선택

In [None]:
import pandas as pd

df_area_deal = pd.read_pickle('/content/drive/MyDrive/house_price/after_data/df_area_deal.pkl')
df_area_full_rent = pd.read_pickle('/content/drive/MyDrive/house_price/after_data/df_area_full_rent.pkl')

In [None]:
df_area_deal_full_rent = pd.merge(df_area_deal,df_area_full_rent, on=['address_1', 'address_2', 'address_3', 'address_4', 'year', 'month','day'])

In [None]:
df_area_deal_full_rent.to_pickle('/content/drive/MyDrive/house_price/after_data/df_area_deal_full_rent.pkl')

- 메모리 부족 이슈로 나누어서 실행

In [None]:
import pandas as pd

df_area_deal_full_rent = pd.read_pickle('/content/drive/MyDrive/house_price/after_data/df_area_deal_full_rent.pkl')
df_area_year_rent = pd.read_pickle('/content/drive/MyDrive/house_price/after_data/df_area_year_rent.pkl')

In [None]:
df_area_all = pd.merge(df_area_deal_full_rent, df_area_year_rent , on=['address_1', 'address_2', 'address_3', 'address_4', 'year', 'month','day'])

In [None]:
df_area_all.to_pickle('/content/drive/MyDrive/house_price/after_data/df_area_all.pkl')

# df_original_dataset 생성

- df_original_dataset은 df_area_all의 정보 추가적인 변화 정보들을 추가하고, final_economic 들까지 병합한 데이터프레임

## 일별로 종합 수치들을 확인

In [None]:
import pandas as pd

df_area_all = pd.read_pickle('/content/drive/MyDrive/house_price/after_data/df_area_all.pkl')
df_area_all.head()

Unnamed: 0,address_1,address_2,address_3,address_4,year,month,day,area_deal,area_full_rent,area_year_rent
0,강남구,개포동,12,0,2011,1,7,1054.852295,400.0,30.255503
1,강남구,개포동,12,0,2011,1,8,1054.852295,400.0,30.255503
2,강남구,개포동,12,0,2011,1,9,1054.852295,400.0,30.255503
3,강남구,개포동,12,0,2011,1,10,1054.852295,420.425629,30.255503
4,강남구,개포동,12,0,2011,1,11,1006.830261,434.408142,30.255503


In [None]:
df_area_all.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 25054084 entries, 0 to 25054083
Data columns (total 10 columns):
 #   Column          Dtype  
---  ------          -----  
 0   address_1       object 
 1   address_2       object 
 2   address_3       int16  
 3   address_4       int16  
 4   year            int16  
 5   month           int16  
 6   day             int16  
 7   area_deal       float32
 8   area_full_rent  float32
 9   area_year_rent  float32
dtypes: float32(3), int16(5), object(2)
memory usage: 1.1+ GB


In [None]:
# 실제 메모리 사용량 확인
real_memory_usage = df_area_all.memory_usage(deep=True).sum() # deep 옵션을 통해서 정확한 메모리 사용량을 확인
print(real_memory_usage/(1024**3),'GB')

4.815261030569673 GB


## 데이터 필터링

- 일별로 초반의 데이터들은 계약체결의 개수가 적어서 데이터의 신빙성이 확보되기가 어렵다 판단
- 일별 거래(매매,전세, 월세) 체결 개수들을 파악해서 너무 개수가 적은 데이터 들을 제거하는 과정 필요
- 개수가 적은 기준은 IQR를 사용하여서 이상치에 해당하는 개수가 적은 기준을 세움

In [None]:
df_area_all_count = df_area_all.groupby(["year","month","day"])[["area_deal","area_full_rent","area_year_rent"]].count()
df_area_all_count

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,area_deal,area_full_rent,area_year_rent
year,month,day,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2011,1,2,1,1,1
2011,1,3,6,6,6
2011,1,4,18,18,18
2011,1,5,43,43,43
2011,1,6,79,79,79
...,...,...,...,...,...
2023,4,26,7606,7606,7606
2023,4,27,7606,7606,7606
2023,4,28,7607,7607,7607
2023,4,29,7607,7607,7607


>> 데이터셋을 사용할 때, 각 수치들을 도출한 표본이 어느정도 이상이어야지 데이터로서의 가치가 있다

In [None]:
df_area_all_count.describe() # min의 값과 1분위수의 차이가 매우 큼을 확인

Unnamed: 0,area_deal,area_full_rent,area_year_rent
count,4502.0,4502.0,4502.0
mean,5565.100844,5565.100844,5565.100844
std,1644.460048,1644.460048,1644.460048
min,1.0,1.0,1.0
25%,4441.0,4441.0,4441.0
50%,6014.5,6014.5,6014.5
75%,6913.75,6913.75,6913.75
max,7607.0,7607.0,7607.0


In [None]:
# boxplot 을 통해서 이상치가 있음을 확인
import plotly.express as px
fig = px.box(df_area_all_count, y="area_deal")
fig.show()

In [None]:
# 막대그래프를 통해서 체결 개수는 순차적으로 증가함을 확인
# 즉, 특정 개수 이하인 value 기준으로 row들을 제거하면, 과거일자들의 value들도 특정개수 이하일 것이므로, 제거해도 괜찮음을 확인
import plotly.express as px

df_area_all_count_2 = df_area_all_count.reset_index()
fig = px.bar(df_area_all_count, x=df_area_all_count_2.index, y='area_deal')
fig.show()

In [None]:
# 이상치 제거를 위한 변수들을 선언
q1=df_area_all_count['area_deal'].quantile(0.25)
q2=df_area_all_count['area_deal'].quantile(0.5)
q3=df_area_all_count['area_deal'].quantile(0.75)
iqr=q3-q1
iqr

2472.75

In [None]:
# 이상치의 인덱스들을 확인
df_area_all_count.loc[df_area_all_count['area_deal']<q1-1.5*iqr,'area_deal'].index


MultiIndex([(2011, 1,  2),
            (2011, 1,  3),
            (2011, 1,  4),
            (2011, 1,  5),
            (2011, 1,  6),
            (2011, 1,  7),
            (2011, 1,  8),
            (2011, 1,  9),
            (2011, 1, 10),
            (2011, 1, 11),
            (2011, 1, 12),
            (2011, 1, 13),
            (2011, 1, 14),
            (2011, 1, 15),
            (2011, 1, 16),
            (2011, 1, 17),
            (2011, 1, 18),
            (2011, 1, 19),
            (2011, 1, 20),
            (2011, 1, 21),
            (2011, 1, 22),
            (2011, 1, 23),
            (2011, 1, 24),
            (2011, 1, 25),
            (2011, 1, 26),
            (2011, 1, 27),
            (2011, 1, 28),
            (2011, 1, 29),
            (2011, 1, 30),
            (2011, 1, 31)],
           names=['year', 'month', 'day'])

## df_area_micro 생성

- df_area_micro는 df_original_dataset을 만드는 과정 중에 생기는 데이터프레임으로, df_area_all 에 가치평가 지표와 과거수치 대비 변화율들의 정보를 추가한 데이터프레임

### 일별로 그룹화 진행


- 우선, 일별로 거래정보들을 그룹화 해서 평균 가격들을 도출

In [None]:
df_area_micro=df_area_all.groupby(["year","month","day"])[["area_deal","area_full_rent","area_year_rent"]].mean()
df_area_micro

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,area_deal,area_full_rent,area_year_rent
year,month,day,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2011,1,2,595.000000,259.771637,18.880535
2011,1,3,519.548096,274.167999,16.547453
2011,1,4,704.231018,326.309021,21.519497
2011,1,5,768.772095,326.750854,21.022175
2011,1,6,709.595642,303.017273,19.990875
...,...,...,...,...,...
2023,4,26,1024.179443,580.640137,29.545254
2023,4,27,1024.288940,580.863586,29.561056
2023,4,28,1024.102783,581.085632,29.577700
2023,4,29,1023.981995,581.219116,29.587328


In [None]:
# 위에서 구한 데이터의 개수가 적어서 데이터로서의 가치가 떨어지는 데이터들을 제거
df_area_micro.drop(df_area_all_count.loc[df_area_all_count['area_deal']<q1-1.5*iqr,'area_deal'].index,inplace=True)
df_area_micro

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,area_deal,area_full_rent,area_year_rent
year,month,day,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2011,2,1,650.274597,304.564636,21.068266
2011,2,2,650.676758,303.993530,21.060005
2011,2,3,650.676758,304.175018,21.060005
2011,2,4,650.503906,303.648407,21.046247
2011,2,5,650.336243,303.861664,21.049116
...,...,...,...,...,...
2023,4,26,1024.179443,580.640137,29.545254
2023,4,27,1024.288940,580.863586,29.561056
2023,4,28,1024.102783,581.085632,29.577700
2023,4,29,1023.981995,581.219116,29.587328


In [None]:
df_area_micro.reset_index(inplace=True)
df_area_micro

Unnamed: 0,year,month,day,area_deal,area_full_rent,area_year_rent
0,2011,2,1,650.274597,304.564636,21.068266
1,2011,2,2,650.676758,303.993530,21.060005
2,2011,2,3,650.676758,304.175018,21.060005
3,2011,2,4,650.503906,303.648407,21.046247
4,2011,2,5,650.336243,303.861664,21.049116
...,...,...,...,...,...,...
4467,2023,4,26,1024.179443,580.640137,29.545254
4468,2023,4,27,1024.288940,580.863586,29.561056
4469,2023,4,28,1024.102783,581.085632,29.577700
4470,2023,4,29,1023.981995,581.219116,29.587328


#### 가치평가 지표 컬럼 추가

- 전세가율(deal_full_rent_rate), 연세멀티플(deal_year_rent_multiple) 을 계산함

In [None]:
df_area_micro['deal_full_rent_rate'] = 100*(df_area_micro['area_full_rent'] / df_area_micro['area_deal'])
df_area_micro['deal_year_rent_multiple'] = df_area_micro['area_deal']/ df_area_micro['area_year_rent']
df_area_micro

Unnamed: 0,year,month,day,area_deal,area_full_rent,area_year_rent,deal_full_rent_rate,deal_year_rent_multiple
0,2011,2,1,650.274597,304.564636,21.068266,46.836311,30.865122
1,2011,2,2,650.676758,303.993530,21.060005,46.719593,30.896324
2,2011,2,3,650.676758,304.175018,21.060005,46.747486,30.896324
3,2011,2,4,650.503906,303.648407,21.046247,46.678951,30.908308
4,2011,2,5,650.336243,303.861664,21.049116,46.723778,30.896132
...,...,...,...,...,...,...,...,...
4467,2023,4,26,1024.179443,580.640137,29.545254,56.693203,34.664772
4468,2023,4,27,1024.288940,580.863586,29.561056,56.708954,34.649944
4469,2023,4,28,1024.102783,581.085632,29.577700,56.740944,34.624153
4470,2023,4,29,1023.981995,581.219116,29.587328,56.760674,34.608803


### 월별 평균 종합 수치들을 확인

- 6개월, 12개월전 수치들을 일별을 기준으로 하면, 너무 특정일자 기준으로 하는 것 같아서, 월별을 기준으로 변화율을 계산하기 위해서 월별 평균 종합 수치들을 확인

In [None]:
df_area_micro_month = df_area_micro.groupby(["year","month"])[["area_deal","area_full_rent","area_year_rent"]].mean().copy()
df_area_micro_month.reset_index(inplace=True)
df_area_micro_month

Unnamed: 0,year,month,area_deal,area_full_rent,area_year_rent
0,2011,2,641.325623,300.576172,20.937654
1,2011,3,615.491150,297.227692,20.475676
2,2011,4,600.585022,295.111023,20.246986
3,2011,5,589.073303,295.179962,20.066994
4,2011,6,581.239624,297.469910,19.992504
...,...,...,...,...,...
142,2022,12,1060.324219,605.178101,30.067490
143,2023,1,1050.356201,596.822388,29.944647
144,2023,2,1040.774170,588.755737,29.721769
145,2023,3,1031.198730,583.412354,29.565010


In [None]:
# 체결일자의 발표일자가 1달 씩 미뤄진다 생각하고 1칸씩 미룸
df_area_micro_month['area_deal'] = df_area_micro_month['area_deal'].shift(1)
df_area_micro_month['area_full_rent'] = df_area_micro_month['area_full_rent'].shift(1)
df_area_micro_month['area_year_rent'] = df_area_micro_month['area_year_rent'].shift(1)
df_area_micro_month = df_area_micro_month.dropna()
df_area_micro_month.columns = ['year','month','last_month_area_deal','last_month_area_full_count', 'last_month_area_year_rent']
df_area_micro_month

Unnamed: 0,year,month,last_month_area_deal,last_month_area_full_count,last_month_area_year_rent
1,2011,3,641.325623,300.576172,20.937654
2,2011,4,615.491150,297.227692,20.475676
3,2011,5,600.585022,295.111023,20.246986
4,2011,6,589.073303,295.179962,20.066994
5,2011,7,581.239624,297.469910,19.992504
...,...,...,...,...,...
142,2022,12,1068.705444,612.513855,30.279573
143,2023,1,1060.324219,605.178101,30.067490
144,2023,2,1050.356201,596.822388,29.944647
145,2023,3,1040.774170,588.755737,29.721769


#### 6개월전 종합 수치 병합

In [None]:
# df_area_micro_month의 6개월 후, 연, 월을 컬럼으로 구한후, df_area_micro의 year, month와 merge 하면 6개월 전 수치들을 구할 수 있음
df_area_micro_month_6m = df_area_micro_month.copy()
df_area_micro_month_6m.loc[df_area_micro_month_6m['month']<7, '6m_after_year'] = df_area_micro_month_6m['year']
df_area_micro_month_6m.loc[df_area_micro_month_6m['month']<7, '6m_after_month'] = df_area_micro_month_6m['month'] + 6
df_area_micro_month_6m.loc[df_area_micro_month_6m['month']>=7, '6m_after_year'] = df_area_micro_month_6m['year'] + 1
df_area_micro_month_6m.loc[df_area_micro_month_6m['month']>=7, '6m_after_month'] = df_area_micro_month_6m['month'] - 6

df_area_micro_month_6m



Unnamed: 0,year,month,last_month_area_deal,last_month_area_full_count,last_month_area_year_rent,6m_after_year,6m_after_month
1,2011,3,641.325623,300.576172,20.937654,2011.0,9.0
2,2011,4,615.491150,297.227692,20.475676,2011.0,10.0
3,2011,5,600.585022,295.111023,20.246986,2011.0,11.0
4,2011,6,589.073303,295.179962,20.066994,2011.0,12.0
5,2011,7,581.239624,297.469910,19.992504,2012.0,1.0
...,...,...,...,...,...,...,...
142,2022,12,1068.705444,612.513855,30.279573,2023.0,6.0
143,2023,1,1060.324219,605.178101,30.067490,2023.0,7.0
144,2023,2,1050.356201,596.822388,29.944647,2023.0,8.0
145,2023,3,1040.774170,588.755737,29.721769,2023.0,9.0


In [None]:
df_area_micro_month_6m = df_area_micro_month_6m.drop(['year','month'],axis=1)
df_area_micro_month_6m = df_area_micro_month_6m.astype({'6m_after_year':'int16', '6m_after_month' : 'int16'})
df_area_micro_month_6m.rename(columns = {'last_month_area_deal' : '6m_before_area_deal_mean', 'last_month_area_full_count' : '6m_before_area_full_rent_mean',
                                      'last_month_area_year_rent' : '6m_before_area_year_rent_mean'}, inplace = True)
df_area_micro_month_6m

Unnamed: 0,6m_before_area_deal_mean,6m_before_area_full_rent_mean,6m_before_area_year_rent_mean,6m_after_year,6m_after_month
1,641.325623,300.576172,20.937654,2011,9
2,615.491150,297.227692,20.475676,2011,10
3,600.585022,295.111023,20.246986,2011,11
4,589.073303,295.179962,20.066994,2011,12
5,581.239624,297.469910,19.992504,2012,1
...,...,...,...,...,...
142,1068.705444,612.513855,30.279573,2023,6
143,1060.324219,605.178101,30.067490,2023,7
144,1050.356201,596.822388,29.944647,2023,8
145,1040.774170,588.755737,29.721769,2023,9


In [None]:
df_area_micro_month_6m['6m_before_deal_full_rent_rate'] = 100*(df_area_micro_month_6m['6m_before_area_full_rent_mean'] / df_area_micro_month_6m['6m_before_area_deal_mean'])
df_area_micro_month_6m['6m_before_deal_year_rent_multiple'] = df_area_micro_month_6m['6m_before_area_deal_mean']/ df_area_micro_month_6m['6m_before_area_year_rent_mean']
df_area_micro_month_6m

Unnamed: 0,6m_before_area_deal_mean,6m_before_area_full_rent_mean,6m_before_area_year_rent_mean,6m_after_year,6m_after_month,6m_before_deal_full_rent_rate,6m_before_deal_year_rent_multiple
1,641.325623,300.576172,20.937654,2011,9,46.867950,30.630251
2,615.491150,297.227692,20.475676,2011,10,48.291142,30.059626
3,600.585022,295.111023,20.246986,2011,11,49.137260,29.662933
4,589.073303,295.179962,20.066994,2011,12,50.109207,29.355333
5,581.239624,297.469910,19.992504,2012,1,51.178532,29.072878
...,...,...,...,...,...,...,...
142,1068.705444,612.513855,30.279573,2023,6,57.313625,35.294601
143,1060.324219,605.178101,30.067490,2023,7,57.074814,35.264809
144,1050.356201,596.822388,29.944647,2023,8,56.820953,35.076591
145,1040.774170,588.755737,29.721769,2023,9,56.569016,35.017235


In [None]:
df_area_micro = pd.merge(df_area_micro,df_area_micro_month_6m, left_on=['year','month'], right_on=['6m_after_year','6m_after_month'],how = 'left') # inner로 하면, 12개월 파트를 병합할 때 사라지는 데이터가 더 많아짐 ㅠㅠㅠ
df_area_micro

Unnamed: 0,year,month,day,area_deal,area_full_rent,area_year_rent,deal_full_rent_rate,deal_year_rent_multiple,6m_before_area_deal_mean,6m_before_area_full_rent_mean,6m_before_area_year_rent_mean,6m_after_year,6m_after_month,6m_before_deal_full_rent_rate,6m_before_deal_year_rent_multiple
0,2011,2,1,650.274597,304.564636,21.068266,46.836311,30.865122,,,,,,,
1,2011,2,2,650.676758,303.993530,21.060005,46.719593,30.896324,,,,,,,
2,2011,2,3,650.676758,304.175018,21.060005,46.747486,30.896324,,,,,,,
3,2011,2,4,650.503906,303.648407,21.046247,46.678951,30.908308,,,,,,,
4,2011,2,5,650.336243,303.861664,21.049116,46.723778,30.896132,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4467,2023,4,26,1024.179443,580.640137,29.545254,56.693203,34.664772,1079.784546,619.757629,30.16025,2023.0,4.0,57.39642,35.801579
4468,2023,4,27,1024.288940,580.863586,29.561056,56.708954,34.649944,1079.784546,619.757629,30.16025,2023.0,4.0,57.39642,35.801579
4469,2023,4,28,1024.102783,581.085632,29.577700,56.740944,34.624153,1079.784546,619.757629,30.16025,2023.0,4.0,57.39642,35.801579
4470,2023,4,29,1023.981995,581.219116,29.587328,56.760674,34.608803,1079.784546,619.757629,30.16025,2023.0,4.0,57.39642,35.801579


#### 12개월전 종합 수치 병합

In [None]:
df_area_micro_month_12m = df_area_micro_month.copy()
df_area_micro_month_12m['12m_after_year'] = df_area_micro_month_12m['year']+1
df_area_micro_month_12m['12m_after_month'] = df_area_micro_month_12m['month']

df_area_micro_month_12m = df_area_micro_month_12m.drop(['year','month'],axis=1)
df_area_micro_month_12m = df_area_micro_month_12m.astype({'12m_after_year':'int16', '12m_after_month' : 'int16'})
df_area_micro_month_12m.rename(columns = {'last_month_area_deal' : '12m_before_area_deal_mean', 'last_month_area_full_count' : '12m_before_area_full_rent_mean',
                                      'last_month_area_year_rent' : '12m_before_area_year_rent_mean'}, inplace = True)

df_area_micro_month_12m['12m_before_deal_full_rent_rate'] = 100*(df_area_micro_month_12m['12m_before_area_full_rent_mean'] / df_area_micro_month_12m['12m_before_area_deal_mean'])
df_area_micro_month_12m['12m_before_deal_year_rent_multiple'] =df_area_micro_month_12m['12m_before_area_deal_mean']/ df_area_micro_month_12m['12m_before_area_year_rent_mean']


df_area_micro = pd.merge(df_area_micro, df_area_micro_month_12m, left_on=['year','month'], right_on=['12m_after_year','12m_after_month'],how = 'left')
df_area_micro

Unnamed: 0,year,month,day,area_deal,area_full_rent,area_year_rent,deal_full_rent_rate,deal_year_rent_multiple,6m_before_area_deal_mean,6m_before_area_full_rent_mean,...,6m_after_month,6m_before_deal_full_rent_rate,6m_before_deal_year_rent_multiple,12m_before_area_deal_mean,12m_before_area_full_rent_mean,12m_before_area_year_rent_mean,12m_after_year,12m_after_month,12m_before_deal_full_rent_rate,12m_before_deal_year_rent_multiple
0,2011,2,1,650.274597,304.564636,21.068266,46.836311,30.865122,,,...,,,,,,,,,,
1,2011,2,2,650.676758,303.993530,21.060005,46.719593,30.896324,,,...,,,,,,,,,,
2,2011,2,3,650.676758,304.175018,21.060005,46.747486,30.896324,,,...,,,,,,,,,,
3,2011,2,4,650.503906,303.648407,21.046247,46.678951,30.908308,,,...,,,,,,,,,,
4,2011,2,5,650.336243,303.861664,21.049116,46.723778,30.896132,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4467,2023,4,26,1024.179443,580.640137,29.545254,56.693203,34.664772,1079.784546,619.757629,...,4.0,57.39642,35.801579,1082.813354,608.582642,28.969515,2023.0,4.0,56.203835,37.377682
4468,2023,4,27,1024.288940,580.863586,29.561056,56.708954,34.649944,1079.784546,619.757629,...,4.0,57.39642,35.801579,1082.813354,608.582642,28.969515,2023.0,4.0,56.203835,37.377682
4469,2023,4,28,1024.102783,581.085632,29.577700,56.740944,34.624153,1079.784546,619.757629,...,4.0,57.39642,35.801579,1082.813354,608.582642,28.969515,2023.0,4.0,56.203835,37.377682
4470,2023,4,29,1023.981995,581.219116,29.587328,56.760674,34.608803,1079.784546,619.757629,...,4.0,57.39642,35.801579,1082.813354,608.582642,28.969515,2023.0,4.0,56.203835,37.377682


### df_area_micro 컬럼 수정

- 6개월, 12개월 전 수치들을 변화율들로 계산하여 교체

In [None]:
df_area_micro = df_area_micro.drop(['6m_after_year','6m_after_month', '12m_after_year', '12m_after_month'], axis=1)

df_area_micro['6m_before_area_deal_mean'] = 100*((df_area_micro['area_deal'] - df_area_micro['6m_before_area_deal_mean'])/ df_area_micro['6m_before_area_deal_mean'])
df_area_micro['6m_before_area_full_rent_mean'] = 100*((df_area_micro['area_full_rent'] - df_area_micro['6m_before_area_full_rent_mean'])/ df_area_micro['6m_before_area_full_rent_mean'])
df_area_micro['6m_before_area_year_rent_mean'] = 100*((df_area_micro['area_year_rent'] - df_area_micro['6m_before_area_year_rent_mean'])/ df_area_micro['6m_before_area_year_rent_mean'])
df_area_micro['6m_before_deal_full_rent_rate'] = 100*((df_area_micro['deal_full_rent_rate'] - df_area_micro['6m_before_deal_full_rent_rate'])/ df_area_micro['6m_before_deal_full_rent_rate'])
df_area_micro['6m_before_deal_year_rent_multiple'] = 100*((df_area_micro['deal_year_rent_multiple'] - df_area_micro['6m_before_deal_year_rent_multiple'])/ df_area_micro['6m_before_deal_year_rent_multiple'])


df_area_micro['12m_before_area_deal_mean'] = 100*((df_area_micro['area_deal'] - df_area_micro['12m_before_area_deal_mean'])/ df_area_micro['12m_before_area_deal_mean'])
df_area_micro['12m_before_area_full_rent_mean'] = 100*((df_area_micro['area_full_rent'] - df_area_micro['12m_before_area_full_rent_mean'])/ df_area_micro['12m_before_area_full_rent_mean'])
df_area_micro['12m_before_area_year_rent_mean'] = 100*((df_area_micro['area_year_rent'] - df_area_micro['12m_before_area_year_rent_mean'])/ df_area_micro['12m_before_area_year_rent_mean'])
df_area_micro['12m_before_deal_full_rent_rate'] = 100*((df_area_micro['deal_full_rent_rate'] - df_area_micro['12m_before_deal_full_rent_rate'])/ df_area_micro['12m_before_deal_full_rent_rate'])
df_area_micro['12m_before_deal_year_rent_multiple'] = 100*((df_area_micro['deal_year_rent_multiple'] - df_area_micro['12m_before_deal_year_rent_multiple'])/ df_area_micro['12m_before_deal_year_rent_multiple'])

df_area_micro.head()

Unnamed: 0,year,month,day,area_deal,area_full_rent,area_year_rent,deal_full_rent_rate,deal_year_rent_multiple,6m_before_area_deal_mean,6m_before_area_full_rent_mean,6m_before_area_year_rent_mean,6m_before_deal_full_rent_rate,6m_before_deal_year_rent_multiple,12m_before_area_deal_mean,12m_before_area_full_rent_mean,12m_before_area_year_rent_mean,12m_before_deal_full_rent_rate,12m_before_deal_year_rent_multiple
0,2011,2,1,650.274597,304.564636,21.068266,46.836311,30.865122,,,,,,,,,,
1,2011,2,2,650.676758,303.99353,21.060005,46.719593,30.896324,,,,,,,,,,
2,2011,2,3,650.676758,304.175018,21.060005,46.747486,30.896324,,,,,,,,,,
3,2011,2,4,650.503906,303.648407,21.046247,46.678951,30.908308,,,,,,,,,,
4,2011,2,5,650.336243,303.861664,21.049116,46.723778,30.896132,,,,,,,,,,


In [None]:
df_area_micro = df_area_micro.dropna()
df_area_micro.head()

Unnamed: 0,year,month,day,area_deal,area_full_rent,area_year_rent,deal_full_rent_rate,deal_year_rent_multiple,6m_before_area_deal_mean,6m_before_area_full_rent_mean,6m_before_area_year_rent_mean,6m_before_deal_full_rent_rate,6m_before_deal_year_rent_multiple,12m_before_area_deal_mean,12m_before_area_full_rent_mean,12m_before_area_year_rent_mean,12m_before_deal_full_rent_rate,12m_before_deal_year_rent_multiple
394,2012,3,1,548.170105,309.03598,20.537769,56.375927,26.69083,-4.654602,1.893891,1.922042,6.868177,-6.452623,-14.525463,2.814531,-1.909885,20.286734,-12.861209
395,2012,3,2,547.971802,308.738037,20.522972,56.341957,26.700411,-4.689094,1.795654,1.848608,6.803782,-6.419044,-14.556384,2.715407,-1.980558,20.214254,-12.829931
396,2012,3,3,547.847473,309.738129,20.533909,56.537292,26.680136,-4.710719,2.1254,1.902884,7.174067,-6.490106,-14.57577,3.048131,-1.928323,20.631031,-12.896124
397,2012,3,4,548.025452,309.982178,20.534513,56.563461,26.688017,-4.679762,2.205867,1.905884,7.223673,-6.462483,-14.548018,3.129325,-1.925435,20.686869,-12.870394
398,2012,3,5,547.768799,309.442413,20.514805,56.491428,26.701145,-4.724403,2.027898,1.808077,7.087125,-6.416471,-14.588037,2.949749,-2.019566,20.533175,-12.827534


In [None]:
df_area_micro.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4078 entries, 394 to 4471
Data columns (total 18 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   year                                4078 non-null   int64  
 1   month                               4078 non-null   int64  
 2   day                                 4078 non-null   int64  
 3   area_deal                           4078 non-null   float32
 4   area_full_rent                      4078 non-null   float32
 5   area_year_rent                      4078 non-null   float32
 6   deal_full_rent_rate                 4078 non-null   float32
 7   deal_year_rent_multiple             4078 non-null   float32
 8   6m_before_area_deal_mean            4078 non-null   float32
 9   6m_before_area_full_rent_mean       4078 non-null   float32
 10  6m_before_area_year_rent_mean       4078 non-null   float32
 11  6m_before_deal_full_rent_rate       4078 

In [None]:
df_area_micro.to_pickle('/content/drive/MyDrive/house_price/after_data/df_area_micro.pkl')

## final_economic 과의 병합

In [None]:
import pandas as pd
df_economic = pd.read_pickle('/content/drive/MyDrive/house_price/after_data/final_economic.pkl')
df_economic.head()

Unnamed: 0,date,year,month,day,apartment_index,kospi_index,korea_rp,korea_3_year,korea_10_year,us_3_month,...,us_10_year_12m_before,korea_10-3_year_12m_before,us_10-2_year_12m_before,us_10-3_year_month_12m_before,last_month_total_apartment_supply_12m_before,last_month_total_unsold_count_12m_before,last_month_total_unsold_ratio_12m_before,last_month_total_deal_count_12m_before,last_month_total_full_rent_count_12m_before,last_month_total_month_rent_count_12m_before
0,2012-02-01,2012,2,1,86.800003,1959.23999,3.25,3.38,3.75,0.061,...,-1.735429,-0.436429,-1.199357,-1.668893,-3520.0,-379.0,61.257435,-4393.0,-1891.0,-237.0
1,2012-02-02,2012,2,2,86.800003,1984.300049,3.25,3.38,3.76,0.084,...,-1.742429,-0.426429,-1.206357,-1.698893,-3520.0,-379.0,61.257435,-4393.0,-1891.0,-237.0
2,2012-02-03,2012,2,3,86.800003,1972.339966,3.25,3.38,3.76,0.079,...,-1.641429,-0.426429,-1.117357,-1.592893,-3520.0,-379.0,61.257435,-4393.0,-1891.0,-237.0
3,2012-02-04,2012,2,4,86.800003,1972.339966,3.25,3.38,3.76,0.079,...,-1.641429,-0.426429,-1.117357,-1.592893,-3520.0,-379.0,61.257435,-4393.0,-1891.0,-237.0
4,2012-02-05,2012,2,5,86.800003,1972.339966,3.25,3.38,3.76,0.079,...,-1.641429,-0.426429,-1.117357,-1.592893,-3520.0,-379.0,61.257435,-4393.0,-1891.0,-237.0


In [None]:
# 이 부분에서 2023년 이후 거시경제 지표가 없음으로 병합과정에서 제거가 됨
df_original_dataset = pd.merge(df_area_micro,df_economic, on = ['year','month','day'])
df_original_dataset

Unnamed: 0,year,month,day,area_deal,area_full_rent,area_year_rent,deal_full_rent_rate,deal_year_rent_multiple,6m_before_area_deal_mean,6m_before_area_full_rent_mean,...,us_10_year_12m_before,korea_10-3_year_12m_before,us_10-2_year_12m_before,us_10-3_year_month_12m_before,last_month_total_apartment_supply_12m_before,last_month_total_unsold_count_12m_before,last_month_total_unsold_ratio_12m_before,last_month_total_deal_count_12m_before,last_month_total_full_rent_count_12m_before,last_month_total_month_rent_count_12m_before
0,2012,3,1,548.170105,309.035980,20.537769,56.375927,26.690830,-4.654602,1.893891,...,-1.374226,-0.392419,-0.987774,-1.352387,-2468.0,-513.0,102.561394,-2078.0,794.0,-73.0
1,2012,3,2,547.971802,308.738037,20.522972,56.341957,26.700411,-4.689094,1.795654,...,-1.425226,-0.407419,-1.023774,-1.395387,-2468.0,-513.0,102.561394,-2078.0,794.0,-73.0
2,2012,3,3,547.847473,309.738129,20.533909,56.537292,26.680136,-4.710719,2.125400,...,-1.425226,-0.407419,-1.023374,-1.395487,-2468.0,-513.0,102.561394,-2078.0,794.0,-73.0
3,2012,3,4,548.025452,309.982178,20.534513,56.563461,26.688017,-4.679762,2.205867,...,-1.425226,-0.407419,-1.023374,-1.395487,-2468.0,-513.0,102.561394,-2078.0,794.0,-73.0
4,2012,3,5,547.768799,309.442413,20.514805,56.491428,26.701145,-4.724403,2.027898,...,-1.397226,-0.392419,-1.014774,-1.382387,-2468.0,-513.0,102.561394,-2078.0,794.0,-73.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3953,2022,12,27,1058.162842,602.649963,30.013388,56.952477,35.256363,-2.863783,-2.328694,...,2.393742,-0.434968,-1.326813,-1.857890,-266.0,811.0,46.509003,-695.0,-2296.0,1048.0
3954,2022,12,28,1057.969971,602.549805,29.992699,56.953396,35.274250,-2.881488,-2.344926,...,2.430742,-0.378968,-1.264513,-1.966890,-266.0,811.0,46.509003,-695.0,-2296.0,1048.0
3955,2022,12,29,1057.233032,602.409302,30.018431,56.979805,35.219463,-2.949137,-2.367698,...,2.364742,-0.380968,-1.338713,-1.998890,-266.0,811.0,46.509003,-695.0,-2296.0,1048.0
3956,2022,12,30,1056.862427,602.243958,30.013784,56.984138,35.212566,-2.983157,-2.394495,...,2.423742,-0.375968,-1.342013,-1.921890,-266.0,811.0,46.509003,-695.0,-2296.0,1048.0


In [None]:
df_original_dataset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3958 entries, 0 to 3957
Data columns (total 68 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   year                                          3958 non-null   int64  
 1   month                                         3958 non-null   int64  
 2   day                                           3958 non-null   int64  
 3   area_deal                                     3958 non-null   float32
 4   area_full_rent                                3958 non-null   float32
 5   area_year_rent                                3958 non-null   float32
 6   deal_full_rent_rate                           3958 non-null   float32
 7   deal_year_rent_multiple                       3958 non-null   float32
 8   6m_before_area_deal_mean                      3958 non-null   float32
 9   6m_before_area_full_rent_mean                 3958 non-null   f

In [None]:
# date 컬럼의 타입을 변경
df_original_dataset['date'] = pd.to_datetime(df_original_dataset['date'])
df_original_dataset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3958 entries, 0 to 3957
Data columns (total 68 columns):
 #   Column                                        Non-Null Count  Dtype         
---  ------                                        --------------  -----         
 0   year                                          3958 non-null   int64         
 1   month                                         3958 non-null   int64         
 2   day                                           3958 non-null   int64         
 3   area_deal                                     3958 non-null   float32       
 4   area_full_rent                                3958 non-null   float32       
 5   area_year_rent                                3958 non-null   float32       
 6   deal_full_rent_rate                           3958 non-null   float32       
 7   deal_year_rent_multiple                       3958 non-null   float32       
 8   6m_before_area_deal_mean                      3958 non-null   float3

In [None]:
df_original_dataset.to_pickle('/content/drive/MyDrive/house_price/after_data/df_original_dataset_without_future.pkl')

## 1년후 가격 병합

In [None]:
df_future = df_original_dataset[['date','area_deal']].copy()
df_future.head()

Unnamed: 0,date,area_deal
0,2012-03-01,548.170105
1,2012-03-02,547.971802
2,2012-03-03,547.847473
3,2012-03-04,548.025452
4,2012-03-05,547.768799


In [None]:
# 365일 전으로 날짜를 미룸
df_future['date'] = df_future['date'] - pd.Timedelta(days=365)
df_future.head()

Unnamed: 0,date,area_deal
0,2011-03-02,548.170105
1,2011-03-03,547.971802
2,2011-03-04,547.847473
3,2011-03-05,548.025452
4,2011-03-06,547.768799


In [None]:
df_future['year'] = df_future['date'].dt.year
df_future['month'] = df_future['date'].dt.month
df_future['day'] = df_future['date'].dt.day
df_future.rename(columns = {'area_deal' : 'future_area_deal'}, inplace = True)
df_future.drop('date',axis=1,inplace=True)

df_future.head()

Unnamed: 0,future_area_deal,year,month,day
0,548.170105,2011,3,2
1,547.971802,2011,3,3
2,547.847473,2011,3,4
3,548.025452,2011,3,5
4,547.768799,2011,3,6


In [None]:
# 데이터 프레임 병합
df_original_dataset = pd.merge(df_original_dataset,df_future, on = ['year','month','day'])
df_original_dataset

Unnamed: 0,year,month,day,area_deal,area_full_rent,area_year_rent,deal_full_rent_rate,deal_year_rent_multiple,6m_before_area_deal_mean,6m_before_area_full_rent_mean,...,korea_10-3_year_12m_before,us_10-2_year_12m_before,us_10-3_year_month_12m_before,last_month_total_apartment_supply_12m_before,last_month_total_unsold_count_12m_before,last_month_total_unsold_ratio_12m_before,last_month_total_deal_count_12m_before,last_month_total_full_rent_count_12m_before,last_month_total_month_rent_count_12m_before,future_area_deal
0,2012,3,1,548.170105,309.035980,20.537769,56.375927,26.690830,-4.654602,1.893891,...,-0.392419,-0.987774,-1.352387,-2468.0,-513.0,102.561394,-2078.0,794.0,-73.0,512.318481
1,2012,3,2,547.971802,308.738037,20.522972,56.341957,26.700411,-4.689094,1.795654,...,-0.407419,-1.023774,-1.395387,-2468.0,-513.0,102.561394,-2078.0,794.0,-73.0,512.909119
2,2012,3,3,547.847473,309.738129,20.533909,56.537292,26.680136,-4.710719,2.125400,...,-0.407419,-1.023374,-1.395487,-2468.0,-513.0,102.561394,-2078.0,794.0,-73.0,513.148926
3,2012,3,4,548.025452,309.982178,20.534513,56.563461,26.688017,-4.679762,2.205867,...,-0.407419,-1.023374,-1.395487,-2468.0,-513.0,102.561394,-2078.0,794.0,-73.0,513.173767
4,2012,3,5,547.768799,309.442413,20.514805,56.491428,26.701145,-4.724403,2.027898,...,-0.392419,-1.014774,-1.382387,-2468.0,-513.0,102.561394,-2078.0,794.0,-73.0,513.639587
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3588,2021,12,27,1083.472168,607.116211,28.812990,56.034313,37.603600,7.364027,4.543994,...,-0.272226,-0.024955,0.574126,789.0,2.0,-1.540453,-5084.0,1313.0,-278.0,1058.162842
3589,2021,12,28,1083.753906,606.659790,28.849360,55.977631,37.565960,7.391945,4.465400,...,-0.298226,-0.069055,0.576126,789.0,2.0,-1.540453,-5084.0,1313.0,-278.0,1057.969971
3590,2021,12,29,1083.566772,606.564270,28.839167,55.978481,37.572750,7.373402,4.448951,...,-0.311226,0.006845,0.658126,789.0,2.0,-1.540453,-5084.0,1313.0,-278.0,1057.233032
3591,2021,12,30,1083.580566,606.848999,28.829924,56.004047,37.585274,7.374769,4.497981,...,-0.262226,-0.020455,0.614126,789.0,2.0,-1.540453,-5084.0,1313.0,-278.0,1056.862427


In [None]:
# 미래 변화율 컬럼을 추가
df_original_dataset['future_change_rate'] = 100*((df_original_dataset['future_area_deal'] - df_original_dataset['area_deal'])/df_original_dataset['area_deal'])
df_original_dataset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3593 entries, 0 to 3592
Data columns (total 70 columns):
 #   Column                                        Non-Null Count  Dtype         
---  ------                                        --------------  -----         
 0   year                                          3593 non-null   int64         
 1   month                                         3593 non-null   int64         
 2   day                                           3593 non-null   int64         
 3   area_deal                                     3593 non-null   float32       
 4   area_full_rent                                3593 non-null   float32       
 5   area_year_rent                                3593 non-null   float32       
 6   deal_full_rent_rate                           3593 non-null   float32       
 7   deal_year_rent_multiple                       3593 non-null   float32       
 8   6m_before_area_deal_mean                      3593 non-null   float3

## 파일 저장

In [None]:
df_original_dataset.to_pickle('/content/drive/MyDrive/house_price/after_data/df_original_dataset.pkl')

# 기계학습

- 여러 회귀 모델들을 사용해서 서울 전체 집값의 추이를 예상하는 모델을 생성

## df_train_test 생성

- df_train_test는 df_original_dataset 에서 future_area_deal과 상관관계가 높은 feature들만을 선택한 데이터프레임

In [3]:
import pandas as pd

df_original_dataset = pd.read_pickle('/content/drive/MyDrive/house_price/after_data/df_original_dataset.pkl')
df_original_dataset

Unnamed: 0,year,month,day,area_deal,area_full_rent,area_year_rent,deal_full_rent_rate,deal_year_rent_multiple,6m_before_area_deal_mean,6m_before_area_full_rent_mean,...,us_10-2_year_12m_before,us_10-3_year_month_12m_before,last_month_total_apartment_supply_12m_before,last_month_total_unsold_count_12m_before,last_month_total_unsold_ratio_12m_before,last_month_total_deal_count_12m_before,last_month_total_full_rent_count_12m_before,last_month_total_month_rent_count_12m_before,future_area_deal,future_change_rate
0,2012,3,1,548.170105,309.035980,20.537769,56.375927,26.690830,-4.654602,1.893891,...,-0.987774,-1.352387,-2468.0,-513.0,102.561394,-2078.0,794.0,-73.0,512.318481,-6.540236
1,2012,3,2,547.971802,308.738037,20.522972,56.341957,26.700411,-4.689094,1.795654,...,-1.023774,-1.395387,-2468.0,-513.0,102.561394,-2078.0,794.0,-73.0,512.909119,-6.398629
2,2012,3,3,547.847473,309.738129,20.533909,56.537292,26.680136,-4.710719,2.125400,...,-1.023374,-1.395487,-2468.0,-513.0,102.561394,-2078.0,794.0,-73.0,513.148926,-6.333615
3,2012,3,4,548.025452,309.982178,20.534513,56.563461,26.688017,-4.679762,2.205867,...,-1.023374,-1.395487,-2468.0,-513.0,102.561394,-2078.0,794.0,-73.0,513.173767,-6.359501
4,2012,3,5,547.768799,309.442413,20.514805,56.491428,26.701145,-4.724403,2.027898,...,-1.014774,-1.382387,-2468.0,-513.0,102.561394,-2078.0,794.0,-73.0,513.639587,-6.230587
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3588,2021,12,27,1083.472168,607.116211,28.812990,56.034313,37.603600,7.364027,4.543994,...,-0.024955,0.574126,789.0,2.0,-1.540453,-5084.0,1313.0,-278.0,1058.162842,-2.335946
3589,2021,12,28,1083.753906,606.659790,28.849360,55.977631,37.565960,7.391945,4.465400,...,-0.069055,0.576126,789.0,2.0,-1.540453,-5084.0,1313.0,-278.0,1057.969971,-2.379132
3590,2021,12,29,1083.566772,606.564270,28.839167,55.978481,37.572750,7.373402,4.448951,...,0.006845,0.658126,789.0,2.0,-1.540453,-5084.0,1313.0,-278.0,1057.233032,-2.430283
3591,2021,12,30,1083.580566,606.848999,28.829924,56.004047,37.585274,7.374769,4.497981,...,-0.020455,0.614126,789.0,2.0,-1.540453,-5084.0,1313.0,-278.0,1056.862427,-2.465727


In [4]:
pd.set_option('display.max_rows', 70)

df_original_dataset.corr()['future_area_deal'].sort_values(ascending=False).to_frame()

  df_original_dataset.corr()['future_area_deal'].sort_values(ascending=False).to_frame()


Unnamed: 0,future_area_deal
future_area_deal,1.0
area_deal,0.978133
deal_year_rent_multiple,0.967116
area_full_rent,0.96146
year,0.957828
area_year_rent,0.93252
apartment_index,0.863615
12m_before_area_deal_mean,0.840139
6m_before_area_deal_mean,0.79336
12m_before_deal_year_rent_multiple,0.74197


In [5]:
# 상관관계들만을 컬럼으로 한 데이터 프레임 생성
df_corr = df_original_dataset.corr(numeric_only=False)['future_area_deal']
df_corr.head()

year              0.957828
month             0.061658
day               0.007000
area_deal         0.978133
area_full_rent    0.961460
Name: future_area_deal, dtype: float64

In [6]:
df_corr.info()

<class 'pandas.core.series.Series'>
Index: 70 entries, year to future_change_rate
Series name: future_area_deal
Non-Null Count  Dtype  
--------------  -----  
70 non-null     float64
dtypes: float64(1)
memory usage: 3.1+ KB


In [7]:
# 컬럼명(series의)을 수정
df_corr.name = 'correlation'
df_corr.info()

<class 'pandas.core.series.Series'>
Index: 70 entries, year to future_change_rate
Series name: correlation
Non-Null Count  Dtype  
--------------  -----  
70 non-null     float64
dtypes: float64(1)
memory usage: 3.1+ KB


In [8]:
# 상관계쑤가 0.7 이상이거나, -0.7 이하인것 (양의 상관관계나 음의 상관관계가 있는 컬럼들만을 고름)
learning_feature_list = list(df_corr[(df_corr >= 0.7) | (df_corr <= -0.7)].index)
learning_feature_list

['year',
 'area_deal',
 'area_full_rent',
 'area_year_rent',
 'deal_year_rent_multiple',
 '6m_before_area_deal_mean',
 '6m_before_deal_full_rent_rate',
 '12m_before_area_deal_mean',
 '12m_before_deal_full_rent_rate',
 '12m_before_deal_year_rent_multiple',
 'date',
 'apartment_index',
 'kospi_index',
 'korea_rp',
 'korea_3_year',
 'last_month_total_month_rent_count',
 'future_area_deal']

In [9]:
# year, date, apartment_index는 실제 사용하는 컬럼들이 아니기에, 제거
to_remove = ['year','apartment_index', 'korea_rp']
for x in to_remove:
    learning_feature_list.remove(x)
learning_feature_list

['area_deal',
 'area_full_rent',
 'area_year_rent',
 'deal_year_rent_multiple',
 '6m_before_area_deal_mean',
 '6m_before_deal_full_rent_rate',
 '12m_before_area_deal_mean',
 '12m_before_deal_full_rent_rate',
 '12m_before_deal_year_rent_multiple',
 'date',
 'kospi_index',
 'korea_3_year',
 'last_month_total_month_rent_count',
 'future_area_deal']

- 거시경제 지표와, 아파트 가치평가 지표들이 미래 가격과 상관관계가 있음 -> 수치로 표현되는 지표들이 어느정도 미래 가격을 예측하는데 상관관계가 있음을 확인


In [10]:
# 사용할 컬러명들만 선택해서 학습&테스트 데이터셋을 확보
df_train_test = df_original_dataset[learning_feature_list]
df_train_test

Unnamed: 0,area_deal,area_full_rent,area_year_rent,deal_year_rent_multiple,6m_before_area_deal_mean,6m_before_deal_full_rent_rate,12m_before_area_deal_mean,12m_before_deal_full_rent_rate,12m_before_deal_year_rent_multiple,date,kospi_index,korea_3_year,last_month_total_month_rent_count,future_area_deal
0,548.170105,309.035980,20.537769,26.690830,-4.654602,6.868177,-14.525463,20.286734,-12.861209,2012-03-01,2030.250000,3.430,2638,512.318481
1,547.971802,308.738037,20.522972,26.700411,-4.689094,6.803782,-14.556384,20.214254,-12.829931,2012-03-02,2034.630005,3.485,2638,512.909119
2,547.847473,309.738129,20.533909,26.680136,-4.710719,7.174067,-14.575770,20.631031,-12.896124,2012-03-03,2034.630005,3.485,2638,513.148926
3,548.025452,309.982178,20.534513,26.688017,-4.679762,7.223673,-14.548018,20.686869,-12.870394,2012-03-04,2034.630005,3.485,2638,513.173767
4,547.768799,309.442413,20.514805,26.701145,-4.724403,7.087125,-14.588037,20.533175,-12.827534,2012-03-05,2016.060059,3.490,2638,513.639587
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3588,1083.472168,607.116211,28.812990,37.603600,7.364027,-2.626612,15.883393,-5.929564,5.545363,2021-12-27,2999.550049,1.776,6661,1058.162842
3589,1083.753906,606.659790,28.849360,37.565960,7.391945,-2.725111,15.913527,-6.024723,5.439716,2021-12-28,3020.239990,1.786,6661,1057.969971
3590,1083.566772,606.564270,28.839167,37.572750,7.373402,-2.723633,15.893512,-6.023295,5.458775,2021-12-29,2993.290039,1.783,6661,1057.233032
3591,1083.580566,606.848999,28.829924,37.585274,7.374769,-2.679206,15.894987,-5.980375,5.493926,2021-12-30,2977.649902,1.802,6661,1056.862427


In [None]:
df_train_test.to_pickle('/content/drive/MyDrive/house_price/after_data/df_train_test.pkl')

## 학습 데이터셋, 테스트 데이터셋 선언

In [None]:
import pandas as pd

df_train_test = pd.read_pickle('/content/drive/MyDrive/house_price/after_data/df_train_test.pkl')
df_train_test.head()

Unnamed: 0,area_deal,area_full_rent,area_year_rent,deal_year_rent_multiple,6m_before_area_deal_mean,6m_before_deal_full_rent_rate,12m_before_area_deal_mean,12m_before_deal_full_rent_rate,12m_before_deal_year_rent_multiple,date,kospi_index,korea_3_year,last_month_total_month_rent_count,future_area_deal
0,548.170105,309.03598,20.537769,26.69083,-4.654602,6.868177,-14.525463,20.286734,-12.861209,2012-03-01,2030.25,3.43,2638,512.318481
1,547.971802,308.738037,20.522972,26.700411,-4.689094,6.803782,-14.556384,20.214254,-12.829931,2012-03-02,2034.630005,3.485,2638,512.909119
2,547.847473,309.738129,20.533909,26.680136,-4.710719,7.174067,-14.57577,20.631031,-12.896124,2012-03-03,2034.630005,3.485,2638,513.148926
3,548.025452,309.982178,20.534513,26.688017,-4.679762,7.223673,-14.548018,20.686869,-12.870394,2012-03-04,2034.630005,3.485,2638,513.173767
4,547.768799,309.442413,20.514805,26.701145,-4.724403,7.087125,-14.588037,20.533175,-12.827534,2012-03-05,2016.060059,3.49,2638,513.639587


In [11]:
df_train_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3593 entries, 0 to 3592
Data columns (total 14 columns):
 #   Column                              Non-Null Count  Dtype         
---  ------                              --------------  -----         
 0   area_deal                           3593 non-null   float32       
 1   area_full_rent                      3593 non-null   float32       
 2   area_year_rent                      3593 non-null   float32       
 3   deal_year_rent_multiple             3593 non-null   float32       
 4   6m_before_area_deal_mean            3593 non-null   float32       
 5   6m_before_deal_full_rent_rate       3593 non-null   float32       
 6   12m_before_area_deal_mean           3593 non-null   float32       
 7   12m_before_deal_full_rent_rate      3593 non-null   float32       
 8   12m_before_deal_year_rent_multiple  3593 non-null   float32       
 9   date                                3593 non-null   datetime64[ns]
 10  kospi_index             

In [12]:
# 학습할 때 입력을 할 feature들을 설정
train_columns = list(df_train_test.columns)

to_remove = ['future_area_deal','date']
for x in to_remove:
    train_columns.remove(x)
train_columns


['area_deal',
 'area_full_rent',
 'area_year_rent',
 'deal_year_rent_multiple',
 '6m_before_area_deal_mean',
 '6m_before_deal_full_rent_rate',
 '12m_before_area_deal_mean',
 '12m_before_deal_full_rent_rate',
 '12m_before_deal_year_rent_multiple',
 'kospi_index',
 'korea_3_year',
 'last_month_total_month_rent_count']

In [13]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df_train_test[train_columns], df_train_test['future_area_deal'], test_size=0.3, random_state=42)


In [14]:
# 정렬을 하지 않으면 추후 시각화를 통해서 모델의 성능을 파악할 대, 그래프가 의도한 대로 나오지 않음
X_test_sorted = X_test.sort_index()
y_test_sorted = y_test.sort_index()

In [16]:
X_test

Unnamed: 0,area_deal,area_full_rent,area_year_rent,deal_year_rent_multiple,6m_before_area_deal_mean,6m_before_deal_full_rent_rate,12m_before_area_deal_mean,12m_before_deal_full_rent_rate,12m_before_deal_year_rent_multiple,kospi_index,korea_3_year,last_month_total_month_rent_count
2833,816.799133,496.585693,24.072159,33.931278,8.640079,-4.032959,7.834159,-3.953656,7.046749,2084.070068,1.462,4461
315,512.632385,314.482727,20.332415,25.212568,-4.162067,7.302878,-8.081483,10.517179,-7.241796,2006.800049,2.720,2448
3537,1080.277832,608.909912,28.460196,37.957500,8.146161,-2.012936,17.017311,-4.986022,6.290092,2969.270020,1.960,8418
3540,1080.603394,608.821228,28.500612,37.915092,8.178753,-2.056711,17.052576,-5.028468,6.171340,2962.459961,1.865,8418
439,513.715454,319.800659,20.334183,25.263639,-1.292054,4.549354,-4.861605,9.368598,-4.444160,1968.829956,2.560,2696
...,...,...,...,...,...,...,...,...,...,...,...,...
435,514.154541,320.140472,20.336897,25.281858,-1.207686,4.571066,-4.780288,9.391310,-4.375250,1944.750000,2.555,2696
365,512.318481,316.435547,20.360691,25.162136,-2.896181,6.564540,-6.990144,9.613562,-6.166691,2026.489990,2.625,3226
1566,570.298706,420.893433,23.141182,24.644321,2.126533,1.290554,5.878519,3.621788,0.582317,1972.030029,1.345,3959
276,516.272583,311.231934,20.257851,25.485062,-4.061530,5.972913,-8.425145,9.602610,-7.269067,1932.900024,2.840,2486


In [17]:
X_test_sorted

Unnamed: 0,area_deal,area_full_rent,area_year_rent,deal_year_rent_multiple,6m_before_area_deal_mean,6m_before_deal_full_rent_rate,12m_before_area_deal_mean,12m_before_deal_full_rent_rate,12m_before_deal_year_rent_multiple,kospi_index,korea_3_year,last_month_total_month_rent_count
0,548.170105,309.035980,20.537769,26.690830,-4.654602,6.868177,-14.525463,20.286734,-12.861209,2030.250000,3.430,2638
7,546.781677,309.040894,20.527166,26.636978,-4.896097,7.141244,-14.741957,20.594090,-13.037023,2000.760010,3.480,2638
12,545.325134,308.750000,20.508192,26.590601,-5.149439,7.326300,-14.969071,20.802380,-13.188432,2025.040039,3.475,2638
14,545.194580,308.387939,20.531059,26.554625,-5.172147,7.226110,-14.989429,20.689611,-13.305886,2043.760010,3.565,2638
17,544.595398,309.191559,20.509285,26.553602,-5.276365,7.623809,-15.082857,21.137245,-13.309224,2034.439941,3.580,2638
...,...,...,...,...,...,...,...,...,...,...,...,...
3581,1082.718262,607.646851,28.764551,37.640713,7.289321,-2.473635,15.802759,-5.781777,5.649532,2963.000000,1.741,6661
3585,1083.359253,606.215698,28.791737,37.627438,7.352838,-2.760908,15.871316,-6.059305,5.612271,3012.429932,1.802,6661
3586,1083.347656,606.836914,28.802065,37.613541,7.351689,-2.660220,15.870076,-5.962033,5.573266,3012.429932,1.795,6661
3587,1083.281128,607.223083,28.803324,37.609589,7.345097,-2.592293,15.862960,-5.896410,5.562173,3012.429932,1.795,6661


## 모델 적용

### 선형회귀 모델

In [18]:
# Importing required libraries
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np
import plotly.graph_objs as go

# Creating a Linear Regression model
model = LinearRegression()

# Training the model on the training set
model.fit(X_train, y_train)


# Making predictions on the testing set
y_pred = model.predict(X_test_sorted)


# Evaluating the model using Mean Squared Error (MSE)
mse = mean_squared_error(y_test_sorted , y_pred)
print('LinearRegression Mean Squared Error:', mse)
print('LinearRegression Root Mean Squared Error:', np.sqrt(mse))
print()

# Creating the traces
trace1 = go.Scatter(
    x = y_test_sorted.index,
    y = y_test_sorted.values,
    mode = 'lines',
    name = 'actual_value'
)


trace2 = go.Scatter(
    x = y_test_sorted.index,
    y = list(y_pred),
    mode = 'lines',
    name = 'predict_value'
)

# Combining the traces and creating the layout
data = [trace1, trace2]
layout = go.Layout(title='Linear Regression Predict future price for test dataset', xaxis=dict(title='index'), yaxis=dict(title='미래평당가격'))

# Creating the figure and plotting it
fig = go.Figure(data=data, layout=layout)
fig.show()


LinearRegression Mean Squared Error: 188.0048911043631
LinearRegression Root Mean Squared Error: 13.71148755986611



- 어느정도 오차 값이 있기는 하나, 추세가 비슷하게 움직임

### 다항회귀 모델

#### 2차항

In [19]:
# Importing required libraries
import pandas as pd
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np
import plotly.graph_objs as go


# Creating polynomial features
poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X_train)
X_test_sorted_poly = poly.transform(X_test_sorted)

# Creating a Polynomial Regression model
model = LinearRegression()

# Training the model on the training set
model.fit(X_train_poly, y_train)

# Making predictions on the testing set
y_pred = model.predict(X_test_sorted_poly)

# Evaluating the model using Mean Squared Error (MSE)
mse = mean_squared_error(y_test_sorted, y_pred)
print('2 PolynomialFeatures Mean Squared Error:', mse)
print('2 PolynomialFeatures Root Mean Squared Error:', np.sqrt(mse))
print()

# Creating the traces
trace1 = go.Scatter(
    x = y_test_sorted.index,
    y = y_test_sorted.values,
    mode = 'lines',
    name = 'actual_value'
)


trace2 = go.Scatter(
    x = y_test_sorted.index,
    y = list(y_pred),
    mode = 'lines',
    name = 'predict_value'
)

# Combining the traces and creating the layout
data = [trace1, trace2]
layout = go.Layout(title='Linear Regression Poly2 Predict future price for test dataset', xaxis=dict(title='index'), yaxis=dict(title='미래평당가격'))

# Creating the figure and plotting it
fig = go.Figure(data=data, layout=layout)
fig.show()



2 PolynomialFeatures Mean Squared Error: 9.01157800924518
2 PolynomialFeatures Root Mean Squared Error: 3.0019290480031637



#### 3차항

In [20]:
# Importing required libraries
import pandas as pd
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np
import plotly.graph_objs as go


# Creating polynomial features
poly = PolynomialFeatures(degree=3)
X_train_poly = poly.fit_transform(X_train)
X_test_sorted_poly = poly.transform(X_test_sorted)

# Creating a Polynomial Regression model
model = LinearRegression()

# Training the model on the training set
model.fit(X_train_poly, y_train)

# Making predictions on the testing set
y_pred = model.predict(X_test_sorted_poly)

# Evaluating the model using Mean Squared Error (MSE)
mse = mean_squared_error(y_test_sorted, y_pred)
print('3 PolynomialFeatures Mean Squared Error:', mse)
print('3 PolynomialFeaturesRoot Mean Squared Error:', np.sqrt(mse))
print()

# Creating the traces
trace1 = go.Scatter(
    x = y_test_sorted.index,
    y = y_test_sorted.values,
    mode = 'lines',
    name = 'actual_value'
)


trace2 = go.Scatter(
    x = y_test_sorted.index,
    y = list(y_pred),
    mode = 'lines',
    name = 'predict_value'
)

# Combining the traces and creating the layout
data = [trace1, trace2]
layout = go.Layout(title='Linear Regression Poly3 Predict future price for test dataset', xaxis=dict(title='index'), yaxis=dict(title='미래평당가격'))

# Creating the figure and plotting it
fig = go.Figure(data=data, layout=layout)
fig.show()


3 PolynomialFeatures Mean Squared Error: 1.415645335579257
3 PolynomialFeaturesRoot Mean Squared Error: 1.1898089491927926



#### 4차항

In [21]:
# Importing required libraries
import pandas as pd
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np



# Creating polynomial features
poly = PolynomialFeatures(degree=4)
X_train_poly = poly.fit_transform(X_train)
X_test_sorted_poly = poly.transform(X_test_sorted)

# Creating a Polynomial Regression model
model = LinearRegression()

# Training the model on the training set
model.fit(X_train_poly, y_train)

# Making predictions on the testing set
y_pred = model.predict(X_test_sorted_poly)

# Evaluating the model using Mean Squared Error (MSE)
mse = mean_squared_error(y_test_sorted, y_pred)
print('4 PolynomialFeatures Mean Squared Error:', mse)
print('4 PolynomialFeaturesRoot Mean Squared Error:', np.sqrt(mse))

print()

# Creating the traces
trace1 = go.Scatter(
    x = y_test_sorted.index,
    y = y_test_sorted.values,
    mode = 'lines',
    name = 'actual_value'
)


trace2 = go.Scatter(
    x = y_test_sorted.index,
    y = list(y_pred),
    mode = 'lines',
    name = 'predict_value'
)

# Combining the traces and creating the layout
data = [trace1, trace2]
layout = go.Layout(title='Linear Regression Poly4 Predict future price for test dataset', xaxis=dict(title='index'), yaxis=dict(title='미래평당가격'))

# Creating the figure and plotting it
fig = go.Figure(data=data, layout=layout)
fig.show()


4 PolynomialFeatures Mean Squared Error: 1.3479647570803281
4 PolynomialFeaturesRoot Mean Squared Error: 1.1610188444122378



### Gradient Boosting 모델

In [22]:
# Importing required libraries
import pandas as pd
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np
import plotly.graph_objs as go


# Creating a Gradient Boosting model
model = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=0)

# Training the model on the training set
model.fit(X_train, y_train)

# Making predictions on the testing set
y_pred = model.predict(X_test_sorted)

# Evaluating the model using Mean Squared Error (MSE)
mse = mean_squared_error(y_test_sorted , y_pred)
print('GradientBoostingRegressor Mean Squared Error:', mse)
print('GradientBoostingRegressor Root Mean Squared Error:', np.sqrt(mse))
print()

final_pred = model.predict(df_train_test[train_columns])
final_pred = final_pred.tolist()


# Creating the traces
trace1 = go.Scatter(
    x = y_test_sorted.index,
    y = y_test_sorted.values,
    mode = 'lines',
    name = 'actual_value'
)


trace2 = go.Scatter(
    x = y_test_sorted.index,
    y = list(y_pred),
    mode = 'lines',
    name = 'predict_value'
)

# Combining the traces and creating the layout
data = [trace1, trace2]
layout = go.Layout(title='Gradient Boosting Predict future price for test dataset', xaxis=dict(title='index'), yaxis=dict(title='평당가격'))

# Creating the figure and plotting it
fig = go.Figure(data=data, layout=layout)
fig.show()


GradientBoostingRegressor Mean Squared Error: 3.010988906888781
GradientBoostingRegressor Root Mean Squared Error: 1.7352201321125746



### XGBoost 모델

In [23]:
# Importing required libraries
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np
import plotly.graph_objs as go


# Creating an XGBoost model
model = xgb.XGBRegressor(objective ='reg:squarederror', n_estimators=100, learning_rate=0.1, max_depth=3, random_state=0)

# Training the model on the training set
model.fit(X_train, y_train)


# Making predictions on the testing set
y_pred = model.predict(X_test_sorted)


# Evaluating the model using Mean Squared Error (MSE)
mse = mean_squared_error(y_test_sorted , y_pred)
print('XGBRegressor Mean Squared Error:', mse)
print('XGBRegressor Root Mean Squared Error:', np.sqrt(mse))


# Creating the traces
trace1 = go.Scatter(
    x = y_test_sorted.index,
    y = y_test_sorted.values,
    mode = 'lines',
    name = 'actual_value'
)


trace2 = go.Scatter(
    x = y_test_sorted.index,
    y = list(y_pred),
    mode = 'lines',
    name = 'predict_value'
)

# Combining the traces and creating the layout
data = [trace1, trace2]
layout = go.Layout(title='XGBoost Predict future price for test dataset', xaxis=dict(title='index'), yaxis=dict(title='평당가격'))

# Creating the figure and plotting it
fig = go.Figure(data=data, layout=layout)
fig.show()


XGBRegressor Mean Squared Error: 3.4640126
XGBRegressor Root Mean Squared Error: 1.8611858


### RandomForest Regressor 모델



In [None]:
# Importing required libraries
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np
import plotly.graph_objs as go


# Creating a Random Forest Regressor model
model = RandomForestRegressor(n_estimators=100, random_state=0)

# Training the model on the training set
model.fit(X_train, y_train)

# Making predictions on the testing set
y_pred = model.predict(X_test_sorted)

# Evaluating the model using Mean Squared Error (MSE)
mse = mean_squared_error(y_test_sorted , y_pred)
print('RandomForestRegressor Mean Squared Error:', mse)
print('RandomForestRegressor Root Mean Squared Error:', np.sqrt(mse))
print()



# Creating the traces
trace1 = go.Scatter(
    x = y_test_sorted.index,
    y = y_test_sorted.values,
    mode = 'lines',
    name = 'actual_value'
)


trace2 = go.Scatter(
    x = y_test_sorted.index,
    y = list(y_pred),
    mode = 'lines',
    name = 'predict_value'
)


# Combining the traces and creating the layout
data = [trace1, trace2]
layout = go.Layout(title='RandomForest Regressor Predict future price for test dataset', xaxis=dict(title='Date'), yaxis=dict(title='평당가격'))

# Creating the figure and plotting it
fig = go.Figure(data=data, layout=layout)
fig.show()


Mean Squared Error: 0.6788606847170302
Root Mean Squared Error: 0.8239300241628716



>> rmse가 mse보다 수치가 크게 나옴 -> mse 오류가 너무 작아서 루트를 씌웠을 때 오히려 그 수치가 더 크게 나온듯

## 실용성 테스트

- 앞에서 train 데이터와, test 데이터를 통해서 학습과 예측을 했으나, 너무 실제값과 예측값이 비슷하게 움직여서 성능에 대한 의구심이 발생
- 그래서 학습과 예측에 사용을 하지 않은 2023년도 정보를 통해 성능을 재평가할 것
- '2023년도의 실제 값' 과 '2022년도'를 기반으로 예측한 '2023년의 예측값'을 비교

### 실용성 테스트를 할 데이터셋 불러오기

- 2020년부터 2022년 까지의 데이터셋을 불러옴

In [25]:
import pandas as pd
df_final_pred_dataset = pd.read_pickle('/content/drive/MyDrive/house_price/after_data/df_original_dataset_without_future.pkl')
# df_final_pred_dataset = df_final_pred_dataset.loc[~((df_final_pred_dataset['year']==2022)&(df_final_pred_dataset['month']>4)),:]
df_final_pred_dataset = df_final_pred_dataset.loc[df_final_pred_dataset['year']>2021,:] # 2019년 이후 데이터들만을 필터링()
df_final_pred_dataset['future_date'] = df_final_pred_dataset['date'] + pd.Timedelta(days=365) # 1년 뒤 날짜들을 컬럼으로 추가(그래프에 출력을 위해서)
df_final_pred_dataset

Unnamed: 0,year,month,day,area_deal,area_full_rent,area_year_rent,deal_full_rent_rate,deal_year_rent_multiple,6m_before_area_deal_mean,6m_before_area_full_rent_mean,...,korea_10-3_year_12m_before,us_10-2_year_12m_before,us_10-3_year_month_12m_before,last_month_total_apartment_supply_12m_before,last_month_total_unsold_count_12m_before,last_month_total_unsold_ratio_12m_before,last_month_total_deal_count_12m_before,last_month_total_full_rent_count_12m_before,last_month_total_month_rent_count_12m_before,future_date
3593,2022,1,1,1083.536621,606.893738,28.842766,56.010448,37.567017,5.955772,3.132203,...,-0.298355,-0.152739,0.478029,-3602.0,5.0,3.115771,-6571.0,1795.0,3859.0,2023-01-01
3594,2022,1,2,1083.647461,606.853333,28.841305,56.000988,37.572762,5.966611,3.125337,...,-0.361355,-0.152739,0.475229,-3602.0,5.0,3.115771,-6571.0,1795.0,3859.0,2023-01-02
3595,2022,1,3,1083.479614,607.364014,28.852951,56.056805,37.551777,5.950198,3.212119,...,-0.291355,-0.069439,0.593029,-3602.0,5.0,3.115771,-6571.0,1795.0,3859.0,2023-01-03
3596,2022,1,4,1083.489624,606.642090,28.878763,55.989655,37.518562,5.951177,3.089440,...,-0.309355,-0.041539,0.582029,-3602.0,5.0,3.115771,-6571.0,1795.0,3859.0,2023-01-04
3597,2022,1,5,1083.651611,607.107971,28.830170,56.024277,37.587418,5.967017,3.168609,...,-0.296355,-0.060239,0.628029,-3602.0,5.0,3.115771,-6571.0,1795.0,3859.0,2023-01-05
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3953,2022,12,27,1058.162842,602.649963,30.013388,56.952477,35.256363,-2.863783,-2.328694,...,-0.434968,-1.326813,-1.857890,-266.0,811.0,46.509003,-695.0,-2296.0,1048.0,2023-12-27
3954,2022,12,28,1057.969971,602.549805,29.992699,56.953396,35.274250,-2.881488,-2.344926,...,-0.378968,-1.264513,-1.966890,-266.0,811.0,46.509003,-695.0,-2296.0,1048.0,2023-12-28
3955,2022,12,29,1057.233032,602.409302,30.018431,56.979805,35.219463,-2.949137,-2.367698,...,-0.380968,-1.338713,-1.998890,-266.0,811.0,46.509003,-695.0,-2296.0,1048.0,2023-12-29
3956,2022,12,30,1056.862427,602.243958,30.013784,56.984138,35.212566,-2.983157,-2.394495,...,-0.375968,-1.342013,-1.921890,-266.0,811.0,46.509003,-695.0,-2296.0,1048.0,2023-12-30


In [26]:
# 일자별 실제 평당 가격을 알기 위해서 사용(2023년)
df_area_micro = pd.read_pickle('/content/drive/MyDrive/house_price/after_data/df_area_micro.pkl')
df_final_actual =  df_area_micro[['year','month','day','area_deal']].copy()
df_final_actual['date'] = pd.to_datetime(df_final_actual[['year','month','day']])
df_final_actual = df_final_actual.loc[df_final_actual['year']==2023]
df_final_actual

Unnamed: 0,year,month,day,area_deal,date
4352,2023,1,1,1056.691528,2023-01-01
4353,2023,1,2,1056.478394,2023-01-02
4354,2023,1,3,1055.845093,2023-01-03
4355,2023,1,4,1055.708862,2023-01-04
4356,2023,1,5,1055.373779,2023-01-05
...,...,...,...,...,...
4467,2023,4,26,1024.179443,2023-04-26
4468,2023,4,27,1024.288940,2023-04-27
4469,2023,4,28,1024.102783,2023-04-28
4470,2023,4,29,1023.981995,2023-04-29


### RandomForest Regressor 모델 적용

In [None]:
# Importing required libraries
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np
import plotly.graph_objs as go


# Creating a Random Forest Regressor model
model = RandomForestRegressor(n_estimators=100, random_state=0)

# Training the model on the training set
model.fit(X_train, y_train)

# Making predictions on the testing set
y_pred = model.predict(df_final_pred_dataset[train_columns])

final_pred = model.predict(df_final_pred_dataset[train_columns])
final_pred = final_pred.tolist()



# Creating the traces
trace1 = go.Scatter(
    x = df_final_actual['date'],
    y = df_final_actual['area_deal'],
    mode = 'lines',
    name = 'actual_value'
)


trace2 = go.Scatter(
    x = df_final_pred_dataset['future_date'],
    y = final_pred,
    mode = 'lines',
    name = 'predict_value'
)

# Combining the traces and creating the layout
data = [trace1, trace2]
layout = go.Layout(title='Final RandomForest Regressor Predict future price for total dataset', xaxis=dict(title='Date'), yaxis=dict(title='평당가격'))

# Creating the figure and plotting it
fig = go.Figure(data=data, layout=layout)
fig.show()






- 2023도 예측값이 상당히 이상하게 나오는데, 그 이유로는 2022년도에 있었던 경제지표들이 그간에 잘 없었던 수치들이 나와서 그런게 아닌가 하는 가설
- 과적합의 오류?

### XGBoost 모델 적용

In [None]:
# Importing required libraries
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np
import plotly.graph_objs as go

# Creating an XGBoost model
model = xgb.XGBRegressor(objective ='reg:squarederror', n_estimators=70, learning_rate=0.1, max_depth=3, random_state=0)

# Training the model on the training set
model.fit(X_train, y_train)

# Making predictions on the testing set
y_pred = model.predict(df_final_pred_dataset[train_columns])

final_pred = model.predict(df_final_pred_dataset[train_columns])
final_pred = final_pred.tolist()



# Creating the traces
trace1 = go.Scatter(
    x = df_final_actual['date'],
    y = df_final_actual['area_deal'],
    mode = 'lines',
    name = 'actual_value'
)


trace2 = go.Scatter(
    x = df_final_pred_dataset['future_date'],
    y = final_pred,
    mode = 'lines',
    name = 'predict_value'
)

# Combining the traces and creating the layout
data = [trace1, trace2]
layout = go.Layout(title='Final XGBoost Regressor Predict future price for total dataset', xaxis=dict(title='Date'), yaxis=dict(title='평당가격'))

# Creating the figure and plotting it
fig = go.Figure(data=data, layout=layout)
fig.show()






- 2023도 예측값이 상당히 이상하게 나오는데, 그 이유로는 2022년도에 있었던 경제지표들이 그간에 잘 없었던 수치들이 나와서 그런게 아닌가 하는 가설
- 과적합으 오류?

### Gradient Boosting 모델 적용

In [None]:
# Importing required libraries
import pandas as pd
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np
import plotly.graph_objs as go

# Creating a Gradient Boosting model
model = GradientBoostingRegressor(n_estimators=80, learning_rate=0.1, max_depth=3, random_state=0)

# Training the model on the training set
model.fit(X_train, y_train)

# Making predictions on the testing set

final_pred = model.predict(df_final_pred_dataset[train_columns])
final_pred = final_pred.tolist()



# Creating the traces
trace1 = go.Scatter(
    x = df_final_actual['date'],
    y = df_final_actual['area_deal'],
    mode = 'lines',
    name = 'actual_value'
)


trace2 = go.Scatter(
    x = df_final_pred_dataset['future_date'],
    y = final_pred,
    mode = 'lines',
    name = 'predict_value'
)

# Combining the traces and creating the layout
data = [trace1, trace2]
layout = go.Layout(title='Final Gradient Boosting Regressor Predict future price for total dataset', xaxis=dict(title='Date'), yaxis=dict(title='평당가격'))

# Creating the figure and plotting it
fig = go.Figure(data=data, layout=layout)
fig.show()






#### 20일 평균 이동선 추가

- 좀 더 부드러운 움직임을 파악하기 위해서 20일 평균 이동선을 사용

In [None]:
import pandas as pd

df_with_average_move = pd.DataFrame({'date':df_final_pred_dataset['future_date'], 'predict_price':final_pred})
df_with_average_move.reset_index(inplace=True,drop=True)
print(df_with_average_move.info())
print()
print(len(df_with_average_move))
print()
df_with_average_move.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 365 entries, 0 to 364
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   date           365 non-null    datetime64[ns]
 1   predict_price  365 non-null    float64       
dtypes: datetime64[ns](1), float64(1)
memory usage: 5.8 KB
None

365



Unnamed: 0,date,predict_price
0,2023-01-01,1054.309771
1,2023-01-02,1054.309771
2,2023-01-03,1054.309771
3,2023-01-04,1054.309771
4,2023-01-05,1054.309771


In [None]:
# 20일 지수 평균 이동선 생성
for cal_index in range(20,len(df_with_average_move)):
    start_index = cal_index-20
    sum_value = 0
    for i in range(start_index, cal_index):
        sum_value += df_with_average_move.loc[i]['predict_price']

    df_with_average_move.loc[cal_index,'20days_average_price'] = sum_value/20

In [None]:
df_with_average_move = df_with_average_move.dropna(subset=['20days_average_price'])
df_with_average_move.head()

Unnamed: 0,date,predict_price,20days_average_price
20,2023-01-21,1053.862618,1053.996764
21,2023-01-22,1053.862618,1053.974406
22,2023-01-23,1053.862618,1053.952049
23,2023-01-24,1053.862618,1053.929691
24,2023-01-25,1053.862618,1053.907334


In [None]:
# Importing required libraries
import pandas as pd
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np
import plotly.graph_objs as go




# Creating the traces
trace1 = go.Scatter(
    x = df_final_actual['date'],
    y = df_final_actual['area_deal'],
    mode = 'lines',
    name = 'actual_value'
)


trace2 = go.Scatter(
    x = df_with_average_move['date'],
    y = df_with_average_move['predict_price'],
    mode = 'lines',
    name = 'predict_value'
)

trace3 = go.Scatter(
    x = df_with_average_move['date'],
    y = df_with_average_move['20days_average_price'],
    mode = 'lines',
    name = 'predict_value_20days_average'
)

# Combining the traces and creating the layout
data = [trace1, trace2, trace3]
layout = go.Layout(title='Final Gradient Boosting Regressor Predict future price for total dataset', xaxis=dict(title='Date'), yaxis=dict(title='평당가격'))

# Creating the figure and plotting it
fig = go.Figure(data=data, layout=layout)
fig.show()






- 성능 평가에서는 Gradient Boosting 모델이 RandomForest Regressor 모델 보다 더 성능이 안좋았는데, 실제 시각화를 해보니 오히려 Gradient Boosting 모델가 더 실제값과 비슷하게 움직임

### Linear Regressor 모델 적용

In [None]:
# Importing required libraries
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np
import plotly.graph_objs as go
# Creating a Linear Regression model
model = LinearRegression()

# Training the model on the training set
model.fit(X_train, y_train)


# Making predictions on the testing set

final_pred = model.predict(df_final_pred_dataset[train_columns])
final_pred = final_pred.tolist()



# Creating the traces
trace1 = go.Scatter(
    x = df_final_actual['date'],
    y = df_final_actual['area_deal'],
    mode = 'lines',
    name = 'actual_value'
)


trace2 = go.Scatter(
    x = df_final_pred_dataset['future_date'],
    y = final_pred,
    mode = 'lines',
    name = 'predict_value'
)

# Combining the traces and creating the layout
data = [trace1, trace2]
layout = go.Layout(title='Final Linear regressor Predict future price for total dataset', xaxis=dict(title='Date'), yaxis=dict(title='평당가격'))

# Creating the figure and plotting it
fig = go.Figure(data=data, layout=layout)
fig.show()






#### 20일 평균 이동선 추가

In [None]:
import pandas as pd

df_with_average_move = pd.DataFrame({'date':df_final_pred_dataset['future_date'], 'predict_price':final_pred})
df_with_average_move.reset_index(inplace=True,drop=True)

# 20일 지수 평균 이동선 생성
for cal_index in range(20,len(df_with_average_move)):
    start_index = cal_index-20
    sum_value = 0
    for i in range(start_index, cal_index):
        sum_value += df_with_average_move.loc[i]['predict_price']

    df_with_average_move.loc[cal_index,'20days_average_price'] = sum_value/20

df_with_average_move = df_with_average_move.dropna(subset=['20days_average_price'])

In [None]:
# Importing required libraries
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np
import plotly.graph_objs as go





# Creating the traces
trace1 = go.Scatter(
    x = df_final_actual['date'],
    y = df_final_actual['area_deal'],
    mode = 'lines',
    name = 'actual_value'
)


trace2 = go.Scatter(
    x = df_with_average_move['date'],
    y = df_with_average_move['predict_price'],
    mode = 'lines',
    name = 'predict_value'
)

trace3 = go.Scatter(
    x = df_with_average_move['date'],
    y = df_with_average_move['20days_average_price'],
    mode = 'lines',
    name = 'predict_value_20days_average'
)

# Combining the traces and creating the layout
data = [trace1, trace2, trace3]
layout = go.Layout(title='Final Linear regressor Predict future price for total dataset', xaxis=dict(title='Date'), yaxis=dict(title='평당가격'))

# Creating the figure and plotting it
fig = go.Figure(data=data, layout=layout)
fig.show()






- 성능 평가에서는 Linear Regressor 모델이 RandomForest Regressor 모델 보다 더 성능이 안좋았는데, 실제 시각화를 해보니 오히려 Linear Regressor가 더 실제값과 비슷하게 움직임

### 앙상블 모델 적용

- 시각적으로 확인을 했을 때, Gradient Boosting 모델이 Linear Regressor 모델보다 실제값과 더 비슷하게 움직이지만, 23년 이후로 예측값이 너무 직선적이어서 예측값에 대한 의구심이 듬
- 문제 해결을 위해서 Gradient Boosting 모델과 Linear Regressor 모델을 조합해서 앙상블 모델을 생성

In [27]:
# Importing required libraries
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import VotingRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np
import plotly.graph_objs as go


# Creating a Linear Regression model
lr_model = LinearRegression()

# Creating a Gradient Boosting model
gb_model = GradientBoostingRegressor(n_estimators=80, learning_rate=0.1, max_depth=3, random_state=0)

# VotingRegressor를 통해 앙상블 모델 생성
ensemble_model_1 = VotingRegressor([('lr', lr_model), ('gb', gb_model)], weights=[0.3, 0.7])

# VotingRegressor를 통해 앙상블 모델 생성
ensemble_model_2 = VotingRegressor([('lr', lr_model), ('gb', gb_model)], weights=[0.4, 0.6])


# Training the model on the training set
ensemble_model_1.fit(X_train, y_train)
# Making predictions on the testing set
final_pred_1 = ensemble_model_1.predict(df_final_pred_dataset[train_columns])
final_pred_1 = final_pred_1.tolist()


# Training the model on the training set
ensemble_model_2.fit(X_train, y_train)
# Making predictions on the testing set
final_pred_2 = ensemble_model_2.predict(df_final_pred_dataset[train_columns])
final_pred_2 = final_pred_2.tolist()



# Training the model on the training set
lr_model.fit(X_train, y_train)
# Making predictions on the testing set
final_pred_3 = lr_model.predict(df_final_pred_dataset[train_columns])
final_pred_3 = final_pred_3.tolist()


# Training the model on the training set
gb_model.fit(X_train, y_train)
# Making predictions on the testing set
final_pred_4 = gb_model.predict(df_final_pred_dataset[train_columns])
final_pred_4 = final_pred_4.tolist()


# Creating the traces
trace1 = go.Scatter(
    x = df_final_actual['date'],
    y = df_final_actual['area_deal'],
    mode = 'lines',
    name = 'actual_value'
)


trace2 = go.Scatter(
    x = df_final_pred_dataset['future_date'],
    y = final_pred_1,
    mode = 'lines',
    name = 'ensemble_1 predict_value'
)

trace3 = go.Scatter(
    x = df_final_pred_dataset['future_date'],
    y = final_pred_2,
    mode = 'lines',
    name = 'ensemble_2 predict_value'
)


trace4 = go.Scatter(
    x = df_final_pred_dataset['future_date'],
    y = final_pred_3,
    mode = 'lines',
    name = 'Linear predict_value'
)

trace5 = go.Scatter(
    x = df_final_pred_dataset['future_date'],
    y = final_pred_4,
    mode = 'lines',
    name = 'Gradient Boosting predict_value'
)
# Combining the traces and creating the layout
data = [trace1, trace2, trace3, trace4,trace5]
layout = go.Layout(title='Final Ensemble Model Predict future price for total dataset', xaxis=dict(title='Date'), yaxis=dict(title='평당가격'))

# Creating the figure and plotting it
fig = go.Figure(data=data, layout=layout)
fig.show()

- ensemble_1 모델은 3:7로 Linear Regressor 모델과 Gradient Boosting 모델을 조합
- ensemble_2 모델은 4:6로 Linear Regressor 모델과 Gradient Boosting 모델을 조합
- 그 외, 실제값과 Linear Regressor 모델, Gradient Boosting 모델 들을 통한 예측값을 시각화 함
- 2개의 앙상블 모델은 비슷하게 움직이나, Linear Regressor 모델과 Gradient Boosting 모델로 예측한 값과 Linear Regressor 모델로 예측한 값은 어느정도 차이가 있음을 알 수 있음

#### 앙상블 모델 성능 평가

##### ensemble_model_1

In [30]:

# Making predictions on the testing set
y_pred = ensemble_model_1.predict(X_test_sorted)
# Evaluating the model using Mean Squared Error (MSE)
mse = mean_squared_error(y_test_sorted , y_pred)
print('ensemble_model_1 Mean Squared Error:', mse)
print('ensemble_model_1 Root Mean Squared Error:', np.sqrt(mse))
print()
# Creating the traces
trace1 = go.Scatter(
    x = y_test_sorted.index,
    y = y_test_sorted.values,
    mode = 'lines',
    name = 'actual_value'
)


trace2 = go.Scatter(
    x = y_test_sorted.index,
    y = list(y_pred),
    mode = 'lines',
    name = 'predict_value'
)

# Combining the traces and creating the layout
data = [trace1, trace2]
layout = go.Layout(title='ensemble_model_1 Predict future price for test dataset', xaxis=dict(title='index'), yaxis=dict(title='미래평당가격'))

# Creating the figure and plotting it
fig = go.Figure(data=data, layout=layout)
fig.show()



ensemble_model_1 Mean Squared Error: 19.239289186371
ensemble_model_1 Root Mean Squared Error: 4.386261413364575



##### ensemble_model_2

In [31]:
# Making predictions on the testing set
y_pred = ensemble_model_2.predict(X_test_sorted)
# Evaluating the model using Mean Squared Error (MSE)
mse = mean_squared_error(y_test_sorted , y_pred)
print('ensemble_model_2 Mean Squared Error:', mse)
print('ensemble_model_2 Root Mean Squared Error:', np.sqrt(mse))
print()
# Creating the traces
trace1 = go.Scatter(
    x = y_test_sorted.index,
    y = y_test_sorted.values,
    mode = 'lines',
    name = 'actual_value'
)


trace2 = go.Scatter(
    x = y_test_sorted.index,
    y = list(y_pred),
    mode = 'lines',
    name = 'predict_value'
)

# Combining the traces and creating the layout
data = [trace1, trace2]
layout = go.Layout(title='ensemble_model_2 Predict future price for test dataset', xaxis=dict(title='index'), yaxis=dict(title='미래평당가격'))

# Creating the figure and plotting it
fig = go.Figure(data=data, layout=layout)
fig.show()

ensemble_model_2 Mean Squared Error: 32.07280700063213
ensemble_model_2 Root Mean Squared Error: 5.66328588371028



## corr 을 통해 추세 성능 평가

- 실용성 테스트에서 그나마 비슷하게 움직이는 것 같았던 4개의 모델이 살세값과 얼마나 비슷하게 움직이는 corr() 을 사용해서 확인

In [None]:
df_final_actual.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 120 entries, 4352 to 4471
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   year       120 non-null    int64         
 1   month      120 non-null    int64         
 2   day        120 non-null    int64         
 3   area_deal  120 non-null    float32       
 4   date       120 non-null    datetime64[ns]
dtypes: datetime64[ns](1), float32(1), int64(3)
memory usage: 5.2 KB


1050.3454598034855

In [None]:
import pandas as pd

# 실제값은 2023년 4월달 까지 있지만, 예측값은 2023년 12월달 값까지의 예측값이 있음으로 예측값들은 슬라이싱 필요
corr_df_2023 = pd.DataFrame({'date':df_final_actual['date'],'actual_value':df_final_actual['area_deal'], 'ensemble_1_value':final_pred_1[:120], 'ensemble_2_value':final_pred_2[:120],'linear_value':final_pred_3[:120],'Gradient_boosting_value':final_pred_4[:120]})

corr_df_2023.reset_index(inplace=True, drop=True)

corr_df_2023

Unnamed: 0,date,actual_value,ensemble_1_value,ensemble_2_value,linear_value,Gradient_boosting_value
0,2023-01-01,1056.691528,1050.345460,1049.024023,1041.095400,1054.309771
1,2023-01-02,1056.478394,1050.481302,1049.205146,1041.548208,1054.309771
2,2023-01-03,1055.845093,1050.146769,1048.759102,1040.433099,1054.309771
3,2023-01-04,1055.708862,1048.623669,1046.728301,1035.356096,1054.309771
4,2023-01-05,1055.373779,1049.864010,1048.382090,1039.490568,1054.309771
...,...,...,...,...,...,...
115,2023-04-26,1024.179443,1028.365674,1027.537441,1022.568045,1030.850372
116,2023-04-27,1024.288940,1027.692011,1026.697424,1020.729906,1030.675770
117,2023-04-28,1024.102783,1027.553437,1026.512659,1020.267994,1030.675770
118,2023-04-29,1023.981995,1027.913898,1026.993274,1021.469531,1030.675770


In [None]:
corr_df_2023.corr()['actual_value'].sort_values(ascending=False).to_frame()





Unnamed: 0,actual_value
actual_value,1.0
Gradient_boosting_value,0.812965
ensemble_1_value,0.803631
ensemble_2_value,0.782644
linear_value,0.333004


- Gradient Boosting 모델을 통한 예측값, ensemble_1 모델을 통한 예측값, ensemble_2 모델을 통한 예측값, Linear Regression 모델을 통한 예측값 순으로 실제값과 비슷하게 움직임을 알 수 있음
- 시각적으로 봤을 때는 Linear Regression 모델이 실제값과 비슷한 추세로 움직이는 것 같았으나, 상관관계를 확인해보니 매우 낮게 나옴

# 결론

- corr()만으로만 확인을 하면, Gradient boosting 모델이 가장 실제값과 상관관계가 높지만, 시각화를 한 결과 너무 직선으로 움직여서 과연 신빙성이 있을까라는 의문
- Gradient boosting 모델과 Linear regressor 모델을 섞어서 만든 ensemble_1 모델과 ensemble_2 모델은 둘 다 Gradient boosting 모델과 큰 차이가 안나면서도 실제값과 높은 상관관계를 보임
- 과거에는 있었지만 현재는 없는 아파트들을 거르는 단계를 진행하지 않음. 그래서 데이터에 오류가 있었을 수도 있음
- 굳이 회귀 모델을 통해서 정확한 수익률을 예측하고자 하면 그 예측한 수익률 자체에 오류가 있을 텐데 그럴거면 미래변화율을 카테고리화 해서 어떤 카테고리에 속할지 예측하는 식으로 하는 게 더 효용성이 있지 않을까? 하는 의문

- 모델들 마다 정확한 수치들은 차이가 있지만, 실제 데이터와 비슷하게 움직이는 모델들을 적용시켰을 때, 2023년 말 까지는 서울 아파트 전체의 값들이 유지되거나 내려가는 추세임으로 현재 아파트를 매수하는 것은 추천하지 않는다

# 주의할 점&보완할 점

## 1. 주의할 점

- 데이터에 따라서 null 값 대신 ''로 값을 채워놓은 경우들이 있다, 데이터를 다루기 전에 빈 칸으로 처리되 부분이 있는지 확인이 필요하다
- 판다스에서 object 타입과 string 타입은 차이가 있기에 .str.replace() 등의 함수를 사용할때 str 타입으로 변경후에 사용을 해야 한다.
- 컬럼별로 계산을 할 때, 속성값에 0이나 null 값이 있는지 잘 확인을 하고 연산을 진행해야 한다
- 메모리 용량을 줄이기 위해서 데이터 타입을 변환할 수도 있다.
- 값들을 병합하거나 수정한 후, null 이나 inf 값들이 존재함을 확인해야 한다.
- stack() 함수를 사용할 때, null 값은 패스를 하기에, 계산 시 원하는 의도의 변경을 예방하기 위해서 null 값들을 치환할 수도 있다.
- pandas는 row 개수가 많은것이, column의 개수가 많은 것보다 더 메모리 부담이 크다
- info()를 사용해서 얻은 메모리 사용량과 memory_usage(deep=True)를 통해서 얻는 메모리 사용량은 다를 수 있다.(https://pythonspeed.com/articles/pandas-dataframe-series-memory-usage/ 참조)


## 2. 보완할 점

### 2-1. 프로젝트 시작 전

- 변수명이나 프로젝트 진행시에 쓰는 용어, 약속한 개념들을 확실히 정한 다음 진행을 해야지 프로젝트를 혼선 없이 진행할 수 있다.
- 데이터 분석을 진행 할 때, 최종 생성할 테이블과, 중간에 생성을 할 테이블들에 대한 구조들(스키마 테이블)을 미리 설계를 해놔야 추후 데이터들을 전처리하거나 생성할 때 더 효율적으로 일을 처리할 수 있음을 배움 -> 설계방법에 대한 학습 필요
- 데이터 시각화를 어떤 상황에 어느 시각 지표를 적용하면 좋을지 판단하는 능력의 향상이 필요
- 어느 지표들에 어떤 시각화를 사용할 지 미리 계획이 되어 있어야 한다.
- csv파일, pkl파일, mysql 데이터베이스 사용시의 차이점들 비교하여 학습할 필요
- 기계학습 모델을 사용하려면, 어떤 문제(분류, 회귀 등)에 어떤 모델을 적용할지 미리 설정하고 프로젝트를 진행해야 한다.

### 2-2. 프로젝트 진행 중

- 생각했던 가설이 참이 아닌 경우, 왜 참이 아니었는지 판단하고 검증하고 수정하는 능력이 필요함
- 주제목, 부제목, 설명들을 적으면서 진행을 해야 추후 정리를 할 때 수월하다




#### 2-2-1. 전처리 과정

- 판다스를 통해서 데이터를 전처리를 할 때, 함수를 활용하는 식 등의 메모리를 효율적으로 사용하는 식으로 코딩할 능력의 필요성을 느낌(메모리 부족으로 여러번에 나누어서 실행하면 번거롭고 원하는 결과와 다른 결과가 나올 수도 있음)
- 파이썬의 문법들을 활용해서 더 효율적인 함수를 만들 수 있는 능력이 필요
- 상황에 맞는 이상치를 제거하는 방법들에 대한 학습 필요
- 상황에 맞는 결측치를 채우는 방법들에 대한 학습 필요


#### 2-2-2. 분석 과정

- 기계학습 모델의 성능을 효과적으로 파악하기 위해서는, 그래프를 사용하는 것이  시각적으로 효용성이 있음
- 모델이 테스트 데이터셋를 통해서는 성능이 좋을 수는 있어도, 실제 미래의 값들은 다를 수 있음
- 각 모델의 동작과정을 알아야지, 어느 상황에서 어느 모델을 사용할 수 있는지 확인이 가능하고 파라미터 수정등이 용이할 듯
- 기계학습 등의 모델에서 하이퍼파라미터를 수정하는 것이 단순히 수정을 하면 되는 것인지 아니면 기준을 가지고 수정을 해야하는지 학습 필요

- 회귀 모델에서 오차의 허용 범위와 성능평가 방법을 어떻게 설정하는지가 모델의 성능의 큰 영향을 미치는듯 하다
- 회귀 모델의 경우 성능평가 방법들이 각각 어느정도 수치와 기준이 좋은 성능을 내는 것인지 파악을 할 줄 알아야 함
- 회귀모델을 통해서 정확한 값을 얻으려 노력하기 보다 추세를 보려고 노력하는 것이 더 맞는 방향이지 않을까? 하는 생각
- 선그래프를 통해 봤을 때는 추세적으로 비슷하게 움직이는 것 같았지만 실제 corr()을 통해서 확인했을 때는 그 수치가 크게 나오지 않을 수도 있다. -> 추세가 비슷하게 움직이는 다른 평가지표에 대한 조사 및 학습 필요





- 데이터 분석을 하기위해서 사용하는 방식으로 기계학습 모델 사용 뿐이 아닌, 데이터를 시각화를 통한 인사이트를 얻는 방법과 통계적 방법들을 통해 인사이트를 얻는 방법들에 대한 학습 필요

### 2-3. 프로젝트 종료 후

- 프로젝트를 보기 좋게 정리하는 법 및 설득력을 가질 수 있게 정리하는 법에 대한 학습 필요