<a href="https://colab.research.google.com/github/Ryong1998/house_price/blob/main/apartment_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 프로젝트 주제

- 해당 프로젝트는 미래의 '서울 아파트 매매가격지수'를 예측하는 프로젝트 입니다

# 프로젝트 소개

- 다양한 지역의 다양한 부동산의 종류(아파트, 단독주택 등)들 중 '서울'의 '아파트'로 부동산 종류를 한정
- 최종적으로는 향후 1년간 서울 아파트 시장의 동향을 예측하고, 아파트를 구매하려는 소비자들에게 도움이 되는 지표를 제공하고자 함
- 부동산의 가치는 '1. 거주지로서의 특성'과 '2. 금융상품으로서의 특성' 두가지를 통해서 평가를 할 수 있다고 가정
- '1. 거주지로서의 특성'은 주변 편의시설, 교육시설, 아파트 평수, 주변 교통시설 등 더 편한 거주환경을 제공하는 요소들을 포함
- '2. 금융상품으로서의 특성'은 기준금리, 아파트 공급량, 아파트 미분양, 현재 매매가, 전세가율 등 금융관련 수치들로 표현이 되는 요소들을 포함
- '1. 거주지로서의 특성'에서 높은 가치를 의미하는 요소들은 시대가 변함에 따라서 바뀔 수가 있음(예를 들어 대가족에서 소가족 형태로 가족 구성원 구조가 바뀌면서 시대에 따라 사람들이 선호하는 아파트 평수가 바뀔 수도 있고, 인터넷 강의의 발달로 인해서 교육시설 인프라의 중요성이 향후 낮아질 수 있음)
- '1. 거주지로서의 특성'에서 높은 가치들은 과거 계속 변화했을 수 있지만 어떻게 변했는지 파악하기가 쉽지 않고, 미래에 어떻게 변할지 알 수 없기에 평가의 기준이 '변동적'이라는 특징이 있음
- 하지만 '2. 금융상품으로서의 특성'은 가격과 경제를 바탕으로 한 '수치'들을 표현 하기에 '1. 거주지로서의 특성'보다 일관성 있게 부동산의 가치를 평가할 수 있을거라는 가정
- '2. 금융상품으로서의 특성'에 해당하는 수치들은 그 자체로 변화하는 '1. 거주지로서의 특성'의 가치를 내포하고 있다고 가정
- 해당 프로젝트는 '2. 금융상품으로서의 특성'에 집중하여서 집값의 변화를 예측 할 예정
- 일별로 '1년뒤 서울 아파트 매매가격지수'를 예측하는 모델을 생성하여 진행
- 개별 아파트를 추천하지는 못하더라도, 서울 아파트 시장의 1년뒤 전망을 통해 현재 아파트를 살 타이밍인지 아닌지를 예측하는 프로젝트를 진행

In [2]:
# 구글 드라이브 마운트
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# original_data 확보

- 'http://rtdown.molit.go.kr/' 사이트를 통해서 아파트매매가, 아파트 전/월세 가격 정보 파일로 얻음
- 'https://kr.investing.com/' 사이트를 통해서 한국국채금리, 미국국채금리, 코스피 정보를 얻음
- 'https://data.kbland.kr/publicdata/housing-supply' 사이트를 통해서 아파트 공급량 정보를 얻음
- 'https://data.kbland.kr/publicdata/unsold-apartments' 사이트를 통해서 미분양 아파트 정보를 얻음
- 'https://www.bok.or.kr/portal/singl/baseRate/list.do?dataSeCd=01&menuNo=200643' 사이트를 통해서 기준금리 정보를 얻음
- 'https://data.kbland.kr/kbstats/wmh?tIdx=HT01&tsIdx=weekAptSalePriceInx' 사이트를 통해서 주간 서울시 주택가격지수를 얻음



>> 공공데이터포털의 api를 이용해서 아파트매매가, 아파트 전/월세 가격 정보를 얻으려 했지만 일일 트래픽 제한으로 인해서 직접 'http://rtdown.molit.go.kr/' 사이트에 접속해서 파일들을 다운 받아 필요 데이터를 확보

# apartment_deal.csv 파일 생성

- 'http://rtdown.molit.go.kr/' 사이트를 통해서 아파트매매가 정보 파일들을 얻음
- '아파트 매매' 관련 정보들을 가지고 있는 데이터프레임을 생성하여 apartment_deal.csv 로 저장

## csv 파일들 불러오기 및 병합

- 아파트 매매 정보 원본본파일들은 연도별로 파일들이 나누어져 되어있고, 각 csv 파일 내의 모든 정보들이 필요하지는 않기에 전처리 과정 진행

In [None]:
import pandas as pd
import os

# 연도별 아파트 매매 정보들이 들어있는 csv경로 설정
dir_path = "/content/drive/MyDrive/house_price/original_data/deal_price/Seoul"
file_list = os.listdir(dir_path)
file_list.sort()
df_list = list()
# 해당 폴더 안에 있는 csv 파일들을 읽어서 리스트 안에 데이터프레임들을 담음
for csv_file in file_list:
    df_list.append(pd.read_csv(dir_path+"/"+csv_file ,skiprows=15,  encoding='cp949'))

>> 코랩은 파일을 읽어올 때 업로드한 순서대로 파일을 불러오는 듯

In [None]:
df_list[0].info() # 리스트 안에 잘 담겼는지 확인

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120812 entries, 0 to 120811
Data columns (total 15 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   시군구       120812 non-null  object 
 1   번지        120812 non-null  object 
 2   본번        120812 non-null  int64  
 3   부번        120812 non-null  int64  
 4   단지명       120812 non-null  object 
 5   전용면적(㎡)   120812 non-null  float64
 6   계약년월      120812 non-null  int64  
 7   계약일       120812 non-null  int64  
 8   거래금액(만원)  120812 non-null  object 
 9   층         120812 non-null  int64  
 10  건축년도      120812 non-null  int64  
 11  도로명       120812 non-null  object 
 12  해제사유발생일   0 non-null       float64
 13  거래유형      120812 non-null  object 
 14  중개사소재지    120812 non-null  object 
dtypes: float64(2), int64(6), object(7)
memory usage: 13.8+ MB


In [None]:
df_list[0].head() # 데이터 형태들을 확인

Unnamed: 0,시군구,번지,본번,부번,단지명,전용면적(㎡),계약년월,계약일,거래금액(만원),층,건축년도,도로명,해제사유발생일,거래유형,중개사소재지
0,서울특별시 강남구 개포동,655-2,655,2,개포2차현대아파트(220),77.75,200603,10,59500,7,1988,언주로 103,,-,-
1,서울특별시 강남구 개포동,655-2,655,2,개포2차현대아파트(220),77.75,200603,29,60000,6,1988,언주로 103,,-,-
2,서울특별시 강남구 개포동,655-2,655,2,개포2차현대아파트(220),77.75,200604,29,67000,9,1988,언주로 103,,-,-
3,서울특별시 강남구 개포동,655-2,655,2,개포2차현대아파트(220),77.75,200606,1,60000,4,1988,언주로 103,,-,-
4,서울특별시 강남구 개포동,655-2,655,2,개포2차현대아파트(220),77.75,200610,20,72250,5,1988,언주로 103,,-,-


In [None]:
# 모든 데이터프레임을 하나의 데이터프레임으로 통합
df_default = df_list[0]
for df_csv in df_list[1:]:
    df_default = pd.concat([df_default, df_csv], axis=0) # concat을 통해서 위-아래로 데이터 프레임들을 병합
df_default.reset_index(drop=True, inplace=True) # concat으로 합쳐질 때 인덱스 재설정
df_default.loc[1]

시군구          서울특별시 강남구 개포동
번지                   655-2
본번                   655.0
부번                     2.0
단지명         개포2차현대아파트(220)
전용면적(㎡)              77.75
계약년월                200603
계약일                     29
거래금액(만원)            60,000
층                        6
건축년도                1988.0
도로명                언주로 103
해제사유발생일                NaN
거래유형                     -
중개사소재지                   -
등기신청일자                 NaN
Name: 1, dtype: object

In [None]:
df_default.head() # 병합한 테이블의 정보 파악

Unnamed: 0,시군구,번지,본번,부번,단지명,전용면적(㎡),계약년월,계약일,거래금액(만원),층,건축년도,도로명,해제사유발생일,거래유형,중개사소재지,등기신청일자
0,서울특별시 강남구 개포동,655-2,655.0,2.0,개포2차현대아파트(220),77.75,200603,10,59500,7,1988.0,언주로 103,,-,-,
1,서울특별시 강남구 개포동,655-2,655.0,2.0,개포2차현대아파트(220),77.75,200603,29,60000,6,1988.0,언주로 103,,-,-,
2,서울특별시 강남구 개포동,655-2,655.0,2.0,개포2차현대아파트(220),77.75,200604,29,67000,9,1988.0,언주로 103,,-,-,
3,서울특별시 강남구 개포동,655-2,655.0,2.0,개포2차현대아파트(220),77.75,200606,1,60000,4,1988.0,언주로 103,,-,-,
4,서울특별시 강남구 개포동,655-2,655.0,2.0,개포2차현대아파트(220),77.75,200610,20,72250,5,1988.0,언주로 103,,-,-,


In [None]:
df_default.info() # 데이터프레임 합친 결과 확인

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1268633 entries, 0 to 1268632
Data columns (total 16 columns):
 #   Column    Non-Null Count    Dtype  
---  ------    --------------    -----  
 0   시군구       1268633 non-null  object 
 1   번지        1268408 non-null  object 
 2   본번        1268554 non-null  float64
 3   부번        1268554 non-null  float64
 4   단지명       1268633 non-null  object 
 5   전용면적(㎡)   1268633 non-null  float64
 6   계약년월      1268633 non-null  int64  
 7   계약일       1268633 non-null  int64  
 8   거래금액(만원)  1268633 non-null  object 
 9   층         1268633 non-null  int64  
 10  건축년도      1268631 non-null  float64
 11  도로명       1268633 non-null  object 
 12  해제사유발생일   6336 non-null     float64
 13  거래유형      1268633 non-null  object 
 14  중개사소재지    1268633 non-null  object 
 15  등기신청일자    31142 non-null    object 
dtypes: float64(5), int64(3), object(8)
memory usage: 154.9+ MB


## 필요한 컬럼만 선택

- df_default 데이터프레임에서 모든 컬럼들을 사용하지 않기에, 사용할 컬럼들만을 선택

In [None]:
# 사용할 컬럼들만 거르고 컬럼명 영어로 치환 - 필요한 컬럼들만 선택
df_default = df_default[['시군구','본번','부번','도로명','단지명','계약년월','계약일','전용면적(㎡)','거래금액(만원)']]
df_default.columns = ['address','main_number','sub_number','road','name','year_month','day','area','deal_price']
df_default.head() # 잘 걸러졌는지 확인

Unnamed: 0,address,main_number,sub_number,road,name,year_month,day,area,deal_price,floor
0,서울특별시 강남구 개포동,655.0,2.0,언주로 103,개포2차현대아파트(220),200603,10,77.75,59500,7
1,서울특별시 강남구 개포동,655.0,2.0,언주로 103,개포2차현대아파트(220),200603,29,77.75,60000,6
2,서울특별시 강남구 개포동,655.0,2.0,언주로 103,개포2차현대아파트(220),200604,29,77.75,67000,9
3,서울특별시 강남구 개포동,655.0,2.0,언주로 103,개포2차현대아파트(220),200606,1,77.75,60000,4
4,서울특별시 강남구 개포동,655.0,2.0,언주로 103,개포2차현대아파트(220),200610,20,77.75,72250,5


In [None]:
# 타입 변경을 통해서 deal_price,year_month, day 타입 변경
df_default["deal_price"] = df_default["deal_price"].str.replace(",", "") # 'deal_price'에서 ','가 들어있는 부분 제거(추후 계산에 사용하기 위해서서)
df = df_default.astype({'year_month':'str','day':'str','deal_price':'int64'}).copy()
df.head() # 형태가 변경된거 확인

Unnamed: 0,address,main_number,sub_number,road,name,year_month,day,area,deal_price,floor
0,서울특별시 강남구 개포동,655.0,2.0,언주로 103,개포2차현대아파트(220),200603,10,77.75,59500,7
1,서울특별시 강남구 개포동,655.0,2.0,언주로 103,개포2차현대아파트(220),200603,29,77.75,60000,6
2,서울특별시 강남구 개포동,655.0,2.0,언주로 103,개포2차현대아파트(220),200604,29,77.75,67000,9
3,서울특별시 강남구 개포동,655.0,2.0,언주로 103,개포2차현대아파트(220),200606,1,77.75,60000,4
4,서울특별시 강남구 개포동,655.0,2.0,언주로 103,개포2차현대아파트(220),200610,20,77.75,72250,5


In [None]:
df.info() # 타입변경 및 null 확인

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1268633 entries, 0 to 1268632
Data columns (total 10 columns):
 #   Column       Non-Null Count    Dtype  
---  ------       --------------    -----  
 0   address      1268633 non-null  object 
 1   main_number  1268554 non-null  float64
 2   sub_number   1268554 non-null  float64
 3   road         1268633 non-null  object 
 4   name         1268633 non-null  object 
 5   year_month   1268633 non-null  object 
 6   day          1268633 non-null  object 
 7   area         1268633 non-null  float64
 8   deal_price   1268633 non-null  int64  
 9   floor        1268633 non-null  int64  
dtypes: float64(3), int64(2), object(5)
memory usage: 96.8+ MB


In [None]:
# 'main_number'혹은 'sub_number'이 null 인데 'road'도 null 인 값을 확인 -> 없음
# 즉, 'road가 주소에 관한한 정보가 더욱 많음'
df[((df['main_number'].isnull()) |(df['sub_number'].isnull())) &(df['road'].isnull()) ]

Unnamed: 0,address,main_number,sub_number,road,name,year_month,day,area,deal_price,floor


> main_number와 sub_number에 null 값들이 있음을 확인 -> road 정보가 주소에 관한 정보로 적합하다는 판단

## year, month, day 컬럼추가

- 날짜 관련한 컬럼들을 추후 그룹화등을 할 때 사용하기에 'year_month' 컬럼과 'day' 컬럼을 가공하여서 다양한 날짜 관련 컬럼들을 생성

In [None]:
# 추후 데이터들 그룹화에 사용하기 위해서 날짜관련 컬럼들들 분리 및 생성
df['year'] = df['year_month'].str[0:4] # '연','월' 합쳐져 있는 컬럼에서 연도만 추출
df['month'] = df['year_month'].str[4:] # '연','월' 합쳐져 있는 컬럼에서 월만 추출
df.loc[df["day"].str.len()==1,"day"]='0'+df.loc[df["day"].str.len()==1,"day"] # '일'이 있는 컬럼에서 해당 '일'이 1일, 2일 처럼 1자리 숫자인 경우 앞에 0을 추가
df['date'] = pd.to_datetime(df['year']+df['month']+df['day']) # 일자들을 합쳐서 date 컬럼 생성
df = df.astype({'year':'int64','month':'int64','day':'int64'}) # 원하는 타입으로 변경
df = df.drop(['year_month'], axis=1) # 사용 안하는 컬럼들 제거
df.head()

Unnamed: 0,address,main_number,sub_number,road,name,day,area,deal_price,floor,year,month,date
0,서울특별시 강남구 개포동,655.0,2.0,언주로 103,개포2차현대아파트(220),10,77.75,59500,7,2006,3,2006-03-10
1,서울특별시 강남구 개포동,655.0,2.0,언주로 103,개포2차현대아파트(220),29,77.75,60000,6,2006,3,2006-03-29
2,서울특별시 강남구 개포동,655.0,2.0,언주로 103,개포2차현대아파트(220),29,77.75,67000,9,2006,4,2006-04-29
3,서울특별시 강남구 개포동,655.0,2.0,언주로 103,개포2차현대아파트(220),1,77.75,60000,4,2006,6,2006-06-01
4,서울특별시 강남구 개포동,655.0,2.0,언주로 103,개포2차현대아파트(220),20,77.75,72250,5,2006,10,2006-10-20


In [None]:
df.info() # 타입들이 원하는데로 변경됨을 확인인

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1268633 entries, 0 to 1268632
Data columns (total 12 columns):
 #   Column       Non-Null Count    Dtype         
---  ------       --------------    -----         
 0   address      1268633 non-null  object        
 1   main_number  1268554 non-null  float64       
 2   sub_number   1268554 non-null  float64       
 3   road         1268633 non-null  object        
 4   name         1268633 non-null  object        
 5   day          1268633 non-null  int64         
 6   area         1268633 non-null  float64       
 7   deal_price   1268633 non-null  int64         
 8   floor        1268633 non-null  int64         
 9   year         1268633 non-null  int64         
 10  month        1268633 non-null  int64         
 11  date         1268633 non-null  datetime64[ns]
dtypes: datetime64[ns](1), float64(3), int64(5), object(3)
memory usage: 116.1+ MB


In [None]:
# 주소 및 도로명들 분리
df["address_0"] = df["address"].str.split(' ',expand=True)[0] # '시' 만 추출
df["address_1"] = df["address"].str.split(' ',expand=True)[1] # '구' 만 추출
df["address_2"] = df["address"].str.split(' ',expand=True)[2] # '동' 만 추출
df["road_name"] = df["road"].str.split(' ',expand=True)[0] # '도로명' 만 추출
df["road_number"] = df["road"].str.split(' ',expand=True)[1] # '도로숫자' 만 추출
df= df[['year','month','day','address_0','address_1','address_2','road_name','road_number','area','deal_price','name','main_number','sub_number','date']] # 사용할 컬럼만 선택
df.head()

Unnamed: 0,year,month,day,address_0,address_1,address_2,road_name,road_number,area,deal_price,name,main_number,sub_number,date
0,2006,3,10,서울특별시,강남구,개포동,언주로,103,77.75,59500,개포2차현대아파트(220),655.0,2.0,2006-03-10
1,2006,3,29,서울특별시,강남구,개포동,언주로,103,77.75,60000,개포2차현대아파트(220),655.0,2.0,2006-03-29
2,2006,4,29,서울특별시,강남구,개포동,언주로,103,77.75,67000,개포2차현대아파트(220),655.0,2.0,2006-04-29
3,2006,6,1,서울특별시,강남구,개포동,언주로,103,77.75,60000,개포2차현대아파트(220),655.0,2.0,2006-06-01
4,2006,10,20,서울특별시,강남구,개포동,언주로,103,77.75,72250,개포2차현대아파트(220),655.0,2.0,2006-10-20


## 결측치 처리1

In [None]:
df.info() # road_number에 1개의의 null 값이 생김을 확인

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1268633 entries, 0 to 1268632
Data columns (total 14 columns):
 #   Column       Non-Null Count    Dtype         
---  ------       --------------    -----         
 0   year         1268633 non-null  int64         
 1   month        1268633 non-null  int64         
 2   day          1268633 non-null  int64         
 3   address_0    1268633 non-null  object        
 4   address_1    1268633 non-null  object        
 5   address_2    1268633 non-null  object        
 6   road_name    1268633 non-null  object        
 7   road_number  1268632 non-null  object        
 8   area         1268633 non-null  float64       
 9   deal_price   1268633 non-null  int64         
 10  name         1268633 non-null  object        
 11  main_number  1268554 non-null  float64       
 12  sub_number   1268554 non-null  float64       
 13  date         1268633 non-null  datetime64[ns]
dtypes: datetime64[ns](1), float64(3), int64(4), object(6)
memory usage

In [None]:
df[df['road_number'].isnull()] # road_number에 null값이 들어 있는 row를 확인

Unnamed: 0,year,month,day,address_0,address_1,address_2,road_name,road_number,area,deal_price,name,main_number,sub_number,date
1177515,2020,12,31,서울특별시,중구,만리동2가,만리재로,,39.9541,161000,서울역센트럴자이(임대),176.0,1.0,2020-12-31


In [None]:
# '서울역센트럴자이'를 확인 -> '' 값이 존재함을 확인..
df.loc[df['name'] == '서울역센트럴자이',:]

Unnamed: 0,year,month,day,address_0,address_1,address_2,road_name,road_number,area,deal_price,name,main_number,sub_number,date
936223,2017,5,3,서울특별시,중구,만리동2가,만리재로,175.0,84.972,79390,서울역센트럴자이,176.0,1.0,2017-05-03
936224,2017,12,20,서울특별시,중구,만리동2가,만리재로,175.0,59.943,85000,서울역센트럴자이,176.0,1.0,2017-12-20
936225,2017,12,30,서울특별시,중구,만리동2가,,,59.94,85000,서울역센트럴자이,176.0,1.0,2017-12-30
1018067,2018,3,20,서울특별시,중구,만리동2가,,,72.99,85000,서울역센트럴자이,176.0,1.0,2018-03-20
1093938,2019,7,13,서울특별시,중구,만리동2가,만리재로,175.0,84.972,134500,서울역센트럴자이,176.0,1.0,2019-07-13
1093939,2019,8,20,서울특별시,중구,만리동2가,만리재로,175.0,59.94,95000,서울역센트럴자이,176.0,1.0,2019-08-20
1093940,2019,8,23,서울특별시,중구,만리동2가,만리재로,175.0,84.972,139000,서울역센트럴자이,176.0,1.0,2019-08-23
1093941,2019,9,8,서울특별시,중구,만리동2가,만리재로,175.0,59.94,113800,서울역센트럴자이,176.0,1.0,2019-09-08
1093942,2019,9,21,서울특별시,중구,만리동2가,만리재로,175.0,72.9733,132000,서울역센트럴자이,176.0,1.0,2019-09-21
1093943,2019,11,30,서울특별시,중구,만리동2가,만리재로,175.0,59.9808,120000,서울역센트럴자이,176.0,1.0,2019-11-30


In [None]:
# 값이 '' 로 되어 있는 row들을 확인인
df.loc[df['road_name'] == '',:]

Unnamed: 0,year,month,day,address_0,address_1,address_2,road_name,road_number,area,deal_price,name,main_number,sub_number,date
1606,2006,2,23,서울특별시,강남구,논현동,,,128.67,73500,경복,276.0,0.0,2006-02-23
1628,2006,10,19,서울특별시,강남구,논현동,,,95.48,71000,경복,276.0,0.0,2006-10-19
2799,2006,1,24,서울특별시,강남구,대치동,,,76.56,80000,청실1,633.0,0.0,2006-01-24
2806,2006,2,14,서울특별시,강남구,대치동,,,102.64,143500,청실1,633.0,0.0,2006-02-14
2807,2006,2,14,서울특별시,강남구,대치동,,,102.64,142000,청실1,633.0,0.0,2006-02-14
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1255393,2023,9,9,서울특별시,서초구,반포동,,,84.97,430000,래미안원베일리,1.0,1.0,2023-09-09
1255394,2023,9,13,서울특별시,서초구,반포동,,,101.97,470000,래미안원베일리,1.0,1.0,2023-09-13
1255395,2023,9,24,서울특별시,서초구,반포동,,,74.92,322000,래미안원베일리,1.0,1.0,2023-09-24
1255396,2023,9,25,서울특별시,서초구,반포동,,,84.93,394000,래미안원베일리,1.0,1.0,2023-09-25


> null 값이 없다고 ''값도 없는건 아니구나! -> 의미적으로는 결측치이지만 ''로 표시되어서 마치 값이 있는 것처럼 있을 수도 있음

In [None]:
df.loc[df['name'] == '서울역센트럴자이(임대)','name']='서울역센트럴자이' # '서울역센트럴자이(임대)' 명칭을을 '서울역센트럴자이'로 수정
df.loc[df['name'] == '서울역센트럴자이','road_name']='만리재로' # 위에서 확인한 '서울역센트럴자이'의 값들로 'road_name' 수정
df.loc[df['name'] == '서울역센트럴자이','road_number']='175' # 위에서 확인한 '서울역센트럴자이'의 값들로 'road_number' 수정
df.info() # 우선 1차적으로 null 로 표시되는는 null 값들은 처리함을 확인

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1268633 entries, 0 to 1268632
Data columns (total 14 columns):
 #   Column       Non-Null Count    Dtype         
---  ------       --------------    -----         
 0   year         1268633 non-null  int64         
 1   month        1268633 non-null  int64         
 2   day          1268633 non-null  int64         
 3   address_0    1268633 non-null  object        
 4   address_1    1268633 non-null  object        
 5   address_2    1268633 non-null  object        
 6   road_name    1268633 non-null  object        
 7   road_number  1268633 non-null  object        
 8   area         1268633 non-null  float64       
 9   deal_price   1268633 non-null  int64         
 10  name         1268633 non-null  object        
 11  main_number  1268554 non-null  float64       
 12  sub_number   1268554 non-null  float64       
 13  date         1268633 non-null  datetime64[ns]
dtypes: datetime64[ns](1), float64(3), int64(4), object(6)
memory usage

## 결측치 처리2

- 앞에서 과정들을 통해서 ''들이 값들로 들어 있을 수도 있음을 깨닫고 '' 값들을 null로 간주하여서 결측치 처리 진행

In [None]:
import numpy as np
df = df.replace('', np.nan) # ''값만 있는 값들을 null 값들로 수정
df.info() # 수정한 후 정보 확인

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1268633 entries, 0 to 1268632
Data columns (total 14 columns):
 #   Column       Non-Null Count    Dtype         
---  ------       --------------    -----         
 0   year         1268633 non-null  int64         
 1   month        1268633 non-null  int64         
 2   day          1268633 non-null  int64         
 3   address_0    1268633 non-null  object        
 4   address_1    1268633 non-null  object        
 5   address_2    1268633 non-null  object        
 6   road_name    1266588 non-null  object        
 7   road_number  1265322 non-null  object        
 8   area         1268633 non-null  float64       
 9   deal_price   1268633 non-null  int64         
 10  name         1268633 non-null  object        
 11  main_number  1268554 non-null  float64       
 12  sub_number   1268554 non-null  float64       
 13  date         1268633 non-null  datetime64[ns]
dtypes: datetime64[ns](1), float64(3), int64(4), object(6)
memory usage

In [None]:
df.isnull().sum() # df의 'road_name'과 'road_number'의 null 값들이 증가함을 확인,

year              0
month             0
day               0
address_0         0
address_1         0
address_2         0
road_name      2045
road_number    3311
area              0
deal_price        0
name              0
main_number      79
sub_number       79
date              0
dtype: int64

> 처음에는 도로주소가 null값이 더 적은 줄 알았지만, 전처리 과정 중 지번주소가 null 값이 더 적은 것을 확인

In [None]:
# 'main_number'나 'sub_number' 둘중 하나만 null 인 것을 확인 -> 없음
# 즉, 2개가 동시에 null 값을 가지고 있음
df[((df['main_number'].isnull()) &(df['sub_number'].notnull()))
  |((df['main_number'].notnull()) &(df['sub_number'].isnull()))]

Unnamed: 0,year,month,day,address_0,address_1,address_2,road_name,road_number,area,deal_price,name,main_number,sub_number,date


In [None]:
# 도로명정보에는 null이고 지번주소도 null인 데이터를 확인 -> 없다
# 즉, 도로명주소나 지번주소 둘 중 하나를 활용해서 주소에 대한 정보를 얻을 수 있다
df[((df['road_name'].isnull()) | (df['road_number'].isnull())) & (df['main_number'].isnull())]

Unnamed: 0,year,month,day,address_0,address_1,address_2,road_name,road_number,area,deal_price,name,main_number,sub_number,date


In [None]:
# 처리해야 할 null 값이 있는 데이터프레임을 조회
df.loc[df['main_number'].isnull(),['address_0','address_1','address_2','road_name','road_number','name','main_number','sub_number']]

Unnamed: 0,address_0,address_1,address_2,road_name,road_number,name,main_number,sub_number
681633,서울특별시,서초구,신원동,헌릉로8길,10-12,힐스테이트 서초 젠트리스,,
681634,서울특별시,서초구,신원동,헌릉로8길,10-12,힐스테이트 서초 젠트리스,,
681635,서울특별시,서초구,신원동,헌릉로8길,10-12,힐스테이트 서초 젠트리스,,
681636,서울특별시,서초구,신원동,헌릉로8길,10-12,힐스테이트 서초 젠트리스,,
681637,서울특별시,서초구,신원동,헌릉로8길,10-12,힐스테이트 서초 젠트리스,,
...,...,...,...,...,...,...,...,...
1232881,서울특별시,서초구,신원동,헌릉로8길,10-12,힐스테이트 서초 젠트리스,,
1256361,서울특별시,서초구,신원동,헌릉로8길,10-12,힐스테이트 서초 젠트리스,,
1256362,서울특별시,서초구,신원동,헌릉로8길,10-12,힐스테이트 서초 젠트리스,,
1256363,서울특별시,서초구,신원동,헌릉로8길,10-12,힐스테이트 서초 젠트리스,,


In [None]:
df.loc[df['main_number'].isnull(),'name'].unique() # 처리해야 할 지번주소에 null 값이 있는 아파트명들 조회
                                                   # '힐스테이트 서초 젠트리스'만 수정하면 될듯

array(['힐스테이트 서초 젠트리스'], dtype=object)

In [None]:
df.loc[df['name']=='힐스테이트 서초 젠트리스',:] # 기존 name 컬럼이 '힐스테이트 서초 젠트리스' 인 전체 값들이 지번주소가 null값으로 되어 있음

Unnamed: 0,year,month,day,address_0,address_1,address_2,road_name,road_number,area,deal_price,name,main_number,sub_number,date
681633,2015,3,1,서울특별시,서초구,신원동,헌릉로8길,10-12,84.95,73430,힐스테이트 서초 젠트리스,,,2015-03-01
681634,2015,4,17,서울특별시,서초구,신원동,헌릉로8길,10-12,84.99,79000,힐스테이트 서초 젠트리스,,,2015-04-17
681635,2015,5,1,서울특별시,서초구,신원동,헌릉로8길,10-12,101.90,95000,힐스테이트 서초 젠트리스,,,2015-05-01
681636,2015,6,16,서울특별시,서초구,신원동,헌릉로8길,10-12,84.95,87200,힐스테이트 서초 젠트리스,,,2015-06-16
681637,2015,6,26,서울특별시,서초구,신원동,헌릉로8길,10-12,101.90,94500,힐스테이트 서초 젠트리스,,,2015-06-26
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1232881,2022,9,28,서울특별시,서초구,신원동,헌릉로8길,10-12,84.95,173000,힐스테이트 서초 젠트리스,,,2022-09-28
1256361,2023,4,20,서울특별시,서초구,신원동,헌릉로8길,10-12,84.95,160000,힐스테이트 서초 젠트리스,,,2023-04-20
1256362,2023,9,11,서울특별시,서초구,신원동,헌릉로8길,10-12,84.99,163000,힐스테이트 서초 젠트리스,,,2023-09-11
1256363,2023,9,16,서울특별시,서초구,신원동,헌릉로8길,10-12,101.90,190000,힐스테이트 서초 젠트리스,,,2023-09-16


In [None]:
# 지번주소 null 값들을 네이버를 통해 검색하여서 정보를 얻고 수정
df.loc[df['name']=='힐스테이트 서초 젠트리스','main_number'] = 557
df.loc[df['name']=='힐스테이트 서초 젠트리스','sub_number'] = 0

In [None]:
# 사용할 컬럼들 선택택과, 컬럼명들 수정
df_deal = df[['date','year','month','day','address_0','address_1','address_2','main_number','sub_number','name','area','deal_price']].copy()
df_deal.columns =['date','year','month','day','address_0','address_1','address_2','address_3','address_4','name','area','deal_price']
df_deal = df_deal[df_deal['year']>=2011] # 전세/월세데이터가 2011년 이후로 있어서 연도 선택
df_deal.head()

Unnamed: 0,date,year,month,day,address_0,address_1,address_2,address_3,address_4,name,area,deal_price
355306,2011-07-09,2011,7,9,서울특별시,강남구,개포동,655.0,2.0,개포2차현대아파트(220),77.75,64000
355307,2011-07-28,2011,7,28,서울특별시,강남구,개포동,655.0,2.0,개포2차현대아파트(220),77.75,65500
355308,2011-01-19,2011,1,19,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,67.28,70500
355309,2011-09-02,2011,9,2,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,79.97,85000
355310,2011-12-17,2011,12,17,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,67.28,68000


In [None]:
df_deal.info() # 데이터프레임 정보 확인

<class 'pandas.core.frame.DataFrame'>
Int64Index: 913327 entries, 355306 to 1268632
Data columns (total 12 columns):
 #   Column      Non-Null Count   Dtype         
---  ------      --------------   -----         
 0   date        913327 non-null  datetime64[ns]
 1   year        913327 non-null  int64         
 2   month       913327 non-null  int64         
 3   day         913327 non-null  int64         
 4   address_0   913327 non-null  object        
 5   address_1   913327 non-null  object        
 6   address_2   913327 non-null  object        
 7   address_3   913327 non-null  float64       
 8   address_4   913327 non-null  float64       
 9   name        913327 non-null  object        
 10  area        913327 non-null  float64       
 11  deal_price  913327 non-null  int64         
dtypes: datetime64[ns](1), float64(3), int64(4), object(4)
memory usage: 90.6+ MB


> 해당 데이터프레임은 거래일자의 정보를 나타내는 'date', 'month', 'day' 컬럼들, 주소의 정보를 나타내는 'address_0', 'address_1', 'address_2', 'address_4' 컬럼들, 아파트명을 나타내는 'name'컬럼, 아파트의 면적을 나타내는 'area' 컬럼, 매매가격을 나타내는 'deal_price' 컬럼으로 이루어져 있다

In [None]:
df_deal.iloc[200] # 정보들 제대로 있는지 확인

date          2011-12-23 00:00:00
year                         2011
month                          12
day                            23
address_0                   서울특별시
address_1                     강남구
address_2                     개포동
address_3                   141.0
address_4                     0.0
name                      개포주공1단지
area                        56.57
deal_price                  95000
Name: 355506, dtype: object

In [None]:
df_deal.to_csv('/content/drive/MyDrive/house_price/after_data/apartment_deal.csv',index=False) # 파일로 저장

# apartment_full_rent.csv, apartment_month_rent.csv 파일 생성

- 'http://rtdown.molit.go.kr/' 사이트를 통해서 아파트전세,월세 정보 파일로 얻음
- '아파트 전세' 관련 정보들을 가지고 있는 데이터프레임을 생성하여 apartment_full_rent.csv 로 저장
- '파이트 월세' 관련 정보들을 가지고 있는 데이터프레임을 생성하여 apartment_month_rent.csv 로 저장

## csv 파일들 불러오기 및 병합

- 아파트 전세정보 csv 파일들은 연도별로 파일들이 분류가 되어있고, 각 csv 파일 내의 모든 정보들이 전부 필요하지는 않기에 전처리 과정 진행

In [None]:
import pandas as pd
import os


dir_path = "/content/drive/MyDrive/house_price/original_data/rent_price/Seoul"
file_list = os.listdir(dir_path)
file_list.sort()
df_list = list()

# 해당 폴더 안에 있는 csv 파일들을 읽어서 리스트 안에 데이터프레임들을 담음
for csv_file in file_list:
    df_list.append(pd.read_csv(dir_path+"/"+csv_file ,skiprows=15,  encoding='cp949', low_memory=False))


In [None]:
df_list[-1].info() # 리스트 안에 잘 담겼는지 확인

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 224919 entries, 0 to 224918
Data columns (total 19 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   시군구            224919 non-null  object 
 1   번지             224861 non-null  object 
 2   본번             224892 non-null  float64
 3   부번             224892 non-null  float64
 4   단지명            224919 non-null  object 
 5   전월세구분          224919 non-null  object 
 6   전용면적(㎡)        224919 non-null  float64
 7   계약년월           224919 non-null  int64  
 8   계약일            224919 non-null  int64  
 9   보증금(만원)        224919 non-null  object 
 10  월세(만원)         224919 non-null  object 
 11  층              224919 non-null  int64  
 12  건축년도           224903 non-null  float64
 13  도로명            224919 non-null  object 
 14  계약기간           224919 non-null  object 
 15  계약구분           224919 non-null  object 
 16  갱신요구권 사용       224919 non-null  object 
 17  종전계약 보증금 (만원)  184614 non-nul

In [None]:
# 모든 데이터프레임을을 통합
df_default = df_list[0]
for df_csv in df_list[1:]:
    df_default = pd.concat([df_default, df_csv], axis=0)
df_default.reset_index(drop=True, inplace=True) # concat으로 합쳐질 때 인덱스 재설정
df_default.info() # 데이터프레임 합친 결과 확인

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2310694 entries, 0 to 2310693
Data columns (total 19 columns):
 #   Column         Dtype  
---  ------         -----  
 0   시군구            object 
 1   번지             object 
 2   본번             float64
 3   부번             float64
 4   단지명            object 
 5   전월세구분          object 
 6   전용면적(㎡)        float64
 7   계약년월           int64  
 8   계약일            int64  
 9   보증금(만원)        object 
 10  월세(만원)         object 
 11  층              float64
 12  건축년도           float64
 13  도로명            object 
 14  계약기간           object 
 15  계약구분           object 
 16  갱신요구권 사용       object 
 17  종전계약 보증금 (만원)  object 
 18  종전계약 월세 (만원)   object 
dtypes: float64(5), int64(2), object(12)
memory usage: 335.0+ MB


In [None]:
df_default.head() # 데이터 형태 확인

Unnamed: 0,시군구,번지,본번,부번,단지명,전월세구분,전용면적(㎡),계약년월,계약일,보증금(만원),월세(만원),층,건축년도,도로명,계약기간,계약구분,갱신요구권 사용,종전계약 보증금 (만원),종전계약 월세 (만원)
0,서울특별시 강남구 개포동,655-2,655.0,2.0,개포2차현대아파트(220),전세,77.75,201101,5,35000,0,7.0,1988.0,언주로 103,-,-,-,,
1,서울특별시 강남구 개포동,655-2,655.0,2.0,개포2차현대아파트(220),전세,77.75,201101,18,20000,0,8.0,1988.0,언주로 103,-,-,-,,
2,서울특별시 강남구 개포동,655-2,655.0,2.0,개포2차현대아파트(220),전세,77.75,201102,1,24000,0,5.0,1988.0,언주로 103,-,-,-,,
3,서울특별시 강남구 개포동,655-2,655.0,2.0,개포2차현대아파트(220),전세,77.75,201102,11,31000,0,9.0,1988.0,언주로 103,-,-,-,,
4,서울특별시 강남구 개포동,655-2,655.0,2.0,개포2차현대아파트(220),전세,77.75,201102,24,30500,0,9.0,1988.0,언주로 103,-,-,-,,


In [None]:
df_default.isnull().sum() # 번지, 본번, 부번이 null 값들이 있음

시군구                    0
번지                  1644
본번                   261
부번                   261
단지명                    0
전월세구분                  0
전용면적(㎡)               36
계약년월                   0
계약일                    0
보증금(만원)                0
월세(만원)                 0
층                     36
건축년도                 265
도로명                    0
계약기간                   0
계약구분                   0
갱신요구권 사용               0
종전계약 보증금 (만원)    1834104
종전계약 월세 (만원)     1834104
dtype: int64

In [None]:
df_default['전월세구분'].unique()

array(['전세', '월세'], dtype=object)

> 전월세구분이 '전세'와 '월세' 두 가지만 있음으로 조건문을 활용해서 나누기에 용이함

## 전세 데이터 프레임 생성

- apartment_deal 과 진행 과정이 거의 동일하기에 apartment_deal.csv 파일 생성의 진행과정을 참조해서 하나의 셀로 합쳐서 진행
- 주석 부분들은 중간과정 확인 부분

In [None]:
# 전세 데이터 프레임 생성
df_full_rent = df_default.loc[df_default['전월세구분']=='전세',['시군구','본번','부번','도로명','계약년월','계약일','보증금(만원)','전용면적(㎡)','단지명']].copy()
df_full_rent.columns = ['address','main_number','sub_number','road','year_month','day','full_rent_price','area','name']


df_full_rent = df_full_rent.astype({'full_rent_price':'str','year_month':'str','day':'str','full_rent_price':'str'})
df_full_rent["full_rent_price"] = df_full_rent["full_rent_price"].str.replace(",", "")
df_full_rent.loc[df_full_rent["day"].str.len()==1,"day"]='0'+df_full_rent.loc[df_full_rent["day"].str.len()==1,"day"] # 일이 있는 컬럼에서 1자리 숫자인 경우 앞에 0을 추가성
df_full_rent['year'] = df_full_rent['year_month'].str[0:4] # 연,월 합쳐져 있는 컬럼에서 연도만 추출
df_full_rent['month'] = df_full_rent['year_month'].str[4:] # 연,월 합쳐져 있는 컬럼에서 월만 추출
df_full_rent['date'] = pd.to_datetime(df_full_rent['year']+df_full_rent['month']+df_full_rent['day']) # 일자들을 합쳐서 date 컬럼 생
df_full_rent = df_full_rent.astype({'year':'int64','month':'int64','day':'int64','full_rent_price':'int64'})
df_full_rent = df_full_rent.drop(['year_month'], axis=1) # 사용 안하는 컬럼들 제거

df_full_rent["address_0"] = df_full_rent["address"].str.split(' ',expand=True)[0] # '시' 만 추출해야 하나, 서울만 함으로 일단은 실행 X
df_full_rent["address_1"] = df_full_rent["address"].str.split(' ',expand=True)[1] # '구' 만 추출
df_full_rent["address_2"] = df_full_rent["address"].str.split(' ',expand=True)[2] # '동' 만 추출
df_full_rent["road_name"] = df_full_rent["road"].str.split(' ',expand=True)[0] # '도로명' 만 추출
df_full_rent["road_number"] = df_full_rent["road"].str.split(' ',expand=True)[1] # '도로숫자' 만 추출
df_full_rent= df_full_rent[['year','month','day','address_0','address_1','address_2','main_number','sub_number','road_name','road_number','area',"full_rent_price",'name','date']] # 사용할 컬럼만 선택


df_full_rent = df_full_rent.replace('', None) # ''값만 있는 값들을 null 값들로 수정

df_full_rent.loc[df_full_rent['name']=='힐스테이트 서초 젠트리스','main_number'] = 557
df_full_rent.loc[df_full_rent['name']=='힐스테이트 서초 젠트리스','sub_number'] = 0


df_full_rent = df_full_rent[['date','year','month','day','address_0','address_1','address_2','main_number','sub_number','name','area','full_rent_price']].copy()
df_full_rent.columns =['date','year','month','day','address_0','address_1','address_2','address_3','address_4','name','area','full_rent_price']

In [None]:
df_full_rent.isnull().sum()

date                0
year                0
month               0
day                 0
address_0           0
address_1           0
address_2           0
address_3           0
address_4           0
name                0
area               25
full_rent_price     0
dtype: int64

### 'area' 컬럼 결측치 처리

- apartment_deal.csv 생성과 달리 area 컬럼에 결측치가 존재하기에 결측치 처리 부분 추가
- 결측치는 해당 주소의 전세 아파트의 거래 내역 중 가장 거래가 많았던 area 컬럼의 값 으로 대체하여 처리

In [None]:
# area의 빈 칸들 해결
df_full_rent[df_full_rent['area'].isnull()].tail()

Unnamed: 0,date,year,month,day,address_0,address_1,address_2,address_3,address_4,name,area,full_rent_price
357440,2013-11-16,2013,11,16,서울특별시,노원구,공릉동,683.0,14.0,한일휴니스빌,,8000
375219,2013-11-30,2013,11,30,서울특별시,동대문구,장안동,312.0,8.0,태솔에버빌,,12000
389892,2013-01-17,2013,1,17,서울특별시,서대문구,창천동,501.0,14.0,삼성아트빌,,9000
439901,2013-01-20,2013,1,20,서울특별시,영등포구,영등포동4가,103.0,0.0,영등포그랑그루,,8000
490009,2014-02-19,2014,2,19,서울특별시,강서구,화곡동,29.0,47.0,드림하우스(29-47),,9500


In [None]:
# area가 null값인 row들이 다른 주소정보관련 컬럼들을 리스트 화
add_1 = list(df_full_rent.loc[df_full_rent['area'].isnull(),'address_1'])
add_2 = list(df_full_rent.loc[df_full_rent['area'].isnull(),'address_2'])
add_3 = list(df_full_rent.loc[df_full_rent['area'].isnull(),'address_3'])
add_4 = list(df_full_rent.loc[df_full_rent['area'].isnull(),'address_4'])
area_list = list()

In [None]:
# area_list 에 값 추가
for i in range(len(add_1)):
    # 해당 주소에서 거래된 매물들의 'area' 정보가 없을 경우, area null을 처리할 참조 자료가 없음으로 ''으로 처리리
    if (len(df_full_rent.loc[(df_full_rent['address_1'] ==add_1[i]) &
                     (df_full_rent['address_2'] ==add_2[i]) &
                     (df_full_rent['address_3'] ==add_3[i]) &
                     (df_full_rent['address_4'] ==add_4[i]),
                     'area'].value_counts())) == 0:

        area_list.append('')
    else:
        # 해당 주소에서 가장 많이 거래되었던 층수를 null 값에 채움움
        area_list.append(df_full_rent.loc[(df_full_rent['address_1'] ==add_1[i]) &
                     (df_full_rent['address_2'] ==add_2[i]) &
                     (df_full_rent['address_3'] ==add_3[i]) &
                     (df_full_rent['address_4'] ==add_4[i]),
                     'area'].value_counts().idxmax())
print(area_list) # area 이 null 값인 주소의 매물들의 가장 많이 거래된 층들을 출력력

[84.9, 33.33, 15.94, 15.94, 84.98, 142.034, 142.034, 142.034, 142.034, 17.07, 17.07, 17.07, 17.07, 17.07, 64.52, 23.47, 23.47, 13.2195, 13.2195, 13.2195, 13.2195, 49.88, 39.28, 12.1, '']


> 마지막에 ''인 값이 있는데 이건 해당 매물은 참조할 만할 거래내역이 없음을 의미

In [None]:
# len을 통해서 리스트들이 다 만들어 졌는지 확인
print(len(add_1),len(add_2),len(add_3),len(add_4),len(area_list))

25 25 25 25 25


In [None]:
# 맨 마지막 row가 '' 여서 해당 row의 area 값을 채우기 위해 참조할 값을 확인 -> 없음
# 해당은 area를 알수있는 방법이 없음 - 다른 참조할만할 area 값들이 없음 -> 추후 제거 필요
df_full_rent.loc[(df_full_rent['address_3']==29)&(df_full_rent['address_4']==47),:] # 테스트로 area이 null 값인 row를 대표로 확인인

Unnamed: 0,date,year,month,day,address_0,address_1,address_2,address_3,address_4,name,area,full_rent_price
490009,2014-02-19,2014,2,19,서울특별시,강서구,화곡동,29.0,47.0,드림하우스(29-47),,9500


In [None]:
# area 가 null인 값들을 처리, 가장 많이 거래된 'area'의 정보로 결측치 처리리
for i in range(len(add_1)):
    df_full_rent.loc[(df_full_rent['address_1'] ==add_1[i]) &
                         (df_full_rent['address_2'] ==add_2[i]) &
                         (df_full_rent['address_3'] ==add_3[i]) &
                         (df_full_rent['address_4'] ==add_4[i]),
                         'area']=area_list[i]

In [None]:
# null 대신 ''이 잘 들어있는지 확인
df_full_rent.loc[df_full_rent['area']=='',:]

Unnamed: 0,date,year,month,day,address_0,address_1,address_2,address_3,address_4,name,area,full_rent_price
490009,2014-02-19,2014,2,19,서울특별시,강서구,화곡동,29.0,47.0,드림하우스(29-47),,9500


In [None]:
# area이 ''인 값 제거
df_full_rent=df_full_rent.drop(df_full_rent[df_full_rent['area']==''].index)

# 제거후 값 확인
df_full_rent.loc[df_full_rent['area']=='',:] # 제거가 된음 확인인

Unnamed: 0,date,year,month,day,address_0,address_1,address_2,address_3,address_4,name,area,full_rent_price


In [None]:
df_full_rent.info() # 값확인을 통해서 null값 처리가 되었는지 확인인

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1580989 entries, 0 to 2310692
Data columns (total 12 columns):
 #   Column           Non-Null Count    Dtype         
---  ------           --------------    -----         
 0   date             1580989 non-null  datetime64[ns]
 1   year             1580989 non-null  int64         
 2   month            1580989 non-null  int64         
 3   day              1580989 non-null  int64         
 4   address_0        1580989 non-null  object        
 5   address_1        1580989 non-null  object        
 6   address_2        1580989 non-null  object        
 7   address_3        1580989 non-null  float64       
 8   address_4        1580989 non-null  float64       
 9   name             1580989 non-null  object        
 10  area             1580989 non-null  object        
 11  full_rent_price  1580989 non-null  int64         
dtypes: datetime64[ns](1), float64(2), int64(4), object(5)
memory usage: 156.8+ MB


> 해당 데이터프레임은 거래일자의 정보를 나타내는 'date', 'month', 'day' 컬럼들, 주소의 정보를 나타내는 'address_0', 'address_1', 'address_2', 'address_4' 컬럼들, 아파트명을 나타내는 'name'컬럼, 아파트의 면적을 나타내는 'area' 컬럼, 전세가격을 나타내는 'full_rent_price' 컬럼으로 이루어져 있다

In [None]:
df_full_rent.to_csv('/content/drive/MyDrive/house_price/after_data/apartment_full_rent.csv', index=False) # 전세 csv 파일 생성성

## 월세 데이터 프레임 생성

- 전세 데이터프레임 생성 파트 참조

In [None]:
# 월세 데이터 프레임 생성, 필요한 컬럼들만 필터링
df_month_rent = df_default.loc[df_default['전월세구분']=='월세',['시군구','본번','부번','도로명','계약년월','계약일','보증금(만원)','월세(만원)','전용면적(㎡)','단지명']].copy()
df_month_rent.columns = ['address','main_number','sub_number','road','year_month','day','rent_deposit','month_rent_price','area','name']
# df_month_rent.head()

df_month_rent.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 729704 entries, 25 to 2310693
Data columns (total 10 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   address           729704 non-null  object 
 1   main_number       729647 non-null  float64
 2   sub_number        729647 non-null  float64
 3   road              729704 non-null  object 
 4   year_month        729704 non-null  int64  
 5   day               729704 non-null  int64  
 6   rent_deposit      729704 non-null  object 
 7   month_rent_price  729704 non-null  object 
 8   area              729693 non-null  float64
 9   name              729704 non-null  object 
dtypes: float64(3), int64(2), object(5)
memory usage: 61.2+ MB


전세 파트와 다른 부분 확인! ↓

In [None]:
df_month_rent["month_rent_price2"] = df_month_rent["month_rent_price"].str.replace(',','')
df_month_rent.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 729704 entries, 25 to 2310693
Data columns (total 11 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   address            729704 non-null  object 
 1   main_number        729647 non-null  float64
 2   sub_number         729647 non-null  float64
 3   road               729704 non-null  object 
 4   year_month         729704 non-null  int64  
 5   day                729704 non-null  int64  
 6   rent_deposit       729704 non-null  object 
 7   month_rent_price   729704 non-null  object 
 8   area               729693 non-null  float64
 9   name               729704 non-null  object 
 10  month_rent_price2  665236 non-null  object 
dtypes: float64(3), int64(2), object(6)
memory usage: 66.8+ MB


> "month_rent_price"를 replace를 적용해서 month_rent_price2 컬럼을 생성하는데 replace 함수가 제데로 처리가 안됨을 확인

>> df_month_rent["month_rent_price"].str.replace(',','')

>> 진행했을 때, 'month_rent_price2' 컬럼에서의 null 값이 매우 커짐 -> replace 매소드가 제대로 동작 안함을 확인

>> 왜 동작을 안할까? -> string 과 object 타입의 차이, object는 타입의 혼용?

In [None]:
# 해당 파트를 통해서 우선 type 을 변경한 다음에 진행해야 함
df_month_rent = df_month_rent.astype({'month_rent_price':'str','rent_deposit':'str'})

- apartment_deal 과 진행 과정이 거의 동일하기에 한 셀로 합쳐서 진행
- 주석 부분들은 중간과정 확인 부분

In [None]:
df_month_rent["rent_deposit"] = df_month_rent["rent_deposit"].str.replace(",", "")
df_month_rent["month_rent_price"] = df_month_rent["month_rent_price"].str.replace(',','')
df_month_rent = df_month_rent.astype({'year_month':'str','day':'str','rent_deposit':'int64','month_rent_price':'int64'})
df_month_rent['year'] = df_month_rent['year_month'].str[0:4] # 연,월 합쳐져 있는 컬럼에서 연도만 추출
df_month_rent['month'] = df_month_rent['year_month'].str[4:] # 연,월 합쳐져 있는 컬럼에서 월만 추출
df_month_rent.loc[df_month_rent["day"].str.len()==1,"day"]='0'+df_month_rent.loc[df_month_rent["day"].str.len()==1,"day"] # 일이 있는 컬럼에서 1자리 숫자인 경우 앞에 0을 추가성
df_month_rent['date'] = pd.to_datetime(df_month_rent['year']+df_month_rent['month']+df_month_rent['day']) # 일자들을 합쳐서 date 컬럼 생
df_month_rent = df_month_rent.astype({'year':'int64','month':'int64','day':'int64'})
df_month_rent = df_month_rent.drop(['year_month'], axis=1) # 사용 안하는 컬럼들 제거

df_month_rent["address_0"] = df_month_rent["address"].str.split(' ',expand=True)[0] # '시' 만 추출해야 하나, 서울만 함으로 일단은 실행 X
df_month_rent["address_1"] = df_month_rent["address"].str.split(' ',expand=True)[1] # '구' 만 추출
df_month_rent["address_2"] = df_month_rent["address"].str.split(' ',expand=True)[2] # '동' 만 추출
df_month_rent["road_name"] = df_month_rent["road"].str.split(' ',expand=True)[0] # '도로명' 만 추출
df_month_rent["road_number"] = df_month_rent["road"].str.split(' ',expand=True)[1] # '도로숫자' 만 추출
df_month_rent= df_month_rent[['year','month','day','address_0','address_1','address_2','main_number','sub_number','road_name','road_number','area',"rent_deposit","month_rent_price",'name','date']] # 사용할 컬럼만 선택


df_month_rent = df_month_rent.replace('', None) # ''값만 있는 값들을 null 값들로 수정

df_month_rent.loc[df_month_rent['name']=='힐스테이트 서초 젠트리스','main_number'] = 557
df_month_rent.loc[df_month_rent['name']=='힐스테이트 서초 젠트리스','sub_number'] = 0

df_month_rent = df_month_rent[['date','year','month','day','address_0','address_1','address_2','main_number','sub_number','name','area','rent_deposit','month_rent_price']]
df_month_rent.columns =['date','year','month','day','address_0','address_1','address_2','address_3','address_4','name','area','rent_deposit','month_rent_price']

In [None]:
df_month_rent.isnull().sum()

date                 0
year                 0
month                0
day                  0
address_0            0
address_1            0
address_2            0
address_3            0
address_4            0
name                 0
area                11
rent_deposit         0
month_rent_price     0
dtype: int64

### 'area' 컬럼 결측치 처리

- 전세의 area 결측치 처리 부분 참조

In [None]:
add_1 = list(df_month_rent.loc[df_month_rent['area'].isnull(),'address_1'])
add_2 = list(df_month_rent.loc[df_month_rent['area'].isnull(),'address_2'])
add_3 = list(df_month_rent.loc[df_month_rent['area'].isnull(),'address_3'])
add_4 = list(df_month_rent.loc[df_month_rent['area'].isnull(),'address_4'])
area_list = list()
# area_list 에 값 추가
for i in range(len(add_1)):
    # 해당 주소에서 거래된 매물들의 '층' 정보가 없을 경우, area null을 처리할 참조 자료가 없음으로 ''으로 처리리
    if (len(df_month_rent.loc[(df_month_rent['address_1'] ==add_1[i]) &
                     (df_month_rent['address_2'] ==add_2[i]) &
                     (df_month_rent['address_3'] ==add_3[i]) &
                     (df_month_rent['address_4'] ==add_4[i]),
                     'area'].value_counts())) == 0:

        area_list.append('')
    else:
        # 해당 주소에서 가장 많이 거래되었던 층수를 null 값에 채울거임
        area_list.append(df_month_rent.loc[(df_month_rent['address_1'] ==add_1[i]) &
                     (df_month_rent['address_2'] ==add_2[i]) &
                     (df_month_rent['address_3'] ==add_3[i]) &
                     (df_month_rent['address_4'] ==add_4[i]),
                     'area'].value_counts().idxmax())


for i in range(len(add_1)):
    df_month_rent.loc[(df_month_rent['address_1'] ==add_1[i]) &
                         (df_month_rent['address_2'] ==add_2[i]) &
                         (df_month_rent['address_3'] ==add_3[i]) &
                         (df_month_rent['address_4'] ==add_4[i]),
                         'area']=area_list[i]



In [None]:
df_month_rent.isnull().sum()

date                0
year                0
month               0
day                 0
address_0           0
address_1           0
address_2           0
address_3           0
address_4           0
name                0
area                0
rent_deposit        0
month_rent_price    0
dtype: int64

In [None]:
df_month_rent.head()

Unnamed: 0,date,year,month,day,address_0,address_1,address_2,address_3,address_4,name,area,rent_deposit,month_rent_price
25,2011-03-18,2011,3,18,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,79.97,19000,63
28,2011-04-09,2011,4,9,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,79.97,21000,35
38,2011-07-09,2011,7,9,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,79.97,3000,160
46,2011-09-19,2011,9,19,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,79.97,6000,140
47,2011-09-20,2011,9,20,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,79.97,5000,160


In [None]:
df_month_rent.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 729704 entries, 25 to 2310693
Data columns (total 13 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   date              729704 non-null  datetime64[ns]
 1   year              729704 non-null  int64         
 2   month             729704 non-null  int64         
 3   day               729704 non-null  int64         
 4   address_0         729704 non-null  object        
 5   address_1         729704 non-null  object        
 6   address_2         729704 non-null  object        
 7   address_3         729704 non-null  float64       
 8   address_4         729704 non-null  float64       
 9   name              729704 non-null  object        
 10  area              729704 non-null  float64       
 11  rent_deposit      729704 non-null  int64         
 12  month_rent_price  729704 non-null  int64         
dtypes: datetime64[ns](1), float64(3), int64(5), object(4)
mem

> 해당 데이터프레임은 거래일자의 정보를 나타내는 'date', 'month', 'day' 컬럼들, 주소의 정보를 나타내는 'address_0', 'address_1', 'address_2', 'address_4' 컬럼들, 아파트명을 나타내는 'name'컬럼, 아파트의 면적을 나타내는 'area' 컬럼, 보증근 정보를 나타내는 'rent_deposit' 컬럼,월세가격을 나타내는 'month_rent_price' 컬럼으로 이루어져 있다

In [None]:
df_month_rent.to_csv('/content/drive/MyDrive/house_price/after_data/apartment_month_rent.csv', index=False)

# 부동산 지수 데이터 수정

In [None]:
import pandas as pd
df_seoul_index = pd.read_excel('/content/drive/MyDrive/house_price/original_data/주간 아파트 매매가격지수_20231027.xlsx')
df_seoul_index

Unnamed: 0,지역명,2008-04-07 00:00:52,2008-04-14 00:00:52,2008-04-21 00:00:52,2008-04-28 00:00:52,2008-05-05 00:00:52,2008-05-12 00:00:52,2008-05-19 00:00:52,2008-05-26 00:00:52,2008-06-02 00:00:52,...,2023-08-14 00:00:52,2023-08-21 00:00:52,2023-08-28 00:00:52,2023-09-04 00:00:52,2023-09-11 00:00:52,2023-09-18 00:00:52,2023-09-25 00:00:52,2023-10-09 00:00:52,2023-10-16 00:00:52,2023-10-23 00:00:52
0,전국,58.013,58.089,58.198,58.281,58.355,58.427,58.493,58.553,58.616,...,90.012,90.026,90.05,90.065,90.088,90.112,90.135,90.152,90.172,90.187
1,서울,59.733,59.882,60.042,60.163,60.278,60.387,60.458,60.526,60.595,...,90.339,90.408,90.476,90.511,90.578,90.635,90.681,90.733,90.787,90.82
2,강북14개구,59.739,59.98,60.25,60.444,60.625,60.796,60.909,61.037,61.156,...,88.503,88.522,88.554,88.544,88.565,88.595,88.603,88.615,88.646,88.673
3,강남11개구,59.899,59.966,60.029,60.086,60.143,60.197,60.231,60.245,60.271,...,92.026,92.142,92.245,92.322,92.433,92.515,92.596,92.685,92.762,92.801
4,수도권,62.334,62.457,62.638,62.771,62.898,63.014,63.108,63.198,63.297,...,87.127,87.17,87.219,87.261,87.311,87.359,87.395,87.436,87.483,87.511
5,6개광역시,51.272,51.335,51.404,51.467,51.519,51.562,51.614,51.645,51.697,...,88.399,88.382,88.381,88.359,88.359,88.364,88.361,88.338,88.327,88.318
6,5개광역시,49.125,49.16,49.192,49.214,49.237,49.254,49.27,49.277,49.297,...,89.515,89.49,89.485,89.457,89.455,89.457,89.452,89.423,89.405,89.395
7,기타지방,62.411,62.436,62.483,62.537,62.557,62.603,62.678,62.743,62.782,...,96.49,96.483,96.483,96.482,96.476,96.474,96.496,96.505,96.506,96.515
8,부산,46.536,46.594,46.66,46.707,46.743,46.781,46.813,46.834,46.883,...,90.657,90.625,90.61,90.569,90.521,90.49,90.467,90.389,90.338,90.301
9,대구,51.373,51.353,51.331,51.313,51.294,51.273,51.271,51.247,51.221,...,84.816,84.768,84.738,84.667,84.661,84.651,84.629,84.598,84.558,84.54


In [None]:
df_seoul_index = df_seoul_index.loc[1,:]
df_seoul_index = df_seoul_index.reset_index()
df_seoul_index

Unnamed: 0,index,1
0,지역명,서울
1,2008-04-07 00:00:52,59.733
2,2008-04-14 00:00:52,59.882
3,2008-04-21 00:00:52,60.042
4,2008-04-28 00:00:52,60.163
...,...,...
779,2023-09-18 00:00:52,90.635
780,2023-09-25 00:00:52,90.681
781,2023-10-09 00:00:52,90.733
782,2023-10-16 00:00:52,90.787


In [None]:
df_seoul_index = df_seoul_index.loc[1:,:]
df_seoul_index.columns =['date','seoul_index']
df_seoul_index = df_seoul_index.reset_index(drop=True)
print(df_seoul_index.info())
df_seoul_index

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 783 entries, 0 to 782
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   date         783 non-null    object
 1   seoul_index  783 non-null    object
dtypes: object(2)
memory usage: 12.4+ KB
None


Unnamed: 0,date,seoul_index
0,2008-04-07 00:00:52,59.733
1,2008-04-14 00:00:52,59.882
2,2008-04-21 00:00:52,60.042
3,2008-04-28 00:00:52,60.163
4,2008-05-05 00:00:52,60.278
...,...,...
778,2023-09-18 00:00:52,90.635
779,2023-09-25 00:00:52,90.681
780,2023-10-09 00:00:52,90.733
781,2023-10-16 00:00:52,90.787


In [None]:
df_seoul_index['date'] = df_seoul_index['date'].astype('str')
df_seoul_index['seoul_index'] = df_seoul_index['seoul_index'].astype('float')
print(df_seoul_index.info())

df_seoul_index

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 783 entries, 0 to 782
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   date         783 non-null    object 
 1   seoul_index  783 non-null    float64
dtypes: float64(1), object(1)
memory usage: 12.4+ KB
None


Unnamed: 0,date,seoul_index
0,2008-04-07 00:00:52,59.733
1,2008-04-14 00:00:52,59.882
2,2008-04-21 00:00:52,60.042
3,2008-04-28 00:00:52,60.163
4,2008-05-05 00:00:52,60.278
...,...,...
778,2023-09-18 00:00:52,90.635
779,2023-09-25 00:00:52,90.681
780,2023-10-09 00:00:52,90.733
781,2023-10-16 00:00:52,90.787


In [None]:
df_seoul_index['date'] = df_seoul_index['date'].str.split(' ').str[0]
print(df_seoul_index.info())

df_seoul_index

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 783 entries, 0 to 782
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   date         783 non-null    object 
 1   seoul_index  783 non-null    float64
dtypes: float64(1), object(1)
memory usage: 12.4+ KB
None


Unnamed: 0,date,seoul_index
0,2008-04-07,59.733
1,2008-04-14,59.882
2,2008-04-21,60.042
3,2008-04-28,60.163
4,2008-05-05,60.278
...,...,...
778,2023-09-18,90.635
779,2023-09-25,90.681
780,2023-10-09,90.733
781,2023-10-16,90.787


In [None]:
import pandas as pd

# df_seoul_index에서 date 컬럼을 문자열로 변환 후, 앞 10자리를 추출하여 datetime으로 변환
df_seoul_index['date'] = pd.to_datetime(df_seoul_index['date'], format='%Y-%m-%d')

# 변환된 DataFrame 출력
print(df_seoul_index.info())

df_seoul_index

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 783 entries, 0 to 782
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   date         783 non-null    datetime64[ns]
 1   seoul_index  783 non-null    float64       
dtypes: datetime64[ns](1), float64(1)
memory usage: 12.4 KB
None


Unnamed: 0,date,seoul_index
0,2008-04-07,59.733
1,2008-04-14,59.882
2,2008-04-21,60.042
3,2008-04-28,60.163
4,2008-05-05,60.278
...,...,...
778,2023-09-18,90.635
779,2023-09-25,90.681
780,2023-10-09,90.733
781,2023-10-16,90.787


In [None]:
df_seoul_index.to_pickle('/content/drive/MyDrive/house_price/after_data/df_seoul_index.pkl')

# economic_data.csv 파일생성

- economic_data(거시경제 정보관련) 파일 생성
- economic_data 에는 한국기준금리, 부동산지수, 기준금리, 코스피지수, 한국국채금리, 미국국채금리, 장단기금리차, 아파트 분양 공급량, 아파트 미분양수, 아파트 미분양률 의 정보를 포함함

## 기준금리 정보관련 데이터 프레임 생성

- 'https://www.bok.or.kr/portal/singl/baseRate/list.do?dataSeCd=01&menuNo=200643' 홈페이지에서 기준금리의 변경 일자들을 제공하기에 해당 사이트에서 정보를 추출

### 데이터프레임 확인

In [None]:
import pandas as pd

df = pd.read_csv('/content/drive/MyDrive/house_price/original_data/korean_rp.csv',encoding='cp949')
print(df.info())
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55 entries, 0 to 54
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   년도      55 non-null     int64  
 1   월일      55 non-null     object 
 2   기준금리    55 non-null     float64
dtypes: float64(1), int64(1), object(1)
memory usage: 1.4+ KB
None


Unnamed: 0,년도,월일,기준금리
0,2023,01월 13일,3.5
1,2022,11월 24일,3.25
2,2022,10월 12일,3.0
3,2022,08월 25일,2.5
4,2022,07월 13일,2.25


In [None]:
df['month']=df['월일'].str[0:2] # 월의 정보만 추출
df['day'] = df['월일'].str[4:6] # 일의 정보만 추출
df.head()

Unnamed: 0,년도,월일,기준금리,month,day
0,2023,01월 13일,3.5,1,13
1,2022,11월 24일,3.25,11,24
2,2022,10월 12일,3.0,10,12
3,2022,08월 25일,2.5,8,25
4,2022,07월 13일,2.25,7,13


In [None]:
df['rp_date'] = df['년도'].astype('str')+df['month']+df['day'] # 새로운 컬럼 생성
df.head()

Unnamed: 0,년도,월일,기준금리,month,day,rp_date
0,2023,01월 13일,3.5,1,13,20230113
1,2022,11월 24일,3.25,11,24,20221124
2,2022,10월 12일,3.0,10,12,20221012
3,2022,08월 25일,2.5,8,25,20220825
4,2022,07월 13일,2.25,7,13,20220713


In [None]:
df = df.rename(columns={'기준금리':'korea_rp'})
df = df.drop(['년도','월일','month','day'], axis=1) # 안쓰는 컬럼 제거
df=df.sort_index(ascending=False) # 날짜가 역순으로 되어 있어서 정렬
df['rp_date'] = pd.to_datetime(df['rp_date'], format='%Y-%m-%d') # date 타입으로 변경
print(df.info())
df

<class 'pandas.core.frame.DataFrame'>
Int64Index: 55 entries, 54 to 0
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   korea_rp  55 non-null     float64       
 1   rp_date   55 non-null     datetime64[ns]
dtypes: datetime64[ns](1), float64(1)
memory usage: 1.3 KB
None


Unnamed: 0,korea_rp,rp_date
54,4.75,1999-05-06
53,5.0,2000-02-10
52,5.25,2000-10-05
51,5.0,2001-02-08
50,4.75,2001-07-05
49,4.5,2001-08-09
48,4.0,2001-09-19
47,4.25,2002-05-07
46,4.0,2003-05-13
45,3.75,2003-07-10


### '기준금리 변경날짜'들 사이에 있는 날짜들의 기준금리 정보 생성

- 위에서의 데이터 프레임은 '기준금리 변경일자'와 '변경된 기준금리'의 정보를 나타내는데, '기준금리 변경일자'들 사이에 있는 모든 날짜들에 대응하는 '기준금리'에 대한 정보도 필요하기에 사이 날짜들에 대한 기준금리 정보들을 생성

In [None]:
import datetime

# 크롤링한 날짜 기간에 있는 모든 날짜들을 계산
start = datetime.datetime.strptime("06-05-1999", "%d-%m-%Y") # 시작날짜 설정
end = datetime.datetime.strptime("31-10-2023", "%d-%m-%Y") # 끝날짜 설정정
date_generated = [start + datetime.timedelta(days=x) for x in range(0, (end-start).days)] # 시작날짜와 끝날짜 사이에 있는 날짜들 생성
date_list=list()
for date in date_generated:
    date_list.append(date.strftime("%Y-%m-%d")) # date_list 에서 생성한 날짜들의 형식을 맞춰서 대입

In [None]:
# df_date는 조회할 모든 날짜들의 정보가 들어있는 series
df_date = pd.DataFrame({
    "date": date_list
}, columns=["date"])
df_date['date'] = pd.to_datetime(df_date['date'], format='%Y-%m-%d %H:%M:%S', errors='raise') # date 타입으로 변경

In [None]:
df_date.head() # 데이터 프레임 형태 확인

Unnamed: 0,date
0,1999-05-06
1,1999-05-07
2,1999-05-08
3,1999-05-09
4,1999-05-10


In [None]:
# 두개의 데이터프레임 결합을 통해서 날짜별 기준금리 현황을 생성
df_rp=pd.merge(df_date, df, left_on='date', right_on='rp_date', how='left')

In [None]:
# 사용할 컬럼만을 선택
df_rp = df_rp[['date','korea_rp']]
df_rp # 생성한 데이터 프레임 형태 확인

Unnamed: 0,date,korea_rp
0,1999-05-06,4.75
1,1999-05-07,
2,1999-05-08,
3,1999-05-09,
4,1999-05-10,
...,...,...
8939,2023-10-26,
8940,2023-10-27,
8941,2023-10-28,
8942,2023-10-29,


In [None]:
# 가장 최근에 변경된 기준금리가 이후 변경되기 전까지 유지가 되기에, null값들을 젤 위에 있는 값(변경된 가장 최근의 기준금리 값)들로 채움
# 일자별 기준금리의 정보들을 생성
df_rp=df_rp.ffill() # ffill() 매소드를 통해서 젤 위의 있는 값으로 null 값들을 채움
print(df_rp.info())
df_rp

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8944 entries, 0 to 8943
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   date      8944 non-null   datetime64[ns]
 1   korea_rp  8944 non-null   float64       
dtypes: datetime64[ns](1), float64(1)
memory usage: 209.6 KB
None


Unnamed: 0,date,korea_rp
0,1999-05-06,4.75
1,1999-05-07,4.75
2,1999-05-08,4.75
3,1999-05-09,4.75
4,1999-05-10,4.75
...,...,...
8939,2023-10-26,3.50
8940,2023-10-27,3.50
8941,2023-10-28,3.50
8942,2023-10-29,3.50


In [None]:
# 기준금리 현황 그래프 출력
# x축을 날짜, y축을 기준금리 값으로 한 그래프 출력력
import plotly.graph_objects as go

# Create traces
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_rp['date'], y=df_rp['korea_rp'],
                    mode='lines',
                    name='korea_rp',yaxis='y1'))


fig.show(renderer="colab")

## 서울 아파트 지수 추가

In [None]:
import pandas as pd
df_seoul_index =pd.read_pickle('/content/drive/MyDrive/house_price/after_data/df_seoul_index.pkl')
df_seoul_index.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 783 entries, 0 to 782
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   date         783 non-null    datetime64[ns]
 1   seoul_index  783 non-null    float64       
dtypes: datetime64[ns](1), float64(1)
memory usage: 12.4 KB


In [None]:
df_final = pd.merge(df_rp, df_seoul_index, on = 'date', how = 'left')
print(df_final.info())
df_final

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8944 entries, 0 to 8943
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   date         8944 non-null   datetime64[ns]
 1   korea_rp     8944 non-null   float64       
 2   seoul_index  783 non-null    float64       
dtypes: datetime64[ns](1), float64(2)
memory usage: 279.5 KB
None


Unnamed: 0,date,korea_rp,seoul_index
0,1999-05-06,4.75,
1,1999-05-07,4.75,
2,1999-05-08,4.75,
3,1999-05-09,4.75,
4,1999-05-10,4.75,
...,...,...,...
8939,2023-10-26,3.50,
8940,2023-10-27,3.50,
8941,2023-10-28,3.50,
8942,2023-10-29,3.50,


In [None]:
df_final.loc[df_final['seoul_index'].notnull(),:]

Unnamed: 0,date,korea_rp,seoul_index
3259,2008-04-07,5.0,59.733
3266,2008-04-14,5.0,59.882
3273,2008-04-21,5.0,60.042
3280,2008-04-28,5.0,60.163
3287,2008-05-05,5.0,60.278
...,...,...,...
8901,2023-09-18,3.5,90.635
8908,2023-09-25,3.5,90.681
8922,2023-10-09,3.5,90.733
8929,2023-10-16,3.5,90.787


In [None]:
df_final = df_final.loc[df_final['date']>='2008-04-07',:]
df_final.head(10)

Unnamed: 0,date,korea_rp,seoul_index
3259,2008-04-07,5.0,59.733
3260,2008-04-08,5.0,
3261,2008-04-09,5.0,
3262,2008-04-10,5.0,
3263,2008-04-11,5.0,
3264,2008-04-12,5.0,
3265,2008-04-13,5.0,
3266,2008-04-14,5.0,59.882
3267,2008-04-15,5.0,
3268,2008-04-16,5.0,


In [None]:
df_final['seoul_index'] = df_final['seoul_index'].interpolate(method='values')
df_final['seoul_index'] = round(df_final['seoul_index'],2)
print(df_final.info())
df_final.head(10)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5685 entries, 3259 to 8943
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   date         5685 non-null   datetime64[ns]
 1   korea_rp     5685 non-null   float64       
 2   seoul_index  5685 non-null   float64       
dtypes: datetime64[ns](1), float64(2)
memory usage: 177.7 KB
None




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,date,korea_rp,seoul_index
3259,2008-04-07,5.0,59.73
3260,2008-04-08,5.0,59.75
3261,2008-04-09,5.0,59.78
3262,2008-04-10,5.0,59.8
3263,2008-04-11,5.0,59.82
3264,2008-04-12,5.0,59.84
3265,2008-04-13,5.0,59.86
3266,2008-04-14,5.0,59.88
3267,2008-04-15,5.0,59.9
3268,2008-04-16,5.0,59.93


## 데이터프레임 기간 수정

- 전세,월세에 대한 정보가 2011년 이후 부터 있기에 데이터 프레임을 2011년 ~ 으로 자름

In [None]:
df_final = df_final[(df_final['date']>='2011-01-01') ] # 사용할 날자만 자름
print(df_final.info())
df_final

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4686 entries, 4258 to 8943
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   date         4686 non-null   datetime64[ns]
 1   korea_rp     4686 non-null   float64       
 2   seoul_index  4686 non-null   float64       
dtypes: datetime64[ns](1), float64(2)
memory usage: 146.4 KB
None


Unnamed: 0,date,korea_rp,seoul_index
4258,2011-01-01,2.5,59.42
4259,2011-01-02,2.5,59.43
4260,2011-01-03,2.5,59.43
4261,2011-01-04,2.5,59.43
4262,2011-01-05,2.5,59.44
...,...,...,...
8939,2023-10-26,3.5,90.82
8940,2023-10-27,3.5,90.82
8941,2023-10-28,3.5,90.82
8942,2023-10-29,3.5,90.82


### 기준금리(역) 과 부동산지수 비교

In [None]:
# Create traces
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['korea_rp'],
                    mode='lines',
                    name='korea_rp',yaxis='y1'))
# x축으로 그래프를 뒤집음
fig.update_layout(
    yaxis = dict(autorange="reversed")
)


fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['seoul_index'],
                    mode='lines',
                    name='seoul_index',
                    yaxis="y2"))
fig.update_layout(

   # create first Y-axis
   yaxis=dict(
      title="rp point",
      titlefont=dict(color="blue"),
      tickfont=dict(color="blue")
   ),

   # create second Y-axis
   yaxis2=dict(
      title="seoul_index",
      overlaying="y",
      side="right")
)

> 기준금리(역)과 부동산 지수는 연관성이 있는듯

## 코스피 지수 데이터 추가

In [None]:
df_kospi = pd.read_csv("/content/drive/MyDrive/house_price/original_data/kospi.csv",  encoding='UTF8') # 코스피 지수 정보 불러오기
df_kospi.head()

Unnamed: 0,날짜,종가,시가,고가,저가,거래량,변동 %
0,2023- 11- 10,2409.66,2406.4,2413.62,2393.64,312.47M,-0.72%
1,2023- 11- 09,2427.08,2425.93,2437.9,2413.04,395.03M,0.23%
2,2023- 11- 08,2421.62,2460.22,2468.43,2418.14,467.22M,-0.91%
3,2023- 11- 07,2443.96,2476.35,2476.35,2418.74,457.68M,-2.33%
4,2023- 11- 06,2502.37,2399.8,2502.37,2395.03,528.58M,5.66%


In [None]:
df_kospi=df_kospi.sort_index(ascending=False) # 날짜가 역순으로 되어 있어서 정렬
df_kospi.reset_index(drop=True, inplace=True) # index 재설정
df_kospi.head()

Unnamed: 0,날짜,종가,시가,고가,저가,거래량,변동 %
0,2007- 01- 02,1435.26,1438.89,1439.71,1430.06,147.74M,0.06%
1,2007- 01- 03,1409.35,1436.42,1437.79,1409.31,203.21M,-1.81%
2,2007- 01- 04,1397.29,1410.55,1411.12,1388.5,241.17M,-0.86%
3,2007- 01- 05,1385.76,1398.6,1400.59,1372.36,277.29M,-0.83%
4,2007- 01- 08,1370.81,1376.76,1384.65,1366.48,177.59M,-1.08%


In [None]:
df_kospi.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4168 entries, 0 to 4167
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   날짜      4168 non-null   object
 1   종가      4168 non-null   object
 2   시가      4168 non-null   object
 3   고가      4168 non-null   object
 4   저가      4168 non-null   object
 5   거래량     4168 non-null   object
 6   변동 %    4168 non-null   object
dtypes: object(7)
memory usage: 228.1+ KB


In [None]:
# 필요한 컬럼만 선택 후, 컬럼명 수정, 타입변경경
df_kospi = df_kospi[['날짜','종가']]
df_kospi.columns = ['kospi_date','kospi_index']
df_kospi["kospi_date"] = pd.to_datetime(df_kospi["kospi_date"])
df_kospi.head()

Unnamed: 0,kospi_date,kospi_index
0,2007-01-02,1435.26
1,2007-01-03,1409.35
2,2007-01-04,1397.29
3,2007-01-05,1385.76
4,2007-01-08,1370.81


In [None]:
df_kospi.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4168 entries, 0 to 4167
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   kospi_date   4168 non-null   datetime64[ns]
 1   kospi_index  4168 non-null   object        
dtypes: datetime64[ns](1), object(1)
memory usage: 65.2+ KB


In [None]:
# kospi_index 값을 이후 계산에 사용하기 위해서 숫자 형태로 수정
df_kospi["kospi_index"] = df_kospi["kospi_index"].str.replace(",", "") # 문자형으로 되어 있기에 , 을 제거
df_kospi = df_kospi.astype({'kospi_index': 'float64'})# 컬럼 타입 변경
df_kospi.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4168 entries, 0 to 4167
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   kospi_date   4168 non-null   datetime64[ns]
 1   kospi_index  4168 non-null   float64       
dtypes: datetime64[ns](1), float64(1)
memory usage: 65.2 KB


In [None]:
df_kospi.head() # 데이터프레임 형태 확인

Unnamed: 0,kospi_date,kospi_index
0,2007-01-02,1435.26
1,2007-01-03,1409.35
2,2007-01-04,1397.29
3,2007-01-05,1385.76
4,2007-01-08,1370.81


### 코스피 지수 데이터와 병합

In [None]:
# 기준금리&부동산지수 데이터프레임과 코스피 지수 데이터프레임 병합합
df_final=pd.merge(df_final, df_kospi, left_on='date', right_on='kospi_date', how='left') # 두 데이터프레임을 결함
df_final.head()

Unnamed: 0,date,korea_rp,seoul_index,kospi_date,kospi_index
0,2011-01-01,2.5,59.42,NaT,
1,2011-01-02,2.5,59.43,NaT,
2,2011-01-03,2.5,59.43,2011-01-03,2070.08
3,2011-01-04,2.5,59.43,2011-01-04,2085.14
4,2011-01-05,2.5,59.44,2011-01-05,2082.55


In [None]:
df_final.info() # 정보확인 -> 주말등 휴장일들의 존재로 kospi_date 컬럼과 kospi_index 컬럼에서 null 값들이 있음

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4686 entries, 0 to 4685
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   date         4686 non-null   datetime64[ns]
 1   korea_rp     4686 non-null   float64       
 2   seoul_index  4686 non-null   float64       
 3   kospi_date   3161 non-null   datetime64[ns]
 4   kospi_index  3161 non-null   float64       
dtypes: datetime64[ns](2), float64(3)
memory usage: 219.7 KB


In [None]:
# 휴장일에는 이전의 지수값이 유지된다고 가정
# 해결방안으로 이전의 값으로 null 값을 채우기
df_final["kospi_index"]=df_final["kospi_index"].fillna(method='ffill')
df_final.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4686 entries, 0 to 4685
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   date         4686 non-null   datetime64[ns]
 1   korea_rp     4686 non-null   float64       
 2   seoul_index  4686 non-null   float64       
 3   kospi_date   3161 non-null   datetime64[ns]
 4   kospi_index  4684 non-null   float64       
dtypes: datetime64[ns](2), float64(3)
memory usage: 219.7 KB


In [None]:
# 가장 위에 있는 null 값은 직접 찾아서(네이버 검색을 통해서서) 대입
df_final["kospi_index"] = df_final["kospi_index"].fillna(2051)
df_final.info() # 값들 대입이 되었는지 확인

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4686 entries, 0 to 4685
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   date         4686 non-null   datetime64[ns]
 1   korea_rp     4686 non-null   float64       
 2   seoul_index  4686 non-null   float64       
 3   kospi_date   3161 non-null   datetime64[ns]
 4   kospi_index  4686 non-null   float64       
dtypes: datetime64[ns](2), float64(3)
memory usage: 219.7 KB


In [None]:
df_final.head() # 형태 확인

Unnamed: 0,date,korea_rp,seoul_index,kospi_date,kospi_index
0,2011-01-01,2.5,59.42,NaT,2051.0
1,2011-01-02,2.5,59.43,NaT,2051.0
2,2011-01-03,2.5,59.43,2011-01-03,2070.08
3,2011-01-04,2.5,59.43,2011-01-04,2085.14
4,2011-01-05,2.5,59.44,2011-01-05,2082.55


In [None]:
# 사용할 컬럼만 설정
df_final = df_final[['date','korea_rp','seoul_index','kospi_index']]
df_final.head()

Unnamed: 0,date,korea_rp,seoul_index,kospi_index
0,2011-01-01,2.5,59.42,2051.0
1,2011-01-02,2.5,59.43,2051.0
2,2011-01-03,2.5,59.43,2070.08
3,2011-01-04,2.5,59.43,2085.14
4,2011-01-05,2.5,59.44,2082.55


### 코스피지수의 필요성 그래프로 점검

In [None]:
# Create traces
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['kospi_index'],
                    mode='lines',
                    name='kospi_index',yaxis='y1'))



fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['seoul_index'],
                    mode='lines',
                    name='seoul_index',
                    yaxis="y2"))
fig.update_layout(

   # create first Y-axis
   yaxis=dict(
      title="kospi index",
      titlefont=dict(color="blue"),
      tickfont=dict(color="blue")
   ),

   # create second Y-axis
   yaxis2=dict(
      title="seoul index",
      overlaying="y",
      side="right")
)

> 코스피지수와 부동산 지수는 어느정도의 상관성은 있나? 그래프로 봐서는 잘 모르겠음

## 한국국채 금리 데이터 추가

- 코스피 데이터프레임 생성과정과 거의 비슷

In [None]:
import os


dir_path = "/content/drive/MyDrive/house_price/original_data/korean_bond"
file_list = os.listdir(dir_path)
file_list.sort()
name_list = list()
df_list = list()

# 해당 폴더 안에 있는 csv 파일들을 읽어서 리스트 안에 데이터프레임들을 담음
for csv_file in file_list:
    df_list.append(pd.read_csv(dir_path+"/"+csv_file , encoding='UTF8'))
    name_list.append(csv_file.split('.')[0])
for i in range(len(df_list)):
    df_korea = df_list[i] # 파일이 잘 들어갔는지 확인
    df_korea=df_korea.sort_index(ascending=False) # 날짜가 역순으로 되어 있어서 정렬
    df_korea.reset_index(drop=True, inplace=True) # index 재설정
    df_korea = df_korea[['날짜','종가']]
    df_korea.columns = ['korea_date',name_list[i]]
    df_korea['korea_date'] = pd.to_datetime(df_korea['korea_date'])
    df_final=pd.merge(df_final, df_korea, left_on='date', right_on='korea_date', how='left')
    df_final[name_list[i]]=df_final[name_list[i]].fillna(method='ffill') # 중간 공휴일들을 처리
    df_final[name_list[i]]=df_final[name_list[i]].fillna(method='bfill') # 제일 위의 있는 값을 근처 값으로 처리
    df_final = df_final.drop(['korea_date'], axis=1)

In [None]:
df_final.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4686 entries, 0 to 4685
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   date           4686 non-null   datetime64[ns]
 1   korea_rp       4686 non-null   float64       
 2   seoul_index    4686 non-null   float64       
 3   kospi_index    4686 non-null   float64       
 4   korea_10_year  4686 non-null   float64       
 5   korea_1_year   4686 non-null   float64       
 6   korea_20_year  4686 non-null   float64       
 7   korea_2_year   4686 non-null   float64       
 8   korea_3_year   4686 non-null   float64       
 9   korea_4_year   4686 non-null   float64       
 10  korea_5_year   4686 non-null   float64       
dtypes: datetime64[ns](1), float64(10)
memory usage: 439.3 KB


In [None]:
# 컬럼 순서 변경
df_final = df_final[['date', 'seoul_index','kospi_index','korea_rp',
                    'korea_1_year','korea_2_year','korea_3_year','korea_4_year','korea_5_year',
                    'korea_10_year','korea_20_year']]
df_final

Unnamed: 0,date,seoul_index,kospi_index,korea_rp,korea_1_year,korea_2_year,korea_3_year,korea_4_year,korea_5_year,korea_10_year,korea_20_year
0,2011-01-01,59.42,2051.00,2.5,2.810,3.400,3.440,4.090,4.140,4.570,4.730
1,2011-01-02,59.43,2051.00,2.5,2.810,3.400,3.440,4.090,4.140,4.570,4.730
2,2011-01-03,59.43,2070.08,2.5,2.810,3.400,3.440,4.090,4.140,4.570,4.730
3,2011-01-04,59.43,2085.14,2.5,2.830,3.370,3.495,4.160,4.200,4.580,4.740
4,2011-01-05,59.44,2082.55,2.5,2.800,3.420,3.495,4.150,4.170,4.630,4.750
...,...,...,...,...,...,...,...,...,...,...,...
4681,2023-10-26,90.82,2299.08,3.5,3.833,3.995,3.692,4.218,3.608,3.977,4.254
4682,2023-10-27,90.82,2302.81,3.5,3.813,3.979,3.692,4.169,3.608,3.977,4.095
4683,2023-10-28,90.82,2302.81,3.5,3.813,3.973,3.692,4.169,3.608,3.977,4.085
4684,2023-10-29,90.82,2302.81,3.5,3.813,3.973,3.692,4.169,3.608,3.977,4.085


In [None]:
# 년,월,일일 컬럼 생성
df_final['year'] = df_final['date'].dt.year
df_final['month'] = df_final['date'].dt.month
df_final['day'] = df_final['date'].dt.day
df_final.head()

Unnamed: 0,date,seoul_index,kospi_index,korea_rp,korea_1_year,korea_2_year,korea_3_year,korea_4_year,korea_5_year,korea_10_year,korea_20_year,year,month,day
0,2011-01-01,59.42,2051.0,2.5,2.81,3.4,3.44,4.09,4.14,4.57,4.73,2011,1,1
1,2011-01-02,59.43,2051.0,2.5,2.81,3.4,3.44,4.09,4.14,4.57,4.73,2011,1,2
2,2011-01-03,59.43,2070.08,2.5,2.81,3.4,3.44,4.09,4.14,4.57,4.73,2011,1,3
3,2011-01-04,59.43,2085.14,2.5,2.83,3.37,3.495,4.16,4.2,4.58,4.74,2011,1,4
4,2011-01-05,59.44,2082.55,2.5,2.8,3.42,3.495,4.15,4.17,4.63,4.75,2011,1,5


### 부동산지수와 한국국채금리 시각화

In [None]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['korea_rp'],
                    mode='lines',
                    name='korea_rp',yaxis='y1'))

fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['korea_1_year'],
                    mode='lines',
                    name='korea_1_year',yaxis='y1'))

fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['korea_2_year'],
                    mode='lines',
                    name='korea_2_year',yaxis='y1'))

fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['korea_3_year'],
                    mode='lines',
                    name='korea_3_year',yaxis='y1'))

fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['korea_4_year'],
                    mode='lines',
                    name='korea_4_year',yaxis='y1'))

fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['korea_5_year'],
                    mode='lines',
                    name='korea_5_year',yaxis='y1'))

fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['korea_10_year'],
                    mode='lines',
                    name='korea_10_year',yaxis='y1'))

fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['korea_20_year'],
                    mode='lines',
                    name='korea_20_year',yaxis='y1'))

# 앞에서의 그래프들은 뒤집기
fig.update_layout(
    yaxis = dict(autorange="reversed")
)
fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['seoul_index'],
                    mode='lines',
                    name='seoul_index',
                    yaxis="y2"))
fig.update_layout(

   # create first Y-axis
   yaxis=dict(
      title="rate index",
      titlefont=dict(color="black"),
      tickfont=dict(color="black")
   ),

   # create second Y-axis
   yaxis2=dict(
      title="apartment index",
      overlaying="y",
      side="right")
)

> 한국국채금리(역)과 부동산지수는 연관이 있는듯

In [None]:
# 금리들이 얼추 비슷한 흐름을 보임으로 국채에서 3년과 10년만 사용
df_final = df_final[['date','year','month','day','seoul_index','kospi_index','korea_rp',
                    'korea_3_year','korea_10_year']]
df_final.head()

Unnamed: 0,date,year,month,day,seoul_index,kospi_index,korea_rp,korea_3_year,korea_10_year
0,2011-01-01,2011,1,1,59.42,2051.0,2.5,3.44,4.57
1,2011-01-02,2011,1,2,59.43,2051.0,2.5,3.44,4.57
2,2011-01-03,2011,1,3,59.43,2070.08,2.5,3.44,4.57
3,2011-01-04,2011,1,4,59.43,2085.14,2.5,3.495,4.58
4,2011-01-05,2011,1,5,59.44,2082.55,2.5,3.495,4.63


## 미국금채 금리 데이터 추가

- 한국국채금리 데이터프레임 생성과정과 거의 동일

In [None]:
# 변수들 초기화
dir_path = "/content/drive/MyDrive/house_price/original_data/us_bond"
file_list = os.listdir(dir_path)
file_list.sort()
name_list = list()
df_list = list()

# 해당 폴더 안에 있는 csv 파일들을 읽어서 리스트 안에 데이터프레임들을 담음
for csv_file in file_list:
    df_list.append(pd.read_csv(dir_path+"/"+csv_file , encoding='UTF8'))
    name_list.append(csv_file.split('.')[0])
for i in range(len(df_list)):
    df_us = df_list[i]
    df_us=df_us.sort_index(ascending=False) # 날짜가 역순으로 되어 있어서 정렬
    df_us.reset_index(drop=True, inplace=True) # index 재설정
    df_us = df_us[['날짜','종가']]
    df_us.columns = ['us_date',name_list[i]]
    df_us['us_date'] = pd.to_datetime(df_us['us_date'])
    df_final=pd.merge(df_final, df_us, left_on='date', right_on='us_date', how='left')
    df_final[name_list[i]]=df_final[name_list[i]].fillna(method='ffill') # 중간 공휴일들을 처리
    df_final[name_list[i]]=df_final[name_list[i]].fillna(method='bfill') # 제일 위의 있는 값을 근처 값으로 처리
    df_final = df_final.drop(['us_date'], axis=1)

In [None]:
df_final.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4686 entries, 0 to 4685
Data columns (total 17 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   date           4686 non-null   datetime64[ns]
 1   year           4686 non-null   int64         
 2   month          4686 non-null   int64         
 3   day            4686 non-null   int64         
 4   seoul_index    4686 non-null   float64       
 5   kospi_index    4686 non-null   float64       
 6   korea_rp       4686 non-null   float64       
 7   korea_3_year   4686 non-null   float64       
 8   korea_10_year  4686 non-null   float64       
 9   us_10_year     4686 non-null   float64       
 10  us_1_month     4686 non-null   float64       
 11  us_2_year      4686 non-null   float64       
 12  us_30_year     4686 non-null   float64       
 13  us_3_month     4686 non-null   float64       
 14  us_3_year      4686 non-null   float64       
 15  us_5_year      4686 n

In [None]:
df_final = df_final[['date','year','month','day','seoul_index','kospi_index','korea_rp',
                    'korea_3_year','korea_10_year','us_1_month','us_3_month',
                    'us_6_month','us_2_year', 'us_3_year', 'us_5_year',
                    'us_10_year','us_30_year']]

In [None]:
df_final.head()

Unnamed: 0,date,year,month,day,seoul_index,kospi_index,korea_rp,korea_3_year,korea_10_year,us_1_month,us_3_month,us_6_month,us_2_year,us_3_year,us_5_year,us_10_year,us_30_year
0,2011-01-01,2011,1,1,59.42,2051.0,2.5,3.44,4.57,0.096,0.124,0.183,0.601,1.006,2.011,3.334,4.401
1,2011-01-02,2011,1,2,59.43,2051.0,2.5,3.44,4.57,0.096,0.124,0.183,0.601,1.006,2.011,3.334,4.401
2,2011-01-03,2011,1,3,59.43,2070.08,2.5,3.44,4.57,0.096,0.124,0.183,0.601,1.006,2.011,3.334,4.401
3,2011-01-04,2011,1,4,59.43,2085.14,2.5,3.495,4.58,0.106,0.142,0.187,0.621,1.026,2.016,3.338,4.422
4,2011-01-05,2011,1,5,59.44,2082.55,2.5,3.495,4.63,0.129,0.142,0.184,0.708,1.129,2.133,3.463,4.541


### 미국국채금리와 부동산 지수 비교

In [None]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['us_1_month'],
                    mode='lines',
                    name='us_1_month',yaxis='y1'))

fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['us_3_month'],
                    mode='lines',
                    name='us_3_month',yaxis='y1'))

fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['us_6_month'],
                    mode='lines',
                    name='us_6_month',yaxis='y1'))

fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['us_2_year'],
                    mode='lines',
                    name='us_2_year',yaxis='y1'))

fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['us_3_year'],
                    mode='lines',
                    name='us_3_year',yaxis='y1'))

fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['us_5_year'],
                    mode='lines',
                    name='us_5_year',yaxis='y1'))

fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['us_10_year'],
                    mode='lines',
                    name='us_10_year',yaxis='y1'))

fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['us_30_year'],
                    mode='lines',
                    name='us_30_year',yaxis='y1'))

# 앞에서의 그래프들은 뒤집기
fig.update_layout(
    yaxis = dict(autorange="reversed")
)
fig.add_trace(go.Scatter(x=df_final['date'], y=df_final['seoul_index'],
                    mode='lines',
                    name='seoul_index',
                    yaxis="y2"))
fig.update_layout(

   # create first Y-axis
   yaxis=dict(
      title="rate index",
      titlefont=dict(color="black"),
      tickfont=dict(color="black")
   ),

   # create second Y-axis
   yaxis2=dict(
      title="seoul index",
      overlaying="y",
      side="right")
)

> 미국 국채금리(역)은 한국 국채금리(역)보다는 부동산지수와 어느정도 연관이 있는듯?

In [None]:
# 금리들이 얼추 비슷한 흐름을 보임으로 국채에서 3개월, 2년, 10년 데이터프레임을 생성
df_final = df_final[['date','year','month','day','kospi_index','korea_rp',
                    'korea_3_year','korea_10_year','us_3_month', 'us_2_year', 'us_10_year']]
df_final.head()

Unnamed: 0,date,year,month,day,kospi_index,korea_rp,korea_3_year,korea_10_year,us_3_month,us_2_year,us_10_year
0,2011-01-01,2011,1,1,2051.0,2.5,3.44,4.57,0.124,0.601,3.334
1,2011-01-02,2011,1,2,2051.0,2.5,3.44,4.57,0.124,0.601,3.334
2,2011-01-03,2011,1,3,2070.08,2.5,3.44,4.57,0.124,0.601,3.334
3,2011-01-04,2011,1,4,2085.14,2.5,3.495,4.58,0.142,0.621,3.338
4,2011-01-05,2011,1,5,2082.55,2.5,3.495,4.63,0.142,0.708,3.463


## 파일저장

In [None]:
df_final.to_csv('/content/drive/MyDrive/house_price/after_data/economic_data.csv',index=False)

# economic_data2.csv 파일생성

- economic_data2 파일은 economic_data 파일에 금리차, 아파트 분양 공급량, 미분양 정보들을 추가한 파일

In [None]:
import pandas as pd
df_economic = pd.read_csv('/content/drive/MyDrive/house_price/after_data/economic_data.csv')
df_economic

Unnamed: 0,date,year,month,day,kospi_index,korea_rp,korea_3_year,korea_10_year,us_3_month,us_2_year,us_10_year
0,2011-01-01,2011,1,1,2051.00,2.5,3.440,4.570,0.124,0.601,3.334
1,2011-01-02,2011,1,2,2051.00,2.5,3.440,4.570,0.124,0.601,3.334
2,2011-01-03,2011,1,3,2070.08,2.5,3.440,4.570,0.124,0.601,3.334
3,2011-01-04,2011,1,4,2085.14,2.5,3.495,4.580,0.142,0.621,3.338
4,2011-01-05,2011,1,5,2082.55,2.5,3.495,4.630,0.142,0.708,3.463
...,...,...,...,...,...,...,...,...,...,...,...
4681,2023-10-26,2023,10,26,2299.08,3.5,3.692,3.977,5.479,5.046,4.849
4682,2023-10-27,2023,10,27,2302.81,3.5,3.692,3.977,5.477,5.015,4.845
4683,2023-10-28,2023,10,28,2302.81,3.5,3.692,3.977,5.477,5.015,4.845
4684,2023-10-29,2023,10,29,2302.81,3.5,3.692,3.977,5.477,5.015,4.845



## 금리차 컬럼들 추가

In [None]:
# 금리차 컬럼들을 추가
df_economic['korea_10-3_year'] = df_economic['korea_10_year'] - df_economic['korea_3_year']
df_economic['us_10-2_year'] = df_economic['us_10_year'] - df_economic['us_2_year']
df_economic['us_10-3_year_month'] = df_economic['us_10_year'] - df_economic['us_3_month']
df_economic

Unnamed: 0,date,year,month,day,kospi_index,korea_rp,korea_3_year,korea_10_year,us_3_month,us_2_year,us_10_year,korea_10-3_year,us_10-2_year,us_10-3_year_month
0,2011-01-01,2011,1,1,2051.00,2.5,3.440,4.570,0.124,0.601,3.334,1.130,2.733,3.210
1,2011-01-02,2011,1,2,2051.00,2.5,3.440,4.570,0.124,0.601,3.334,1.130,2.733,3.210
2,2011-01-03,2011,1,3,2070.08,2.5,3.440,4.570,0.124,0.601,3.334,1.130,2.733,3.210
3,2011-01-04,2011,1,4,2085.14,2.5,3.495,4.580,0.142,0.621,3.338,1.085,2.717,3.196
4,2011-01-05,2011,1,5,2082.55,2.5,3.495,4.630,0.142,0.708,3.463,1.135,2.755,3.321
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4681,2023-10-26,2023,10,26,2299.08,3.5,3.692,3.977,5.479,5.046,4.849,0.285,-0.197,-0.630
4682,2023-10-27,2023,10,27,2302.81,3.5,3.692,3.977,5.477,5.015,4.845,0.285,-0.170,-0.632
4683,2023-10-28,2023,10,28,2302.81,3.5,3.692,3.977,5.477,5.015,4.845,0.285,-0.170,-0.632
4684,2023-10-29,2023,10,29,2302.81,3.5,3.692,3.977,5.477,5.015,4.845,0.285,-0.170,-0.632


## 아파트 공급량 관련 데이터 추가

### 아파트 분양 공급 데이터 추가

- https://data.kbland.kr/publicdata/housing-supply 사이트를 통해서 아파트 공급량의 정보를 직접 얻음

In [None]:
import pandas as pd
# txt 파일을 불러옴옴
df_apartment_supply = pd.read_csv("/content/drive/MyDrive/house_price/original_data/apartment_supply.csv",  encoding='cp949')
df_apartment_supply = df_apartment_supply.dropna()
print(df_apartment_supply.info())
df_apartment_supply

<class 'pandas.core.frame.DataFrame'>
Int64Index: 168 entries, 0 to 167
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   year_month   168 non-null    float64
 1   general      168 non-null    object 
 2   combination  168 non-null    object 
 3   etc          168 non-null    object 
dtypes: float64(1), object(3)
memory usage: 6.6+ KB
None


Unnamed: 0,year_month,general,combination,etc
0,10.01,0,0,0
1,10.02,1199,147,506
2,10.03,31,96,0
3,10.04,148,0,0
4,10.05,102,0,0
...,...,...,...,...
163,23.08,1191,167,127
164,23.09,222,492,0
165,23.10,2242,452,5215
166,23.11,299,70,0


In [None]:
df_apartment_supply["general"] = df_apartment_supply["general"].str.replace(",", "")
df_apartment_supply["combination"] = df_apartment_supply["combination"].str.replace(",", "")
df_apartment_supply["etc"] = df_apartment_supply["etc"].str.replace(",", "")
df_apartment_supply[['general', 'combination', 'etc']] = df_apartment_supply[['general', 'combination', 'etc']].astype(int)
df_apartment_supply['year_month'] = df_apartment_supply['year_month'].astype(str)
print(df_apartment_supply.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 168 entries, 0 to 167
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   year_month   168 non-null    object
 1   general      168 non-null    int64 
 2   combination  168 non-null    int64 
 3   etc          168 non-null    int64 
dtypes: int64(3), object(1)
memory usage: 6.6+ KB
None


In [None]:
df_apartment_supply['year_month'].unique()

array(['10.01', '10.02', '10.03', '10.04', '10.05', '10.06', '10.07',
       '10.08', '10.09', '10.1', '10.11', '10.12', '11.01', '11.02',
       '11.03', '11.04', '11.05', '11.06', '11.07', '11.08', '11.09',
       '11.1', '11.11', '11.12', '12.01', '12.02', '12.03', '12.04',
       '12.05', '12.06', '12.07', '12.08', '12.09', '12.1', '12.11',
       '12.12', '13.01', '13.02', '13.03', '13.04', '13.05', '13.06',
       '13.07', '13.08', '13.09', '13.1', '13.11', '13.12', '14.01',
       '14.02', '14.03', '14.04', '14.05', '14.06', '14.07', '14.08',
       '14.09', '14.1', '14.11', '14.12', '15.01', '15.02', '15.03',
       '15.04', '15.05', '15.06', '15.07', '15.08', '15.09', '15.1',
       '15.11', '15.12', '16.01', '16.02', '16.03', '16.04', '16.05',
       '16.06', '16.07', '16.08', '16.09', '16.1', '16.11', '16.12',
       '17.01', '17.02', '17.03', '17.04', '17.05', '17.06', '17.07',
       '17.08', '17.09', '17.1', '17.11', '17.12', '18.01', '18.02',
       '18.03', '18.04', '18

In [None]:
df_apartment_supply.loc[df_apartment_supply['year_month'].str.len() == 4, 'year_month'] = df_apartment_supply['year_month']+'0'

In [None]:
df_apartment_supply['year_month'].unique()

array(['10.01', '10.02', '10.03', '10.04', '10.05', '10.06', '10.07',
       '10.08', '10.09', '10.10', '10.11', '10.12', '11.01', '11.02',
       '11.03', '11.04', '11.05', '11.06', '11.07', '11.08', '11.09',
       '11.10', '11.11', '11.12', '12.01', '12.02', '12.03', '12.04',
       '12.05', '12.06', '12.07', '12.08', '12.09', '12.10', '12.11',
       '12.12', '13.01', '13.02', '13.03', '13.04', '13.05', '13.06',
       '13.07', '13.08', '13.09', '13.10', '13.11', '13.12', '14.01',
       '14.02', '14.03', '14.04', '14.05', '14.06', '14.07', '14.08',
       '14.09', '14.10', '14.11', '14.12', '15.01', '15.02', '15.03',
       '15.04', '15.05', '15.06', '15.07', '15.08', '15.09', '15.10',
       '15.11', '15.12', '16.01', '16.02', '16.03', '16.04', '16.05',
       '16.06', '16.07', '16.08', '16.09', '16.10', '16.11', '16.12',
       '17.01', '17.02', '17.03', '17.04', '17.05', '17.06', '17.07',
       '17.08', '17.09', '17.10', '17.11', '17.12', '18.01', '18.02',
       '18.03', '18.

In [None]:
df_apartment_supply['total'] = df_apartment_supply['general'] + df_apartment_supply['combination'] + df_apartment_supply['etc']

print(df_apartment_supply.info())
df_apartment_supply

<class 'pandas.core.frame.DataFrame'>
Int64Index: 168 entries, 0 to 167
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   year_month   168 non-null    object
 1   general      168 non-null    int64 
 2   combination  168 non-null    int64 
 3   etc          168 non-null    int64 
 4   total        168 non-null    int64 
dtypes: int64(4), object(1)
memory usage: 11.9+ KB
None


Unnamed: 0,year_month,general,combination,etc,total
0,10.01,0,0,0,0
1,10.02,1199,147,506,1852
2,10.03,31,96,0,127
3,10.04,148,0,0,148
4,10.05,102,0,0,102
...,...,...,...,...,...
163,23.08,1191,167,127,1485
164,23.09,222,492,0,714
165,23.10,2242,452,5215,7909
166,23.11,299,70,0,369


In [None]:
# 연, 월 컬럼 생성
df_apartment_supply['year'] =df_apartment_supply["year_month"].str.split('.',expand=True)[0]
df_apartment_supply['month'] =df_apartment_supply["year_month"].str.split('.',expand=True)[1]

# 연 컬럼 수정 및 사용할 컬럼 선택
df_apartment_supply['year'] = '20'+df_apartment_supply['year']

df_apartment_supply = df_apartment_supply.astype({'year':'int', 'month':'int'})

df_apartment_supply = df_apartment_supply[['year','month','general',	'combination',	'etc',	'total']]

print(df_apartment_supply.info())
display(df_apartment_supply.head())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 168 entries, 0 to 167
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   year         168 non-null    int64
 1   month        168 non-null    int64
 2   general      168 non-null    int64
 3   combination  168 non-null    int64
 4   etc          168 non-null    int64
 5   total        168 non-null    int64
dtypes: int64(6)
memory usage: 13.3 KB
None


Unnamed: 0,year,month,general,combination,etc,total
0,2010,1,0,0,0,0
1,2010,2,1199,147,506,1852
2,2010,3,31,96,0,127
3,2010,4,148,0,0,148
4,2010,5,102,0,0,102


- 해당 달의 수치의 결과는 다음달에 발표한다고 가정(예를들어 2011년 1월의 거래수치는 2011년 1월동안에는 알 수 없고 2월이 되어야 1월의 결과를 종합해서 수치를 알 수 있다)

In [None]:
# 한 칸씩 옮
df_apartment_supply[['general','combination', 'etc', 'total']] = df_apartment_supply[['general','combination', 'etc', 'total']].shift(1)
df_apartment_supply

Unnamed: 0,year,month,general,combination,etc,total
0,2010,1,,,,
1,2010,2,0.0,0.0,0.0,0.0
2,2010,3,1199.0,147.0,506.0,1852.0
3,2010,4,31.0,96.0,0.0,127.0
4,2010,5,148.0,0.0,0.0,148.0
...,...,...,...,...,...,...
163,2023,8,1169.0,226.0,432.0,1827.0
164,2023,9,1191.0,167.0,127.0,1485.0
165,2023,10,222.0,492.0,0.0,714.0
166,2023,11,2242.0,452.0,5215.0,7909.0


In [None]:
df_apartment_supply = df_apartment_supply.dropna()
df_apartment_supply = df_apartment_supply.reset_index(drop=True)
df_apartment_supply

Unnamed: 0,year,month,general,combination,etc,total
0,2010,2,0.0,0.0,0.0,0.0
1,2010,3,1199.0,147.0,506.0,1852.0
2,2010,4,31.0,96.0,0.0,127.0
3,2010,5,148.0,0.0,0.0,148.0
4,2010,6,102.0,0.0,0.0,102.0
...,...,...,...,...,...,...
162,2023,8,1169.0,226.0,432.0,1827.0
163,2023,9,1191.0,167.0,127.0,1485.0
164,2023,10,222.0,492.0,0.0,714.0
165,2023,11,2242.0,452.0,5215.0,7909.0


In [None]:
df_apartment_supply = df_apartment_supply.rename(columns={'year':'announcement_year', 'month':'announcement_month', 'general':'general_supply',
                                                          'combination':'combination_supply', 'etc':'etc_supply', 'total':'total_supply'})
df_apartment_supply.head()

Unnamed: 0,announcement_year,announcement_month,general_supply,combination_supply,etc_supply,total_supply
0,2010,2,0.0,0.0,0.0,0.0
1,2010,3,1199.0,147.0,506.0,1852.0
2,2010,4,31.0,96.0,0.0,127.0
3,2010,5,148.0,0.0,0.0,148.0
4,2010,6,102.0,0.0,0.0,102.0


###| 아파트 미분양 데이터 추가

- https://data.kbland.kr/publicdata/unsold-apartments 사이트를 통해서 미분양 데이터 정보를 확보

In [None]:
df_apartment_unsold = pd.read_excel("/content/drive/MyDrive/house_price/original_data/unsold/서울 미분양 현황.xlsx")
df_apartment_unsold.head()

Unnamed: 0,구분,'07.01,'07.02,'07.03,'07.04,'07.05,'07.06,'07.07,'07.08,'07.09,...,'22.12,'23.01,'23.02,'23.03,'23.04,'23.05,'23.06,'23.07,'23.08,'23.09
0,미분양,72831,73546.0,73162.0,73393.0,78571.0,89924.0,90658.0,91714.0,98235.0,...,68107.0,75359.0,75438.0,72104.0,71365.0,68865.0,66388.0,63087.0,61811.0,59806.0
1,변동률,-,0.98,-0.52,0.32,7.06,14.45,0.82,1.16,7.11,...,17.37,10.65,0.1,-4.42,-1.02,-3.5,-3.6,-4.97,-2.02,-3.24


In [None]:
import numpy as np
df_apartment_unsold["'23.10"] = [0,np.nan]
df_apartment_unsold.head()

Unnamed: 0,구분,'07.01,'07.02,'07.03,'07.04,'07.05,'07.06,'07.07,'07.08,'07.09,...,'23.01,'23.02,'23.03,'23.04,'23.05,'23.06,'23.07,'23.08,'23.09,'23.10
0,미분양,72831,73546.0,73162.0,73393.0,78571.0,89924.0,90658.0,91714.0,98235.0,...,75359.0,75438.0,72104.0,71365.0,68865.0,66388.0,63087.0,61811.0,59806.0,0.0
1,변동률,-,0.98,-0.52,0.32,7.06,14.45,0.82,1.16,7.11,...,10.65,0.1,-4.42,-1.02,-3.5,-3.6,-4.97,-2.02,-3.24,


In [None]:
df_apartment_unsold = df_apartment_unsold.set_index('구분') # '구분'커럼을 인덱스로 설정
df_apartment_unsold.head()

Unnamed: 0_level_0,'07.01,'07.02,'07.03,'07.04,'07.05,'07.06,'07.07,'07.08,'07.09,'07.10,...,'23.01,'23.02,'23.03,'23.04,'23.05,'23.06,'23.07,'23.08,'23.09,'23.10
구분,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
미분양,72831,73546.0,73162.0,73393.0,78571.0,89924.0,90658.0,91714.0,98235.0,100887.0,...,75359.0,75438.0,72104.0,71365.0,68865.0,66388.0,63087.0,61811.0,59806.0,0.0
변동률,-,0.98,-0.52,0.32,7.06,14.45,0.82,1.16,7.11,2.7,...,10.65,0.1,-4.42,-1.02,-3.5,-3.6,-4.97,-2.02,-3.24,


In [None]:
# T 매소드를 통해서 row와 column을 교환환
df_apartment_unsold=df_apartment_unsold.T
df_apartment_unsold.head()

구분,미분양,변동률
'07.01,72831.0,-
'07.02,73546.0,0.98
'07.03,73162.0,-0.52
'07.04,73393.0,0.32
'07.05,78571.0,7.06


In [None]:
df_apartment_unsold.info()

<class 'pandas.core.frame.DataFrame'>
Index: 202 entries, '07.01 to '23.10
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   미분양     202 non-null    object
 1   변동률     201 non-null    object
dtypes: object(2)
memory usage: 12.8+ KB


In [None]:
# index가 날짜의 정보를 가지고 있음으로 reset_index를 통해서 날짜 정보를 컬럼으로 생성성
df_apartment_unsold = df_apartment_unsold.reset_index()
df_apartment_unsold.head()

구분,index,미분양,변동률
0,'07.01,72831.0,-
1,'07.02,73546.0,0.98
2,'07.03,73162.0,-0.52
3,'07.04,73393.0,0.32
4,'07.05,78571.0,7.06


In [None]:
# 컬럼명 수정정
df_apartment_unsold.columns=['year_month','unsold_count','ratio']
df_apartment_unsold.head()

Unnamed: 0,year_month,unsold_count,ratio
0,'07.01,72831.0,-
1,'07.02,73546.0,0.98
2,'07.03,73162.0,-0.52
3,'07.04,73393.0,0.32
4,'07.05,78571.0,7.06


In [None]:
# year_month 컬럼에서 ' 부분을 제거
df_apartment_unsold["year_month"] = df_apartment_unsold["year_month"].str.replace("'", "")
df_apartment_unsold.head()

Unnamed: 0,year_month,unsold_count,ratio
0,7.01,72831.0,-
1,7.02,73546.0,0.98
2,7.03,73162.0,-0.52
3,7.04,73393.0,0.32
4,7.05,78571.0,7.06


In [None]:
# 연, 월 컬럼 생성
df_apartment_unsold['year'] =df_apartment_unsold["year_month"].str.split('.',expand=True)[0]
df_apartment_unsold['month'] =df_apartment_unsold["year_month"].str.split('.',expand=True)[1]
df_apartment_unsold.head()

Unnamed: 0,year_month,unsold_count,ratio,year,month
0,7.01,72831.0,-,7,1
1,7.02,73546.0,0.98,7,2
2,7.03,73162.0,-0.52,7,3
3,7.04,73393.0,0.32,7,4
4,7.05,78571.0,7.06,7,5


In [None]:
# 연 컬럼 수정 및 사용할 컬럼 선택
df_apartment_unsold['year'] = '20'+df_apartment_unsold['year']
df_apartment_unsold = df_apartment_unsold[['year','month','unsold_count']]
df_apartment_unsold

Unnamed: 0,year,month,unsold_count
0,2007,01,72831
1,2007,02,73546.0
2,2007,03,73162.0
3,2007,04,73393.0
4,2007,05,78571.0
...,...,...,...
197,2023,06,66388.0
198,2023,07,63087.0
199,2023,08,61811.0
200,2023,09,59806.0


In [None]:
import datetime

# 미분양에 대한 정보는 한달이 지나야 결과를 알 수 있다 가정
df_apartment_unsold['date'] = pd.to_datetime(df_apartment_unsold['year']+'-'+df_apartment_unsold['month'], format="%Y-%m")
df_apartment_unsold['date_column'] = df_apartment_unsold['date'] + datetime.timedelta(days=32) # 한달 뒤의 날짜를 구함(발표날짜)
df_apartment_unsold['announcement_year'] = df_apartment_unsold['date_column'].dt.year
df_apartment_unsold['announcement_month'] = df_apartment_unsold['date_column'].dt.month
df_apartment_unsold = df_apartment_unsold[['announcement_year','announcement_month','unsold_count']]
df_apartment_unsold = df_apartment_unsold.astype({'unsold_count': 'int64'})
df_apartment_unsold

Unnamed: 0,announcement_year,announcement_month,unsold_count
0,2007,2,72831
1,2007,3,73546
2,2007,4,73162
3,2007,5,73393
4,2007,6,78571
...,...,...,...
197,2023,7,66388
198,2023,8,63087
199,2023,9,61811
200,2023,10,59806


In [None]:
# 사용할 연도의 범위를 설정
df_apartment_unsold=df_apartment_unsold[df_apartment_unsold['announcement_year']>=2011]
df_apartment_unsold = df_apartment_unsold.drop(df_apartment_unsold[(df_apartment_unsold['announcement_year'] == 2023) & (df_apartment_unsold['announcement_month'] == 11)].index)

df_apartment_unsold

Unnamed: 0,announcement_year,announcement_month,unsold_count
47,2011,1,88706
48,2011,2,84923
49,2011,3,80588
50,2011,4,77572
51,2011,5,72232
...,...,...,...
196,2023,6,68865
197,2023,7,66388
198,2023,8,63087
199,2023,9,61811


In [None]:
df_apartment_unsold.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 154 entries, 47 to 200
Data columns (total 3 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   announcement_year   154 non-null    int64
 1   announcement_month  154 non-null    int64
 2   unsold_count        154 non-null    int64
dtypes: int64(3)
memory usage: 4.8 KB


### 아파트 분양 & 미분양 데이터 프레임 병합

In [None]:
df_apartment_supply.head(20)

Unnamed: 0,announcement_year,announcement_month,general_supply,combination_supply,etc_supply,total_supply
0,2010,2,0.0,0.0,0.0,0.0
1,2010,3,1199.0,147.0,506.0,1852.0
2,2010,4,31.0,96.0,0.0,127.0
3,2010,5,148.0,0.0,0.0,148.0
4,2010,6,102.0,0.0,0.0,102.0
5,2010,7,472.0,0.0,0.0,472.0
6,2010,8,331.0,22.0,0.0,353.0
7,2010,9,974.0,0.0,0.0,974.0
8,2010,10,312.0,631.0,0.0,943.0
9,2010,11,124.0,0.0,0.0,124.0


In [None]:
df_apartment_supply.loc[df_apartment_supply['announcement_year']==2011,:]

Unnamed: 0,announcement_year,announcement_month,general_supply,combination_supply,etc_supply,total_supply
11,2011,1,2956.0,615.0,0.0,3571.0
12,2011,2,647.0,0.0,0.0,647.0
13,2011,3,307.0,154.0,2.0,463.0
14,2011,4,558.0,0.0,0.0,558.0
15,2011,5,633.0,310.0,0.0,943.0
16,2011,6,1939.0,81.0,1724.0,3744.0
17,2011,7,739.0,70.0,442.0,1251.0
18,2011,8,880.0,0.0,0.0,880.0
19,2011,9,1030.0,96.0,0.0,1126.0
20,2011,10,0.0,0.0,0.0,0.0


In [None]:
df_apartment_supply.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 167 entries, 0 to 166
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   announcement_year   167 non-null    int64  
 1   announcement_month  167 non-null    int64  
 2   general_supply      167 non-null    float64
 3   combination_supply  167 non-null    float64
 4   etc_supply          167 non-null    float64
 5   total_supply        167 non-null    float64
dtypes: float64(4), int64(2)
memory usage: 8.0 KB


In [None]:
df_apartment_unsold.head()

Unnamed: 0,announcement_year,announcement_month,unsold_count
47,2011,1,88706
48,2011,2,84923
49,2011,3,80588
50,2011,4,77572
51,2011,5,72232


In [None]:
df_apartment_unsold.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 154 entries, 47 to 200
Data columns (total 3 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   announcement_year   154 non-null    int64
 1   announcement_month  154 non-null    int64
 2   unsold_count        154 non-null    int64
dtypes: int64(3)
memory usage: 4.8 KB


In [None]:
# 데이터 프레임 병합합
df_apartment_supply_unsold=pd.merge(df_apartment_supply, df_apartment_unsold, on=['announcement_year',	'announcement_month'])
df_apartment_supply_unsold.head()

Unnamed: 0,announcement_year,announcement_month,general_supply,combination_supply,etc_supply,total_supply,unsold_count
0,2011,1,2956.0,615.0,0.0,3571.0,88706
1,2011,2,647.0,0.0,0.0,647.0,84923
2,2011,3,307.0,154.0,2.0,463.0,80588
3,2011,4,558.0,0.0,0.0,558.0,77572
4,2011,5,633.0,310.0,0.0,943.0,72232


In [None]:
df_apartment_supply_unsold.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 154 entries, 0 to 153
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   announcement_year   154 non-null    int64  
 1   announcement_month  154 non-null    int64  
 2   general_supply      154 non-null    float64
 3   combination_supply  154 non-null    float64
 4   etc_supply          154 non-null    float64
 5   total_supply        154 non-null    float64
 6   unsold_count        154 non-null    int64  
dtypes: float64(4), int64(3)
memory usage: 9.6 KB


#### 미분양 비율 컬럼 추가

In [None]:
# 미분양 비율을 구함
df_apartment_supply_unsold['unsold_ratio'] = 100*(df_apartment_supply_unsold['unsold_count'] / df_apartment_supply_unsold['total_supply'])
df_apartment_supply_unsold

Unnamed: 0,announcement_year,announcement_month,general_supply,combination_supply,etc_supply,total_supply,unsold_count,unsold_ratio
0,2011,1,2956.0,615.0,0.0,3571.0,88706,2484.066088
1,2011,2,647.0,0.0,0.0,647.0,84923,13125.656878
2,2011,3,307.0,154.0,2.0,463.0,80588,17405.615551
3,2011,4,558.0,0.0,0.0,558.0,77572,13901.792115
4,2011,5,633.0,310.0,0.0,943.0,72232,7659.809120
...,...,...,...,...,...,...,...,...
149,2023,6,327.0,454.0,191.0,972.0,68865,7084.876543
150,2023,7,206.0,148.0,389.0,743.0,66388,8935.127860
151,2023,8,1169.0,226.0,432.0,1827.0,63087,3453.037767
152,2023,9,1191.0,167.0,127.0,1485.0,61811,4162.356902


In [None]:
df_apartment_supply_unsold.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 154 entries, 0 to 153
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   announcement_year   154 non-null    int64  
 1   announcement_month  154 non-null    int64  
 2   general_supply      154 non-null    float64
 3   combination_supply  154 non-null    float64
 4   etc_supply          154 non-null    float64
 5   total_supply        154 non-null    float64
 6   unsold_count        154 non-null    int64  
 7   unsold_ratio        154 non-null    float64
dtypes: float64(5), int64(3)
memory usage: 10.8 KB


### 최종 테이블에 병합

In [None]:
# 데이터 병합
df_economic=pd.merge(df_economic, df_apartment_supply_unsold, left_on=['year','month'], right_on=['announcement_year','announcement_month'], how='left')
df_economic = df_economic.drop(["announcement_year", "announcement_month"], axis=1)
df_economic

Unnamed: 0,date,year,month,day,kospi_index,korea_rp,korea_3_year,korea_10_year,us_3_month,us_2_year,us_10_year,korea_10-3_year,us_10-2_year,us_10-3_year_month,general_supply,combination_supply,etc_supply,total_supply,unsold_count,unsold_ratio
0,2011-01-01,2011,1,1,2051.00,2.5,3.440,4.570,0.124,0.601,3.334,1.130,2.733,3.210,2956.0,615.0,0.0,3571.0,88706,2484.066088
1,2011-01-02,2011,1,2,2051.00,2.5,3.440,4.570,0.124,0.601,3.334,1.130,2.733,3.210,2956.0,615.0,0.0,3571.0,88706,2484.066088
2,2011-01-03,2011,1,3,2070.08,2.5,3.440,4.570,0.124,0.601,3.334,1.130,2.733,3.210,2956.0,615.0,0.0,3571.0,88706,2484.066088
3,2011-01-04,2011,1,4,2085.14,2.5,3.495,4.580,0.142,0.621,3.338,1.085,2.717,3.196,2956.0,615.0,0.0,3571.0,88706,2484.066088
4,2011-01-05,2011,1,5,2082.55,2.5,3.495,4.630,0.142,0.708,3.463,1.135,2.755,3.321,2956.0,615.0,0.0,3571.0,88706,2484.066088
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4681,2023-10-26,2023,10,26,2299.08,3.5,3.692,3.977,5.479,5.046,4.849,0.285,-0.197,-0.630,222.0,492.0,0.0,714.0,59806,8376.190476
4682,2023-10-27,2023,10,27,2302.81,3.5,3.692,3.977,5.477,5.015,4.845,0.285,-0.170,-0.632,222.0,492.0,0.0,714.0,59806,8376.190476
4683,2023-10-28,2023,10,28,2302.81,3.5,3.692,3.977,5.477,5.015,4.845,0.285,-0.170,-0.632,222.0,492.0,0.0,714.0,59806,8376.190476
4684,2023-10-29,2023,10,29,2302.81,3.5,3.692,3.977,5.477,5.015,4.845,0.285,-0.170,-0.632,222.0,492.0,0.0,714.0,59806,8376.190476


In [None]:
df_economic.info() # 데이터프레임 정보 확인

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4686 entries, 0 to 4685
Data columns (total 20 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   date                4686 non-null   object 
 1   year                4686 non-null   int64  
 2   month               4686 non-null   int64  
 3   day                 4686 non-null   int64  
 4   kospi_index         4686 non-null   float64
 5   korea_rp            4686 non-null   float64
 6   korea_3_year        4686 non-null   float64
 7   korea_10_year       4686 non-null   float64
 8   us_3_month          4686 non-null   float64
 9   us_2_year           4686 non-null   float64
 10  us_10_year          4686 non-null   float64
 11  korea_10-3_year     4686 non-null   float64
 12  us_10-2_year        4686 non-null   float64
 13  us_10-3_year_month  4686 non-null   float64
 14  general_supply      4686 non-null   float64
 15  combination_supply  4686 non-null   float64
 16  etc_su

In [None]:
df_economic.isnull().sum() # null data 있는지 확인

date                  0
year                  0
month                 0
day                   0
kospi_index           0
korea_rp              0
korea_3_year          0
korea_10_year         0
us_3_month            0
us_2_year             0
us_10_year            0
korea_10-3_year       0
us_10-2_year          0
us_10-3_year_month    0
general_supply        0
combination_supply    0
etc_supply            0
total_supply          0
unsold_count          0
unsold_ratio          0
dtype: int64

In [None]:
df_economic.to_csv('/content/drive/MyDrive/house_price/after_data/economic_data2.csv',index=False)

# economic_data3 파일 생성


- economic_data3 파일은 economic_data2 파일에 월별 아파트 거래체결량들(매매체결량, 전세체결량,월세체결량) 정보를 추가한 파일
- '아파트 거래' 는 '아파트 매매', '아파트 전세', '아파트 월세' 를 합친 개념
- 아파트 월별 거래량은 이전 달에 체결된 서울 총 아파트 거래량을 의미

In [None]:
import pandas as pd
import numpy as np
# 데이터들 불러오기
df_deal = pd.read_csv("/content/drive/MyDrive/house_price/after_data/apartment_deal.csv",  encoding='UTF8')
df_month_rent = pd.read_csv("/content/drive/MyDrive/house_price/after_data/apartment_month_rent.csv",  encoding='UTF8')
df_full_rent = pd.read_csv("/content/drive/MyDrive/house_price/after_data/apartment_full_rent.csv",  encoding='UTF8')
df_economic2 = pd.read_csv("/content/drive/MyDrive/house_price/after_data/economic_data2.csv",  encoding='UTF8')

## 아파트 매매 체결량 데이터프레임 생성

In [None]:
df_deal.head()

Unnamed: 0,date,year,month,day,address_0,address_1,address_2,address_3,address_4,name,area,deal_price
0,2011-07-09,2011,7,9,서울특별시,강남구,개포동,655.0,2.0,개포2차현대아파트(220),77.75,64000
1,2011-07-28,2011,7,28,서울특별시,강남구,개포동,655.0,2.0,개포2차현대아파트(220),77.75,65500
2,2011-01-19,2011,1,19,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,67.28,70500
3,2011-09-02,2011,9,2,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,79.97,85000
4,2011-12-17,2011,12,17,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,67.28,68000


In [None]:
# 서울 아파트 월별 거래량을 group by를 이용하여여 계산
df_count = df_deal.groupby(["year","month"])["name"].agg('count').copy()
df_count = df_count.reset_index(["year","month"]) # index로 있던 컬럼들을 다시 컬럼화
df_count.columns = ["year","month","deal_count"] # 컬럼명들 수정
df_count

Unnamed: 0,year,month,deal_count
0,2011,1,7179
1,2011,2,6026
2,2011,3,5419
3,2011,4,4028
4,2011,5,3836
...,...,...,...
149,2023,6,4009
150,2023,7,3733
151,2023,8,3996
152,2023,9,3448


In [None]:
df_count.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 154 entries, 0 to 153
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   year        154 non-null    int64
 1   month       154 non-null    int64
 2   deal_count  154 non-null    int64
dtypes: int64(3)
memory usage: 3.7 KB


## 아파트 전세 체결량 정보 추가

- 아파트 매매 체결량 부분 참조

In [None]:
df_full_rent.head()

Unnamed: 0,date,year,month,day,address_0,address_1,address_2,address_3,address_4,name,area,full_rent_price
0,2011-01-05,2011,1,5,서울특별시,강남구,개포동,655.0,2.0,개포2차현대아파트(220),77.75,35000
1,2011-01-18,2011,1,18,서울특별시,강남구,개포동,655.0,2.0,개포2차현대아파트(220),77.75,20000
2,2011-02-01,2011,2,1,서울특별시,강남구,개포동,655.0,2.0,개포2차현대아파트(220),77.75,24000
3,2011-02-11,2011,2,11,서울특별시,강남구,개포동,655.0,2.0,개포2차현대아파트(220),77.75,31000
4,2011-02-24,2011,2,24,서울특별시,강남구,개포동,655.0,2.0,개포2차현대아파트(220),77.75,30500


In [None]:
# 월별 전세 체결량을 group과 count를 통해서 구함
df_temp = df_full_rent.groupby(["year","month"])["name"].agg('count').copy()
df_temp = df_temp.reset_index(["year","month"])
df_temp.columns = ["year","month","full_rent_count"]
df_temp

Unnamed: 0,year,month,full_rent_count
0,2011,1,12336
1,2011,2,12261
2,2011,3,12121
3,2011,4,9754
4,2011,5,9280
...,...,...,...
149,2023,6,13228
150,2023,7,12965
151,2023,8,12526
152,2023,9,11173


In [None]:
# 아파트 매매 체결량 데이터프레임과 아파트 전세 체결량 데이터프레임을 병합
df_count=pd.merge(df_count,df_temp, left_on=["year","month"], right_on=["year","month"], how="inner")
df_count.head()

Unnamed: 0,year,month,deal_count,full_rent_count
0,2011,1,7179,12336
1,2011,2,6026,12261
2,2011,3,5419,12121
3,2011,4,4028,9754
4,2011,5,3836,9280


In [None]:
df_count.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 154 entries, 0 to 153
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype
---  ------           --------------  -----
 0   year             154 non-null    int64
 1   month            154 non-null    int64
 2   deal_count       154 non-null    int64
 3   full_rent_count  154 non-null    int64
dtypes: int64(4)
memory usage: 6.0 KB


## 아파트 월세 체결량 정보 추가

- 아파트 매매 체결량 데이터프레임 참조

In [None]:
df_month_rent.head()

Unnamed: 0,date,year,month,day,address_0,address_1,address_2,address_3,address_4,name,area,rent_deposit,month_rent_price
0,2011-03-18,2011,3,18,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,79.97,19000,63
1,2011-04-09,2011,4,9,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,79.97,21000,35
2,2011-07-09,2011,7,9,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,79.97,3000,160
3,2011-09-19,2011,9,19,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,79.97,6000,140
4,2011-09-20,2011,9,20,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,79.97,5000,160


In [None]:
df_temp = df_month_rent.groupby(["year","month"])["name"].agg('count').copy()
df_temp = df_temp.reset_index(["year","month"])
df_temp.columns = ["year","month","month_rent_count"]
df_temp

Unnamed: 0,year,month,month_rent_count
0,2011,1,2514
1,2011,2,2711
2,2011,3,2775
3,2011,4,2210
4,2011,5,2168
...,...,...,...
149,2023,6,9829
150,2023,7,8594
151,2023,8,8170
152,2023,9,7342


In [None]:
# 아파트 월세 거래량 데이터프레임을 추가하여 병합
df_count=pd.merge(df_count,df_temp, left_on=["year","month"], right_on=["year","month"], how="inner")
df_count.head()

Unnamed: 0,year,month,deal_count,full_rent_count,month_rent_count
0,2011,1,7179,12336,2514
1,2011,2,6026,12261,2711
2,2011,3,5419,12121,2775
3,2011,4,4028,9754,2210
4,2011,5,3836,9280,2168


In [None]:
df_count.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 154 entries, 0 to 153
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype
---  ------            --------------  -----
 0   year              154 non-null    int64
 1   month             154 non-null    int64
 2   deal_count        154 non-null    int64
 3   full_rent_count   154 non-null    int64
 4   month_rent_count  154 non-null    int64
dtypes: int64(5)
memory usage: 7.2 KB


## 월 정보들 shift

- 해당 달의 거래량은 다음달에 알 수 있음으로 한칸씩 shift(1달씩 미룸)

In [None]:
df_count['deal_count'] = df_count['deal_count'].shift(1)
df_count['month_rent_count'] = df_count['month_rent_count'].shift(1)
df_count['full_rent_count'] = df_count['full_rent_count'].shift(1)
# 컬럼명 수정
df_count.columns = ['year','month','last_month_total_deal_count','last_month_total_full_rent_count', 'last_month_total_month_rent_count']
df_count

Unnamed: 0,year,month,last_month_total_deal_count,last_month_total_full_rent_count,last_month_total_month_rent_count
0,2011,1,,,
1,2011,2,7179.0,12336.0,2514.0
2,2011,3,6026.0,12261.0,2711.0
3,2011,4,5419.0,12121.0,2775.0
4,2011,5,4028.0,9754.0,2210.0
...,...,...,...,...,...
149,2023,6,3561.0,13314.0,9700.0
150,2023,7,4009.0,13228.0,9829.0
151,2023,8,3733.0,12965.0,8594.0
152,2023,9,3996.0,12526.0,8170.0


In [None]:
# null 값이 있는 row를 제거한 후,
df_count.dropna(axis=0,inplace=True)
df_count.reset_index(inplace=True,drop=True) # 인덱스 초기화
df_count

Unnamed: 0,year,month,last_month_total_deal_count,last_month_total_full_rent_count,last_month_total_month_rent_count
0,2011,2,7179.0,12336.0,2514.0
1,2011,3,6026.0,12261.0,2711.0
2,2011,4,5419.0,12121.0,2775.0
3,2011,5,4028.0,9754.0,2210.0
4,2011,6,3836.0,9280.0,2168.0
...,...,...,...,...,...
148,2023,6,3561.0,13314.0,9700.0
149,2023,7,4009.0,13228.0,9829.0
150,2023,8,3733.0,12965.0,8594.0
151,2023,9,3996.0,12526.0,8170.0


In [None]:
df_count.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 153 entries, 0 to 152
Data columns (total 5 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   year                               153 non-null    int64  
 1   month                              153 non-null    int64  
 2   last_month_total_deal_count        153 non-null    float64
 3   last_month_total_full_rent_count   153 non-null    float64
 4   last_month_total_month_rent_count  153 non-null    float64
dtypes: float64(3), int64(2)
memory usage: 6.1 KB


In [None]:
df_economic.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4686 entries, 0 to 4685
Data columns (total 20 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   date                4686 non-null   object 
 1   year                4686 non-null   int64  
 2   month               4686 non-null   int64  
 3   day                 4686 non-null   int64  
 4   kospi_index         4686 non-null   float64
 5   korea_rp            4686 non-null   float64
 6   korea_3_year        4686 non-null   float64
 7   korea_10_year       4686 non-null   float64
 8   us_3_month          4686 non-null   float64
 9   us_2_year           4686 non-null   float64
 10  us_10_year          4686 non-null   float64
 11  korea_10-3_year     4686 non-null   float64
 12  us_10-2_year        4686 non-null   float64
 13  us_10-3_year_month  4686 non-null   float64
 14  general_supply      4686 non-null   float64
 15  combination_supply  4686 non-null   float64
 16  etc_su

## economic_data2 와의 통합

In [None]:
# 거시경제 지표가 모든 날짜들에 대한 정보를 가지고 있음으로, year과 month를 통해서 병합
df_economic3 = pd.merge(df_economic2, df_count, left_on=["year","month"], right_on=["year","month"], how="inner")

df_economic3 = df_economic3.rename(columns={'unsold_count' : 'last_month_total_unsold_count',
                                          'unsold_ratio' : 'last_month_total_unsold_ratio','general_supply':'last_month_general_supply', 'combination_supply':'last_month_combination_supply',
                                            'etc_supply':'last_month_etc_supply','total_supply':'last_month_total_supply'})



print(df_economic3.info())
df_economic3.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4655 entries, 0 to 4654
Data columns (total 23 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   date                               4655 non-null   object 
 1   year                               4655 non-null   int64  
 2   month                              4655 non-null   int64  
 3   day                                4655 non-null   int64  
 4   kospi_index                        4655 non-null   float64
 5   korea_rp                           4655 non-null   float64
 6   korea_3_year                       4655 non-null   float64
 7   korea_10_year                      4655 non-null   float64
 8   us_3_month                         4655 non-null   float64
 9   us_2_year                          4655 non-null   float64
 10  us_10_year                         4655 non-null   float64
 11  korea_10-3_year                    4655 non-null   float

Unnamed: 0,date,year,month,day,kospi_index,korea_rp,korea_3_year,korea_10_year,us_3_month,us_2_year,...,us_10-3_year_month,last_month_general_supply,last_month_combination_supply,last_month_etc_supply,last_month_total_supply,last_month_total_unsold_count,last_month_total_unsold_ratio,last_month_total_deal_count,last_month_total_full_rent_count,last_month_total_month_rent_count
0,2011-02-01,2011,2,1,2072.03,2.75,3.97,4.71,0.157,0.605,...,3.278,647.0,0.0,0.0,647.0,84923,13125.656878,7179.0,12336.0,2514.0
1,2011-02-02,2011,2,2,2072.03,2.75,3.97,4.71,0.157,0.664,...,3.322,647.0,0.0,0.0,647.0,84923,13125.656878,7179.0,12336.0,2514.0
2,2011-02-03,2011,2,3,2072.03,2.75,3.97,4.71,0.152,0.712,...,3.395,647.0,0.0,0.0,647.0,84923,13125.656878,7179.0,12336.0,2514.0
3,2011-02-04,2011,2,4,2072.03,2.75,3.97,4.71,0.152,0.752,...,3.486,647.0,0.0,0.0,647.0,84923,13125.656878,7179.0,12336.0,2514.0
4,2011-02-05,2011,2,5,2072.03,2.75,3.97,4.71,0.152,0.752,...,3.486,647.0,0.0,0.0,647.0,84923,13125.656878,7179.0,12336.0,2514.0


In [None]:
df_economic3.tail()

Unnamed: 0,date,year,month,day,kospi_index,korea_rp,korea_3_year,korea_10_year,us_3_month,us_2_year,...,us_10-3_year_month,last_month_general_supply,last_month_combination_supply,last_month_etc_supply,last_month_total_supply,last_month_total_unsold_count,last_month_total_unsold_ratio,last_month_total_deal_count,last_month_total_full_rent_count,last_month_total_month_rent_count
4650,2023-10-26,2023,10,26,2299.08,3.5,3.692,3.977,5.479,5.046,...,-0.63,222.0,492.0,0.0,714.0,59806,8376.190476,3448.0,11173.0,7342.0
4651,2023-10-27,2023,10,27,2302.81,3.5,3.692,3.977,5.477,5.015,...,-0.632,222.0,492.0,0.0,714.0,59806,8376.190476,3448.0,11173.0,7342.0
4652,2023-10-28,2023,10,28,2302.81,3.5,3.692,3.977,5.477,5.015,...,-0.632,222.0,492.0,0.0,714.0,59806,8376.190476,3448.0,11173.0,7342.0
4653,2023-10-29,2023,10,29,2302.81,3.5,3.692,3.977,5.477,5.015,...,-0.632,222.0,492.0,0.0,714.0,59806,8376.190476,3448.0,11173.0,7342.0
4654,2023-10-30,2023,10,30,2310.55,3.5,3.692,3.977,5.481,5.05,...,-0.593,222.0,492.0,0.0,714.0,59806,8376.190476,3448.0,11173.0,7342.0


In [None]:
# 파일 저장
df_economic3.to_pickle('/content/drive/MyDrive/house_price/after_data/economic_data3.pkl')

# final_economic 파일 생성

- economic_data3 은 '해당 월'에 대한 거시경제 지표들을 가지고 있다.
- final_economic 파일은 economic_data3 파일에 추가적으로 과거 수치대비 변화에 대한 정보들을 추가한 파일

## 기본정보 파악

In [None]:
import pandas as pd
# 데이터 프레임 불러오기기
df_economic = pd.read_pickle('/content/drive/MyDrive/house_price/after_data/economic_data3.pkl')
df_economic.head()

Unnamed: 0,date,year,month,day,kospi_index,korea_rp,korea_3_year,korea_10_year,us_3_month,us_2_year,...,us_10-3_year_month,last_month_general_supply,last_month_combination_supply,last_month_etc_supply,last_month_total_supply,last_month_total_unsold_count,last_month_total_unsold_ratio,last_month_total_deal_count,last_month_total_full_rent_count,last_month_total_month_rent_count
0,2011-02-01,2011,2,1,2072.03,2.75,3.97,4.71,0.157,0.605,...,3.278,647.0,0.0,0.0,647.0,84923,13125.656878,7179.0,12336.0,2514.0
1,2011-02-02,2011,2,2,2072.03,2.75,3.97,4.71,0.157,0.664,...,3.322,647.0,0.0,0.0,647.0,84923,13125.656878,7179.0,12336.0,2514.0
2,2011-02-03,2011,2,3,2072.03,2.75,3.97,4.71,0.152,0.712,...,3.395,647.0,0.0,0.0,647.0,84923,13125.656878,7179.0,12336.0,2514.0
3,2011-02-04,2011,2,4,2072.03,2.75,3.97,4.71,0.152,0.752,...,3.486,647.0,0.0,0.0,647.0,84923,13125.656878,7179.0,12336.0,2514.0
4,2011-02-05,2011,2,5,2072.03,2.75,3.97,4.71,0.152,0.752,...,3.486,647.0,0.0,0.0,647.0,84923,13125.656878,7179.0,12336.0,2514.0


In [None]:
df_economic.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4655 entries, 0 to 4654
Data columns (total 23 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   date                               4655 non-null   object 
 1   year                               4655 non-null   int64  
 2   month                              4655 non-null   int64  
 3   day                                4655 non-null   int64  
 4   kospi_index                        4655 non-null   float64
 5   korea_rp                           4655 non-null   float64
 6   korea_3_year                       4655 non-null   float64
 7   korea_10_year                      4655 non-null   float64
 8   us_3_month                         4655 non-null   float64
 9   us_2_year                          4655 non-null   float64
 10  us_10_year                         4655 non-null   float64
 11  korea_10-3_year                    4655 non-null   float

## 6,12개월 전 대비 변화정도 계산

In [None]:
# 월별 평균 값을 구한 데이터프레임 2개 구함(추후 병합에 사용)
# 'korea_10-3_year', 'us_10-2_year', 'us_10-3_year_month', 'last_month_total_unsold_ratio' 은 컬럼별 계산을 통해 계산된 값들임으로, 월별 평균값을 구할 때는
# 추가적으로 계산 필요
df_economic_month = df_economic.drop(['date','day','korea_10-3_year', 'us_10-2_year', 'us_10-3_year_month', 'last_month_total_unsold_ratio'],axis=1).copy()


# 일 단위여서 월단위로 묶어서 평균을 할 필요가 있음음
df_economic_month = df_economic_month.groupby(['year','month']).agg('mean').reset_index()

df_economic_month['korea_10-3_year'] = df_economic_month['korea_10_year'] - df_economic_month['korea_3_year']
df_economic_month['us_10-2_year'] = df_economic_month['us_10_year'] - df_economic_month['us_2_year']
df_economic_month['us_10-3_year_month'] = df_economic_month['us_10_year'] - df_economic_month['us_3_month']
df_economic_month['last_month_total_unsold_ratio'] = (100*df_economic_month['last_month_total_unsold_count']) / df_economic_month['last_month_total_supply']

display(df_economic_month.head())

Unnamed: 0,year,month,kospi_index,korea_rp,korea_3_year,korea_10_year,us_3_month,us_2_year,us_10_year,last_month_general_supply,...,last_month_etc_supply,last_month_total_supply,last_month_total_unsold_count,last_month_total_deal_count,last_month_total_full_rent_count,last_month_total_month_rent_count,korea_10-3_year,us_10-2_year,us_10-3_year_month,last_month_total_unsold_ratio
0,2011,2,2011.301786,2.75,3.939286,4.745714,0.127536,0.762071,3.565429,647.0,...,0.0,647.0,84923.0,7179.0,12336.0,2514.0,0.806429,2.803357,3.437893,13125.656878
1,2011,3,1999.63871,2.927419,3.745968,4.518387,0.095839,0.679452,3.404226,307.0,...,2.0,463.0,80588.0,6026.0,12261.0,2711.0,0.772419,2.724774,3.308387,17405.615551
2,2011,4,2152.758,3.0,3.747,4.483333,0.0562,0.719767,3.4346,558.0,...,0.0,558.0,77572.0,5419.0,12121.0,2775.0,0.736333,2.714833,3.3784,13901.792115
3,2011,5,2126.069355,3.0,3.67371,4.347742,0.035581,0.536161,3.152774,633.0,...,0.0,943.0,72232.0,4028.0,9754.0,2210.0,0.674032,2.616613,3.117194,7659.80912
4,2011,6,2074.891667,3.175,3.6385,4.240667,0.032667,0.402067,2.976833,1939.0,...,1724.0,3744.0,71360.0,3836.0,9280.0,2168.0,0.602167,2.574767,2.944167,1905.982906


In [None]:
## 6달전 날짜들 구한
df_economic.loc[df_economic['month']<7, '6m_before_year'] = df_economic['year']-1
df_economic.loc[df_economic['month']<7, '6m_before_month'] = 12-(6-df_economic['month'])
df_economic.loc[df_economic['month']>=7, '6m_before_year'] = df_economic['year']
df_economic.loc[df_economic['month']>=7, '6m_before_month'] = df_economic['month']-6

# 12달전 날짜들 구한
df_economic.loc[:, '12m_before_year'] = df_economic['year']-1
df_economic.loc[:, '12m_before_month'] = df_economic['month']

df_economic

Unnamed: 0,date,year,month,day,kospi_index,korea_rp,korea_3_year,korea_10_year,us_3_month,us_2_year,...,last_month_total_supply,last_month_total_unsold_count,last_month_total_unsold_ratio,last_month_total_deal_count,last_month_total_full_rent_count,last_month_total_month_rent_count,6m_before_year,6m_before_month,12m_before_year,12m_before_month
0,2011-02-01,2011,2,1,2072.03,2.75,3.970,4.710,0.157,0.605,...,647.0,84923,13125.656878,7179.0,12336.0,2514.0,2010.0,8.0,2010,2
1,2011-02-02,2011,2,2,2072.03,2.75,3.970,4.710,0.157,0.664,...,647.0,84923,13125.656878,7179.0,12336.0,2514.0,2010.0,8.0,2010,2
2,2011-02-03,2011,2,3,2072.03,2.75,3.970,4.710,0.152,0.712,...,647.0,84923,13125.656878,7179.0,12336.0,2514.0,2010.0,8.0,2010,2
3,2011-02-04,2011,2,4,2072.03,2.75,3.970,4.710,0.152,0.752,...,647.0,84923,13125.656878,7179.0,12336.0,2514.0,2010.0,8.0,2010,2
4,2011-02-05,2011,2,5,2072.03,2.75,3.970,4.710,0.152,0.752,...,647.0,84923,13125.656878,7179.0,12336.0,2514.0,2010.0,8.0,2010,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4650,2023-10-26,2023,10,26,2299.08,3.50,3.692,3.977,5.479,5.046,...,714.0,59806,8376.190476,3448.0,11173.0,7342.0,2023.0,4.0,2022,10
4651,2023-10-27,2023,10,27,2302.81,3.50,3.692,3.977,5.477,5.015,...,714.0,59806,8376.190476,3448.0,11173.0,7342.0,2023.0,4.0,2022,10
4652,2023-10-28,2023,10,28,2302.81,3.50,3.692,3.977,5.477,5.015,...,714.0,59806,8376.190476,3448.0,11173.0,7342.0,2023.0,4.0,2022,10
4653,2023-10-29,2023,10,29,2302.81,3.50,3.692,3.977,5.477,5.015,...,714.0,59806,8376.190476,3448.0,11173.0,7342.0,2023.0,4.0,2022,10


In [None]:
pd.set_option('display.max_columns', 100)
# df_economic의 '6m_before_year', '6m_before_month' 과 df_economic_6m_before의 'year_6m_before','month_6m_before' 이 매칭이 됨
df_economic = pd.merge(df_economic, df_economic_month, left_on=['6m_before_year', '6m_before_month'], right_on=['year','month'], how='inner', suffixes=('', '_6m_before'))
df_economic = pd.merge(df_economic, df_economic_month, left_on=['12m_before_year', '12m_before_month'], right_on=['year','month'], how='inner', suffixes=('', '_12m_before'))
df_economic = df_economic.drop(["6m_before_year", "6m_before_month", "12m_before_year", "12m_before_month", "year_6m_before", "month_6m_before","year_12m_before", "month_12m_before"], axis=1)
df_economic.head(20)

Unnamed: 0,date,year,month,day,kospi_index,korea_rp,korea_3_year,korea_10_year,us_3_month,us_2_year,us_10_year,korea_10-3_year,us_10-2_year,us_10-3_year_month,last_month_general_supply,last_month_combination_supply,last_month_etc_supply,last_month_total_supply,last_month_total_unsold_count,last_month_total_unsold_ratio,last_month_total_deal_count,last_month_total_full_rent_count,last_month_total_month_rent_count,kospi_index_6m_before,korea_rp_6m_before,korea_3_year_6m_before,korea_10_year_6m_before,us_3_month_6m_before,us_2_year_6m_before,us_10_year_6m_before,last_month_general_supply_6m_before,last_month_combination_supply_6m_before,last_month_etc_supply_6m_before,last_month_total_supply_6m_before,last_month_total_unsold_count_6m_before,last_month_total_deal_count_6m_before,last_month_total_full_rent_count_6m_before,last_month_total_month_rent_count_6m_before,korea_10-3_year_6m_before,us_10-2_year_6m_before,us_10-3_year_month_6m_before,last_month_total_unsold_ratio_6m_before,kospi_index_12m_before,korea_rp_12m_before,korea_3_year_12m_before,korea_10_year_12m_before,us_3_month_12m_before,us_2_year_12m_before,us_10_year_12m_before,last_month_general_supply_12m_before,last_month_combination_supply_12m_before,last_month_etc_supply_12m_before,last_month_total_supply_12m_before,last_month_total_unsold_count_12m_before,last_month_total_deal_count_12m_before,last_month_total_full_rent_count_12m_before,last_month_total_month_rent_count_12m_before,korea_10-3_year_12m_before,us_10-2_year_12m_before,us_10-3_year_month_12m_before,last_month_total_unsold_ratio_12m_before
0,2012-02-01,2012,2,1,1959.24,3.25,3.38,3.75,0.061,0.226,1.83,0.37,1.604,1.769,912.0,0.0,0.0,912.0,67786,7432.675439,2786.0,10445.0,2277.0,1852.969677,3.25,3.569194,3.942903,0.015387,0.225677,2.286581,880.0,0.0,0.0,880.0,70087.0,4319.0,9682.0,2311.0,0.37371,2.060903,2.271194,7964.431818,2011.301786,2.75,3.939286,4.745714,0.127536,0.762071,3.565429,647.0,0.0,0.0,647.0,84923.0,7179.0,12336.0,2514.0,0.806429,2.803357,3.437893,13125.656878
1,2012-02-02,2012,2,2,1984.3,3.25,3.38,3.76,0.084,0.226,1.823,0.38,1.597,1.739,912.0,0.0,0.0,912.0,67786,7432.675439,2786.0,10445.0,2277.0,1852.969677,3.25,3.569194,3.942903,0.015387,0.225677,2.286581,880.0,0.0,0.0,880.0,70087.0,4319.0,9682.0,2311.0,0.37371,2.060903,2.271194,7964.431818,2011.301786,2.75,3.939286,4.745714,0.127536,0.762071,3.565429,647.0,0.0,0.0,647.0,84923.0,7179.0,12336.0,2514.0,0.806429,2.803357,3.437893,13125.656878
2,2012-02-03,2012,2,3,1972.34,3.25,3.38,3.76,0.079,0.238,1.924,0.38,1.686,1.845,912.0,0.0,0.0,912.0,67786,7432.675439,2786.0,10445.0,2277.0,1852.969677,3.25,3.569194,3.942903,0.015387,0.225677,2.286581,880.0,0.0,0.0,880.0,70087.0,4319.0,9682.0,2311.0,0.37371,2.060903,2.271194,7964.431818,2011.301786,2.75,3.939286,4.745714,0.127536,0.762071,3.565429,647.0,0.0,0.0,647.0,84923.0,7179.0,12336.0,2514.0,0.806429,2.803357,3.437893,13125.656878
3,2012-02-04,2012,2,4,1972.34,3.25,3.38,3.76,0.079,0.238,1.924,0.38,1.686,1.845,912.0,0.0,0.0,912.0,67786,7432.675439,2786.0,10445.0,2277.0,1852.969677,3.25,3.569194,3.942903,0.015387,0.225677,2.286581,880.0,0.0,0.0,880.0,70087.0,4319.0,9682.0,2311.0,0.37371,2.060903,2.271194,7964.431818,2011.301786,2.75,3.939286,4.745714,0.127536,0.762071,3.565429,647.0,0.0,0.0,647.0,84923.0,7179.0,12336.0,2514.0,0.806429,2.803357,3.437893,13125.656878
4,2012-02-05,2012,2,5,1972.34,3.25,3.38,3.76,0.079,0.238,1.924,0.38,1.686,1.845,912.0,0.0,0.0,912.0,67786,7432.675439,2786.0,10445.0,2277.0,1852.969677,3.25,3.569194,3.942903,0.015387,0.225677,2.286581,880.0,0.0,0.0,880.0,70087.0,4319.0,9682.0,2311.0,0.37371,2.060903,2.271194,7964.431818,2011.301786,2.75,3.939286,4.745714,0.127536,0.762071,3.565429,647.0,0.0,0.0,647.0,84923.0,7179.0,12336.0,2514.0,0.806429,2.803357,3.437893,13125.656878
5,2012-02-06,2012,2,6,1973.13,3.25,3.39,3.78,0.086,0.234,1.901,0.39,1.667,1.815,912.0,0.0,0.0,912.0,67786,7432.675439,2786.0,10445.0,2277.0,1852.969677,3.25,3.569194,3.942903,0.015387,0.225677,2.286581,880.0,0.0,0.0,880.0,70087.0,4319.0,9682.0,2311.0,0.37371,2.060903,2.271194,7964.431818,2011.301786,2.75,3.939286,4.745714,0.127536,0.762071,3.565429,647.0,0.0,0.0,647.0,84923.0,7179.0,12336.0,2514.0,0.806429,2.803357,3.437893,13125.656878
6,2012-02-07,2012,2,7,1981.59,3.25,3.41,3.81,0.081,0.25,1.977,0.4,1.727,1.896,912.0,0.0,0.0,912.0,67786,7432.675439,2786.0,10445.0,2277.0,1852.969677,3.25,3.569194,3.942903,0.015387,0.225677,2.286581,880.0,0.0,0.0,880.0,70087.0,4319.0,9682.0,2311.0,0.37371,2.060903,2.271194,7964.431818,2011.301786,2.75,3.939286,4.745714,0.127536,0.762071,3.565429,647.0,0.0,0.0,647.0,84923.0,7179.0,12336.0,2514.0,0.806429,2.803357,3.437893,13125.656878
7,2012-02-08,2012,2,8,2003.73,3.25,3.44,3.83,0.081,0.258,1.982,0.39,1.724,1.901,912.0,0.0,0.0,912.0,67786,7432.675439,2786.0,10445.0,2277.0,1852.969677,3.25,3.569194,3.942903,0.015387,0.225677,2.286581,880.0,0.0,0.0,880.0,70087.0,4319.0,9682.0,2311.0,0.37371,2.060903,2.271194,7964.431818,2011.301786,2.75,3.939286,4.745714,0.127536,0.762071,3.565429,647.0,0.0,0.0,647.0,84923.0,7179.0,12336.0,2514.0,0.806429,2.803357,3.437893,13125.656878
8,2012-02-09,2012,2,9,2014.62,3.25,3.45,3.81,0.091,0.266,2.036,0.36,1.77,1.945,912.0,0.0,0.0,912.0,67786,7432.675439,2786.0,10445.0,2277.0,1852.969677,3.25,3.569194,3.942903,0.015387,0.225677,2.286581,880.0,0.0,0.0,880.0,70087.0,4319.0,9682.0,2311.0,0.37371,2.060903,2.271194,7964.431818,2011.301786,2.75,3.939286,4.745714,0.127536,0.762071,3.565429,647.0,0.0,0.0,647.0,84923.0,7179.0,12336.0,2514.0,0.806429,2.803357,3.437893,13125.656878
9,2012-02-10,2012,2,10,1993.71,3.25,3.455,3.82,0.089,0.278,1.984,0.365,1.706,1.895,912.0,0.0,0.0,912.0,67786,7432.675439,2786.0,10445.0,2277.0,1852.969677,3.25,3.569194,3.942903,0.015387,0.225677,2.286581,880.0,0.0,0.0,880.0,70087.0,4319.0,9682.0,2311.0,0.37371,2.060903,2.271194,7964.431818,2011.301786,2.75,3.939286,4.745714,0.127536,0.762071,3.565429,647.0,0.0,0.0,647.0,84923.0,7179.0,12336.0,2514.0,0.806429,2.803357,3.437893,13125.656878


> 처음에 변화율을 구하려 했지만, 수치가 0인 값들이 있어서 계산을 할 때 null이나 inf가 되는 경우들이 있어서 변화율보다는 변화정도로 진행을 하기로 함

>> 계산식을 생성할 때, 0으로 나누거나 나누어지는 경우들에 대해서 조심해야 한다

In [None]:
df_economic_columns = list(df_economic.columns)
df_economic_columns_now = df_economic_columns[4:23]
df_economic_columns_6m_before = df_economic_columns[23:42]
df_economic_columns_12m_before = df_economic_columns[42:]

In [None]:
df_economic_columns_now  = sorted(df_economic_columns_now)
print(len(df_economic_columns_now))
df_economic_columns_now

19


['korea_10-3_year',
 'korea_10_year',
 'korea_3_year',
 'korea_rp',
 'kospi_index',
 'last_month_combination_supply',
 'last_month_etc_supply',
 'last_month_general_supply',
 'last_month_total_deal_count',
 'last_month_total_full_rent_count',
 'last_month_total_month_rent_count',
 'last_month_total_supply',
 'last_month_total_unsold_count',
 'last_month_total_unsold_ratio',
 'us_10-2_year',
 'us_10-3_year_month',
 'us_10_year',
 'us_2_year',
 'us_3_month']

In [None]:
df_economic_columns_6m_before = sorted(df_economic_columns_6m_before)
print(len(df_economic_columns_6m_before))
df_economic_columns_6m_before

19


['korea_10-3_year_6m_before',
 'korea_10_year_6m_before',
 'korea_3_year_6m_before',
 'korea_rp_6m_before',
 'kospi_index_6m_before',
 'last_month_combination_supply_6m_before',
 'last_month_etc_supply_6m_before',
 'last_month_general_supply_6m_before',
 'last_month_total_deal_count_6m_before',
 'last_month_total_full_rent_count_6m_before',
 'last_month_total_month_rent_count_6m_before',
 'last_month_total_supply_6m_before',
 'last_month_total_unsold_count_6m_before',
 'last_month_total_unsold_ratio_6m_before',
 'us_10-2_year_6m_before',
 'us_10-3_year_month_6m_before',
 'us_10_year_6m_before',
 'us_2_year_6m_before',
 'us_3_month_6m_before']

In [None]:
df_economic_columns_12m_before = sorted(df_economic_columns_12m_before)
print(len(df_economic_columns_12m_before))
df_economic_columns_12m_before

19


['korea_10-3_year_12m_before',
 'korea_10_year_12m_before',
 'korea_3_year_12m_before',
 'korea_rp_12m_before',
 'kospi_index_12m_before',
 'last_month_combination_supply_12m_before',
 'last_month_etc_supply_12m_before',
 'last_month_general_supply_12m_before',
 'last_month_total_deal_count_12m_before',
 'last_month_total_full_rent_count_12m_before',
 'last_month_total_month_rent_count_12m_before',
 'last_month_total_supply_12m_before',
 'last_month_total_unsold_count_12m_before',
 'last_month_total_unsold_ratio_12m_before',
 'us_10-2_year_12m_before',
 'us_10-3_year_month_12m_before',
 'us_10_year_12m_before',
 'us_2_year_12m_before',
 'us_3_month_12m_before']

In [None]:
# 변화정도 = 현재데이터 - 과거데이터
# 반복문을 통해서 변화정도들을 계산
for i in range(len(df_economic_columns_now)):
  df_economic[df_economic_columns_6m_before[i]] = df_economic[df_economic_columns_now[i]] - df_economic[df_economic_columns_6m_before[i]]
  df_economic[df_economic_columns_12m_before[i]] = df_economic[df_economic_columns_now[i]] - df_economic[df_economic_columns_12m_before[i]]
df_economic

Unnamed: 0,date,year,month,day,kospi_index,korea_rp,korea_3_year,korea_10_year,us_3_month,us_2_year,us_10_year,korea_10-3_year,us_10-2_year,us_10-3_year_month,last_month_general_supply,last_month_combination_supply,last_month_etc_supply,last_month_total_supply,last_month_total_unsold_count,last_month_total_unsold_ratio,last_month_total_deal_count,last_month_total_full_rent_count,last_month_total_month_rent_count,kospi_index_6m_before,korea_rp_6m_before,korea_3_year_6m_before,korea_10_year_6m_before,us_3_month_6m_before,us_2_year_6m_before,us_10_year_6m_before,last_month_general_supply_6m_before,last_month_combination_supply_6m_before,last_month_etc_supply_6m_before,last_month_total_supply_6m_before,last_month_total_unsold_count_6m_before,last_month_total_deal_count_6m_before,last_month_total_full_rent_count_6m_before,last_month_total_month_rent_count_6m_before,korea_10-3_year_6m_before,us_10-2_year_6m_before,us_10-3_year_month_6m_before,last_month_total_unsold_ratio_6m_before,kospi_index_12m_before,korea_rp_12m_before,korea_3_year_12m_before,korea_10_year_12m_before,us_3_month_12m_before,us_2_year_12m_before,us_10_year_12m_before,last_month_general_supply_12m_before,last_month_combination_supply_12m_before,last_month_etc_supply_12m_before,last_month_total_supply_12m_before,last_month_total_unsold_count_12m_before,last_month_total_deal_count_12m_before,last_month_total_full_rent_count_12m_before,last_month_total_month_rent_count_12m_before,korea_10-3_year_12m_before,us_10-2_year_12m_before,us_10-3_year_month_12m_before,last_month_total_unsold_ratio_12m_before
0,2012-02-01,2012,2,1,1959.24,3.25,3.380,3.750,0.061,0.226,1.830,0.370,1.604,1.769,912.0,0.0,0.0,912.0,67786,7432.675439,2786.0,10445.0,2277.0,106.270323,0.0,-0.189194,-0.192903,0.045613,0.000323,-0.456581,32.0,0.0,0.0,32.0,-2301.0,-1533.0,763.0,-34.0,-0.00371,-0.456903,-0.502194,-531.756380,-52.061786,0.500000,-0.559286,-0.995714,-0.066536,-0.536071,-1.735429,265.0,0.0,0.0,265.0,-17137.0,-4393.0,-1891.0,-237.0,-0.436429,-1.199357,-1.668893,-5692.981439
1,2012-02-02,2012,2,2,1984.30,3.25,3.380,3.760,0.084,0.226,1.823,0.380,1.597,1.739,912.0,0.0,0.0,912.0,67786,7432.675439,2786.0,10445.0,2277.0,131.330323,0.0,-0.189194,-0.182903,0.068613,0.000323,-0.463581,32.0,0.0,0.0,32.0,-2301.0,-1533.0,763.0,-34.0,0.00629,-0.463903,-0.532194,-531.756380,-27.001786,0.500000,-0.559286,-0.985714,-0.043536,-0.536071,-1.742429,265.0,0.0,0.0,265.0,-17137.0,-4393.0,-1891.0,-237.0,-0.426429,-1.206357,-1.698893,-5692.981439
2,2012-02-03,2012,2,3,1972.34,3.25,3.380,3.760,0.079,0.238,1.924,0.380,1.686,1.845,912.0,0.0,0.0,912.0,67786,7432.675439,2786.0,10445.0,2277.0,119.370323,0.0,-0.189194,-0.182903,0.063613,0.012323,-0.362581,32.0,0.0,0.0,32.0,-2301.0,-1533.0,763.0,-34.0,0.00629,-0.374903,-0.426194,-531.756380,-38.961786,0.500000,-0.559286,-0.985714,-0.048536,-0.524071,-1.641429,265.0,0.0,0.0,265.0,-17137.0,-4393.0,-1891.0,-237.0,-0.426429,-1.117357,-1.592893,-5692.981439
3,2012-02-04,2012,2,4,1972.34,3.25,3.380,3.760,0.079,0.238,1.924,0.380,1.686,1.845,912.0,0.0,0.0,912.0,67786,7432.675439,2786.0,10445.0,2277.0,119.370323,0.0,-0.189194,-0.182903,0.063613,0.012323,-0.362581,32.0,0.0,0.0,32.0,-2301.0,-1533.0,763.0,-34.0,0.00629,-0.374903,-0.426194,-531.756380,-38.961786,0.500000,-0.559286,-0.985714,-0.048536,-0.524071,-1.641429,265.0,0.0,0.0,265.0,-17137.0,-4393.0,-1891.0,-237.0,-0.426429,-1.117357,-1.592893,-5692.981439
4,2012-02-05,2012,2,5,1972.34,3.25,3.380,3.760,0.079,0.238,1.924,0.380,1.686,1.845,912.0,0.0,0.0,912.0,67786,7432.675439,2786.0,10445.0,2277.0,119.370323,0.0,-0.189194,-0.182903,0.063613,0.012323,-0.362581,32.0,0.0,0.0,32.0,-2301.0,-1533.0,763.0,-34.0,0.00629,-0.374903,-0.426194,-531.756380,-38.961786,0.500000,-0.559286,-0.985714,-0.048536,-0.524071,-1.641429,265.0,0.0,0.0,265.0,-17137.0,-4393.0,-1891.0,-237.0,-0.426429,-1.117357,-1.592893,-5692.981439
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4285,2023-10-26,2023,10,26,2299.08,3.50,3.692,3.977,5.479,5.046,4.849,0.285,-0.197,-0.630,222.0,492.0,0.0,714.0,59806,8376.190476,3448.0,11173.0,7342.0,-222.089667,0.0,0.430500,0.655800,0.448833,1.004800,1.383167,-12462.0,-2830.0,-2629.0,-17921.0,-12298.0,351.0,-5301.0,-3332.0,0.22530,0.378367,0.934333,7989.262652,75.587097,0.677419,-0.543484,-0.275290,1.747935,0.662645,0.867968,193.0,492.0,-150.0,535.0,18202.0,2799.0,915.0,-451.0,0.268194,0.205323,-0.879968,-14866.267624
4286,2023-10-27,2023,10,27,2302.81,3.50,3.692,3.977,5.477,5.015,4.845,0.285,-0.170,-0.632,222.0,492.0,0.0,714.0,59806,8376.190476,3448.0,11173.0,7342.0,-218.359667,0.0,0.430500,0.655800,0.446833,0.973800,1.379167,-12462.0,-2830.0,-2629.0,-17921.0,-12298.0,351.0,-5301.0,-3332.0,0.22530,0.405367,0.932333,7989.262652,79.317097,0.677419,-0.543484,-0.275290,1.745935,0.631645,0.863968,193.0,492.0,-150.0,535.0,18202.0,2799.0,915.0,-451.0,0.268194,0.232323,-0.881968,-14866.267624
4287,2023-10-28,2023,10,28,2302.81,3.50,3.692,3.977,5.477,5.015,4.845,0.285,-0.170,-0.632,222.0,492.0,0.0,714.0,59806,8376.190476,3448.0,11173.0,7342.0,-218.359667,0.0,0.430500,0.655800,0.446833,0.973800,1.379167,-12462.0,-2830.0,-2629.0,-17921.0,-12298.0,351.0,-5301.0,-3332.0,0.22530,0.405367,0.932333,7989.262652,79.317097,0.677419,-0.543484,-0.275290,1.745935,0.631645,0.863968,193.0,492.0,-150.0,535.0,18202.0,2799.0,915.0,-451.0,0.268194,0.232323,-0.881968,-14866.267624
4288,2023-10-29,2023,10,29,2302.81,3.50,3.692,3.977,5.477,5.015,4.845,0.285,-0.170,-0.632,222.0,492.0,0.0,714.0,59806,8376.190476,3448.0,11173.0,7342.0,-218.359667,0.0,0.430500,0.655800,0.446833,0.973800,1.379167,-12462.0,-2830.0,-2629.0,-17921.0,-12298.0,351.0,-5301.0,-3332.0,0.22530,0.405367,0.932333,7989.262652,79.317097,0.677419,-0.543484,-0.275290,1.745935,0.631645,0.863968,193.0,492.0,-150.0,535.0,18202.0,2799.0,915.0,-451.0,0.268194,0.232323,-0.881968,-14866.267624


In [None]:
df_economic.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4290 entries, 0 to 4289
Data columns (total 61 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   date                                          4290 non-null   object 
 1   year                                          4290 non-null   int64  
 2   month                                         4290 non-null   int64  
 3   day                                           4290 non-null   int64  
 4   kospi_index                                   4290 non-null   float64
 5   korea_rp                                      4290 non-null   float64
 6   korea_3_year                                  4290 non-null   float64
 7   korea_10_year                                 4290 non-null   float64
 8   us_3_month                                    4290 non-null   float64
 9   us_2_year                                     4290 non-null   f

In [None]:
df_economic.to_pickle('/content/drive/MyDrive/house_price/after_data/final_economic.pkl')

>> 메모리 용량을 줄이기 위해서 타입들을 변환할 수도 있다.

>> 값들을 병합하거나 수정한 후, null 값이나 inf 값들이 존재하는 확인을 해야 한다 -> 나중에 진행이 된 다음에 발견을 하면 많은 부분을 수정해야 한다

# df_area_deal, df_area_full_rent, df_area_year_rent 파일들 생성

- '아파트 거래'가 체결된 날 이외의 날들은 가장 최근에 체결된 거래가격이 유지된다고 가정

## 필요한 데이터들 불러오기

In [None]:
import pandas as pd
import numpy as np
# 데이터들 불러오기
df_deal = pd.read_csv("/content/drive/MyDrive/house_price/after_data/apartment_deal.csv",  encoding='UTF8')
df_month_rent = pd.read_csv("/content/drive/MyDrive/house_price/after_data/apartment_month_rent.csv",  encoding='UTF8')
df_full_rent = pd.read_csv("/content/drive/MyDrive/house_price/after_data/apartment_full_rent.csv",  encoding='UTF8')

## df_area_deal 파일 생성

- 아파트별로 가장 최근에 체결된 '평당 매매가격' 정보를 가진 데이터프레임 생성

### 아파트 월별 매매 피봇 테이블 생성

In [None]:
# 대표 데이터 파악
df_deal.head()

Unnamed: 0,date,year,month,day,address_0,address_1,address_2,address_3,address_4,name,area,deal_price
0,2011-07-09,2011,7,9,서울특별시,강남구,개포동,655.0,2.0,개포2차현대아파트(220),77.75,64000
1,2011-07-28,2011,7,28,서울특별시,강남구,개포동,655.0,2.0,개포2차현대아파트(220),77.75,65500
2,2011-01-19,2011,1,19,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,67.28,70500
3,2011-09-02,2011,9,2,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,79.97,85000
4,2011-12-17,2011,12,17,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,67.28,68000


In [None]:
df_deal.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 913327 entries, 0 to 913326
Data columns (total 12 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   date        913327 non-null  object 
 1   year        913327 non-null  int64  
 2   month       913327 non-null  int64  
 3   day         913327 non-null  int64  
 4   address_0   913327 non-null  object 
 5   address_1   913327 non-null  object 
 6   address_2   913327 non-null  object 
 7   address_3   913327 non-null  float64
 8   address_4   913327 non-null  float64
 9   name        913327 non-null  object 
 10  area        913327 non-null  float64
 11  deal_price  913327 non-null  int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 83.6+ MB


In [None]:
# 면적당 가격 컬럼을 추가
df_deal['area_deal_price'] = df_deal['deal_price'] / df_deal['area']
df_deal.head()

Unnamed: 0,date,year,month,day,address_0,address_1,address_2,address_3,address_4,name,area,deal_price,area_deal_price
0,2011-07-09,2011,7,9,서울특별시,강남구,개포동,655.0,2.0,개포2차현대아파트(220),77.75,64000,823.151125
1,2011-07-28,2011,7,28,서울특별시,강남구,개포동,655.0,2.0,개포2차현대아파트(220),77.75,65500,842.44373
2,2011-01-19,2011,1,19,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,67.28,70500,1047.859691
3,2011-09-02,2011,9,2,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,79.97,85000,1062.898587
4,2011-12-17,2011,12,17,서울특별시,강남구,개포동,658.0,1.0,개포6차우성아파트1동~8동,67.28,68000,1010.701546


In [None]:
# 최근에 체결된 가격이 계속 유지된다고 생각을 하고 모든 날짜의 가격들을 결정
import numpy as np
pivot_table_area_deal = df_deal.pivot_table(index=['year','month','day'], columns=['address_1','address_2','address_3','address_4'], values='area_deal_price')
pivot_table_area_deal


Unnamed: 0_level_0,Unnamed: 1_level_0,address_1,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,...,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구
Unnamed: 0_level_1,Unnamed: 1_level_1,address_2,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,논현동,논현동,논현동,논현동,논현동,논현동,논현동,논현동,논현동,논현동,...,신내동,신내동,신내동,신내동,신내동,신내동,신내동,신내동,신내동,신내동,신내동,신내동,신내동,신내동,신내동,신내동,신내동,신내동,신내동,신내동,신내동,신내동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동
Unnamed: 0_level_2,Unnamed: 1_level_2,address_3,12.0,12.0,138.0,140.0,141.0,166.0,172.0,176.0,177.0,179.0,185.0,187.0,189.0,649.0,651.0,652.0,653.0,654.0,655.0,655.0,655.0,656.0,658.0,1164.0,1164.0,1164.0,1164.0,1164.0,1164.0,1164.0,1164.0,1165.0,1167.0,1204.0,1242.0,1260.0,1260.0,1260.0,1280.0,1282.0,9.0,16.0,22.0,39.0,43.0,44.0,44.0,44.0,46.0,46.0,...,618.0,650.0,654.0,656.0,657.0,661.0,783.0,785.0,786.0,787.0,788.0,795.0,796.0,797.0,799.0,800.0,801.0,802.0,803.0,804.0,816.0,817.0,11.0,11.0,19.0,19.0,19.0,42.0,110.0,207.0,207.0,208.0,208.0,274.0,274.0,274.0,274.0,286.0,297.0,307.0,307.0,314.0,318.0,331.0,413.0,438.0,450.0,452.0,453.0,454.0
Unnamed: 0_level_3,Unnamed: 1_level_3,address_4,0.0,2.0,0.0,0.0,0.0,4.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,2.0,3.0,0.0,1.0,7.0,12.0,13.0,14.0,20.0,25.0,28.0,30.0,0.0,0.0,5.0,2.0,2.0,4.0,11.0,0.0,0.0,2.0,42.0,0.0,10.0,3.0,6.0,8.0,9.0,2.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.0,0.0,3.0,16.0,0.0,26.0,12.0,14.0,4.0,33.0,51.0,75.0,76.0,77.0,22.0,84.0,6.0,76.0,1.0,81.0,64.0,8.0,0.0,0.0,0.0,0.0,0.0
year,month,day,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4,Unnamed: 7_level_4,Unnamed: 8_level_4,Unnamed: 9_level_4,Unnamed: 10_level_4,Unnamed: 11_level_4,Unnamed: 12_level_4,Unnamed: 13_level_4,Unnamed: 14_level_4,Unnamed: 15_level_4,Unnamed: 16_level_4,Unnamed: 17_level_4,Unnamed: 18_level_4,Unnamed: 19_level_4,Unnamed: 20_level_4,Unnamed: 21_level_4,Unnamed: 22_level_4,Unnamed: 23_level_4,Unnamed: 24_level_4,Unnamed: 25_level_4,Unnamed: 26_level_4,Unnamed: 27_level_4,Unnamed: 28_level_4,Unnamed: 29_level_4,Unnamed: 30_level_4,Unnamed: 31_level_4,Unnamed: 32_level_4,Unnamed: 33_level_4,Unnamed: 34_level_4,Unnamed: 35_level_4,Unnamed: 36_level_4,Unnamed: 37_level_4,Unnamed: 38_level_4,Unnamed: 39_level_4,Unnamed: 40_level_4,Unnamed: 41_level_4,Unnamed: 42_level_4,Unnamed: 43_level_4,Unnamed: 44_level_4,Unnamed: 45_level_4,Unnamed: 46_level_4,Unnamed: 47_level_4,Unnamed: 48_level_4,Unnamed: 49_level_4,Unnamed: 50_level_4,Unnamed: 51_level_4,Unnamed: 52_level_4,Unnamed: 53_level_4,Unnamed: 54_level_4,Unnamed: 55_level_4,Unnamed: 56_level_4,Unnamed: 57_level_4,Unnamed: 58_level_4,Unnamed: 59_level_4,Unnamed: 60_level_4,Unnamed: 61_level_4,Unnamed: 62_level_4,Unnamed: 63_level_4,Unnamed: 64_level_4,Unnamed: 65_level_4,Unnamed: 66_level_4,Unnamed: 67_level_4,Unnamed: 68_level_4,Unnamed: 69_level_4,Unnamed: 70_level_4,Unnamed: 71_level_4,Unnamed: 72_level_4,Unnamed: 73_level_4,Unnamed: 74_level_4,Unnamed: 75_level_4,Unnamed: 76_level_4,Unnamed: 77_level_4,Unnamed: 78_level_4,Unnamed: 79_level_4,Unnamed: 80_level_4,Unnamed: 81_level_4,Unnamed: 82_level_4,Unnamed: 83_level_4,Unnamed: 84_level_4,Unnamed: 85_level_4,Unnamed: 86_level_4,Unnamed: 87_level_4,Unnamed: 88_level_4,Unnamed: 89_level_4,Unnamed: 90_level_4,Unnamed: 91_level_4,Unnamed: 92_level_4,Unnamed: 93_level_4,Unnamed: 94_level_4,Unnamed: 95_level_4,Unnamed: 96_level_4,Unnamed: 97_level_4,Unnamed: 98_level_4,Unnamed: 99_level_4,Unnamed: 100_level_4,Unnamed: 101_level_4,Unnamed: 102_level_4,Unnamed: 103_level_4
2011,1,1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2011,1,2,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,431.726908,,,
2011,1,3,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,448.028674,,,,,,,,,,,,,,,410.887475,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2011,1,4,1018.685955,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,688.806888,,,...,,,382.878461,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2011,1,5,1087.781432,,2101.057579,,1887.191539,,,,,,1218.844152,,,,,,1026.278961,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,398.357072,,,418.922099,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2023,10,27,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2023,10,28,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,932.652661,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2023,10,29,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2023,10,30,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,971.385542,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [None]:
pivot_table_area_deal.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 4685 entries, (2011, 1, 1) to (2023, 10, 31)
Columns: 8933 entries, ('강남구', '개포동', 12.0, 0.0) to ('중랑구', '중화동', 454.0, 0.0)
dtypes: float64(8933)
memory usage: 319.3 MB


In [None]:
# 2011년 1월 1일부터 2023년 10월 30일 까지의 모든 일자들을 리스트에 선언
from datetime import datetime, timedelta

start_date = datetime(2011, 1, 1)  # 시작 날짜
end_date = datetime(2023, 10, 31)  # 끝 날짜

date_list = []
current_date = start_date
while current_date <= end_date:
    date_tuple = (current_date.year, current_date.month, current_date.day)
    date_list.append(date_tuple)
    current_date += timedelta(days=1)

print(date_list)

[(2011, 1, 1), (2011, 1, 2), (2011, 1, 3), (2011, 1, 4), (2011, 1, 5), (2011, 1, 6), (2011, 1, 7), (2011, 1, 8), (2011, 1, 9), (2011, 1, 10), (2011, 1, 11), (2011, 1, 12), (2011, 1, 13), (2011, 1, 14), (2011, 1, 15), (2011, 1, 16), (2011, 1, 17), (2011, 1, 18), (2011, 1, 19), (2011, 1, 20), (2011, 1, 21), (2011, 1, 22), (2011, 1, 23), (2011, 1, 24), (2011, 1, 25), (2011, 1, 26), (2011, 1, 27), (2011, 1, 28), (2011, 1, 29), (2011, 1, 30), (2011, 1, 31), (2011, 2, 1), (2011, 2, 2), (2011, 2, 3), (2011, 2, 4), (2011, 2, 5), (2011, 2, 6), (2011, 2, 7), (2011, 2, 8), (2011, 2, 9), (2011, 2, 10), (2011, 2, 11), (2011, 2, 12), (2011, 2, 13), (2011, 2, 14), (2011, 2, 15), (2011, 2, 16), (2011, 2, 17), (2011, 2, 18), (2011, 2, 19), (2011, 2, 20), (2011, 2, 21), (2011, 2, 22), (2011, 2, 23), (2011, 2, 24), (2011, 2, 25), (2011, 2, 26), (2011, 2, 27), (2011, 2, 28), (2011, 3, 1), (2011, 3, 2), (2011, 3, 3), (2011, 3, 4), (2011, 3, 5), (2011, 3, 6), (2011, 3, 7), (2011, 3, 8), (2011, 3, 9), (2011,

In [None]:
len(date_list)

4687

In [None]:
pivot_table_area_deal.index

MultiIndex([(2011,  1,  1),
            (2011,  1,  2),
            (2011,  1,  3),
            (2011,  1,  4),
            (2011,  1,  5),
            (2011,  1,  6),
            (2011,  1,  7),
            (2011,  1,  8),
            (2011,  1,  9),
            (2011,  1, 10),
            ...
            (2023, 10, 22),
            (2023, 10, 23),
            (2023, 10, 24),
            (2023, 10, 25),
            (2023, 10, 26),
            (2023, 10, 27),
            (2023, 10, 28),
            (2023, 10, 29),
            (2023, 10, 30),
            (2023, 10, 31)],
           names=['year', 'month', 'day'], length=4685)

In [None]:
# 기간 내 모든 날짜들에서 '거래날짜'들 빼서 '거래날짜'에서 없는 날짜들을 고름
print(set(date_list) - set(pivot_table_area_deal.index)) # '모든날짜'에 있고 '거래날짜'에 없는 날짜
print(set(pivot_table_area_deal.index) - set(date_list)) # 잘못 추가 생성된 날짜

{(2016, 2, 9), (2022, 9, 11)}
set()


In [None]:
# 빈 날짜 들(거래날짜에서 포함되지 않은 날짜의 거래가격들)을 null 로 채워서 row로 추가
pivot_table_area_deal.loc[(2016, 2, 9)]=np.nan
pivot_table_area_deal.loc[(2022, 9, 11)]=np.nan

In [None]:
# 연, 월, 일 로 정렬을 함 - 정렬을 하지 않으면 바로 위에서 추가한 row들이 적절한 위치에 들어가 있지 않는다
pivot_table_area_deal = pivot_table_area_deal.sort_values(by=['year','month','day'])
pivot_table_area_deal

Unnamed: 0_level_0,Unnamed: 1_level_0,address_1,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,...,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구
Unnamed: 0_level_1,Unnamed: 1_level_1,address_2,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,논현동,논현동,논현동,논현동,논현동,논현동,논현동,논현동,논현동,논현동,...,신내동,신내동,신내동,신내동,신내동,신내동,신내동,신내동,신내동,신내동,신내동,신내동,신내동,신내동,신내동,신내동,신내동,신내동,신내동,신내동,신내동,신내동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동
Unnamed: 0_level_2,Unnamed: 1_level_2,address_3,12.0,12.0,138.0,140.0,141.0,166.0,172.0,176.0,177.0,179.0,185.0,187.0,189.0,649.0,651.0,652.0,653.0,654.0,655.0,655.0,655.0,656.0,658.0,1164.0,1164.0,1164.0,1164.0,1164.0,1164.0,1164.0,1164.0,1165.0,1167.0,1204.0,1242.0,1260.0,1260.0,1260.0,1280.0,1282.0,9.0,16.0,22.0,39.0,43.0,44.0,44.0,44.0,46.0,46.0,...,618.0,650.0,654.0,656.0,657.0,661.0,783.0,785.0,786.0,787.0,788.0,795.0,796.0,797.0,799.0,800.0,801.0,802.0,803.0,804.0,816.0,817.0,11.0,11.0,19.0,19.0,19.0,42.0,110.0,207.0,207.0,208.0,208.0,274.0,274.0,274.0,274.0,286.0,297.0,307.0,307.0,314.0,318.0,331.0,413.0,438.0,450.0,452.0,453.0,454.0
Unnamed: 0_level_3,Unnamed: 1_level_3,address_4,0.0,2.0,0.0,0.0,0.0,4.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,2.0,3.0,0.0,1.0,7.0,12.0,13.0,14.0,20.0,25.0,28.0,30.0,0.0,0.0,5.0,2.0,2.0,4.0,11.0,0.0,0.0,2.0,42.0,0.0,10.0,3.0,6.0,8.0,9.0,2.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.0,0.0,3.0,16.0,0.0,26.0,12.0,14.0,4.0,33.0,51.0,75.0,76.0,77.0,22.0,84.0,6.0,76.0,1.0,81.0,64.0,8.0,0.0,0.0,0.0,0.0,0.0
year,month,day,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4,Unnamed: 7_level_4,Unnamed: 8_level_4,Unnamed: 9_level_4,Unnamed: 10_level_4,Unnamed: 11_level_4,Unnamed: 12_level_4,Unnamed: 13_level_4,Unnamed: 14_level_4,Unnamed: 15_level_4,Unnamed: 16_level_4,Unnamed: 17_level_4,Unnamed: 18_level_4,Unnamed: 19_level_4,Unnamed: 20_level_4,Unnamed: 21_level_4,Unnamed: 22_level_4,Unnamed: 23_level_4,Unnamed: 24_level_4,Unnamed: 25_level_4,Unnamed: 26_level_4,Unnamed: 27_level_4,Unnamed: 28_level_4,Unnamed: 29_level_4,Unnamed: 30_level_4,Unnamed: 31_level_4,Unnamed: 32_level_4,Unnamed: 33_level_4,Unnamed: 34_level_4,Unnamed: 35_level_4,Unnamed: 36_level_4,Unnamed: 37_level_4,Unnamed: 38_level_4,Unnamed: 39_level_4,Unnamed: 40_level_4,Unnamed: 41_level_4,Unnamed: 42_level_4,Unnamed: 43_level_4,Unnamed: 44_level_4,Unnamed: 45_level_4,Unnamed: 46_level_4,Unnamed: 47_level_4,Unnamed: 48_level_4,Unnamed: 49_level_4,Unnamed: 50_level_4,Unnamed: 51_level_4,Unnamed: 52_level_4,Unnamed: 53_level_4,Unnamed: 54_level_4,Unnamed: 55_level_4,Unnamed: 56_level_4,Unnamed: 57_level_4,Unnamed: 58_level_4,Unnamed: 59_level_4,Unnamed: 60_level_4,Unnamed: 61_level_4,Unnamed: 62_level_4,Unnamed: 63_level_4,Unnamed: 64_level_4,Unnamed: 65_level_4,Unnamed: 66_level_4,Unnamed: 67_level_4,Unnamed: 68_level_4,Unnamed: 69_level_4,Unnamed: 70_level_4,Unnamed: 71_level_4,Unnamed: 72_level_4,Unnamed: 73_level_4,Unnamed: 74_level_4,Unnamed: 75_level_4,Unnamed: 76_level_4,Unnamed: 77_level_4,Unnamed: 78_level_4,Unnamed: 79_level_4,Unnamed: 80_level_4,Unnamed: 81_level_4,Unnamed: 82_level_4,Unnamed: 83_level_4,Unnamed: 84_level_4,Unnamed: 85_level_4,Unnamed: 86_level_4,Unnamed: 87_level_4,Unnamed: 88_level_4,Unnamed: 89_level_4,Unnamed: 90_level_4,Unnamed: 91_level_4,Unnamed: 92_level_4,Unnamed: 93_level_4,Unnamed: 94_level_4,Unnamed: 95_level_4,Unnamed: 96_level_4,Unnamed: 97_level_4,Unnamed: 98_level_4,Unnamed: 99_level_4,Unnamed: 100_level_4,Unnamed: 101_level_4,Unnamed: 102_level_4,Unnamed: 103_level_4
2011,1,1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2011,1,2,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,431.726908,,,
2011,1,3,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,448.028674,,,,,,,,,,,,,,,410.887475,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2011,1,4,1018.685955,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,688.806888,,,...,,,382.878461,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2011,1,5,1087.781432,,2101.057579,,1887.191539,,,,,,1218.844152,,,,,,1026.278961,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,398.357072,,,418.922099,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2023,10,27,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2023,10,28,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,932.652661,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2023,10,29,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2023,10,30,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,971.385542,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [None]:
# 가장 최근에 체결된 값이 거래가격으로 유지 됨으로 ffill()을 사용
pivot_table_area_deal=pivot_table_area_deal.ffill()
pivot_table_area_deal

Unnamed: 0_level_0,Unnamed: 1_level_0,address_1,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,...,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구
Unnamed: 0_level_1,Unnamed: 1_level_1,address_2,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,논현동,논현동,논현동,논현동,논현동,논현동,논현동,논현동,논현동,논현동,...,신내동,신내동,신내동,신내동,신내동,신내동,신내동,신내동,신내동,신내동,신내동,신내동,신내동,신내동,신내동,신내동,신내동,신내동,신내동,신내동,신내동,신내동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동
Unnamed: 0_level_2,Unnamed: 1_level_2,address_3,12.0,12.0,138.0,140.0,141.0,166.0,172.0,176.0,177.0,179.0,185.0,187.0,189.0,649.0,651.0,652.0,653.0,654.0,655.0,655.0,655.0,656.0,658.0,1164.0,1164.0,1164.0,1164.0,1164.0,1164.0,1164.0,1164.0,1165.0,1167.0,1204.0,1242.0,1260.0,1260.0,1260.0,1280.0,1282.0,9.0,16.0,22.0,39.0,43.0,44.0,44.0,44.0,46.0,46.0,...,618.0,650.0,654.0,656.0,657.0,661.0,783.0,785.0,786.0,787.0,788.0,795.0,796.0,797.0,799.0,800.0,801.0,802.0,803.0,804.0,816.0,817.0,11.0,11.0,19.0,19.0,19.0,42.0,110.0,207.0,207.0,208.0,208.0,274.0,274.0,274.0,274.0,286.0,297.0,307.0,307.0,314.0,318.0,331.0,413.0,438.0,450.0,452.0,453.0,454.0
Unnamed: 0_level_3,Unnamed: 1_level_3,address_4,0.0,2.0,0.0,0.0,0.0,4.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,2.0,3.0,0.0,1.0,7.0,12.0,13.0,14.0,20.0,25.0,28.0,30.0,0.0,0.0,5.0,2.0,2.0,4.0,11.0,0.0,0.0,2.0,42.0,0.0,10.0,3.0,6.0,8.0,9.0,2.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.0,0.0,3.0,16.0,0.0,26.0,12.0,14.0,4.0,33.0,51.0,75.0,76.0,77.0,22.0,84.0,6.0,76.0,1.0,81.0,64.0,8.0,0.0,0.0,0.0,0.0,0.0
year,month,day,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4,Unnamed: 7_level_4,Unnamed: 8_level_4,Unnamed: 9_level_4,Unnamed: 10_level_4,Unnamed: 11_level_4,Unnamed: 12_level_4,Unnamed: 13_level_4,Unnamed: 14_level_4,Unnamed: 15_level_4,Unnamed: 16_level_4,Unnamed: 17_level_4,Unnamed: 18_level_4,Unnamed: 19_level_4,Unnamed: 20_level_4,Unnamed: 21_level_4,Unnamed: 22_level_4,Unnamed: 23_level_4,Unnamed: 24_level_4,Unnamed: 25_level_4,Unnamed: 26_level_4,Unnamed: 27_level_4,Unnamed: 28_level_4,Unnamed: 29_level_4,Unnamed: 30_level_4,Unnamed: 31_level_4,Unnamed: 32_level_4,Unnamed: 33_level_4,Unnamed: 34_level_4,Unnamed: 35_level_4,Unnamed: 36_level_4,Unnamed: 37_level_4,Unnamed: 38_level_4,Unnamed: 39_level_4,Unnamed: 40_level_4,Unnamed: 41_level_4,Unnamed: 42_level_4,Unnamed: 43_level_4,Unnamed: 44_level_4,Unnamed: 45_level_4,Unnamed: 46_level_4,Unnamed: 47_level_4,Unnamed: 48_level_4,Unnamed: 49_level_4,Unnamed: 50_level_4,Unnamed: 51_level_4,Unnamed: 52_level_4,Unnamed: 53_level_4,Unnamed: 54_level_4,Unnamed: 55_level_4,Unnamed: 56_level_4,Unnamed: 57_level_4,Unnamed: 58_level_4,Unnamed: 59_level_4,Unnamed: 60_level_4,Unnamed: 61_level_4,Unnamed: 62_level_4,Unnamed: 63_level_4,Unnamed: 64_level_4,Unnamed: 65_level_4,Unnamed: 66_level_4,Unnamed: 67_level_4,Unnamed: 68_level_4,Unnamed: 69_level_4,Unnamed: 70_level_4,Unnamed: 71_level_4,Unnamed: 72_level_4,Unnamed: 73_level_4,Unnamed: 74_level_4,Unnamed: 75_level_4,Unnamed: 76_level_4,Unnamed: 77_level_4,Unnamed: 78_level_4,Unnamed: 79_level_4,Unnamed: 80_level_4,Unnamed: 81_level_4,Unnamed: 82_level_4,Unnamed: 83_level_4,Unnamed: 84_level_4,Unnamed: 85_level_4,Unnamed: 86_level_4,Unnamed: 87_level_4,Unnamed: 88_level_4,Unnamed: 89_level_4,Unnamed: 90_level_4,Unnamed: 91_level_4,Unnamed: 92_level_4,Unnamed: 93_level_4,Unnamed: 94_level_4,Unnamed: 95_level_4,Unnamed: 96_level_4,Unnamed: 97_level_4,Unnamed: 98_level_4,Unnamed: 99_level_4,Unnamed: 100_level_4,Unnamed: 101_level_4,Unnamed: 102_level_4,Unnamed: 103_level_4
2011,1,1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2011,1,2,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,431.726908,,,
2011,1,3,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,448.028674,,,,,,,,,,,,,,,410.887475,,,,,,,,,,,,,,,,,,,,,,,,,,,,431.726908,,,
2011,1,4,1018.685955,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,688.806888,,,...,,,382.878461,448.028674,,,,,,,,,,,,,,,410.887475,,,,,,,,,,,,,,,,,,,,,,,,,,,,431.726908,,,
2011,1,5,1087.781432,,2101.057579,,1887.191539,,,,,,1218.844152,,,,,,1026.278961,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,688.806888,,,...,,,382.878461,448.028674,,,,,,,,,,,,,398.357072,,410.887475,418.922099,,,,,,,,,,,,,,,,,,,,,,,,,,,431.726908,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2023,10,27,3134.418324,1842.889054,3481.293903,2487.219819,4324.324324,1413.594063,1342.758827,2172.968275,2768.940733,2386.634845,3385.203385,2639.73064,3656.21874,2536.605486,2126.525428,2352.764873,2295.597484,2230.144884,2335.766423,2340.836013,582.442656,2756.990941,2738.526948,906.308851,1517.006803,1253.561254,1061.6261,989.28277,777.325999,856.06329,882.012724,518.830495,909.50432,746.199481,769.056774,806.947433,784.106215,1103.852596,3306.016951,3167.556742,1090.016727,1187.178988,2420.934789,1209.574739,1187.396633,1188.734455,1057.21393,1158.301158,858.669152,820.179511,...,808.080808,953.815261,761.905217,811.860219,832.998796,884.066707,717.504744,916.230366,796.475174,885.991307,791.18678,663.507109,677.506775,783.619818,787.401575,423.120838,738.964154,826.739105,778.528724,847.756976,808.806216,871.511012,547.069981,693.444982,751.560843,393.22444,630.412634,792.459126,497.945973,671.378092,897.178609,635.451505,922.002604,589.207699,685.405705,586.105022,384.786148,978.542797,396.210164,542.779426,589.761736,872.199239,466.954023,956.130484,595.238095,818.61013,934.190170,727.417008,1006.355932,1131.141746
2023,10,28,3134.418324,1842.889054,3481.293903,2487.219819,4324.324324,1413.594063,1342.758827,2172.968275,2768.940733,2386.634845,3385.203385,2639.73064,3656.21874,2536.605486,2126.525428,2352.764873,2295.597484,2230.144884,2335.766423,2340.836013,582.442656,2756.990941,2738.526948,906.308851,1517.006803,1253.561254,1061.6261,989.28277,777.325999,856.06329,882.012724,518.830495,909.50432,746.199481,769.056774,806.947433,784.106215,1103.852596,3306.016951,3167.556742,1090.016727,1187.178988,2420.934789,1209.574739,1187.396633,1188.734455,1057.21393,1158.301158,858.669152,820.179511,...,808.080808,953.815261,932.652661,811.860219,832.998796,884.066707,717.504744,916.230366,796.475174,885.991307,791.18678,663.507109,677.506775,783.619818,787.401575,423.120838,738.964154,826.739105,778.528724,847.756976,808.806216,871.511012,547.069981,693.444982,751.560843,393.22444,630.412634,792.459126,497.945973,671.378092,897.178609,635.451505,922.002604,589.207699,685.405705,586.105022,384.786148,978.542797,396.210164,542.779426,589.761736,872.199239,466.954023,956.130484,595.238095,818.61013,934.190170,727.417008,1006.355932,1131.141746
2023,10,29,3134.418324,1842.889054,3481.293903,2487.219819,4324.324324,1413.594063,1342.758827,2172.968275,2768.940733,2386.634845,3385.203385,2639.73064,3656.21874,2536.605486,2126.525428,2352.764873,2295.597484,2230.144884,2335.766423,2340.836013,582.442656,2756.990941,2738.526948,906.308851,1517.006803,1253.561254,1061.6261,989.28277,777.325999,856.06329,882.012724,518.830495,909.50432,746.199481,769.056774,806.947433,784.106215,1103.852596,3306.016951,3167.556742,1090.016727,1187.178988,2420.934789,1209.574739,1187.396633,1188.734455,1057.21393,1158.301158,858.669152,820.179511,...,808.080808,953.815261,932.652661,811.860219,832.998796,884.066707,717.504744,916.230366,796.475174,885.991307,791.18678,663.507109,677.506775,783.619818,787.401575,423.120838,738.964154,826.739105,778.528724,847.756976,808.806216,871.511012,547.069981,693.444982,751.560843,393.22444,630.412634,792.459126,497.945973,671.378092,897.178609,635.451505,922.002604,589.207699,685.405705,586.105022,384.786148,978.542797,396.210164,542.779426,589.761736,872.199239,466.954023,956.130484,595.238095,818.61013,934.190170,727.417008,1006.355932,1131.141746
2023,10,30,3134.418324,1842.889054,3481.293903,2487.219819,4324.324324,1413.594063,1342.758827,2172.968275,2768.940733,2386.634845,3385.203385,2639.73064,3656.21874,2536.605486,2126.525428,2352.764873,2295.597484,2230.144884,2335.766423,2340.836013,582.442656,2756.990941,2738.526948,906.308851,1517.006803,1253.561254,1061.6261,989.28277,777.325999,856.06329,882.012724,518.830495,909.50432,746.199481,769.056774,806.947433,784.106215,1103.852596,3306.016951,3167.556742,1090.016727,1187.178988,2420.934789,1209.574739,1187.396633,1188.734455,1057.21393,1158.301158,858.669152,820.179511,...,808.080808,971.385542,932.652661,811.860219,832.998796,884.066707,717.504744,916.230366,796.475174,885.991307,791.18678,663.507109,677.506775,783.619818,787.401575,423.120838,738.964154,826.739105,778.528724,847.756976,808.806216,871.511012,547.069981,693.444982,751.560843,393.22444,630.412634,792.459126,497.945973,671.378092,897.178609,635.451505,922.002604,589.207699,685.405705,586.105022,384.786148,978.542797,396.210164,542.779426,589.761736,872.199239,466.954023,956.130484,595.238095,818.61013,934.190170,727.417008,1006.355932,1131.141746


In [None]:
# null 값을 채움 - 값을 채우지 않으면 추후 stack을 할 때 null 값을 계산을 안함
pivot_table_area_deal = pivot_table_area_deal.fillna(0)
pivot_table_area_deal

Unnamed: 0_level_0,Unnamed: 1_level_0,address_1,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,...,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구
Unnamed: 0_level_1,Unnamed: 1_level_1,address_2,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,논현동,논현동,논현동,논현동,논현동,논현동,논현동,논현동,논현동,논현동,...,신내동,신내동,신내동,신내동,신내동,신내동,신내동,신내동,신내동,신내동,신내동,신내동,신내동,신내동,신내동,신내동,신내동,신내동,신내동,신내동,신내동,신내동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동
Unnamed: 0_level_2,Unnamed: 1_level_2,address_3,12.0,12.0,138.0,140.0,141.0,166.0,172.0,176.0,177.0,179.0,185.0,187.0,189.0,649.0,651.0,652.0,653.0,654.0,655.0,655.0,655.0,656.0,658.0,1164.0,1164.0,1164.0,1164.0,1164.0,1164.0,1164.0,1164.0,1165.0,1167.0,1204.0,1242.0,1260.0,1260.0,1260.0,1280.0,1282.0,9.0,16.0,22.0,39.0,43.0,44.0,44.0,44.0,46.0,46.0,...,618.0,650.0,654.0,656.0,657.0,661.0,783.0,785.0,786.0,787.0,788.0,795.0,796.0,797.0,799.0,800.0,801.0,802.0,803.0,804.0,816.0,817.0,11.0,11.0,19.0,19.0,19.0,42.0,110.0,207.0,207.0,208.0,208.0,274.0,274.0,274.0,274.0,286.0,297.0,307.0,307.0,314.0,318.0,331.0,413.0,438.0,450.0,452.0,453.0,454.0
Unnamed: 0_level_3,Unnamed: 1_level_3,address_4,0.0,2.0,0.0,0.0,0.0,4.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,2.0,3.0,0.0,1.0,7.0,12.0,13.0,14.0,20.0,25.0,28.0,30.0,0.0,0.0,5.0,2.0,2.0,4.0,11.0,0.0,0.0,2.0,42.0,0.0,10.0,3.0,6.0,8.0,9.0,2.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.0,0.0,3.0,16.0,0.0,26.0,12.0,14.0,4.0,33.0,51.0,75.0,76.0,77.0,22.0,84.0,6.0,76.0,1.0,81.0,64.0,8.0,0.0,0.0,0.0,0.0,0.0
year,month,day,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4,Unnamed: 7_level_4,Unnamed: 8_level_4,Unnamed: 9_level_4,Unnamed: 10_level_4,Unnamed: 11_level_4,Unnamed: 12_level_4,Unnamed: 13_level_4,Unnamed: 14_level_4,Unnamed: 15_level_4,Unnamed: 16_level_4,Unnamed: 17_level_4,Unnamed: 18_level_4,Unnamed: 19_level_4,Unnamed: 20_level_4,Unnamed: 21_level_4,Unnamed: 22_level_4,Unnamed: 23_level_4,Unnamed: 24_level_4,Unnamed: 25_level_4,Unnamed: 26_level_4,Unnamed: 27_level_4,Unnamed: 28_level_4,Unnamed: 29_level_4,Unnamed: 30_level_4,Unnamed: 31_level_4,Unnamed: 32_level_4,Unnamed: 33_level_4,Unnamed: 34_level_4,Unnamed: 35_level_4,Unnamed: 36_level_4,Unnamed: 37_level_4,Unnamed: 38_level_4,Unnamed: 39_level_4,Unnamed: 40_level_4,Unnamed: 41_level_4,Unnamed: 42_level_4,Unnamed: 43_level_4,Unnamed: 44_level_4,Unnamed: 45_level_4,Unnamed: 46_level_4,Unnamed: 47_level_4,Unnamed: 48_level_4,Unnamed: 49_level_4,Unnamed: 50_level_4,Unnamed: 51_level_4,Unnamed: 52_level_4,Unnamed: 53_level_4,Unnamed: 54_level_4,Unnamed: 55_level_4,Unnamed: 56_level_4,Unnamed: 57_level_4,Unnamed: 58_level_4,Unnamed: 59_level_4,Unnamed: 60_level_4,Unnamed: 61_level_4,Unnamed: 62_level_4,Unnamed: 63_level_4,Unnamed: 64_level_4,Unnamed: 65_level_4,Unnamed: 66_level_4,Unnamed: 67_level_4,Unnamed: 68_level_4,Unnamed: 69_level_4,Unnamed: 70_level_4,Unnamed: 71_level_4,Unnamed: 72_level_4,Unnamed: 73_level_4,Unnamed: 74_level_4,Unnamed: 75_level_4,Unnamed: 76_level_4,Unnamed: 77_level_4,Unnamed: 78_level_4,Unnamed: 79_level_4,Unnamed: 80_level_4,Unnamed: 81_level_4,Unnamed: 82_level_4,Unnamed: 83_level_4,Unnamed: 84_level_4,Unnamed: 85_level_4,Unnamed: 86_level_4,Unnamed: 87_level_4,Unnamed: 88_level_4,Unnamed: 89_level_4,Unnamed: 90_level_4,Unnamed: 91_level_4,Unnamed: 92_level_4,Unnamed: 93_level_4,Unnamed: 94_level_4,Unnamed: 95_level_4,Unnamed: 96_level_4,Unnamed: 97_level_4,Unnamed: 98_level_4,Unnamed: 99_level_4,Unnamed: 100_level_4,Unnamed: 101_level_4,Unnamed: 102_level_4,Unnamed: 103_level_4
2011,1,1,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0000,0.00000,0.000000,0.00000,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000
2011,1,2,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0000,0.00000,0.000000,0.00000,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,431.726908,0.000000,0.000000,0.000000
2011,1,3,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0000,0.00000,0.000000,0.00000,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,448.028674,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,410.887475,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,431.726908,0.000000,0.000000,0.000000
2011,1,4,1018.685955,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0000,0.00000,0.000000,0.00000,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,688.806888,0.000000,0.000000,...,0.000000,0.000000,382.878461,448.028674,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,410.887475,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,431.726908,0.000000,0.000000,0.000000
2011,1,5,1087.781432,0.000000,2101.057579,0.000000,1887.191539,0.000000,0.000000,0.000000,0.000000,0.000000,1218.844152,0.00000,0.00000,0.000000,0.000000,0.000000,1026.278961,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0000,0.00000,0.000000,0.00000,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,688.806888,0.000000,0.000000,...,0.000000,0.000000,382.878461,448.028674,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000,398.357072,0.000000,410.887475,418.922099,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,431.726908,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2023,10,27,3134.418324,1842.889054,3481.293903,2487.219819,4324.324324,1413.594063,1342.758827,2172.968275,2768.940733,2386.634845,3385.203385,2639.73064,3656.21874,2536.605486,2126.525428,2352.764873,2295.597484,2230.144884,2335.766423,2340.836013,582.442656,2756.990941,2738.526948,906.308851,1517.006803,1253.561254,1061.6261,989.28277,777.325999,856.06329,882.012724,518.830495,909.50432,746.199481,769.056774,806.947433,784.106215,1103.852596,3306.016951,3167.556742,1090.016727,1187.178988,2420.934789,1209.574739,1187.396633,1188.734455,1057.21393,1158.301158,858.669152,820.179511,...,808.080808,953.815261,761.905217,811.860219,832.998796,884.066707,717.504744,916.230366,796.475174,885.991307,791.18678,663.507109,677.506775,783.619818,787.401575,423.120838,738.964154,826.739105,778.528724,847.756976,808.806216,871.511012,547.069981,693.444982,751.560843,393.22444,630.412634,792.459126,497.945973,671.378092,897.178609,635.451505,922.002604,589.207699,685.405705,586.105022,384.786148,978.542797,396.210164,542.779426,589.761736,872.199239,466.954023,956.130484,595.238095,818.61013,934.190170,727.417008,1006.355932,1131.141746
2023,10,28,3134.418324,1842.889054,3481.293903,2487.219819,4324.324324,1413.594063,1342.758827,2172.968275,2768.940733,2386.634845,3385.203385,2639.73064,3656.21874,2536.605486,2126.525428,2352.764873,2295.597484,2230.144884,2335.766423,2340.836013,582.442656,2756.990941,2738.526948,906.308851,1517.006803,1253.561254,1061.6261,989.28277,777.325999,856.06329,882.012724,518.830495,909.50432,746.199481,769.056774,806.947433,784.106215,1103.852596,3306.016951,3167.556742,1090.016727,1187.178988,2420.934789,1209.574739,1187.396633,1188.734455,1057.21393,1158.301158,858.669152,820.179511,...,808.080808,953.815261,932.652661,811.860219,832.998796,884.066707,717.504744,916.230366,796.475174,885.991307,791.18678,663.507109,677.506775,783.619818,787.401575,423.120838,738.964154,826.739105,778.528724,847.756976,808.806216,871.511012,547.069981,693.444982,751.560843,393.22444,630.412634,792.459126,497.945973,671.378092,897.178609,635.451505,922.002604,589.207699,685.405705,586.105022,384.786148,978.542797,396.210164,542.779426,589.761736,872.199239,466.954023,956.130484,595.238095,818.61013,934.190170,727.417008,1006.355932,1131.141746
2023,10,29,3134.418324,1842.889054,3481.293903,2487.219819,4324.324324,1413.594063,1342.758827,2172.968275,2768.940733,2386.634845,3385.203385,2639.73064,3656.21874,2536.605486,2126.525428,2352.764873,2295.597484,2230.144884,2335.766423,2340.836013,582.442656,2756.990941,2738.526948,906.308851,1517.006803,1253.561254,1061.6261,989.28277,777.325999,856.06329,882.012724,518.830495,909.50432,746.199481,769.056774,806.947433,784.106215,1103.852596,3306.016951,3167.556742,1090.016727,1187.178988,2420.934789,1209.574739,1187.396633,1188.734455,1057.21393,1158.301158,858.669152,820.179511,...,808.080808,953.815261,932.652661,811.860219,832.998796,884.066707,717.504744,916.230366,796.475174,885.991307,791.18678,663.507109,677.506775,783.619818,787.401575,423.120838,738.964154,826.739105,778.528724,847.756976,808.806216,871.511012,547.069981,693.444982,751.560843,393.22444,630.412634,792.459126,497.945973,671.378092,897.178609,635.451505,922.002604,589.207699,685.405705,586.105022,384.786148,978.542797,396.210164,542.779426,589.761736,872.199239,466.954023,956.130484,595.238095,818.61013,934.190170,727.417008,1006.355932,1131.141746
2023,10,30,3134.418324,1842.889054,3481.293903,2487.219819,4324.324324,1413.594063,1342.758827,2172.968275,2768.940733,2386.634845,3385.203385,2639.73064,3656.21874,2536.605486,2126.525428,2352.764873,2295.597484,2230.144884,2335.766423,2340.836013,582.442656,2756.990941,2738.526948,906.308851,1517.006803,1253.561254,1061.6261,989.28277,777.325999,856.06329,882.012724,518.830495,909.50432,746.199481,769.056774,806.947433,784.106215,1103.852596,3306.016951,3167.556742,1090.016727,1187.178988,2420.934789,1209.574739,1187.396633,1188.734455,1057.21393,1158.301158,858.669152,820.179511,...,808.080808,971.385542,932.652661,811.860219,832.998796,884.066707,717.504744,916.230366,796.475174,885.991307,791.18678,663.507109,677.506775,783.619818,787.401575,423.120838,738.964154,826.739105,778.528724,847.756976,808.806216,871.511012,547.069981,693.444982,751.560843,393.22444,630.412634,792.459126,497.945973,671.378092,897.178609,635.451505,922.002604,589.207699,685.405705,586.105022,384.786148,978.542797,396.210164,542.779426,589.761736,872.199239,466.954023,956.130484,595.238095,818.61013,934.190170,727.417008,1006.355932,1131.141746


### 피봇테이블 -> 데이터프레임

In [None]:
# 컬럼을 slice해서 값을 처리할 때, 컬럼의 개수가 많으면, row가 많을 때 보다 메모리를 많이 소모함으로 전치를 시킴
pivot_table_area_deal = pivot_table_area_deal.T
pivot_table_area_deal

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,year,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,...,2023,2023,2023,2023,2023,2023,2023,2023,2023,2023,2023,2023,2023,2023,2023,2023,2023,2023,2023,2023,2023,2023,2023,2023,2023,2023,2023,2023,2023,2023,2023,2023,2023,2023,2023,2023,2023,2023,2023,2023,2023,2023,2023,2023,2023,2023,2023,2023,2023,2023
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,month,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,...,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10
Unnamed: 0_level_2,Unnamed: 1_level_2,Unnamed: 2_level_2,day,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,...,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31
address_1,address_2,address_3,address_4,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3,Unnamed: 11_level_3,Unnamed: 12_level_3,Unnamed: 13_level_3,Unnamed: 14_level_3,Unnamed: 15_level_3,Unnamed: 16_level_3,Unnamed: 17_level_3,Unnamed: 18_level_3,Unnamed: 19_level_3,Unnamed: 20_level_3,Unnamed: 21_level_3,Unnamed: 22_level_3,Unnamed: 23_level_3,Unnamed: 24_level_3,Unnamed: 25_level_3,Unnamed: 26_level_3,Unnamed: 27_level_3,Unnamed: 28_level_3,Unnamed: 29_level_3,Unnamed: 30_level_3,Unnamed: 31_level_3,Unnamed: 32_level_3,Unnamed: 33_level_3,Unnamed: 34_level_3,Unnamed: 35_level_3,Unnamed: 36_level_3,Unnamed: 37_level_3,Unnamed: 38_level_3,Unnamed: 39_level_3,Unnamed: 40_level_3,Unnamed: 41_level_3,Unnamed: 42_level_3,Unnamed: 43_level_3,Unnamed: 44_level_3,Unnamed: 45_level_3,Unnamed: 46_level_3,Unnamed: 47_level_3,Unnamed: 48_level_3,Unnamed: 49_level_3,Unnamed: 50_level_3,Unnamed: 51_level_3,Unnamed: 52_level_3,Unnamed: 53_level_3,Unnamed: 54_level_3,Unnamed: 55_level_3,Unnamed: 56_level_3,Unnamed: 57_level_3,Unnamed: 58_level_3,Unnamed: 59_level_3,Unnamed: 60_level_3,Unnamed: 61_level_3,Unnamed: 62_level_3,Unnamed: 63_level_3,Unnamed: 64_level_3,Unnamed: 65_level_3,Unnamed: 66_level_3,Unnamed: 67_level_3,Unnamed: 68_level_3,Unnamed: 69_level_3,Unnamed: 70_level_3,Unnamed: 71_level_3,Unnamed: 72_level_3,Unnamed: 73_level_3,Unnamed: 74_level_3,Unnamed: 75_level_3,Unnamed: 76_level_3,Unnamed: 77_level_3,Unnamed: 78_level_3,Unnamed: 79_level_3,Unnamed: 80_level_3,Unnamed: 81_level_3,Unnamed: 82_level_3,Unnamed: 83_level_3,Unnamed: 84_level_3,Unnamed: 85_level_3,Unnamed: 86_level_3,Unnamed: 87_level_3,Unnamed: 88_level_3,Unnamed: 89_level_3,Unnamed: 90_level_3,Unnamed: 91_level_3,Unnamed: 92_level_3,Unnamed: 93_level_3,Unnamed: 94_level_3,Unnamed: 95_level_3,Unnamed: 96_level_3,Unnamed: 97_level_3,Unnamed: 98_level_3,Unnamed: 99_level_3,Unnamed: 100_level_3,Unnamed: 101_level_3,Unnamed: 102_level_3,Unnamed: 103_level_3,Unnamed: 104_level_3
강남구,개포동,12.0,0.0,0.0,0.000000,0.000000,1018.685955,1087.781432,1040.914561,1054.852321,1054.852321,1054.852321,1054.852321,1006.830256,1062.484189,1062.484189,1062.484189,1062.484189,1062.484189,1062.484189,1032.418980,1054.852321,1075.132811,1075.132811,1075.132811,1075.132811,1075.132811,1075.132811,1075.132811,1104.224640,1062.571868,1100.430053,1100.430053,1100.430053,1100.430053,1100.430053,1100.430053,1100.430053,1100.430053,1100.430053,1100.430053,1100.430053,1050.000000,1050.000000,1050.000000,1036.776213,1091.054954,1091.054954,1091.054954,1100.430053,923.749247,1071.420440,1087.781432,...,2554.610769,2554.610769,2554.610769,2554.610769,3048.317733,3048.317733,3048.317733,3048.317733,3044.002411,2776.372375,2776.372375,3098.912219,3134.418324,3134.418324,3134.418324,3134.418324,3134.418324,3134.418324,3134.418324,3134.418324,3134.418324,3134.418324,3134.418324,3134.418324,3134.418324,3134.418324,3134.418324,3134.418324,3134.418324,3134.418324,3134.418324,3134.418324,3134.418324,3134.418324,3134.418324,3134.418324,3134.418324,3134.418324,3134.418324,3134.418324,3134.418324,3134.418324,3134.418324,3134.418324,3134.418324,3134.418324,3134.418324,3134.418324,3134.418324,3134.418324
강남구,개포동,12.0,2.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,920.318252,920.318252,920.318252,920.318252,920.318252,...,1779.004227,1779.004227,1779.004227,1779.004227,1779.004227,1779.004227,1779.004227,1779.004227,1779.004227,1779.004227,1779.004227,1779.004227,1779.004227,1779.004227,1779.004227,1779.004227,1779.004227,1779.004227,1779.004227,1779.004227,1842.889054,1842.889054,1842.889054,1842.889054,1842.889054,1842.889054,1842.889054,1842.889054,1842.889054,1842.889054,1842.889054,1842.889054,1842.889054,1842.889054,1842.889054,1842.889054,1842.889054,1842.889054,1842.889054,1842.889054,1842.889054,1842.889054,1842.889054,1842.889054,1842.889054,1842.889054,1842.889054,1842.889054,1842.889054,1842.889054
강남구,개포동,138.0,0.0,0.0,0.000000,0.000000,0.000000,2101.057579,2101.057579,2101.057579,2101.057579,2101.057579,2101.057579,2101.057579,2101.057579,2101.057579,1862.280457,2110.458284,2110.458284,2110.458284,2110.458284,2110.458284,2110.458284,1867.856147,1867.856147,1867.856147,1867.856147,1867.856147,1867.856147,1867.856147,1867.856147,1867.856147,1867.856147,1867.856147,1867.856147,1867.856147,1867.856147,1867.856147,1867.856147,1867.856147,1867.856147,1867.856147,1867.856147,1867.856147,1867.856147,1867.856147,1867.856147,1867.856147,1867.856147,1867.856147,1920.594837,1920.594837,1907.968575,...,3339.993077,3339.993077,3339.993077,3339.993077,3339.993077,3339.993077,3339.993077,3339.993077,3339.993077,3339.993077,3339.993077,3339.993077,3339.993077,3339.993077,3339.993077,3339.993077,3339.993077,3339.993077,3339.993077,3339.993077,3339.993077,3339.993077,3663.055771,3663.055771,3663.055771,3674.431966,3674.431966,3674.431966,3674.431966,3674.431966,3481.293903,3481.293903,3481.293903,3481.293903,3481.293903,3481.293903,3481.293903,3481.293903,3481.293903,3481.293903,3481.293903,3481.293903,3481.293903,3481.293903,3481.293903,3481.293903,3481.293903,3481.293903,3481.293903,3481.293903
강남구,개포동,140.0,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,1565.991903,2064.490759,1759.201497,1665.038381,1665.038381,1665.038381,1665.038381,1665.038381,1665.038381,2095.949666,2095.949666,2095.949666,2168.579343,1585.525089,1585.525089,1585.525089,1585.525089,1585.525089,1585.525089,1585.525089,1585.525089,1585.525089,1585.525089,2176.493866,2176.493866,2176.493866,2176.493866,2176.493866,2176.493866,2176.493866,2176.493866,2176.493866,2156.707558,1619.100914,1619.100914,1619.100914,1619.100914,2162.799843,2162.799843,2162.799843,2162.799843,2162.799843,...,2487.219819,2487.219819,2487.219819,2487.219819,2487.219819,2487.219819,2487.219819,2487.219819,2487.219819,2487.219819,2487.219819,2487.219819,2487.219819,2487.219819,2487.219819,2487.219819,2487.219819,2487.219819,2487.219819,2487.219819,2487.219819,2487.219819,2487.219819,2487.219819,2487.219819,2487.219819,2487.219819,2487.219819,2487.219819,2487.219819,2487.219819,2487.219819,2487.219819,2487.219819,2487.219819,2487.219819,2487.219819,2487.219819,2487.219819,2487.219819,2487.219819,2487.219819,2487.219819,2487.219819,2487.219819,2487.219819,2487.219819,2487.219819,2487.219819,2487.219819
강남구,개포동,141.0,0.0,0.0,0.000000,0.000000,0.000000,1887.191539,1887.191539,1887.191539,1887.191539,1887.191539,1887.191539,1887.191539,1866.687890,1866.687890,182.229633,182.229633,182.229633,182.229633,1905.669366,1871.081129,1893.939394,1893.939394,1886.909963,1886.909963,1905.583038,1945.915807,1945.934819,1891.281789,2027.576454,1865.819770,1865.819770,2009.899240,1948.703652,1948.703652,1948.703652,1948.703652,1948.703652,1948.703652,1980.027548,1980.027548,1980.027548,1977.131968,1875.987362,1941.619281,1941.619281,1941.619281,1941.619281,1941.619281,1941.619281,1941.619281,1832.543444,...,4324.324324,4324.324324,4324.324324,4324.324324,4324.324324,4324.324324,4324.324324,4324.324324,4324.324324,4324.324324,4324.324324,4324.324324,4324.324324,4324.324324,4324.324324,4324.324324,4324.324324,4324.324324,4324.324324,4324.324324,4324.324324,4324.324324,4324.324324,4324.324324,4324.324324,4324.324324,4324.324324,4324.324324,4324.324324,4324.324324,4324.324324,4324.324324,4324.324324,4324.324324,4324.324324,4324.324324,4324.324324,4324.324324,4324.324324,4324.324324,4324.324324,4324.324324,4324.324324,4324.324324,4324.324324,4324.324324,4324.324324,4324.324324,4324.324324,4324.324324
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
중랑구,중화동,438.0,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,818.610130,818.610130,818.610130,818.610130,818.610130,818.610130,818.610130,818.610130,818.610130,818.610130,818.610130,818.610130,818.610130,818.610130,818.610130,818.610130,818.610130,818.610130,818.610130,818.610130,818.610130,818.610130,818.610130,818.610130,818.610130,818.610130,818.610130,818.610130,818.610130,818.610130,818.610130,818.610130,818.610130,818.610130,818.610130,818.610130,818.610130,818.610130,818.610130,818.610130,818.610130,818.610130,818.610130,818.610130,818.610130,818.610130,818.610130,818.610130,818.610130,818.610130
중랑구,중화동,450.0,0.0,0.0,431.726908,431.726908,431.726908,431.726908,485.274431,485.274431,485.274431,485.274431,485.274431,485.274431,449.839343,449.839343,449.839343,449.839343,449.839343,468.540830,468.540830,468.540830,468.540830,431.782467,431.782467,431.782467,485.274431,485.274431,399.933066,399.933066,399.933066,399.933066,399.933066,399.933066,399.933066,399.933066,399.933066,399.933066,399.933066,399.933066,399.933066,399.933066,399.933066,473.560910,476.474092,451.807229,451.807229,451.807229,443.440428,443.440428,443.440428,443.440428,443.440428,...,1042.287076,1042.287076,1042.287076,1042.287076,1042.287076,1042.287076,1042.287076,1042.287076,1042.287076,1042.287076,1042.287076,1042.287076,1042.287076,1042.287076,1042.287076,1042.287076,1042.287076,1042.287076,1042.287076,1042.287076,1042.287076,1042.287076,1042.287076,1042.287076,1042.287076,1042.287076,1042.287076,1020.749665,1020.749665,1020.749665,1020.749665,1059.029148,1004.016064,1004.016064,1004.016064,1020.749665,1020.749665,1020.749665,934.190170,934.190170,934.190170,934.190170,934.190170,934.190170,934.190170,934.190170,934.190170,934.190170,934.190170,934.190170
중랑구,중화동,452.0,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,727.417008,727.417008,727.417008,727.417008,727.417008,727.417008,727.417008,727.417008,727.417008,727.417008,727.417008,727.417008,727.417008,727.417008,727.417008,727.417008,727.417008,727.417008,727.417008,727.417008,727.417008,727.417008,727.417008,727.417008,727.417008,727.417008,727.417008,727.417008,727.417008,727.417008,727.417008,727.417008,727.417008,727.417008,727.417008,727.417008,727.417008,727.417008,727.417008,727.417008,727.417008,727.417008,727.417008,727.417008,727.417008,727.417008,727.417008,727.417008,727.417008,727.417008
중랑구,중화동,453.0,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,1006.355932,1006.355932,1006.355932,1006.355932,1006.355932,1006.355932,1006.355932,1006.355932,1006.355932,1006.355932,1006.355932,1006.355932,1006.355932,1006.355932,1006.355932,1006.355932,1006.355932,1006.355932,1006.355932,1006.355932,1006.355932,1006.355932,1006.355932,1006.355932,1006.355932,1006.355932,1006.355932,1006.355932,1006.355932,1006.355932,1006.355932,1006.355932,1006.355932,1006.355932,1006.355932,1006.355932,1006.355932,1006.355932,1006.355932,1006.355932,1006.355932,1006.355932,1006.355932,1006.355932,1006.355932,1006.355932,1006.355932,1006.355932,1006.355932,1006.355932


- pandas는 row 개수가 많은것이, column의 개수가 많은 것보다 더 메모리 부담이 크다

In [None]:
# 피봇테이블을 일반데이터프레임화
df_area_deal = pivot_table_area_deal.stack(level=[0,1,2])
df_area_deal =df_area_deal.reset_index()
df_area_deal

Unnamed: 0,address_1,address_2,address_3,address_4,year,month,day,0
0,강남구,개포동,12.0,0.0,2011,1,1,0.000000
1,강남구,개포동,12.0,0.0,2011,1,2,0.000000
2,강남구,개포동,12.0,0.0,2011,1,3,0.000000
3,강남구,개포동,12.0,0.0,2011,1,4,1018.685955
4,강남구,개포동,12.0,0.0,2011,1,5,1087.781432
...,...,...,...,...,...,...,...,...
41868966,중랑구,중화동,454.0,0.0,2023,10,27,1131.141746
41868967,중랑구,중화동,454.0,0.0,2023,10,28,1131.141746
41868968,중랑구,중화동,454.0,0.0,2023,10,29,1131.141746
41868969,중랑구,중화동,454.0,0.0,2023,10,30,1131.141746


In [None]:
df_area_deal.columns = ['address_1','address_2','address_3','address_4','year','month','day','area_deal'] # 컬럼명 수정
df_area_deal = df_area_deal.astype({'address_3': 'int16', 'address_4': 'int16','year':'int16', 'month':'int16', 'day':'int16', 'area_deal':'float32'}) # 데이터 타입 변경
df_area_deal = df_area_deal.drop(df_area_deal[df_area_deal.area_deal == 0].index) # 위에서 값이 null인 값들을 0으로 처리했으므로, 0인 값들을 제거한다
df_area_deal

Unnamed: 0,address_1,address_2,address_3,address_4,year,month,day,area_deal
3,강남구,개포동,12,0,2011,1,4,1018.685974
4,강남구,개포동,12,0,2011,1,5,1087.781372
5,강남구,개포동,12,0,2011,1,6,1040.914551
6,강남구,개포동,12,0,2011,1,7,1054.852295
7,강남구,개포동,12,0,2011,1,8,1054.852295
...,...,...,...,...,...,...,...,...
41868966,중랑구,중화동,454,0,2023,10,27,1131.141724
41868967,중랑구,중화동,454,0,2023,10,28,1131.141724
41868968,중랑구,중화동,454,0,2023,10,29,1131.141724
41868969,중랑구,중화동,454,0,2023,10,30,1131.141724


In [None]:
df_area_deal.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 35550086 entries, 3 to 41868970
Data columns (total 8 columns):
 #   Column     Dtype  
---  ------     -----  
 0   address_1  object 
 1   address_2  object 
 2   address_3  int16  
 3   address_4  int16  
 4   year       int16  
 5   month      int16  
 6   day        int16  
 7   area_deal  float32
dtypes: float32(1), int16(5), object(2)
memory usage: 1.3+ GB


### 파일저장

In [None]:
df_area_deal.to_pickle('/content/drive/MyDrive/house_price/after_data/df_area_deal.pkl')

## df_area_full_rent 파일 생성

- 아파트별로 가장 최근에 체결된 '평당 전세가격' 정보를 가진 데이터프레임 생성

- df_area_deal 파일생성 부분 참조

In [None]:
df_full_rent.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1580989 entries, 0 to 1580988
Data columns (total 12 columns):
 #   Column           Non-Null Count    Dtype  
---  ------           --------------    -----  
 0   date             1580989 non-null  object 
 1   year             1580989 non-null  int64  
 2   month            1580989 non-null  int64  
 3   day              1580989 non-null  int64  
 4   address_0        1580989 non-null  object 
 5   address_1        1580989 non-null  object 
 6   address_2        1580989 non-null  object 
 7   address_3        1580989 non-null  float64
 8   address_4        1580989 non-null  float64
 9   name             1580989 non-null  object 
 10  area             1580989 non-null  float64
 11  full_rent_price  1580989 non-null  int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 144.7+ MB


In [None]:
import numpy as np
# 면적당 가격을 추가
df_full_rent['area_full_rent_price'] = df_full_rent['full_rent_price'] / df_full_rent['area']
pivot_table_area_full_rent = df_full_rent.pivot_table(index=['year','month','day'], columns=['address_1','address_2','address_3','address_4'], values='area_full_rent_price')
pivot_table_area_full_rent

Unnamed: 0_level_0,Unnamed: 1_level_0,address_1,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,...,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구
Unnamed: 0_level_1,Unnamed: 1_level_1,address_2,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,...,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동
Unnamed: 0_level_2,Unnamed: 1_level_2,address_3,12.0,12.0,138.0,140.0,141.0,166.0,172.0,176.0,177.0,179.0,...,307.0,314.0,318.0,331.0,413.0,438.0,450.0,452.0,453.0,454.0
Unnamed: 0_level_3,Unnamed: 1_level_3,address_4,0.0,2.0,0.0,0.0,0.0,4.0,3.0,1.0,0.0,0.0,...,76.0,1.0,81.0,64.0,8.0,0.0,0.0,0.0,0.0,0.0
year,month,day,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4,Unnamed: 7_level_4,Unnamed: 8_level_4,Unnamed: 9_level_4,Unnamed: 10_level_4,Unnamed: 11_level_4,Unnamed: 12_level_4,Unnamed: 13_level_4,Unnamed: 14_level_4,Unnamed: 15_level_4,Unnamed: 16_level_4,Unnamed: 17_level_4,Unnamed: 18_level_4,Unnamed: 19_level_4,Unnamed: 20_level_4,Unnamed: 21_level_4,Unnamed: 22_level_4,Unnamed: 23_level_4
2011,1,1,,,,,,,,,,,...,,,,,,,,,,
2011,1,2,,,,,,,,,,,...,,,,,,,,,,
2011,1,3,430.053124,469.099032,,,190.044764,,,,,,...,,,,,,,,,,
2011,1,4,416.009890,,,259.109312,159.620342,,,,,,...,,,203.665988,,,,251.004016,,,
2011,1,5,,,217.090981,267.487606,212.476466,,,,,,...,,,,,,,190.408188,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2023,10,27,802.246290,,1706.688512,,,,,,,,...,,,,,,,,,,
2023,10,28,848.576322,,,,,,,,883.704489,,...,,,,,,,669.344043,,,
2023,10,29,,,,,,,,,,,...,,,,,,,568.942436,,,
2023,10,30,806.208559,,,,,,,,,,...,,,,,,,491.967871,,,


In [None]:
from datetime import datetime, timedelta

start_date = datetime(2011, 1, 1)  # 시작 날짜
end_date = datetime(2023, 10, 31)  # 끝 날짜

date_list = []
current_date = start_date
while current_date <= end_date:
    date_tuple = (current_date.year, current_date.month, current_date.day)
    date_list.append(date_tuple)
    current_date += timedelta(days=1)

# 기간 내 모든 날짜들에서 '거래날짜'들 빼서 '거래날짜'에서 없는 날짜들을 고름
print(set(date_list) - set(pivot_table_area_full_rent.index)) # '모든날짜'에 있고 '거래날짜'에 없는 날짜
print(set(pivot_table_area_full_rent.index) - set(date_list)) # 잘못 추가 생성된 날짜

set()
set()


In [None]:
pivot_table_area_full_rent = pivot_table_area_full_rent.ffill()
pivot_table_area_full_rent = pivot_table_area_full_rent.fillna(0)
pivot_table_area_full_rent = pivot_table_area_full_rent.T
df_area_full_rent = pivot_table_area_full_rent.stack(level=[0,1,2])
df_area_full_rent =df_area_full_rent.reset_index()
df_area_full_rent.columns = ['address_1','address_2','address_3','address_4','year','month','day','area_full_rent'] # 컬럼명 수정
df_area_full_rent = df_area_full_rent.drop(df_area_full_rent[df_area_full_rent.area_full_rent == 0].index) # 위에서 값이 null인 값들을 0으로 처리했으므로, 0인 값들을 제거한다
df_area_full_rent = df_area_full_rent.astype({'address_3': 'int16', 'address_4': 'int16','year':'int16', 'month':'int16', 'day':'int16', 'area_full_rent':'float32'})
df_area_full_rent

Unnamed: 0,address_1,address_2,address_3,address_4,year,month,day,area_full_rent
2,강남구,개포동,12,0,2011,1,3,430.053131
3,강남구,개포동,12,0,2011,1,4,416.009888
4,강남구,개포동,12,0,2011,1,5,416.009888
5,강남구,개포동,12,0,2011,1,6,416.009888
6,강남구,개포동,12,0,2011,1,7,400.000000
...,...,...,...,...,...,...,...,...
43776575,중랑구,중화동,454,0,2023,10,27,571.462219
43776576,중랑구,중화동,454,0,2023,10,28,571.462219
43776577,중랑구,중화동,454,0,2023,10,29,571.462219
43776578,중랑구,중화동,454,0,2023,10,30,571.462219


In [None]:
df_area_full_rent.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 37345835 entries, 2 to 43776579
Data columns (total 8 columns):
 #   Column          Dtype  
---  ------          -----  
 0   address_1       object 
 1   address_2       object 
 2   address_3       int16  
 3   address_4       int16  
 4   year            int16  
 5   month           int16  
 6   day             int16  
 7   area_full_rent  float32
dtypes: float32(1), int16(5), object(2)
memory usage: 1.3+ GB


In [None]:
df_area_full_rent.to_pickle('/content/drive/MyDrive/house_price/after_data/df_area_full_rent.pkl')

## df_area_year_rent 파일 생성

- 아파트별로 가장 최근에 체결된 '평당 월세가격' 정보를 가진 데이터프레임 생성

- df_area_deal 파일 생성 참조
- 아파트월세 피봇테이블 -> 아파트 월별 연세 피봇테이블
- 보증금은 계약시의 상황마다 다를 것
- 전월세전환률을 적용하여서 월세에서의 보증금을 변환
- 거래들마다 상황에 따라 보증금과 월세금액은 다를 수 있음으로, 보증금의 5.8% 값에 월세*12을 더하여 1년간 들어가는 금액인 연세를 계산

In [None]:
df_month_rent.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 729704 entries, 0 to 729703
Data columns (total 13 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   date              729704 non-null  object 
 1   year              729704 non-null  int64  
 2   month             729704 non-null  int64  
 3   day               729704 non-null  int64  
 4   address_0         729704 non-null  object 
 5   address_1         729704 non-null  object 
 6   address_2         729704 non-null  object 
 7   address_3         729704 non-null  float64
 8   address_4         729704 non-null  float64
 9   name              729704 non-null  object 
 10  area              729704 non-null  float64
 11  rent_deposit      729704 non-null  int64  
 12  month_rent_price  729704 non-null  int64  
dtypes: float64(3), int64(5), object(5)
memory usage: 72.4+ MB


In [None]:
# 보증금의 5.8% 값에 월세*12을 더하여 1년간 들어가는 금액인 연세를 계산
df_month_rent['year_rent_price'] = (df_month_rent['rent_deposit']*0.058)+(df_month_rent['month_rent_price']*12)
df_month_rent['area_year_rent_price'] = df_month_rent['year_rent_price'] / df_month_rent['area']
pivot_table_area_year_rent = df_month_rent.pivot_table(index=['year','month','day'], columns=['address_1','address_2','address_3','address_4'], values='area_year_rent_price')
pivot_table_area_year_rent

Unnamed: 0_level_0,Unnamed: 1_level_0,address_1,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,강남구,...,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구,중랑구
Unnamed: 0_level_1,Unnamed: 1_level_1,address_2,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,개포동,...,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동,중화동
Unnamed: 0_level_2,Unnamed: 1_level_2,address_3,12.0,12.0,138.0,140.0,141.0,172.0,176.0,177.0,179.0,185.0,...,307.0,307.0,314.0,318.0,331.0,438.0,450.0,452.0,453.0,454.0
Unnamed: 0_level_3,Unnamed: 1_level_3,address_4,0.0,2.0,0.0,0.0,0.0,3.0,1.0,0.0,0.0,0.0,...,6.0,76.0,1.0,81.0,64.0,0.0,0.0,0.0,0.0,0.0
year,month,day,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4,Unnamed: 7_level_4,Unnamed: 8_level_4,Unnamed: 9_level_4,Unnamed: 10_level_4,Unnamed: 11_level_4,Unnamed: 12_level_4,Unnamed: 13_level_4,Unnamed: 14_level_4,Unnamed: 15_level_4,Unnamed: 16_level_4,Unnamed: 17_level_4,Unnamed: 18_level_4,Unnamed: 19_level_4,Unnamed: 20_level_4,Unnamed: 21_level_4,Unnamed: 22_level_4,Unnamed: 23_level_4
2011,1,1,,,,,,,,,,,...,,,,,,,,,,
2011,1,2,,,,,,,,,,,...,,,,,,,,,,
2011,1,3,,,,,,,,,,,...,,,,,,,,,,
2011,1,4,,,,,,,,,,29.702312,...,,,,,,,,,,
2011,1,5,,,18.284371,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2023,10,27,,,,,,,,,,,...,,,,,,,,,,
2023,10,28,,,,,,,,,,30.885711,...,,,,,,,,,,
2023,10,29,,,,,,,,,,,...,,,,,,,,,,
2023,10,30,,,,,,,,,,38.070109,...,,,,,,,,,,


In [None]:
from datetime import datetime, timedelta

start_date = datetime(2011, 1, 1)  # 시작 날짜
end_date = datetime(2023, 10, 31)  # 끝 날짜

date_list = []
current_date = start_date
while current_date <= end_date:
    date_tuple = (current_date.year, current_date.month, current_date.day)
    date_list.append(date_tuple)
    current_date += timedelta(days=1)

# 기간 내 모든 날짜들에서 '거래날짜'들 빼서 '거래날짜'에서 없는 날짜들을 고름
print(set(date_list) - set(pivot_table_area_year_rent.index)) # '모든날짜'에 있고 '거래날짜'에 없는 날짜
print(set(pivot_table_area_year_rent.index) - set(date_list)) # 잘못 추가 생성된 날짜

set()
set()


In [None]:
pivot_table_area_year_rent=pivot_table_area_year_rent.ffill()
pivot_table_area_year_rent = pivot_table_area_year_rent.fillna(0)

# 피봇테이블을 일반데이터프레임화
pivot_table_area_year_rent = pivot_table_area_year_rent.T
df_area_year_rent = pivot_table_area_year_rent.stack(level=[0,1,2])
df_area_year_rent = df_area_year_rent.reset_index()
df_area_year_rent.columns = ['address_1','address_2','address_3','address_4','year','month','day','area_year_rent'] # 컬럼명 수정
df_area_year_rent = df_area_year_rent.drop(df_area_year_rent[df_area_year_rent.area_year_rent == 0].index) # 위에서 값이 null인 값들을 0으로 처리했으므로, 0인 값들을 제거한다
df_area_year_rent = df_area_year_rent.astype({'address_3': 'int16', 'address_4': 'int16','year':'int16', 'month':'int16', 'day':'int16', 'area_year_rent':'float32'})
df_area_year_rent

Unnamed: 0,address_1,address_2,address_3,address_4,year,month,day,area_year_rent
6,강남구,개포동,12,0,2011,1,7,30.255503
7,강남구,개포동,12,0,2011,1,8,30.255503
8,강남구,개포동,12,0,2011,1,9,30.255503
9,강남구,개포동,12,0,2011,1,10,30.255503
10,강남구,개포동,12,0,2011,1,11,30.255503
...,...,...,...,...,...,...,...,...
40172272,중랑구,중화동,454,0,2023,10,27,30.705395
40172273,중랑구,중화동,454,0,2023,10,28,30.705395
40172274,중랑구,중화동,454,0,2023,10,29,30.705395
40172275,중랑구,중화동,454,0,2023,10,30,30.705395


In [None]:
df_area_year_rent.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 29822452 entries, 6 to 40172276
Data columns (total 8 columns):
 #   Column          Dtype  
---  ------          -----  
 0   address_1       object 
 1   address_2       object 
 2   address_3       int16  
 3   address_4       int16  
 4   year            int16  
 5   month           int16  
 6   day             int16  
 7   area_year_rent  float32
dtypes: float32(1), int16(5), object(2)
memory usage: 1.1+ GB


In [None]:
df_area_year_rent.to_pickle('/content/drive/MyDrive/house_price/after_data/df_area_year_rent.pkl')

## df_area_all 파일 생성

- df_area_deal, df_area_full_rent, df_area_year_rent 3개의 파일 병합하여 df_area_all을 생성
- 가치평가 컬럼들을 구하기 위해서 merge를 통해, 매매가, 전세가, 연세가 기록이 다 있는 아파트 만을 선택

In [None]:
import pandas as pd

df_area_deal = pd.read_pickle('/content/drive/MyDrive/house_price/after_data/df_area_deal.pkl')
df_area_full_rent = pd.read_pickle('/content/drive/MyDrive/house_price/after_data/df_area_full_rent.pkl')

In [None]:
df_area_deal_full_rent = pd.merge(df_area_deal,df_area_full_rent, on=['address_1', 'address_2', 'address_3', 'address_4', 'year', 'month','day'])

In [None]:
df_area_deal_full_rent.to_pickle('/content/drive/MyDrive/house_price/after_data/df_area_deal_full_rent.pkl')

- 메모리 부족 이슈로 나누어서 실행

In [None]:
import pandas as pd

df_area_deal_full_rent = pd.read_pickle('/content/drive/MyDrive/house_price/after_data/df_area_deal_full_rent.pkl')
df_area_year_rent = pd.read_pickle('/content/drive/MyDrive/house_price/after_data/df_area_year_rent.pkl')

In [None]:
df_area_all = pd.merge(df_area_deal_full_rent, df_area_year_rent , on=['address_1', 'address_2', 'address_3', 'address_4', 'year', 'month','day'])

In [None]:
df_area_all.to_pickle('/content/drive/MyDrive/house_price/after_data/df_area_all.pkl')

# df_original_dataset 생성

- df_original_dataset은 df_area_all의 정보 추가적인 변화 정보들을 추가하고, final_economic 들까지 병합한 데이터프레임

## 일별로 종합 수치들을 확인

In [None]:
import pandas as pd

df_area_all = pd.read_pickle('/content/drive/MyDrive/house_price/after_data/df_area_all.pkl')
df_area_all.head()

Unnamed: 0,address_1,address_2,address_3,address_4,year,month,day,area_deal,area_full_rent,area_year_rent
0,강남구,개포동,12,0,2011,1,7,1054.852295,400.0,30.255503
1,강남구,개포동,12,0,2011,1,8,1054.852295,400.0,30.255503
2,강남구,개포동,12,0,2011,1,9,1054.852295,400.0,30.255503
3,강남구,개포동,12,0,2011,1,10,1054.852295,420.425629,30.255503
4,강남구,개포동,12,0,2011,1,11,1006.830261,434.408142,30.255503


In [None]:
df_area_all.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 26467941 entries, 0 to 26467940
Data columns (total 10 columns):
 #   Column          Dtype  
---  ------          -----  
 0   address_1       object 
 1   address_2       object 
 2   address_3       int16  
 3   address_4       int16  
 4   year            int16  
 5   month           int16  
 6   day             int16  
 7   area_deal       float32
 8   area_full_rent  float32
 9   area_year_rent  float32
dtypes: float32(3), int16(5), object(2)
memory usage: 1.1+ GB


In [None]:
# 실제 메모리 사용량 확인
real_memory_usage = df_area_all.memory_usage(deep=True).sum() # deep 옵션을 통해서 정확한 메모리 사용량을 확인
print(real_memory_usage/(1024**3),'GB')

5.086982175707817 GB


## 데이터 필터링

- 일별로 초반의 데이터들은 계약체결의 개수가 적어서 데이터의 신빙성이 확보되기가 어렵다 판단
- 일별 거래(매매,전세, 월세) 체결 개수들을 파악해서 너무 개수가 적은 데이터 들을 제거하는 과정 필요
- 개수가 적은 기준은 IQR를 사용하여서 이상치에 해당하는 개수가 적은 기준을 세움

In [None]:
df_area_all_count = df_area_all.groupby(["year","month","day"])[["area_deal","area_full_rent","area_year_rent"]].count()
df_area_all_count

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,area_deal,area_full_rent,area_year_rent
year,month,day,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2011,1,2,1,1,1
2011,1,3,6,6,6
2011,1,4,18,18,18
2011,1,5,43,43,43
2011,1,6,79,79,79
...,...,...,...,...,...
2023,10,27,7728,7728,7728
2023,10,28,7728,7728,7728
2023,10,29,7728,7728,7728
2023,10,30,7729,7729,7729


>> 데이터셋을 사용할 때, 각 수치들을 도출한 표본이 어느정도 이상이어야지 데이터로서의 가치가 있다

In [None]:
df_area_all_count.describe() # min의 값과 1분위수의 차이가 매우 큼을 확인

Unnamed: 0,area_deal,area_full_rent,area_year_rent
count,4686.0,4686.0,4686.0
mean,5648.301536,5648.301536,5648.301536
std,1663.565707,1663.565707,1663.565707
min,1.0,1.0,1.0
25%,4544.25,4544.25,4544.25
50%,6106.5,6106.5,6106.5
75%,7007.75,7007.75,7007.75
max,7730.0,7730.0,7730.0


In [None]:
# boxplot 을 통해서 이상치가 있음을 확인
import plotly.express as px
fig = px.box(df_area_all_count, y="area_deal")
fig.show()

In [None]:
# 막대그래프를 통해서 체결 개수는 순차적으로 증가함을 확인
# 즉, 특정 개수 이하인 value 기준으로 row들을 제거하면, 과거일자들의 value들도 특정개수 이하일 것이므로, 제거해도 괜찮음을 확인
import plotly.express as px

df_area_all_count_2 = df_area_all_count.reset_index()
fig = px.bar(df_area_all_count, x=df_area_all_count_2.index, y='area_deal')
fig.show()

In [None]:
# 이상치 제거를 위한 변수들을 선언
q1=df_area_all_count['area_deal'].quantile(0.25)
q2=df_area_all_count['area_deal'].quantile(0.5)
q3=df_area_all_count['area_deal'].quantile(0.75)
iqr=q3-q1
iqr

2463.5

In [None]:
# 이상치의 인덱스들을 확인
df_area_all_count.loc[df_area_all_count['area_deal']<q1-1.5*iqr,'area_deal'].index


MultiIndex([(2011, 1,  2),
            (2011, 1,  3),
            (2011, 1,  4),
            (2011, 1,  5),
            (2011, 1,  6),
            (2011, 1,  7),
            (2011, 1,  8),
            (2011, 1,  9),
            (2011, 1, 10),
            (2011, 1, 11),
            (2011, 1, 12),
            (2011, 1, 13),
            (2011, 1, 14),
            (2011, 1, 15),
            (2011, 1, 16),
            (2011, 1, 17),
            (2011, 1, 18),
            (2011, 1, 19),
            (2011, 1, 20),
            (2011, 1, 21),
            (2011, 1, 22),
            (2011, 1, 23),
            (2011, 1, 24),
            (2011, 1, 25),
            (2011, 1, 26),
            (2011, 1, 27),
            (2011, 1, 28),
            (2011, 1, 29),
            (2011, 1, 30),
            (2011, 1, 31),
            (2011, 2,  1),
            (2011, 2,  2),
            (2011, 2,  3),
            (2011, 2,  4),
            (2011, 2,  5),
            (2011, 2,  6),
            (2011, 2,  7),
 

## df_area_micro 생성

- df_area_micro는 df_original_dataset을 만드는 과정 중에 생기는 데이터프레임으로, df_area_all 에 가치평가 지표와 과거수치 대비 변화율들의 정보를 추가한 데이터프레임

### 일별로 그룹화 진행


- 우선, 일별로 거래정보들을 그룹화 해서 평균 가격들을 도출

In [None]:
df_area_micro=df_area_all.groupby(["year","month","day"])[["area_deal","area_full_rent","area_year_rent"]].mean()
df_area_micro

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,area_deal,area_full_rent,area_year_rent
year,month,day,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2011,1,2,595.000000,259.771637,18.880535
2011,1,3,519.548096,274.167999,16.547453
2011,1,4,704.231018,326.309021,21.519497
2011,1,5,768.772095,326.750854,21.022175
2011,1,6,709.595642,303.017273,19.990875
...,...,...,...,...,...
2023,10,27,1029.540405,596.237915,30.427759
2023,10,28,1029.958740,596.546021,30.465477
2023,10,29,1029.966797,596.627197,30.464407
2023,10,30,1030.069458,596.618347,30.486279


In [None]:
# 위에서 구한 데이터의 개수가 적어서 데이터로서의 가치가 떨어지는 데이터들을 제거
df_area_micro.drop(df_area_all_count.loc[df_area_all_count['area_deal']<q1-1.5*iqr,'area_deal'].index,inplace=True)
df_area_micro

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,area_deal,area_full_rent,area_year_rent
year,month,day,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2011,2,10,649.267273,303.338654,21.142321
2011,2,11,644.917053,302.780884,21.052322
2011,2,12,641.195679,300.848663,21.090029
2011,2,13,640.122192,300.396454,21.067911
2011,2,14,638.963379,298.462372,20.963787
...,...,...,...,...,...
2023,10,27,1029.540405,596.237915,30.427759
2023,10,28,1029.958740,596.546021,30.465477
2023,10,29,1029.966797,596.627197,30.464407
2023,10,30,1030.069458,596.618347,30.486279


In [None]:
df_area_micro.reset_index(inplace=True)
df_area_micro

Unnamed: 0,year,month,day,area_deal,area_full_rent,area_year_rent
0,2011,2,10,649.267273,303.338654,21.142321
1,2011,2,11,644.917053,302.780884,21.052322
2,2011,2,12,641.195679,300.848663,21.090029
3,2011,2,13,640.122192,300.396454,21.067911
4,2011,2,14,638.963379,298.462372,20.963787
...,...,...,...,...,...,...
4642,2023,10,27,1029.540405,596.237915,30.427759
4643,2023,10,28,1029.958740,596.546021,30.465477
4644,2023,10,29,1029.966797,596.627197,30.464407
4645,2023,10,30,1030.069458,596.618347,30.486279


#### 가치평가 지표 컬럼 추가

- 전세가율(deal_full_rent_rate), 연세멀티플(deal_year_rent_multiple) 을 계산함

In [None]:
df_area_micro['deal_full_rent_rate'] = 100*(df_area_micro['area_full_rent'] / df_area_micro['area_deal'])
df_area_micro['deal_year_rent_multiple'] = df_area_micro['area_deal']/ df_area_micro['area_year_rent']
df_area_micro

Unnamed: 0,year,month,day,area_deal,area_full_rent,area_year_rent,deal_full_rent_rate,deal_year_rent_multiple
0,2011,2,10,649.267273,303.338654,21.142321,46.720150,30.709366
1,2011,2,11,644.917053,302.780884,21.052322,46.948811,30.634010
2,2011,2,12,641.195679,300.848663,21.090029,46.919945,30.402788
3,2011,2,13,640.122192,300.396454,21.067911,46.927986,30.383753
4,2011,2,14,638.963379,298.462372,20.963787,46.710403,30.479387
...,...,...,...,...,...,...,...,...
4642,2023,10,27,1029.540405,596.237915,30.427759,57.913017,33.835564
4643,2023,10,28,1029.958740,596.546021,30.465477,57.919407,33.807407
4644,2023,10,29,1029.966797,596.627197,30.464407,57.926838,33.808857
4645,2023,10,30,1030.069458,596.618347,30.486279,57.920204,33.787971


### 월별 평균 종합 수치들을 확인

- 6개월, 12개월전 수치들을 일별을 기준으로 하면, 너무 특정일자 기준으로 하는 것 같아서, 월별을 기준으로 변화율을 계산하기 위해서 월별 평균 종합 수치들을 확인

In [None]:
df_area_micro_month = df_area_micro.groupby(["year","month"])[["area_deal","area_full_rent","area_year_rent"]].mean().copy()
df_area_micro_month.reset_index(inplace=True)
df_area_micro_month

Unnamed: 0,year,month,area_deal,area_full_rent,area_year_rent
0,2011,2,636.811462,298.967072,20.891218
1,2011,3,615.491150,297.227692,20.475676
2,2011,4,600.585022,295.111023,20.246986
3,2011,5,589.073303,295.179962,20.066994
4,2011,6,581.239624,297.469910,19.992504
...,...,...,...,...,...
148,2023,6,1021.233032,582.908325,29.696184
149,2023,7,1021.891541,584.898132,29.837887
150,2023,8,1024.649536,588.286255,30.112097
151,2023,9,1027.772583,591.920471,30.319450


In [None]:
# 체결일자의 발표일자가 1달 씩 미뤄진다 생각하고 1칸씩 미룸
df_area_micro_month['area_deal'] = df_area_micro_month['area_deal'].shift(1)
df_area_micro_month['area_full_rent'] = df_area_micro_month['area_full_rent'].shift(1)
df_area_micro_month['area_year_rent'] = df_area_micro_month['area_year_rent'].shift(1)
df_area_micro_month = df_area_micro_month.dropna()
df_area_micro_month.columns = ['year','month','last_month_area_deal','last_month_area_full_count', 'last_month_area_year_rent']
df_area_micro_month

Unnamed: 0,year,month,last_month_area_deal,last_month_area_full_count,last_month_area_year_rent
1,2011,3,636.811462,298.967072,20.891218
2,2011,4,615.491150,297.227692,20.475676
3,2011,5,600.585022,295.111023,20.246986
4,2011,6,589.073303,295.179962,20.066994
5,2011,7,581.239624,297.469910,19.992504
...,...,...,...,...,...
148,2023,6,1021.817688,581.627808,29.655708
149,2023,7,1021.233032,582.908325,29.696184
150,2023,8,1021.891541,584.898132,29.837887
151,2023,9,1024.649536,588.286255,30.112097


#### 6개월전 종합 수치 병합

In [None]:
# df_area_micro_month의 6개월 후, 연, 월을 컬럼으로 구한후, df_area_micro의 year, month와 merge 하면 6개월 전 수치들을 구할 수 있음
df_area_micro_month_6m = df_area_micro_month.copy()
df_area_micro_month_6m.loc[df_area_micro_month_6m['month']<7, '6m_after_year'] = df_area_micro_month_6m['year']
df_area_micro_month_6m.loc[df_area_micro_month_6m['month']<7, '6m_after_month'] = df_area_micro_month_6m['month'] + 6
df_area_micro_month_6m.loc[df_area_micro_month_6m['month']>=7, '6m_after_year'] = df_area_micro_month_6m['year'] + 1
df_area_micro_month_6m.loc[df_area_micro_month_6m['month']>=7, '6m_after_month'] = df_area_micro_month_6m['month'] - 6

df_area_micro_month_6m



Unnamed: 0,year,month,last_month_area_deal,last_month_area_full_count,last_month_area_year_rent,6m_after_year,6m_after_month
1,2011,3,636.811462,298.967072,20.891218,2011.0,9.0
2,2011,4,615.491150,297.227692,20.475676,2011.0,10.0
3,2011,5,600.585022,295.111023,20.246986,2011.0,11.0
4,2011,6,589.073303,295.179962,20.066994,2011.0,12.0
5,2011,7,581.239624,297.469910,19.992504,2012.0,1.0
...,...,...,...,...,...,...,...
148,2023,6,1021.817688,581.627808,29.655708,2023.0,12.0
149,2023,7,1021.233032,582.908325,29.696184,2024.0,1.0
150,2023,8,1021.891541,584.898132,29.837887,2024.0,2.0
151,2023,9,1024.649536,588.286255,30.112097,2024.0,3.0


In [None]:
df_area_micro_month_6m = df_area_micro_month_6m.drop(['year','month'],axis=1)
df_area_micro_month_6m = df_area_micro_month_6m.astype({'6m_after_year':'int16', '6m_after_month' : 'int16'})
df_area_micro_month_6m.rename(columns = {'last_month_area_deal' : '6m_before_area_deal_mean', 'last_month_area_full_count' : '6m_before_area_full_rent_mean',
                                      'last_month_area_year_rent' : '6m_before_area_year_rent_mean'}, inplace = True)
df_area_micro_month_6m

Unnamed: 0,6m_before_area_deal_mean,6m_before_area_full_rent_mean,6m_before_area_year_rent_mean,6m_after_year,6m_after_month
1,636.811462,298.967072,20.891218,2011,9
2,615.491150,297.227692,20.475676,2011,10
3,600.585022,295.111023,20.246986,2011,11
4,589.073303,295.179962,20.066994,2011,12
5,581.239624,297.469910,19.992504,2012,1
...,...,...,...,...,...
148,1021.817688,581.627808,29.655708,2023,12
149,1021.233032,582.908325,29.696184,2024,1
150,1021.891541,584.898132,29.837887,2024,2
151,1024.649536,588.286255,30.112097,2024,3


In [None]:
df_area_micro_month_6m['6m_before_deal_full_rent_rate'] = 100*(df_area_micro_month_6m['6m_before_area_full_rent_mean'] / df_area_micro_month_6m['6m_before_area_deal_mean'])
df_area_micro_month_6m['6m_before_deal_year_rent_multiple'] = df_area_micro_month_6m['6m_before_area_deal_mean']/ df_area_micro_month_6m['6m_before_area_year_rent_mean']
df_area_micro_month_6m

Unnamed: 0,6m_before_area_deal_mean,6m_before_area_full_rent_mean,6m_before_area_year_rent_mean,6m_after_year,6m_after_month,6m_before_deal_full_rent_rate,6m_before_deal_year_rent_multiple
1,636.811462,298.967072,20.891218,2011,9,46.947502,30.482256
2,615.491150,297.227692,20.475676,2011,10,48.291142,30.059626
3,600.585022,295.111023,20.246986,2011,11,49.137260,29.662933
4,589.073303,295.179962,20.066994,2011,12,50.109207,29.355333
5,581.239624,297.469910,19.992504,2012,1,51.178532,29.072878
...,...,...,...,...,...,...,...
148,1021.817688,581.627808,29.655708,2023,12,56.920898,34.456020
149,1021.233032,582.908325,29.696184,2024,1,57.078873,34.389370
150,1021.891541,584.898132,29.837887,2024,2,57.236809,34.248119
151,1024.649536,588.286255,30.112097,2024,3,57.413410,34.027836


In [None]:
df_area_micro = pd.merge(df_area_micro,df_area_micro_month_6m, left_on=['year','month'], right_on=['6m_after_year','6m_after_month'],how = 'left') # inner로 하면, 12개월 파트를 병합할 때 사라지는 데이터가 더 많아짐 ㅠㅠㅠ
df_area_micro

Unnamed: 0,year,month,day,area_deal,area_full_rent,area_year_rent,deal_full_rent_rate,deal_year_rent_multiple,6m_before_area_deal_mean,6m_before_area_full_rent_mean,6m_before_area_year_rent_mean,6m_after_year,6m_after_month,6m_before_deal_full_rent_rate,6m_before_deal_year_rent_multiple
0,2011,2,10,649.267273,303.338654,21.142321,46.720150,30.709366,,,,,,,
1,2011,2,11,644.917053,302.780884,21.052322,46.948811,30.634010,,,,,,,
2,2011,2,12,641.195679,300.848663,21.090029,46.919945,30.402788,,,,,,,
3,2011,2,13,640.122192,300.396454,21.067911,46.927986,30.383753,,,,,,,
4,2011,2,14,638.963379,298.462372,20.963787,46.710403,30.479387,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4642,2023,10,27,1029.540405,596.237915,30.427759,57.913017,33.835564,1030.953491,583.086731,29.516256,2023.0,10.0,56.558006,34.928329
4643,2023,10,28,1029.958740,596.546021,30.465477,57.919407,33.807407,1030.953491,583.086731,29.516256,2023.0,10.0,56.558006,34.928329
4644,2023,10,29,1029.966797,596.627197,30.464407,57.926838,33.808857,1030.953491,583.086731,29.516256,2023.0,10.0,56.558006,34.928329
4645,2023,10,30,1030.069458,596.618347,30.486279,57.920204,33.787971,1030.953491,583.086731,29.516256,2023.0,10.0,56.558006,34.928329


#### 12개월전 종합 수치 병합

In [None]:
df_area_micro_month_12m = df_area_micro_month.copy()
df_area_micro_month_12m['12m_after_year'] = df_area_micro_month_12m['year']+1
df_area_micro_month_12m['12m_after_month'] = df_area_micro_month_12m['month']

df_area_micro_month_12m = df_area_micro_month_12m.drop(['year','month'],axis=1)
df_area_micro_month_12m = df_area_micro_month_12m.astype({'12m_after_year':'int16', '12m_after_month' : 'int16'})
df_area_micro_month_12m.rename(columns = {'last_month_area_deal' : '12m_before_area_deal_mean', 'last_month_area_full_count' : '12m_before_area_full_rent_mean',
                                      'last_month_area_year_rent' : '12m_before_area_year_rent_mean'}, inplace = True)

df_area_micro_month_12m['12m_before_deal_full_rent_rate'] = 100*(df_area_micro_month_12m['12m_before_area_full_rent_mean'] / df_area_micro_month_12m['12m_before_area_deal_mean'])
df_area_micro_month_12m['12m_before_deal_year_rent_multiple'] =df_area_micro_month_12m['12m_before_area_deal_mean']/ df_area_micro_month_12m['12m_before_area_year_rent_mean']


df_area_micro = pd.merge(df_area_micro, df_area_micro_month_12m, left_on=['year','month'], right_on=['12m_after_year','12m_after_month'],how = 'left')
df_area_micro

Unnamed: 0,year,month,day,area_deal,area_full_rent,area_year_rent,deal_full_rent_rate,deal_year_rent_multiple,6m_before_area_deal_mean,6m_before_area_full_rent_mean,...,6m_after_month,6m_before_deal_full_rent_rate,6m_before_deal_year_rent_multiple,12m_before_area_deal_mean,12m_before_area_full_rent_mean,12m_before_area_year_rent_mean,12m_after_year,12m_after_month,12m_before_deal_full_rent_rate,12m_before_deal_year_rent_multiple
0,2011,2,10,649.267273,303.338654,21.142321,46.720150,30.709366,,,...,,,,,,,,,,
1,2011,2,11,644.917053,302.780884,21.052322,46.948811,30.634010,,,...,,,,,,,,,,
2,2011,2,12,641.195679,300.848663,21.090029,46.919945,30.402788,,,...,,,,,,,,,,
3,2011,2,13,640.122192,300.396454,21.067911,46.927986,30.383753,,,...,,,,,,,,,,
4,2011,2,14,638.963379,298.462372,20.963787,46.710403,30.479387,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4642,2023,10,27,1029.540405,596.237915,30.427759,57.913017,33.835564,1030.953491,583.086731,...,10.0,56.558006,34.928329,1079.784546,619.757629,30.158903,2023.0,10.0,57.39642,35.803177
4643,2023,10,28,1029.958740,596.546021,30.465477,57.919407,33.807407,1030.953491,583.086731,...,10.0,56.558006,34.928329,1079.784546,619.757629,30.158903,2023.0,10.0,57.39642,35.803177
4644,2023,10,29,1029.966797,596.627197,30.464407,57.926838,33.808857,1030.953491,583.086731,...,10.0,56.558006,34.928329,1079.784546,619.757629,30.158903,2023.0,10.0,57.39642,35.803177
4645,2023,10,30,1030.069458,596.618347,30.486279,57.920204,33.787971,1030.953491,583.086731,...,10.0,56.558006,34.928329,1079.784546,619.757629,30.158903,2023.0,10.0,57.39642,35.803177


### df_area_micro 컬럼 수정

- 6개월, 12개월 전 수치들을 변화율들로 계산하여 교체

In [None]:
df_area_micro = df_area_micro.drop(['6m_after_year','6m_after_month', '12m_after_year', '12m_after_month'], axis=1)

df_area_micro['6m_before_area_deal_mean'] = 100*((df_area_micro['area_deal'] - df_area_micro['6m_before_area_deal_mean'])/ df_area_micro['6m_before_area_deal_mean'])
df_area_micro['6m_before_area_full_rent_mean'] = 100*((df_area_micro['area_full_rent'] - df_area_micro['6m_before_area_full_rent_mean'])/ df_area_micro['6m_before_area_full_rent_mean'])
df_area_micro['6m_before_area_year_rent_mean'] = 100*((df_area_micro['area_year_rent'] - df_area_micro['6m_before_area_year_rent_mean'])/ df_area_micro['6m_before_area_year_rent_mean'])
df_area_micro['6m_before_deal_full_rent_rate'] = 100*((df_area_micro['deal_full_rent_rate'] - df_area_micro['6m_before_deal_full_rent_rate'])/ df_area_micro['6m_before_deal_full_rent_rate'])
df_area_micro['6m_before_deal_year_rent_multiple'] = 100*((df_area_micro['deal_year_rent_multiple'] - df_area_micro['6m_before_deal_year_rent_multiple'])/ df_area_micro['6m_before_deal_year_rent_multiple'])


df_area_micro['12m_before_area_deal_mean'] = 100*((df_area_micro['area_deal'] - df_area_micro['12m_before_area_deal_mean'])/ df_area_micro['12m_before_area_deal_mean'])
df_area_micro['12m_before_area_full_rent_mean'] = 100*((df_area_micro['area_full_rent'] - df_area_micro['12m_before_area_full_rent_mean'])/ df_area_micro['12m_before_area_full_rent_mean'])
df_area_micro['12m_before_area_year_rent_mean'] = 100*((df_area_micro['area_year_rent'] - df_area_micro['12m_before_area_year_rent_mean'])/ df_area_micro['12m_before_area_year_rent_mean'])
df_area_micro['12m_before_deal_full_rent_rate'] = 100*((df_area_micro['deal_full_rent_rate'] - df_area_micro['12m_before_deal_full_rent_rate'])/ df_area_micro['12m_before_deal_full_rent_rate'])
df_area_micro['12m_before_deal_year_rent_multiple'] = 100*((df_area_micro['deal_year_rent_multiple'] - df_area_micro['12m_before_deal_year_rent_multiple'])/ df_area_micro['12m_before_deal_year_rent_multiple'])

df_area_micro.head()

Unnamed: 0,year,month,day,area_deal,area_full_rent,area_year_rent,deal_full_rent_rate,deal_year_rent_multiple,6m_before_area_deal_mean,6m_before_area_full_rent_mean,6m_before_area_year_rent_mean,6m_before_deal_full_rent_rate,6m_before_deal_year_rent_multiple,12m_before_area_deal_mean,12m_before_area_full_rent_mean,12m_before_area_year_rent_mean,12m_before_deal_full_rent_rate,12m_before_deal_year_rent_multiple
0,2011,2,10,649.267273,303.338654,21.142321,46.72015,30.709366,,,,,,,,,,
1,2011,2,11,644.917053,302.780884,21.052322,46.948811,30.63401,,,,,,,,,,
2,2011,2,12,641.195679,300.848663,21.090029,46.919945,30.402788,,,,,,,,,,
3,2011,2,13,640.122192,300.396454,21.067911,46.927986,30.383753,,,,,,,,,,
4,2011,2,14,638.963379,298.462372,20.963787,46.710403,30.479387,,,,,,,,,,


In [None]:
df_area_micro = df_area_micro.dropna()
df_area_micro.head()

Unnamed: 0,year,month,day,area_deal,area_full_rent,area_year_rent,deal_full_rent_rate,deal_year_rent_multiple,6m_before_area_deal_mean,6m_before_area_full_rent_mean,6m_before_area_year_rent_mean,6m_before_deal_full_rent_rate,6m_before_deal_year_rent_multiple,12m_before_area_deal_mean,12m_before_area_full_rent_mean,12m_before_area_year_rent_mean,12m_before_deal_full_rent_rate,12m_before_deal_year_rent_multiple
385,2012,3,1,548.170105,309.03598,20.537769,56.375927,26.69083,-4.654602,1.893891,1.922042,6.868177,-6.452623,-13.91956,3.367899,-1.691854,20.082911,-12.43814
386,2012,3,2,547.971802,308.738037,20.522972,56.341957,26.700411,-4.689094,1.795654,1.848608,6.803782,-6.419044,-13.950701,3.268241,-1.762684,20.010553,-12.40671
387,2012,3,3,547.847473,309.738129,20.533909,56.537292,26.680136,-4.710719,2.1254,1.902884,7.174067,-6.490106,-13.970224,3.602757,-1.710333,20.426624,-12.473225
388,2012,3,4,548.025452,309.982178,20.534513,56.563461,26.688017,-4.679762,2.205867,1.905884,7.223673,-6.462483,-13.942276,3.684388,-1.707439,20.482367,-12.44737
389,2012,3,5,547.768799,309.442413,20.514805,56.491428,26.701145,-4.724403,2.027898,1.808077,7.087125,-6.416471,-13.982579,3.503844,-1.801778,20.328934,-12.404301


In [None]:
df_area_micro.tail()

Unnamed: 0,year,month,day,area_deal,area_full_rent,area_year_rent,deal_full_rent_rate,deal_year_rent_multiple,6m_before_area_deal_mean,6m_before_area_full_rent_mean,6m_before_area_year_rent_mean,6m_before_deal_full_rent_rate,6m_before_deal_year_rent_multiple,12m_before_area_deal_mean,12m_before_area_full_rent_mean,12m_before_area_year_rent_mean,12m_before_deal_full_rent_rate,12m_before_deal_year_rent_multiple
4642,2023,10,27,1029.540405,596.237915,30.427759,57.913017,33.835564,-0.137066,2.255442,3.088138,2.39579,-3.128595,-4.653163,-3.794986,0.891465,0.900052,-5.495639
4643,2023,10,28,1029.95874,596.546021,30.465477,57.919407,33.807407,-0.096488,2.308283,3.215925,2.407087,-3.209206,-4.614421,-3.745272,1.016528,0.911185,-5.57428
4644,2023,10,29,1029.966797,596.627197,30.464407,57.926838,33.808857,-0.095707,2.322204,3.2123,2.420226,-3.205056,-4.613675,-3.732174,1.012981,0.924131,-5.570232
4645,2023,10,30,1030.069458,596.618347,30.486279,57.920204,33.787971,-0.085749,2.320687,3.2864,2.408497,-3.264851,-4.604167,-3.733602,1.085502,0.912574,-5.628566
4646,2023,10,31,1029.9823,596.441956,30.495527,57.907982,33.774864,-0.094203,2.290435,3.317734,2.386887,-3.302377,-4.612239,-3.762064,1.116168,0.891279,-5.665175


In [None]:
df_area_micro.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4262 entries, 385 to 4646
Data columns (total 18 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   year                                4262 non-null   int64  
 1   month                               4262 non-null   int64  
 2   day                                 4262 non-null   int64  
 3   area_deal                           4262 non-null   float32
 4   area_full_rent                      4262 non-null   float32
 5   area_year_rent                      4262 non-null   float32
 6   deal_full_rent_rate                 4262 non-null   float32
 7   deal_year_rent_multiple             4262 non-null   float32
 8   6m_before_area_deal_mean            4262 non-null   float32
 9   6m_before_area_full_rent_mean       4262 non-null   float32
 10  6m_before_area_year_rent_mean       4262 non-null   float32
 11  6m_before_deal_full_rent_rate       4262 

In [None]:
df_area_micro.to_pickle('/content/drive/MyDrive/house_price/after_data/df_area_micro.pkl')

## final_economic 과의 병합

In [None]:
import pandas as pd
df_area_micro = pd.read_pickle('/content/drive/MyDrive/house_price/after_data/df_area_micro.pkl')
df_area_micro

Unnamed: 0,year,month,day,area_deal,area_full_rent,area_year_rent,deal_full_rent_rate,deal_year_rent_multiple,6m_before_area_deal_mean,6m_before_area_full_rent_mean,6m_before_area_year_rent_mean,6m_before_deal_full_rent_rate,6m_before_deal_year_rent_multiple,12m_before_area_deal_mean,12m_before_area_full_rent_mean,12m_before_area_year_rent_mean,12m_before_deal_full_rent_rate,12m_before_deal_year_rent_multiple
394,2012,3,1,548.170105,309.035980,20.537769,56.375927,26.690830,-4.654602,1.893891,1.922042,6.868177,-6.452623,-14.525463,2.814531,-1.909885,20.286734,-12.861209
395,2012,3,2,547.971802,308.738037,20.522972,56.341957,26.700411,-4.689094,1.795654,1.848608,6.803782,-6.419044,-14.556384,2.715407,-1.980558,20.214254,-12.829931
396,2012,3,3,547.847473,309.738129,20.533909,56.537292,26.680136,-4.710719,2.125400,1.902884,7.174067,-6.490106,-14.575770,3.048131,-1.928323,20.631031,-12.896124
397,2012,3,4,548.025452,309.982178,20.534513,56.563461,26.688017,-4.679762,2.205867,1.905884,7.223673,-6.462483,-14.548018,3.129325,-1.925435,20.686869,-12.870394
398,2012,3,5,547.768799,309.442413,20.514805,56.491428,26.701145,-4.724403,2.027898,1.808077,7.087125,-6.416471,-14.588037,2.949749,-2.019566,20.533175,-12.827534
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4467,2023,4,26,1024.179443,580.640137,29.545254,56.693203,34.664772,-5.149648,-6.311741,-2.039094,-1.225192,-3.175297,-5.414960,-4.591407,1.987396,0.870703,-7.258101
4468,2023,4,27,1024.288940,580.863586,29.561056,56.708954,34.649944,-5.139507,-6.275686,-1.986700,-1.197750,-3.216714,-5.404848,-4.554690,2.041944,0.898727,-7.297771
4469,2023,4,28,1024.102783,581.085632,29.577700,56.740944,34.624153,-5.156748,-6.239858,-1.931516,-1.142015,-3.288753,-5.422040,-4.518205,2.099396,0.955645,-7.366772
4470,2023,4,29,1023.981995,581.219116,29.587328,56.760674,34.608803,-5.167934,-6.218320,-1.899592,-1.107641,-3.331629,-5.433195,-4.496271,2.132632,0.990749,-7.407840


In [None]:
import pandas as pd
df_economic = pd.read_pickle('/content/drive/MyDrive/house_price/after_data/final_economic.pkl')
df_economic

Unnamed: 0,date,year,month,day,kospi_index,korea_rp,korea_3_year,korea_10_year,us_3_month,us_2_year,...,last_month_etc_supply_12m_before,last_month_total_supply_12m_before,last_month_total_unsold_count_12m_before,last_month_total_deal_count_12m_before,last_month_total_full_rent_count_12m_before,last_month_total_month_rent_count_12m_before,korea_10-3_year_12m_before,us_10-2_year_12m_before,us_10-3_year_month_12m_before,last_month_total_unsold_ratio_12m_before
0,2012-02-01,2012,2,1,1959.24,3.25,3.380,3.750,0.061,0.226,...,0.0,265.0,-17137.0,-4393.0,-1891.0,-237.0,-0.436429,-1.199357,-1.668893,-5692.981439
1,2012-02-02,2012,2,2,1984.30,3.25,3.380,3.760,0.084,0.226,...,0.0,265.0,-17137.0,-4393.0,-1891.0,-237.0,-0.426429,-1.206357,-1.698893,-5692.981439
2,2012-02-03,2012,2,3,1972.34,3.25,3.380,3.760,0.079,0.238,...,0.0,265.0,-17137.0,-4393.0,-1891.0,-237.0,-0.426429,-1.117357,-1.592893,-5692.981439
3,2012-02-04,2012,2,4,1972.34,3.25,3.380,3.760,0.079,0.238,...,0.0,265.0,-17137.0,-4393.0,-1891.0,-237.0,-0.426429,-1.117357,-1.592893,-5692.981439
4,2012-02-05,2012,2,5,1972.34,3.25,3.380,3.760,0.079,0.238,...,0.0,265.0,-17137.0,-4393.0,-1891.0,-237.0,-0.426429,-1.117357,-1.592893,-5692.981439
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4285,2023-10-26,2023,10,26,2299.08,3.50,3.692,3.977,5.479,5.046,...,-150.0,535.0,18202.0,2799.0,915.0,-451.0,0.268194,0.205323,-0.879968,-14866.267624
4286,2023-10-27,2023,10,27,2302.81,3.50,3.692,3.977,5.477,5.015,...,-150.0,535.0,18202.0,2799.0,915.0,-451.0,0.268194,0.232323,-0.881968,-14866.267624
4287,2023-10-28,2023,10,28,2302.81,3.50,3.692,3.977,5.477,5.015,...,-150.0,535.0,18202.0,2799.0,915.0,-451.0,0.268194,0.232323,-0.881968,-14866.267624
4288,2023-10-29,2023,10,29,2302.81,3.50,3.692,3.977,5.477,5.015,...,-150.0,535.0,18202.0,2799.0,915.0,-451.0,0.268194,0.232323,-0.881968,-14866.267624


In [None]:
df_original_dataset = pd.merge(df_area_micro,df_economic, on = ['year','month','day'])
df_original_dataset

Unnamed: 0,year,month,day,area_deal,area_full_rent,area_year_rent,deal_full_rent_rate,deal_year_rent_multiple,6m_before_area_deal_mean,6m_before_area_full_rent_mean,...,last_month_etc_supply_12m_before,last_month_total_supply_12m_before,last_month_total_unsold_count_12m_before,last_month_total_deal_count_12m_before,last_month_total_full_rent_count_12m_before,last_month_total_month_rent_count_12m_before,korea_10-3_year_12m_before,us_10-2_year_12m_before,us_10-3_year_month_12m_before,last_month_total_unsold_ratio_12m_before
0,2012,3,1,548.170105,309.035980,20.537769,56.375927,26.690830,-4.654602,1.893891,...,-2.0,-35.0,-15738.0,-2078.0,794.0,-73.0,-0.392419,-0.987774,-1.352387,-2253.746392
1,2012,3,2,547.971802,308.738037,20.522972,56.341957,26.700411,-4.689094,1.795654,...,-2.0,-35.0,-15738.0,-2078.0,794.0,-73.0,-0.407419,-1.023774,-1.395387,-2253.746392
2,2012,3,3,547.847473,309.738129,20.533909,56.537292,26.680136,-4.710719,2.125400,...,-2.0,-35.0,-15738.0,-2078.0,794.0,-73.0,-0.407419,-1.023774,-1.395387,-2253.746392
3,2012,3,4,548.025452,309.982178,20.534513,56.563461,26.688017,-4.679762,2.205867,...,-2.0,-35.0,-15738.0,-2078.0,794.0,-73.0,-0.407419,-1.023774,-1.395387,-2253.746392
4,2012,3,5,547.768799,309.442413,20.514805,56.491428,26.701145,-4.724403,2.027898,...,-2.0,-35.0,-15738.0,-2078.0,794.0,-73.0,-0.392419,-1.014774,-1.382387,-2253.746392
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4256,2023,10,26,1029.659058,596.442871,30.401781,57.926250,33.868378,-0.125557,2.290592,...,-150.0,535.0,18202.0,2799.0,915.0,-451.0,0.268194,0.205323,-0.879968,-14866.267624
4257,2023,10,27,1029.540405,596.237915,30.427759,57.913017,33.835564,-0.137066,2.255442,...,-150.0,535.0,18202.0,2799.0,915.0,-451.0,0.268194,0.232323,-0.881968,-14866.267624
4258,2023,10,28,1029.958740,596.546021,30.465477,57.919407,33.807407,-0.096488,2.308283,...,-150.0,535.0,18202.0,2799.0,915.0,-451.0,0.268194,0.232323,-0.881968,-14866.267624
4259,2023,10,29,1029.966797,596.627197,30.464407,57.926838,33.808857,-0.095707,2.322204,...,-150.0,535.0,18202.0,2799.0,915.0,-451.0,0.268194,0.232323,-0.881968,-14866.267624


In [None]:
df_original_dataset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4261 entries, 0 to 4260
Data columns (total 76 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   year                                          4261 non-null   int64  
 1   month                                         4261 non-null   int64  
 2   day                                           4261 non-null   int64  
 3   area_deal                                     4261 non-null   float32
 4   area_full_rent                                4261 non-null   float32
 5   area_year_rent                                4261 non-null   float32
 6   deal_full_rent_rate                           4261 non-null   float32
 7   deal_year_rent_multiple                       4261 non-null   float32
 8   6m_before_area_deal_mean                      4261 non-null   float32
 9   6m_before_area_full_rent_mean                 4261 non-null   f

> 해당 데이터셋에는 null 값이 없음을 확인

In [None]:
# date 컬럼의 타입을 변경
df_original_dataset['date'] = pd.to_datetime(df_original_dataset['date'])
df_original_dataset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4261 entries, 0 to 4260
Data columns (total 76 columns):
 #   Column                                        Non-Null Count  Dtype         
---  ------                                        --------------  -----         
 0   year                                          4261 non-null   int64         
 1   month                                         4261 non-null   int64         
 2   day                                           4261 non-null   int64         
 3   area_deal                                     4261 non-null   float32       
 4   area_full_rent                                4261 non-null   float32       
 5   area_year_rent                                4261 non-null   float32       
 6   deal_full_rent_rate                           4261 non-null   float32       
 7   deal_year_rent_multiple                       4261 non-null   float32       
 8   6m_before_area_deal_mean                      4261 non-null   float3

In [None]:
df_original_dataset.to_pickle('/content/drive/MyDrive/house_price/after_data/df_original_dataset_without_future.pkl')

## 1년후 아파트 지수 병합

In [None]:
import pandas as pd
df_seoul_index = pd.read_pickle('/content/drive/MyDrive/house_price/after_data/df_seoul_index.pkl')
df_seoul_index

Unnamed: 0,date,seoul_index
0,2008-04-07,59.733
1,2008-04-14,59.882
2,2008-04-21,60.042
3,2008-04-28,60.163
4,2008-05-05,60.278
...,...,...
778,2023-09-18,90.635
779,2023-09-25,90.681
780,2023-10-09,90.733
781,2023-10-16,90.787


In [None]:
import datetime

# 크롤링한 날짜 기간에 있는 모든 날짜들을 계산
start = datetime.datetime.strptime("07-04-2008", "%d-%m-%Y") # 시작날짜 설정
end = datetime.datetime.strptime("31-10-2023", "%d-%m-%Y") # 끝날짜 설정정
date_generated = [start + datetime.timedelta(days=x) for x in range(0, (end-start).days)] # 시작날짜와 끝날짜 사이에 있는 날짜들 생성
date_list=list()
for date in date_generated:
    date_list.append(date.strftime("%Y-%m-%d")) # date_list 에서 생성한 날짜들의 형식을 맞춰서 대입
# df_date는 조회할 모든 날짜들의 정보가 들어있는 series
df_date = pd.DataFrame({
    "future_date": date_list
}, columns=["future_date"])
df_date['future_date'] = pd.to_datetime(df_date['future_date'], format='%Y-%m-%d %H:%M:%S', errors='raise') # date 타입으로 변경

df_date

Unnamed: 0,future_date
0,2008-04-07
1,2008-04-08
2,2008-04-09
3,2008-04-10
4,2008-04-11
...,...
5680,2023-10-26
5681,2023-10-27
5682,2023-10-28
5683,2023-10-29


In [None]:
# 두개의 데이터프레임 결합을 통해서 날짜별 기준금리 현황을 생성
df_future =pd.merge(df_date, df_seoul_index, left_on='future_date', right_on='date', how='left')
print(df_future.info())
df_future

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5685 entries, 0 to 5684
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   future_date  5685 non-null   datetime64[ns]
 1   date         783 non-null    datetime64[ns]
 2   seoul_index  783 non-null    float64       
dtypes: datetime64[ns](2), float64(1)
memory usage: 177.7 KB
None


Unnamed: 0,future_date,date,seoul_index
0,2008-04-07,2008-04-07,59.733
1,2008-04-08,NaT,
2,2008-04-09,NaT,
3,2008-04-10,NaT,
4,2008-04-11,NaT,
...,...,...,...
5680,2023-10-26,NaT,
5681,2023-10-27,NaT,
5682,2023-10-28,NaT,
5683,2023-10-29,NaT,


In [None]:
df_future.drop('date',axis=1,inplace=True)

df_future['seoul_index'] = df_future['seoul_index'].interpolate(method='values')
df_future['seoul_index'] = round(df_future['seoul_index'],2)
print(df_future.info())
df_future

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5685 entries, 0 to 5684
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   future_date  5685 non-null   datetime64[ns]
 1   seoul_index  5685 non-null   float64       
dtypes: datetime64[ns](1), float64(1)
memory usage: 133.2 KB
None


Unnamed: 0,future_date,seoul_index
0,2008-04-07,59.73
1,2008-04-08,59.75
2,2008-04-09,59.78
3,2008-04-10,59.80
4,2008-04-11,59.82
...,...,...
5680,2023-10-26,90.82
5681,2023-10-27,90.82
5682,2023-10-28,90.82
5683,2023-10-29,90.82


In [None]:
# 365일 전의 날짜를 구함
df_future['date'] = df_future['future_date'] - pd.Timedelta(days=365)

df_future = df_future.rename(columns={'seoul_index':'future_index'})

print(df_future.info())
df_future.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5685 entries, 0 to 5684
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   future_date   5685 non-null   datetime64[ns]
 1   future_index  5685 non-null   float64       
 2   date          5685 non-null   datetime64[ns]
dtypes: datetime64[ns](2), float64(1)
memory usage: 177.7 KB
None


Unnamed: 0,future_date,future_index,date
0,2008-04-07,59.73,2007-04-08
1,2008-04-08,59.75,2007-04-09
2,2008-04-09,59.78,2007-04-10
3,2008-04-10,59.8,2007-04-11
4,2008-04-11,59.82,2007-04-12


In [None]:
df_future.tail()

Unnamed: 0,future_date,future_index,date
5680,2023-10-26,90.82,2022-10-26
5681,2023-10-27,90.82,2022-10-27
5682,2023-10-28,90.82,2022-10-28
5683,2023-10-29,90.82,2022-10-29
5684,2023-10-30,90.82,2022-10-30


In [None]:
df_future['year'] = df_future['date'].dt.year
df_future['month'] = df_future['date'].dt.month
df_future['day'] = df_future['date'].dt.day
df_future.drop('date',axis=1,inplace=True)

df_future.head()

Unnamed: 0,future_date,future_index,year,month,day
0,2008-04-07,59.73,2007,4,8
1,2008-04-08,59.75,2007,4,9
2,2008-04-09,59.78,2007,4,10
3,2008-04-10,59.8,2007,4,11
4,2008-04-11,59.82,2007,4,12


In [None]:
# 데이터 프레임 병합
df_original_dataset = pd.merge(df_original_dataset,df_future, on = ['year','month','day'], how='left')
df_original_dataset

Unnamed: 0,year,month,day,area_deal,area_full_rent,area_year_rent,deal_full_rent_rate,deal_year_rent_multiple,6m_before_area_deal_mean,6m_before_area_full_rent_mean,...,last_month_total_unsold_count_12m_before,last_month_total_deal_count_12m_before,last_month_total_full_rent_count_12m_before,last_month_total_month_rent_count_12m_before,korea_10-3_year_12m_before,us_10-2_year_12m_before,us_10-3_year_month_12m_before,last_month_total_unsold_ratio_12m_before,future_date,future_index
0,2012,3,1,548.170105,309.035980,20.537769,56.375927,26.690830,-4.654602,1.893891,...,-15738.0,-2078.0,794.0,-73.0,-0.392419,-0.987774,-1.352387,-2253.746392,2013-03-01,56.00
1,2012,3,2,547.971802,308.738037,20.522972,56.341957,26.700411,-4.689094,1.795654,...,-15738.0,-2078.0,794.0,-73.0,-0.407419,-1.023774,-1.395387,-2253.746392,2013-03-02,56.00
2,2012,3,3,547.847473,309.738129,20.533909,56.537292,26.680136,-4.710719,2.125400,...,-15738.0,-2078.0,794.0,-73.0,-0.407419,-1.023774,-1.395387,-2253.746392,2013-03-03,55.99
3,2012,3,4,548.025452,309.982178,20.534513,56.563461,26.688017,-4.679762,2.205867,...,-15738.0,-2078.0,794.0,-73.0,-0.407419,-1.023774,-1.395387,-2253.746392,2013-03-04,55.99
4,2012,3,5,547.768799,309.442413,20.514805,56.491428,26.701145,-4.724403,2.027898,...,-15738.0,-2078.0,794.0,-73.0,-0.392419,-1.014774,-1.382387,-2253.746392,2013-03-05,55.98
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4256,2023,10,26,1029.659058,596.442871,30.401781,57.926250,33.868378,-0.125557,2.290592,...,18202.0,2799.0,915.0,-451.0,0.268194,0.205323,-0.879968,-14866.267624,NaT,
4257,2023,10,27,1029.540405,596.237915,30.427759,57.913017,33.835564,-0.137066,2.255442,...,18202.0,2799.0,915.0,-451.0,0.268194,0.232323,-0.881968,-14866.267624,NaT,
4258,2023,10,28,1029.958740,596.546021,30.465477,57.919407,33.807407,-0.096488,2.308283,...,18202.0,2799.0,915.0,-451.0,0.268194,0.232323,-0.881968,-14866.267624,NaT,
4259,2023,10,29,1029.966797,596.627197,30.464407,57.926838,33.808857,-0.095707,2.322204,...,18202.0,2799.0,915.0,-451.0,0.268194,0.232323,-0.881968,-14866.267624,NaT,


In [None]:
df_original_dataset.isnull().sum()

year                                          0
month                                         0
day                                           0
area_deal                                     0
area_full_rent                                0
                                           ... 
us_10-2_year_12m_before                       0
us_10-3_year_month_12m_before                 0
last_month_total_unsold_ratio_12m_before      0
future_date                                 365
future_index                                365
Length: 78, dtype: int64

> 365일 뒤를 미뤘으니, 365개의 null 값이 생기는 것음 맞음

In [None]:
df_original_dataset.loc[(df_original_dataset['year']>=2022)&(df_original_dataset['future_date'].isnull()),:]

Unnamed: 0,year,month,day,area_deal,area_full_rent,area_year_rent,deal_full_rent_rate,deal_year_rent_multiple,6m_before_area_deal_mean,6m_before_area_full_rent_mean,...,last_month_total_unsold_count_12m_before,last_month_total_deal_count_12m_before,last_month_total_full_rent_count_12m_before,last_month_total_month_rent_count_12m_before,korea_10-3_year_12m_before,us_10-2_year_12m_before,us_10-3_year_month_12m_before,last_month_total_unsold_ratio_12m_before,future_date,future_index
3896,2022,10,31,1072.249390,616.301147,30.388184,57.477409,35.285076,-0.975604,1.268276,...,27762.0,-2155.0,951.0,1609.0,-0.499065,-1.627161,-1.565419,22272.451093,NaT,
3897,2022,11,1,1071.884399,615.858154,30.359941,57.455654,35.305878,-1.399316,0.338758,...,33142.0,-1721.0,-1437.0,-724.0,-0.313000,-1.544267,-1.603100,-58197.386187,NaT,
3898,2022,11,2,1071.720825,615.359009,30.345861,57.417847,35.316868,-1.414363,0.257434,...,33142.0,-1721.0,-1437.0,-724.0,-0.357000,-1.556267,-1.539100,-58197.386187,NaT,
3899,2022,11,3,1071.687134,615.154297,30.362978,57.400547,35.295849,-1.417462,0.224082,...,33142.0,-1721.0,-1437.0,-724.0,-0.369000,-1.610267,-1.483100,-58197.386187,NaT,
3900,2022,11,4,1071.556396,615.286621,30.390690,57.419903,35.259365,-1.429489,0.245641,...,33142.0,-1721.0,-1437.0,-724.0,-0.331000,-1.536267,-1.452100,-58197.386187,NaT,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4256,2023,10,26,1029.659058,596.442871,30.401781,57.926250,33.868378,-0.125557,2.290592,...,18202.0,2799.0,915.0,-451.0,0.268194,0.205323,-0.879968,-14866.267624,NaT,
4257,2023,10,27,1029.540405,596.237915,30.427759,57.913017,33.835564,-0.137066,2.255442,...,18202.0,2799.0,915.0,-451.0,0.268194,0.232323,-0.881968,-14866.267624,NaT,
4258,2023,10,28,1029.958740,596.546021,30.465477,57.919407,33.807407,-0.096488,2.308283,...,18202.0,2799.0,915.0,-451.0,0.268194,0.232323,-0.881968,-14866.267624,NaT,
4259,2023,10,29,1029.966797,596.627197,30.464407,57.926838,33.808857,-0.095707,2.322204,...,18202.0,2799.0,915.0,-451.0,0.268194,0.232323,-0.881968,-14866.267624,NaT,


## 파일 저장

In [None]:
df_original_dataset.to_pickle('/content/drive/MyDrive/house_price/after_data/df_original_dataset.pkl')

# 기계학습

- 여러 회귀 모델들을 사용해서 서울 전체 집값의 추이를 예상하는 모델을 생성

## df_train_test 생성

- df_train_test는 df_original_dataset 에서 future_area_deal과 상관관계가 높은 feature들만을 선택한 데이터프레임

In [None]:
import pandas as pd

df_original_dataset = pd.read_pickle('/content/drive/MyDrive/house_price/after_data/df_original_dataset.pkl')
print(df_original_dataset.info())
df_original_dataset

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4261 entries, 0 to 4260
Data columns (total 78 columns):
 #   Column                                        Non-Null Count  Dtype         
---  ------                                        --------------  -----         
 0   year                                          4261 non-null   int64         
 1   month                                         4261 non-null   int64         
 2   day                                           4261 non-null   int64         
 3   area_deal                                     4261 non-null   float32       
 4   area_full_rent                                4261 non-null   float32       
 5   area_year_rent                                4261 non-null   float32       
 6   deal_full_rent_rate                           4261 non-null   float32       
 7   deal_year_rent_multiple                       4261 non-null   float32       
 8   6m_before_area_deal_mean                      4261 non-null   float3

Unnamed: 0,year,month,day,area_deal,area_full_rent,area_year_rent,deal_full_rent_rate,deal_year_rent_multiple,6m_before_area_deal_mean,6m_before_area_full_rent_mean,...,last_month_total_unsold_count_12m_before,last_month_total_deal_count_12m_before,last_month_total_full_rent_count_12m_before,last_month_total_month_rent_count_12m_before,korea_10-3_year_12m_before,us_10-2_year_12m_before,us_10-3_year_month_12m_before,last_month_total_unsold_ratio_12m_before,future_date,future_index
0,2012,3,1,548.170105,309.035980,20.537769,56.375927,26.690830,-4.654602,1.893891,...,-15738.0,-2078.0,794.0,-73.0,-0.392419,-0.987774,-1.352387,-2253.746392,2013-03-01,56.00
1,2012,3,2,547.971802,308.738037,20.522972,56.341957,26.700411,-4.689094,1.795654,...,-15738.0,-2078.0,794.0,-73.0,-0.407419,-1.023774,-1.395387,-2253.746392,2013-03-02,56.00
2,2012,3,3,547.847473,309.738129,20.533909,56.537292,26.680136,-4.710719,2.125400,...,-15738.0,-2078.0,794.0,-73.0,-0.407419,-1.023774,-1.395387,-2253.746392,2013-03-03,55.99
3,2012,3,4,548.025452,309.982178,20.534513,56.563461,26.688017,-4.679762,2.205867,...,-15738.0,-2078.0,794.0,-73.0,-0.407419,-1.023774,-1.395387,-2253.746392,2013-03-04,55.99
4,2012,3,5,547.768799,309.442413,20.514805,56.491428,26.701145,-4.724403,2.027898,...,-15738.0,-2078.0,794.0,-73.0,-0.392419,-1.014774,-1.382387,-2253.746392,2013-03-05,55.98
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4256,2023,10,26,1029.659058,596.442871,30.401781,57.926250,33.868378,-0.125557,2.290592,...,18202.0,2799.0,915.0,-451.0,0.268194,0.205323,-0.879968,-14866.267624,NaT,
4257,2023,10,27,1029.540405,596.237915,30.427759,57.913017,33.835564,-0.137066,2.255442,...,18202.0,2799.0,915.0,-451.0,0.268194,0.232323,-0.881968,-14866.267624,NaT,
4258,2023,10,28,1029.958740,596.546021,30.465477,57.919407,33.807407,-0.096488,2.308283,...,18202.0,2799.0,915.0,-451.0,0.268194,0.232323,-0.881968,-14866.267624,NaT,
4259,2023,10,29,1029.966797,596.627197,30.464407,57.926838,33.808857,-0.095707,2.322204,...,18202.0,2799.0,915.0,-451.0,0.268194,0.232323,-0.881968,-14866.267624,NaT,


In [None]:
pd.set_option('display.max_rows', 80)

df_original_dataset.corr()['future_index'].sort_values(ascending=False).to_frame()





Unnamed: 0,future_index
future_index,1.0
deal_year_rent_multiple,0.967644
area_deal,0.959716
area_full_rent,0.950341
year,0.945281
area_year_rent,0.896595
12m_before_area_deal_mean,0.785215
kospi_index,0.774767
last_month_total_month_rent_count,0.73273
6m_before_area_deal_mean,0.681655


In [None]:
# 상관관계들만을 컬럼으로 한 데이터 프레임 생성
df_corr = df_original_dataset.corr(numeric_only=False)['future_index']
df_corr.head()

year              0.945281
month             0.020036
day               0.004487
area_deal         0.959716
area_full_rent    0.950341
Name: future_index, dtype: float64

In [None]:
df_corr.info()

<class 'pandas.core.series.Series'>
Index: 78 entries, year to future_index
Series name: future_index
Non-Null Count  Dtype  
--------------  -----  
78 non-null     float64
dtypes: float64(1)
memory usage: 3.3+ KB


In [None]:
# 컬럼명(series의)을 수정
df_corr.name = 'correlation'
df_corr.info()

<class 'pandas.core.series.Series'>
Index: 78 entries, year to future_index
Series name: correlation
Non-Null Count  Dtype  
--------------  -----  
78 non-null     float64
dtypes: float64(1)
memory usage: 3.3+ KB


In [None]:
# 상관계쑤가 0.7 이상이거나, -0.7 이하인것 (양의 상관관계나 음의 상관관계가 있는 컬럼들만을 고름)
learning_feature_list = list(df_corr[(df_corr >= 0.7) | (df_corr <= -0.7)].index)
learning_feature_list

['year',
 'area_deal',
 'area_full_rent',
 'area_year_rent',
 'deal_year_rent_multiple',
 '12m_before_area_deal_mean',
 '12m_before_deal_full_rent_rate',
 'date',
 'kospi_index',
 'korea_rp',
 'last_month_total_month_rent_count',
 'future_date',
 'future_index']

> 거시경제 지표와, 아파트 가치평가 지표들이 미래 가격과 상관관계가 있음 -> 수치로 표현되는 지표들이 어느정도 미래 가격을 예측하는데 상관관계가 있음을 확인

> year, date, future_date, future_index 컬럼들은 future_index와 상관관계가 높지만, 부동산 지수가 그동안 우상향이어서 선택된 컬럼들로, 큰 의미는 없다. 하지만 추후 시각화로 그래프를 출력하는데 해당 컬럼들이 필요함으로 굳이 제거하지 않고 진행

In [None]:
# 사용할 컬러명들만 선택해서 학습&테스트 데이터셋을 확보
df_train_test = df_original_dataset[learning_feature_list]
df_train_test

Unnamed: 0,year,area_deal,area_full_rent,area_year_rent,deal_year_rent_multiple,12m_before_area_deal_mean,12m_before_deal_full_rent_rate,date,kospi_index,korea_rp,last_month_total_month_rent_count,future_date,future_index
0,2012,548.170105,309.035980,20.537769,26.690830,-13.919560,20.082911,2012-03-01,2030.25,3.25,2638.0,2013-03-01,56.00
1,2012,547.971802,308.738037,20.522972,26.700411,-13.950701,20.010553,2012-03-02,2034.63,3.25,2638.0,2013-03-02,56.00
2,2012,547.847473,309.738129,20.533909,26.680136,-13.970224,20.426624,2012-03-03,2034.63,3.25,2638.0,2013-03-03,55.99
3,2012,548.025452,309.982178,20.534513,26.688017,-13.942276,20.482367,2012-03-04,2034.63,3.25,2638.0,2013-03-04,55.99
4,2012,547.768799,309.442413,20.514805,26.701145,-13.982579,20.328934,2012-03-05,2016.06,3.25,2638.0,2013-03-05,55.98
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4256,2023,1029.659058,596.442871,30.401781,33.868378,-4.642175,0.923108,2023-10-26,2299.08,3.50,7342.0,NaT,
4257,2023,1029.540405,596.237915,30.427759,33.835564,-4.653163,0.900052,2023-10-27,2302.81,3.50,7342.0,NaT,
4258,2023,1029.958740,596.546021,30.465477,33.807407,-4.614421,0.911185,2023-10-28,2302.81,3.50,7342.0,NaT,
4259,2023,1029.966797,596.627197,30.464407,33.808857,-4.613675,0.924131,2023-10-29,2302.81,3.50,7342.0,NaT,


In [None]:
df_train_test.to_pickle('/content/drive/MyDrive/house_price/after_data/df_train_test.pkl')

## df_train, df_test 생성

- '2023년 서울 아파트 매매지수'를 테스트 데이터로 선언 (과거를 학습한 데이터가 얼마나 최근 데이터에도 성능을 내는지 확인하기 위해서)

In [3]:
import pandas as pd

df_train_test = pd.read_pickle('/content/drive/MyDrive/house_price/after_data/df_train_test.pkl')
df_train_test.head()

Unnamed: 0,year,area_deal,area_full_rent,area_year_rent,deal_year_rent_multiple,12m_before_area_deal_mean,12m_before_deal_full_rent_rate,date,kospi_index,korea_rp,last_month_total_month_rent_count,future_date,future_index
0,2012,548.170105,309.03598,20.537769,26.69083,-13.91956,20.082911,2012-03-01,2030.25,3.25,2638.0,2013-03-01,56.0
1,2012,547.971802,308.738037,20.522972,26.700411,-13.950701,20.010553,2012-03-02,2034.63,3.25,2638.0,2013-03-02,56.0
2,2012,547.847473,309.738129,20.533909,26.680136,-13.970224,20.426624,2012-03-03,2034.63,3.25,2638.0,2013-03-03,55.99
3,2012,548.025452,309.982178,20.534513,26.688017,-13.942276,20.482367,2012-03-04,2034.63,3.25,2638.0,2013-03-04,55.99
4,2012,547.768799,309.442413,20.514805,26.701145,-13.982579,20.328934,2012-03-05,2016.06,3.25,2638.0,2013-03-05,55.98


In [4]:
df_train_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4261 entries, 0 to 4260
Data columns (total 13 columns):
 #   Column                             Non-Null Count  Dtype         
---  ------                             --------------  -----         
 0   year                               4261 non-null   int64         
 1   area_deal                          4261 non-null   float32       
 2   area_full_rent                     4261 non-null   float32       
 3   area_year_rent                     4261 non-null   float32       
 4   deal_year_rent_multiple            4261 non-null   float32       
 5   12m_before_area_deal_mean          4261 non-null   float32       
 6   12m_before_deal_full_rent_rate     4261 non-null   float32       
 7   date                               4261 non-null   datetime64[ns]
 8   kospi_index                        4261 non-null   float64       
 9   korea_rp                           4261 non-null   float64       
 10  last_month_total_month_rent_count  4

In [5]:
# 학습할 때 입력을 할 feature들을 설정
train_columns = list(df_train_test.columns)

to_remove = ['future_index','date','year','future_date']
for x in to_remove:
    train_columns.remove(x)
train_columns


['area_deal',
 'area_full_rent',
 'area_year_rent',
 'deal_year_rent_multiple',
 '12m_before_area_deal_mean',
 '12m_before_deal_full_rent_rate',
 'kospi_index',
 'korea_rp',
 'last_month_total_month_rent_count']

In [6]:
# df_train_test에는 future_index가 null 값인 데이터들도 있음으로 필터링해서 학습용데이터셋은 df_train, 테스트용 데이터셋은 df_test 에 저장
df_train = df_train_test.loc[df_train_test['year']<2022, :].copy()
df_test = df_train_test.loc[(df_train_test['year']>=2022)&(df_train_test['future_index'].notnull()), :].copy()

In [7]:
df_train

Unnamed: 0,year,area_deal,area_full_rent,area_year_rent,deal_year_rent_multiple,12m_before_area_deal_mean,12m_before_deal_full_rent_rate,date,kospi_index,korea_rp,last_month_total_month_rent_count,future_date,future_index
0,2012,548.170105,309.035980,20.537769,26.690830,-13.919560,20.082911,2012-03-01,2030.25,3.25,2638.0,2013-03-01,56.00
1,2012,547.971802,308.738037,20.522972,26.700411,-13.950701,20.010553,2012-03-02,2034.63,3.25,2638.0,2013-03-02,56.00
2,2012,547.847473,309.738129,20.533909,26.680136,-13.970224,20.426624,2012-03-03,2034.63,3.25,2638.0,2013-03-03,55.99
3,2012,548.025452,309.982178,20.534513,26.688017,-13.942276,20.482367,2012-03-04,2034.63,3.25,2638.0,2013-03-04,55.99
4,2012,547.768799,309.442413,20.514805,26.701145,-13.982579,20.328934,2012-03-05,2016.06,3.25,2638.0,2013-03-05,55.98
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3588,2021,1083.472168,607.116211,28.811615,37.605396,15.883393,-5.929564,2021-12-27,2999.55,1.00,6661.0,2022-12-27,95.94
3589,2021,1083.753906,606.659790,28.847986,37.567749,15.913527,-6.024723,2021-12-28,3020.24,1.00,6661.0,2022-12-28,95.89
3590,2021,1083.566772,606.564270,28.837791,37.574539,15.893512,-6.023295,2021-12-29,2993.29,1.00,6661.0,2022-12-29,95.85
3591,2021,1083.580566,606.848999,28.828548,37.587067,15.894987,-5.980375,2021-12-30,2977.65,1.00,6661.0,2022-12-30,95.80


In [8]:
df_test

Unnamed: 0,year,area_deal,area_full_rent,area_year_rent,deal_year_rent_multiple,12m_before_area_deal_mean,12m_before_deal_full_rent_rate,date,kospi_index,korea_rp,last_month_total_month_rent_count,future_date,future_index
3593,2022,1083.536621,606.893738,28.841393,37.568806,13.972645,-5.584571,2022-01-01,2977.65,1.0,10052.0,2023-01-01,95.71
3594,2022,1083.647461,606.853333,28.839933,37.574547,13.984303,-5.600518,2022-01-02,2977.65,1.0,10052.0,2023-01-02,95.67
3595,2022,1083.479614,607.364014,28.851576,37.553570,13.966648,-5.506430,2022-01-03,2988.77,1.0,10052.0,2023-01-03,95.62
3596,2022,1083.489624,606.642090,28.877390,37.520344,13.967700,-5.619623,2022-01-04,2989.24,1.0,10052.0,2023-01-04,95.56
3597,2022,1083.651611,607.107971,28.828796,37.589207,13.984739,-5.561261,2022-01-05,2953.97,1.0,10052.0,2023-01-05,95.51
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3891,2022,1073.413330,616.464355,30.461447,35.238422,0.732220,1.600907,2022-10-26,2249.56,3.0,7793.0,2023-10-26,90.82
3892,2022,1073.308960,616.624390,30.406357,35.298836,0.722425,1.637168,2022-10-27,2288.78,3.0,7793.0,2023-10-27,90.82
3893,2022,1073.112549,616.791504,30.423437,35.272560,0.703994,1.683322,2022-10-28,2268.40,3.0,7793.0,2023-10-28,90.82
3894,2022,1072.432983,617.013306,30.416872,35.257832,0.640221,1.784342,2022-10-29,2268.40,3.0,7793.0,2023-10-29,90.82


## 모델 적용

In [18]:
display(df_test['future_index'].describe())

count    303.000000
mean      91.590693
std        1.501675
min       90.310000
25%       90.520000
50%       90.800000
75%       92.390000
max       95.710000
Name: future_index, dtype: float64

> 테스트로 예측할 future_index의 값은 90.31~95.71 사이에 값들이 존재함

> 모델의 성능은 mse 로 측정했을 때, 1 이하인 모델을 생성하는 것을 목표로 함(mse가 1이하 일때 실제로 mse가 더 작음에도는 rmse는 더 크게 되는 문제 발생 -> 성능 측정에 혼선이 없게 mse로 성능을 확인 )

### 선형회귀 모델

In [None]:
# Importing required libraries
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np
import plotly.graph_objs as go
from sklearn.metrics import r2_score

# Creating a Linear Regression model
model = LinearRegression()

X_train = df_train[train_columns]
y_train = df_train['future_index']

X_test = df_test[train_columns]
y_test = df_test['future_index']

# Training the model on the training set
model.fit(X_train, y_train)


# Making predictions on the testing set
y_pred = model.predict(X_test)


# Evaluating the model using Mean Squared Error (MSE)
mse = mean_squared_error(y_test , y_pred)
print('LinearRegression Mean Squared Error:', mse)
print('LinearRegression Root Mean Squared Error:', np.sqrt(mse))
print()

# Creating the traces
trace1 = go.Scatter(
    x = df_test['future_date'],
    y = y_test.values,
    mode = 'lines',
    name = 'actual_value'
)


trace2 = go.Scatter(
    x = df_test['future_date'],
    y = list(y_pred),
    mode = 'lines',
    name = 'predict_value'
)

# Combining the traces and creating the layout
data = [trace1, trace2]
layout = go.Layout(title='Linear Regression Predict future price for test dataset', xaxis=dict(title='index'), yaxis=dict(title='서울아파트지수'))

# Creating the figure and plotting it
fig = go.Figure(data=data, layout=layout)
fig.show()


LinearRegression Mean Squared Error: 21.649886770747557
LinearRegression Root Mean Squared Error: 4.65294388218336



> 오차 값이 존재하나, 추세가 비슷하게 움직임

> 하지만 2023년 8월 이후로 예측한 부분이 크게 실제값과 벗어남을 확인

### 다항회귀 모델

#### 2차항

In [8]:
# Importing required libraries
import pandas as pd
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np
import plotly.graph_objs as go


X_train = df_train[train_columns]
y_train = df_train['future_index']

X_test = df_test[train_columns]
y_test = df_test['future_index']

# Creating polynomial features
poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

# Creating a Polynomial Regression model
model = LinearRegression()

# Training the model on the training set
model.fit(X_train_poly, y_train)

# Making predictions on the testing set
y_pred = model.predict(X_test_poly)

# Evaluating the model using Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print('2 PolynomialFeatures Mean Squared Error:', mse)
print('2 PolynomialFeatures Root Mean Squared Error:', np.sqrt(mse))
print()

# Creating the traces
trace1 = go.Scatter(
    x = df_test['future_date'],
    y = y_test.values,
    mode = 'lines',
    name = 'actual_value'
)


trace2 = go.Scatter(
    x = df_test['future_date'],
    y = list(y_pred),
    mode = 'lines',
    name = 'predict_value'
)

# Combining the traces and creating the layout
data = [trace1, trace2]
layout = go.Layout(title='Linear Regression Poly2 Predict future price for test dataset', xaxis=dict(title='index'), yaxis=dict(title='서울아파트지수'))

# Creating the figure and plotting it
fig = go.Figure(data=data, layout=layout)
fig.show()



2 PolynomialFeatures Mean Squared Error: 477.93997993894186
2 PolynomialFeatures Root Mean Squared Error: 21.861838439137316



> 2차항 다항회귀 모델을 적용해보았으니, 선형회귀보다 더 성능이 안좋게 나옴

#### 3차항

In [10]:
# Importing required libraries
import pandas as pd
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np
import plotly.graph_objs as go


X_train = df_train[train_columns]
y_train = df_train['future_index']

X_test = df_test[train_columns]
y_test = df_test['future_index']

# Creating polynomial features
poly = PolynomialFeatures(degree=3)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

# Creating a Polynomial Regression model
model = LinearRegression()

# Training the model on the training set
model.fit(X_train_poly, y_train)

# Making predictions on the testing set
y_pred = model.predict(X_test_poly)

# Evaluating the model using Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print('3 PolynomialFeatures Mean Squared Error:', mse)
print('3 PolynomialFeatures Root Mean Squared Error:', np.sqrt(mse))
print()

# Creating the traces
trace1 = go.Scatter(
    x = df_test['future_date'],
    y = y_test.values,
    mode = 'lines',
    name = 'actual_value'
)


trace2 = go.Scatter(
    x = df_test['future_date'],
    y = list(y_pred),
    mode = 'lines',
    name = 'predict_value'
)

# Combining the traces and creating the layout
data = [trace1, trace2]
layout = go.Layout(title='Linear Regression Poly3 Predict future price for test dataset', xaxis=dict(title='index'), yaxis=dict(title='서울아파트지수'))

# Creating the figure and plotting it
fig = go.Figure(data=data, layout=layout)
fig.show()



3 PolynomialFeatures Mean Squared Error: 37099.686023837174
3 PolynomialFeatures Root Mean Squared Error: 192.61278779934932



> 3차항 다항회귀를 사용했으나, 너무 성능이 않좋게 나옴

### Gradient Boosting 모델

In [None]:
# Importing required libraries
import pandas as pd
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np
import plotly.graph_objs as go
from tqdm import tqdm

X_train = df_train[train_columns]
y_train = df_train['future_index']

X_test = df_test[train_columns]
y_test = df_test['future_index']

best_mse_score = np.inf
best_estimator = 0
best_learning = 0
best_depth = 0

estimator_list = np.arange(80, 301, 10)
learning_list = [0.001,0.005,0.01,0.05,0.1]

# Assuming estimator_list and learning_list are defined somewhere in your code
for estimator in tqdm(estimator_list):
  for learning in learning_list:
    for depth in range(1, 6):

      # Creating a Gradient Boosting model
      model = GradientBoostingRegressor(n_estimators= estimator, learning_rate= learning, max_depth = depth, random_state=0)

      # Training the model on the training set
      model.fit(X_train, y_train)

      # Making predictions on the testing set
      y_pred = model.predict(X_test)

      # Evaluating the model using Mean Squared Error (MSE)
      mse = mean_squared_error(y_test , y_pred)

      if mse < best_mse_score:

        best_mse_score = mse
        best_estimator = estimator
        best_learning = learning
        best_depth =  depth


print()
print('GradientBoostingRegressor Mean Squared Error:', best_mse_score)
print('GradientBoostingRegressor Root Mean Squared Error:', np.sqrt(best_mse_score))
print()
print('n_estimators :', best_estimator,'learning_rate :',best_learning,'max_depth',  best_depth)
model = GradientBoostingRegressor(n_estimators= best_estimator, learning_rate=  best_learning, max_depth = best_depth, random_state=0)

model.fit(X_train, y_train)

y_pred = model.predict(X_test)


# Creating the traces
trace1 = go.Scatter(
    x = df_test['future_date'],
    y = y_test.values,
    mode = 'lines',
    name = 'actual_value'
)


trace2 = go.Scatter(
    x = df_test['future_date'],
    y = list(y_pred),
    mode = 'lines',
    name = 'predict_value'
)

# Combining the traces and creating the layout
data = [trace1, trace2]
layout = go.Layout(title='Gradient Boosting Predict future price for test dataset', xaxis=dict(title='index'), yaxis=dict(title='서울아파트지수'))

# Creating the figure and plotting it
fig = go.Figure(data=data, layout=layout)
fig.show()


100%|██████████| 23/23 [20:43<00:00, 54.07s/it]



GradientBoostingRegressor Mean Squared Error: 1.1167184205462022
GradientBoostingRegressor Root Mean Squared Error: 1.0567489865366335

n_estimators : 230 learning_rate : 0.01 max_depth 5


> for 문을 이용해 파라미터들을 변경시켜가면서 최고의 성능을 보이는 파라미터들을 찾음

> n_estimators : 230 learning_rate : 0.01 max_depth 5 일때, Mean Squared Error: 1.11 로 목표로 하는 성능과 상당히 유사한 성능을 보임

### XGBoost 모델

In [None]:
# Importing required libraries
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np
import plotly.graph_objs as go

X_train = df_train[train_columns]
y_train = df_train['future_index']

X_test = df_test[train_columns]
y_test = df_test['future_index']


best_mse_score = np.inf
best_estimator = 0
best_learning = 0
best_depth = 0

estimator_list = np.arange(80, 201, 10)
learning_list = [0.001,0.005,0.01,0.05,0.1]

# Assuming estimator_list and learning_list are defined somewhere in your code
for estimator in tqdm(estimator_list):
  for learning in learning_list:
    for depth in range(1, 6):


      # Creating an XGBoost model
      model = xgb.XGBRegressor(objective ='reg:squarederror', n_estimators=estimator, learning_rate= learning, max_depth=depth, random_state=0)

      # Training the model on the training set
      model.fit(X_train, y_train)


      # Making predictions on the testing set
      y_pred = model.predict(X_test)


      # Evaluating the model using Mean Squared Error (MSE)
      mse = mean_squared_error(y_test , y_pred)

      if mse < best_mse_score:

        best_mse_score = mse
        best_estimator = estimator
        best_learning = learning
        best_depth =  depth
print()
print('XGBRegressor Mean Squared Error:', best_mse_score)
print('XGBRegressor Root Mean Squared Error:', np.sqrt(best_mse_score))

print()
print('n_estimators :', best_estimator,'learning_rate :',best_learning,'max_depth',  best_depth)
model = xgb.XGBRegressor(objective ='reg:squarederror', n_estimators= best_estimator, learning_rate=  best_learning, max_depth = best_depth, random_state=0)

model.fit(X_train, y_train)

y_pred = model.predict(X_test)


# Creating the traces
trace1 = go.Scatter(
    x = df_test['future_date'],
    y = y_test.values,
    mode = 'lines',
    name = 'actual_value'
)


trace2 = go.Scatter(
    x = df_test['future_date'],
    y = list(y_pred),
    mode = 'lines',
    name = 'predict_value'
)

# Combining the traces and creating the layout
data = [trace1, trace2]
layout = go.Layout(title='XGBoost Predict future price for test dataset', xaxis=dict(title='index'), yaxis=dict(title='서울아파트지수'))

# Creating the figure and plotting it
fig = go.Figure(data=data, layout=layout)
fig.show()


100%|██████████| 13/13 [01:07<00:00,  5.16s/it]


XGBRegressor Mean Squared Error: 1.2220096120066088
XGBRegressor Root Mean Squared Error: 1.1054454360150974

n_estimators : 120 learning_rate : 0.1 max_depth 4





> for 문을 이용해 파라미터들을 변경시켜가면서 최고의 성능을 보이는 파라미터들을 찾음

> n_estimators : 120 learning_rate : 0.1 max_depth 4 일 때, Mean Squared Error: 1.22 로 최고의 성능을 보임

> Gradient Bossting 모델에 비해서 성능은 다소 떨어지만 크게 차이가 나지 않고(mse 기준 0.11 차이) 학습 속도가 훨씬 빠름

### RandomForest Regressor 모델



In [None]:
# Importing required libraries
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np
import plotly.graph_objs as go

X_train = df_train[train_columns]
y_train = df_train['future_index']

X_test = df_test[train_columns]
y_test = df_test['future_index']

best_mse_score = np.inf
best_estimator = 0
estimator_list = np.arange(80, 201, 10)

for estimator in tqdm(estimator_list):
  # Creating a Random Forest Regressor model
  model = RandomForestRegressor(n_estimators= estimator, random_state=0)

  # Training the model on the training set
  model.fit(X_train, y_train)

  # Making predictions on the testing set
  y_pred = model.predict(X_test)

  # Evaluating the model using Mean Squared Error (MSE)
  mse = mean_squared_error(y_test , y_pred)

  if mse < best_mse_score:
      best_mse_score = mse
      best_estimator = estimator

print()
print('RandomForestRegressor Mean Squared Error:', best_mse_score)
print('RandomForestRegressor Root Mean Squared Error:', np.sqrt(best_mse_score))
print()

print('n_estimators :', best_estimator)

model = RandomForestRegressor(n_estimators= estimator, random_state=0)

model.fit(X_train, y_train)

y_pred = model.predict(X_test)



# Creating the traces
trace1 = go.Scatter(
    x = df_test['future_date'],
    y = y_test.values,
    mode = 'lines',
    name = 'actual_value'
)


trace2 = go.Scatter(
    x = df_test['future_date'],
    y = list(y_pred),
    mode = 'lines',
    name = 'predict_value'
)


# Combining the traces and creating the layout
data = [trace1, trace2]
layout = go.Layout(title='RandomForest Regressor Predict future price for test dataset', xaxis=dict(title='Date'), yaxis=dict(title='서울아파트지수'))

# Creating the figure and plotting it
fig = go.Figure(data=data, layout=layout)
fig.show()


100%|██████████| 13/13 [00:53<00:00,  4.08s/it]



RandomForestRegressor Mean Squared Error: 30.085491006472935
RandomForestRegressor Root Mean Squared Error: 5.48502424848541

n_estimators : 190


> for 문을 이용해 파라미터들을 변경시켜가면서 최고의 성능을 보이는 파라미터들을 찾음

> n_estimators : 190 일 때, Mean Squared Error: 30.08 로 최적의 성능을 보임

> Gradient Boosting, XGBoost, LinearRegressor 보다도 성능이 안좋게 나옴

## 데이터 범위 수정

In [None]:
df_train_test.describe()

Unnamed: 0,year,area_deal,area_full_rent,area_year_rent,deal_year_rent_multiple,12m_before_area_deal_mean,12m_before_deal_full_rent_rate,kospi_index,korea_rp,last_month_total_month_rent_count,future_index
count,4261.0,4261.0,4261.0,4261.0,4261.0,4261.0,4261.0,4261.0,4261.0,4261.0,3896.0
mean,2017.496362,732.685425,459.80249,24.162024,29.78093,5.907815,1.053344,2247.941908,1.81049,4948.403661,72.756907
std,3.375665,210.211655,99.731697,3.097236,4.954143,8.543085,6.849267,354.375349,0.864054,2094.932302,15.602975
min,2012.0,508.184509,304.899719,20.161331,24.165903,-14.94167,-11.132855,1457.64,0.5,2038.0,55.36
25%,2015.0,536.843872,368.355988,21.448662,24.884462,-0.036408,-4.619988,1986.62,1.25,3585.0,58.7875
50%,2017.0,657.146118,460.357452,23.608965,27.786295,6.869967,0.016122,2100.2,1.5,4277.0,68.335
75%,2020.0,942.029785,560.719177,26.345444,34.550373,12.578342,6.237414,2439.9,2.5,5842.0,89.09
max,2023.0,1090.708984,622.134399,30.486279,38.265236,21.542885,21.222189,3305.21,3.5,12374.0,100.64


In [None]:
from sklearn.preprocessing import MinMaxScaler

# MinMaxScaler 객체 생성
scaler = MinMaxScaler(feature_range=(50, 100))

# 스케일링을 적용할 열 선택
columns_to_scale = ['area_deal', 'area_full_rent', 'area_year_rent', 'deal_year_rent_multiple', '12m_before_area_deal_mean', '12m_before_deal_full_rent_rate', 'kospi_index', 'korea_rp', 'last_month_total_month_rent_count']

# 선택한 열들에 대해 Min-Max Scaling 적용
df_train_test[columns_to_scale] = scaler.fit_transform(df_train_test[columns_to_scale].values)

# 결과 확인
display(df_train_test[columns_to_scale].describe())
display(df_train_test)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,area_deal,area_full_rent,area_year_rent,deal_year_rent_multiple,12m_before_area_deal_mean,12m_before_deal_full_rent_rate,kospi_index,korea_rp,last_month_total_month_rent_count
count,4261.0,4261.0,4261.0,4261.0,4261.0,4261.0,4261.0,4261.0,4261.0
mean,69.269651,74.414539,69.373917,69.912373,78.573029,68.831993,71.387604,71.841508,64.078965
std,18.043161,15.718914,14.998797,17.568713,11.70781,10.584544,9.590309,14.400901,10.134154
min,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0
25%,52.459928,60.001471,56.234078,52.548203,70.426812,60.064686,64.31556,62.5,57.483553
50%,62.785867,74.502008,66.695648,62.838876,79.891604,67.229118,67.389328,66.666667,60.831076
75%,87.238373,90.320223,79.947429,86.826104,87.714606,76.843217,76.582484,83.333333,68.401703
max,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0


Unnamed: 0,year,area_deal,area_full_rent,area_year_rent,deal_year_rent_multiple,12m_before_area_deal_mean,12m_before_deal_full_rent_rate,date,kospi_index,korea_rp,last_month_total_month_rent_count,future_date,future_index
0,2012,53.432096,50.651924,51.822954,58.954066,51.400743,98.239411,2012-03-01,65.496301,95.833333,52.902477,2013-03-01,56.00
1,2012,53.415075,50.604965,51.751297,58.988041,51.358067,98.127594,2012-03-02,65.614835,95.833333,52.902477,2013-03-02,56.00
2,2012,53.404403,50.762592,51.804259,58.916140,51.331311,98.770571,2012-03-03,65.614835,95.833333,52.902477,2013-03-03,55.99
3,2012,53.419680,50.801057,51.807187,58.944089,51.369613,98.856712,2012-03-04,65.614835,95.833333,52.902477,2013-03-04,55.99
4,2012,53.397650,50.715983,51.711746,58.990646,51.314380,98.619604,2012-03-05,65.112283,95.833333,52.902477,2013-03-05,55.98
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4256,2023,94.759883,95.950706,99.590809,84.407566,64.114870,68.630732,2023-10-26,72.771532,100.000000,75.657895,NaT,
4257,2023,94.749699,95.918403,99.716612,84.291199,64.099811,68.595103,2023-10-27,72.872476,100.000000,75.657895,NaT,
4258,2023,94.785606,95.966964,99.899266,84.191349,64.152906,68.612307,2023-10-28,72.872476,100.000000,75.657895,NaT,
4259,2023,94.786297,95.979758,99.894084,84.196490,64.153929,68.632314,2023-10-29,72.872476,100.000000,75.657895,NaT,


In [None]:
# 학습할 때 입력을 할 feature들을 설정
train_columns = list(df_train_test.columns)

to_remove = ['future_index','date','year','future_date']
for x in to_remove:
    train_columns.remove(x)
train_columns


['area_deal',
 'area_full_rent',
 'area_year_rent',
 'deal_year_rent_multiple',
 '12m_before_area_deal_mean',
 '12m_before_deal_full_rent_rate',
 'kospi_index',
 'korea_rp',
 'last_month_total_month_rent_count']

In [None]:
df_train = df_train_test.loc[df_train_test['year']<2022, :].copy()
df_test = df_train_test.loc[(df_train_test['year']>=2022)&(df_train_test['future_index'].notnull()), :].copy()

> 데이터크기를 조절한 데이터셋을 각 모델들에 적요을 시켜봤지만, 성능의 개선이 없었음

> 데이터 크기를 수정한다고 무조건 성능이 좋아지지는 않음

## 모델 혼합(앙상블)

- 시각적으로 확인을 했을 때, Gradient Boosting , XGBoost, Linear Regressor 순으로 모델의 성능이 좋은
- 더 좋은 모델을 만들기 위해서 Gradient Boosting 모델과 Linear Regressor 모델을 조합해서 앙상블 모델을 생성

#### 앙상블 임의 생성

- 앙상블 모델이 실제로 성능의 향상이 있는지 임의로 한개의 앙상블 모델을 생성해서 확인

In [None]:
# Importing required libraries
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor
import xgboost as xgb
from sklearn.ensemble import VotingRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np
import plotly.graph_objs as go

X_train = df_train[train_columns]
y_train = df_train['future_index']

X_test = df_test[train_columns]
y_test = df_test['future_index']

# Creating a Linear Regression model
lr_model = LinearRegression()

# Creating a Gradient Boosting model
gb_model = GradientBoostingRegressor(n_estimators=230, learning_rate=0.01, max_depth=5, random_state=0)

# Creating a XGBoosting model
xgb_model = xgb.XGBRegressor(objective ='reg:squarederror', n_estimators= 120, learning_rate=  0.1, max_depth = 4, random_state=0)

# VotingRegressor를 통해 앙상블 모델 생성
ensemble_model_1 = VotingRegressor([('lr', lr_model), ('gb', gb_model)], weights=[0.2, 0.8])




# Training the model on the training set
lr_model.fit(X_train, y_train)
lr_predict = lr_model.predict(X_test)
lr_mse = mean_squared_error(y_test ,lr_predict)

# Training the model on the training set
gb_model.fit(X_train, y_train)
gb_predict = gb_model.predict(X_test)
gb_mse = mean_squared_error(y_test ,gb_predict)

# Training the model on the training set
xgb_model.fit(X_train, y_train)
xgb_predict = xgb_model.predict(X_test)
xgb_mse = mean_squared_error(y_test ,xgb_predict)

# Training the model on the training set
ensemble_model_1.fit(X_train, y_train)
ensemble_1_predict = ensemble_model_1.predict(X_test)
ensemble_1_mse = mean_squared_error(y_test ,ensemble_1_predict)


# Creating the traces
trace1 = go.Scatter(
    x = df_test['future_date'],
    y = y_test.values,
    mode = 'lines',
    name = 'actual_value'
)


trace2 = go.Scatter(
    x = df_test['future_date'],
    y = list(lr_predict),
    mode = 'lines',
    name = 'lr_model predict_value'
)

trace3 = go.Scatter(
    x = df_test['future_date'],
    y = list(gb_predict),
    mode = 'lines',
    name = 'gb_model predict_value'
)

trace4 = go.Scatter(
    x = df_test['future_date'],
    y = list(xgb_predict),
    mode = 'lines',
    name = 'xgb_model predict_value'
)

trace5 = go.Scatter(
    x = df_test['future_date'],
    y = list(ensemble_1_predict),
    mode = 'lines',
    name = 'ensemble_model_1 predict_value'
)





# Combining the traces and creating the layout
data = [trace1, trace2, trace3, trace4,trace5]
layout = go.Layout(title='Variable Models Predict future price for total dataset', xaxis=dict(title='Date'), yaxis=dict(title='아파트지수'))

# Creating the figure and plotting it
fig = go.Figure(data=data, layout=layout)
fig.show()

print('LinearRegression Mean Squared Error:', lr_mse)
print('GradientBoostingRegressor Mean Squared Error:', gb_mse)
print('XGBRegressor Mean Squared Error:', xgb_mse)
print('Ensemble 1 Mean Squared Error:', ensemble_1_mse)

LinearRegression Mean Squared Error: 21.649886770747557
GradientBoostingRegressor Mean Squared Error: 1.1167184205462022
XGBRegressor Mean Squared Error: 1.2220096120066088
Ensemble 1 Mean Squared Error: 0.5584074793365612


> ensemble_1 모델은 2:8로 Linear Regressor 모델과 Gradient Boosting 모델을 조합

> 기존에는 Gradient Boosting 모델이 mse가 1.11로 가장 성능이 좋았지만, 앙상블 모델을 임의로 생성했는데 0.55로 성능이 더 좋아짐을 확인

> for문을 이용해 여러 경우의 앙상블 모델을 생성 해야 함

#### 최적 앙상블 모델 생성

- 3개의 모델의 조합을 구해야 함, 각 3개의 모델의 비중을 어떻게 조합할지 정하는 부분 구현

In [10]:
from sklearn.ensemble import VotingRegressor
from itertools import product



# List to store results
weights_list = []

# Define weight increments (10% increments)
weights = [i / 10.0 for i in range(11)]

# Generate all possible combinations of weights
weight_combinations = product(weights, repeat=3)

# Iterate through all combinations
for w_combination in weight_combinations:
    # Check if the weights sum to 1
    if sum(w_combination) == 1.0:


        # Perform your desired operations or evaluations here
        # For example, train the model and evaluate it on your data

        # Store the result (you can modify this based on your needs)
        weights_list.append(w_combination)

# Display the results
for weights in weights_list:
    print(weights)
    # Display more information from the result if needed
    print()


(0.0, 0.0, 1.0)

(0.0, 0.1, 0.9)

(0.0, 0.2, 0.8)

(0.0, 0.3, 0.7)

(0.0, 0.4, 0.6)

(0.0, 0.5, 0.5)

(0.0, 0.6, 0.4)

(0.0, 0.7, 0.3)

(0.0, 0.8, 0.2)

(0.0, 0.9, 0.1)

(0.0, 1.0, 0.0)

(0.1, 0.0, 0.9)

(0.1, 0.1, 0.8)

(0.1, 0.2, 0.7)

(0.1, 0.3, 0.6)

(0.1, 0.4, 0.5)

(0.1, 0.5, 0.4)

(0.1, 0.6, 0.3)

(0.1, 0.7, 0.2)

(0.1, 0.8, 0.1)

(0.1, 0.9, 0.0)

(0.2, 0.0, 0.8)

(0.2, 0.1, 0.7)

(0.2, 0.2, 0.6)

(0.2, 0.3, 0.5)

(0.2, 0.4, 0.4)

(0.2, 0.5, 0.3)

(0.2, 0.6, 0.2)

(0.2, 0.8, 0.0)

(0.3, 0.0, 0.7)

(0.3, 0.1, 0.6)

(0.3, 0.2, 0.5)

(0.3, 0.3, 0.4)

(0.3, 0.4, 0.3)

(0.3, 0.5, 0.2)

(0.3, 0.7, 0.0)

(0.4, 0.0, 0.6)

(0.4, 0.1, 0.5)

(0.4, 0.2, 0.4)

(0.4, 0.3, 0.3)

(0.4, 0.4, 0.2)

(0.4, 0.5, 0.1)

(0.4, 0.6, 0.0)

(0.5, 0.0, 0.5)

(0.5, 0.1, 0.4)

(0.5, 0.2, 0.3)

(0.5, 0.3, 0.2)

(0.5, 0.4, 0.1)

(0.5, 0.5, 0.0)

(0.6, 0.0, 0.4)

(0.6, 0.1, 0.3)

(0.6, 0.2, 0.2)

(0.6, 0.4, 0.0)

(0.7, 0.0, 0.3)

(0.7, 0.1, 0.2)

(0.7, 0.3, 0.0)

(0.8, 0.0, 0.2)

(0.8, 0.1, 0.1)

(0.8, 0.2, 0.0

In [11]:
# Importing required libraries
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor
import xgboost as xgb
from sklearn.ensemble import VotingRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np
import plotly.graph_objs as go
from tqdm import tqdm
from sklearn.metrics import mean_absolute_error, r2_score

X_train = df_train[train_columns]
y_train = df_train['future_index']

X_test = df_test[train_columns]
y_test = df_test['future_index']

# Creating a Linear Regression model
lr_model = LinearRegression()

# Creating a Gradient Boosting model
gb_model = GradientBoostingRegressor(n_estimators=230, learning_rate=0.01, max_depth=5, random_state=0)

# Creating a XGBoosting model
xgb_model = xgb.XGBRegressor(objective ='reg:squarederror', n_estimators= 120, learning_rate=  0.1, max_depth = 4, random_state=0)


best_mse_score = np.inf
best_weights = 0

for weights in tqdm(weights_list):
  # VotingRegressor를 통해 앙상블 모델 생성
  ensemble_model = VotingRegressor(estimators=[('lr', lr_model),('gb', gb_model),('xgb', xgb_model)], weights= weights)
  # Training the model on the training set
  ensemble_model.fit(X_train, y_train)
  ensemble_predict = ensemble_model.predict(X_test)
  ensemble_mse = mean_squared_error(y_test ,ensemble_predict)

  if ensemble_mse < best_mse_score:
    best_mse_score = ensemble_mse
    best_weights = weights


# Training the model on the training set
gb_model.fit(X_train, y_train)
gb_predict = gb_model.predict(X_test)
gb_mse = mean_squared_error(y_test ,gb_predict)
print('GradientBoostingRegressor Mean Squared Error:', gb_mse)

print()

best_ensemble_model = VotingRegressor(estimators=[('lr', lr_model),('gb', gb_model),('xgb', xgb_model)], weights=  best_weights)
best_ensemble_model.fit(X_train, y_train)
best_ensemble_predict = best_ensemble_model.predict(X_test)

best_ensemble_mse = mean_squared_error(y_test ,best_ensemble_predict)
best_ensemble_mae = mean_absolute_error(y_test, best_ensemble_predict)
best_ensemble_r2 = r2_score(y_test, best_ensemble_predict)
best_ensemble_corr = np.corrcoef(y_test, best_ensemble_predict)[0, 1]

print('best Weights :', best_weights)
print('Best Ensemble Mean Squared Error:', best_ensemble_mse)
print('Best Ensemble Mean Absolute Error:', best_ensemble_mae)
print('Best Ensemble R-squared:', best_ensemble_r2)
print('Best Ensemble Pearson Correlation Coefficient:', best_ensemble_corr)
print()





# Creating the traces
trace1 = go.Scatter(
    x = df_test['future_date'],
    y = y_test.values,
    mode = 'lines',
    name = 'actual_value'
)

trace2 = go.Scatter(
    x = df_test['future_date'],
    y = list(gb_predict),
    mode = 'lines',
    name = 'gb_model predict_value'
)


trace3 = go.Scatter(
    x = df_test['future_date'],
    y = list(best_ensemble_predict),
    mode = 'lines',
    name = 'best_ensemble_model predict_value'
)







# Combining the traces and creating the layout
data = [trace1, trace2, trace3]
layout = go.Layout(title='Final Models Predict future price for total dataset', xaxis=dict(title='Date'), yaxis=dict(title='아파트지수'))

# Creating the figure and plotting it
fig = go.Figure(data=data, layout=layout)
fig.show()


100%|██████████| 62/62 [04:30<00:00,  4.36s/it]


GradientBoostingRegressor Mean Squared Error: 1.1167184205462022

best Weights : (0.1, 0.5, 0.4)
Best Ensemble Mean Squared Error: 0.44679675735094954
Best Ensemble Mean Absolute Error: 0.5561022859696734
Best Ensemble R-squared: 0.8012102664346892
Best Ensemble Pearson Correlation Coefficient: 0.9169610372958904



> Linear Regressor, Gradient Boosting, XGBoost 모델 중에서 가장 높은 성능을 보이는 모델들을 추출한 뒤, 1:5:4의 비중으로 조합

> 생성한 모델의 성능이 Mean Squared Error: 0.44 로 이전 Gradient Boosting 의 성능 Mean Squared Error: 1.11 보다 훨씬 좋아짐을 확인

> 예측한 값들이 최솟값이 90.31 이고, 최댓값이 95.71 인데 mse가 0.44임으로 모델의 성능이 좋다라고 판단(목표치는 mse 1 이하를 만족)

> mae는 0.55, R-squared 는 0.8, Pearson Correlation Coefficient 는 0.9 로 성능이 좋게 나옴을 확인

> https://medium.com/analytics-vidhya/mae-mse-rmse-coefficient-of-determination-adjusted-r-squared-which-metric-is-better-cd0326a5697e 참조

## 미래 아파트 가격지수 예측

- 결과값은 2023년 10월 30일 까지의 예측값 까지만을 가지고 있음으로, 2023년 10월 30일 이후의 예측값들을 생성한 모델을 적용하여서 도출

In [None]:
df_test = df_train_test.loc[(df_train_test['year']>=2022), :].copy()
df_test['future_date'] = df_test['date'] + pd.Timedelta(days=365)
df_test

Unnamed: 0,year,area_deal,area_full_rent,area_year_rent,deal_year_rent_multiple,12m_before_area_deal_mean,12m_before_deal_full_rent_rate,date,kospi_index,korea_rp,last_month_total_month_rent_count,future_date,future_index
3593,2022,1083.536621,606.893738,28.841393,37.568806,13.972645,-5.584571,2022-01-01,2977.65,1.0,10052.0,2023-01-01,95.71
3594,2022,1083.647461,606.853333,28.839933,37.574547,13.984303,-5.600518,2022-01-02,2977.65,1.0,10052.0,2023-01-02,95.67
3595,2022,1083.479614,607.364014,28.851576,37.553570,13.966648,-5.506430,2022-01-03,2988.77,1.0,10052.0,2023-01-03,95.62
3596,2022,1083.489624,606.642090,28.877390,37.520344,13.967700,-5.619623,2022-01-04,2989.24,1.0,10052.0,2023-01-04,95.56
3597,2022,1083.651611,607.107971,28.828796,37.589207,13.984739,-5.561261,2022-01-05,2953.97,1.0,10052.0,2023-01-05,95.51
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4256,2023,1029.659058,596.442871,30.401781,33.868378,-4.642175,0.923108,2023-10-26,2299.08,3.5,7342.0,2024-10-25,
4257,2023,1029.540405,596.237915,30.427759,33.835564,-4.653163,0.900052,2023-10-27,2302.81,3.5,7342.0,2024-10-26,
4258,2023,1029.958740,596.546021,30.465477,33.807407,-4.614421,0.911185,2023-10-28,2302.81,3.5,7342.0,2024-10-27,
4259,2023,1029.966797,596.627197,30.464407,33.808857,-4.613675,0.924131,2023-10-29,2302.81,3.5,7342.0,2024-10-28,


In [None]:
# Importing required libraries
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor
import xgboost as xgb
from sklearn.ensemble import VotingRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np
import plotly.graph_objs as go
from tqdm import tqdm

X_train = df_train[train_columns]
y_train = df_train['future_index']

X_test = df_test[train_columns]

# Creating a Linear Regression model
lr_model = LinearRegression()

# Creating a Gradient Boosting model
gb_model = GradientBoostingRegressor(n_estimators=230, learning_rate=0.01, max_depth=5, random_state=0)

# Creating a XGBoosting model
xgb_model = xgb.XGBRegressor(objective ='reg:squarederror', n_estimators= 120, learning_rate=  0.1, max_depth = 4, random_state=0)




best_ensemble_model = VotingRegressor(estimators=[('lr', lr_model),('gb', gb_model),('xgb', xgb_model)], weights=  (0.1, 0.5, 0.4))
best_ensemble_model.fit(X_train, y_train)
best_ensemble_predict = best_ensemble_model.predict(X_test)




# Creating the traces
trace1 = go.Scatter(
    x = df_test['future_date'],
    y = y_test.values,
    mode = 'lines',
    name = 'actual_value'
)


trace2 = go.Scatter(
    x = df_test['future_date'],
    y = list(best_ensemble_predict),
    mode = 'lines',
    name = 'best_ensemble_model predict_value'
)





# Combining the traces and creating the layout
data = [trace1, trace2]
layout = go.Layout(title='Final Models Predict future price for total dataset', xaxis=dict(title='Date'), yaxis=dict(title='아파트지수'))

# Creating the figure and plotting it
fig = go.Figure(data=data, layout=layout)
fig.show()


> 최종적으로 만든 앙상블 모델로 '서울 아파트 매매가격지수'를 예측

> 파란선은 현재까지 나온 '서울 아파트 매매가격지수', 빨간선은 모델로 예측한 현재까지와 향후의 '서울 아파트 매매가격지수'

> 2022년 10월 30일까지는 1년 뒤 값인 2023년 10월 30일의 값 까지는 있지만,  2022년 10월 30일 이후의 값으로 예측한 2023년 10월 30일 이후의 '서울 아파트 매매가격지수'를 확인

> 2024년 10월까지는 지속적인 '서울 아파트 부동산'의 불황이 예상됨

# 결론

- 최종적으로 완성한 앙상블 모델이 가장 실제값과 상관관계가 높지만, 시각화를 한 결과 너무 직선으로 움직이는 부분이 있어 과연 신빙성이 있을까라는 의문
- 너무 테스트 성능을 높이기 위해서 과적합이 되었을 수 있기 때문에, 지속적인 모니터링을 통해 모델의 성능을 검증할 필요가 있음
- 과거에는 있었지만 현재는 없는 아파트들을 거르는 단계를 진행하지 않음. 그래서 데이터에 오류가 있었을 수도 있음

- 모델들 마다 정확한 수치들은 차이가 있지만, 실제 데이터와 비슷하게 움직이는 모델들을 적용시켰을 때, 2023년 말 까지는 서울 아파트 전체의 값들이 유지되거나 내려가는 추세임으로 현재 아파트를 매수하는 것은 추천하지 않는다

# 주의할 점&보완할 점

## 1. 주의할 점

- 데이터에 따라서 null 값 대신 ''로 값을 채워놓은 경우들이 있다, 데이터를 다루기 전에 빈 칸으로 처리되 부분이 있는지 확인이 필요하다
- 판다스에서 object 타입과 string 타입은 차이가 있기에 .str.replace() 등의 함수를 사용할때 str 타입으로 변경후에 사용을 해야 한다.
- 컬럼별로 계산을 할 때, 속성값에 0이나 null 값이 있는지 잘 확인을 하고 연산을 진행해야 한다
- 메모리 용량을 줄이기 위해서 데이터 타입을 변환할 수도 있다.
- 값들을 병합하거나 수정한 후, null 이나 inf 값들이 존재함을 확인해야 한다.
- stack() 함수를 사용할 때, null 값은 패스를 하기에, 계산 시 원하는 의도의 변경을 예방하기 위해서 null 값들을 치환할 수도 있다.
- pandas는 row 개수가 많은것이, column의 개수가 많은 것보다 더 메모리 부담이 크다
- info()를 사용해서 얻은 메모리 사용량과 memory_usage(deep=True)를 통해서 얻는 메모리 사용량은 다를 수 있다.(https://pythonspeed.com/articles/pandas-dataframe-series-memory-usage/ 참조)


## 2. 보완할 점

### 2-1. 프로젝트 시작 전

- 변수명이나 프로젝트 진행시에 쓰는 용어, 약속한 개념들을 확실히 정한 다음 진행을 해야지 프로젝트를 혼선 없이 진행할 수 있다.
- 데이터 분석을 진행 할 때, 최종 생성할 테이블과, 중간에 생성을 할 테이블들에 대한 구조들(스키마 테이블)을 미리 설계를 해놔야 추후 데이터들을 전처리하거나 생성할 때 더 효율적으로 일을 처리할 수 있음을 배움 -> 설계방법에 대한 학습 필요
- 데이터 시각화를 어떤 상황에 어느 시각 지표를 적용하면 좋을지 판단하는 능력의 향상이 필요
- 어느 지표들에 어떤 시각화를 사용할 지 미리 계획이 되어 있어야 한다.
- csv파일, pkl파일, mysql 데이터베이스 사용시의 차이점들 비교하여 학습할 필요
- 기계학습 모델을 사용하려면, 어떤 문제(분류, 회귀 등)에 어떤 모델을 적용할지 미리 설정하고 프로젝트를 진행해야 한다.

### 2-2. 프로젝트 진행 중

- 생각했던 가설이 참이 아닌 경우, 왜 참이 아니었는지 판단하고 검증하고 수정하는 능력이 필요함
- 주제목, 부제목, 설명들을 적으면서 진행을 해야 추후 정리를 할 때 수월하다




#### 2-2-1. 전처리 과정

- 판다스를 통해서 데이터를 전처리를 할 때, 함수를 활용하는 식 등의 메모리를 효율적으로 사용하는 식으로 코딩할 능력의 필요성을 느낌(메모리 부족으로 여러번에 나누어서 실행하면 번거롭고 원하는 결과와 다른 결과가 나올 수도 있음)
- 파이썬의 문법들을 활용해서 더 효율적인 함수를 만들 수 있는 능력이 필요
- 상황에 맞는 이상치를 제거하는 방법들에 대한 학습 필요
- 상황에 맞는 결측치를 채우는 방법들에 대한 학습 필요


#### 2-2-2. 분석 과정

- 기계학습 모델의 성능을 효과적으로 파악하기 위해서는, 그래프를 사용하는 것이  시각적으로 효용성이 있음
- 모델이 테스트 데이터셋를 통해서는 성능이 좋을 수는 있어도, 실제 미래의 값들은 다를 수 있음
- 각 모델의 동작과정을 알아야지, 어느 상황에서 어느 모델을 사용할 수 있는지 확인이 가능하고 파라미터 수정등이 용이할 듯
- 기계학습 등의 모델에서 하이퍼파라미터를 수정하는 것이 단순히 수정을 하면 되는 것인지 아니면 기준을 가지고 수정을 해야하는지 학습 필요

- 회귀 모델에서 오차의 허용 범위와 성능평가 방법을 어떻게 설정하는지가 모델의 성능의 큰 영향을 미치는듯 하다
- 회귀 모델의 경우 성능평가 방법들이 각각 어느정도 수치와 기준이 좋은 성능을 내는 것인지 파악을 할 줄 알아야 함
- 회귀모델을 통해서 정확한 값을 얻으려 노력하기 보다 추세를 보려고 노력하는 것이 더 맞는 방향이지 않을까? 하는 생각
- 선그래프를 통해 봤을 때는 추세적으로 비슷하게 움직이는 것 같았지만 실제 corr()을 통해서 확인했을 때는 그 수치가 크게 나오지 않을 수도 있다. -> 추세가 비슷하게 움직이는 다른 평가지표에 대한 조사 및 학습 필요





- 데이터 분석을 하기위해서 사용하는 방식으로 기계학습 모델 사용 뿐이 아닌, 데이터를 시각화를 통한 인사이트를 얻는 방법과 통계적 방법들을 통해 인사이트를 얻는 방법들에 대한 학습 필요

### 2-3. 프로젝트 종료 후

- 프로젝트를 보기 좋게 정리하는 법 및 설득력을 가질 수 있게 정리하는 법에 대한 학습 필요