# Acknowledgements

이 데이터는 인터넷 사이트를 통해 직접 수집된 데이터입니다.<br>
ML을 이용해 집 가격을 예측하는 모델을 완성해보세요.<br><br>

## 주택 가격 예측 모델 (회귀 모델)<br><br>

# File descriptions

- `train.csv` - 트레이닝 셋<br><br>
- `test.csv` - 테스트 셋<br><br>
- `sampleSubmission.csv` - 케글 점수 체크용 정답 데이터 셋<br><br>

# Data fields

- `ID` - 각 집의 고유한 번호<br><br>
- `ADDRESS` - 집의 주소<br><br>
- `SUBURB` - 동네 이름<br><br>
- `PRICE` - 가격<br><br>
- `BEDROOMS` - 침실의 갯수<br><br>
- `BATHROOMS` - 욕실의 갯수<br><br>
- `GARAGE` - 차고의 수<br><br>
- `LAND_AREA` - 토지 면적<br><br>
- `FLOOR_AREA` - 건물 면적<br><br>
- `BUILD_YEAR` - 건축년도<br><br>
- `CBD_DIST` - Central business district까지의 거리<br><br>
- `NEAREST_STN` - 근처 역 정보<br><br>
- `NEAREST_STN_DIST` - 근처 역까지 거리<br><br>
- `DATE_SOLD` - 판매된 날짜<br><br>
- `POSTCODE` - 우편번호<br><br>
- `LATITUDE` - 위도<br><br>
- `LONGITUDE` - 경도<br><br>
- `NEAREST_SCH` - 근교의 학교<br><br>
- `NEAREST_SCH_DIST` - 근교의 학교까지의 거리<br><br>
- `NEAREST_SCH_RANK` - 근교의 학교까지의 랭킹<br><br>


In [47]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LinearRegression, Ridge, Lasso

import warnings
warnings.filterwarnings(action='ignore')

In [48]:
df = pd.read_csv('./train.csv')
test_df = pd.read_csv('./test.csv')
# 타겟 데이터

In [49]:
y = df['PRICE']

In [50]:
x = df.drop(['PRICE'], axis=1)

In [51]:
x = x.drop(['ADDRESS', 'NEAREST_SCH_RANK', 'SUBURB'], axis=1)

test_df = test_df.drop(['ADDRESS', 'NEAREST_SCH_RANK', 'SUBURB'], axis=1)

In [52]:
x['GARAGE'].fillna(0, inplace=True)
test_df['GARAGE'].fillna(0, inplace=True)

x['BUILD_YEAR'].fillna(x['BUILD_YEAR'].median(), inplace=True)
test_df['BUILD_YEAR'].fillna(test_df['BUILD_YEAR'].median(), inplace=True)

In [53]:
x['DATE_SOLD'] = pd.to_datetime(x['DATE_SOLD'])
x['DATE_YEAR'] = x['DATE_SOLD'].apply(lambda x: x.year)
x['DATE_MONTH'] = x['DATE_SOLD'].apply(lambda x: x.month)

x = x.drop('DATE_SOLD', axis=1)

test_df['DATE_SOLD'] = pd.to_datetime(test_df['DATE_SOLD'])
test_df['DATE_YEAR'] = test_df['DATE_SOLD'].apply(lambda x: x.year)
test_df['DATE_MONTH'] = test_df['DATE_SOLD'].apply(lambda x: x.month)

test_df = test_df.drop('DATE_SOLD', axis=1)

In [54]:
for column in ['NEAREST_STN', 'NEAREST_SCH', 'POSTCODE', 'GARAGE']:
        dummies = pd.get_dummies(x[column], prefix=column)
        x = pd.concat([x, dummies], axis=1)
        x = x.drop(column, axis=1)

for column in ['NEAREST_STN', 'NEAREST_SCH', 'POSTCODE', 'GARAGE']:
        dummies = pd.get_dummies(test_df[column], prefix=column)
        test_df = pd.concat([test_df, dummies], axis=1)
        test_df = test_df.drop(column, axis=1)

In [55]:
x.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18510 entries, 0 to 18509
Columns: 377 entries, ID to GARAGE_99.0
dtypes: float64(4), int64(9), uint8(364)
memory usage: 8.3 MB


In [56]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15146 entries, 0 to 15145
Columns: 374 entries, ID to GARAGE_50.0
dtypes: float64(4), int64(9), uint8(361)
memory usage: 6.7 MB


In [46]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(x, y, shuffle=True, random_state=1)

In [34]:
scaler = StandardScaler()

scaler.fit(X_train)

X_train = pd.DataFrame(scaler.transform(X_train), index=X_train.index, columns=X_train.columns)

X_test = pd.DataFrame(scaler.transform(X_test), index=X_test.index, columns=X_test.columns)

In [35]:
models = {
    "                Linear Regression": LinearRegression(),
    "Ridge (L2-Regularized) Regression": Ridge(),
    "Lasso (L1-Regularized) Regression": Lasso()
}

for name, model in models.items():
    model.fit(X_train, y_train)
    print(name + " trained.")

                Linear Regression trained.
Ridge (L2-Regularized) Regression trained.
Lasso (L1-Regularized) Regression trained.


In [36]:
for name, model in models.items():
    print(name + ": R^2 Score: {:.5f}".format(model.score(X_test, y_test)))

                Linear Regression: R^2 Score: -1115297028088144345432064.00000
Ridge (L2-Regularized) Regression: R^2 Score: 0.75078
Lasso (L1-Regularized) Regression: R^2 Score: 0.75080


In [37]:
lasso_model = Lasso(alpha=10.0)
lasso_model.fit(X_train, y_train)

print("R^2 Score: {:.5f}".format(lasso_model.score(X_test, y_test)))

R^2 Score: 0.75102


In [42]:
score_fin = lasso_model.predict(x)

score_fin

array([-2.29243100e+09, -1.03152663e+09, -2.15991026e+09, ...,
       -5.52284356e+08, -9.34620769e+08, -7.33810026e+08])

In [43]:
submit_df = pd.read_csv('./sample_submission.csv')

submit_df['PRICE'] = score_fin

# home_submit_df.to_csv('model01_REG_73.csv', index=False)

ValueError: Length of values (18510) does not match length of index (15146)