# Acknowledgements

이 데이터는 인터넷 사이트를 통해 직접 수집된 데이터입니다.<br>
ML을 이용해 집 가격을 예측하는 모델을 완성해보세요.<br><br>

## 주택 가격 예측 모델 (회귀 모델)<br><br>

# File descriptions

- `train.csv` - 트레이닝 셋<br><br>
- `test.csv` - 테스트 셋<br><br>
- `sampleSubmission.csv` - 케글 점수 체크용 정답 데이터 셋<br><br>

# Data fields

- `ID` - 각 집의 고유한 번호<br><br>
- `ADDRESS` - 집의 주소<br><br>
- `SUBURB` - 동네 이름<br><br>
- `PRICE` - 가격<br><br>
- `BEDROOMS` - 침실의 갯수<br><br>
- `BATHROOMS` - 욕실의 갯수<br><br>
- `GARAGE` - 차고의 수<br><br>
- `LAND_AREA` - 토지 면적<br><br>
- `FLOOR_AREA` - 건물 면적<br><br>
- `BUILD_YEAR` - 건축년도<br><br>
- `CBD_DIST` - Central business district까지의 거리<br><br>
- `NEAREST_STN` - 근처 역 정보<br><br>
- `NEAREST_STN_DIST` - 근처 역까지 거리<br><br>
- `DATE_SOLD` - 판매된 날짜<br><br>
- `POSTCODE` - 우편번호<br><br>
- `LATITUDE` - 위도<br><br>
- `LONGITUDE` - 경도<br><br>
- `NEAREST_SCH` - 근교의 학교<br><br>
- `NEAREST_SCH_DIST` - 근교의 학교까지의 거리<br><br>
- `NEAREST_SCH_RANK` - 근교의 학교까지의 랭킹<br><br>


In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LinearRegression, Ridge, Lasso

import warnings
warnings.filterwarnings(action='ignore')

In [2]:
df = pd.read_csv('./train.csv')
test_df = pd.read_csv('./test.csv')

In [3]:
def preprocess_inputs(df):
    df = df.copy()
    
    # Drop high-cardinality ADDRESS column
    df = df.drop('ADDRESS', axis=1)
    
    # Drop high-missing value (> 25%) column
    df = df.drop('NEAREST_SCH_RANK', axis=1)
    
    # Fill missing values
    df['BUILD_YEAR'] = df['BUILD_YEAR'].fillna(df['BUILD_YEAR'].median())
    
    # Extract date features
    df['DATE_SOLD'] = pd.to_datetime(df['DATE_SOLD'])
    df['DATE_YEAR'] = df['DATE_SOLD'].apply(lambda x: x.year)
    df['DATE_MONTH'] = df['DATE_SOLD'].apply(lambda x: x.month)
    df = df.drop('DATE_SOLD', axis=1)
    
    # One-hot encode nominal features
    for column in ['SUBURB', 'NEAREST_STN', 'NEAREST_SCH', 'POSTCODE', 'GARAGE']:
        dummies = pd.get_dummies(df[column], prefix=column)
        df = pd.concat([df, dummies], axis=1)
        df = df.drop(column, axis=1)
    
    # Split df into X and y
    y = df['PRICE']
    X = df.drop('PRICE', axis=1)
    
    # Train-test split
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, shuffle=True, random_state=1)
    
    # Scale X
    scaler = StandardScaler()
    scaler.fit(X_train)
    X_train = pd.DataFrame(scaler.transform(X_train), index=X_train.index, columns=X_train.columns)
    X_test = pd.DataFrame(scaler.transform(X_test), index=X_test.index, columns=X_test.columns)
    
    return X_train, X_test, y_train, y_test

In [4]:
X_train, X_test, y_train, y_test = preprocess_inputs(df)

In [7]:
models = {
    "                Linear Regression": LinearRegression(),
    "Ridge (L2-Regularized) Regression": Ridge(),
    "Lasso (L1-Regularized) Regression": Lasso()
}

for name, model in models.items():
    model.fit(X_train, y_train)
    print(name + " trained.")

                Linear Regression trained.
Ridge (L2-Regularized) Regression trained.
Lasso (L1-Regularized) Regression trained.


# 결과

In [8]:
for name, model in models.items():
    print(name + ": R^2 Score: {:.5f}".format(model.score(X_test, y_test)))

                Linear Regression: R^2 Score: -2597381310416862188666880.00000
Ridge (L2-Regularized) Regression: R^2 Score: 0.75647
Lasso (L1-Regularized) Regression: R^2 Score: 0.75651


In [10]:
lasso_model = Lasso(alpha=10.0)
lasso_model.fit(X_train, y_train)

print("R^2 Score: {:.5f}".format(lasso_model.score(X_test, y_test)))