# 房價預測 - 資料前處理
本次的課程將學習如何實作迴歸分析模型，目的是利用房子的相關資訊，來預測該房價；藉由此項專案將學會如何使用python裡的套件pandas和numpy來操作資料、並利用matplotlib、seaborn視覺化資料，以及用keras來搭建深度學習的模型。

### 環境提醒及備註
在執行本範例前請先確認Jupyter筆記本設置是否正確，首先點選主選單的「修改」─「筆記本設置」─「運行類別」，選擇「Python3」，同時將「硬件加速器」下拉式選單由「None」改成「GPU」，再按「保存」。

### 課程架構
在房價預測的專案中，將帶著學員建構一個深度學習的模型，並進行房價預測，主要包括以下四個步驟：

>1.   如何進行資料前處理(Processing)

>2.   如何實作探索式數據分析(Exploratory Data Analysis)

>3.   如何導入特徵工程(Feature Engineering)

>4.   如何選擇模型並評估其學習狀況(Model&Inference) 

---

**1.1 載入所需套件**

---

In [1]:
# 1-1
# 首先載入所需套件，一般會利用import (package_name) as (xxx) 來簡化套件名稱，使得之後呼叫它們時更方便
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns #基於matplotlib提供更多高階視覺化的套件
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.models import load_model

import warnings
plt.style.use('ggplot')
warnings.filterwarnings('ignore')
%matplotlib inline

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


---

**1.2 觀察資料分布**

---

In [2]:
# 1-2
# 可以用pandas裡面的函式來讀取csv檔，使用方法為pd.read_csv('檔案名稱')

train = pd.read_csv('train/train-v3.csv')
test = pd.read_csv('test/test-v3.csv')
valid = pd.read_csv('vaild/valid-v3.csv')

In [3]:
# 1-3

train.shape

(12967, 23)

In [4]:
# 1-4

train.columns

Index(['id', 'price', 'sale_yr', 'sale_month', 'sale_day', 'bedrooms',
       'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'waterfront', 'view',
       'condition', 'grade', 'sqft_above', 'sqft_basement', 'yr_built',
       'yr_renovated', 'zipcode', 'lat', 'long', 'sqft_living15',
       'sqft_lot15'],
      dtype='object')

In [5]:
# 1-5

train.head()

Unnamed: 0,id,price,sale_yr,sale_month,sale_day,bedrooms,bathrooms,sqft_living,sqft_lot,floors,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,5615100330,200000,2015,3,27,4,2.0,1900,8160,1,...,7,1900,0,1975,0,98022,47.2114,-121.986,1280,6532
1,8835900086,350000,2014,9,2,4,3.0,3380,16133,1,...,8,2330,1050,1959,0,98118,47.5501,-122.261,2500,11100
2,9510900270,254000,2014,12,11,3,2.0,2070,9000,1,...,7,1450,620,1969,0,98023,47.3085,-122.376,1630,7885
3,2621600015,175000,2015,4,30,3,1.0,1150,8924,1,...,6,1150,0,1943,0,98030,47.3865,-122.217,1492,8924
4,8078350090,619000,2015,3,31,3,2.5,2040,7503,2,...,8,2040,0,1987,0,98029,47.5718,-122.021,2170,7503


In [6]:
# 1-6

train = train.drop('id',axis=1)

In [7]:
# 1-7

train.describe()

Unnamed: 0,price,sale_yr,sale_month,sale_day,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
count,12967.0,12967.0,12967.0,12967.0,12967.0,12967.0,12967.0,12967.0,12967.0,12967.0,...,12967.0,12967.0,12967.0,12967.0,12967.0,12967.0,12967.0,12967.0,12967.0,12967.0
mean,537383.3,2014.322511,6.572068,15.72993,3.362381,2.106058,2071.295057,14995.39,1.442893,0.00802,...,7.646025,1781.741806,289.553251,1970.766947,82.94887,98078.459166,47.55891,-122.214565,1980.143672,12796.53829
std,366884.0,0.467455,3.107792,8.619505,0.941124,0.76528,919.35518,38751.91,0.551628,0.0892,...,1.171189,826.580915,440.742634,29.472777,398.333729,53.525055,0.138978,0.140481,683.572323,27429.856166
min,75000.0,2014.0,1.0,1.0,0.0,0.0,290.0,572.0,1.0,0.0,...,1.0,290.0,0.0,1900.0,0.0,98001.0,47.1593,-122.515,399.0,659.0
25%,319950.0,2014.0,4.0,8.0,3.0,1.5,1420.0,5040.0,1.0,0.0,...,7.0,1190.0,0.0,1951.0,0.0,98033.0,47.47045,-122.33,1480.0,5100.0
50%,447000.0,2014.0,6.0,16.0,3.0,2.25,1900.0,7620.0,1.0,0.0,...,7.0,1550.0,0.0,1974.0,0.0,98065.0,47.5699,-122.232,1830.0,7625.0
75%,637000.0,2015.0,9.0,23.0,4.0,2.5,2540.0,10634.0,2.0,0.0,...,8.0,2200.0,560.0,1996.0,0.0,98118.0,47.6773,-122.125,2350.0,10051.0
max,7062500.0,2015.0,12.0,31.0,33.0,8.0,13540.0,1024068.0,3.0,1.0,...,13.0,9410.0,4820.0,2015.0,2015.0,98199.0,47.7776,-121.315,6210.0,871200.0


In [8]:
# 1-8

train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12967 entries, 0 to 12966
Data columns (total 22 columns):
price            12967 non-null int64
sale_yr          12967 non-null int64
sale_month       12967 non-null int64
sale_day         12967 non-null int64
bedrooms         12967 non-null int64
bathrooms        12967 non-null float64
sqft_living      12967 non-null int64
sqft_lot         12967 non-null int64
floors           12967 non-null int64
waterfront       12967 non-null int64
view             12967 non-null int64
condition        12967 non-null int64
grade            12967 non-null int64
sqft_above       12967 non-null int64
sqft_basement    12967 non-null int64
yr_built         12967 non-null int64
yr_renovated     12967 non-null int64
zipcode          12967 non-null int64
lat              12967 non-null float64
long             12967 non-null float64
sqft_living15    12967 non-null int64
sqft_lot15       12967 non-null int64
dtypes: float64(3), int64(19)
memory usage: 2.2 M

----