Skip to content

jungh0/Beijing-house-scikit-learn

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

5 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

beijing-house-scikit-learn

License

https://jungh0.github.io/beijing-house-scikit-learn/map.html

Basic concept of project

Predict Housing Price

Housing Price of Beijing from 2011 to 2017, fetching from ย  ย icon

Descriptions for data

Dataset Size : 134266
image image

Data Preprocessing

Data Value Changes (Cleaning dirty data)

image

#clean value
data = clean_value(data,'floor')
data = clean_value(data,'constructionTime')

#drop nan data
data_dnan = dropnan(data)

Data Value Changes (Text preprocessing)

Remove the meaningless words
Left only the floor number

image image image

def clean_value(D,str):
    floor = D[str].str.extract('(\d+)')
    floor.columns = [str + '_new']
    D = drop(D,str) #drop original floor data
    D = concat(D,floor) #concat clean_floor to dataset
    return D

Feature Engineering

Categorical data
image image image

def one_hot_data(D,c_str):
    D[c_str] = D[c_str].astype(str)
    one_hot_data = pd.get_dummies(D[[c_str]])
    D = drop(D,c_str)
    new = concat(D,one_hot_data)
    return new

Preprocessing Results

After Preprocessing
Dataset Size : 127670

image image image

Data Analysis

Find model

Use 102137 train set and 25534 test set
Because the MSE is the smallest
Use the Random Forest Regressor

image

Random Forest Regressor

Find estimators

def find_estimators(train_set_x,train_set_y,test_set_x,test_set_y):
    #MSE์˜ ๋ณ€ํ™”๋ฅผ ํ™•์ธํ•˜๊ธฐ ์œ„ํ•˜์—ฌ ์•™์ƒ๋ธ”์˜ ํฌ๊ธฐ ๋ฒ”์œ„์—์„œ ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ ํŠธ๋ ˆ์ด๋‹
    mseOos = []
    nTreeList = range(10, 500, 100)
    for iTrees in nTreeList:
        depth = None
        maxFeat = None #4 #์กฐ์ •ํ•ด๋ณผ ๊ฒƒ
        wineRFModel = RandomForestRegressor(n_estimators=iTrees,max_depth=None, max_features=maxFeat,oob_score=False, random_state=531)
        wineRFModel.fit(train_set_x, train_set_y)
        #๋ฐ์ดํ„ฐ ์„ธํŠธ์— ๋Œ€ํ•œ MSE ๋ˆ„์ 
        prediction = wineRFModel.predict(test_set_x)
        mseOos.append(mean_squared_error(test_set_y, prediction))
    print("MSE")
    print(mseOos[-1])
    #ํŠธ๋ ˆ์ด๋‹ ํ…Œ์ŠคํŠธ ์˜ค์ฐจ ๋Œ€๋น„  ์•™์ƒ๋ธ”์˜ ํŠธ๋ฆฌ ๊ฐœ์ˆ˜ ๋„ํ‘œ ๊ทธ๋ฆฌ๊ธฐ
    plt.plot(nTreeList, mseOos)
    plt.xlabel('Number of Trees in Ensemble')
    plt.ylabel('Mean Squared Error')
    plt.show()

Set estimator = 100
image image

Find max depth

def find_max_depth(train_set_x,train_set_y,test_set_x,test_set_y):
    #MSE์˜ ๋ณ€ํ™”๋ฅผ ํ™•์ธํ•˜๊ธฐ ์œ„ํ•˜์—ฌ ์•™์ƒ๋ธ”์˜ ํฌ๊ธฐ ๋ฒ”์œ„์—์„œ ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ ํŠธ๋ ˆ์ด๋‹
    mseOos = []
    nTreeList = range(10, 50, 1)
    for iTrees in nTreeList:
        maxFeat = None
        wineRFModel = RandomForestRegressor(n_estimators=30,max_depth=iTrees, max_features=maxFeat,oob_score=False, random_state=531)
        wineRFModel.fit(train_set_x, train_set_y)
        #๋ฐ์ดํ„ฐ ์„ธํŠธ์— ๋Œ€ํ•œ MSE ๋ˆ„์ 
        prediction = wineRFModel.predict(test_set_x)
        mseOos.append(mean_squared_error(test_set_y, prediction))
    print("MSE")
    print(mseOos[-1])
    #ํŠธ๋ ˆ์ด๋‹ ํ…Œ์ŠคํŠธ ์˜ค์ฐจ ๋Œ€๋น„  ์•™์ƒ๋ธ”์˜ ํŠธ๋ฆฌ ๊ฐœ์ˆ˜ ๋„ํ‘œ ๊ทธ๋ฆฌ๊ธฐ
    plt.plot(nTreeList, mseOos)
    plt.xlabel('Number of maxFeat in Ensemble')
    plt.ylabel('Mean Squared Error')
    plt.show()

image

Find max features

def find_max_features(train_set_x,train_set_y,test_set_x,test_set_y):
    #MSE์˜ ๋ณ€ํ™”๋ฅผ ํ™•์ธํ•˜๊ธฐ ์œ„ํ•˜์—ฌ ์•™์ƒ๋ธ”์˜ ํฌ๊ธฐ ๋ฒ”์œ„์—์„œ ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ ํŠธ๋ ˆ์ด๋‹
    mseOos = []
    nTreeList = range(1, 23, 1)
    for iTrees in nTreeList:
        wineRFModel = RandomForestRegressor(n_estimators=30,max_depth=17, max_features=iTrees,oob_score=False, random_state=531)
        wineRFModel.fit(train_set_x, train_set_y)
        #๋ฐ์ดํ„ฐ ์„ธํŠธ์— ๋Œ€ํ•œ MSE ๋ˆ„์ 
        prediction = wineRFModel.predict(test_set_x)
        mseOos.append(mean_squared_error(test_set_y, prediction))
    print("MSE")
    print(mseOos[-1])
    #ํŠธ๋ ˆ์ด๋‹ ํ…Œ์ŠคํŠธ ์˜ค์ฐจ ๋Œ€๋น„  ์•™์ƒ๋ธ”์˜ ํŠธ๋ฆฌ ๊ฐœ์ˆ˜ ๋„ํ‘œ ๊ทธ๋ฆฌ๊ธฐ
    plt.plot(nTreeList, mseOos)
    plt.xlabel('Number of max_features in Ensemble')
    plt.ylabel('Mean Squared Error')
    plt.show()

image

Data Analysis

Result

image image image

Feature Importance

image

Predict housing price

image image image image

Releases

No releases published

Packages

No packages published