Kaggle: House Price Prediction

Data Processing and Feature Extration Approchs

Trial 1:

Droped 'Id'.
One hot encoded all none neumerical features.
Replace all none neumerical features Nan with 'None' and one hot encoded them.
Filled all neumerical data Nan with means of the column.
Schewed Year data to be base on the minium of that column.

Problems:

Data contains outliers.
Some numerical features are catergical.
Fill numerical data with means is not a good approch because:
- Numerical that contains Nan usualy becasue the house does not have this feature.
- Outliers' effect the means greatly.
Target collumn 'SalePrice' is not in a normal disturbation.
Data that are highly correlated have repeted impact on the model.

Trial 2:

One hot encoded all catergical features.
Normoralized SalePrice distrubition to normal curve by taking.

train['LogSalePrice'] = np.log(train['SalePrice'])

Use:

train['SalePrice'] = np.exp(train['LogSalePrice'])

to return to orignal distuibition.

Reomved one feature from each set of features that have a corlation above 0.8, base on the disturbition graph.
The feature that have the highest corlation with 'SalePrice' out of the two is removed.
Fill all numerical feature Nan with 0.

Trial 3:

All creddit of this methods gose to @Golden and her notebook

Filled all numerical Nan with 0.
Filled all categorical Nan with 'None'.
Removed outliers recomended by author:

train = train[train['GrLivArea']<4000]

Normoralized SalePrice.
One hot encoded all catergical features.

Model Approchs

Linear Regression:

Used hyperparameter tuning to tune a sklearn linear regression model.
Used polynomial features to expand feature space.
Use Root Mean Square Error (RMSE) as lost function since it is what the data is evluated by.

Result:

First degree poly feature showed the best result.
Optominal Alpha is less then 1000.
Scores:
- Datas from 1: 0.24922.
- Datas from 3: 0.31011.

Neural Network:

Implemented RMSE for both the default 'SalePrice' and 'LogSalePrice':

def root_mean_squared_error(y_true, y_pred):
       return K.sqrt(K.mean(K.square(y_pred - y_true)))
def exp_root_mean_squared_error(y_true, y_pred):
   return K.sqrt(K.mean(K.square(K.exp(y_pred)-K.exp(y_true))))

Established bsae line model:

Layer (type)                 Output Shape              Param #   
=================================================================
dense_1 (Dense)              (None, 512)               207872    
_________________________________________________________________
re_lu_1 (ReLU)               (None, 512)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 512)               262656    
_________________________________________________________________
re_lu_2 (ReLU)               (None, 512)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 512)               262656    
_________________________________________________________________
re_lu_3 (ReLU)               (None, 512)               0         
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 513       
=================================================================
Total params: 733,697
Trainable params: 733,697
Non-trainable params: 0
_________________________________________________________________

3 layers of 512 Dense ReLU neurons and one output neuron.
Train untile 'val_loss' stop increasing for 50 epochs.
Default 'adam' optimizer.
RMSE root_mean_squared_error as loss.

Attempts:

Structurs:
- Increase / Decrease neuron number of each layer.
- Increase / Decrease depths of the network.
Activitation functions:
- Sigmoid.
- Default LeakyReLU alpha = 0.1.
- LeakyReLU alpha = 0.5.
Optimizers:
- Adam with increas / decrease learning rates.
- Default SGD.

Result:

Base Line Score: 0.21801.
Structurs：
- Increasing model size and num of neurons resulted in the exact same score.
- Decreasing it result in sigenfigent score.
Activitation functions:
- Sigmoid did not converge under 10000 epochs.
- Default LeakyReLU resulted in slightly better score: 0.21259.
- LeakyReLU with alpha = 0.5 performed less then default, scored: 0.21337.
Optimizer:
- Most optomal Adam learning_rate = 0.0001,scored of 0.21106.
- SGD did not converge under 10000 epochs.
Combined Model:
- Parameters: Defalut LeakyReLU, Adam learning_rate = 0.0001, 3 layers of 521 neurons.
- Score: 0.21406, some how a combenation of these has increased the score.

Lasso Regression

This approch is build upon @Golden's notebook.

Used hyperparameter to turn a Lasso Regression model.
Golden's Parameter.

Result:

Golden's Score: 0.11888.
Hyper tuned best parameters.

Lasso(alpha = 0.0005, fit_intercept = True, normalize = False)

Score: 0.11744.

Conclusion

Neural Network is not an all-powerful solution.
Better data cleaning and feature engineering with a simple model could result in a much better model then neural networks can be.
The complexity of this data is manageable by humans, thus careful data cleaning and feature engineering should be done.
Traditional approach should be considered first before deep learning in these types of data.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
.ipynb_checkpoints		.ipynb_checkpoints
notes		notes
BetterPreProcess.ipynb		BetterPreProcess.ipynb
KaggleHouseDataPreprocess.ipynb		KaggleHouseDataPreprocess.ipynb
KaggleHouseLinearRegressionModel.ipynb		KaggleHouseLinearRegressionModel.ipynb
KaggleHouseNeuralNetworkModel.ipynb		KaggleHouseNeuralNetworkModel.ipynb
NewProcess1.16.2020.ipynb		NewProcess1.16.2020.ipynb
NumericalDataDisturbitionGraph.png		NumericalDataDisturbitionGraph.png
Process_note_Jan.16.2020.ipynb		Process_note_Jan.16.2020.ipynb
README.md		README.md
checkOutData.ipynb		checkOutData.ipynb
checkOut_note_Jan.16.2020.ipynb		checkOut_note_Jan.16.2020.ipynb
house-prices-on-the-top-with-a-simple-model.ipynb		house-prices-on-the-top-with-a-simple-model.ipynb
lasso_baseline_note_Jan_16_20.ipynb		lasso_baseline_note_Jan_16_20.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kaggle: House Price Prediction

Data Processing and Feature Extration Approchs

Trial 1:

Problems:

Trial 2:

Trial 3:

Model Approchs

Linear Regression:

Result:

Neural Network:

Attempts:

Result:

Lasso Regression

Result:

Conclusion

About

Releases

Packages

Languages

Beepbloop/KaggleHouse

Folders and files

Latest commit

History

Repository files navigation

Kaggle: House Price Prediction

Data Processing and Feature Extration Approchs

Trial 1:

Problems:

Trial 2:

Trial 3:

Model Approchs

Linear Regression:

Result:

Neural Network:

Attempts:

Result:

Lasso Regression

Result:

Conclusion

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages