**Data fields**:

Here's a brief version of what you'll find in the data description file.

- SalePrice - the property's sale price in dollars. This is the target variable that you're trying to predict.
- MSSubClass: The building class
- MSZoning: The general zoning classification
- LotFrontage: Linear feet of street connected to property
- LotArea: Lot size in square feet
- Street: Type of road access
- Alley: Type of alley access
- LotShape: General shape of property
- LandContour: Flatness of the property
- Utilities: Type of utilities available
- LotConfig: Lot configuration
- LandSlope: Slope of property
- Neighborhood: Physical locations within Ames city limits
- Condition1: Proximity to main road or railroad
- Condition2: Proximity to main road or railroad (if a second is present)
- BldgType: Type of dwelling
- HouseStyle: Style of dwelling
- OverallQual: Overall material and finish quality
- OverallCond: Overall condition rating
- YearBuilt: Original construction date
- YearRemodAdd: Remodel date
- RoofStyle: Type of roof
- RoofMatl: Roof material
- Exterior1st: Exterior covering on house
- Exterior2nd: Exterior covering on house (if more than one material)
- MasVnrType: Masonry veneer type
- MasVnrArea: Masonry veneer area in square feet
- ExterQual: Exterior material quality
- ExterCond: Present condition of the material on the exterior
- Foundation: Type of foundation
- BsmtQual: Height of the basement
- BsmtCond: General condition of the basement
- BsmtExposure: Walkout or garden level basement walls
- BsmtFinType1: Quality of basement finished area
- BsmtFinSF1: Type 1 finished square feet
- BsmtFinType2: Quality of second finished area (if present)
- BsmtFinSF2: Type 2 finished square feet
- BsmtUnfSF: Unfinished square feet of basement area
- TotalBsmtSF: Total square feet of basement area
- Heating: Type of heating
- HeatingQC: Heating quality and condition
- CentralAir: Central air conditioning
- Electrical: Electrical system
- 1stFlrSF: First Floor square feet
- 2ndFlrSF: Second floor square feet- 
- LowQualFinSF: Low quality finished square feet (all floors)
- GrLivArea: Above grade (ground) living area square feet
- BsmtFullBath: Basement full bathrooms
- BsmtHalfBath: Basement half bathrooms
- FullBath: Full bathrooms above grade
- HalfBath: Half baths above grade
- Bedroom: Number of bedrooms above basement level
- Kitchen: Number of kitchens
- KitchenQual: Kitchen quality
- TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
- Functional: Home functionality rating
- Fireplaces: Number of fireplaces
- FireplaceQu: Fireplace quality
- GarageType: Garage location
- GarageYrBlt: Year garage was built
- GarageFinish: Interior finish of the garage
- GarageCars: Size of garage in car capacity
- GarageArea: Size of garage in square feet
- GarageQual: Garage quality
- GarageCond: Garage condition
- PavedDrive: Paved driveway
- WoodDeckSF: Wood deck area in square feet
- OpenPorchSF: Open porch area in square feet
- EnclosedPorch: Enclosed porch area in square feet
- 3SsnPorch: Three season porch area in square feet
- ScreenPorch: Screen porch area in square feet
- PoolArea: Pool area in square feet
- PoolQC: Pool quality
- Fence: Fence quality
- MiscFeature: Miscellaneous feature not covered in other categories
- MiscVal: Value of miscellaneous feature
- MoSold: Month Sold
- YrSold: Year Sold
- SaleType: Type of sale
- SaleCondition: Condition of sale

______
## `1. Data Acquisition`
- The Python Pandas packages helps us work with our datasets. We start by acquiring the training dataset into Pandas DataFrames, and take a first look of our Dataframe using pandas descriptive statitstics functions

### Importing Necessary libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df= pd.read_csv('House Prices.csv')
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [3]:
df.shape

(1460, 81)

## `2. Data Cleaning`

In [4]:
# df.isna().sum()
null_columns= df.columns[df.isna().any()]
df[null_columns].isna().sum()

LotFrontage      259
Alley           1369
MasVnrType         8
MasVnrArea         8
BsmtQual          37
BsmtCond          37
BsmtExposure      38
BsmtFinType1      37
BsmtFinType2      38
Electrical         1
FireplaceQu      690
GarageType        81
GarageYrBlt       81
GarageFinish      81
GarageQual        81
GarageCond        81
PoolQC          1453
Fence           1179
MiscFeature     1406
dtype: int64

In [5]:
# the Id column not add value so i will drop it:
# this features --> ['Id', 'Alley', 'FireplaceQu', 'PoolQC', 'Fence', 'MiscFeature']  have above 90% of null values so will drop them. 

df.drop(['Id', 'Alley', 'FireplaceQu', 'PoolQC', 'Fence', 'MiscFeature'], axis= 1, inplace= True)

In [6]:
# df.isna().sum()
null_columns= df.columns[df.isna().any()]
df[null_columns].isna().sum()

LotFrontage     259
MasVnrType        8
MasVnrArea        8
BsmtQual         37
BsmtCond         37
BsmtExposure     38
BsmtFinType1     37
BsmtFinType2     38
Electrical        1
GarageType       81
GarageYrBlt      81
GarageFinish     81
GarageQual       81
GarageCond       81
dtype: int64

In [7]:
df.shape

(1460, 75)

In [8]:
[df[i].dtypes for i in null_columns]

[dtype('float64'),
 dtype('O'),
 dtype('float64'),
 dtype('O'),
 dtype('O'),
 dtype('O'),
 dtype('O'),
 dtype('O'),
 dtype('O'),
 dtype('O'),
 dtype('float64'),
 dtype('O'),
 dtype('O'),
 dtype('O')]

In [9]:
# fillna with median in numerical features:
num_features= df[null_columns].select_dtypes(exclude=['object']).columns

for c in num_features:
    df[c].fillna(df[c].median(), inplace= True)

In [10]:
# fillna with fowrward fill (ffill) in object features:
object_features= df[null_columns].select_dtypes(include=['object']).columns

for c in object_features:
    df[c].fillna(method= 'ffill', inplace= True)

In [11]:
# df.isna().sum()
null_columns= df.columns[df.isna().any()]
df[null_columns].isna().sum()        # as shown there are no nulls values

Series([], dtype: float64)

## `3. Data Preprocessing`

In [12]:
# to collect object features:
object_data = df.select_dtypes(include=['object'])

# Label Encoder, to convert categorical to numerical:
from sklearn.preprocessing import LabelEncoder

la = LabelEncoder()
for i in range(object_data.shape[1]):
    object_data.iloc[:, i] = la.fit_transform(object_data.iloc[:, i])

#concat between data none object and data object after convert it
num_data = df.select_dtypes(exclude=['object'])
data = pd.concat([object_data, num_data], axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[selected_item_labels] = value
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(ilocs[0], value, pi)


In [13]:
data.head()

Unnamed: 0,MSZoning,Street,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
0,3,1,3,3,0,4,0,5,2,2,...,0,61,0,0,0,0,0,2,2008,208500
1,3,1,3,3,0,2,0,24,1,2,...,298,0,0,0,0,0,0,5,2007,181500
2,3,1,0,3,0,4,0,5,2,2,...,0,42,0,0,0,0,0,9,2008,223500
3,3,1,0,3,0,0,0,6,2,2,...,0,35,272,0,0,0,0,2,2006,140000
4,3,1,0,3,0,2,0,15,2,2,...,192,84,0,0,0,0,0,12,2008,250000


## `4. Make Polynomial Features`

**Polynomial features**:
Generate a new feature matrix consisting of all polynomial combinations of the features with degree less than or equal to the specified degree. For example, if an input sample is two dimensional and of the form [a, b], the degree-2 polynomial features are [1, a, b, a^2, ab, b^2].

In [14]:
### Example to understand the functionality of parameter [interaction_only= True / False]

import numpy as np
from sklearn.preprocessing import PolynomialFeatures
X = np.arange(6).reshape(3, 2)
X
# output:
# array([[0, 1],
#       [2, 3],
#       [4, 5]])

poly = PolynomialFeatures(2, interaction_only=False)    # If [interaction_only=False], then will get [1, a, b, a^2, ab, b^2].
poly.fit_transform(X)
# output:
# array([[ 1.,  0.,  1.,  0.,  0.,  1.],
#        [ 1.,  2.,  3.,  4.,  6.,  9.],
#        [ 1.,  4.,  5., 16., 20., 25.]])

poly = PolynomialFeatures(interaction_only=True)    # If [interaction_only=True], then will get [1, a, b, ab], not get power [a^2, b^2]
poly.fit_transform(X)
# output:
# array([[ 1.,  0.,  1.,  0.],
#        [ 1.,  2.,  3.,  6.],
#        [ 1.,  4.,  5., 20.]])

array([[ 1.,  0.,  1.,  0.],
       [ 1.,  2.,  3.,  6.],
       [ 1.,  4.,  5., 20.]])

In [15]:
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(2, interaction_only= True)
output_nparray = poly.fit_transform(data)
target_feature_names = ['x'.join(['{}^{}'.format(pair[0],pair[1]) for pair in tuple if pair[1]!=0]) for tuple in [zip(df.columns, p) for p in poly.powers_]]

output_df = pd.DataFrame(output_nparray, columns= target_feature_names)

## `5. Modeling`

In [16]:
# Spliting data:
X = output_df.iloc[:, :-1]
y = output_df.iloc[:, -1]

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### 1. LinearRegression:

In [17]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error

LR = LinearRegression()
LR.fit(X_train, y_train)
y_pre= LR.predict(X_test)

print('Linear Regression Train Accuracy is : '   , r2_score(y_train, LR.predict(X_train)))
print('Linear Regression Train Score is :    '   , LR.score(X_train, y_train))
print('Linear Regression Test Accuracy is :  '   , r2_score(y_test, y_pre))
print('Linear Regression Test Score is :     '   , LR.score(X_test, y_test))
 
print('\n')
print('Linear Regression Train MSE is :      '   , round(mean_squared_error(y_train, LR.predict(X_train)), 4))
print('Linear Regression Test MSE is :       '   , round(mean_squared_error(y_test, y_pre), 4))

Linear Regression Train Accuracy is :  1.0
Linear Regression Train Score is :     1.0
Linear Regression Test Accuracy is :   0.9999526056913903
Linear Regression Test Score is :      0.9999526056913903


Linear Regression Train MSE is :       0.0
Linear Regression Test MSE is :        1464888151698.8047


In [18]:
print('Y predict: ',y_pre[:5])
print('Y test: ', y_test[:5])

Y predict:  [3.09777830e+08 6.53149286e+08 2.31355910e+08 3.19041557e+08
 6.33561031e+08]
Y test:  892     309927000.0
1105    653250000.0
413     231150000.0
522     318954000.0
1036    633839500.0
Name: SaleCondition^1xSalePrice^1, dtype: float64


### 2. DecisionTreeRegressor:

In [19]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score, mean_squared_error

DT = DecisionTreeRegressor(max_depth = 4, random_state= 1)
DT.fit(X_train, y_train)
y_pre= DT.predict(X_test)

print('Linear Regression Train Accuracy is : '   , r2_score(y_train, DT.predict(X_train)))
print('Linear Regression Train Score is :    '   , DT.score(X_train, y_train))
print('Linear Regression Test Accuracy is :  '   , r2_score(y_test, y_pre))
print('Linear Regression Test Score is :     '   , DT.score(X_test, y_test))
 
print('\n')
print('Linear Regression Train MSE is :      '   , round(mean_squared_error(y_train, DT.predict(X_train)), 4))
print('Linear Regression Test MSE is :       '   , round(mean_squared_error(y_test, y_pre), 4))


Linear Regression Train Accuracy is :  0.9893572156644006
Linear Regression Train Score is :     0.9893572156644006
Linear Regression Test Accuracy is :   0.9670343303978861
Linear Regression Test Score is :      0.9670343303978861


Linear Regression Train MSE is :       255904166641344.22
Linear Regression Test MSE is :        1018920208555727.5


In [20]:
print('Y predict: ',y_pre[:5])
print('Y test: ', y_test[:5])

Y predict:  [3.16932755e+08 6.40672301e+08 2.49823286e+08 3.16932755e+08
 6.40672301e+08]
Y test:  892     309927000.0
1105    653250000.0
413     231150000.0
522     318954000.0
1036    633839500.0
Name: SaleCondition^1xSalePrice^1, dtype: float64
