## Project: Predicting Boston Housing Prices
### Data  

The modified Boston housing dataset consists of 489 data points, with each datapoint having 3 features. This dataset is a modified version of the Boston Housing dataset found on the <a href="https://archive.ics.uci.edu/ml/index.php">UCI Machine Learning Repository</a> and you can find the main dataset on the 
<a href="https://www.kaggle.com/c/boston-housing">Kaggle</a>


### Features  

RM: average number of rooms per dwelling (Total number of rooms in home)  
LSTAT: percentage of population considered lower status (Neighborhood poverty level )  
PTRATIO: pupil-teacher ratio by town  (Student-teacher ratio of nearby schools)  
Target Variable: MEDV: median value of owner-occupied homes (house price)  

In [1]:
# Import libraries:
import numpy as np
import pandas as pd

In [2]:
# Load the Boston housing dataset
data = pd.read_csv('housing.csv')
data.info()
data.head(10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 489 entries, 0 to 488
Data columns (total 4 columns):
RM         489 non-null float64
LSTAT      489 non-null float64
PTRATIO    489 non-null float64
MEDV       489 non-null float64
dtypes: float64(4)
memory usage: 15.4 KB


Unnamed: 0,RM,LSTAT,PTRATIO,MEDV
0,6.575,4.98,15.3,504000.0
1,6.421,9.14,17.8,453600.0
2,7.185,4.03,17.8,728700.0
3,6.998,2.94,18.7,701400.0
4,7.147,5.33,18.7,760200.0
5,6.43,5.21,18.7,602700.0
6,6.012,12.43,15.2,480900.0
7,6.172,19.15,15.2,569100.0
8,5.631,29.93,15.2,346500.0
9,6.004,17.1,15.2,396900.0


In [3]:
# Data Exploration
data.plot.scatter('RM','MEDV',c='r');
data.plot.scatter('LSTAT','MEDV',c='c');
data.plot.scatter('PTRATIO','MEDV',c='g');





In [4]:
#define variables(features,prices)
prices = data['MEDV']

features = data.drop('MEDV', axis = 1)
features.head(5)    

Unnamed: 0,RM,LSTAT,PTRATIO
0,6.575,4.98,15.3
1,6.421,9.14,17.8
2,7.185,4.03,17.8
3,6.998,2.94,18.7
4,7.147,5.33,18.7


In [31]:
#prepreprocessing step:
#Generate a new feature consisting of all polynomial combinations of the features with specifed degree
from sklearn.preprocessing import PolynomialFeatures
#define the function:
poly_fun = PolynomialFeatures(degree=2)



# fit the data to the polynomial functiion:
poly_fun = poly_fun.fit(features)
# Now the transform function is ready to use. we can use it any time!
# to do that, just insert the data then use transform function..
#
#transform the data to the polynomial dgree:
X_trasformed = poly_fun.transform(features)
#This is the new data_input/features. we will inserted it as the input of our model later:
X_trasformed


array([[  1.    ,   6.575 ,   4.98  , ...,  24.8004,  76.194 , 234.09  ],
       [  1.    ,   6.421 ,   9.14  , ...,  83.5396, 162.692 , 316.84  ],
       [  1.    ,   7.185 ,   4.03  , ...,  16.2409,  71.734 , 316.84  ],
       ...,
       [  1.    ,   6.976 ,   5.64  , ...,  31.8096, 118.44  , 441.    ],
       [  1.    ,   6.794 ,   6.48  , ...,  41.9904, 136.08  , 441.    ],
       [  1.    ,   6.03  ,   7.88  , ...,  62.0944, 165.48  , 441.    ]])

In [20]:
#split the data to two sets. training set and testing set:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_trasformed, prices, test_size = 0.25)

In [21]:
print("training set:",X_train.shape,y_train.shape)
print("testing set:",X_test.shape,y_test.shape[0])



training set: (366, 10) (366,)
testing set: (123, 10) 123


unfortunately, there is no direct model that can build a polynomial regression model.
To do that, we need to transform the data_input (features/X) to another degree before inserting them into linearRgression model.
This step called a prepreprocessing step. We will talk about the prepreprocessing step in detail.

In [22]:
#create the rgression model:
from sklearn.linear_model import LinearRegression
model = LinearRegression()


In [23]:
#fit/train the model:
model.fit(X_train,y_train);

In [24]:
#predict X_test by the model:
y_pred=model.predict(X_test)

In [25]:
#model accuracy test:
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)

0.8076757056859534

In [26]:
#extra point: model accuracy train: to check underfitting and oferfitting
y_train_pre=model.predict(X_train)
r2_score(y_train, y_train_pre)

0.8438492232351235

As you can see, the model performs a little bit better than the multilinear regression, as we saw during the session.

## Extra:  
prdict the price for house with:   
Total number of rooms in home =7 rooms  
Neighborhood poverty level as 20%  
Student-teacher ratio of nearby schools=19-to-1  


### Note: 
since we transformed the data previously, every new data should transform it using poly_fun.transform()

In [35]:
extra=np.array([7,20,19]).reshape(1,-1)

In [36]:
extra_trasformed = poly_fun.transform(extra)


In [37]:
model.predict(extra_trasformed)

array([386776.50852631])

#### from more infromation, read about polynomial preprocessing using link below:
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html