# what is polynomial regression?
Polynomial regression is a special case of linear regression where we fit a polynomial equation on the data with a curvilinear relationship between the target variable and the independent variables.
In a curvilinear relationship, the value of the target variable changes in a non-uniform manner with respect to the predictor (s).
In Linear Regression, with a single predictor, we have the following equation:
linear regression equation

<img src="https://cdn.analyticsvidhya.com/wp-content/uploads/2020/03/pr2.png" alt="drawing" width="200"/>
where,

*          Y is the target,
*          x is the predictor,
*         𝜃0 is the bias,
*         and 𝜃1 is the weight in the regression equation
          
This linear equation can be used to represent a linear relationship. But, in polynomial regression, we have a polynomial equation of degree n represented as:
polynomial regression equation

<img src="https://cdn.analyticsvidhya.com/wp-content/uploads/2020/03/pr3.png" alt="drawing" width="500"/>
Here:

*          𝜃0 is the bias,
*         𝜃1, 𝜃2, …, 𝜃n are the weights in the equation of the polynomial regression,
*         and n is the degree of the polynomial

The number of higher-order terms increases with the increasing value of n, and hence the equation becomes more complicated.

## But what if we have more than one predictor?

For 2 predictors, the equation of the polynomial regression becomes:

two degree polynomial regression

<img src="https://cdn.analyticsvidhya.com/wp-content/uploads/2020/03/pr10.png" alt="drawing" width="500"/>

where,

* Y is the target,

* x1, x2 are the predictors,

* 𝜃0 is the bias,

* and, 𝜃1, 𝜃2, 𝜃3, 𝜃4, and 𝜃5 are the weights in the regression equation

For n predictors, the equation includes all the possible combinations of different order polynomials. This is known as Multi-dimensional Polynomial Regression.

for more information you can checkout these websites:

[Introduction to Polynomial Regression (with Python Implementation)](https://www.analyticsvidhya.com/blog/2020/03/polynomial-regression-python/).

[polynomial regression(wikipedia)](https://en.wikipedia.org/wiki/Polynomial_regression).

# Import all necessary Libraries:
at first we import the libraries.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Import the Dataset:

In [None]:
df= pd.read_csv('../input/real-estate-price-prediction/Real estate.csv')

# Data Overview:
dataset that i worked on, is about house prices based on these 6 parameters:
* 1-transaction date
* 2-house age
* 3-distance to the nearest MRT station
* 4-number of convenience stores
* 5-latitude
* 6-longitude

In [None]:
df.shape

In [None]:
df.head()

In [None]:
df.info()

# Exploratory Data Analysis:
as you can see, this below chart shows us the distribution of 'house price of unit area'. based on this chart, mean of 'house price of unit area' is about 40. the maximum of price is about 120.

In [None]:
sns.displot(df['Y house price of unit area'], kde=True, aspect=2, color='purple')
plt.show()

## correlation: 
to check the correlation of parameters and house price, i displayed the 6 scatter plots to see is there any correlation or not.
* chart 1: we cannot see the impressive correlation between transaction date and house price
* chart 2:there is small negative correlation between house age and house price
* chart 3: as you can see there is a negative corrolation between distance to the nearest MRT station and house price. this means if the 'distance to the nearest MRT station' become more, the house price become less.
* chart 4: there is a positive correlation. it means for more number of convenience stores, the house price become more.
* chart 5 and 6: for these charts, there is a positive correlation.

In [None]:
fig, axes= plt.subplots(nrows=3, ncols=2, figsize=(15,15))
fig.subplots_adjust(wspace=0.3, hspace=0.3)

for i in range(1, df.shape[1]-1):
    axes[(i-1)//2, (i+1)%2].set_title(f'chart {i}').set_size(20)
    sns.scatterplot(data=df, x=df.iloc[:, i], y='Y house price of unit area', ax=axes[(i-1)//2, (i+1)%2])

to better understand  the correlations you, can see the last row of this chart. as mentioned, 'house age' and 'distance to the nearest MRT station' have negative correlation with house price. but the 'number of convenience store' and 'geographical location' have positive correlation with house price.

__Note__: Green is shown for positive correlation and white for negative correlation.

In [None]:
fig = plt.figure(figsize=(10,5))
sns.heatmap(df.iloc[:, 1:].corr(), annot=True, cmap='Greens')
plt.show()

In [None]:
X = df.drop(['Y house price of unit area', 'No'],axis=1)
y = df['Y house price of unit area']

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn import metrics

# Make and Train the Model:
in this part we use a for loop to make a model with degree from 1 to 9 and then train the model. next we compare the RMSEs and then choose best degree for model to predict the house price.

In [None]:
# Train List of RMSE per degree
train_RMSE_list=[]
#Test List of RMSE per degree
test_RMSE_list=[]

for d in range(1,10):
    
    #Preprocessing
    #create poly data set for degree (d)
    polynomial_converter= PolynomialFeatures(degree=d)
    poly_features= polynomial_converter.fit(X)
    poly_features= polynomial_converter.transform(X)
    
    #Split the dataset
    X_train, X_test, y_train, y_test = train_test_split(poly_features, y, test_size=0.3, random_state=101)
    
    #Train the Model
    polymodel=LinearRegression()
    polymodel.fit(X_train, y_train)
    
    #Predicting on both Train & Test Data
    y_train_pred=polymodel.predict(X_train)
    y_test_pred=polymodel.predict(X_test)
    
    #RMSE of Train set
    train_RMSE=np.sqrt(metrics.mean_squared_error(y_train, y_train_pred))
    
    #RMSE of Test Set
    test_RMSE=np.sqrt(metrics.mean_squared_error(y_test, y_test_pred))
    
    #Append the RMSE to the Train and Test List
    train_RMSE_list.append(train_RMSE)
    test_RMSE_list.append(test_RMSE)

# Compare the RMSEs:

In [None]:
display(pd.DataFrame({'degree': list(range(1, 10)),'train_RMSE': train_RMSE_list,'test_RMSE':test_RMSE_list}).set_index('degree'))

fig = plt.figure(figsize=(10,5))
plt.plot(range(1,5), train_RMSE_list[:4], label='Train RMSE')
plt.plot(range(1,5), test_RMSE_list[:4], label='Test RMSE')

plt.xlabel('Polynomial Degree')
plt.ylabel('RMSE')
plt.legend()
plt.show()

based on above chart, the best degree for our model is 2. in this point, test_RMSE is lowest.

now we must make a model with degree 2 and check the 'test_residuals' and 'y_test' scatter plot.

In [None]:
#create poly data set for degree 2
polynomial_converter= PolynomialFeatures(degree=2)
poly_features= polynomial_converter.fit(X)
poly_features= polynomial_converter.transform(X)

#Split the dataset
X_train, X_test, y_train, y_test = train_test_split(poly_features, y, test_size=0.3, random_state=101)

#Train the Model
polymodel=LinearRegression()
polymodel.fit(X_train, y_train)

#Predicting on both Train & Test Data
y_train_pred=polymodel.predict(X_train)
y_test_pred=polymodel.predict(X_test)

In [None]:
test_residuals = y_test - y_test_pred

As you can see,the mean of 'test_residuals' is about 0.
the scatter plot has no pattern and the dots are distributed almost randomly above and below the red line. If the dots had a pattern, our model would not be a good model.

In [None]:
fig = plt.figure()
ax1 = fig.add_axes([0, 0, 1, 1])
ax2 = fig.add_axes([1.2, 0, 1, 1])

sns.scatterplot(x=y_test, y=test_residuals, ax=ax1)
ax1.axhline(y=0, color='r', ls='--')

sns.kdeplot(test_residuals, color='purple', ax=ax2)
plt.show()

# what is over fitting?

if you see the above chart of RMSEs, after 2 degree we have low error for training data, but in test data we have high error. in this situation our model will be __overfit__.

for example, if we assume degree equals to 8, the scatter plot will be like below chart. you can see that dots are in one line and have a pattern. this model is not a good model and is __overfit__.  

In [None]:
#Note: this model is a over fit model!

#create poly data set for degree 8
polynomial_converter= PolynomialFeatures(degree=8)
poly_features= polynomial_converter.fit(X)
poly_features= polynomial_converter.transform(X)

#Split the dataset
X_train, X_test, y_train, y_test = train_test_split(poly_features, y, test_size=0.3, random_state=101)

#Train the Model
polymodel=LinearRegression()
polymodel.fit(X_train, y_train)

#Predicting on both Train & Test Data
y_train_pred=polymodel.predict(X_train)
y_test_pred=polymodel.predict(X_test)

test_residuals = y_test - y_test_pred

sns.scatterplot(x=y_test, y=test_residuals)
plt.axhline(y=0, color='r', ls='--')
plt.show()