# Model Development
- A model or estimator can be thought of as a mathematical equation used to predict a value given one or more other values.
- Relating one or more independent variables to a dependent variable.
- Usually the more relevant data we have the more accurate our model is.
- To understand why more data is important, consider the following situation:
    * We have two almost identical cars.
    * We want to use our mode to determine the price of two cars, one is pink car and other is red.
    * If our model's independent variable or feature does not include color, our model will predict the same price for both cars.
    * If our model's independent variable or feature does include color, our model will predict a higher price for the red car than the pink car.

## Liner Regression and Multiple Regression
* `Linear Regression` wil refer ti one independent variable to make a prediction.
* `Multiple Regression` will refer to two or more independent variables to make a prediction.
* `Simple Linear Regression (SLR)` is a method to help us understand the relationship between two variables:
    * The predictor/independent variable (X)
    * The response/dependent variable (that we want to predict)(Y)
* The result of Linear Regression is a `linear function` that predicts the response (dependent) variable as a function of the predictor (independent) variable.
* `Y: Response/Target/Dependent Variable`
* `X: Predictor/Independent Variable`
* `Y = b0 + b1X`
* `b0: Intercept`
* `b1: Slope`


In many cases, many factors influence how much people pay for a car, for example, how old the car is?
- This uncertainty is represented by the error term, also called the `noise`.
- The `noise` is the part of the data that can not be explained by the model, because there are other variables that influence the price that have not been included in the model.
- The `noise` is the difference between the true value of the dependent variable, and the predicted value of the dependent variable.
- It can be removed by using `residual plot`.

# Model Evaluation using visualization
* `Regression plot` are good extimate of:
    - The relationship between two variable.
    - The strength of correlation and 
    - The direction of the relationship.

* The horizontal axis is the independent variable and the vertical axis is the dependent variables.
* Each point on the graph represents a different target points.
* The fitted line represents the predicted value.
* There are several ways to plot a `regression plot` a simple way to use `Regplot` from the `seaborn` library.

In [None]:
# Regression Plot
def plot_regression(x, y):
    sns.regplot(x=x, y=y, data=df)
    plt.ylim(0,)
    plt.show()

In [None]:
# Splitting the dataset into independent and dependent variables
x = df.drop('price', axis=1) # Independent variables
y = df['price'] # Dependent variable

In [None]:
for col in x:
    plot_regression(x[col], y)

- From the above graphs we can say that the some of the data columns are showing us a linear relationship with the target variable and some are not.
- So we have to select the columns which are showing linear relationship with the target variable.

## Residual Plot
- A `residual plot` is a graph that shows the residuals on the vertical axis and the independent variable on the horizontal axis.
- If the points in a `residual plot` are randomly spread out around the x-axis, then a `linear model` is appropriate for the data.
- Randomly spread out residuals means that the variance is constant, and thus the linear model is a good fit for this data.

In [None]:
# Residual Plot
def plot_residual(x, y):
    sns.residplot(x=x, y=y, data=df)
    plt.show()

In [None]:
for col in x:
    plot_residual(x[col], y)

So, we can say that the `wheel-base`, '`lenght`, `width`, `curb-weight`, `engine-size`, `bore`, `city-l/100Km`, `horsepower` and `highway-mpg` are the best predictor variables for the `price` target variable.

# Distribution Plot
* A distribution plot counts the actual values versus the predicted value. 
* These plot are extremely useful for visualizing models with more than one independent variables or features.
* We can plot the distribution plot using `distplot()` function from `seaborn` library.
* We can plot the distribution plot using `matplotlib` library.

We can summarize the process like this: 
- We have a set of training points. We use these training point to fit and train the model and get parameter.
- We then use these parameter in the model to predict the value of the target variable.
- We then compare the predicted value with the actual value to see how accurate our model is.
- We then use the test data to evaluate the model.

To fit the model in Python, we import `Linear Model` from `sklearn` library.
- We then create a `linear regression` object.
- We define the `predictor` and `target` variables.
- We then fit the model using `fit()` method.
- We then use `predict()` method to predict the value of the target variable.

In [None]:
from sklearn.linear_model import LinearRegression
X = df[['highway-mpg']]
Y = df['price']
lm = LinearRegression()
lm.fit(X,Y)
lm.predict(X)

As we can see that instead of having accuracy of 49% our model has high error rate. So we have to improve or change or model.
# Multiple Linear Regression
- Multiple Linear Regression is very similar to Simple Linear Regression, but this method is used to explain the relationship between one continuous response (dependent) variable and two or more predictor (independent) variables.
- Most of the real-world regression models involve multiple predictors. We can use all of the predictors to create a model that predicts the response variable based on the predictors.
- The equation is given by:
    * `Y: Response/Target/Dependent Variable`
    * `X: Predictor/Independent Variable`
    * `Y = b0 + b1X1 + b2X2 + b3X3 + ... + bnXn`
    * `b0: Intercept`
    * `b1: Slope`

In [None]:
X = df[['wheel-base', 'length', 'width', 'curb-weight', 'engine-size', 'bore', 'horsepower', 'city-L/100Km', 'highway-mpg']]
Y = df['price']
lm = LinearRegression()
lm.fit(X,Y)
lm.predict(X)

In [None]:
# Calculating the R^2 value
lm.score(X,Y)

# Polynomial Regression 
- `Polynomial regression` is a particular case of the general linear regression model or multiple linear regression models.
- We get non-linear relationships by squaring or setting higher-order terms of the predictor variables.
- There are different orders of `polynomial regression`:
    * `Quadratic - 2nd order`
    * `Cubic - 3rd order`
    * `Higher order`
- `Polynomial Regression` is beneficial for describing `curvilinear relationship`.
- `Curvilinear Relationship` is what we get by `squaring` or setting `higher-order` terms of the predictor variables in the model transforming the data.
- The degree of the regression makes a big difference and can result in a better fit if we pick the right value.
- In all cases, the relationship between the variables and the parameter is always linear.
- In python we can do this by using the `profit()` function from `numpy` library.
- `Numpy` library has a function `polyfit()` that takes in the predictor and target variables along with the degree of the polynomial and returns the parameters of the polynomial function.
- We can use the `poly1d()` function from `numpy` library to display the polynomial function.  

In [None]:
f  = np.polyfit(df['highway-mpg'], df['price'], 3)
p = np.poly1d(f)
print(p)

We can print out the model. The symbolic form of the polynomial regression is:
- Yhat = a + b1X^2 + b2X^2 + b3X^3 
We can also have multi-polynomial linear regression:
- Yhat = b0 + b1X1 + b2X2 + b3X1X2 + b4X1^2 + b5X2^2 + ..... + bnX1^n + bn+1X2^n
* `Numpy` `polyfit()` function cannot perform this type of regression.
- We can use the `PolynomialFeatures()` function in `sklearn.preprocessing` library to transform the original data into a polynomial data.

In [None]:
from sklearn.preprocessing import PolynomialFeatures
pr = PolynomialFeatures(degree=2, include_bias=False)
x_polly = pr.fit_transform(X)

In [None]:
x_polly

# Pipeline 
- We can simplify our code by using a pipeline library.
- There are many steps to getting a prediction for example, `Normalization`, `Polynomial Features` and `Linear Regression`.
- Pipeline sequentially performs a series of transformation. The last step carries out a prediction.
- First we import all the modules we need. Then we import the library `Pipeline`.
- We create a list of tuples:
    - First element in the tuples contains the name of the estimator model.
    - Second element contains model constructor. 
    - We now have a pipeline object. We can train the pipeline by applying the train method to the `Pipeline` object.
    - We can also produce a prediction as well. 

In [None]:
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline

Input = [('scale', StandardScaler()), ('polynomial', PolynomialFeatures(degree=2)), ('model', LinearRegression())]

pipe = Pipeline(Input)
pipe.fit(X, Y)

pipe.score(X, Y)

# Measure for In-Smple Evaluation
- A way to numerically determine how good the model fits on dataset.
- Two important measures to determine the fit of a model: 
    * `Mean-Squard Error (MSE)`
    * `R-squared`
### Mean-Squared Error
- To measure the MSE, we find the difference between the `actual valule` y and the `predicted value` yhat then square it.
- For example, the actual value is 150, the predicted value is 50. Substracting these points we get 100. SWe then square the numbers.
- We then take the Mean or Average if all the errors by adding them all together anf dividing by the number of samples.
- To find the MSE in Python, we can import the `mean-square-error()` function from `scikit-learn.metrics`.
- The `mean-squared-error()` function gets two imputs: 
    - The `actual value` of the `target variables`.
    - The `predicted value` of the `target variables`.

# R-squared
- It is also called the `coefficient of determination`.
- It is a measure to determine how close the data is to the fitted regression line.


In [None]:
# Calculating the R^2 value
lm.score(X,Y)

In [None]:
# Calculating the MSE
from sklearn.metrics import mean_squared_error
y_pred = lm.predict(X)
mean_squared_error(Y, y_pred)

Root Mean Squared Error (RMSE) is the square root of the mean of the squared errors:
* It shows how spread out the errors are.
* It's always non-negative.
* Values closer to zero are better.

In [None]:
# Calculating the RMSE
from math import sqrt
sqrt(mean_squared_error(Y, y_pred))

- Mean Absolute Error (MAE) is the mean of the absolute value of the errors:
* It shows how close the predictions are to the actual values.
* It's always non-negative.
* Values closer to zero are better.

In [None]:
# Calculating the MAE
from sklearn.metrics import mean_absolute_error
mean_absolute_error(Y, y_pred)