### Model Development

Developing models means dealing with:  
1. Simple and multiple linear regression
2. Model evaluation using visualization
3. Polynomial regression and pipelines
4. R-squared and MSE for in-sample evaluation
5. Prediction and decision making  

Ultimately, you can answer decisive questions like, "how can you determine  
a fair value for a used car?"  

A model can be thought of as a mathematical equation used to predict a value  
given one or more other values. They relate **one or more independent variable  
to dependent variables**.  

Usually the more **relevant data** you have, the more accurate your model is.  
For example:  

  - You enter the following to your model:  
      - `highway-mpg`  
      - `curb-weight`  
      - `engine-size`  

And you should receive an accurate prediction for `price`.  

---

### Linear and Multiple Linear Regression  

**Linear regression** will refer to one independent variable, while  
**multiple linear regression** refers to multiple independent variables to  
make a prediction.

### Simple Linear Regression  

In simple linear regression you have the following:  
  - The *predictor* (independent) variable - **X**  
  - The *target* (dependent) variable - **Y**  
    - We would like to come up with a linear relationship expressed as the 
      following:  
      $y = b_0 + b_1 x$
  - $b_0$: the **intercept**  
  - $b_1$: the **slope**  

To determine the slope and intercept requires heavy calculations--that can  
luckily be abstracted by Python (love this language). But, it's important  
to understand what is happening. For this example, we'll consider  
`auto_df["mpg"]` our *predictor* and `auto_df["price"]` our target *variable*.  

At this point in our modeling, we'll primarily use `LinearRegression` from  
the `linear_model` module in the `sklearn` (scikit-learn) library:  

  - We'll start by using it to create a LinearRegression object--our model.  
  - Assign independent varible(s) (X) and dependent variable (Y), then using  
    `fit()` to determine intercept ($b_0$) and slope ($b_1$):  

$$
\text{slope} = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2}
$$

$$
\text{intercept} = \bar{y} - \text{slope} \cdot \bar{x}
$$  

  - There is no prediction without fitting your data.  
  - Finally, using `predict()` to determine a prediction (returning an array).  

In [None]:
from pathlib import Path
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
from scipy import stats as sts
from sklearn.linear_model import LinearRegression

df_data = Path().cwd().parent.parent/"Data"/"Clean_Data"/"clean_auto_df.csv"
auto_df = pd.read_csv(df_data)

In [None]:
lm = LinearRegression()

# X must always be a 2D object
X = auto_df[["highway-L/100km"]]
Y = auto_df["price"]

lm.fit(X, Y)
b_int = lm.intercept_
b_slope = lm.coef_[0]

# Again, X must always be 2D
Yhat = lm.predict([[10]])

print(b_int, "\n", b_slope, "\n", Yhat)

**SLR Usecases**  

SLR might seam rather simple compared to MLR, but it provides critical insight:  
- It answers one question clearly, "**How does this one variable affect the
  outcome?**"  

- You get *one* slope and *one* relationship--easier explaining to an audience  
  or stakeholder.  

- Acts as a baseline, telling you how well a **single feature** performs.  

- Could be ideal for low-data situations, especially if there aren't many  
  strong predictors.  

Ultimately, **SLR** is great if you're trying to explain something, like  
potential predicting power for your target variable (EDA). **MLR** is geared  
toward *full modeling*, when your target is affected by multiple interracting  
variables.  

---  

### Multiple Linear Regression  

This method is used to explain the relationship between:
- One continuous target (Y) variable  
- Two or more predictor (X) variables  

While the same exact functions and principles are used in MLR, aside from  
taking multiple variables for X, the key distinction is that there will be   
multiple coefficients generated when running `fit()`.  

Additionally, it is best practice to pass a data frame object with column names  
corresponding to the column names in X, with corresponding predictor values  
for each.  

In [None]:
lm2 = LinearRegression()

X2 = auto_df[["horsepower", "engine-size", "fuel-type-gas", "highway-L/100km"]]
Y2 = auto_df["price"]

lm2.fit(X2, Y2)
b0 = lm2.intercept_
b1 = lm2.coef_

predictor = pd.DataFrame([{
    "horsepower": 125,
    "engine-size": 130,
    "fuel-type-gas": 1,
    "highway-L/100km": 10
}])

Yhat2 = lm2.predict(predictor)

print(b0, "\n", b1, "\n", Yhat2)