# Module 4.2: Linear Regression

**Linear Regression** is one of the simplest yet most powerful algorithms in machine learning. It's a fundamental building block and the perfect place to start our journey into specific models.

**The Goal:** The core idea is to find the best-fitting straight line through our data points. This line can then be used to predict a continuous output (like a price, temperature, or salary).

The famous equation for a line is $y = mx + c$.
* `y` is the value we want to predict (target).
* `x` is our input variable (feature).
* `m` is the slope of the line (coefficient).
* `c` is the y-intercept.

The model's job is to learn the best values for `m` and `c` from the training data.

**Goal of this Notebook:**
1.  Train a Linear Regression model on a real dataset.
2.  Interpret the model's coefficients.
3.  Evaluate the model's performance using key regression metrics.

### Dataset Setup

We will use a housing dataset to predict house prices.

➡️ **Action:** Go to the `02_Data_Analysis_and_Wrangling/data/` folder. Create a new file named `USA_Housing.csv` and paste the following content into it:

```csv
Avg. Area Income,Avg. Area House Age,Avg. Area Number of Rooms,Avg. Area Number of Bedrooms,Area Population,Price,Address
79545.45857,5.682861322,7.009188143,4.09,23086.8005,1059033.558,"208 Michael Ferry Apt. 674 Laurabury, NE 37010-5101"
79248.64245,6.002900364,6.73082101,3.09,40173.07217,1505890.915,"188 Johnson Views Suite 079 Lake Kathleen, CA 48958"
61287.06718,5.86588984,8.512727431,5.13,36882.1594,1058987.988,"9127 Elizabeth Stravenue Apt. 196, WI 06290-3607"
63345.24005,7.188236139,5.58672872,3.26,34310.24283,1260616.807,"USS Barnett"
59982.19723,5.0405545,7.839387993,4.23,26354.10947,630943.4893,"USNS Raymond"
80175.75416,4.988408422,6.104512411,4.04,26748.42842,1068138.074,"9725 Angel Cliff Apt. 303, WI 16673-3232"
64698.46343,6.025336334,8.147760223,3.41,60828.24909,1502055.816,"4759 Lott Forge Suite 400, OK 73016-5122"
```

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [None]:
# Load the dataset
df = pd.read_csv('../02_Data_Analysis_and_Wrangling/data/USA_Housing.csv')

df.head()

## 1. Preparing the Data

We follow the standard Scikit-Learn process: define features (X) and target (y), then split the data.

In [None]:
# Features (all columns except Price and Address)
X = df[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
       'Avg. Area Number of Bedrooms', 'Area Population']]

# Target
y = df['Price']

# Train-Test Split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

## 2. Training the Model

In [None]:
from sklearn.linear_model import LinearRegression

# Instantiate the model
lm = LinearRegression()

# Fit the model to the training data
lm.fit(X_train, y_train)

## 3. Model Interpretation & Evaluation

Once the model is trained, we can inspect its learned parameters to understand how it's making predictions.

In [None]:
# The coefficients tell us how a one-unit increase in a feature affects the price
coeff_df = pd.DataFrame(lm.coef_, X.columns, columns=['Coefficient'])
coeff_df

> **Interpretation:** For every one unit increase in 'Avg. Area Income', the price is expected to increase by about \$21.6. This is the most influential feature.

### Making Predictions

In [None]:
predictions = lm.predict(X_test)

# A scatter plot of actual vs. predicted values is a great way to visualize performance
# If the points fall along a straight line, the model is performing well.
plt.scatter(y_test, predictions)
plt.xlabel('Y Test (Actual Values)')
plt.ylabel('Predicted Values')
plt.title('Actual vs. Predicted House Prices')

### Regression Evaluation Metrics

Visuals are good, but we need concrete metrics to quantify our model's performance.

* **Mean Absolute Error (MAE):** The average of the absolute differences between predicted and actual values. Easy to understand.
* **Mean Squared Error (MSE):** The average of the squared differences. It punishes larger errors more.
* **Root Mean Squared Error (RMSE):** The square root of the MSE. It's in the same units as the target variable (e.g., dollars), making it very interpretable.


In [None]:
from sklearn import metrics

print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))

> **Interpretation:** Our model's predictions are, on average, off by about \$81,025 (the RMSE value). Whether this is 'good' or 'bad' depends entirely on the business context.


## ✅ What's Next?

You have now trained, interpreted, and evaluated your first regression model!

Next, we'll shift from predicting continuous values (like price) to predicting categories (like 'Yes' or 'No', 'Spam' or 'Not Spam'). This is called **Classification**, and our first algorithm for this task will be **Logistic Regression**.