# Predicting Outcomes using Linear Regression

#### What is linear regression? 

Concept: A type of supervised learning that predicts a continuous outcome based on a linear relationship between features and the target variable. Simple to understand and visualize. It provides foundational knowledge about regression and relationships between variables.

Use Case: Predicting house prices based on square footage.

In [3]:
# Package imports 
import pandas as pd 
import numpy as np

# Import all scikit-learn packages 
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, r2_score, mean_squared_error
from sklearn.datasets import fetch_california_housing

The target variable (Median House Value) is not provided as a separate column in this dataset. It is stored in a separate array that is accessible by calling *data.target* 

The features (df.columns) are stored as a *feature matrix* which can be converted to a pandas dataframe. The target variable *median house value* will need to be added as a column to the dataframe. 

In [None]:
# Load the data set 
data = fetch_california_housing()

# Create dataframe using the housing matrix
df = pd.DataFrame(data.data, columns = data.feature_names)

# Add the target variable as a separate column in the dataframe
df["MedHouseVal"] = data.target 

df.head()

In [8]:
# Separate the features into X and y variables 
# Also known as input and output variables 

X = df.drop("MedHouseVal", axis = 1)
y = df["MedHouseVal"]

 Create the training and testing variables 
 
 - Test size is a floating number between 0.0 and 1.0 representing a fraction of the dataset to be used for testing. In this case, we will set aside 20% of the data for testing and use 80% of the data for training

 - Random state will be set to 42. Random state controls the randomization of data splitting. It ensures reproducibility by making sure that the split remains the same each time. 
    - This argument could be any integer values. When setting random state to None, the execution of each split will be random and the training and testing data could differ each time. 
    - By setting this parameter to 42, we are ensuring that every time the split is ran, the same 80/20 train-test split occurs 

In [9]:
# Create training and testing variables 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

In [10]:
# Create an instance of the LinearRegression model 
model = LinearRegression()

# Run the fit() function to train the data 
model.fit(X_train, y_train)

In [11]:
# Use the trained model to predict housing prices using the test set 
y_pred = model.predict(X_test)

##### Evaluating how the model performed using the following metrics: 

- Mean Absolute Error (MAE)
- Root Mean Squared Error (RMSE)
- R-squared (R^2)
- Mean Absolute Percentage Error (MAPE)

##### Mean Absolute Error (MAE)

- Concept: MAE provides a straightforward average error in the same units as housing prices, making it easy to interpret. Since it doesn’t square the errors, it treats all prediction errors equally, which is helpful when you want a reliable measure that’s not overly sensitive to outliers.
- Interpretation: Tells you the average dollar amount the predictions deviate from actual housing prices.

In [21]:
# Mean Absolute Error
mae = mean_absolute_error(y_test, y_pred)
print(f"Mean Absolute Error: {mae:.2f}")

Mean Absolute Error: 0.53


##### Root Mean Squared Error (RMSE)

- Concept: RMSE penalizes larger errors more heavily due to squaring, making it a useful metric if minimizing large errors is a priority (e.g., high-priced properties). RMSE is also in the same units as the target, which makes it interpretable and comparable to MAE.
- Interpretation: Gives insight into the size of typical errors, emphasizing the impact of larger deviations.

In [20]:
# Root Mean Squared Error
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"Root Mean Squared Error: {rmse:.2f}")

Root Mean Squared Error: 0.75


##### R-squared (R^2)

- Concept: R-squared helps understand how well the model captures the overall variability in housing prices. It’s a standardized measure that tells us the proportion of variance in the target variable explained by the features, giving a broader picture of the model’s fit.
- Interpretation: An R-squared close to 1 indicates that the model explains a large portion of the variance in housing prices, while values closer to 0 suggest that much of the variance is unexplained.

In [19]:
# R-squared
r2 = r2_score(y_test, y_pred)
print(f"R-Squared: {r2:.2f}")

R-Squared: 0.58


##### Mean Absolute Percentage Error (MAPE)
    
- Concept: MAPE provides error as a percentage, which can make errors more interpretable in relative terms. This metric is particularly useful for understanding how the error compares to actual housing prices, helping to contextualize the model’s predictions across different price ranges.
- Interpretation: A MAPE of 10%, for instance, would mean that on average, the predictions are off by 10% of the actual price, regardless of the price range.

In [18]:
# Mean Absolute Percentage Error
mape = np.mean(np.abs((y_test - y_pred) / y_test)) * 100
print(f"Mean Absolute Percentage Error: {mape:.2f}%")

Mean Absolute Percentage Error: 31.95%


##### Summary

- MAE and RMSE together provide insights into the average and the impact of larger errors.
- R-squared gives an understanding of how well the model captures the variance in housing prices.
- MAPE adds context by showing the error as a percentage, which can help when evaluating performance across different price levels.

#### Visualizations

##### Predicted vs. Actual House Prices

In [23]:
import plotly.express as px

fig = px.scatter(x = y_test, 
                 y = y_pred,
                 labels = {"x": "Actual House Prices", "y": "Predicted House Prices"},
                 title = "Predicted vs. Actual House Prices")
fig.add_shape(type = "line", 
              x0 = y_test.min(),
              y0 = y_test.min(),
              x1 = y_test.max(),
              y1 = y_test.max(),
              line = dict(color = "Red", width = 2))
fig.show()

- The points represent acutal vs. predicted values 
- The red line represents the ideal prediction line (predicted = actual)

#### Distribution of Residuals (Errors)

In [24]:
import plotly.graph_objects as go

In [25]:
residuals = y_test - y_pred

fig = go.Figure()
fig.add_trace(go.Histogram(x = residuals,
                           nbinsx = 50, 
                           marker_color = "blue",
                           opacity = 0.75))
fig.update_layout(title = "Distribution of Residuals",
                  xaxis_title = "Residuals (Actual - Predicted)",
                  yaxis_title = "Frequency")
fig.show()

- Centered around zero with a narrow spread and symmetrical shape, it’s a good sign that the model is making accurate predictions without significant bias.
- Skewed to one side or with wide spread and outliers, it suggests that the model might need tuning, additional features, or even a different algorithm.

In [26]:
metrics = {
    'MAE': mae,
    'RMSE': rmse,
    'R2 Score': r2,
    'MAPE': mape
}

# Bar plot for metrics
fig = go.Figure(go.Bar(
    x = list(metrics.keys()),
    y = list(metrics.values()),
    marker_color = ['blue', 'green', 'orange', 'purple']
))

fig.update_layout(title = "Model Evaluation Metrics", 
                  xaxis_title = "Metrics", 
                  yaxis_title = "Metric Value")
fig.show()