# Height vs Weight Dataset with Polynomial Regression

This notebook demonstrates a non-linear relationship between height and weight using a synthetic dataset where linear regression is insufficient and polynomial regression is required.


## Configuration Parameters

This section contains adjustable parameters for the dataset generation and visualization.


In [None]:
# Configuration parameters
config = {
    # Dataset parameters
    "n_samples": 200,  # Number of samples to generate
    "random_seed": 42,  # Random seed for reproducibility
    # Height distribution parameters
    "height_mean": 170,  # Mean height in cm
    "height_std": 15,  # Standard deviation for height
    # Weight parameters - non-linear relationship
    "base_weight": -210,  # Base weight component
    "height_factor": 0.11,  # Weight factor for quadratic term
    "linear_factor": 0.21,  # Weight factor for linear term
    "weight_noise_std": 15,  # Standard deviation of noise in weight
    # Polynomial regression parameters
    "poly_degree": 2,  # Degree of the polynomial
    # Plot parameters
    "plot_figsize": (14, 8),  # Figure size
    "scatter_alpha": 0.6,  # Transparency of scatter points
    "scatter_color": "blue",  # Color of scatter points
    "line_color_linear": "red",  # Color of linear regression line
    "line_color_poly": "green",  # Color of polynomial regression line
    "line_width": 2,  # Width of regression line
    "grid_alpha": 0.3,  # Transparency of grid lines
}

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error, r2_score

# Set random seed for reproducibility
np.random.seed(config["random_seed"])

In [None]:
# Generate heights in cm (normally distributed)
heights = np.random.normal(config["height_mean"], config["height_std"], config["n_samples"])

# Create weights with a non-linear relationship to height plus some noise
# Weight = base_weight + (height_factor * height^2) + (linear_factor * height) + noise
noise = np.random.normal(0, config["weight_noise_std"], config["n_samples"])
weights = (
    config["base_weight"]
    + (config["height_factor"] * heights**2)
    + (config["linear_factor"] * heights)
    + noise
)

# Create a DataFrame
data = pd.DataFrame({"Height (cm)": heights, "Weight (kg)": weights})

# Display the first few rows
data.head()

## Exploratory Data Analysis

Let's analyze the dataset from a data scientist's perspective.


In [None]:
# Display descriptive statistics
print("Descriptive Statistics:")
display(data.describe())

# Calculate correlation
correlation = data["Height (cm)"].corr(data["Weight (kg)"])
print(f"\nCorrelation between Height and Weight: {correlation:.4f}")

# Scatter plot to visualize the relationship
plt.figure(figsize=(10, 6))
plt.scatter(data["Height (cm)"], data["Weight (kg)"], alpha=0.6)
plt.title("Height vs Weight - Non-linear Relationship")
plt.xlabel("Height (cm)")
plt.ylabel("Weight (kg)")
plt.grid(alpha=0.3)
plt.show()

## Linear Regression (Insufficient Model)

First, let's try fitting a simple linear regression model to see why it's insufficient.


In [None]:
# Prepare data for linear regression
X = data["Height (cm)"].to_numpy().reshape(-1, 1)  # Independent variable
y = data["Weight (kg)"].to_numpy()  # Dependent variable

# Create and fit the linear regression model
linear_model = LinearRegression()
linear_model.fit(X, y)

# Get the coefficient (slope) and intercept
slope = linear_model.coef_[0]
intercept = linear_model.intercept_

# Make predictions
y_pred_linear = linear_model.predict(X)

# Calculate metrics for linear model
mse_linear = mean_squared_error(y, y_pred_linear)
r2_linear = r2_score(y, y_pred_linear)

print(f"Linear Regression Model: Weight = {slope:.4f} × Height + {intercept:.4f}")
print(f"Mean Squared Error (MSE): {mse_linear:.4f}")
print(f"R-squared (R²): {r2_linear:.4f}")

## Polynomial Regression (Better Model)

Now let's implement polynomial regression to better fit the non-linear relationship.


In [None]:
# Create and fit a polynomial regression model
poly_model = make_pipeline(PolynomialFeatures(degree=config["poly_degree"]), LinearRegression())
poly_model.fit(X, y)

# Make predictions with the polynomial model
y_pred_poly = poly_model.predict(X)

# Calculate metrics for polynomial model
mse_poly = mean_squared_error(y, y_pred_poly)
r2_poly = r2_score(y, y_pred_poly)

# Extract polynomial coefficients
coefficients = poly_model.named_steps["linearregression"].coef_
intercept_poly = poly_model.named_steps["linearregression"].intercept_

print("Polynomial Regression Model:")
print(f"Intercept: {intercept_poly:.4f}")
for i, coef in enumerate(coefficients):
    if i > 0:  # Skip the first coefficient which is always 0
        print(f"Coefficient for degree {i}: {coef:.6f}")
print(f"\nMean Squared Error (MSE): {mse_poly:.4f}")
print(f"R-squared (R²): {r2_poly:.4f}")
print(f"\nImprovement in MSE: {mse_linear - mse_poly:.4f} ({(1 - mse_poly/mse_linear) * 100:.2f}%)")
print(f"Improvement in R²: {r2_poly - r2_linear:.4f}")

In [None]:
# Visualization to compare linear and polynomial models
plt.figure(figsize=config["plot_figsize"])

# Original data points
plt.scatter(
    data["Height (cm)"],
    data["Weight (kg)"],
    alpha=config["scatter_alpha"],
    color=config["scatter_color"],
    label="Data points",
)

# Sort X for smoother lines
X_sorted = np.sort(X, axis=0)
y_linear_sorted = linear_model.predict(X_sorted)
y_poly_sorted = poly_model.predict(X_sorted)

# Linear regression line
plt.plot(
    X_sorted,
    y_linear_sorted,
    color=config["line_color_linear"],
    linewidth=config["line_width"],
    label=f"Linear model (R² = {r2_linear:.4f})",
)

# Polynomial regression line
plt.plot(
    X_sorted,
    y_poly_sorted,
    color=config["line_color_poly"],
    linewidth=config["line_width"],
    label=f"Polynomial model (R² = {r2_poly:.4f})",
)

plt.title("Comparison of Linear vs Polynomial Regression", fontsize=14)
plt.xlabel("Height (cm)", fontsize=12)
plt.ylabel("Weight (kg)", fontsize=12)
plt.grid(True, alpha=config["grid_alpha"])
plt.legend(fontsize=12)
plt.tight_layout()
plt.show()

## Residual Analysis

Let's compare the residuals from both models to visualize the improvement.


In [None]:
# Calculate residuals for both models
residuals_linear = y - y_pred_linear
residuals_poly = y - y_pred_poly

# Create a figure with 2 rows and 2 columns
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# Residuals vs. Fitted values plot for linear model
axes[0, 0].scatter(y_pred_linear, residuals_linear, alpha=0.6)
axes[0, 0].axhline(y=0, color="r", linestyle="-")
axes[0, 0].set_xlabel("Predicted Weight (kg)")
axes[0, 0].set_ylabel("Residuals")
axes[0, 0].set_title("Linear Model: Residuals vs Fitted Values")
axes[0, 0].grid(alpha=0.3)

# Histogram of residuals for linear model
sns.histplot(residuals_linear, kde=True, ax=axes[0, 1], color="red", alpha=0.6)
axes[0, 1].axvline(x=0, color="k", linestyle="-")
axes[0, 1].set_xlabel("Residual Value")
axes[0, 1].set_title(
    f"Linear Model: Distribution of Residuals (std={np.std(residuals_linear):.2f})"
)
axes[0, 1].grid(alpha=0.3)

# Residuals vs. Fitted values plot for polynomial model
axes[1, 0].scatter(y_pred_poly, residuals_poly, alpha=0.6, color="green")
axes[1, 0].axhline(y=0, color="r", linestyle="-")
axes[1, 0].set_xlabel("Predicted Weight (kg)")
axes[1, 0].set_ylabel("Residuals")
axes[1, 0].set_title("Polynomial Model: Residuals vs Fitted Values")
axes[1, 0].grid(alpha=0.3)

# Histogram of residuals for polynomial model
sns.histplot(residuals_poly, kde=True, ax=axes[1, 1], color="green", alpha=0.6)
axes[1, 1].axvline(x=0, color="k", linestyle="-")
axes[1, 1].set_xlabel("Residual Value")
axes[1, 1].set_title(
    f"Polynomial Model: Distribution of Residuals (std={np.std(residuals_poly):.2f})"
)
axes[1, 1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

## Prediction Example

Using both models to predict weights for new height values and comparing the predictions.


In [None]:
# Example predictions for different heights
example_heights = np.array([150, 160, 170, 180, 190, 200])
example_heights_reshaped = example_heights.reshape(-1, 1)

# Make predictions with both models
linear_predictions = linear_model.predict(example_heights_reshaped)
poly_predictions = poly_model.predict(example_heights_reshaped)

# Create a DataFrame for the comparison
comparison_df = pd.DataFrame(
    {
        "Height (cm)": example_heights,
        "Linear Model Prediction (kg)": linear_predictions,
        "Polynomial Model Prediction (kg)": poly_predictions,
        "Difference (kg)": poly_predictions - linear_predictions,
    }
)

# Display the comparison
comparison_df

## Conclusion

This notebook demonstrates how polynomial regression can significantly improve model fit when the relationship between variables is non-linear. The key observations are:

1. The simple linear regression model failed to capture the curvature in the data
2. The polynomial regression model provided a much better fit as shown by the improved R² and MSE values
3. The residual analysis shows that the polynomial model's residuals are more randomly distributed around zero
4. The predictions from the polynomial model better reflect the true non-linear relationship in the data

This illustrates why it's important to explore different model types beyond simple linear regression when working with real-world data that may contain non-linear relationships.
