![image.png](https://i.imgur.com/a3uAqnb.png)

# Machine Learning with Scikit-Learn - Lab Exercise

In this lab, you'll work with a real dataset to practice the machine learning concepts we covered in class. We'll be predicting house prices using various features.

## Dataset Overview
We're using a simplified version of the California housing dataset. Each row represents a house with features like:
- Average income in the area
- House age
- Average rooms per house
- Population density
- And more...

Your goal: build a model that can predict house prices accurately.

## Step 1: Import Libraries

Fill in the missing imports. You'll need:
- pandas for data manipulation
- numpy for numerical operations  
- matplotlib and seaborn for visualization
- sklearn modules for machine learning

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# TODO: Import train_test_split from sklearn
# TODO: Import LinearRegression from sklearn
# TODO: Import RandomForestRegressor from sklearn
# TODO: Import mean_squared_error and r2_score from sklearn.metrics

# Set random seed for reproducibility
np.random.seed(42)

## Step 2: Load and Explore the Data

In [None]:
# Load the California housing dataset
from sklearn.datasets import fetch_california_housing

# TODO: Use fetch_california_housing() to load the data
housing =

# TODO: Create a DataFrame with the features
# Hint: use housing.data for features and housing.feature_names for column names
df =

# TODO: Add the target variable (house prices) as a new column called 'price'
df['price'] =

print(f"Dataset shape: {df.shape}")
print(f"\nFirst few rows:")
df.head()

## Step 3: Basic Data Analysis

Let's understand our data better before building models.

In [None]:
# TODO: Display basic statistics about the dataset
# Hint: use the describe() method


In [None]:
# TODO: Check for missing values
# Hint: use isnull().sum()


In [None]:
# TODO: Create a histogram of house prices
plt.figure(figsize=(10, 6))
# Your code here

plt.title('Distribution of House Prices')
plt.xlabel('Price')
plt.ylabel('Frequency')
plt.show()

## Step 4: Feature Analysis

Let's see which features are most correlated with house prices.

In [None]:
# TODO: Create a correlation matrix and visualize it
plt.figure(figsize=(12, 10))

# Calculate correlation matrix
corr_matrix =

# Create heatmap
# Hint: use sns.heatmap() with annot=True to show correlation values

plt.title('Feature Correlation Matrix')
plt.show()

In [None]:
# TODO: Show correlations with the target variable (price) in descending order
# Hint: Get correlations with 'price' column and sort them


## Step 5: Prepare Data for Machine Learning

Split the data into features (X) and target (y), then create training and testing sets.

In [None]:
# TODO: Separate features and target
# X should contain all columns except 'price'
# y should contain only the 'price' column
X =
y =

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")

In [None]:
# TODO: Split the data into training and testing sets
# Use 80% for training, 20% for testing
# Set random_state=42 for reproducibility
X_train, X_test, y_train, y_test =

print(f"Training set size: {X_train.shape[0]}")
print(f"Testing set size: {X_test.shape[0]}")

## Step 6: Build and Train Models

We'll compare two different algorithms: Linear Regression and Random Forest.

### Linear Regression Model

In [None]:
# TODO: Create and train a Linear Regression model
lr_model =

# Train the model

print("Linear Regression model trained!")

### Random Forest Model

In [None]:
# TODO: Create and train a Random Forest model
# Use 100 trees (n_estimators=100) and random_state=42
rf_model =

# Train the model

print("Random Forest model trained!")

## Step 7: Make Predictions and Evaluate Models

Let's see how well our models perform on the test data.

In [None]:
# TODO: Make predictions with both models
lr_predictions =
rf_predictions =

In [None]:
# TODO: Calculate evaluation metrics for both models
# Calculate Mean Squared Error and R² score

# Linear Regression metrics
lr_mse =
lr_r2 =

# Random Forest metrics
rf_mse =
rf_r2 =

print("Model Performance Comparison:")
print(f"\nLinear Regression:")
print(f"  Mean Squared Error: {lr_mse:.4f}")
print(f"  R² Score: {lr_r2:.4f}")

print(f"\nRandom Forest:")
print(f"  Mean Squared Error: {rf_mse:.4f}")
print(f"  R² Score: {rf_r2:.4f}")

## Step 8: Visualize Results

Create scatter plots to see how well our predictions match the actual prices.

In [None]:
# TODO: Create scatter plots comparing predictions vs actual values
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Linear Regression plot
# Hint: use scatter plot with y_test vs lr_predictions
ax1.scatter(y_test, lr_predictions, alpha=0.5)
ax1.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
ax1.set_xlabel('Actual Prices')
ax1.set_ylabel('Predicted Prices')
ax1.set_title(f'Linear Regression (R² = {lr_r2:.3f})')

# Random Forest plot
# TODO: Create similar scatter plot for Random Forest
ax2.scatter(y_test, rf_predictions, alpha=0.5)
ax2.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
ax2.set_xlabel('Actual Prices')
ax2.set_ylabel('Predicted Prices')
ax2.set_title(f'Random Forest (R² = {rf_r2:.3f})')

plt.tight_layout()
plt.show()

## Step 9: Feature Importance (Bonus)

Random Forest can tell us which features are most important for predictions.

In [None]:
# TODO: Get feature importance from the Random Forest model
# Create a DataFrame with feature names and their importance scores
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': # TODO: Get feature_importances_ from rf_model
})

# TODO: Sort by importance in descending order
feature_importance =

print("Feature Importance (Random Forest):")
print(feature_importance)

In [None]:
# TODO: Create a bar plot of feature importance
plt.figure(figsize=(10, 6))
# Hint: use plt.barh() for horizontal bar plot

plt.title('Feature Importance (Random Forest)')
plt.xlabel('Importance')
plt.show()