# AI Summer Camp: Linear Regression

This notebook covers the theory and implementation of Linear Regression using the House Prices dataset.

## Linear Regression Theory

Linear regression is a fundamental statistical and machine learning technique used to model the relationship between a dependent variable and one or more independent variables. It assumes a linear relationship between the variables.

### The Linear Regression Equation

The basic form of a linear regression model is:

```
y = β₀ + β₁x₁ + β₂x₂ + ... + ε
```

Where:
- `y` is the dependent variable (what we're trying to predict)
- `x₁, x₂, ...` are the independent variables (features)
- `β₀` is the y-intercept (the value of y when all x's are 0)
- `β₁, β₂, ...` are the coefficients (weights) for each independent variable
- `ε` is the error term (the difference between the predicted and actual y values)

### Key Concepts

1. **Best Fit Line**: The goal is to find the line that best fits the data points, minimizing the overall error.
2. **Ordinary Least Squares (OLS)**: This is the most common method for estimating the coefficients.
3. **Model Evaluation**: We use metrics like R-squared (R²) and Mean Squared Error (MSE) to evaluate the model's performance.

## Linear Regression Implementation

Let's implement a linear regression model using the House Prices dataset from Kaggle.

In [None]:
# Install required libraries
!pip install pandas numpy matplotlib seaborn scikit-learn

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

In [None]:
# Load the data
# Note: You need to upload the train.csv file to your Colab environment or use the Kaggle API
train_df = pd.read_csv('train.csv')
print(train_df.head())

In [None]:
# Data exploration
print(train_df.info())
print(train_df.describe())

# Visualize the target variable (SalePrice)
plt.figure(figsize=(10, 6))
sns.histplot(train_df['SalePrice'], kde=True)
plt.title('Sale Price Distribution')
plt.show()

In [None]:
# Preprocessing
# Separate features and target
X = train_df.drop('SalePrice', axis=1)
y = train_df['SalePrice']

# Identify numeric and categorical columns
numeric_features = X.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X.select_dtypes(include=['object']).columns

# Create preprocessing steps
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Create a pipeline
model = Pipeline(steps=[('preprocessor', preprocessor),
                        ('regressor', LinearRegression())])

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit the model
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse}")
print(f"R-squared Score: {r2}")

In [None]:
# Visualize predictions vs actual
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, alpha=0.5)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
plt.xlabel('Actual Prices')
plt.ylabel('Predicted Prices')
plt.title('Predicted vs Actual House Prices')
plt.show()