# Project 1: Predicting House Prices with Linear Regression

*Machine Learning Foundations for Beginners*

*Codecademy Live Learning*

# Overview


In this project, you will build a linear regression model to predict housing prices
using the Ames Housing dataset. This is a real-world dataset containing information
about residential homes in Ames, Iowa.

## Learning Objectives
- Perform exploratory data analysis (EDA)
- Engineer meaningful features for prediction
- Train and evaluate a linear regression model
- Interpret model results and coefficients
- Create professional documentation for your portfolio



## Deliverables
A completed Jupyter notebook that includes:
1. Clear explanations of your methodology
2. Visualizations of data and results
3. A trained linear regression model
4. Evaluation metrics and interpretation
5. Reflection on results and potential improvements

## Instructions
1. Save a copy of this .ipynb in your own Google Drive.
2. Run the code in the Setup section and upload the AmesHousing.csv
    - The file will be shared in Discord.
    - Alternatively, you can download it from Kaggle [here](https://www.kaggle.com/datasets/marcopale/housing)
3. Work through each section to explore the data, train the model, and interpret the results.
    - You can review the lecture slides+recording and see other examples from the instructor to help.
    - If you are stuck, post a question in #doubts in Discord.
4. When you are finished, share a link to your notebook in the #project-showcase channel in Discord. Also considering creating a git repo and publishing your notebook on GitHub.

There are a mix of code cells and text cells. Write your code in the code cells and add comments and documentation as needed to explain the code itself. Use the text cells to explain your observations, thought process, and decisions. Imagine you are a working as a Data Scientist and are presenting your findings to your company.

# Setup

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Set random seed for reproducibility
np.random.seed(42)

# Configure visualization style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (10, 6)

print("✓ All libraries imported successfully!")

✓ All libraries imported successfully!


## Load the dataset from CSV

In [None]:
from google.colab import files
uploaded = files.upload()

Saving AmesHousing.csv to AmesHousing.csv


In [None]:
df = pd.read_csv('AmesHousing.csv')
df

Unnamed: 0,Order,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice
0,1,526301100,20,RL,141.0,31770,Pave,,IR1,Lvl,...,0,,,,0,5,2010,WD,Normal,215000
1,2,526350040,20,RH,80.0,11622,Pave,,Reg,Lvl,...,0,,MnPrv,,0,6,2010,WD,Normal,105000
2,3,526351010,20,RL,81.0,14267,Pave,,IR1,Lvl,...,0,,,Gar2,12500,6,2010,WD,Normal,172000
3,4,526353030,20,RL,93.0,11160,Pave,,Reg,Lvl,...,0,,,,0,4,2010,WD,Normal,244000
4,5,527105010,60,RL,74.0,13830,Pave,,IR1,Lvl,...,0,,MnPrv,,0,3,2010,WD,Normal,189900
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2925,2926,923275080,80,RL,37.0,7937,Pave,,IR1,Lvl,...,0,,GdPrv,,0,3,2006,WD,Normal,142500
2926,2927,923276100,20,RL,,8885,Pave,,IR1,Low,...,0,,MnPrv,,0,6,2006,WD,Normal,131000
2927,2928,923400125,85,RL,62.0,10441,Pave,,Reg,Lvl,...,0,,MnPrv,Shed,700,7,2006,WD,Normal,132000
2928,2929,924100070,20,RL,77.0,10010,Pave,,Reg,Lvl,...,0,,,,0,4,2006,WD,Normal,170000


# Exploratory Data Analysis

In [None]:
# TODO: Display basic information about the dataset
# Suggested explorations:
# - What are the dimensions of the dataset?
# - What columns are available?
# - What are the data types?
# - Are there any missing values?

print("Dataset shape:", df.shape)
print("\nFirst few rows:", df.head())
print("\nDataset info:", df.describe())

df.columns


Dataset shape: (2930, 82)

First few rows:    Order        PID  MS SubClass MS Zoning  Lot Frontage  Lot Area Street  \
0      1  526301100           20        RL         141.0     31770   Pave   
1      2  526350040           20        RH          80.0     11622   Pave   
2      3  526351010           20        RL          81.0     14267   Pave   
3      4  526353030           20        RL          93.0     11160   Pave   
4      5  527105010           60        RL          74.0     13830   Pave   

  Alley Lot Shape Land Contour  ... Pool Area Pool QC  Fence Misc Feature  \
0   NaN       IR1          Lvl  ...         0     NaN    NaN          NaN   
1   NaN       Reg          Lvl  ...         0     NaN  MnPrv          NaN   
2   NaN       IR1          Lvl  ...         0     NaN    NaN         Gar2   
3   NaN       Reg          Lvl  ...         0     NaN    NaN          NaN   
4   NaN       IR1          Lvl  ...         0     NaN  MnPrv          NaN   

  Misc Val Mo Sold Yr Sold Sale

Index(['Order', 'PID', 'MS SubClass', 'MS Zoning', 'Lot Frontage', 'Lot Area',
       'Street', 'Alley', 'Lot Shape', 'Land Contour', 'Utilities',
       'Lot Config', 'Land Slope', 'Neighborhood', 'Condition 1',
       'Condition 2', 'Bldg Type', 'House Style', 'Overall Qual',
       'Overall Cond', 'Year Built', 'Year Remod/Add', 'Roof Style',
       'Roof Matl', 'Exterior 1st', 'Exterior 2nd', 'Mas Vnr Type',
       'Mas Vnr Area', 'Exter Qual', 'Exter Cond', 'Foundation', 'Bsmt Qual',
       'Bsmt Cond', 'Bsmt Exposure', 'BsmtFin Type 1', 'BsmtFin SF 1',
       'BsmtFin Type 2', 'BsmtFin SF 2', 'Bsmt Unf SF', 'Total Bsmt SF',
       'Heating', 'Heating QC', 'Central Air', 'Electrical', '1st Flr SF',
       '2nd Flr SF', 'Low Qual Fin SF', 'Gr Liv Area', 'Bsmt Full Bath',
       'Bsmt Half Bath', 'Full Bath', 'Half Bath', 'Bedroom AbvGr',
       'Kitchen AbvGr', 'Kitchen Qual', 'TotRms AbvGrd', 'Functional',
       'Fireplaces', 'Fireplace Qu', 'Garage Type', 'Garage Yr Blt',
      

In [None]:
# TODO: Perform exploratory data analysis
# Suggested analyses:
# - Distribution of the target variable (SalePrice)
# - Relationships between features and target
# - Correlation analysis
# - Identify potential outliers
# - Understand categorical vs numerical features

# Document your findings with visualizations and written observations.


# Example: Visualize target variable distribution
# TODO: Create a histogram of SalePrice


# TODO: Create visualizations to explore key features
# Consider: scatter plots, box plots, correlation heatmaps


# TODO: Analyze the relationship between square footage and price
# Hint: Features like 'GrLivArea' (above ground living area) might be relevant


Write your key findings here:
- What patterns do you observe?
- Which features seem most correlated with price?
- Are there any anomalies or outliers?

# Data Preparation

In [None]:
# TODO: Prepare your data for modeling
# Consider:
# - Handling missing values (if any)
# - Selecting features for your model
# - Encoding categorical variables
# - Feature scaling/normalization (optional but recommended)
# - Removing outliers (optional)

In [None]:
# TODO: Select your features and target variable
# Example structure:
# features = ['feature1', 'feature2', ...]  # Choose your features
# X = df[features]
# y = df['SalePrice']

X = None  # Replace with your feature selection
y = None  # Replace with target variable

In [None]:
# TODO: Handle categorical variables if you include any
# Hint: Consider sklearn.OneHotEncoder


In [None]:
# TODO: Handle any missing values in your selected features

Explain why you chose these features:
- What features did you include and why?
- Did you create any new features?
- How did you handle categorical variables?

# Model Training

## Train-Test Split

In [None]:
# TODO: Split your data into training and testing sets
# Hint: Use sklearn.model_selection.train_test_split with test_size=0.2
X_train = None
X_test = None
y_train = None
y_test = None

print(f"Training set size: {None}")  # Replace None
print(f"Testing set size: {None}")   # Replace None

Training set size: None
Testing set size: None


## Fit a LinearRegression Model

# Model Evaluation

## Make predictions

In [None]:
# TODO: Make predictions on both training and test sets
y_train_pred = None
y_test_pred = None


## Calculate evaluation metrics

In [None]:
# TODO: Calculate evaluation metrics
# Suggested metrics: RMSE, MAE, R^2
train_rmse = None
test_rmse = None
train_r2 = None

In [None]:
# TODO: Create visualizations of your results
# Suggested visualizations:
# - Actual vs Predicted prices scatter plot
# - Residual plot
# - Distribution of prediction errors


# Actual vs Predicted Plot
# TODO: Create scatter plot comparing actual and predicted prices


# Residual Plot
# TODO: Create residual plot (errors vs predictions)

# Interpretation

In [None]:
# TODO: Examine feature coefficients
# What do the coefficients tell you about feature importance?

print("\nFeature Coefficients:")
# Display coefficients here


Feature Coefficients:


Analyze your model:
- Which features have the strongest positive/negative impact on price?
- Does this align with your intuition about housing prices?
- Are there any surprising findings?

# Conclusion

TODO: Summarize your project results

1. Model Performance:
   - How well does your model predict housing prices?
   - Is there evidence of overfitting or underfitting?

2. Key Insights:
   - What are the most important factors affecting house prices?
   - What did you learn from this analysis?

3. Limitations:
   - What are the limitations of your model?
   - What assumptions does linear regression make?

4. Future Improvements:
   - How could you improve this model?
   - What additional features or techniques might help?
   - Would other algorithms perform better?

You are finished! However, you can always continue working on this project as you learn more throughout the course. (Use git to track your changes).

## Suggested enhancements
Here are some things you can do to make this really shine as a portfolio project:
- Polish up the documentation, analysis, and visualizations in this notebook
- Create interactive visualizations, e.g. using [ploty](https://github.com/plotly/plotly.py)
- Publish your work as a web app, e.g. using [streamlit](https://streamlit.io/) or [mercury](https://github.com/mljar/mercury)
- Write a blog post explaining your approach, the results, and what your learned.

## A note on licensing and attribution

The Ames Housing Dataset is provided under GPL-2 license. For your portfolio projects, you have two options:

1. (Recommended) Don't include the CSV file in your GitHub repo. Instead, link to the original data source in your README.
2. If you include the data, you must include proper attribution and the GPL-2 license in a DATA_LICENSE file.

Your code and analysis remain your own intellectual property and can be licensed however you choose.