First things first, let's start by downloading and loading our data.
I picked this dataset from kaggle, here's the link.
https://www.kaggle.com/code/hnazari8665/ames-housing-dataset/input
Then I just loaded the dataset.

In [3]:
import pandas as pd

file_path = 'AmesHousing.csv'
data = pd.read_csv(file_path)

# Display basic information about the dataset
print(data.head())
print(data.info())
print(data.describe())
print(data.isnull().sum())


   Order        PID  MS SubClass MS Zoning  Lot Frontage  Lot Area Street  \
0      1  526301100           20        RL         141.0     31770   Pave   
1      2  526350040           20        RH          80.0     11622   Pave   
2      3  526351010           20        RL          81.0     14267   Pave   
3      4  526353030           20        RL          93.0     11160   Pave   
4      5  527105010           60        RL          74.0     13830   Pave   

  Alley Lot Shape Land Contour  ... Pool Area Pool QC  Fence Misc Feature  \
0   NaN       IR1          Lvl  ...         0     NaN    NaN          NaN   
1   NaN       Reg          Lvl  ...         0     NaN  MnPrv          NaN   
2   NaN       IR1          Lvl  ...         0     NaN    NaN         Gar2   
3   NaN       Reg          Lvl  ...         0     NaN    NaN          NaN   
4   NaN       IR1          Lvl  ...         0     NaN  MnPrv          NaN   

  Misc Val Mo Sold Yr Sold Sale Type  Sale Condition  SalePrice  
0       

Data Cleaning:
After a brief overview, you can see that the dataset has 2930 entries and 82 columns. There are some columns with missing values, particularly Lot Frontage, Alley, Mas Vnr Type, Mas Vnr Area, Bsmt Qual, Bsmt Cond, Bsmt Exposure, and others.

So let's start by dropping some columns and adding the missing values

In [4]:
# Drop columns with too many missing values
data = data.drop(['Alley', 'Pool QC', 'Fence', 'Misc Feature'], axis=1)

# Fill missing values for numerical columns with the median
data['Lot Frontage'].fillna(data['Lot Frontage'].median(), inplace=True)
data['Mas Vnr Area'].fillna(data['Mas Vnr Area'].median(), inplace=True)
data['BsmtFin SF 1'].fillna(data['BsmtFin SF 1'].median(), inplace=True)
data['BsmtFin SF 2'].fillna(data['BsmtFin SF 2'].median(), inplace=True)
data['Bsmt Unf SF'].fillna(data['Bsmt Unf SF'].median(), inplace=True)
data['Total Bsmt SF'].fillna(data['Total Bsmt SF'].median(), inplace=True)
data['Bsmt Full Bath'].fillna(data['Bsmt Full Bath'].median(), inplace=True)
data['Bsmt Half Bath'].fillna(data['Bsmt Half Bath'].median(), inplace=True)
data['Garage Yr Blt'].fillna(data['Garage Yr Blt'].median(), inplace=True)
data['Garage Cars'].fillna(data['Garage Cars'].median(), inplace=True)
data['Garage Area'].fillna(data['Garage Area'].median(), inplace=True)

# Fill missing values for categorical columns with the mode
data['Mas Vnr Type'].fillna(data['Mas Vnr Type'].mode()[0], inplace=True)
data['Bsmt Qual'].fillna(data['Bsmt Qual'].mode()[0], inplace=True)
data['Bsmt Cond'].fillna(data['Bsmt Cond'].mode()[0], inplace=True)
data['Bsmt Exposure'].fillna(data['Bsmt Exposure'].mode()[0], inplace=True)
data['BsmtFin Type 1'].fillna(data['BsmtFin Type 1'].mode()[0], inplace=True)
data['BsmtFin Type 2'].fillna(data['BsmtFin Type 2'].mode()[0], inplace=True)
data['Garage Type'].fillna(data['Garage Type'].mode()[0], inplace=True)
data['Garage Finish'].fillna(data['Garage Finish'].mode()[0], inplace=True)
data['Garage Qual'].fillna(data['Garage Qual'].mode()[0], inplace=True)
data['Garage Cond'].fillna(data['Garage Cond'].mode()[0], inplace=True)



Feature Engineering:
This step is not necessary but to test it out we can simply convert categorical variables to dummy/indicator variables

In [5]:
data = pd.get_dummies(data, drop_first=True)


Now we are training a linear regression model.

In [6]:
import numpy as np
# Define features and target variable
X = data.drop('SalePrice', axis=1)
y = data['SalePrice']

# Split the data into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train a linear regression model
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train_scaled, y_train)



Model Evaluation and fine tuning

In [7]:
# Evaluate the model
from sklearn.metrics import mean_squared_error, r2_score

y_train_pred = model.predict(X_train_scaled)
y_test_pred = model.predict(X_test_scaled)

# Training set evaluation
train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
train_r2 = r2_score(y_train, y_train_pred)

# Testing set evaluation
test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
test_r2 = r2_score(y_test, y_test_pred)

train_rmse, train_r2, test_rmse, test_r2


(20851.98002184223, 0.9268711496324904, 2174349311901.0908, -589680906295430.1)