### Disclaimer

The following document was made for submition in "Housing Prices Competition for Kaggle Learn Users".

All references can be traced back to the following links:

[Competition Overview](https://www.kaggle.com/competitions/home-data-for-ml-course/overview)  
[Intermediate ML Course](https://www.kaggle.com/learn/intermediate-machine-learning)

# Goal
It is your job to predict the sales price for each house. For each Id in the test set, you must predict the value of the SalePrice variable. 
# Metric
Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)

# Strategy
We will use the knowledge obtained in the "Intro to Machine Learning" course from Kaggle. 

Notes with more detailed theory explanation can be found in my GitHub repository: https://github.com/CallejoSanzDavid/IA-Portfolio/tree/main

In [90]:
# Required libraries
import numpy as np
import pandas as pd
from sklearn.metrics import mean_absolute_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split

## Selecting Data for Modeling

In [91]:
housing_data = pd.read_csv("train.csv")

# Reorder columns
cols = housing_data.columns.tolist()
cols.insert(cols.index("MSSubClass"), cols.pop(cols.index("SalePrice")))
housing_data = housing_data[cols]

housing_data.describe()

Unnamed: 0,Id,SalePrice,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,...,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
count,1460.0,1460.0,1460.0,1201.0,1460.0,1460.0,1460.0,1460.0,1460.0,1452.0,...,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,730.5,180921.19589,56.89726,70.049958,10516.828082,6.099315,5.575342,1971.267808,1984.865753,103.685262,...,472.980137,94.244521,46.660274,21.95411,3.409589,15.060959,2.758904,43.489041,6.321918,2007.815753
std,421.610009,79442.502883,42.300571,24.284752,9981.264932,1.382997,1.112799,30.202904,20.645407,181.066207,...,213.804841,125.338794,66.256028,61.119149,29.317331,55.757415,40.177307,496.123024,2.703626,1.328095
min,1.0,34900.0,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0
25%,365.75,129975.0,20.0,59.0,7553.5,5.0,5.0,1954.0,1967.0,0.0,...,334.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0
50%,730.5,163000.0,50.0,69.0,9478.5,6.0,5.0,1973.0,1994.0,0.0,...,480.0,0.0,25.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0
75%,1095.25,214000.0,70.0,80.0,11601.5,7.0,6.0,2000.0,2004.0,166.0,...,576.0,168.0,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0
max,1460.0,755000.0,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,...,1418.0,857.0,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0


In [92]:
housing_data.columns

Index(['Id', 'SalePrice', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea',
       'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond'

## Data Cleaning and Choosing Features
When having this quatity of data is important to identify the different types of information and clean it for a better analysis. If we skip this step we most probably run into errors when trying to fit the data to our model.

Following the basic course I run into this issue. To solve it I had to implement the Pipeline methodology, which is a good practise tought in the "Intermediate Machine Learning" from Kaggle.

### Lines with missing information
These lines do not help our model to predict reliable predictions, that is why they will be droped from our data set.

In [93]:
# Drop data lines with missing information.
X_train = housing_data.dropna(axis=0)

### Identifier Variables
Unique values used to identify each record (e.g., Id), and the target variable. These values are not useful for our prediction, That is why we remove them before modeling.

In [94]:
# Drop Target and Identifyer Variables
X_train = housing_data.drop(columns=['Id', 'SalePrice'])

### Numerical Variables
Variables that represent measurable quantities.

Types:
- Continuous: Can take any value within a range.
- Discrete: Represent counts or whole numbers.

In [95]:
numerical_cols = [col for col in X_train.columns if X_train[col].dtype in ['int64', 'float64']]
print(numerical_cols)

['MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal', 'MoSold', 'YrSold']


### Categorical Variables
Variables that represent categories or groups.

Types:
- Nominal: Categories without inherent order.
- Ordinal: Categories with a meaningful order.
```   
    Ex	Excellent
    Gd	Good
    TA	Average/Typical
    Fa	Fair
    Po	Poor

In [96]:
categorical_cols = [col for col in X_train.columns if X_train[col].dtype == "object"]
print(categorical_cols)

['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual', 'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature', 'SaleType', 'SaleCondition']


## Pipelines
Before fitting our model with our selected features, it is a good practise to use the pipeline methodology. This ensures that all data preprocessing steps (like handling missing values, scaling, and encoding) are applied consistently and correctly. Meaning that the data is handled in a way that the model can learn effectively.

A pipeline guarantees that the data is cleaned, transformed, and prepared in the exact same way every time, whether you're training, validating, or predicting. It prevents errors, avoids data leakage, and keeps your workflow organized and reproducible. Without it, you risk applying inconsistent transformations, leaking test data into training, or forgetting key steps. With it, your model gets reliable input, and your code stays clean and scalable.

### Numerical Variables
To transform the numerical variables, we have selected the `SimpleImputer(strategy='median')`, which replaces missing values with the median of that column. 

Trying to fit our Random Forest model with missing values will result in an error.

In [97]:
num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median'))
])

### Categorical Variables
Most ML models (like linear regression, decision trees, etc.) can’t handle text or categories directly, that is why it is a good practise to transform this kind of variables beforehand. It’s especially useful when categories do not have a natural order, as in our case.

`SimpleImputer(strategy='most_frequent')`: Fills in missing values using the most common value in each column, since models can't handle missing values directly. Using the most frequent value avoids introducing outliers or unrealistic replacements.
`OneHotEncoder(handle_unknown='ignore')`: Converts categorical variables into binary columns. `handle_unknown='ignore'` is used to avoid errors when new categories appear in test data.

In [98]:
cat_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

### Transformation designation
The following code is used to create preprocessing pipelines for both numerical and categorical data. It indicates which columns and how they need to be transformed before feeding them into a machine learning model.

The final line `remainder='passthrough'` it is added so that the columns not specified are processed without any transformation.

In [99]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', num_pipeline, numerical_cols),
        ('cat', cat_pipeline, categorical_cols)
    ], remainder='passthrough')

## Selecting The Prediction Target

In [100]:
y_train = housing_data.SalePrice

## Building Our Model

To create our model, we will use the scikit-learn library (`sklearn`). For that, we will need to define:
- Type of model: In our case a Random Forest.
- Fit: Capture patterns from provided data.
- Predict: The value to be obtained.
- Evaluate: Determine how accurate the model's predictions are.

In [101]:
# Create the model pipeline
model_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('imputer', SimpleImputer(strategy='mean')),
    ('model', RandomForestRegressor(random_state=1))
])

model_pipeline.fit(X_train, y_train)

# Model prediction
For the final submittion to the contest I will adapt the code facilitated in the Kaggle website.

In [102]:
# Load test data
test_data = pd.read_csv("test.csv")

X_test = test_data.drop(columns=['Id'])
X_test = X_test[numerical_cols + categorical_cols]  # keep only relevant features

predictions = model_pipeline.predict(X_test)

# Run the code to save predictions in the format used for competition scoring
output = pd.DataFrame({
    'Id': test_data['Id'],
    'SalePrice': predictions
})
output.to_csv('submission.csv', index=False)

# Model Validation

Now that we programmed our model, we need to know how accurate it is. The results in this competition are evaluated on Root Mean Square Error (RMSE)  between the logarithm of the predicted value and the logarithm of the observed sales price.
```python
from sklearn.metrics import mean_squared_error
```
Always make sure that your feature matrix and target vector are sliced from the same original DataFrame and stay aligned through every transformation.


In [103]:
X, y = housing_data.drop(columns=['Id', 'SalePrice']), housing_data['SalePrice']
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, random_state=42)

model_pipeline.fit(X_train, y_train)
preds = model_pipeline.predict(X_valid)

rmse = np.sqrt(mean_squared_error(y_valid, preds))
print(f'Root Mean Squared Error: {rmse:.2f}')

Root Mean Squared Error: 29724.82


## Results analysis

Taking into account that the:
- Mean Sale Price = `180,921$`
- Standard Deviation = `79,443$`
- Price Range from `34,900$` to `755,000$`

The RMSE obtained (`29,724.82$`) is aprox. 16.4% of the mean price, well below the standard deviation.

The model is reasonably accurate, it’s capturing the general trend of the data. It could be improved with model tuning or a more precise feature engineering.