## 1. Problem Definition and Dataset Identification

### Problem Definition:

Let's assume we are working on a problem of predicting house prices based on various features such as location, size, number of bedrooms, and other relevant factors. Machine learning is suitable for this problem because it can identify complex patterns in the data and provide accurate predictions by learning from historical data.

### Why Machine Learning?

Machine learning is effective for predicting house prices because it can handle a large number of features, capture non-linear relationships, and improve over time as more data becomes available. Traditional statistical methods might fall short in capturing the complex interactions between features that machine learning models can.

### Dataset Identification:

For this project, we'll use the well-known "House Prices: Advanced Regression Techniques" dataset from Kaggle. You can download it from the following location: Kaggle House Prices Dataset.
https://www.kaggle.com/datasets/lespin/house-prices-dataset/data

## 2. Data Exploration, Cleaning, Feature Engineering, and Selection

In [19]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import pickle


### Load the dataset

In [20]:
# Load the dataset
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

data = pd.concat([train_df, test_df], ignore_index=True)
print(data.shape)
print(data.head(10))

(2919, 81)
   Id  MSSubClass MSZoning  LotFrontage  LotArea Street Alley LotShape  \
0   1          60       RL         65.0     8450   Pave   NaN      Reg   
1   2          20       RL         80.0     9600   Pave   NaN      Reg   
2   3          60       RL         68.0    11250   Pave   NaN      IR1   
3   4          70       RL         60.0     9550   Pave   NaN      IR1   
4   5          60       RL         84.0    14260   Pave   NaN      IR1   
5   6          50       RL         85.0    14115   Pave   NaN      IR1   
6   7          20       RL         75.0    10084   Pave   NaN      Reg   
7   8          60       RL          NaN    10382   Pave   NaN      IR1   
8   9          50       RM         51.0     6120   Pave   NaN      Reg   
9  10         190       RL         50.0     7420   Pave   NaN      Reg   

  LandContour Utilities  ... PoolArea PoolQC  Fence MiscFeature MiscVal  \
0         Lvl    AllPub  ...        0    NaN    NaN         NaN       0   
1         Lvl    AllPub 

In [21]:
print(data.info())
print(data.describe())

# Check for missing values
missing_values = data.isnull().sum()
print(missing_values[missing_values > 0])

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2919 entries, 0 to 2918
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             2919 non-null   int64  
 1   MSSubClass     2919 non-null   int64  
 2   MSZoning       2915 non-null   object 
 3   LotFrontage    2433 non-null   float64
 4   LotArea        2919 non-null   int64  
 5   Street         2919 non-null   object 
 6   Alley          198 non-null    object 
 7   LotShape       2919 non-null   object 
 8   LandContour    2919 non-null   object 
 9   Utilities      2917 non-null   object 
 10  LotConfig      2919 non-null   object 
 11  LandSlope      2919 non-null   object 
 12  Neighborhood   2919 non-null   object 
 13  Condition1     2919 non-null   object 
 14  Condition2     2919 non-null   object 
 15  BldgType       2919 non-null   object 
 16  HouseStyle     2919 non-null   object 
 17  OverallQual    2919 non-null   int64  
 18  OverallC

### Data Cleaning

In [22]:
# Fill numerical columns with the median value
num_cols = data.select_dtypes(include=[np.number]).columns
data[num_cols] = data[num_cols].apply(lambda x: x.fillna(x.median()), axis=0)

# Fill categorical columns with the most frequent value
cat_cols = data.select_dtypes(include=[object]).columns
data[cat_cols] = data[cat_cols].apply(lambda x: x.fillna(x.value_counts().index[0]), axis=0)

# Encode categorical variables
data = pd.get_dummies(data)

print(data.head(10))

   Id  MSSubClass  LotFrontage  LotArea  OverallQual  OverallCond  YearBuilt  \
0   1          60         65.0     8450            7            5       2003   
1   2          20         80.0     9600            6            8       1976   
2   3          60         68.0    11250            7            5       2001   
3   4          70         60.0     9550            7            5       1915   
4   5          60         84.0    14260            8            5       2000   
5   6          50         85.0    14115            5            5       1993   
6   7          20         75.0    10084            8            5       2004   
7   8          60         68.0    10382            7            6       1973   
8   9          50         51.0     6120            7            5       1931   
9  10         190         50.0     7420            5            6       1939   

   YearRemodAdd  MasVnrArea  BsmtFinSF1  ...  SaleType_ConLw  SaleType_New  \
0          2003       196.0       706.0  

In [23]:
# Select the neighborhood with the most homes: 'NAmes'
selected_neighborhood = 'NAmes'
key_variables = [
    'LotArea', 'YearBuilt', 'OverallQual', 'TotalBsmtSF', '1stFlrSF', 
    'GrLivArea', 'FullBath', 'BedroomAbvGr', 'KitchenQual', 'GarageCars', 'GarageArea', 'SalePrice'
]

In [24]:
# Filter the dataset for the selected neighborhood and key variables
filtered_data = train_df[train_df['Neighborhood'] == selected_neighborhood][key_variables]


In [26]:
# Fill missing values
num_cols = filtered_data.select_dtypes(include=[np.number]).columns
cat_cols = filtered_data.select_dtypes(include=[object]).columns

filtered_data[num_cols] = filtered_data[num_cols].apply(lambda x: x.fillna(x.median()), axis=0)
filtered_data[cat_cols] = filtered_data[cat_cols].apply(lambda x: x.fillna(x.value_counts().index[0]), axis=0)


In [27]:
# Convert categorical variables to numerical (e.g., KitchenQual)
filtered_data = pd.get_dummies(filtered_data, columns=['KitchenQual'], drop_first=True)

In [28]:
# Define the target and features
X = filtered_data.drop('SalePrice', axis=1)
y = filtered_data['SalePrice']

In [29]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

In [32]:
# Define the preprocessors
num_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

cat_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

In [33]:
# Combine preprocessors
preprocessor = ColumnTransformer(
    transformers=[
        ('num', num_transformer, X_train.select_dtypes(include=[np.number]).columns),
        ('cat', cat_transformer, X_train.select_dtypes(include=[object]).columns)
    ])


In [34]:
# Define the model
model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', LinearRegression())
])

In [35]:
# Train the model
model.fit(X_train, y_train)

In [36]:
# Make predictions
y_pred_train = model.predict(X_train)
y_pred_val = model.predict(X_val)
y_pred_test = model.predict(X_test)

In [37]:
# Evaluate the model
train_mse = mean_squared_error(y_train, y_pred_train)
val_mse = mean_squared_error(y_val, y_pred_val)
test_mse = mean_squared_error(y_test, y_pred_test)

train_r2 = model.score(X_train, y_train)
val_r2 = model.score(X_val, y_val)
test_r2 = model.score(X_test, y_test)

print(f'Train MSE: {train_mse}, Train R2: {train_r2}')
print(f'Validation MSE: {val_mse}, Validation R2: {val_r2}')
print(f'Test MSE: {test_mse}, Test R2: {test_r2}')


Train MSE: 328955922.5025872, Train R2: 0.7356483483296241
Validation MSE: 400386323.7260555, Validation R2: 0.3582112812043511
Test MSE: 391099193.0866673, Test R2: 0.5720552881535859


In [38]:
# Save the model
with open('house_price_model.pkl', 'wb') as file:
    pickle.dump(model, file)