This is my second attempt at creating a machine learning script using the sklearn library to accurately predict house prices.

In my first attempt, the model was off by an average of $20,000 per house. This was mainly due to incorrect data cleaning and improper model evaluation.

This time, I’ll carefully go through each step one at a time. Additionally, instead of training the model on the entire dataset and all features at once, I will train the model on each feature individually to see if that improves accuracy.

I will be training and testing two different models: K Neighbors and SGD regressor to evaluate their accuracy.

Importing libaries 

In [192]:
import pandas as pd
from sklearn.linear_model import SGDRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler

Reading Data

In [193]:
train_df = pd.read_csv("./kaggle/house-prices-advanced-regression-techniques/train.csv")
test_df = pd.read_csv("./kaggle/house-prices-advanced-regression-techniques/test.csv")

In [None]:
train_df.head(5)

In [None]:
test_df.head(5)

Cleaning data 

In [196]:
# Fill missing categorical values with the mode
train_df_cleaned = train_df.fillna(train_df.mode().iloc[0])
test_df_cleaned = test_df.fillna(test_df.mode().iloc[0])

In [None]:
# Check for missing values
print(train_df_cleaned.isnull().sum())
print(test_df_cleaned.isnull().sum())

In [None]:
train_df_cleaned.head(5)

In [199]:
train_df_cleaned.drop(columns=['Id', 'Alley', 'MiscVal', 'Fence', 'MiscFeature'], inplace=True)

In [None]:
train_df_cleaned.head(5)

Processing data 

In [201]:
encoder = LabelEncoder()

for column in train_df_cleaned.select_dtypes(include=['object']).columns:
    train_df_cleaned[column] = encoder.fit_transform(train_df_cleaned[column])

Model/Training

In [202]:
X = train_df_cleaned.drop(columns=["SalePrice"])
Y = train_df_cleaned['SalePrice']

In [None]:
X

In [None]:
Y

In [205]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=20)

In [206]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [None]:
model_1 = SGDRegressor()
model_2 = KNeighborsRegressor()
model_3 = DecisionTreeClassifier()

In [None]:
model_1.fit(X_train_scaled, y_train)

In [None]:
model_2.fit(X_train_scaled, y_train)

In [None]:
model_3.fit(X_train_scaled, y_train)

Evaluating (inspo from kaggle user IRON WOLF)

In [None]:
models = [model_1, model_2, model_3]

models_names = ['SGDRegressor', 'KNeighborsRegressor', 'DecisionTreeClassifier']

squared_errors = []
train_scores = []
test_scores = []
ratios = []
model_evaluations = []

# Loop through models and calculate metrics
for model, name in zip(models, models_names):
    # Predict on test data
    y_pred = model.predict(X_test_scaled)
    
    # Calculate Mean Squared Error
    mse = mean_squared_error(y_test, y_pred)
    squared_errors.append(f'{mse * 100:.2f}%')
    
    # Calculate train and test scores
    train_score = model.score(X_train_scaled, y_train)
    test_score = model.score(X_test_scaled, y_test)
    
    train_scores.append(train_score)
    test_scores.append(test_score)
    
    # Difference between train and test score (ratio)
    ratio_diff = train_score - test_score
    ratios.append(f'{ratio_diff * 100:.2f}%')
    
    # Model evaluation
    if train_score <= 0.65 and test_score <= 0.65:
        model_evaluations.append('bad')
    elif train_score > test_score * 1.10:
        model_evaluations.append('overfit')
    elif 0.65 < train_score < 0.80 and 0.65 < test_score < 0.80:
        model_evaluations.append('middle')
    elif 0.80 <= train_score < 1.00 and 0.80 <= test_score < 1.00:
        model_evaluations.append('good')
    elif train_score >= 0.80 and test_score < 0.80:
        model_evaluations.append('high train, low test')
    else:
        model_evaluations.append('unknown')

# Create a DataFrame to display the results
model_score = pd.DataFrame({
    'Model': models_names,
    'Train score': [f'{round(score * 100, 2)}%' for score in train_scores],
    'Test score': [f'{round(score * 100, 2)}%' for score in test_scores],
    'Ratio difference': ratios,
    'Evaluate model': model_evaluations,
})

# Display the result
model_score