<a id="top"></a>
# [Table of Contents](#top)

- [Setup & Read data](#setup) <br>
- [Obtain target and features & Split](#features) <br>
- [Exercise: Missing Values](#missing) <br>
- [Exercise: Categorical Variables](#categorical) <br>
- [Exercise: Pipelines](#pipelines) <br>
- [Exercise: Cross-Validation](#crossval) <br>
- [Exercise: XGBoost](#xgboost) <br>
- [Submit your results](#results) <br>
- [Extra](#extra) <br>
- [Continue to learn from others](#learn) <br>

<a id="setup"></a>
# [Setup](#setup)

In [1]:
# Importing all here

import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

from xgboost import XGBRegressor

# Input data files are available in the read-only "../input/" directory
# For example, running this will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/home-data-for-ml-course/sample_submission.csv
/kaggle/input/home-data-for-ml-course/sample_submission.csv.gz
/kaggle/input/home-data-for-ml-course/train.csv.gz
/kaggle/input/home-data-for-ml-course/data_description.txt
/kaggle/input/home-data-for-ml-course/test.csv.gz
/kaggle/input/home-data-for-ml-course/train.csv
/kaggle/input/home-data-for-ml-course/test.csv


In [2]:
# Read the data
X = pd.read_csv('/kaggle/input/home-data-for-ml-course/train.csv', index_col='Id')  # other names during the course: train_data, X_full
X_test_full = pd.read_csv('/kaggle/input/home-data-for-ml-course/test.csv', index_col='Id')  # other names during the course: test_data, X_test (at reading) 

## use ` Ctrl + A ` to select all lines
## use ` Ctrl + / ` to comment or uncomment selected lines

<a id="features"></a>
# [Obtain target and features & Split](#features)
[Back to contents](#top)

In [3]:
# # Exercise: Introduction
# # Obtain target and predictors
# # Only selected features are used
# features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
# X = X_full[features].copy()
# X_test = X_test_full[features].copy()


# # Exercise: Missing Values
# # To keep things simple, we'll use only numerical predictors
# X = X_full.select_dtypes(exclude=['object'])
# X_test = X_test_full.select_dtypes(exclude=['object'])


# # Exercise: Categorical Variables
# # To keep things simple, we'll drop columns with missing values
# cols_with_missing = [col for col in X.columns if X[col].isnull().any()] 
# X.drop(cols_with_missing, axis=1, inplace=True)
# X_test.drop(cols_with_missing, axis=1, inplace=True)


# Remove rows with missing target, separate target from predictors
X.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = X.SalePrice
X.drop(['SalePrice'], axis=1, inplace=True)

In [4]:
# Break off validation set from training data
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                                random_state=0)

# (!) skip for Exercise: Cross-Validation (!)

In [5]:
X_train_full.head()

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
619,20,RL,90.0,11694,Pave,,Reg,Lvl,AllPub,Inside,...,260,0,,,,0,7,2007,New,Partial
871,20,RL,60.0,6600,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,,,,0,8,2009,WD,Normal
93,30,RL,80.0,13360,Pave,Grvl,IR1,HLS,AllPub,Inside,...,0,0,,,,0,8,2009,WD,Normal
818,20,RL,,13265,Pave,,IR1,Lvl,AllPub,CulDSac,...,0,0,,,,0,7,2008,WD,Normal
303,20,RL,118.0,13704,Pave,,IR1,Lvl,AllPub,Corner,...,0,0,,,,0,1,2006,WD,Normal


In [6]:
# Lesson 1 Introduction -- Selecting best model using selected features

## Define the models
# model_1 = RandomForestRegressor(n_estimators=50, random_state=0)
# model_2 = RandomForestRegressor(n_estimators=100, random_state=0)
# model_3 = RandomForestRegressor(n_estimators=100, criterion='absolute_error', random_state=0)
# model_4 = RandomForestRegressor(n_estimators=200, min_samples_split=20, random_state=0)
# model_5 = RandomForestRegressor(n_estimators=100, max_depth=7, random_state=0)

# models = [model_1, model_2, model_3, model_4, model_5]

# # Function for comparing different models
# def score_model(model, X_t=X_train, X_v=X_valid, y_t=y_train, y_v=y_valid):
#     model.fit(X_t, y_t)
#     preds = model.predict(X_v)
#     return mean_absolute_error(y_v, preds)
    
# for i in range(0, len(models)):
#     mae = score_model(models[i])
#     print("Model %d MAE: %d" % (i+1, mae))

# # Fit the model to the training data
# my_model = model_3
# my_model.fit(X, y)

# # Generate test predictions
# preds_test = my_model.predict(X_test)

In [7]:
# Function for comparing different approaches -- used in Exercise: Missing Values & Exercise: Categorical Variables
def score_dataset(X_train, X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators=100, random_state=0)
    model.fit(X_train, y_train)
    preds = model.predict(X_valid)
    return mean_absolute_error(y_valid, preds)

# Then we may follow 
#### [Exercise: Missing Values](#missing) <br>
#### [Exercise: Categorical Variables](#categorical) <br>
#### [Exercise: Pipelines](#pipelines) <br>
#### [Exercise: Cross-Validation](#crossval) <br>
#### [Exercise: XGBoost](#xgboost) <br>

<a id="missing"></a>
# [Missing Values](#missing)
[Back to contents](#top)

In [8]:
# # Shape of training data (num_rows, num_columns)
# print(X_train.shape)

# # Number of missing values in each column of training data
# missing_val_count_by_column = (X_train.isnull().sum())
# print(missing_val_count_by_column[missing_val_count_by_column > 0])

In [9]:
# # Exercise: Missing Values -- Drop columns with missing values

# # Get names of columns with missing values
# cols_with_missing = [col for col in X_train.columns
#                      if X_train[col].isnull().any()]

# # Drop columns in training and validation data
# reduced_X_train = X_train.drop(cols_with_missing, axis=1)
# reduced_X_valid = X_valid.drop(cols_with_missing, axis=1)

# print("MAE (Drop columns with missing values):")
# print(score_dataset(reduced_X_train, reduced_X_valid, y_train, y_valid))

# # Exercise: Missing Values -- Imputation

# # Imputation
# my_imputer = SimpleImputer()
# imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train))
# imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid))

# # Imputation removed column names; put them back
# imputed_X_train.columns = X_train.columns
# imputed_X_valid.columns = X_valid.columns

# print("MAE (Imputation):")
# print(score_dataset(imputed_X_train, imputed_X_valid, y_train, y_valid))

In [10]:
# # Exercise: Missing Values -- # Generate test predictions

# ## 1. Train and evaluate a random forest model

# ### Preprocess training and validation features

# # Imputation
# final_imputer = SimpleImputer(strategy='median')
# final_X_train = pd.DataFrame(final_imputer.fit_transform(X_train))
# final_X_valid = pd.DataFrame(final_imputer.transform(X_valid))

# # Imputation removed column names; put them back
# final_X_train.columns = X_train.columns
# final_X_valid.columns = X_valid.columns

# # Define and fit model
# model = RandomForestRegressor(n_estimators=100, random_state=0)
# model.fit(final_X_train, y_train)

# # Get validation predictions and MAE
# preds_valid = model.predict(final_X_valid)
# print("MAE (Your approach):", mean_absolute_error(y_valid, preds_valid))

# ## 2. Preprocess the test data before generating predictions that can be submitted

# ### Preprocess test data
# final_X_test = pd.DataFrame(final_imputer.transform(X_test))

# ### Get test predictions
# preds_test = model.predict(final_X_test)

<a id="categorical"></a>
# [Categorical Variables](#categorical)
[Back to contents](#top)

In [11]:
# # Exercise: Categorical Variables -- Drop columns with categorical data

# # Fill in the lines below: drop columns in training and validation data
# drop_X_train = X_train.select_dtypes(exclude=['object'])
# drop_X_valid = X_valid.select_dtypes(exclude=['object'])

# print("MAE from Approach 1 (Drop categorical variables):")
# print(score_dataset(drop_X_train, drop_X_valid, y_train, y_valid))

### Ordinal encoding

In [12]:
# # Investigating the dataset for Ordinal encoding
# print("Unique values in 'Condition2' column in training data:", X_train['Condition2'].unique())
# print("\nUnique values in 'Condition2' column in validation data:", X_valid['Condition2'].unique())

Fitting an ordinal encoder to a column in the training data creates a corresponding integer-valued label for each unique value that appears in the training data. In the case that the validation data contains values that don't also appear in the training data, the encoder will throw an error, because these values won't have an integer assigned to them. **Notice that the 'Condition2' column** in the validation data contains the values 'RRAn' and 'RRNn', but these don't appear in the training data -- thus, if we try to use an ordinal encoder with scikit-learn, the code will throw an error.

This is a common problem that you'll encounter with real-world data, and there are many approaches to fixing this issue.  For instance, you can write a custom ordinal encoder to deal with new categories.  The simplest approach, however, is to drop the problematic categorical columns.  

Run the code cell below to save the problematic columns to a Python list `bad_label_cols`.  Likewise, columns that can be safely ordinal encoded are stored in `good_label_cols`.

In [13]:
# # Categorical columns in the training data
# object_cols = [col for col in X_train.columns if X_train[col].dtype == "object"]

# # Columns that can be safely ordinal encoded
# good_label_cols = [col for col in object_cols if 
#                    set(X_valid[col]).issubset(set(X_train[col]))]
        
# # Problematic columns that will be dropped from the dataset
# bad_label_cols = list(set(object_cols)-set(good_label_cols))
        
# print('Categorical columns that will be ordinal encoded:', good_label_cols)
# print('\nCategorical columns that will be dropped from the dataset:', bad_label_cols)

# # Drop categorical columns that will not be encoded
# label_X_train = X_train.drop(bad_label_cols, axis=1)
# label_X_valid = X_valid.drop(bad_label_cols, axis=1)

# # Apply ordinal encoder 
# ordinal_encoder = OrdinalEncoder()
# label_X_train[good_label_cols] = ordinal_encoder.fit_transform(X_train[good_label_cols])
# label_X_valid[good_label_cols] = ordinal_encoder.transform(X_valid[good_label_cols])

# print("MAE from Approach 2 (Ordinal Encoding):") 
# print(score_dataset(label_X_train, label_X_valid, y_train, y_valid))

### One-hot encoding

In [14]:
# # Investigating the dataset for Ordinal encoding - cardinality

# # Get number of unique entries in each column with categorical data
# object_nunique = list(map(lambda col: X_train[col].nunique(), object_cols))
# d = dict(zip(object_cols, object_nunique))

# # Print number of unique entries by column, in ascending order
# sorted(d.items(), key=lambda x: x[1])

For large datasets with many rows, one-hot encoding can greatly expand the size of the dataset. For this reason, we typically will only **one-hot encode columns with relatively low cardinality**. Then, high cardinality columns can either be dropped from the dataset, or we can use ordinal encoding.

Next, you'll experiment with one-hot encoding.  But, instead of encoding all of the categorical variables in the dataset, you'll only create a one-hot encoding for columns with cardinality less than 10.

Run the code cell below without changes to set `low_cardinality_cols` to a Python list containing the columns that will be one-hot encoded.  Likewise, `high_cardinality_cols` contains a list of categorical columns that will be dropped from the dataset.

In [15]:
# # Columns that will be one-hot encoded
# low_cardinality_cols = [col for col in object_cols if X_train[col].nunique() < 10]

# # Columns that will be dropped from the dataset
# high_cardinality_cols = list(set(object_cols)-set(low_cardinality_cols))

# print('Categorical columns that will be one-hot encoded:', low_cardinality_cols)
# print('\nCategorical columns that will be dropped from the dataset:', high_cardinality_cols)

# # Apply one-hot encoder to each column with categorical data
# OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
# OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[low_cardinality_cols]))
# OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[low_cardinality_cols]))

# # One-hot encoding removed index; put it back
# OH_cols_train.index = X_train.index
# OH_cols_valid.index = X_valid.index

# # Remove categorical columns (will replace with one-hot encoding)
# num_X_train = X_train.drop(object_cols, axis=1)
# num_X_valid = X_valid.drop(object_cols, axis=1)

# # Add one-hot encoded columns to numerical features
# OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
# OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)

# # Ensure all columns have string type
# OH_X_train.columns = OH_X_train.columns.astype(str)
# OH_X_valid.columns = OH_X_valid.columns.astype(str)

# print("MAE from Approach 3 (One-Hot Encoding):") 
# print(score_dataset(OH_X_train, OH_X_valid, y_train, y_valid))

In [16]:
# Generate test predictions

## preprocess the test data

## generate predictions

<a id="pipelines"></a>
# [Pipelines](#pipelines)
[Back to contents](#top)

In [17]:
# # Exercise: Pipelines

# # "Cardinality" means the number of unique values in a column
# # Select categorical columns with relatively low cardinality (convenient but arbitrary)
# categorical_cols = [cname for cname in X_train_full.columns if
#                     X_train_full[cname].nunique() < 10 and 
#                     X_train_full[cname].dtype == "object"]

# # Select numerical columns
# numerical_cols = [cname for cname in X_train_full.columns if 
#                 X_train_full[cname].dtype in ['int64', 'float64']]

# # Keep selected columns only
# my_cols = categorical_cols + numerical_cols
# X_train = X_train_full[my_cols].copy()
# X_valid = X_valid_full[my_cols].copy()
# X_test = X_test_full[my_cols].copy()

In [18]:
# # Preprocess the data and train a model

# # Preprocessing for numerical data
# numerical_transformer = SimpleImputer(strategy='constant')

# # Preprocessing for categorical data
# categorical_transformer = Pipeline(steps=[
#     ('imputer', SimpleImputer(strategy='most_frequent')),
#     ('onehot', OneHotEncoder(handle_unknown='ignore'))
# ])

# # Bundle preprocessing for numerical and categorical data
# preprocessor = ColumnTransformer(
#     transformers=[
#         ('num', numerical_transformer, numerical_cols),
#         ('cat', categorical_transformer, categorical_cols)
#     ])

# # Define model
# model = RandomForestRegressor(n_estimators=100, random_state=0)

# # Bundle preprocessing and modeling code in a pipeline
# my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
#                               ('model', model)
#                              ])

# # Preprocessing of training data, fit model 
# my_pipeline.fit(X_train, y_train)

# # Preprocessing of validation data, get predictions
# preds = my_pipeline.predict(X_valid)

# # Evaluate the model
# print('MAE:', mean_absolute_error(y_valid, preds))

# # Preprocessing of test data, fit model
# preds_test = my_pipeline.predict(X_test)

<a id="crossval"></a>
# [Cross-Validation](#crossval)
[Back to contents](#top)

In [19]:
# # Select numeric columns only
# numeric_cols = [cname for cname in train_data.columns if train_data[cname].dtype in ['int64', 'float64']]
# X = train_data[numeric_cols].copy()
# X_test = test_data[numeric_cols].copy()

# my_pipeline = Pipeline(steps=[
#     ('preprocessor', SimpleImputer()),
#     ('model', RandomForestRegressor(n_estimators=50, random_state=0))
# ])

# # Multiply by -1 since sklearn calculates *negative* MAE
# scores = -1 * cross_val_score(my_pipeline, X, y,
#                               cv=5,
#                               scoring='neg_mean_absolute_error')

# print("Average MAE score:", scores.mean())

# def get_score(n_estimators):
#     """Return the average MAE over 3 CV folds of random forest model.
    
#     Keyword argument:
#     n_estimators -- the number of trees in the forest
#     """
#     my_pipeline = Pipeline(steps=[
#     ('preprocessor', SimpleImputer()),
#     ('model', RandomForestRegressor(n_estimators, random_state=0))
# ])
#     scores = -1 * cross_val_score(my_pipeline, X, y,
#                               cv=3,
#                               scoring='neg_mean_absolute_error')
#     return scores.mean()

# results = {estimator: get_score(estimator) for estimator in [50, 100, 150, 200, 250, 300, 350, 400]}



# plt.plot(list(results.keys()), list(results.values()))
# plt.show()

<a id="xgboost"></a>
# [XGBoost (gradient boosting)](#xgboost)
[Back to contents](#top)

In [20]:
# "Cardinality" means the number of unique values in a column

# Select categorical columns with relatively low cardinality (convenient but arbitrary)
low_cardinality_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 and 
                        X_train_full[cname].dtype == "object"]

# Select numeric columns
numeric_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]

# Keep selected columns only
my_cols = low_cardinality_cols + numeric_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()
X_test = X_test_full[my_cols].copy()

# One-hot encode the data (to shorten the code, we use pandas)
X_train = pd.get_dummies(X_train)
X_valid = pd.get_dummies(X_valid)
X_test = pd.get_dummies(X_test)
X_train, X_valid = X_train.align(X_valid, join='left', axis=1)
X_train, X_test = X_train.align(X_test, join='left', axis=1)

In [21]:
# Improve the model with the parameters

# Define the model
my_model_2 = XGBRegressor(n_estimators=1000, learning_rate=0.01) 

# Fit the model
my_model_2.fit(X_train, y_train)

# Get predictions
predictions_2 = my_model_2.predict(X_valid) 

# Calculate MAE
mae_2 = mean_absolute_error(predictions_2, y_valid) 
print("Mean Absolute Error:" , mae_2)


Mean Absolute Error: 17079.305463398974


<a id="results"></a>
# [Submit your results](#results)
[Back to contents](#top)

In [22]:
# Preprocessing of test data, fit model
preds_test = my_model_2.predict(X_test)

In [23]:
# Save predictions in format used for competition scoring

output = pd.DataFrame({'Id': X_test.index,
                       'SalePrice': preds_test})
output.to_csv('submission.csv', index=False)

<a id="extra"></a>
# [Extra](#extra)
[Back to contents](#top)

#### House Prices Best Solution  - "LightGBM"
https://www.kaggle.com/code/tnr1337/house-prices-best-solution/notebook

In [24]:
# import warnings
# warnings.filterwarnings('ignore')
# import numpy as np
# import pandas as pd
# from sklearn.preprocessing import LabelEncoder
# from sklearn.model_selection import KFold
# from sklearn.metrics import mean_squared_error
# import lightgbm as lgb
# from lightgbm import early_stopping, log_evaluation

# # from scipy.stats import norm, skew
# # from scipy import stats

# y = train["SalePrice"]
# train_id = train["Id"]
# test_id = test["Id"]

# # Preprocessing function
# def preprocess(df):
#     df = df.copy()
#     for col in df.select_dtypes(include="object"):
#         df[col] = df[col].fillna("None")
#         df[col] = LabelEncoder().fit_transform(df[col])
#     for col in df.select_dtypes(include=["int64", "float64"]):
#         df[col] = df[col].fillna(df[col].median())
#     return df

# # Feature separation
# X = preprocess(train.drop(["Id", "SalePrice"], axis=1))
# X_test = preprocess(test.drop("Id", axis=1))

# # K-Fold Cross Validation
# kf = KFold(n_splits=5, shuffle=True, random_state=42)
# oof = np.zeros(len(X))
# preds = np.zeros(len(X_test))

# for fold, (train_idx, val_idx) in enumerate(kf.split(X)):
#     X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
#     y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]
    
#     model = lgb.LGBMRegressor(n_estimators=1000, learning_rate=0.01, random_state=42)
    
#     model.fit(
#         X_train, y_train,
#         eval_set=[(X_val, y_val)],
#         callbacks=[
#             early_stopping(stopping_rounds=100),
#             log_evaluation(period=100)
#         ]
#     )
    
#     oof[val_idx] = model.predict(X_val)
#     preds += model.predict(X_test) / kf.n_splits

# # CV RMSE
# rmse = mean_squared_error(y, oof, squared=False)
# print(f"CV RMSE: {rmse:.4f}")

# # Submission file
# submission = pd.DataFrame({
#     "Id": test_id,
#     "SalePrice": preds
# })
# submission.to_csv("submission.csv", index=False)

<a id="learn"></a>
# [Continue to learn from others](#learn)
[Back to contents](#top)

**Top 1% Approach: EDA, New Models and Stacking**
https://www.kaggle.com/code/datafan07/top-1-approach-eda-new-models-and-stacking

**House Prices: 1st Approach to Data Science Process**
https://www.kaggle.com/code/cheesu/house-prices-1st-approach-to-data-science-process
"This kernel represents my first personal exploration of an (almost) end-to-end data science process. While starting off as a kernel to complete the Kaggle Learn Machine Learning exercise, it has now been significantly modified to serve a broader learning experience for myself."

**Top 1%-🏡Housing Price(EDA+Random) for Everyone🤓**
https://www.kaggle.com/code/lazer999/top-1-housing-price-eda-random-for-everyone

**Data Science Workflow TOP 2% (with Tuning)**
https://www.kaggle.com/code/angqx95/data-science-workflow-top-2-with-tuning/notebook

**9th RANK | HOUSING PRICES (feature engineering)**
https://www.kaggle.com/code/manyasinghal912/9th-rank-housing-prices-feature-engineering

**feature engineering + optuna + stacked pipe**
https://www.kaggle.com/code/rzatemizel/feature-engineering-optuna-stacked-pipe

**StackingAndEnsembling**
https://www.kaggle.com/code/donaldst/stackingandensembling

**Stack&Blend LRs XGB LGB {House Prices K} v17**
https://www.kaggle.com/code/itslek/stack-blend-lrs-xgb-lgb-house-prices-k-v17