### **1. Problem Understanding**

---

#### **Objective**  
The goal of the project is to predict house prices based on a dataset containing various features related to properties (e.g., size, location, quality). This involves building a regression model that can accurately estimate the sale price (`SalePrice`) for each house.

---

#### **Output**  
The target variable is **continuous numerical values** representing the sale prices of houses. Models will aim to predict these values as closely as possible to the actual prices.

---

#### **Evaluation Metric**  
The performance metric is likely **Root Mean Squared Error (RMSE)** applied to the **log-transformed `SalePrice`**.  

- **Why log transformation?**  
   - House prices often exhibit skewed distributions, with a few extremely high-priced houses. Applying a log transformation reduces skewness and helps the model focus on relative differences rather than absolute values.
   - RMSE on the log scale penalizes large deviations in terms of percentage error rather than raw error.

- **RMSE Formula on Log Scale:**  

   \[
   RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^n \left( \log(\text{predicted}_i + 1) - \log(\text{actual}_i + 1) \right)^2}
   \]

---

#### **Key Considerations**  
- **Data Structure:** You are likely working with tabular data, including numerical, ordinal, and categorical features.
- **Feature Scope:** Features may include:
  - **Numerical:** `LotArea`, `GrLivArea`, `TotalBsmtSF`.
  - **Categorical:** `Neighborhood`, `GarageType`, `HouseStyle`.
  - **Ordinal:** `OverallQual`, `ExterQual`.
- **Domain Knowledge:** Leverage real estate knowledge where possible (e.g., location and quality are typically strong price determinants).

### Requirements

#### 1. Drop Irrelevant Columns

For the purposes of this lab, we will only be using a subset of all of the features present in the Ames Housing dataset. In this step you will drop all irrelevant columns.

#### 2. Handle Missing Values

Often for reasons outside of a data scientist's control, datasets are missing some values. In this step you will assess the presence of NaN values in our subset of data, and use `MissingIndicator` and `SimpleImputer` from the `sklearn.impute` submodule to handle any missing values.

#### 3. Convert Categorical Features into Numbers

A built-in assumption of the scikit-learn library is that all data being fed into a machine learning model is already in a numeric format, otherwise you will get a `ValueError` when you try to fit a model. In this step you will use an `OrdinalEncoder` to replace data within individual non-numeric columns with 0s and 1s, and a `OneHotEncoder` to replace columns containing more than 2 categories with multiple "dummy" columns containing 0s and 1s.

At this point, a scikit-learn model should be able to run without errors!

#### 4. Preprocess Test Data

Apply Steps 1-3 to the test data in order to perform a final model evaluation.

### **2. Data Exploration**

> Load the dataset

In [1]:
import pandas as pd


df1 = pd.read_csv('Data/train.csv')

df1

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,1456,60,RL,62.0,7917,Pave,,Reg,Lvl,AllPub,...,0,,,,0,8,2007,WD,Normal,175000
1456,1457,20,RL,85.0,13175,Pave,,Reg,Lvl,AllPub,...,0,,MnPrv,,0,2,2010,WD,Normal,210000
1457,1458,70,RL,66.0,9042,Pave,,Reg,Lvl,AllPub,...,0,,GdPrv,Shed,2500,5,2010,WD,Normal,266500
1458,1459,20,RL,68.0,9717,Pave,,Reg,Lvl,AllPub,...,0,,,,0,4,2010,WD,Normal,142125


In [2]:
df1.describe()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
count,1460.0,1460.0,1201.0,1460.0,1460.0,1460.0,1460.0,1460.0,1452.0,1460.0,...,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,730.5,56.89726,70.049958,10516.828082,6.099315,5.575342,1971.267808,1984.865753,103.685262,443.639726,...,94.244521,46.660274,21.95411,3.409589,15.060959,2.758904,43.489041,6.321918,2007.815753,180921.19589
std,421.610009,42.300571,24.284752,9981.264932,1.382997,1.112799,30.202904,20.645407,181.066207,456.098091,...,125.338794,66.256028,61.119149,29.317331,55.757415,40.177307,496.123024,2.703626,1.328095,79442.502883
min,1.0,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,34900.0
25%,365.75,20.0,59.0,7553.5,5.0,5.0,1954.0,1967.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,129975.0
50%,730.5,50.0,69.0,9478.5,6.0,5.0,1973.0,1994.0,0.0,383.5,...,0.0,25.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,163000.0
75%,1095.25,70.0,80.0,11601.5,7.0,6.0,2000.0,2004.0,166.0,712.25,...,168.0,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,214000.0
max,1460.0,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,...,857.0,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0,755000.0


The prediction target for this analysis is the sale price of the home, so I separate the data into `X` and `y` accordingly:

In [4]:
y = df1["SalePrice"]
X = df1.drop("SalePrice", axis=1)

Next, I separate the data into a train set and a test set prior to performing any preprocessing steps:

In [None]:
# from sklearn.model_selection import train_test_split

# X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [6]:
# Checking for the number of rows and coluMns in the dataframe
print(f"X_train is a DataFrame with {X_train.shape[0]} rows and {X_train.shape[1]} columns")
print(f"y_train is a Series with {y_train.shape[0]} values")

# There always should be the same number of rows in X as values in y
assert X_train.shape[0] == y_train.shape[0]

X_train is a DataFrame with 1095 rows and 80 columns
y_train is a Series with 1095 values


## 1. Drop Irrelevant Columns


In [7]:
# viewing the columns in the dataframe
df1.columns

Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive

To select the top 15 most relevant features for predicting SalePrice, we can use a combination of correlation analysis and domain knowledge to prioritize features. Here’s how we can proceed:

1. Numerical Feature Selection
Identify numerical features that are highly correlated with SalePrice.

2. Categorical Feature Selection
Choose categorical features that provide significant context for pricing, such as Neighborhood, GarageType, and ExterQual.

3. Combine with Domain Knowledge
Combine statistical relevance (e.g., correlation) with logical reasoning about features that intuitively affect house prices.

In [8]:
# identifyoing the best columns as correlated with saleprice column

numerical_cols = df1.select_dtypes(include=['float64', 'int64']).columns
correlation_matrix = df1[numerical_cols].corr()
saleprice_correlation = correlation_matrix['SalePrice'].sort_values(ascending=False)
print("Top Numerical Features by Correlation:\n", saleprice_correlation.head(15))


Top Numerical Features by Correlation:
 SalePrice       1.000000
OverallQual     0.790982
GrLivArea       0.708624
GarageCars      0.640409
GarageArea      0.623431
TotalBsmtSF     0.613581
1stFlrSF        0.605852
FullBath        0.560664
TotRmsAbvGrd    0.533723
YearBuilt       0.522897
YearRemodAdd    0.507101
GarageYrBlt     0.486362
MasVnrArea      0.477493
Fireplaces      0.466929
BsmtFinSF1      0.386420
Name: SalePrice, dtype: float64


In [9]:
relevant_columns = [
    'OverallQual', 'GrLivArea', 'TotalBsmtSF', 'GarageCars', 'GarageArea', 
    '1stFlrSF', 'YearBuilt', 'YearRemodAdd', 'FireplaceQu', 'FullBath', 'KitchenQual', 
    'Fireplaces', 'ExterQual', 'LotArea'
]

# Reassign X_train so that it only contains relevant columns
X_train = X_train.loc[:, relevant_columns]

# Visually inspect X_train
X_train

Unnamed: 0,OverallQual,GrLivArea,TotalBsmtSF,GarageCars,GarageArea,1stFlrSF,YearBuilt,YearRemodAdd,FireplaceQu,FullBath,KitchenQual,Fireplaces,ExterQual,LotArea
1023,7,1504,1346,2,437,1504,2005,2006,Gd,2,Gd,1,Gd,3182
810,6,1309,1040,2,484,1309,1974,1999,Fa,1,Gd,1,TA,10140
1384,6,1258,560,1,280,698,1939,1950,,1,TA,0,TA,9060
626,5,1422,978,1,286,1422,1960,1978,TA,1,TA,1,TA,12342
813,6,1442,1442,1,301,1442,1958,1958,,1,TA,0,TA,9750
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1095,6,1314,1314,2,440,1314,2006,2006,Gd,2,Gd,1,Gd,9317
1130,4,1981,1122,2,576,1328,1928,1950,TA,2,Gd,2,TA,7804
1294,5,864,864,2,572,864,1955,1990,,1,TA,0,TA,8172
860,7,1426,912,1,216,912,1918,1998,Gd,1,Gd,1,Gd,7642


## 2. Handle Missing Values

In the cell below, I check to see if there are any NaNs in the selected subset of data:

In [10]:
X_train.isna().sum()

OverallQual       0
GrLivArea         0
TotalBsmtSF       0
GarageCars        0
GarageArea        0
1stFlrSF          0
YearBuilt         0
YearRemodAdd      0
FireplaceQu     512
FullBath          0
KitchenQual       0
Fireplaces        0
ExterQual         0
LotArea           0
dtype: int64

The NaNs in `FireplaceQu` could mean that this places may nopt have the fireplace. Checking using the `Fireplaces` column.

In [11]:
X_train[X_train["Fireplaces"] == 0]

Unnamed: 0,OverallQual,GrLivArea,TotalBsmtSF,GarageCars,GarageArea,1stFlrSF,YearBuilt,YearRemodAdd,FireplaceQu,FullBath,KitchenQual,Fireplaces,ExterQual,LotArea
1384,6,1258,560,1,280,698,1939,1950,,1,TA,0,TA,9060
813,6,1442,1442,1,301,1442,1958,1958,,1,TA,0,TA,9750
839,5,1200,768,1,240,768,1946,1995,,1,TA,0,TA,11767
430,6,987,483,1,264,483,1971,1971,,1,TA,0,TA,1680
513,6,1080,1084,2,484,1080,1983,1983,,1,TA,0,TA,9187
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
87,6,1224,612,2,528,612,2009,2009,,2,Gd,0,Gd,3951
330,5,1728,1728,1,352,1728,1964,1964,,2,TA,0,TA,10624
1238,6,1141,1141,2,484,1141,2005,2005,,1,TA,0,Gd,13072
121,4,1123,732,1,264,772,1939,1950,,1,TA,0,TA,6060


There are exaclty 512  rows with missing data which means that this places have no fireplace.
So, let's replace those NaNs with the string "N/A" to indicate that this is a real category, not missing data:

In [12]:
X_train["FireplaceQu"] = X_train["FireplaceQu"].fillna("N/A")
X_train["FireplaceQu"].value_counts()

FireplaceQu
N/A    512
Gd     286
TA     236
Fa      26
Ex      19
Po      16
Name: count, dtype: int64

Now X_train contains non null values.

In [13]:
X_train.isna().sum()

OverallQual     0
GrLivArea       0
TotalBsmtSF     0
GarageCars      0
GarageArea      0
1stFlrSF        0
YearBuilt       0
YearRemodAdd    0
FireplaceQu     0
FullBath        0
KitchenQual     0
Fireplaces      0
ExterQual       0
LotArea         0
dtype: int64

## 3. Convert Categorical Features into Numbers


In [14]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1095 entries, 1023 to 1126
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   OverallQual   1095 non-null   int64 
 1   GrLivArea     1095 non-null   int64 
 2   TotalBsmtSF   1095 non-null   int64 
 3   GarageCars    1095 non-null   int64 
 4   GarageArea    1095 non-null   int64 
 5   1stFlrSF      1095 non-null   int64 
 6   YearBuilt     1095 non-null   int64 
 7   YearRemodAdd  1095 non-null   int64 
 8   FireplaceQu   1095 non-null   object
 9   FullBath      1095 non-null   int64 
 10  KitchenQual   1095 non-null   object
 11  Fireplaces    1095 non-null   int64 
 12  ExterQual     1095 non-null   object
 13  LotArea       1095 non-null   int64 
dtypes: int64(11), object(3)
memory usage: 128.3+ KB


In [15]:
#I inspect the value counts of the specified features:
print(X_train["KitchenQual"].value_counts())
print()
print(X_train["FireplaceQu"].value_counts())
print()
print(X_train["ExterQual"].value_counts())

KitchenQual
TA    550
Gd    440
Ex     73
Fa     32
Name: count, dtype: int64

FireplaceQu
N/A    512
Gd     286
TA     236
Fa      26
Ex      19
Po      16
Name: count, dtype: int64

ExterQual
TA    682
Gd    363
Ex     39
Fa     11
Name: count, dtype: int64


I will use a `OneHotEncoder` from `sklearn.preprocessing` to convert the vatergorical columns to integers values.

1. KitchenQual

In [16]:

# (0) import OneHotEncoder from sklearn.preprocessing
from sklearn.preprocessing import OneHotEncoder

# (1) Create a variable fireplace_qu_train
# extracted from X_train
# (double brackets due to shape expected by OHE)
kitchen_qu = X_train[["KitchenQual",]]

# (2) Instantiate a OneHotEncoder with categories="auto",
# sparse=False, and handle_unknown="ignore"
ohe = OneHotEncoder(categories="auto", sparse=False, handle_unknown="ignore")

# (3) Fit the encoder on fireplace_qu_train
ohe.fit(kitchen_qu)

# Inspect the categories of the fitted encoder
ohe.categories_



[array(['Ex', 'Fa', 'Gd', 'TA'], dtype=object)]

In [17]:
# (4) Transform categorical columns using the encoder and
# assign the result to categorical_encoded_train
kitchen_qu_train = ohe.transform(kitchen_qu)

# Visually inspect fireplace_qu_encoded_train
kitchen_qu_train

array([[0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.],
       ...,
       [0., 0., 0., 1.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.]])

In [18]:
#(5a) Make the transformed data into a dataframe
kitchen_qu_train = pd.DataFrame(
    # Pass in NumPy array
    kitchen_qu_train,
    # Set the column names to the categories found by OHE
    columns=ohe.categories_[0],
    # Set the index to match X_train's index
    index=X_train.index
)

# Visually inspect new dataframe
kitchen_qu_train

Unnamed: 0,Ex,Fa,Gd,TA
1023,0.0,0.0,1.0,0.0
810,0.0,0.0,1.0,0.0
1384,0.0,0.0,0.0,1.0
626,0.0,0.0,0.0,1.0
813,0.0,0.0,0.0,1.0
...,...,...,...,...
1095,0.0,0.0,1.0,0.0
1130,0.0,0.0,1.0,0.0
1294,0.0,0.0,0.0,1.0
860,0.0,0.0,1.0,0.0


2. FirelpaceQual

In [19]:
# (0) import OneHotEncoder from sklearn.preprocessing
from sklearn.preprocessing import OneHotEncoder

# (1) Create a variable fireplace_qu_train
# extracted from X_train
# (double brackets due to shape expected by OHE)
fireplace_qu_train = X_train[["FireplaceQu"]]

# (2) Instantiate a OneHotEncoder with categories="auto",
# sparse=False, and handle_unknown="ignore"
ohe = OneHotEncoder(categories="auto", sparse=False, handle_unknown="ignore")

# (3) Fit the encoder on fireplace_qu_train
ohe.fit(fireplace_qu_train)

# Inspect the categories of the fitted encoder
ohe.categories_



[array(['Ex', 'Fa', 'Gd', 'N/A', 'Po', 'TA'], dtype=object)]

In [20]:
# (4) Transform fireplace_qu_train using the encoder and
# assign the result to fireplace_qu_encoded_train
fireplace_qu_encoded_train = ohe.transform(fireplace_qu_train)

# Visually inspect fireplace_qu_encoded_train
fireplace_qu_encoded_train

array([[0., 0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0.],
       ...,
       [0., 0., 0., 1., 0., 0.],
       [0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1.]])

In [21]:
# (5a) Make the transformed data into a dataframe
fireplace_qu_encoded_train = pd.DataFrame(
    # Pass in NumPy array
    fireplace_qu_encoded_train,
    # Set the column names to the categories found by OHE
    columns=ohe.categories_[0],
    # Set the index to match X_train's index
    index=X_train.index
)

# Visually inspect new dataframe
fireplace_qu_encoded_train

Unnamed: 0,Ex,Fa,Gd,N/A,Po,TA
1023,0.0,0.0,1.0,0.0,0.0,0.0
810,0.0,1.0,0.0,0.0,0.0,0.0
1384,0.0,0.0,0.0,1.0,0.0,0.0
626,0.0,0.0,0.0,0.0,0.0,1.0
813,0.0,0.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...
1095,0.0,0.0,1.0,0.0,0.0,0.0
1130,0.0,0.0,0.0,0.0,0.0,1.0
1294,0.0,0.0,0.0,1.0,0.0,0.0
860,0.0,0.0,1.0,0.0,0.0,0.0


3, ExterQual

In [22]:
# (0) import OneHotEncoder from sklearn.preprocessing
from sklearn.preprocessing import OneHotEncoder

# (1) Create a variable fireplace_qu_train
# extracted from X_train
# (double brackets due to shape expected by OHE)
external_qu_train = X_train[["ExterQual"]]

# (2) Instantiate a OneHotEncoder with categories="auto",
# sparse=False, and handle_unknown="ignore"
ohe = OneHotEncoder(categories="auto", sparse=False, handle_unknown="ignore")

# (3) Fit the encoder on fireplace_qu_train
ohe.fit(external_qu_train)

# Inspect the categories of the fitted encoder
ohe.categories_



[array(['Ex', 'Fa', 'Gd', 'TA'], dtype=object)]

In [23]:
# (4) Transform fireplace_qu_train using the encoder and
# assign the result to fireplace_qu_encoded_train
external_qu_encoded_train = ohe.transform(external_qu_train)

# Visually inspect fireplace_qu_encoded_train
external_qu_encoded_train

array([[0., 0., 1., 0.],
       [0., 0., 0., 1.],
       [0., 0., 0., 1.],
       ...,
       [0., 0., 0., 1.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.]])

In [24]:
# (5a) Make the transformed data into a dataframe
external_qu_encoded_train = pd.DataFrame(
    # Pass in NumPy array
    external_qu_encoded_train,
    # Set the column names to the categories found by OHE
    columns=ohe.categories_[0],
    # Set the index to match X_train's index
    index=X_train.index
)

# Visually inspect new dataframe
external_qu_encoded_train

Unnamed: 0,Ex,Fa,Gd,TA
1023,0.0,0.0,1.0,0.0
810,0.0,0.0,0.0,1.0
1384,0.0,0.0,0.0,1.0
626,0.0,0.0,0.0,1.0
813,0.0,0.0,0.0,1.0
...,...,...,...,...
1095,0.0,0.0,1.0,0.0
1130,0.0,0.0,0.0,1.0
1294,0.0,0.0,0.0,1.0
860,0.0,0.0,1.0,0.0


In [25]:
# (5b) Drop original FireplaceQu column
X_train.drop(["FireplaceQu", "ExterQual", "KitchenQual"], axis=1, inplace=True)

# Visually inspect X_train
X_train

Unnamed: 0,OverallQual,GrLivArea,TotalBsmtSF,GarageCars,GarageArea,1stFlrSF,YearBuilt,YearRemodAdd,FullBath,Fireplaces,LotArea
1023,7,1504,1346,2,437,1504,2005,2006,2,1,3182
810,6,1309,1040,2,484,1309,1974,1999,1,1,10140
1384,6,1258,560,1,280,698,1939,1950,1,0,9060
626,5,1422,978,1,286,1422,1960,1978,1,1,12342
813,6,1442,1442,1,301,1442,1958,1958,1,0,9750
...,...,...,...,...,...,...,...,...,...,...,...
1095,6,1314,1314,2,440,1314,2006,2006,2,1,9317
1130,4,1981,1122,2,576,1328,1928,1950,2,2,7804
1294,5,864,864,2,572,864,1955,1990,1,0,8172
860,7,1426,912,1,216,912,1918,1998,1,1,7642


In [26]:

# (5c) Concatenate the new dataframe with current X_train
X_train = pd.concat([X_train, fireplace_qu_encoded_train, external_qu_encoded_train, kitchen_qu_train], axis=1)

# Visually inspect X_train
X_train

Unnamed: 0,OverallQual,GrLivArea,TotalBsmtSF,GarageCars,GarageArea,1stFlrSF,YearBuilt,YearRemodAdd,FullBath,Fireplaces,...,Po,TA,Ex,Fa,Gd,TA.1,Ex.1,Fa.1,Gd.1,TA.2
1023,7,1504,1346,2,437,1504,2005,2006,2,1,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
810,6,1309,1040,2,484,1309,1974,1999,1,1,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
1384,6,1258,560,1,280,698,1939,1950,1,0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
626,5,1422,978,1,286,1422,1960,1978,1,1,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
813,6,1442,1442,1,301,1442,1958,1958,1,0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1095,6,1314,1314,2,440,1314,2006,2006,2,1,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
1130,4,1981,1122,2,576,1328,1928,1950,2,2,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
1294,5,864,864,2,572,864,1955,1990,1,0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
860,7,1426,912,1,216,912,1918,1998,1,1,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0


In [27]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1095 entries, 1023 to 1126
Data columns (total 25 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   OverallQual   1095 non-null   int64  
 1   GrLivArea     1095 non-null   int64  
 2   TotalBsmtSF   1095 non-null   int64  
 3   GarageCars    1095 non-null   int64  
 4   GarageArea    1095 non-null   int64  
 5   1stFlrSF      1095 non-null   int64  
 6   YearBuilt     1095 non-null   int64  
 7   YearRemodAdd  1095 non-null   int64  
 8   FullBath      1095 non-null   int64  
 9   Fireplaces    1095 non-null   int64  
 10  LotArea       1095 non-null   int64  
 11  Ex            1095 non-null   float64
 12  Fa            1095 non-null   float64
 13  Gd            1095 non-null   float64
 14  N/A           1095 non-null   float64
 15  Po            1095 non-null   float64
 16  TA            1095 non-null   float64
 17  Ex            1095 non-null   float64
 18  Fa            1095 non-null   

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [28]:
model.fit(X_train, y_train)

NameError: name 'model' is not defined

In [29]:
from sklearn.model_selection import cross_val_score

cross_val_score(model, X_train, y_train, cv=3)

NameError: name 'model' is not defined