In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

url = "https://drive.google.com/file/d/1YxeVDZHfDhqWb0VOn-lfxnDKoLOayJeD/view?usp=drive_link" # > Data from the iteration 5
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
data = pd.read_csv(path)


X = data.drop(columns=['Id']).copy()
y = X.pop("Expensive")

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [2]:
data["Id"]

0          1
1          2
2          3
3          4
4          5
        ... 
1455    1456
1456    1457
1457    1458
1458    1459
1459    1460
Name: Id, Length: 1460, dtype: int64

Explore the data a bit!

In [48]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1168 entries, 254 to 1126
Data columns (total 79 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   LotArea        1168 non-null   int64  
 1   LotFrontage    951 non-null    float64
 2   TotalBsmtSF    1168 non-null   int64  
 3   BedroomAbvGr   1168 non-null   int64  
 4   Fireplaces     1168 non-null   int64  
 5   PoolArea       1168 non-null   int64  
 6   GarageCars     1168 non-null   int64  
 7   WoodDeckSF     1168 non-null   int64  
 8   ScreenPorch    1168 non-null   int64  
 9   MSZoning       1168 non-null   object 
 10  Condition1     1168 non-null   object 
 11  Heating        1168 non-null   object 
 12  Street         1168 non-null   object 
 13  CentralAir     1168 non-null   object 
 14  Foundation     1168 non-null   object 
 15  ExterQual      1168 non-null   object 
 16  ExterCond      1168 non-null   object 
 17  BsmtQual       1140 non-null   object 
 18  BsmtCond   

That's a lot of columns! Will all of them be useful?

Don't forget to check the accompanying `.txt` file for additional info.

### The lazy model

In [49]:

# Select the numerical columns from X
X_num = X_train.select_dtypes(include="number").copy()

# Select the categorical columns from X
X_cat = X_train.select_dtypes(exclude="number").copy()

In [50]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer

num_pipe = make_pipeline(
                            SimpleImputer()
                        )

cat_pipe = make_pipeline(
                            SimpleImputer(strategy="constant", fill_value="NA"),
                            OneHotEncoder(handle_unknown="ignore")
                        )

preprocessor = make_column_transformer(
    (num_pipe, X_num.columns),
    (cat_pipe, X_cat.columns)
)

lazy_pipe = make_pipeline(preprocessor,
                          DecisionTreeClassifier()
                          )

lazy_pipe.fit(X_train, y_train)

Since this is a *lazy* model, we won't tune or even test it!

### The test set

Here's even more new data, this time *without labels!*. To see how well our model performs, we'll predict whether these houses are expensive or not and upload the results to the [competition site](https://housingcomp-data023.streamlit.app/).

In [5]:
test_url = "https://drive.google.com/file/d/1MZnPvWoGQtBHij32Rti26C2T0KT1xGBc/view?usp=drive_link"
test_path = 'https://drive.google.com/uc?export=download&id='+test_url.split('/')[-2]
test = pd.read_csv(test_path)
test

Unnamed: 0,LotArea,LotFrontage,TotalBsmtSF,BedroomAbvGr,Fireplaces,PoolArea,GarageCars,WoodDeckSF,ScreenPorch,MSZoning,...,GarageType,GarageFinish,GarageQual,GarageCond,PavedDrive,PoolQC,Fence,MiscFeature,SaleType,SaleCondition
0,11622,80.0,882.0,2,0,0,1.0,140,120,RH,...,Attchd,Unf,TA,TA,Y,,MnPrv,,WD,Normal
1,14267,81.0,1329.0,3,0,0,1.0,393,0,RL,...,Attchd,Unf,TA,TA,Y,,,Gar2,WD,Normal
2,13830,74.0,928.0,3,1,0,2.0,212,0,RL,...,Attchd,Fin,TA,TA,Y,,MnPrv,,WD,Normal
3,9978,78.0,926.0,3,1,0,2.0,360,0,RL,...,Attchd,Fin,TA,TA,Y,,,,WD,Normal
4,5005,43.0,1280.0,2,0,0,2.0,0,144,RL,...,Attchd,RFn,TA,TA,Y,,,,WD,Normal
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1454,1936,21.0,546.0,3,0,0,0.0,0,0,RM,...,,,,,Y,,,,WD,Normal
1455,1894,21.0,546.0,3,0,0,1.0,0,0,RM,...,CarPort,Unf,TA,TA,Y,,,,WD,Abnorml
1456,20000,160.0,1224.0,4,1,0,2.0,474,0,RL,...,Detchd,Unf,TA,TA,Y,,,,WD,Abnorml
1457,10441,62.0,912.0,3,0,0,0.0,80,0,RL,...,,,,,Y,,MnPrv,Shed,WD,Normal


In [7]:
data["Id"].tail()

1455    1456
1456    1457
1457    1458
1458    1459
1459    1460
Name: Id, dtype: int64

In [6]:
test["Id"].head()

0    1461
1    1462
2    1463
3    1464
4    1465
Name: Id, dtype: int64

The upload will be a `.csv` file with two columns: "Id" and "Expensive" (the columns __*must*__ have these names and they __*must*__ be in this order).

The resulting file should start off something like this:
> Id,Expensive    
1461,0    
1462,1    
1463,0    
1464,1   
1465,1    
1466,1   

If you have different "Id"s, you've used the wrong file to test your model on.

In [54]:
# the dataframe given to the model must have the same columns as the dataframe it trained on
# "Id" is still needed for the submission, though, so don't drop permanently!!

test["Expensive"] = lazy_pipe.predict(test.drop(["Id"], axis=1))

test["Expensive"]

0       0
1       0
2       0
3       0
4       0
       ..
1454    0
1455    0
1456    1
1457    0
1458    0
Name: Expensive, Length: 1459, dtype: int64

In [53]:

test[["Id", "Expensive"]].to_csv('./lazy_model.csv', index=False)  # don't forget to leave off indexes!