# Predicting values

This notebooks aims to help on starting with ML using the [housing dataset](https://platform.wbscodingschool.com/courses/data-science/12667/). Take a look at the [platform](https://platform.wbscodingschool.com/courses/data-science/14402/) before starting with task.

The goal is to create a simple model using some basic EDA, apply it to our housing data and calculating the performance of it.

In [None]:
import pandas as pd
housing = pd.read_csv('https://raw.githubusercontent.com/JoanClaverol/housing_data/main/housing-classification-iter-0-2.csv')

## Initial exploration

What columns exist on this data? What are their data types?

In [None]:
housing.tail()

Unnamed: 0,LotArea,LotFrontage,TotalBsmtSF,BedroomAbvGr,Fireplaces,PoolArea,GarageCars,WoodDeckSF,ScreenPorch,Expensive
1455,7917,62.0,953,3,1,0,2,0,0,0
1456,13175,85.0,1542,3,2,0,2,349,0,0
1457,9042,66.0,1152,4,2,0,1,0,0,1
1458,9717,68.0,1078,2,0,0,1,366,0,0
1459,9937,75.0,1256,3,0,0,1,736,0,0


In [None]:
housing['LotArea'].mean()

10516.828082191782

Do we have missing values on this dataset?

In [None]:
housing.isna().sum()

LotArea           0
LotFrontage     259
TotalBsmtSF       0
BedroomAbvGr      0
Fireplaces        0
PoolArea          0
GarageCars        0
WoodDeckSF        0
ScreenPorch       0
Expensive         0
dtype: int64

Do we have duplicated information?

In [None]:
housing.duplicated().any()

True

Is there any column that helps us identify if a house is expensive or not?

In [None]:
housing['LotArea'].mean()

10516.828082191782

## Create your first model

Based on the previous exploration, you have found some column that have some relation to the price of a house. Now it's your turn to create a python function to classify if a house is going to be expensive (`1`) or not (`0`). Read the following article on the [platform](https://platform.wbscodingschool.com/courses/data-science/14406/) to understand more about this process.

What are the predictions of your model?

In [None]:
housing_t = housing.copy()
y = housing_t.pop("Expensive")
X = housing_t

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=31416)

In [None]:
# your code here
def area_model(df):
    newdf = df.assign(prediction = 0)
    newdf.loc[housing.LotArea >= housing['LotArea'].mean(), "prediction"] = 1
    return newdf.prediction.tolist()

pred_area_train = area_model(X_train)

from sklearn.metrics import accuracy_score
accuracy_score(y_true = y_train,
               y_pred = pred_area_train
              )

0.7054794520547946

## Evaluate its performance

How can we evaluate our model? Is there a way to check the performance of it?

In [None]:
pred_area_test = area_model(X_test)
accuracy_score(y_true = y_test,
               y_pred = pred_area_test
              )

0.7226027397260274

It seems the model has low variance but high bias