# Train/test splitting

> First, we need to hold out some data in order to evaluate how well the machine learning models we build generalize with new data.
> This is called splitting the data into a training and testing set.

In [None]:
# | default_exp train_test_split

In [None]:
# | hide
from nbdev.showdoc import *
from fastcore.test import *

When splitting our data, we want to make sure the training set is representative of the cases we want to generalize to. Otherwise, we would train machine learning models that would not make accurate predictions.
That is why we need to make sure the distribution of key features correlated to our target are preserved in the test set. By doing so, we are evaluating our machine learning models against representative data and hence, we can trust the quality of our models' predictions.

Splitting the data in this manner is called _stratified sampling_. To do so, we need to do some basic exploratory of our data. This is what we will do now.

In [None]:
import matplotlib.pyplot as plt
import matplotlib
import pandas as pd
import numpy as np
from pathlib import Path

In [None]:
plt.rcParams["figure.figsize"] = (15, 6)
plt.style.use("ggplot")
plt.rcParams["axes.prop_cycle"] = matplotlib.cycler(color=["#1f77b4", "red"])

Let's first define some functions to import the data

In [None]:
#|export
def get_project_path():
    """
    Get the path of the root project
    :return: Path
    """
    return Path.cwd().parent.resolve()

In [None]:
get_project_path()

In [None]:
#|hide
test_eq(str(get_project_path()).split('\\')[-1], 'housing-ageron-nbdev')

In [None]:
# | export
def load_housing_raw_data():
    project_path = get_project_path()
    return pd.read_excel(project_path / "data" / "raw" / "housing.xlsx")

In [None]:
df_housing_raw = load_housing_raw_data()

In [None]:
# | hide
import nbdev

nbdev.nbdev_export()