# Data Preprocessing

In this section, we learn how to preprocess data using [CSV.jl](https://csv.juliadata.org/stable/), and [DataFrames.jl](https://dataframes.juliadata.org/stable/)

## Reading the Dataset¶

Comma-separated values (CSV) files are ubiquitous for storing tabular (spreadsheet-like) data. Here, each line corresponds to one record and consists of several (comma-separated) fields, e.g., “Albert Einstein,March 14 1879,Ulm,Federal polytechnic school,Accomplishments in the field of gravitational physics”. To demonstrate how to load CSV files with CSV.jl, we create a CSV file below ../data/house_tiny.csv. This file represents a dataset of homes, where each row corresponds to a distinct home and the columns correspond to the number of rooms (NumRooms), the roof type (RoofType), and the price (Price).

In [17]:
using CSV

csv_data = """
NumRooms,RoofType,Price
,,127500
2,,106000
4,Slate,178100
,,140000
"""

dir_path = joinpath("..","data")
file_path = joinpath(dir_path,"house_tiny.csv")
mkpath(dir_path)
CSV.write(file_path,CSV.File(IOBuffer(csv_data)))

"../data/house_tiny.csv"

Now let’s import `DataFrames` and load the dataset with `CSV.read`.

In [18]:
using DataFrames
data = CSV.read(open(file_path),DataFrame)

Row,NumRooms,RoofType,Price
Unnamed: 0_level_1,Int64?,String7?,Int64
1,missing,missing,127500
2,2,missing,106000
3,4,Slate,178100
4,missing,missing,140000


## Data Preparation

In supervised learning, we train models to predict a designated target value, given some set of input values. Our first step in processing the dataset is to separate out columns corresponding to input versus target values. We can select columns either by name or via integer-location based indexing.