# Background - data processing

In the coming sections, we'll see how to use a few standard machine learning algorithms in Julia. In order to apply these algorithms, you'll need to first know how to load and manipulate the *data* that you'll pass to these algorithms. 

We'll be using `JuliaDB`, a Julia package designed to allow you to work with "large, persistent datasets", to process our data. In this notebook, we'll review a few `JuliaDB` commands and familiarize you a bit with the dataset we'll use in later notebooks.

In [None]:
using JuliaDB

Let's start with some data.

The Sacramento real estate transactions file that we download next is a list of 985 real estate transactions in the Sacramento area reported over a five-day period,

In [None]:
houses = loadtable("../Intro-to-Julia/houses.csv")

`loadtable` is giving us a summary with the names of and types stored in all columns. Let's start by looking at how the housing prices (column 10) change with the house size/square footage (column 7).

To grab each column of interest from the `houses` table, we'll use the `select` command.

In [None]:
x = select(houses, :sq__ft)
y = select(houses, :price)

We'll use `Plots` to plot with the `pyplot` backend and `scatter` to create a scatter plot.

In [None]:
using Plots
scatter(x,y,markersize=3)
xlabel!("Square footage")
ylabel!("Price")
title!("House prices vs. square footage")

*Why are there houses with 0 square feet that cost money?*

The square footage seems to not have been recorded in these cases. 

Filtering these houses out is easy to do with `filter`!

Below we pass three inputs to `filter` to create a new table, `filtered_houses`:

* the rule to follow when filtering (only keep data where x > 0)
* the table to which we should apply the filtering rule
* the column of the table to which we should apply the filtering rule (using the `select` keyword argument)

In [None]:
filtered_houses = filter(x -> x > 0, houses; select = :sq__ft)

We again use `select` to grab the columns for house sizes and prices from the filtered table, and then we generate a new plot.

In [None]:
x = select(filtered_houses, :sq__ft)
y = select(filtered_houses, :price)
scatter(x,y)
xlabel!("Square footage")
ylabel!("Price")
title!("House prices vs. square footage (recorded only!)")

This makes sense! The higher the square footage, the higher the price.

In the next notebooks, we'll learn about dimensionality reduction and how to see this upward trend in prices with size more formally.