# Julia for Data Science

* Data
* **Data processing**
* Visualization

### Data processing: Standard machine learning algorithms in Julia
In what's next, we will see how to use some of the standard machine learning algorithms implemented in Julia.

In [None]:
using DataFrames

### Example 1: Kmeans Clustering

Let's start with some data.

The Sacramento real estate transactions file that we download next is a list of 985 real estate transactions in the Sacramento area reported over a five-day period,

In [None]:
download("http://samplecsvs.s3.amazonaws.com/Sacramentorealestatetransactions.csv","houses.csv")
houses = readtable("houses.csv")

Let's use `Plots` to plot with the `pyplot` backend.

In [None]:
using Plots
plot(size=(500,500),leg=false)

Now let's create a scatter plot to show the price of a house vs. its square footage,

In [None]:
x = houses[:sq__ft]
y = houses[:price]
scatter(x,y,markersize=3)

*Houses with 0 square feet that cost money?*

The square footage seems to not have been recorded in these cases. 

Filtering these houses out is easy to do!

In [None]:
filter_houses = houses[houses[:sq__ft].>0,:]
x = filter_houses[:sq__ft]
y = filter_houses[:price]
scatter(x,y)

This makes sense! The higher the square footage, the higher the price.

We can filter a `DataFrame` by feature value too, using the `by` function.

In [None]:
by(filter_houses,:_type,size)

In [None]:
by(filter_houses,:_type,filter_houses->mean(filter_houses[:price]))

Now let's do some kmeans clustering on this data.

First, we can load the `Clustering` package to do this.

In [None]:
#Pkg.add("Clustering")
using Clustering

Let's store the features `:latitude` and `:longitude` in an array `X` that we will pass to `kmeans`.

First we add data for `:latitude` and `:longitude` to a new `DataFrame` called `X`.

In [None]:
X = filter_houses[[:latitude,:longitude]]

and then we convert `X` to an `Array` via

```julia
X = Array(X)
```
or
```julia
X = convert(Array, X)
```

In particular,

```julia
X = Array{Float64}(X)
```
or
```julia
X = convert(Array{Float64}, X)
```
will turn `X` into an `Array` that stores `Float64`s.

In [None]:
X = Array{Float64}(X)

Each feature is stored as a row of `X`, but we can transpose to make these features columns of `X`.

In [None]:
X = X'

As a first pass at guessing how many clusters we might need, let's use the number of zip codes in our data.

(Try changing this to see how it impacts results!)

In [None]:
k = length(unique(filter_houses[:zip])) 

We can use the `kmeans` function to do kmeans clustering!

In [None]:
C = kmeans(X,k) # try changing k

Now let's create a new data frame, `df`, with all the same data as `filter_houses` that also includes a column for the cluster to which each house has been assigned.

In [None]:
df = DataFrame(cluster = C.assignments,city = filter_houses[:city],
    latitude = filter_houses[:latitude],longitude = filter_houses[:longitude],zip = filter_houses[:zip])

Let's plot each cluster as a different color.

In [None]:
clusters_figure = plot()
for i = 1:k
    clustered_houses = df[df[:cluster].== i,:]
    xvals = clustered_houses[:latitude]
    yvals = clustered_houses[:longitude]
    scatter!(clusters_figure,xvals,yvals,markersize=4)
end
xlabel!("Latitude")
ylabel!("Longitude")
title!("Houses color-coded by cluster")
display(clusters_figure)

And now let's try coloring them by zip code.

In [None]:
unique_zips = unique(filter_houses[:zip])
zips_figure = plot()
for uzip in unique_zips
    subs = filter_houses[filter_houses[:zip].==uzip,:]
    x = subs[:latitude]
    y = subs[:longitude]
    scatter!(zips_figure,x,y)
end
xlabel!("Latitude")
ylabel!("Longitude")
title!("Houses color-coded by zip code")
display(zips_figure)

Let's see the two plots side by side.

In [None]:
plot(clusters_figure,zips_figure,layout=(2, 1))

Not exactly! but almost... Now we know that ZIP codes are not randomly assigned!

### Example 2: Nearest Neighbor with a KDTree

For this example, let's start by loading the `NearestNeighbors` package.

In [None]:
using NearestNeighbors

With this package, we'll look for the `knearest` neighbors of one of the houses, `point`.

In [None]:
knearest = 10
id = 70 # try changing this
point = X[:,id]

Now we can build a `KDTree` and use `knn` to look for `point`'s nearest neighbors!

In [None]:
kdtree = KDTree(X)
idxs, dists = knn(kdtree, point, knearest, true)

We'll first generate a plot with all of the houses in the same color,

In [None]:
x = filter_houses[:latitude];
y = filter_houses[:longitude];
scatter(x,y);

and then overlay the data corresponding to the nearest neighbors of `point` in a different color.

In [None]:
x = filter_houses[idxs,:latitude];
y = filter_houses[idxs,:longitude];
scatter!(x,y)

There are those nearest neighbors in red!

We can see the cities of the neighboring houses by using the indices, `idxs`, and the feature, `:city`, to index into the `DataFrame` `filter_houses`.

In [None]:
cities = filter_houses[idxs,:city]

### Example 3: PCA for dimensionality reduction

Let us try to reduce the dimensions of the price/area data from the houses dataset.

We can start by grabbing the square footage and prices of the houses and storing them in an `Array`.

In [None]:
F = filter_houses[[:sq__ft,:price]]
F = convert(Array{Float64,2},F)'

Recall how the data looks when we plot housing prices against square footage.

In [None]:
scatter(F[1,:],F[2,:])
xlabel!("Square footage")
ylabel!("Housing prices")

We can use the `MultivariateStats` package to run PCA

In [None]:
# Pkg.add("MultivariateStats")
using MultivariateStats

Use `fit` to fit the model

In [None]:
M = fit(PCA, F)

Note that you can choose the maximum dimension of the new space by setting `maxoutdim`, and you can change the method to, for example, `:svd` with the following syntax.

```julia
fit(PCA, F; maxoutdim = 1,method=:svd)
```

It seems like we only get one dimension with PCA! Let's use `transform` to map all of our 2D data in `F` to `1D` data with our model, `M`.

In [None]:
y = transform(M, F)

Let's use `reconstruct` to put our now 1D data, `y`, in a form that we can easily overlay (`Xr`) with our 2D data in `F` along the principle direction/component.

In [None]:
Xr = reconstruct(M, y)

And now we create that overlay, where we can see points along the principle component in red. 

(Each blue point maps uniquely to some red point!)

In [None]:
scatter(F[1,:],F[2,:])
scatter!(Xr[1,:],Xr[2,:])

### Example 4: Learn how to build a simple multi-layer-perceptron on the MNIST dataset

MNIST from: https://github.com/FluxML/model-zoo/blob/master/mnist/mlp.jl

Let's start by loading `Flux`, importing a few things from `Flux` explicitly, and bringing the `repeated` function into our scope.

In [None]:
using Flux, Flux.Data.MNIST
using Flux: onehotbatch, argmax, crossentropy, throttle
using Base.Iterators: repeated

We can now store all the MNIST images in `imgs` and take a peak into this vector to see what the data looks like

In [None]:
imgs = MNIST.images()
imgs[3]

Let's look at the type of an individual image.

In [None]:
typeof(imgs[3])

#### Reorganizing our array of images

We see this is a 2D array that stores `ColorTypes`. To work more easily with this data, let's convert all `ColorTypes` to floating point numbers.

In [None]:
fpt_imgs = float.(imgs)

Now we can see what `imgs[3]` looks like as an array of floats, rather than as an array of colors!

In [None]:
fpt_imgs[3]

**Let's stack the images to create one large 2D array, `X`, that stores the data for each image as a column.**

To do this, we can **first** use `reshape` to unravel each image, creating a 1D array (`Vector`) of floats from a 2D array (`Matrix`) of floats.

In [None]:
unraveled_fpt_imgs = reshape.(fpt_imgs, :);
typeof(unraveled_fpt_imgs)

(Note that `Vector` is an alias for a 1D `Array`.)

In [None]:
Vector

This makes `unraveled_fpt_imgs` a `Vector` of `Vector`s where `imgs[3]` is now

In [None]:
unraveled_fpt_imgs[3]

After using `reshape` to get a `Vector` of `Vector`s, we can use `hcat` to build a `Matrix`, `X`, from `unraveled_fpt_imgs` where the `Vector`s stored in `unraveled_fpt_imgs` will become the columns of `X`.

Note that we're using the "splat" command below, `...`, which allows you to pass all the elements of an object to a function, rather than just passing the object itself.

In [None]:
X = hcat(unraveled_fpt_imgs...)

#### How to go back to images from this 2D `Array`

So now each column in X is an image reshaped to a vector of floating points. Let's pick one column and see what the digit is.

Let's try to view the second image in the original array, `imgs`, by taking the second column of `X`

In [None]:
onefigure = X[:,2]

We'll `reshape` this array to a 2D, 28x28 array,

In [None]:
t1 = reshape(onefigure,28,28)

and finally use `colorview` from the `Images` package to view the handwritten digit.

In [None]:
using Images

In [None]:
colorview(Gray, t1)

*Our data is in working order!*

For our machine to learn the digit with which each image is associated, we'll need to train it using correct answers. Therefore we'll make use of the `labels` associated with these images from MNIST.

In [None]:
labels = MNIST.labels() # the true labels

One-hot-encode the labels with `onehotbatch`

In [None]:
Y = onehotbatch(labels, 0:9)

which gives a binary indicator vector for each figure

Build the network

In [None]:
m = Chain(
  Dense(28^2, 32, relu),
  Dense(32, 10),
  softmax)

Define the loss functions and accuracy

In [None]:
loss(x, y) = Flux.crossentropy(m(x), y)
accuracy(x, y) = mean(argmax(m(x)) .== argmax(y))

Use `X` to create our training data and then declare our evaluation function:

In [None]:
dataset = repeated((X, Y), 200)
evalcb = () -> @show(loss(X, Y))
opt = ADAM(Flux.params(m))

So far, we have defined our training data and our evaluation functions.

Let's take a look at the function signature of Flux.train!

In [None]:
?Flux.train!

**Now we can train our model and look at the accuracy thereafter.**

In [None]:
Flux.train!(loss, dataset, opt, cb = throttle(evalcb, 10))

accuracy(X, Y)

Now that we've trained our model, let's create test data, `tX`, 

In [None]:
tX = hcat(float.(reshape.(MNIST.images(:test), :))...)

and run our model on one of the images from `tX`

In [None]:
test_image = m(tX[:,1])

In [None]:
indmax(test_image) - 1

The largest element of `test_image` is the 8th element, so our model says that test_image is a "7".

Now we can look at the original image.

In [None]:
using Images
t1 = reshape(tX[:,1],28,28)
colorview(Gray, t1)

and there we have it!

### Example 5: Linear regression in Julia (we will write our own Julia code and Python code)

Let's try to find the best line fit of the following data:

In [None]:
xvals = repeat(1:0.5:10,inner=2)
yvals = 3+xvals+2*rand(length(xvals))-1
scatter(xvals,yvals,color=:black,leg=false)

We want to fit a line through this data.

Let's write a Julia function to do this.

In [None]:
function find_best_fit(xvals,yvals)
    meanx = mean(xvals)
    meany = mean(yvals)
    stdx = std(xvals)
    stdy = std(yvals)
    r = cor(xvals,yvals)
    a = r*stdy/stdx
    b = meany - a*meanx
    return a,b
end

To fit the line, we just need to find the slope and the y-intercept (a and b).

Then add this fit to the existing plot!

In [None]:
a,b = find_best_fit(xvals,yvals)
ynew = a*xvals + b

In [None]:
plot!(xvals,ynew)

Let's generate a much bigger dataset,

In [None]:
xvals = 1:100000;
xvals = repeat(xvals,inner=3);
yvals = 3+xvals+2*rand(length(xvals))-1;

In [None]:
@show size(xvals)
@show size(yvals)

and now we can time how long it takes to find a fit to this data.

In [None]:
@time a,b = find_best_fit(xvals,yvals)

Now we will write the same code using Python

In [None]:
using PyCall
using Conda

In [None]:
py"""
import numpy
def find_best_fit_python(xvals,yvals):
    meanx = numpy.mean(xvals)
    meany = numpy.mean(yvals)
    stdx = numpy.std(xvals)
    stdy = numpy.std(yvals)
    r = numpy.corrcoef(xvals,yvals)[0][1]
    a = r*stdy/stdx
    b = meany - a*meanx
    return a,b
"""

In [None]:
find_best_fit_python = py"find_best_fit_python"

In [None]:
xpy = PyObject(xvals)
ypy = PyObject(yvals)
@time a,b = find_best_fit_python(xpy,ypy)

**Let's use the benchmarking package to time these two.**

In [None]:
using BenchmarkTools

In [None]:
@btime a,b = find_best_fit_python(xvals,yvals)

In [None]:
@btime a,b = find_best_fit(xvals,yvals)