# Intro to Data Manipulation and Visualization in Julia
In this section, we will learn and practice how to read in data, conduct data manipulation and visualization in Julia. This is an important step in solving a real-world optimization problem, as you typically need to:
* Read in data,
* Visualize and detect pattern and outliers in data, and
* Change data into a form ready for the optimization program


before running optimization.

## DataFrames
Like data frames in `R`, `Julia` also has a similar structure for datasets. You will need to load the package `DataFrames` first:

In [None]:
using DataFrames

Now let's read in a csv file for the dataset _iris_ using the `readtable` function. The csv file should sit in the same directory as the your script. Otherwise, you will need to change the path to the file for the first argument to the `readtable` function.

In [None]:
iris = readtable("iris.csv");

In [None]:
### If you are unable to read the data, you can uncomment the following codes and run it:
# using RDatasets
# iris = dataset("datasets", "iris")

To view the first few rows of the data, you can use `head()`, or index the dataframe similar to what you did you in `R`:

To subset rows, pass in the indices in the first dimension. If you are not subsetting to particular columns, just pass in ``:`` in the second dimension (as opposed to leaving it blank in `R`).

In [None]:
iris[1:5,:]

To index a column using column name, simply put a `:` in front of the name. You do not need the `:` and the `,` when you are indexing an entire column.

In [None]:
iris[:SepalLength]

We often times need to join/merge datasets. Let's look at an example first: suppose we have a dataframe that gives the species and the respective price at a flower shop:

In [None]:
species_price = DataFrame(Species = ["setosa", "versicolor", "virginica"],
                        Price = [2.5, 3.1, 3.2])

 To join, simply pass in:
 * the two data frames,
 * the shared variable name, and
 * the option for the kind of join you wanted: 
 
 `:left`, `:right`, `:inner`, `:outer`, etc.

In [None]:
join(iris, species_price, on = :Species, kind = :left)

## Data Manipulation with DataFramesMeta

When you need to subset the data based on some column, generate new variables, select columns, group and summarize, etc. ``DataFramesMeta`` is a powerful package that helps you do that with easy syntax. In ``R`` we learned the ``dplyr`` package; this is very similar in spirit. Here is a mapping between the names:

```
DataFramesMeta     dplyr
---------------------------------
@where            filter
@transform        mutate
@by
@groupby          group_by
@based_on         summarise/do
@orderby          arrange
@select           select
```

Let's load the package first.

In [None]:
using DataFramesMeta

* To subset the data based on some criterion on a column, we can use the `@where` macro (similar to `filter`). Let's try to subset the data to only observations where `SepalLength` is greater than 5. Recall that `.>` is used for elementwise operation:

In [None]:
iris_sub = @where(iris, :SepalLength .> 5)
head(iris_sub)

* To select columns, use the `@select` macro and pass as many columns as you want. Here we are keeping the `SepalLength`, `SepalWidth`, and `Species` from the iris data.

In [None]:
iris_select = @select(iris, :SepalLength, :SepalWidth, :Species)
head(iris_select)

* Similar to `mutate` in `R`, `@transform` creates new variables based on some operations on  existing variables. Let's create the logarithm transformation of the variable `SepalLength` and name it `logSepalLength`:

In [None]:
iris_trans1 = @transform(iris, logSepalLength = log(:SepalLength))
head(iris_trans1)

* You can do more complicated operations to customize the variable transform. For example, `map` lets you run a function on an array:

In [None]:
function sqrt_minus_1(x)
  sqrt(x)-1
end

map(sqrt_minus_1, [1, 2, 3])

* Since each column of the data frame is some type of array (DataArray to be precise), now you can use this `map` on a function to transform a variable based on a row-wise operation. In this example, we generate new variable `SepalLGroup` that is `large` when `SepalLength` is at least 5, and `small` otherwise.

In [None]:
iris_trans2 = @transform(iris, SepalLGroup = map(x -> x >= 5? "large":"small", :SepalLength))
head(iris_trans2)

Finally, you can chain operations together, just like the `%>%` you did in `dplyr` for `R`. The syntax starts with the macro `@linq`, and chains using the symbol `|>`. You will not need the `@` for each of the operations.
In this example, we do the following:
* create the log transformations for SepalLength and SepalWdith;
* subset to only observations where the logSL is at least 1;
* group by the species and summarize the mean logSL and mean logSW;
* sort by the mean logSL in ascending order;
* select only species (also rename it to var) and mean logSL.

In [None]:
iris_summary = @linq iris |>
    transform(logSL = log(:SepalLength), logSW=log(:SepalWidth))|>
    where(:logSL .>= 1)|>
    by(:Species, meanLogSL = mean(:logSL), meanLogSW = mean(:logSW))|>
    orderby(:meanLogSL)|>
    select(var = :Species, :meanLogSL)

## Exercise 1: Manipulate Icecream data

### Task 1: Read in the Icecream Data

This time, we are going to read in a dataset directly from the package `RDatasets`. Use the following syntax 
```dataset("Ecdat", "Icecream")```

and save it as a dataframe called `icecream`. 

The dataset is on the ice cream consumption. The columns are:
* `Cons`: consumption level of ice cream
* `Income`: income level
* `Price`: price of ice cream
* `Temperature`: outside temperature at time of measurement

Inspect the first few rows of the data.

### Task 2: Summarize by Temperature

We are interested to know if higher temperature is associated with higher consumption. Let's do the following:
* Create a variable called `TempGroup` that maps the Temp to `low` if less than 50, and `high` otherwise.
* Group by this new variable `TempGroup` and calculate the mean consumption, name it `meanCons`.
* Sort the new variable `meanCons` in ascending order.

You can do it in the chained syntax with `@linq` and `|>`, or creating intermediate datasets along the way.

What are your findings?

### Task 3: Prepare for Optimization

We would like to have the dataset ready for optimization later. It needs the following:
* A column called `Revenue` calculated as the product of `Cons` and `Price`,
* Only subset to the observations where `Temp` is at least `45`,
* The final data should only have columns `Revenue` and `Income`.
* Write the data to a csv file named `icecream_prepared.csv` in the same directory.

How many observation do you have in your final data?


## Plotting in Julia

Julia also has extensive support for plotting. 

* `Plots.jl` is a powerful and concise tool for plotting. It provides the interface to many other plotting packages with simple and consistent syntax.
* `StatPlots.jl` offers the DataFrames integration for `Plots`. You can pass in a data frame, and map aesthetics to the column names directly. 

Using these would be somewhat similar to working with `ggplot2` in `R`. 

Here is an example of a scatter plot based on the `iris` data, where the x axis is the `SepalLength`, y axis is `SepalWidth`, and the grouping (therefore the colors) are based on the `Species`.

In [None]:
using Plots
using StatPlots
pyplot()
scatter(iris, :SepalLength, :SepalWidth, group=:Species)

We can make the plot more beautiful by adding a few custom settings. For example:
* Give it a title
* Provide xlabel and ylabel
* Change the transparency, shape, and size of the dots
* change background color to dark grey

In [None]:
scatter(iris, :SepalLength, :SepalWidth, group=:Species,
        title = "A more beautiful plot",
        xlabel = "Length", ylabel = "Width",
        m=(0.5, [:cross :hex :star7], 12),
        bg=RGB(.2,.2,.2))

You can also do a box plot (with the cool violin plot in the back) grouped by the species. Note the `!` in `boxplot!` adds the current plot to the existing one. 

In [None]:
violin(iris,:Species,:SepalLength)
boxplot!(iris, :Species,:SepalLength, leg=false)

There are many other types of plots and custom options. You can explore more from [the tutorial](https://juliaplots.github.io/tutorial/).

## Exercise 2: Plotting Icecream data

With the same `icecream` data, explore the following questions using visualization:

### Task 1:
How is income related to Consumption?

### Task 2:
Create the `Revenue` variable as the product between `Price` and `Cons`. 

Do you see a positive relationship between the temperature and revenue?

### Task 3:
Create a new variable `IncomeGroup` that groups income based on a few buckets (your choice).

Plot the distribution of the consumption over the different groups. What do you find?