## Julia Tutorial 002: DataFrames

In this tutorial we will walk through how to: extract valuable summaries, rename columns, select columns, filter rows, sort rows, mutate the dataset to create new variables, join datasets, and create grouped summaries.   DataFrames.jl is a package in Julia that will help you manipulate data.   We will use the "mtcars" dataset from the RDatasets.jl package.    We will also make limited use of the StatsBase.jl package.  

The DataFrames.jl documentation can be found here: https://juliadata.github.io/DataFrames.jl/stable/

In [2]:
#import Pkg
#Pkg.add("DataFrames")
#Pkg.add("RDatasets")
#Pkg.add("StatsBase")

In [3]:
# Load required packages into the environment
using DataFrames, RDatasets, StatsBase

In [None]:
# Load the "mtcars" dataset from the "dataset" R package
mtcars = dataset("datasets", "mtcars")

In [4]:
# Return the class of the "mtcars" DataFrame
typeof(mtcars)

LoadError: UndefVarError: mtcars not defined

In [None]:
# Return the dimensions of the "mtcars" DataFrame
size(mtcars)

In [None]:
# Return the number of rows in the "mtcars" DataFrame
nrow(mtcars)

In [None]:
# Return the number of columns in the "mtcars" DataFrame
ncol(mtcars)

In [None]:
# Return a numerical summary of the "mtcars" DataFrame
describe(mtcars)

In [5]:
# Isolate the variable "MPG" and return a numerical summary
mpg = mtcars[:MPG]
describe(mpg)

LoadError: UndefVarError: mtcars not defined

In [None]:
# Rename multiple columns of the "mtcars" DataFrame
mtcars = rename(mtcars, :Cyl => :cylinders, :VS => :Engine);

In [None]:
# Use the exclamation point shortcut to rename columns and make these changes to the object
rename!(mtcars, :HP => :Horsepower, :WT => :weight);

In [6]:
# Eliminate the "DRat" column from the "mtcars" DataFrame, using the Not() function
mtcars = mtcars[:, Not(:DRat)];

LoadError: UndefVarError: mtcars not defined

In [None]:
# Use the select() and Not() functions to eliminate the "QSec" column
mtcars = select(mtcars, Not(:QSec));

In [None]:
# Select desired columns from the "mtcars" DataFrame
mtcars = mtcars[:, [:Model, :MPG, :cylinders, :horsepower, :weight]]

In [None]:
# Filter rows of the "mtcars" DataFrame
fourCyl = mtcars[mtcars[:cylinders] .== 4, :];
sixCyl = mtcars[mtcars[:cylinders] .== 6, :];
eightCyl = mtcars[mtcars[:cylinders] .== 8, :]

In [None]:
# Sort rows of the "mtcars" DataFrame
sort!(mtcars, :MPG, rev = true)

In [None]:
# Create new variables in the "mtcars" DataFrame
mtcars[:powerToWeightRatio] = mtcars[:horsepower] ./ mtcars[:WT];
mtcars[:horsepowerTimesMPG] = mtcars[:horsepower] .* mtcars[:MPG];

In [None]:
# Look at the first six rows of the "mtcars" DataFrame
carsHead = head(mtcars, 6)
carsHead

In [None]:
# Create a new DataFrame called "models"
models = DataFrame(
    Manufacturer = ["Toyota", "Fiat", "Honda", "Lotus", "Fiat", "Porsche"],
    Model = ["Toyota Camry", "Fiat 128", "Honda Civic", "Lotus Europa", "Fiat X1-9", "Porsche 914-2"]
    )

In [None]:
# Left join the "models" and "carsHead" DataFrames
carInfo = join(
    models, 
    carsHead[:, [:Model, :MPG, :Horsepower, :weight]], 
    on = :Model, kind = :left)

In [None]:
# Create a new DataFrame which provides grouped counts
cylSummary = by(mtcars, :cylinders, d -> DataFrame(count = nrow(d)))

In [None]:
# Create a new DataFrame which provides grouped averages
mpgSummary = by(mtcars, :cylinders, :MPG => mean)