# Introduction to Julia DataFrames

In this notebook, we will provide an introduction to Julia DataFrames using the titanic dataset. We will cover the following topics:

* Loading data into a DataFrame
* Basic data manipulation and analysis using DataFrames
* Exercises for practicing data manipulation and analysis using the titanic dataset in a DataFrame

We assume that you have a basic understanding of Julia programming. If you are new to Julia, we recommend that you complete the [official Julia documentation](https://docs.julialang.org/en/v1/) and/or a Julia tutorial or course before proceeding with this notebook.

To get started, let's load the necessary packages:


In [76]:
# Load necessary packages
using DataFrames
using CSV
using Statistics

## Loading the titanic dataset

The titanic dataset contains information about the passengers on the Titanic, including their age, sex, passenger class, and survival status. We will load this dataset into a DataFrame using the `CSV.jl` package.

To load the titanic dataset, we first need to download the `titanic.csv` file. You can download the file from the [Kaggle website](https://www.kaggle.com/c/titanic/data) and save it in the same directory as this notebook. Once you have downloaded the file, you can use the following code to load it into a DataFrame:

In [77]:
# Load the titanic dataset into a DataFrame using CSV.jl
titanic = DataFrame(CSV.File("titanic.csv"))
first(titanic, 5)

## Basic data manipulation and analysis using DataFrames

Now that we have loaded the titanic dataset into a DataFrame, we can start exploring and analyzing the data. We will use basic DataFrame functions such as `head`, `describe`, and `groupby` to do so.

Let's start by determining the number of rows and columns in the titanic DataFrame:

In [None]:
# Determine the number of rows and columns in the titanic DataFrame
nrows, ncols = size(titanic)
println("Number of rows: ", nrows)
println("Number of columns: ", ncols)

Next, let's print a summary of the titanic DataFrame using the `describe` function:

In [None]:
# Print a summary of the titanic DataFrame using the describe function
describe(titanic)

Now let's filter the titanic DataFrame to include only survivors and print the first 5 rows:

In [None]:
# Filter the titanic DataFrame to include only survivors and print the first 5 rows
survivors = filter(row -> row.Survived == 1, titanic)
first(survivors, 5)

We can also calculate the percentage of survivors in the titanic DataFrame:

In [None]:
# Calculate the percentage of survivors in the titanic DataFrame
prop_survivors = mean(titanic.Survived)
println("Proportion of survivors: ", prop_survivors)

We can group the titanic DataFrame by passenger class and calculate the average age in each class:

In [None]:
# Group the titanic DataFrame by passenger class and calculate the average age in each class, ignoring missing values
class_ages = combine(groupby(titanic, :Pclass), :Age => (x -> mean(skipmissing(x))) => :AvgAge)

Let's add a new column to the titanic DataFrame that indicates whether a passenger is a child or an adult, based on their age:

In [None]:
is_child(age) = ismissing(age) ? missing : age < 18
titanic[!, :Child] = is_child.(titanic.Age)
first(titanic, 5)

We can also group the titanic DataFrame by age category and calculate the proportion of survivors in each category:

In [None]:
# Group the titanic DataFrame by age category and calculate the proportion of survivors in each category
age_cat_prop_survivors = combine(groupby(titanic, :Child), 
    :Survived => mean => :PropSurvivors)

In [None]:
# Group by passenger class and calculate proportion of survivors in each group
class_cat_prop_survivors = combine(groupby(titanic, :Pclass), :Survived => mean => :PropSurvivors)

Finally, let's plot the proportion of survivors by age category using the `Plots` package:

In [None]:
# Plot the proportion of survivors by age category using the Plots package
using Plots
gr()
bar(class_cat_prop_survivors.Pclass, class_cat_prop_survivors.PropSurvivors,
    xlabel="Class", ylabel="Proportion of Survivors", legend=true,
    xticks=(1:3, ["1st", "2nd", "3rd"]))