# CA Data Science Takehome Problem

### For this problem, you are provided a data set recording various details about US domestic flights. Please explore the data however you prefer, and try to identify anything interesting, such as correlations, patterns, or strange outliers. When you are done, prepare a write-up or annotate your notebook to show what you discovered. Be prepared to present your findings to the team if invited to continue to an on-site interview.

### This is an open-ended problem, which we expect will take approximately three hours to complete. This starter notebook has loaded the airline data into a data frame for you to use, but feel free to use any additional libraries or outside data that you would like. The accompanying "Column_Descriptions.csv" file explains what each of the columns in the data frame means.

In [None]:
library(httr)

## Loading the CSVs

In [None]:
csvs <- c("701878033_T_ONTIME_2015_8.csv", "701878033_T_ONTIME_2016_8.csv", "701878033_T_ONTIME_2017_8.csv")
for (csv in csvs) {
    print(paste("Starting download of", csv))
    writeBin(
        content(
            GET(paste0("http://ca-data-science-interview.s3.amazonaws.com/", csv)), 
            "raw"
        ), 
        csv
    )
    print(paste("Finished download of", csv))
}

In [None]:
files <- list.files(path = "./", pattern = "8\\.csv$")
df <- read.csv(files[1])
for (f in files[-1]) {
    df <- rbind(df, read.csv(f))
}

df <- df[, names(df) != "X"] # All of the lines in the CSVs have an extra comma at the end that gets interpreted as an unnamed column.
head(df, 5)

## Some basic views of the data

In [None]:
# Number of flights by day of the week

counts <- table(df$DAY_OF_WEEK)
barplot(
    counts,
    main = "Number of flights by day of the week",
    xlab = "Day of the week"
)

In [None]:
# Total number of flights by airport

dest_counts <- as.data.frame(table(df$DEST))
orig_counts <- as.data.frame(table(df$ORIGIN))
total_counts <- merge(
    x = dest_counts,
    y = orig_counts,
    by.x = "Var1",
    by.y = "Var1"
)
names(total_counts) <- c("airport", "dest_count", "orig_count")
total_counts$total_count <- total_counts$dest_count + total_counts$orig_count
total_counts <- total_counts[order(total_counts$total_count, decreasing = T),]

barplot(
    height = total_counts$dest_count[1:10],
    names.arg = total_counts$airport[1:10]
)
barplot(
    height = total_counts$orig_count[1:10],
    names.arg = total_counts$airport[1:10]
)
barplot(
    height = total_counts$total_count[1:10],
    names.arg = total_counts$airport[1:10]
)