# Joining data

Quite often one encounters a situation where data is not stored in one table, but in multiple ones with some shared columns.

In these situations you can use join-functions like `inner_join`, `left_join`, `right_join` etc. to join the tables based on values in some column.


Let's say that we want check the effect of weather on crime in Vancouver. From Kaggle we get a data set of hourly weather data of cities in US and Canada and a data set of crime in Vancouver.[[1]](https://www.kaggle.com/selfishgene/historical-hourly-weather-data) [[2]](https://www.kaggle.com/wosaku/crime-in-vancouver)

The weather-data is organized in individual .csv-files with data for a city stored in column named after the city. 

In [None]:
library(tidyverse)

temperature_vancouver <- read_csv('historical-hourly-weather-data/temperature.csv') %>%
    # Remove hours from datetime
    mutate(date=format(datetime,'%Y-%m-%d')) %>%    
    # Covert date-variables into Date-objects
    mutate(date=as.Date(date)) %>%
    # Take only Vancouver column and date
    select(date,temperature=Vancouver) %>%
    # Remove nan's from temperature
    filter(!is.na(temperature)) %>%
    # Convert temperature from K to C
    mutate(temperature=temperature - 272.15) %>%
    # Group by date
    group_by(date) %>%
    # Get the mean temperature of each day
    summarize_all(mean)

print(temperature_vancouver)

The crime-statistics are stored in a .csv-file with times given as individual column. To parse a single date from the columns we use `make_date`-function from `lubridate`-package [[make_date]](https://www.rdocumentation.org/packages/lubridate/versions/1.7.3/topics/make_datetime).

After this `tally` is used to calculate the number of observations per date [[tally]](https://dplyr.tidyverse.org/reference/tally.html).

In [None]:
library(lubridate)

crimes_vancouver <- read_csv('vancouver-crime/crime.csv') %>%
    # Convert individual columns into a date
    mutate(date=make_date(YEAR,MONTH,DAY)) %>%
    # Arrange and group based on date
    arrange(date) %>%
    group_by(date) %>%
    # Calculate number of crimes / date
    tally()

print(crimes_vancouver)

Now we want to join the datasets based on the dates that are present in both datasets. For this let's use `inner_join` [[inner_join]](https://dplyr.tidyverse.org/reference/join.html).

After joing the data we can visualize the results as a scatter plot.

In [None]:
crime_temperature <- inner_join(crimes_vancouver,temperature_vancouver,by=c('date'))

crime_temperature %>%
    ggplot(aes(x=temperature,y=n)) +
    geom_point() +
    labs(x='Temperature in Vancouver', y='Number of crimes in Vancouver')

Apparently crime happens in Vancouver in all temperatures.