# Data Wrangling Challenge
### Pull and manipulate the API data

The point of this exercise is to try data enrichment with data from external APIs. We are going to take data about car crashes in Monroe County, Indiana from 2003 to 2015 and try to figure out the weather during the accident and how many bars there are in the area. We will work with two different APIs during this challenge:

- Foursquare API
- World Weather Online API

We will try to find correlations between the severity of crash and weather/number of bars in the area. To indicate the severity of a crash, we will use column `Injury Type`.

In [1]:
import pandas as pd
import requests as re
import matplotlib.pyplot as plt

## Data

The data for this exercise can be found [here](https://drive.google.com/file/d/1_KF9oIJV8cB8i3ngA4JPOLWIE_ETE6CJ/view?usp=sharing).

Just run the cells below to get your data ready. Little help from us.


In [2]:
# had to specify the correct coding
crash_data = pd.read_csv('data/monroe_county_crash_data.csv', encoding='ISO-8859-1')
crash_data[["Latitude", "Longitude"]].head(5)

Unnamed: 0,Latitude,Longitude
0,39.159207,-86.525874
1,39.16144,-86.534848
2,39.14978,-86.56889
3,39.165655,-86.575956
4,39.164848,-86.579625


In [44]:
crash_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10399 entries, 0 to 10398
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Master Record Number  10399 non-null  int64  
 1   Year                  10399 non-null  int64  
 2   Month                 10399 non-null  int64  
 3   Day                   10399 non-null  int64  
 4   Weekend?              10399 non-null  object 
 5   Hour                  10399 non-null  int64  
 6   Collision Type        10399 non-null  object 
 7   Injury Type           10399 non-null  object 
 8   Primary Factor        10162 non-null  object 
 9   Reported_Location     10396 non-null  object 
 10  Latitude              10369 non-null  float64
 11  Longitude             10369 non-null  float64
dtypes: float64(2), int64(5), object(5)
memory usage: 975.0+ KB


In [43]:
crash_data.head(10)

Unnamed: 0,Master Record Number,Year,Month,Day,Weekend?,Hour,Collision Type,Injury Type,Primary Factor,Reported_Location,Latitude,Longitude
0,902363382,2015,1,5,Weekday,0,2-Car,No injury/unknown,OTHER (DRIVER) - EXPLAIN IN NARRATIVE,1ST & FESS,39.159207,-86.525874
1,902364268,2015,1,6,Weekday,1500,2-Car,No injury/unknown,FOLLOWING TOO CLOSELY,2ND & COLLEGE,39.16144,-86.534848
2,902364412,2015,1,6,Weekend,2300,2-Car,Non-incapacitating,DISREGARD SIGNAL/REG SIGN,BASSWOOD & BLOOMFIELD,39.14978,-86.56889
3,902364551,2015,1,7,Weekend,900,2-Car,Non-incapacitating,FAILURE TO YIELD RIGHT OF WAY,GATES & JACOBS,39.165655,-86.575956
4,902364615,2015,1,7,Weekend,1100,2-Car,No injury/unknown,FAILURE TO YIELD RIGHT OF WAY,W 3RD,39.164848,-86.579625
5,902364664,2015,1,6,Weekday,1800,2-Car,No injury/unknown,FAILURE TO YIELD RIGHT OF WAY,BURKS & WALNUT,39.12667,-86.53137
6,902364682,2015,1,6,Weekday,1200,2-Car,No injury/unknown,DRIVER DISTRACTED - EXPLAIN IN NARRATIVE,SOUTH CURRY PIKE LOT 71,39.150825,-86.584899
7,902364683,2015,1,6,Weekday,1400,1-Car,Incapacitating,ENGINE FAILURE OR DEFECTIVE,NORTH LOUDEN RD,39.199272,-86.637024
8,902364714,2015,1,7,Weekend,1400,2-Car,No injury/unknown,FOLLOWING TOO CLOSELY,LIBERTY & W 3RD,39.16461,-86.57913
9,902364756,2015,1,7,Weekend,1600,1-Car,No injury/unknown,RAN OFF ROAD RIGHT,PATTERSON & W 3RD,39.16344,-86.55128


# Foursquare API

Foursquare API documentation is [here](https://developer.foursquare.com/)

1. Start a foursquare application and get your keys.
2. For each crash, create the function **get_venues** that will pull bars in the radius of 5km around the crash

#### example
`get_venues('48.146394, 17.107969')`

3. Find a relationship (if there is any) between number of bars in the area and severity of the crash.

HINTs: 
- check out python package "foursquare" (no need to send HTTP requests directly with library `requests`)
- **categoryId** for bars and nightlife needs to be found in the [foursquare API documentation](https://developer.foursquare.com/docs/api-reference/venues/search/)

### Function `get_venues`

In [3]:
# function to return a response from the query (venue, in this case) and the point of reference (long/lang) with radius in m
def get_venues(query, category, longitude, latitude, radius):

    url = "https://api.foursquare.com/v3/places/search"

    longitude = f"{longitude:.2f}"
    latitude = f"{latitude:.2f}"

    params = {
        "query" : query,
        "category" : category,
        "ll" : longitude + "," + latitude,
        "sort" : "DISTANCE",
        "radius" : str(radius)
    }

    headers = {
        "Accept" : "application/json",
        "Authorization" : "fsq3P3rZR2Kb2ccaHpbuTMdwWCtYF3fIp1frLeqT2saiS0M="
    }

    response = re.get(url, params=params, headers=headers)

    return response

### Example extraction for location `48.146394, 17.107969`

In [4]:
# example output in json format
categories = [i for i in range(13003, 13026)]
bar_data_example = get_venues("bar", categories, 48.146394, 17.107969, 5000).json()
bar_data_example['results']

[{'fsq_id': '5baf3efbc66666002c063861',
  'categories': [{'id': 13012,
    'name': 'Hookah Bar',
    'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/nightlife/hookahbar_',
     'suffix': '.png'}}],
  'chains': [],
  'distance': 78,
  'geocodes': {'main': {'latitude': 48.149291, 'longitude': 17.110026}},
  'link': '/v3/places/5baf3efbc66666002c063861',
  'location': {'address': 'Námestie 1. mája 4',
   'country': 'SK',
   'cross_street': '',
   'formatted_address': 'Námestie 1. mája 4, 811 06 Bratislava',
   'locality': 'Bratislava',
   'postcode': '811 06',
   'region': 'Bratislava Region'},
  'name': 'Vice City Shisha Bar & Lounge',
  'related_places': {},
  'timezone': 'Europe/Bratislava'},
 {'fsq_id': '53c2d3e2498eda35d7854d64',
  'categories': [{'id': 13003,
    'name': 'Bar',
    'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/nightlife/pub_',
     'suffix': '.png'}}],
  'chains': [],
  'distance': 85,
  'geocodes': {'main': {'latitude': 48.149235, 'longitude

In [5]:
# example conversion to DataFrame using json_normalize
bar_data_example_df = pd.json_normalize(
    data=bar_data_example["results"],
    record_path="categories",
    meta=["distance", "name", ["geocodes", "main", "latitude"], ["geocodes", "main", "longitude"]],
    record_prefix="cat_"
    ).drop(columns = ["cat_id", "cat_icon.prefix", "cat_icon.suffix"])

# rename columns for readability
bar_data_example_df.rename(columns={"geocodes.main.latitude" : "latitude", "geocodes.main.longitude" : "longitude"}, inplace=True)

In [6]:
df_bars = pd.DataFrame()
df_bars = pd.concat([df_bars, bar_data_example_df], ignore_index=True)
df_bars

Unnamed: 0,cat_name,distance,name,latitude,longitude
0,Hookah Bar,78,Vice City Shisha Bar & Lounge,48.149291,17.110026
1,Bar,85,Smile Bar & Caffe,48.149235,17.109996
2,Beer Bar,125,Mešuge Craft Beer Bar,48.148932,17.110533
3,Gastropub,133,Skupinová Terapia,48.149856,17.111783
4,Lounge,166,EVENT bar & restaurant,48.151471,17.110419
5,Bakery,177,Minute - Fresh Food Bar,48.149603,17.112317
6,Cocktail Bar,238,MYST BAR,48.149131,17.112938
7,Slovak Restaurant,247,1. Slovak pub,48.148398,17.11231
8,Café,247,Bar BaRon,48.147839,17.110789
9,Beer Bar,256,Kollarko,48.149161,17.113223


### API Query for Bars within range of each crash, indexed for each crash

In [7]:
query = "bar"
categories = [i for i in range(13003, 13026)]
radius = 5000
df_bars = pd.DataFrame()

for i in range(0, crash_data.shape[0]):
# for i in range(2,10): <-- to test output

    try:
        # extract longitude/latitude values from crash data
        longitude, latitude = crash_data[["Latitude", "Longitude"]].iloc[i]

        # use longitude/latitude values to search FSQ API for bars in the vicinity of crash
        # then convert result to JSON format
        bar_data_json = get_venues(
            query=query,
            category=categories,
            longitude=longitude,
            latitude=latitude,
            radius=radius).json()

        # convert raw data into DataFrame, detailing bar name and location
        # then drop irrelevant columns
        bar_data_df = pd.json_normalize(
            data=bar_data_json["results"],
            record_path="categories",
            meta=["distance", "name", ["geocodes", "main", "latitude"], ["geocodes", "main", "longitude"]],
            record_prefix="cat_")

        # rename columns for readability
        bar_data_df.rename(columns={"geocodes.main.latitude" : "latitude", "geocodes.main.longitude" : "longitude"}, inplace=True)

        # add cash_data index to identify which bars are associated with which crash
        bar_data_df['crash_index'] = i

        # concatenate previous DataFrame with the new bar locations for each crash
        df_bars = pd.concat([df_bars, bar_data_df], ignore_index=True)

    except KeyError:
        continue

In [8]:
# export DataFrame to csv file
# minimize FSQ API calls
df_bars.to_csv("data/bar_data_3.csv")

In [42]:
df_bars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 141587 entries, 0 to 141586
Data columns (total 9 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   cat_id           141587 non-null  float64
 1   cat_name         141587 non-null  object 
 2   cat_icon.prefix  141587 non-null  object 
 3   cat_icon.suffix  141587 non-null  object 
 4   distance         141587 non-null  object 
 5   name             141587 non-null  object 
 6   latitude         141587 non-null  object 
 7   longitude        141587 non-null  object 
 8   crash_index      141587 non-null  int64  
dtypes: float64(1), int64(1), object(7)
memory usage: 9.7+ MB


In [20]:
df_bar_count = df_bars.groupby("crash_index").count().reset_index()[['crash_index', 'cat_id']].rename(columns={'cat_id' : 'bar_count'})

In [41]:
df_bar_count.sort_values(by='crash_index')

Unnamed: 0,crash_index,bar_count
0,0,18
1,1,18
2,2,18
3,3,19
4,4,18
...,...,...
9692,10394,18
9693,10395,18
9694,10396,18
9695,10397,18


In [49]:
df_crash_to_bars = pd.merge(crash_data, df_bar_count, left_index=True, right_on='crash_index')
df_crash_to_bars

Unnamed: 0,Master Record Number,Year,Month,Day,Weekend?,Hour,Collision Type,Injury Type,Primary Factor,Reported_Location,Latitude,Longitude,crash_index,bar_count
0,902363382,2015,1,5,Weekday,0,2-Car,No injury/unknown,OTHER (DRIVER) - EXPLAIN IN NARRATIVE,1ST & FESS,39.159207,-86.525874,0,18
1,902364268,2015,1,6,Weekday,1500,2-Car,No injury/unknown,FOLLOWING TOO CLOSELY,2ND & COLLEGE,39.161440,-86.534848,1,18
2,902364412,2015,1,6,Weekend,2300,2-Car,Non-incapacitating,DISREGARD SIGNAL/REG SIGN,BASSWOOD & BLOOMFIELD,39.149780,-86.568890,2,18
3,902364551,2015,1,7,Weekend,900,2-Car,Non-incapacitating,FAILURE TO YIELD RIGHT OF WAY,GATES & JACOBS,39.165655,-86.575956,3,19
4,902364615,2015,1,7,Weekend,1100,2-Car,No injury/unknown,FAILURE TO YIELD RIGHT OF WAY,W 3RD,39.164848,-86.579625,4,18
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9692,902030180,2013,6,2,Weekday,1200,2-Car,No injury/unknown,,WEST THIRD ST,39.164660,-86.579293,10394,18
9693,901956938,2013,1,1,Weekend,1400,2-Car,Incapacitating,FAILURE TO YIELD RIGHT OF WAY,3RD ST & CURRY,39.164660,-86.582920,10395,18
9694,902003089,2013,4,5,Weekday,1200,2-Car,No injury/unknown,FOLLOWING TOO CLOSELY,S CURRY & SR 48 RD,39.164660,-86.582920,10396,18
9695,902068276,2013,8,5,Weekday,1700,2-Car,No injury/unknown,FOLLOWING TOO CLOSELY,CURRY & W 3RD ST,39.164660,-86.582920,10397,18


In [53]:
# number of crashes per number of bars
df_crash_to_bars['bar_count'].value_counts()

# number of crashes appear to increase with the number of bars in the vicinity of the crash
# though the pattern is not entirely clean and may also be affected by other variables such as weather

16    2328
15    1748
18    1529
14     991
13     767
4      363
19     362
12     353
20     339
17     212
3      181
5       90
1       83
7       80
9       60
10      60
6       46
11      36
8       36
2       33
Name: bar_count, dtype: int64

# World Weather Online API

World Weather Online API is [here](https://www.worldweatheronline.com/developer/api/historical-weather-api.aspx)

1. Sign up for FREE api key if you haven't done that before (it's free for **30 days**).
2. For each crush, get the weather for the location and date.
3. Find a relationship between the weather and severity of the crash.

Hints:

* pull weather only for smaller sample of crashes (250 or so) due to API limits
* for sending HTTP requests check out "requests" library [here](http://docs.python-requests.org/en/master/)


Although this challenge calls for the World Weather Online API, the trial has ended and is no longer accessible. Instead, we will be using historical data from Apple Developer Kit.

### On requesting calls using Dark Sky API pertaining to `time`

Either be a UNIX time (that is, seconds since midnight GMT on 1 Jan 1970) or a string formatted as follows: [YYYY]-[MM]-[DD]T[HH]:[MM]:[SS][timezone].

Timezone should either be omitted (to refer to local time for the location being requested), Z (referring to GMT time), or +[HH][MM] or -[HH][MM] for an offset from GMT in hours and minutes.

The timezone is only used for determining the time of the request; the response will always be relative to the local time zone.

In [59]:
crash_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10399 entries, 0 to 10398
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Master Record Number  10399 non-null  int64 
 1   Year                  10399 non-null  int64 
 2   Month                 10399 non-null  int64 
 3   Day                   10399 non-null  int64 
 4   Weekend?              10399 non-null  object
 5   Hour                  10399 non-null  int64 
 6   Collision Type        10399 non-null  object
 7   Injury Type           10399 non-null  object
 8   Primary Factor        10162 non-null  object
 9   Reported_Location     10396 non-null  object
 10  Latitude              10369 non-null  string
 11  Longitude             10369 non-null  string
dtypes: int64(5), object(5), string(2)
memory usage: 975.0+ KB


In [60]:
crash_data[['Latitude', 'Longitude']] = crash_data[['Latitude', 'Longitude']].astype('string')

In [70]:
# set up function for get request
def rapid_api(latitude, longitude, date):
    '''
    Latitude and longitude as strings.
    '''

    if type(latitude) != str:
        str(latitude)

    if type(longitude) != str:
        str(longitude)

    url_weather = f'https://dark-sky.p.rapidapi.com/{latitude},{longitude},{date}'
    headers = {
        'X-RapidAPI-Key': 'aaf07e1c12mshbd72847db058ee8p18d1a7jsn60ad2c68474d',
        'X-RapidAPI-Host': 'dark-sky.p.rapidapi.com'}

    return re.get(url_weather, headers=headers)

In [76]:
response = rapid_api(48.146394, 17.107969, '2019-02-20')

In [77]:
response.text

'{"message":"You are not subscribed to this API."}'