<a href="https://colab.research.google.com/github/LaurenMHarris/markdown-portfolio/blob/main/byo_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Women Who Code :: Build-Your-Own Dataset

Sometimes data scientists are handed a fully prepared and cleaned dataset, but this is rarely the case. Today's workshop will give you practice in building your own dataset from scratch. We will use public APIs and publically available data files to create a dataset of weather and population data that is ready for downstream uses.

In this workshop, we'll be collecting and organizing information for fictional visitors to a fictional website.


In [None]:
# clone the git repo
!git clone https://github.com/kaylarobinson077/shareouts.git

# navigate to the women who code directory
%cd shareouts/womenwhocode/
# install requirements
!pip install -r requirements.in --quiet

Cloning into 'shareouts'...
remote: Enumerating objects: 63, done.[K
remote: Counting objects: 100% (63/63), done.[K
remote: Compressing objects: 100% (36/36), done.[K
remote: Total 63 (delta 24), reused 57 (delta 22), pack-reused 0[K
Unpacking objects: 100% (63/63), done.
/content/shareouts/womenwhocode
[K     |████████████████████████████████| 1.5 MB 5.3 MB/s 
[?25h

In [None]:
# If you have installed the requirements in the requirements.txt file,
# you will have the packages needed here
from faker import Faker  # This helps with creating fake data, more details below!
import pandas as pd
import requests

# Part 1 :: Calling Public APIs

In this first section we will use several publically available APIs to collect information about fictional visitors to our website. The only information we directly collect about visitors is their IP address. Beyond that, we'll have to look to outside sources to pull in information to learn more about our visitors.

## Get your IP address

An IP address is a unique address that identifies a device on the internet or a local network. IP stands for "Internet Protocol," which is the set of rules governing the format of data sent via the internet or local network.

To find out your own IP address, you can make a call to the [ipify](https://www.ipify.org/) API, a simple public IP address API. This API does not require an account or API key.

In [None]:
# api endpoint
url = "https://api.ipify.org"

In [None]:
# TODO - call the api, and save the response to a variable called `resp`

Congrats, you just made a call to the first API of this workshop! Let's take a closer look at the response, and see what information we've collected from it.

In [None]:
# http response status codes indicate whether the request has been successfully completed
# Here is one place you can learn more about these status codes: https://en.wikipedia.org/wiki/List_of_HTTP_status_codes

# TODO - print the response code

In [None]:
# can get response in a string format

# TODO - print the response in a string format

In [None]:
# or more usefully as a json dictionary

# TODO - print the response in json format

In [None]:
# That makes it dictionary, making it easier to pull out values like the ip address

# TODO - print the datatype after reading the response as json

In [None]:
# let's hold on to ip address, and use it in some next steps

# TODO - create a variable called `ip` that contains the IP address value as a string

## Get location for an IP address

IP addresses can be linked to information about the location where you are connected to the internet.

To find geolocation information given an IP address, we can use the [ip-api](https://ip-api.com/) JSON endpoint. The IP address endpoint allowed us to pass the desired response format (JSON) as a query parameter, but this API has a specific JSON endpoint, so we'll specify the data format as part of the URL.

In [None]:
url = f"http://ip-api.com/json/{ip}"

In [None]:
# TODO - call the api, and save the response to a variable called `resp`

In [None]:
# check that the call succeeded

# TODO - print the status code

In [None]:
# inspect the returned data

# TODO - print the contents of response as json

In [None]:
# pull out the lat/long fields, since we can look up info about this location

# TODO - create a variable `lat` and `lon` which each store the latitude and
# longitude from the response, respectively.

In [None]:
# to easily repeat this call, put the steps we just took into a function

def get_location_info(ip: str):
    """
    Given an IP address, return a dictionary of location information.
    """
    url = f"http://ip-api.com/json/{ip}"
    
    # TODO - call the api and return its json response

## Get the local weather

Now that we know where a visitor is from, we can collect any information from other sources to learn more about that location. For this workshop, let's suppose that the visitor's current weather is of interest to us.

Given the visitor's latitude and longitude, we can use the [Open Meteo](https://open-meteo.com/en) API to get information about the location's current weather. Like the other APIs we've used in this workshop, Open Meteo is public and does not require an API key.



In [None]:
# api endpoint
url = "https://api.open-meteo.com/v1/forecast"

# dictionary of the values we need to pass in our request
params = {"latitude": lat, "longitude": lon, "current_weather": True, "format": "json"}

In [None]:
# TODO - call the api, and save the response to a variable called `resp`

In [None]:
# TODO - print the response status code

In [None]:
# TODO - print the response contents as json

In [None]:
# to reproduce these steps, put them into a function
def get_weather_info(lat: float, long: float):
    """
    Given a latitude and longitude, return the current weather.
    """
    url = "https://api.open-meteo.com/v1/forecast"

    params = {
        "latitude": lat,
        "longitude": long,
        "current_weather": True,
        "format": "json",
    }

    # TODO - call the api and return its json response

# Part 2 :: Prepare Your Dataset

## 2.1 :: Generate fake IP addresses

Since we are working with fake data, we'll have to create some fake IP addresses for our website visitors. To do this, we'll use a package called [Faker](https://faker.readthedocs.io/en/master/) which generates fake data. It can generate all types of fake data, ranging from addresses to names to, wouldn't you know it, ID addresses!

In [None]:
# TODO - use faker to generate a fake ip address

## 2.2 :: Generate weather data

For each visitor IP address, we'll want to run our full weather collection process of getting location based on IP, then weather based on location. One way to do this is to define a function that takes in a (fake) IP address, hits the IP-to-location API, then sends this response to the Location-to-weather API.

In [None]:
def get_geo_weather_data(ip: str):
    """Pull weather data for given ip address"""
    # TODO - get location info for the ip address using the function we defined

    # TODO - get the current weather at the lat/long using the function we defined

    # TODO - stack the dictionaries

    # TODO - return the resulting stacked dictionaries

By organizing all of the API calls into a single function, this allows us to write a simple function that:

1. Makes a fake IP address
2. Gets the location and weather data for that IP adress
3. Handles the case where we don't get back valid weather data (for example, an API returned an error)

Notice that we have set a parameter called `max_retries`, 

In [None]:
def get_fake_geo_weather_data(max_retries=5):
    """Pull weather data for fake ip address, up to the given number of retries"""
    # keep trying again until we either get a valid result, or hit the max number of retries
    retries = 0
    faker = Faker()
    while retries <= max_retries:
        fake_ip = faker.ipv4()
        # we won't always get successful results from each IP
        try:
            return get_geo_weather_data(fake_ip)
        # for now, we can skip any failed attempts
        except:
            retries += 1
    print("Max retries reached!")
    return None

To handle potential API errors, we allowed our function to return a value of `None` in cases where no valid data was returned after the maximum number of retries. To clean up the data and make it easier for analysis, we can drop these failed attempts from our list of weather data responses.

In [None]:
# TODO - use the above function to create a list of data

Pandas dataframes are a standard across many data science teams, and so we will convert this list of dicts to a DataFrame for downstream analysis and data validation.

In [None]:
# list of dicts to pandas dataframe is easy!

# TODO - convert the list to a pandas dataframe

In [None]:
# take a peek to make sure the data looks as we'd expect it to

# TODO - print the first few rows of the dataframe

In [None]:
# pandas made it easy to inspect our data, such as seeing the set of countries we collected data from

# TODO - print a list of the unique countries in the dataframe

And there you have it! At this point, we have used several public APIs to collect location and weather data about imaginary visitors to our company's website. We've organized this data into a Pandas DataFrame format, which will make it easy to combine with additional data, and to use for downstream analysis or modeling applications.

## 2.3 :: Join with Migration Data

We learned a lot about our individual visitors by inspecting their IP address, and calling other APIs to collect supplemental information off of this.

Often times, relevant data might exist in a database or table format. For example, consider the case where our website may be offering relocation services, such as a moving company or a service that helps individuals find job opportunities in new countries. For a use-case like this, it could be valuable to learn about the typical migration rates in and out of the countries in which our visitors reside.

Luckily for us, the United Nations publishes migration rates at the country level publically, and we can download this data for free. After accessing this data, we can join it to our visitors data table using a key of "Country".

XLSX file available from the UN:

https://population.un.org/wpp/Download/Standard/Migration/

In case the location of this file changes, we've also attached a copy of it to this Repo.

In [None]:
# skip the first few rows, which just contain extra header information
df_migration = pd.read_excel(
    "https://population.un.org/wpp/Download/Files/1_Indicators%20(Standard)/EXCEL_FILES/4_Migration/WPP2019_MIGR_F01_NET_MIGRATION_RATE.xlsx",
    skiprows=range(16),
)

In [None]:
# TODO - print the first few rows of this data table

In [None]:
# TODO - print the unique values in the `Type` column

Looking at the data as it's read in, you can make several observations:

- Data is reported at various aggregations, such as country, income, overall (world), etc.
- Metrics are reported at various date ranges. While interesting to have, we are likely going to be most interested in the most recent year range (2015-2020)

Because of our specific interests, let's limit rows to just those reporting on country-level values, and limit columns to just the region name and most recent measurement.

In [None]:
# limit to only country-level rows

# TODO - limit the migration dataframe to only entries with type of `Country/Area`

In [None]:
# limit to only relevant columns

# TODO - limit the migration dataframe to only include the columns
# "Region, subregion, country or area *", and "2015-2020"

Right now, the column names aren't very specific to our limited use case. So, we can rename the columns in our subsetted dataframe to be more interpretable in our downstream dataset. 

In [None]:
# TODO - rename columns of the migration dataframe to be easier to work with

At this point, we have a cleaned up DataFrame with weather data (at the visitor-level), and a cleaned up DataFrame with migration data (at the country-level). To be able to look at these metrics together, we will join the data together. Because our ultimate goal is to have all data at the website visitor level, we will want to perform a left join of the migration data to the weather data, as the migration data is aggregated at a coarser level.

In [None]:
# TODO - left join migration data onto weather data, using the column 'country' as the key

And, voila!

In [None]:
# TODO - print the first few rows of the resulting dataframe

## 2.4 :: Data Validation and Cleaning

So far things are looking pretty good, but let's dig a little bit deeper to see how things turned out after the join. One thing to be cautious about here is that our left join will still return a result if there are cases where there may not have been a match. For example, if a particualr visitor's country doesn't have a perfect match in the migration dataset, it will remain a row in our dataframe, but all of the weather and location data will be left empty!

In [None]:
# TODO - print the list of countries that have a non-null migration rate


In [None]:
# TODO - print the list of countries that have a null migration rate

Based on the findings of the above cell (listing out the contries where migration rate is empty) we can see a list of countries that don't have an exact string match to the migration data. To help troubleshoot this, we can search the migration data for entries that contain at least a partial string match.

In [None]:
# search for strings containing `United States`
# this shows that the migration data refers to this country as `United States of America`
# we can clean this up prior to the join, and then they should match up

# TODO - print rows from the migration dataset containing the substring `United States`

In [None]:
# to get join to work, let's rename country in the migration dataset
# this dict of replacements came from running a number of IPs through our process
# it may not be exhaustive

# TODO - replace country names in the migration subset to match with the weather dataset

# use ths dict to start, though we may have to add to it
to_replace={
    "United States of America": "United States",
    "Syrian Arab Republic": "Syria",
    "Russian Federation": "Russia",
    "Republic of Korea": "South Korea",
    "Venezuela (Bolivarian Republic of)": "Venezuela",
    "Viet Nam": "Vietnam",
    "China, Taiwan Province of China": "Taiwan",
}


In [None]:
# try the join again

# TODO - repeat the merge, but with the updated migration data

In [None]:
# now see if all entries have a match
# empty array means that no entries are missing migration data

# TODO - check for empty values, to see if we caught all missing values

At this point, we've created a dataset containing location, weather, and migration data for visitors to our website. Depending on your use-case, at this point you may decide to add in additional data sources, perform feature engineering, or implement extra cleanup and data validation steps.