### Import libraries

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import numpy as np

### CSV

We have some user-specific data on which we want to add new information. In the end we should have our user data 
enhanced with the per capita income on a county level which can be found here:
https://en.wikipedia.org/wiki/List_of_United_States_counties_by_per_capita_income

Let's first import our csv containing all adresses and check a our df a bit in detail

In [None]:
# YOUR CODE HERE

<details>
    <summary>🙈 Reveal solution</summary>

<p>
<pre>
full_df = pd.read_csv('../../data/adresses_day2.csv', index_col=0)
</pre>
</p>
</details>

In [None]:
# YOUR CODE HERE

<details>
    <summary>🙈 Reveal solution</summary>

<p>
<pre>
full_df.shape
full_df.head()
full_df.describe()
..
</pre>
</p>
</details>

If we check our null values we see that we're missing quite some data. Also the data is not on the county level. 

In [None]:
# YOUR CODE HERE

<details>
    <summary>🙈 Reveal solution</summary>

<p>
<pre>
full_df.isnull().sum()/len(full_df)
</pre>
</p>
</details>

### API

The great part is however that the latitude and lonitude are present in all row. So we can use an API to get the county data with these lat and lon columns

Documentation:
https://nominatim.org/release-docs/develop/api/Overview/

Looking at the documentation we spot that we need to use the **endpoint /reverse**

https://nominatim.org/release-docs/develop/api/Reverse/

Let's construct our URL based on this information:

In [None]:
# YOUR CODE HERE

<details>
    <summary>🙈 Reveal solution</summary>

<p>
<pre>
base_url = 'https://nominatim.openstreetmap.org'

endpoint = '/reverse'

full_url = base_url + endpoint

full_url
</pre>
</p>
</details>

We have three params we could use namely **lat** and **lon**, which we will get from our DF, and **format** which will be fixed on json

Let's create a function that will send a *GET-request* for a given LAT and LON value and return the county

In [None]:
# YOUR CODE HERE

<details>
    <summary>🙈 Reveal solution</summary>

<p>
<pre>
import numpy as np

def get_county(lon, lat):
    
    #define params
    params = {
        'lat': lat,
        'lon' : lon,
        'format' : 'json'
    }
    
    #send get request
    response = requests.get(full_url, params).json()
    
    #go through dictionary to get to the county
    if 'county' in response['address'].keys():
        county = response['address']['county']
        
        #replace county for wikipedia mach
        county_clean = county.replace(' County', '')
    else:
        county_clean = np.nan    
    
    #return county_clean
    return county_clean
</pre>
</p>
</details>

We can now use this function and the **.iterrows()** method to loop over our DF and create an extra columns with the county

In [None]:
# YOUR CODE HERE

<details>
    <summary>🙈 Reveal solution</summary>

<p>
<pre>
county_ls = []

for index, row in full_df.iterrows():
    county = get_county(row['LON'], row['LAT'])
    county_ls.append(county)
</pre>
</p>
</details>

Now can add a new column with the county data

In [None]:
# YOUR CODE HERE

<details>
    <summary>🙈 Reveal solution</summary>

<p>
<pre>
full_df['county'] = county_ls
full_df.county.value_counts()
</pre>
</p>
</details>

### Scraping

We can use these counties now to get the Per Capita Income

Let's first get a list of all unique counties in our dataset

In [None]:
# YOUR CODE HERE

<details>
    <summary>🙈 Reveal solution</summary>

<p>
<pre>
counties = full_df.county.unique()
counties
</pre>
</p>
</details>

Let's inspect our webpage and try to identidy a unique class we could scrape 

In [None]:
# YOUR CODE HERE

<details>
    <summary>🙈 Reveal solution</summary>

<p>
<pre>
url_to_scrape = 'https://en.wikipedia.org/wiki/List_of_United_States_counties_by_per_capita_income'

content = requests.get(url_to_scrape).content

soup = BeautifulSoup(content)
</pre>
</p>
</details>

wikitable is the one we need

In [None]:
# YOUR CODE HERE

<details>
    <summary>🙈 Reveal solution</summary>

<p>
<pre>
table = soup.find('table', class_='wikitable')
</pre>
</p>
</details>

Inside this table we can use **.find_all()** to get all the **tr** elements. We should get a list of all table elements.

In [None]:
# YOUR CODE HERE

<details>
    <summary>🙈 Reveal solution</summary>

<p>
<pre>
trs = table.find_all('tr')
</pre>
</p>
</details>

Check the first **tr** element and try to figure out how we could get the necessary information out of this.

Watch out for the headers and attribute errors

In [None]:
# YOUR CODE HERE

<details>
    <summary>🙈 Reveal solution</summary>

<p>
<pre>
cap_dict = {}

for row in trs[1:]:
    try:
        county = row.a.text
        per_capita = row.find_all('td')[3].text
        per_capita_stripped = int(per_capita.lstrip('$').replace(',',''))
        cap_dict[county] = per_capita_stripped
    except AttributeError:
        pass
        
</pre>
</p>
</details>

We can now use the .map() function to enrich our dataset

In [None]:
# YOUR CODE HERE

<details>
    <summary>🙈 Reveal solution</summary>

<p>
<pre>
full_df['per_capita'] = full_df.county.map(cap_dict)
</pre>
</p>
</details>