## Merging two dataframes

Here we do a simple merge between two dataframes concerning tube stations. The two files look like:


In this notebook we shall
1. Import the two `.csv` files
2. Clean up the extra spaces in `Station Stats.csv`
3. Merge the dataframes
4. Upload the merged dataframe to Count

# Hackathon 1: 5 Dec

# Welcome to the London Open Data Hackathon!

## This document should be everything you need for today's session :D

Slack Channel (for prizes): 

Join the hackathon slack channel here: [Slack link](https://join.slack.com/t/counthackathons/shared_invite/enQtNDg5MTI2NzM0NzI0LTIyOTI0ZmUwOTY3M2Q1MmUwYjEzYjRkNzMxNTkzNTM1YTUyNzgxY2I5YzU0ZGI4YTAxYjYxOWNhNzU1NTE1Yzk)

# The Task 👨‍💻:

- Work in groups, or individually, to come up with interesting, opportunistic, or amusing findings using the data we've provided, or data you find on your own
- When you have something you want to share, share the link or image so others can see what you've found
- If you see a visual you like, please show it some love on Slack with a 👍😍👏⭐️ or something similar. This is how we decide who gets the prizes!
- At the end, we'll give anyone a chance to share parts of their analysis, and their findings (last chance to get votes)
- Lastly, we'll announce the winners!

# The Topics 🇬🇧: 
- The next "big" Borough: Find out where you should be investing
- A Clean Commute: A look at transportation and enviromental data to find the cleanest way to commute to work
- The London Economy: A deep dive into the London economy past and future. Try to find how your industry is trending
- Have something else you want to explore? Feel free to work on your own, or if you're up for it, start your own team!

# The Rules 👩‍🏫:

- If you use data you've found, please make sure it's allowed to be shared with others
- Use any platform you want (Python, Excel, Fortran, etc.)!
- Please don't up-vote your own post...

# The Data 📈:

The data is housed in a few locations. Each one has all of the data, so pick whichever you prefer: 

1. [https://play.count.co/admin/tables](https://play.count.co/admin/tables)
2. Python starter scripts: [Binder](https://mybinder.org/v2/gh/count/hackathons/master)
3. WeTransfer link: [Download link](https://wetransfer.com/downloads/6f900f0344a79dd64e407379ed6b5f8320181205105007/218ef02f066f5bead4e36a93deba655b20181205105007/2ba39c)

Datasources: 

- [https://data.london.gov.uk/dataset](https://data.london.gov.uk/dataset)
- [https://www.metaweather.com/api/](https://www.metaweather.com/api/)

# The Prizes 🏆:

- People's Choice: Most votes from the Slack Channel
- Host's Choice: The Count team's favourite visual

---

# The Ask ❓:

This is our first meet-up, so we would love to get your feedback (good and bad). If you wouldn't mind, could you fill out a survey here: [https://www.surveymonkey.co.uk/r/VFCRCK7](https://www.surveymonkey.co.uk/r/VFCRCK7)

Join our Slack channel where we regularly post datasets we find and the latest news about Count:
[Slack Link](https://join.slack.com/t/countcommunity/shared_invite/enQtNDk1Mzc1MjcwODUyLTNmODYzNGMzODdmNzUzZjU0MTAwYWQ2OTBjZDc3ODEyZjk2ZWFlNWI3YzVmMjFiNTI2MTYxYjlhNDNjYzljN2U)



### Imports and definitions

The Python API for Count is hosted at PyPI here https://pypi.org/project/count-api/3.0.6/

In [1]:
# ! pip install pandas
# ! pip install count-api

In [1]:

import pandas as pd
import numpy as np
import os
os.environ["COUNT_API_URL"] = "https://play.count.co"
from count_api import CountAPI

# Set this to the local path of the GitHub repository
data_dir = os.path.join('..','data',)
token = 'eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJ1c2VybmFtZSI6IkpjUnVRZjFSVWhqbHdSZVlTRk1IZkBjb3VudC5jbyIsImp3dGlkIjoiMlRYRkY4QTVWVGlGc1lpSWJuOGdMIiwiaWF0IjoxNTQ0MDAzMzU5LCJleHAiOjE1NzU1MzkzNTksImF1ZCI6Imh0dHBzOi8vcGxheS5jb3VudC5jbyJ9.UHS_I1d8iG27VxYQjNpMEFLSYnBXoKmrKDf3HIHVILk'

CountAPI: Running with url: https://play.count.co


## 1. Importing the data

In [4]:
business_path = os.path.join(data_dir,"ECON_BusinessByBorough_09-18.csv")
business = pd.read_csv(weather_path,engine= 'python')
econ_path =os.path.join(data_dir,"ECON_EconByBorough_99-15.csv")
econ = pd.read_csv(animal_path,engine= 'python')


In [5]:
business.head(5)

Unnamed: 0,Year,Code,Area,Industry,Business Count
0,2018,E09000001,City of London,"Agriculture, forestry & fishing",25
1,2018,E09000002,Barking and Dagenham,"Agriculture, forestry & fishing",10
2,2018,E09000003,Barnet,"Agriculture, forestry & fishing",30
3,2018,E09000004,Bexley,"Agriculture, forestry & fishing",10
4,2018,E09000005,Brent,"Agriculture, forestry & fishing",0


In [6]:
econ.head()

Unnamed: 0,Area name,Year,% of economically active with NVQ4+ - working age_x,% of economically active with NVQ4+ - working age_y,% of economically active with NVQ3 only - working age,% of economically active with Trade Apprenticeships - working age,% of economically active with NVQ2 only - working age,% of economically active with NVQ1 only - working age,% of economically active with other qualifications - working age,% of economically active with no qualifications - working age,...,Fires per thousand population,Ambulance incidents per hundred population,Total Carbon Emissions,Household Waste Recycling Rate,Number of Licensed Vehicles,Traffic Flows,% of adults who cycle at least once per month,Road Casualties (Killed or Seriously Injured),Average Public Transport Accessibility Score,% children living in out-of-work households
0,City of London,2004,80.6,80.6,!,!,!,!,!,!,...,,,,14.0,,198.0,,44.0,,
1,City of London,2004,80.6,80.6,!,!,!,!,!,!,...,,,,14.0,,198.0,,44.0,,
2,City of London,2004,80.6,80.6,!,!,!,!,!,!,...,,,,14.0,,198.0,,44.0,,
3,City of London,2004,80.6,80.6,!,!,!,!,!,!,...,,,,14.0,,198.0,,44.0,,
4,Barking and Dagenham,2004,17.2,17.2,10.8,6.4,12.7,20.8,15.9,16.3,...,,,,14.0,,551.0,,90.0,,


## 3. Merging the dataframes

Now that the station names in each dataframe are the same, we are ready to perform the merge. In the following line, we select the two dataframes (`left` and `right`), the columns to merge on (`left_on` and `right_on`), and then delete a column with duplicate information (`applicalbe_Date`).

Finally, we save the dataframe to a CSV file, `merged.csv`. (Setting `index=False` just means we don't save the row numbers)

In [22]:
# econ['AreaName']= econ['Area Name']
# econ.rename(index=str, columns={"AreaName": "Area name"})
econ.columns = econ.columns.str.replace(' ', '')

In [23]:
econ.head()

Unnamed: 0,Areaname,Year,%ofeconomicallyactivewithNVQ4+-workingage_x,%ofeconomicallyactivewithNVQ4+-workingage_y,%ofeconomicallyactivewithNVQ3only-workingage,%ofeconomicallyactivewithTradeApprenticeships-workingage,%ofeconomicallyactivewithNVQ2only-workingage,%ofeconomicallyactivewithNVQ1only-workingage,%ofeconomicallyactivewithotherqualifications-workingage,%ofeconomicallyactivewithnoqualifications-workingage,...,Ambulanceincidentsperhundredpopulation,TotalCarbonEmissions,HouseholdWasteRecyclingRate,NumberofLicensedVehicles,TrafficFlows,%ofadultswhocycleatleastoncepermonth,RoadCasualties(KilledorSeriouslyInjured),AveragePublicTransportAccessibilityScore,%childrenlivinginout-of-workhouseholds,a
0,City of London,2004,80.6,80.6,!,!,!,!,!,!,...,,,14.0,,198.0,,44.0,,,City of London
1,City of London,2004,80.6,80.6,!,!,!,!,!,!,...,,,14.0,,198.0,,44.0,,,City of London
2,City of London,2004,80.6,80.6,!,!,!,!,!,!,...,,,14.0,,198.0,,44.0,,,City of London
3,City of London,2004,80.6,80.6,!,!,!,!,!,!,...,,,14.0,,198.0,,44.0,,,City of London
4,Barking and Dagenham,2004,17.2,17.2,10.8,6.4,12.7,20.8,15.9,16.3,...,,,14.0,,551.0,,90.0,,,Barking and Dagenham


In [24]:
merged_econ = pd.merge(left=econ, right=business, left_on='Areaname', right_on='Area').drop('Areaname', axis=1)

In [31]:
len(merged_econ.columns)
len(econ.columns)
len(business.columns)

5

In [9]:
animal_weather.to_csv('merged.csv', index=False)

## 4. Uploading to Count

The fun bit! Import the Count API module, and initialise it with your access token, then upload the file saved in step 3:

In [32]:
count = CountAPI()
count.set_api_token(token)
table = count.upload(data = merged_econ,name = 'Merged Economy Data')

In [33]:
columns = [t.name for t in table.columns()]

Finally, with the table uploaded, create an interactive plot that shows how the cost of animal rescues varies across temperatures. 

In [34]:
columns

['Year_x',
 '%ofeconomicallyactivewithNVQ4+-workingage_x',
 '%ofeconomicallyactivewithNVQ4+-workingage_y',
 '%ofeconomicallyactivewithNVQ3only-workingage',
 '%ofeconomicallyactivewithTradeApprenticeships-workingage',
 '%ofeconomicallyactivewithNVQ2only-workingage',
 '%ofeconomicallyactivewithNVQ1only-workingage',
 '%ofeconomicallyactivewithotherqualifications-workingage',
 '%ofeconomicallyactivewithnoqualifications-workingage',
 'GrossAnnualPay-FullTime-Total(GBP)',
 'GrossAnnualPay-FullTime-Male(GBP)',
 'GrossAnnualPay-FullTime-Female(GBP)',
 'MedianModelledHouseholdincome(GBP)',
 'MeanModelledHouseholdincome(GBP)',
 'Realchange%(afterinflation)-2001/02to2012/13',
 'Numberofactivebusinesses',
 'Two-yearbusinesssurvivalrates',
 'NumberEmployed-Total',
 'NumberEmployed-Male',
 'NumberEmployed-Female',
 'JobsDensity',
 'CrimeRatesperthousandpopulation',
 'Firesperthousandpopulation',
 'Ambulanceincidentsperhundredpopulation',
 'TotalCarbonEmissions',
 'HouseholdWasteRecyclingRate',
 'Num

In [35]:
visual = table.upload_visual(x=table['Year_x'], y=table['Industry'],
                            chart_options = {'type':'bar'})
visual.embed()