<a href="https://colab.research.google.com/github/StanStarishko/python-programming-for-data/blob/main/Worksheets/1_data_retrieval.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data retrieval
---

Each of the code cells below contains code that is an example of how data can be retrieved from a range of sources.

### Read the text and run the code to see what it does.


## Scraping data from a web page
---

The code below reads all the data tables from the Wikipedia page on Glasgow.  The 8th table on the page shows population data over a period of centuries.

The code reads the data from the page into a list of dataframes.  The index, eg[7] is used to access the 8th table in the list.  

1.  Open the link to have a look at the [Glasgow Wikipedia](https://en.wikipedia.org/wiki/Glasgow#Climate) page
2.  Run the code.
3.  Change the index to see other data tables
4.  Add the line
```
print(len(datatables))
```
to show how many tables were one the page and so are in the list.

## TAKEAWAY:
Take a look at a number of the data tables.  They can look messy. The job of the programmer is to write code that will tidy the tables up.

In [None]:
import pandas as pd

def get_web_data():
  datatables = pd.read_html('https://en.wikipedia.org/wiki/Glasgow#Climate')
  print(f"Total tables is {len(datatables)}")
  #  change the index in [] to look at other tables, add the line print(len(datatables)) to see how many tables there are
  #df = datatables[7]  #Glasgow population data
  #df = datatables[2]  #Glasgow climate data
  df = datatables[3]  #Glasgow Density data
  #df = datatables[4]
  #df = datatables[5]  #Glasgow Ethnic Group
  #df = datatables[6]

  return df

# run and test the get_data() function, test visually - does it match the data on the web page
population_data = get_web_data()
display(population_data)

Total tables is 20


Unnamed: 0,Location,Population,Area,Density,Unnamed: 4
0,Glasgow City Council Area[101],592820,67.76 sq mi (175.5 km2),"8,541.8/sq mi (3,298.0/km2)",
1,Greater Glasgow Urban Area[5],985290,265 km2 (102 sq mi),"3,775/km2 (9,780/sq mi)",
2,Source: Scotland's Census Results Online[102],Source: Scotland's Census Results Online[102],Source: Scotland's Census Results Online[102],Source: Scotland's Census Results Online[102],Source: Scotland's Census Results Online[102]


## From a csv file hosted on Github.com
---

Data has often already been tidied up and organised into table form.  It is often stored as Comma Separated Values (csv).  This is a formatted text file, which is small and so quick to transfer, especially over the internet.

The code below reads a data table stored in a Comma Separated Values file (this is a text file containing rows of data with each column within the row separated from the next column by a comma).  

(**Note**: If you were using Jupyter Notebooks on your device, the url could be replaced with the path to the CSV file).

In [None]:
import pandas as pd

def get_csv_data():
  #url = "https://raw.githubusercontent.com/futureCodersSE/working-with-data/main/Data%20sets/Paisley-Weather-Data.csv"
  #try the other link fomat
  url = "https://github.com/futureCodersSE/working-with-data/blob/main/Data%20sets/Paisley-Weather-Data.csv?raw=true"
  df = pd.read_csv(url)
  return df

weather_data = get_csv_data()
display(weather_data)

Unnamed: 0,yyyy,mm,tmax (degC),tmin (degC),af (days),rain (mm),sun (hours),status
0,1959,1,4,-2,25,40.9,54.1,
1,1959,2,6.6,2.1,10,41.8,17.8,
2,1959,3,10.6,4.2,0,50.9,85.7,
3,1959,4,13,5.2,0,76.3,125.1,
4,1959,5,18.1,7.9,0,24,222,
...,...,...,...,...,...,...,...,...
741,2020,10,12.9*,7.1*,0*,185.3*,76.8*,Provisional
742,2020,11,10.6*,6.0*,0*,142.4*,29.3*,Provisional
743,2020,12,6.9*,2.6*,8*,131.0*,31.6*,Provisional
744,2021,1,4.9*,-0.2*,14*,132.2*,51.0*,Provisional


## From an Excel file hosted on Github.com
---

The code below reads the data table from a sheet in an Excel file.  

Excel spreadsheets often have more than one sheet.  If you don't specify a sheet then it will assume that you want to read the data from the first sheet in the Excel workbook (sheet_name = 0).  If you don't know the sheet name but know it is the second sheet, you can use sheet_name = 1, or 2 for the third sheet, etc.

The Excel file is readable ONLY if it in its raw format (which is not the format we normally see it in). This is the [original file](https://docs.google.com/spreadsheets/d/1JnGkdYpYdr1hsr_ALCPBduxljQxoF1FK/edit?usp=share_link&ouid=109124845535182996296&rtpof=true&sd=true) in the form we can read.  Have a look at it so you know what you are expecting to see.

(**Again**: If you were using Jupyter Notebooks on your device, the url could be replaced with the path to the Excel file).

In [None]:
import pandas as pd

def get_excel_data():
  url = "https://github.com/futureCodersSE/working-with-data/blob/main/Data%20sets/public_use-talent-migration.xlsx?raw=true"
  #df = pd.read_excel(url,sheet_name="Industry Migration")
  df = pd.read_excel(url,sheet_name="Country Migration")
  #df = pd.read_excel(url,sheet_name="Notes")
  return df

migration_df = get_excel_data()
display(migration_df)

Unnamed: 0,base_country_code,base_country_name,base_lat,base_long,base_country_wb_income,base_country_wb_region,target_country_code,target_country_name,target_lat,target_long,target_country_wb_income,target_country_wb_region,net_per_10K_2015,net_per_10K_2016,net_per_10K_2017,net_per_10K_2018,net_per_10K_2019
0,ae,United Arab Emirates,23.424076,53.847818,High Income,Middle East & North Africa,af,Afghanistan,33.939110,67.709953,Low Income,South Asia,0.19,0.16,0.11,-0.05,-0.02
1,ae,United Arab Emirates,23.424076,53.847818,High Income,Middle East & North Africa,dz,Algeria,28.033886,1.659626,Upper Middle Income,Middle East & North Africa,0.19,0.25,0.57,0.55,0.78
2,ae,United Arab Emirates,23.424076,53.847818,High Income,Middle East & North Africa,ao,Angola,-11.202692,17.873887,Lower Middle Income,Sub-Saharan Africa,-0.01,0.04,0.11,-0.02,-0.06
3,ae,United Arab Emirates,23.424076,53.847818,High Income,Middle East & North Africa,ar,Argentina,-38.416097,-63.616672,High Income,Latin America & Caribbean,0.16,0.18,0.04,0.01,0.23
4,ae,United Arab Emirates,23.424076,53.847818,High Income,Middle East & North Africa,am,Armenia,40.069099,45.038189,Upper Middle Income,Europe & Central Asia,0.10,0.05,0.03,-0.01,0.02
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4143,zw,Zimbabwe,-19.015438,29.154857,Low Income,Sub-Saharan Africa,za,South Africa,-30.559482,22.937506,Upper Middle Income,Sub-Saharan Africa,-2.98,-11.79,-9.10,-12.08,-20.76
4144,zw,Zimbabwe,-19.015438,29.154857,Low Income,Sub-Saharan Africa,ae,United Arab Emirates,23.424076,53.847818,High Income,Middle East & North Africa,-2.50,-2.49,-2.21,-1.68,-3.19
4145,zw,Zimbabwe,-19.015438,29.154857,Low Income,Sub-Saharan Africa,gb,United Kingdom,55.378051,-3.435973,High Income,Europe & Central Asia,3.91,4.66,0.74,-0.66,-1.97
4146,zw,Zimbabwe,-19.015438,29.154857,Low Income,Sub-Saharan Africa,us,United States,37.090240,-95.712891,High Income,North America,38.60,37.76,10.09,6.06,5.25


## From an API which delivers the data in JSON format
---

The code below requests the data from a URL.  This is a bit more tricky than the other ways to get the data as how you access the data will depend on how it is organised.

In general, the data will be retrieved as a dictionary (not a table), which will contain a record called 'data' in which the actual data is stored.  In the example, the data has been taken from the 'data' record  and is stored in json_data.

**Try these to help understand the data**:

1.  The code below gets the data from the URL and stores it in a variable called **json_data**.  Run the code to see what the original data looks like.

2.  ```json_data``` is a list of records but it only has one record in the list.  **data_table** is the first record in the ```json_data``` list.   
 * comment out the line ```print(json_data)```
 * un-comment the line that assigns data_table the value json_data[0]
 * un-comment the line that will print data table.

3.  In this example, data_table has three keys, 'to', 'from' and 'regions'.  Take a look at the regions data on its own.  
  * change the line ```print(data_table)``` to print just the regions part of it: ```print(data_table['regions')```

4. The 'regions' value is the data we want to use in our dataframe, so the rest of the code normalizes this json data into a pandas dataframe (df), which you can see as the output.  To see this:  
  *  comment out the line  ```print(data_table['regions')```  
  *  un-comment the rest of the code to see what the data looks like

Each API is likely to deliver its data in a different format and so you will need to be confident to read the documentation and to inspect the data to see what keys and indexes you need to access.

For information on the format of the data here, see https://carbon-intensity.github.io/api-definitions/#regional

In [None]:
import pandas as pd
import requests

def get_api_data():
  url = "https://api.carbonintensity.org.uk/regional"
  json_data = requests.get(url).json()['data']
  print(json_data)
  print(len(json_data)) # for understanding how it works
  df = "" # for understanding how it works
  data_table = json_data[0]
  print(data_table)
  df = pd.json_normalize(data_table['regions'])
  return df

generation_df = get_api_data()
display(generation_df)

[{'from': '2024-07-28T18:00Z', 'to': '2024-07-28T18:30Z', 'regions': [{'regionid': 1, 'dnoregion': 'Scottish Hydro Electric Power Distribution', 'shortname': 'North Scotland', 'intensity': {'forecast': 131, 'index': 'moderate'}, 'generationmix': [{'fuel': 'biomass', 'perc': 0}, {'fuel': 'coal', 'perc': 0}, {'fuel': 'imports', 'perc': 0}, {'fuel': 'gas', 'perc': 33.4}, {'fuel': 'nuclear', 'perc': 0}, {'fuel': 'other', 'perc': 0}, {'fuel': 'hydro', 'perc': 13.1}, {'fuel': 'solar', 'perc': 0}, {'fuel': 'wind', 'perc': 53.6}]}, {'regionid': 2, 'dnoregion': 'SP Distribution', 'shortname': 'South Scotland', 'intensity': {'forecast': 86, 'index': 'low'}, 'generationmix': [{'fuel': 'biomass', 'perc': 1.3}, {'fuel': 'coal', 'perc': 0}, {'fuel': 'imports', 'perc': 0}, {'fuel': 'gas', 'perc': 21.4}, {'fuel': 'nuclear', 'perc': 16.5}, {'fuel': 'other', 'perc': 0}, {'fuel': 'hydro', 'perc': 7.4}, {'fuel': 'solar', 'perc': 2.1}, {'fuel': 'wind', 'perc': 51.3}]}, {'regionid': 3, 'dnoregion': 'Electri

Unnamed: 0,regionid,dnoregion,shortname,generationmix,intensity.forecast,intensity.index
0,1,Scottish Hydro Electric Power Distribution,North Scotland,"[{'fuel': 'biomass', 'perc': 0}, {'fuel': 'coa...",131,moderate
1,2,SP Distribution,South Scotland,"[{'fuel': 'biomass', 'perc': 1.3}, {'fuel': 'c...",86,low
2,3,Electricity North West,North West England,"[{'fuel': 'biomass', 'perc': 2.5}, {'fuel': 'c...",73,low
3,4,NPG North East,North East England,"[{'fuel': 'biomass', 'perc': 5.1}, {'fuel': 'c...",12,very low
4,5,NPG Yorkshire,Yorkshire,"[{'fuel': 'biomass', 'perc': 52.3}, {'fuel': '...",142,moderate
5,6,SP Manweb,North Wales & Merseyside,"[{'fuel': 'biomass', 'perc': 1.4}, {'fuel': 'c...",78,low
6,7,WPD South Wales,South Wales,"[{'fuel': 'biomass', 'perc': 0}, {'fuel': 'coa...",314,very high
7,8,WPD West Midlands,West Midlands,"[{'fuel': 'biomass', 'perc': 7.8}, {'fuel': 'c...",103,low
8,9,WPD East Midlands,East Midlands,"[{'fuel': 'biomass', 'perc': 27}, {'fuel': 'co...",192,high
9,10,UKPN East,East England,"[{'fuel': 'biomass', 'perc': 7.9}, {'fuel': 'c...",89,low


### Exercise - upload a csv file to your github repository and create a data table from it

Visit the Kent and Medway Air Quality site: https://kentair.org.uk/

Collect a data file containing data on Ozone levels in Dover:

Open the site
Go to the Data page  
Launch the data selector tool  
Select:
*  Automatic monitoring data
*  Measurement data and simple statistics
*  Ozone
*  Daily mean
*  This month
*  Thurrock
*  Thurrock

Click on Download CSV  (This should be downloaded into your Downloads folder).

**NEXT**

Add the file to your Github repository.
* rename the file dover-ozone-daily-mean.csv
* sign in to your Github account
* open your repository
* click on Add a file
* upload the air data file

To be able to open the file from github, you will need to get the link to the raw file.  
* open the file on Github
* find the button 'Raw' and click on it
* copy the URL    

**NEXT**  

Write some code to display the dataframe and compare the contents with the output on the site you took the data from.





In [None]:
# add code here to read the csv file and display the dataframe (see above for help - From a csv hosted on Github)
# "https://raw.githubusercontent.com/StanStarishko/python-programming-for-data/main/Data/dover-ozone-daily-mean.csv"
import pandas as pd

def get_csv_data(url=""):
  # url always isn't empty and must have is string
  if url == "" or not isinstance(url,str):
    return False

  df = pd.read_csv(url)
  return df

url = "https://raw.githubusercontent.com/StanStarishko/python-programming-for-data/main/Data/dover-ozone-daily-mean.csv"
#url = 234
display(get_csv_data(url))



Unnamed: 0.1,Unnamed: 0,Thurrock,Unnamed: 2
0,Date,Ozone,Status
1,01/07/2024,50.97588,P µg/m³
2,02/07/2024,44.81178,P µg/m³
3,03/07/2024,45.28576,P µg/m³
4,04/07/2024,60.33666,P µg/m³
5,05/07/2024,40.68733,P µg/m³
6,06/07/2024,48.12963,P µg/m³
7,07/07/2024,49.00275,P µg/m³
8,08/07/2024,48.72834,P µg/m³
9,09/07/2024,37.07013,P µg/m³


### Exercise - upload a csv file directly into your notebook

Use the same file you created in the previous exercise

In this workbook, in the navigation bar on the left, click on the Folder icon
* click on Upload to session storage (select the file from your computer and click on Open to upload it)
* you should see the file in the storage section

To be able to open the file from the session storage, you should just need to use the filename.      

**NEXT**  

Write some code to display the dataframe and compare the contents with the output from the last exercise.

In [None]:
# now our URL is local file in work directory
url = "dover-ozone-daily-mean.csv"
display(get_csv_data(url))

False

### Exercise - read from an Excel spreadsheet
---
Open the datasets list: [here](https://docs.google.com/document/d/1cijDOCDixsYu-Rr9pC8STPPXado3xoFpgBAZgdDTLHs/edit?usp=sharing)  

* Find a dataset that is an Excel file

* Copy the code above (for Excel files on Github) into the code cell below  

* Copy the URL of the Excel file you have chosen in the datasets list  

* Change the line
```
df = pd.read_excel(url,sheet_name="Industry Migration")
```
to
```
df = pd.read_excel(url)
```
This will then open the first sheet in the Excel file, rather than a named sheet.

* Run the code to open the data.

In [2]:
def is_valid_link(link="",link_name="",autotest=False):
  # link always isn't empty and must have is string
  return_value = link != "" and isinstance(link, str)

  if not return_value and not autotest: # not print if autotest
    print(f"{link_name} is not valid")

  return return_value

def get_excel_data(url="",sheet_name="default"):
  # url and sheet name always isn't empty and must have is string
  is_not_valid_url = not is_valid_link(url,"url")
  is_not_valid_sheet_name = not is_valid_link(sheet_name,"sheet name")
  if is_not_valid_url or is_not_valid_sheet_name:
    return False

  if sheet_name == "default":
    df = pd.read_excel(url)
  else:
    df = pd.read_excel(url,sheet_name)

  return df

url = "https://github.com/futureCodersSE/working-with-data/blob/main/Data%20sets/Income-Data.xlsx?raw=true"
migration_df = get_excel_data(url)
display(migration_df)

Unnamed: 0,State,County,Population,Age,Income
0,TX,1,72,34,65
1,TX,2,33,42,45
2,TX,5,25,23,46
3,TX,6,54,36,65
4,TX,7,11,42,53
5,TX,8,28,25,62
6,TX,9,82,35,66
7,TX,10,5,40,75
8,MD,11,61,27,22
9,MD,2,5,23,69


### Exercise
---
*  Copy the code from the cell above -  API delivered in JSON format
*  Change the URL to add /scotland to the end of it
*  get the data_table as before
*  create a new variable, **generation_data** to hold ```data_table['data']```
*  take a look ```generation_data[0]``` by printing it
*  normalize ```generation_data[0]```  
*  display the resulting dataframe

You will notice that there is only one row of data but that the column headed ```generationmix``` has a list of items in that one row.  

You can use the ```df = df.explode('generationmix')```  
This will expand the table to have a row for each item in the ```generationmix``` column.

**Extension**:  

You will notice that the first column (which is generally the index column)  has only 0s.  This is because the index for the original single row was 0 so it has kept it.  To re-index, tell the explode function to ignore the original index.  Like this:
```
df = df.explode('generationmix', ignore_index=True)
```

There is still a way to go to get data like this ready for use but you can start to see what can be done.

In [8]:
import pandas as pd
import requests

def get_api_data(url=""):

  if url == "" or not is_valid_link(url,"url"):
    return False

  json_data = requests.get(url).json()['data']
  generation_data = json_data[0]['data']
  print(generation_data)
  df = pd.json_normalize(generation_data).explode('generationmix', ignore_index=True)
  return df


url = "https://api.carbonintensity.org.uk/regional/scotland"
generation_df = get_api_data(url)
display(generation_df)

[{'from': '2024-07-28T20:30Z', 'to': '2024-07-28T21:00Z', 'intensity': {'forecast': 63, 'index': 'low'}, 'generationmix': [{'fuel': 'biomass', 'perc': 1.2}, {'fuel': 'coal', 'perc': 0}, {'fuel': 'imports', 'perc': 0}, {'fuel': 'gas', 'perc': 15.5}, {'fuel': 'nuclear', 'perc': 15}, {'fuel': 'other', 'perc': 0}, {'fuel': 'hydro', 'perc': 3.7}, {'fuel': 'solar', 'perc': 0.1}, {'fuel': 'wind', 'perc': 64.3}]}]


Unnamed: 0,from,to,generationmix,intensity.forecast,intensity.index
0,2024-07-28T20:30Z,2024-07-28T21:00Z,"{'fuel': 'biomass', 'perc': 1.2}",63,low
1,2024-07-28T20:30Z,2024-07-28T21:00Z,"{'fuel': 'coal', 'perc': 0}",63,low
2,2024-07-28T20:30Z,2024-07-28T21:00Z,"{'fuel': 'imports', 'perc': 0}",63,low
3,2024-07-28T20:30Z,2024-07-28T21:00Z,"{'fuel': 'gas', 'perc': 15.5}",63,low
4,2024-07-28T20:30Z,2024-07-28T21:00Z,"{'fuel': 'nuclear', 'perc': 15}",63,low
5,2024-07-28T20:30Z,2024-07-28T21:00Z,"{'fuel': 'other', 'perc': 0}",63,low
6,2024-07-28T20:30Z,2024-07-28T21:00Z,"{'fuel': 'hydro', 'perc': 3.7}",63,low
7,2024-07-28T20:30Z,2024-07-28T21:00Z,"{'fuel': 'solar', 'perc': 0.1}",63,low
8,2024-07-28T20:30Z,2024-07-28T21:00Z,"{'fuel': 'wind', 'perc': 64.3}",63,low
