# Downloading and acquiring datasets

This notebook lays out how to download or access various different datasets.

Note that in cases of a one-off download, we may provide manual download instructions instead of API calls.

**Note: to run the other notebooks, you must ensure you have downloaded the data listed here**

## Direct/manual downloads

Currently, both the Local Authority District (LAD) and Ordnance Survey Open Greenspace datasets are downloaded manually.

These can be accessed here:

- [May 2024 Local Authority District (LAD) data](https://www.google.com/url?sa=t&source=web&rct=j&opi=89978449&url=https://geoportal.statistics.gov.uk/datasets/ons::local-authority-districts-may-2024-boundaries-uk-bfe-2/about&ved=2ahUKEwi83bT-pJGOAxWBUUEAHSA-NZsQFnoECAoQAQ&usg=AOvVaw3nYML7UR9GdX1gnUYTH8uz)
- [OS Open Greenspace Data](https://osdatahub.os.uk/downloads/open/OpenGreenspace)

## In-script downloads

The Open Street Map data is downloaded within the [01 Processing parks data notebook](01_Processing_parks_data.ipynb). The area downloaded can be altered by changing the LAD code used (in this case, Bradford).

## Webscraping park data

The below webscraping script and subsequent processing code was developed by Fran Pontin, in order to pull information about Bradford city parks from the official [Bradford District Parks website](https://bradforddistrictparks.org/).

[The Bradford District Parks website](https://bradforddistrictparks.org/) provides data on formally recognised parks and greenspaces in Bradford.

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import geopandas as gpd
import numpy as np

# Function to get all links from a single page
def get_links_from_page(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    links = [a['href'] for a in soup.find_all('a', href=True, rel='bookmark')]
    return links

# Base URL of the webpage to scrape
base_url = 'https://bradforddistrictparks.org/park/page/'

# Initialize an empty list to store all links
all_links = []

# Loop through all pages (assuming there are 10 pages)
for page_num in range(1, 11):
    page_url = f'{base_url}{page_num}/'
    links = get_links_from_page(page_url)
    all_links.extend(links)

parks_data = []

# Loop through each park link and extract information
for link in all_links:
    # Send a GET request to the park webpage
    response = requests.get(link)
    
    # Parse the park webpage content
    soup = BeautifulSoup(response.content, 'html.parser')
    

    # Extract park name
    park_name = soup.find('h1', class_='entry-title').text.strip() if soup.find('h1', class_='entry-title') else 'NA'

    # Extract location
    location = soup.find('li', class_='location').text.strip() if soup.find('li', class_='location') else 'NA'
    
    # Extract opening hours
    opening_hours = soup.find('li', class_='calendar').text.strip() if soup.find('li', class_='calendar') else 'NA'
    
    # Find the unordered list containing the opening hours and location
    parent_ul = soup.find('li', class_='calendar').find_parent('ul') if soup.find('li', class_='calendar') else None
    
    # Find the next unordered list after the parent unordered list
    next_ul = parent_ul.find_next('ul') if parent_ul else None
    
    # Extract elements that appear in the next unordered list
    elements = []
    if next_ul:
        for li in next_ul.find_all('li'):
            elements.append(li.text.strip())
    else:
        elements.append('NA')
    
    # Extract latitude and longitude from data attributes
    map_element = soup.find('div', {'data-lat': True, 'data-lng': True})
    latitude = map_element['data-lat'] if map_element else 'NA'
    longitude = map_element['data-lng'] if map_element else 'NA'
    
    # Append the extracted data to the parks_data list
    parks_data.append({
        'Park Name': park_name,
        'Park URL': link,
        'Location': location,
        'Opening Hours': opening_hours,
        'Latitude': latitude,
        'Longitude': longitude,
        'Park features': elements
    })

# Create a pandas DataFrame from the parks_data list
parks_df = pd.DataFrame(parks_data)



In [18]:
parks_df['Latitude'].replace('NA', np.nan, inplace=True)
parks_df['Longitude'].replace('NA', np.nan, inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  parks_df['Latitude'].replace('NA', np.nan, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  parks_df['Longitude'].replace('NA', np.nan, inplace=True)


In [19]:
parks_gdf =gpd.GeoDataFrame(parks_df,geometry=gpd.points_from_xy(parks_df.Longitude, parks_df.Latitude), crs="EPSG:4326")

In [22]:
unique_park_features =list(parks_gdf['Park features'].explode().unique())
# Create columns for each unique item and populate with 1 or 0
for feature in unique_park_features:
    parks_gdf[feature] = parks_gdf['Park features'].apply(lambda x: 1 if feature in x else 0)

parks_gdf['Park features str']=parks_gdf['Park features'].astype(str)
parks_gdf['Park features str'] =parks_gdf['Park features str'].str.replace('[','').str.replace(']','').str.replace("'","").str.replace('"',"")

In [24]:
parks_gdf.drop(columns='Park features', inplace=True)

### Save out datasets

In [27]:
## Save out datasets

parks_df.to_csv('../data/data_downloads/bradford_scraped_parks_data.csv', index=False)
parks_gdf.to_file('../data/data_downloads/bradford_district_parks.geojson')