# Weather Forecast Data Collection
### Required packages

In [49]:
from fake_useragent import UserAgent
import requests, json
import re
import os
from time import sleep
from datetime import datetime, timedelta, date as dt_date
from random import uniform
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

## Location Data
We start by obtaining the the 150 most popular cities on Accuweather. Their free API does not permit more than a few days of forecast information (and only 50 requests are permitted a day). An api key is required to run the following code the first time, after which the recorded data loads from a saved file, which I have included.

In [6]:
if os.path.exists('data/accuweather_cities.json'):
    with open("data/accuweather_cities.json","r") as f:
        response=json.loads(f.read())
else:
    with open("data/accuweather_api.txt", "r") as f:
        api_key = f.readline()
        
    response = requests.get("http://dataservice.accuweather.com/locations/v1/topcities/150",
                                params = {'apikey':api_key })

    response = response.json()

    with open("data/accuweather_cities.json","w") as f:
        f.write(json.dumps(response))


In [7]:
len(response) #150 locations as expected

150

## Weather Data Collection
### Exploration
Given the locations, we wish to gather the available weather data for Jul 2020 until then the end of August 2020. We will need to explore how to scrape this data, given that the future predictions provided by the API are too short. Take the following examples of URLs which are obtained by navigating through the browser and compare to the data available via the API.

https://www.accuweather.com/en/bd/dhaka/28143/july-weather/28143

https://www.accuweather.com/en/bd/dhaka/28143/august-weather/28143

https://www.accuweather.com/en/gb/london/ec4a-2/july-weather/328328

https://www.accuweather.com/en/gb/london/ec4a-2/august-weather/328328

We note that the urls for London also work if we substitute ec4a%202 for ec4a-2 (the former being consistent with the API). There are some differences between these two formats -- Dhaka repeats the string 328328 at two points of the URL, whereas London does not have a repeated numerical string. Let's explore why.

In [42]:
print(response[0])

{'Version': 1, 'Key': '28143', 'Type': 'City', 'Rank': 10, 'LocalizedName': 'Dhaka', 'EnglishName': 'Dhaka', 'PrimaryPostalCode': '', 'Region': {'ID': 'ASI', 'LocalizedName': 'Asia', 'EnglishName': 'Asia'}, 'Country': {'ID': 'BD', 'LocalizedName': 'Bangladesh', 'EnglishName': 'Bangladesh'}, 'AdministrativeArea': {'ID': 'C', 'LocalizedName': 'Dhaka', 'EnglishName': 'Dhaka', 'Level': 1, 'LocalizedType': 'Division', 'EnglishType': 'Division', 'CountryID': 'BD'}, 'TimeZone': {'Code': 'BDT', 'Name': 'Asia/Dhaka', 'GmtOffset': 6.0, 'IsDaylightSaving': False, 'NextOffsetChange': None}, 'GeoPosition': {'Latitude': 23.71, 'Longitude': 90.407, 'Elevation': {'Metric': {'Value': 5.0, 'Unit': 'm', 'UnitType': 5}, 'Imperial': {'Value': 16.0, 'Unit': 'ft', 'UnitType': 0}}}, 'IsAlias': False, 'SupplementalAdminAreas': [], 'DataSets': ['AirQualityCurrentConditions', 'AirQualityForecasts']}


In [45]:
print(response[0]['Key'])
print(response[0]['PrimaryPostalCode'])
print(response[0]['Country']['ID'])

28143

BD


We obtain the two letter country code and numerical key present in the Dhaka URL. There is no value for `PrimaryPostalCode`, but I've included this as it will be relevant in a second. We can do the same for London, finding the position in the CSV first and then printing the relevant entries.

In [43]:
regex = re.compile('London')

for i in range(150):
    if regex.match(response[i]['EnglishName']):
        print(i)
        break

8


In [46]:
print(response[8]['Key'])
print(response[8]['PrimaryPostalCode'])
print(response[8]['Country']['ID'])

328328
EC4A 2
GB


It appears that that we make use of the city name, country ID, primary postal code, if it exists, and key. If the primary postal code is an empty string, we simply use the key again in place of the primary postal code. There are two names, LocalizedName and EnglishName. We can check if there is ever a mismatch (since it's not clear which is used in the url). According to the below there is not -- for all cities in the top 150 list, the localised and English names are the same, as alleged by accuweather.

In [44]:
for i in range(150):
    if response[i]['EnglishName'] != response[i]['LocalizedName']:
        print(i)

### Data Scraping
We now know where to find the data we need, so we will start to scrape. We define some auxiliary functions.

In [3]:
def month_in_string(string):
    """
    Returns first occurence of a month name in a string. Filters to ensure
    that it is a complete name and not part of another word.
    """
    months = ['january', 'february', 'march', 'april', 'may', 'june', 'july', 
              'august', 'september', 'october','november', 'december']
    month_reg = re.compile('\b'+'|'.join(months)+'\b')
    url_month = month_reg.search(string).group()
    return url_month
    
def accudate_to_datetime(accudate, month):
    """
    Accuweather dates on month display will be of the form m/d or just d, depending on
    if the date is in the current calendar month: e.g. during July we get 6/28, 6/29/, 6/30,
    1, 2, ... , 30, 31, 8/1, 8/2... Converts to datetime object.
    """
    if '/' in accudate:
        formatted_date = datetime.strptime(accudate + ' 2020', '%m/%d %Y')
    else:
        formatted_date = datetime.strptime(accudate +' '+ month +' 2020', '%d %B %Y')
        
    return formatted_date.date()

We start by creating a list of tuples, containing the country ID, postal code, and key.

In [4]:
def make_tuples(data = response):
    tuple_list = []
    for i in data:
        tup = (i['Country']['ID'], i['EnglishName'], i['PrimaryPostalCode'], i['Key'])
        tuple_list.append(tup)
    return tuple_list

url_entries = make_tuples()

We will then populate a list with the urls obtained from appropriately formatting the urls with these tuples.

In [5]:
def append_url(tup, url_list):
    #handle absent post codes and adjust to lower case
    new_tup = (tup[0].lower(), tup[1].lower(), tup[2] or tup[3], tup[3]) 
    
    url_list.append('https://www.accuweather.com/en/{}/{}/{}/july-weather/{}'.format(*new_tup))
    url_list.append('https://www.accuweather.com/en/{}/{}/{}/august-weather/{}'.format(*new_tup))

In [6]:
url_list = []
for i in url_entries:
    append_url(i, url_list)

The timezones of the cities in the list fall within [GMT - 11, GMT + 12]. If we record between 11.00 and 12.00 every day, then this will ensure that we do not skip any days due to change of date because of timezone. Ideally I'd have a better way of scheduling the data collection task, but I only have a small laptop at my disposal. I've checked that in July and August we don't have to worry about DST ruining this strategy, so if I'm diligent, there should not be any missing data. For a safer margin, I'll sort the URLs by absolute deviation of timezone from GMT, so that those closest to chaning time zone should be collected first.

In [7]:
tz = [] # Timezone list
for i in range(150):
    tz.append((response[i]['TimeZone']['GmtOffset'],response[i]['EnglishName']))
    tz.append((response[i]['TimeZone']['GmtOffset'],response[i]['EnglishName'])) # Append for each of 2 months
    
name_url_list = [(name,url) for (tz,name),url in sorted(zip(tz,url_list), reverse = True)]

Finally, we come to the task of collecting and parsing the data. The below function will return dates, hi/lo temperature and an icon code indicating precipitation or lack thereof.

In [8]:
# When testing, requests where denied unless I used a real user agent
my_user = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19041'}

def forecast_parser(url):
    # Get raw data
    month = month_in_string(url)
    response = requests.get(url, headers = my_user).text
    soup = BeautifulSoup(response)
    # Divide into days (including weather info)
    all_dates = soup.find_all("a", class_="monthly-daypanel")
    
    # If we just want temperatures: date_info = [date.get_text().split() for date in all_dates]   
    # If we want preciptiation information, a bit more complicated, since this is only
    # provided by image icons. See appendix. Below implements this correctly
    
    regex = re.compile('\d+')
    date_info = []
    for date in all_dates:
        text = date.get_text().split() # Contains temperature and date information
        if date.img: # Only exists for present+future dates
            img = regex.search(date.img['data-src']).group() # Labels precipitation icons
            text.append(img)
        date_info.append(text)
    
    # Clean up list. 
    # 1. Change date to datetime object, temperatures to ints
    for date in date_info:
        date[0] = accudate_to_datetime(date[0], month)
        for i in range(1,len(date)):
            date[i]=date[i].replace('°','')
            try:
                date[i]=int(date[i])
            except:
                pass
    # 2. Only keep data which is not a string (deletes historical weather data and text)
    # We do, however, append an empty string as a null value for absent precipitation data
    date_info_clean = [[x for x in date if not isinstance(x,str)]+[''] for date in date_info]
    
    return date_info_clean

We perform the main data collection task using the above function, and the following two steps

1. Create a CSV file with appropriate labels of all the dates in the July and August. The columns represent collection date, location, data type (hi temp, lo temp or precipitation info) and then the prediction date.

2. Parse the URLs. Concatenate the information for July and August then write 3 new rows to the CSV file.

In [107]:
# Creates file if it does not yet exist and formats its headers.

if not os.path.exists('data/weather_data.csv'):
    # Create CSV. Column labels. Probably some more efficient way to do with pandas.
    date1 = '2020-07-21'
    date2 = '2020-08-31'
    start = datetime.strptime(date1, '%Y-%m-%d')
    end = datetime.strptime(date2, '%Y-%m-%d')
    step = timedelta(days=1)
    date_string_list = []
    while start <= end:
        date_string_list.append(start.date())
        start += step
    date_string = ','.join(map(str,date_string_list))
    first_row = 'Collected,'+'Location,'+'Type,'+date_string+'\n'

    with open('data/weather_data.csv', 'w+') as f:
        f.write(first_row)

In [9]:
# Create three strings for each location:
def string_tuple(location, date_info_jul_raw, date_info_aug_raw):
    now = datetime.now()
    now_str = str(now)
    
    
    # First date to be recorded is jul21. Filter dates to ensure no double counting.
    jul21 = dt_date(2020,7,21)
    date_info_jul = [x for x in date_info_jul_raw if x[0].month == 7 and x[0]>= jul21]
    date_info_aug = [x for x in date_info_aug_raw if x[0].month == 8]

    date_info_comb = date_info_jul + date_info_aug
            
    hi_temps = ','.join([now_str, location, 'high']+[str(x[1]) for x in date_info_comb])
    lo_temps = ','.join([now_str, location, 'low']+[str(x[2]) for x in date_info_comb])
    precip = ','.join([now_str, location, 'precipitation']+[str(x[3]) for x in date_info_comb])
    return(hi_temps, lo_temps, precip)
    
    

In [21]:
# Put it all together
missed_locations = []
with open('data/weather_data.csv', 'a') as f:
    for i in range(150):
            location = name_url_list[2*i][0]
            url_jul = name_url_list[2*i][1]
            url_aug = name_url_list[2*i + 1][1]

            sleep(uniform(3,10)) # Reduces the chance of rate-limiting.

            date_info_jul_raw = forecast_parser(url_jul)
            sleep(uniform(2,5))
            date_info_aug_raw = forecast_parser(url_aug)
            try:
                rows = string_tuple(location, date_info_jul_raw, date_info_aug_raw)
                for i in rows:
                    f.write(i+'\n')
            except:
                missed_locations.append(location)

In [22]:
#Check missing locations
missed_locations

['Wellington', 'Auckland', 'Pago Pago']

In [23]:
# Some places in extreme time zones are missing. We collect manually later in the day.
date_info_jul_raw = forecast_parser(url_jul)
sleep(uniform(2,5))
date_info_aug_raw = forecast_parser(url_aug)
rows = string_tuple(location, date_info_jul_raw, date_info_aug_raw)
with open('data/weather_data.csv', 'a') as f:
    for i in rows:
        f.write(i+'\n')

In case of crash, resume from point:

In [17]:
missed_locations = []
def data_from_point(location):
    for i in range(len(name_url_list)):
        if name_url_list[i][0] == location:
            start = i//2
            break
    with open('data/weather_data.csv', 'a') as f:
        for i in range(start, 150):
                location = name_url_list[2*i][0]
                url_jul = name_url_list[2*i][1]
                url_aug = name_url_list[2*i + 1][1]

                sleep(uniform(3,10))

                date_info_jul_raw = forecast_parser(url_jul)
                sleep(uniform(2,5))
                date_info_aug_raw = forecast_parser(url_aug)
                try:
                    rows = string_tuple(location, date_info_jul_raw, date_info_aug_raw)
                    for i in rows:
                        f.write(i+'\n')
                except:
                    missed_locations.append(location)

In [18]:
data_from_point(location)

## Weaknesses

- Accuweather's top 150 cities is biased towards larger cities (but does include some smaller ones)
- The data only spans about 40 days, and hence doesn't span across multiple seasons for any given location.
- The data was collected manually using a laptop, so although I tried to be as consistent as possible, it was not always possible to collect at the same time every day, and occasionally some data was missed.
- There would be better ways to automate collecting missed data, and resuming after crashes. This should be investigated if I want to improve data quality.

# Appendix

Preciptation data can be deduced by the icon number codes. Accuweather icons, their corresponding numbers and text summary interpretations are available here:
https://developer.accuweather.com/weather-icons. Below I collect and write to a file.

In [9]:
my_user = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19041'}
response = requests.get('https://developer.accuweather.com/weather-icons', headers = my_user).text
soup = BeautifulSoup(response)

In [46]:
def preciptation_parser(soup):
    table_rows = soup.find_all('tr')[1:]

    icon_weather_dict = {}
    for row in table_rows:
        columns = row.find_all('td')
        icon_num = int(columns[0].text.strip())
        weather_text = columns[-1].text.strip()
        icon_weather_dict[icon_num] = weather_text

    return icon_weather_dict


In [48]:
precip_json = json.dumps(preciptation_parser(soup))
# Writing to sample.json
if not os.path.exists('data/precip.json'):
    with open("data/precip.json", "w") as f:
        f.write(precip_json)