# Lab 2.1 - Weather Data Around Winona

In this lab, we will download and combine a decades worth of weather data from the NOAA, focusing on weather stations within 500 miles of Winona.

Here is the outline of the basic process.

1. Install and investigate useful packages.
2. Find all weather stations in proximity to Winona.
3. Use a single station to prototype our tools.
4. Automate the process of downloading and uncompressing data from all stations of interest.
5. Output the results to a CSV file.

## Problem 1 - Install and investigate useful tools.

First, you should install and investigate the following tools.

1. **`wget`** is a tool for programmically downloading data files from the web on the command line.  There is a Python wrapper to this tool that you can install with `pip` as shown below.
2. **`geopy`** is a package that, among other things, implements a function for computing distances between two lat-long pairs. Again, install this package with `pip` as shown below.
3. **`gzip`** is part of the standard Python library and

In [36]:
%pip install wget

Note: you may need to restart the kernel to use updated packages.


In [37]:
%pip install geopy

Note: you may need to restart the kernel to use updated packages.


#### Task 1.1 - Investigate using `wget` to download a file.

Read the help/documentation on `wget` to figure out how to download the following data file [Some random data file from STAT 210] into the `./data` sub-folder.

[https://github.com/yardsale8/STAT_210/raw/refs/heads/main/data/sars1.csv](https://github.com/yardsale8/STAT_210/raw/refs/heads/main/data/sars1.csv)

In [38]:
import wget
url ='https://github.com/yardsale8/STAT_210/raw/refs/heads/main/data/sars1.csv'
file_path = wget.download(url, out='./data')

#### Task 1.2 - Investigate using `geopy.distance.distance` to compute a distance in miles.

1. Import the `distance` function from the `geopy.distance` submodule.
2. Use Wikipedia to find the lat-long coordinates of Winona and Rochester MN.
3. Use `distance` to compute the distance between Winona and Rochester.
4. Use some other source (e.g., Google Maps) to check the answer.

In [39]:
from geopy.distance import distance

In [40]:
winona_coords = (44.0554, -91.6664)
rochester_coords = (44.0121, -92.4802)

In [41]:
(distance_miles := distance(winona_coords, rochester_coords).miles)

40.64494286306752

#### Task 1.3 - Investigate `gzip`

The yearly NOAA data is compressed as `.gz` files, which need to be uncompressed using `gzip`.  Explore the `gzip` module by

1. Exploring the documentation/help for the `gzip` module,
2. Using `wget` to download the following link into the `./data` folder, and
3. Using `gzip` to uncompress this file.
4. Inspect the data in your list, which should be of type `byte`.  Use a comprehension with the expression `l.decode('utf-8')` to convert this to a list of strings.
5. Write the uncompressed lines to an output file using `with open(path, 'w') as out` and the `writelines` method of `out`.  

**Link.** [https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_year/1750.csv.gz](https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_year/1750.csv.gz)

In [42]:
weather_url ='https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_year/1750.csv.gz'
weather_path = wget.download(weather_url, out='./data')

In [43]:
import gzip
with gzip.open(weather_path, 'rb') as f_in:
    uncompressed_data = f_in.readlines()
uncompressed_data[:10]

[b'ASN00002061,17500201,PRCP,56,,,a,\n',
 b'ASN00003014,17500201,PRCP,0,,,a,\n',
 b'ASN00003059,17500201,PRCP,0,,,a,\n',
 b'ASN00003088,17500201,PRCP,0,,,a,\n',
 b'ASN00009015,17500201,PRCP,0,,,a,\n',
 b'ASN00009193,17500201,TMIN,187,,,a,\n',
 b'ASN00009193,17500201,PRCP,0,,,a,\n',
 b'ASN00009500,17500201,DATX,2,,,a,\n',
 b'ASN00009500,17500201,MDTX,210,,,a,\n',
 b'ASN00009592,17500201,DATX,4,,,a,\n']

In [44]:
(weather_lines := [l.decode('utf-8') for l in uncompressed_data])[:10]

['ASN00002061,17500201,PRCP,56,,,a,\n',
 'ASN00003014,17500201,PRCP,0,,,a,\n',
 'ASN00003059,17500201,PRCP,0,,,a,\n',
 'ASN00003088,17500201,PRCP,0,,,a,\n',
 'ASN00009015,17500201,PRCP,0,,,a,\n',
 'ASN00009193,17500201,TMIN,187,,,a,\n',
 'ASN00009193,17500201,PRCP,0,,,a,\n',
 'ASN00009500,17500201,DATX,2,,,a,\n',
 'ASN00009500,17500201,MDTX,210,,,a,\n',
 'ASN00009592,17500201,DATX,4,,,a,\n']

In [45]:
with open('./data/weather_data.csv', 'w') as out:
    out.writelines(weather_lines)

## Problem 2 - Find all stations within 500 miles of Winona, MN.

The file linked below contains information about all stations tracked by NOAA.  

*Main folder:* https://www.ncei.noaa.gov/pub/data/ghcn/daily/

*Station txt file:* https://www.ncei.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt

*Note.* While it would be easier to use the CSV version of the station file, you should use the TXT version here (for practice).

**Your tasks** Our goal is to get a list of stations that are within 500 miles of Winona.  Do this by

1. Using `wget` to download the stations information into the `./data` folder.
2. Use `with` to read the lines of this file.
3. At this point, the lines are strings in a fixed-width format separated by whitespace.  Use a list comprehension with the string split method to split the raw lines (strings) into a list of entries.
4. There are three entries of interest, the station ID and the lat-long coordinates of the station.  Inspect the file to determine the index for these three entries.
5. We want to transform the lines (currently a list of strings) into a record, which is a `dict` with good names for the entries as keys and the values representing the data in an appropriate type (string for station ID, `float` for the lat-long).  Use a comprehension to create a list of records as described.
6. Use another comprehension to apply a filter to the stations, keeping only those within 500 miles of Winona.

In [46]:
station_url ='https://www.ncei.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt'
station_path = wget.download(station_url, out='./data')

In [47]:
with open(station_path, 'r') as file:
    lines = file.readlines()
lines[:10]

['ACW00011604  17.1167  -61.7833   10.1    ST JOHNS COOLIDGE FLD                       \n',
 'ACW00011647  17.1333  -61.7833   19.2    ST JOHNS                                    \n',
 'AE000041196  25.3330   55.5170   34.0    SHARJAH INTER. AIRP            GSN     41196\n',
 'AEM00041194  25.2550   55.3640   10.4    DUBAI INTL                             41194\n',
 'AEM00041217  24.4330   54.6510   26.8    ABU DHABI INTL                         41217\n',
 'AEM00041218  24.2620   55.6090  264.9    AL AIN INTL                            41218\n',
 'AF000040930  35.3170   69.0170 3366.0    NORTH-SALANG                   GSN     40930\n',
 'AFM00040938  34.2100   62.2280  977.2    HERAT                                  40938\n',
 'AFM00040948  34.5660   69.2120 1791.3    KABUL INTL                             40948\n',
 'AFM00040990  31.5000   65.8500 1010.0    KANDAHAR AIRPORT                       40990\n']

In [48]:
(split_lines := [line.split() for line in lines if line.strip()])[:10]

[['ACW00011604',
  '17.1167',
  '-61.7833',
  '10.1',
  'ST',
  'JOHNS',
  'COOLIDGE',
  'FLD'],
 ['ACW00011647', '17.1333', '-61.7833', '19.2', 'ST', 'JOHNS'],
 ['AE000041196',
  '25.3330',
  '55.5170',
  '34.0',
  'SHARJAH',
  'INTER.',
  'AIRP',
  'GSN',
  '41196'],
 ['AEM00041194', '25.2550', '55.3640', '10.4', 'DUBAI', 'INTL', '41194'],
 ['AEM00041217',
  '24.4330',
  '54.6510',
  '26.8',
  'ABU',
  'DHABI',
  'INTL',
  '41217'],
 ['AEM00041218', '24.2620', '55.6090', '264.9', 'AL', 'AIN', 'INTL', '41218'],
 ['AF000040930',
  '35.3170',
  '69.0170',
  '3366.0',
  'NORTH-SALANG',
  'GSN',
  '40930'],
 ['AFM00040938', '34.2100', '62.2280', '977.2', 'HERAT', '40938'],
 ['AFM00040948', '34.5660', '69.2120', '1791.3', 'KABUL', 'INTL', '40948'],
 ['AFM00040990',
  '31.5000',
  '65.8500',
  '1010.0',
  'KANDAHAR',
  'AIRPORT',
  '40990']]

In [49]:
(stations := [
    {
        'station_id': parts[0],  
        'latitude': float(parts[1]),  
        'longitude': float(parts[2])
    }
    for parts in split_lines
])[:10]

[{'station_id': 'ACW00011604', 'latitude': 17.1167, 'longitude': -61.7833},
 {'station_id': 'ACW00011647', 'latitude': 17.1333, 'longitude': -61.7833},
 {'station_id': 'AE000041196', 'latitude': 25.333, 'longitude': 55.517},
 {'station_id': 'AEM00041194', 'latitude': 25.255, 'longitude': 55.364},
 {'station_id': 'AEM00041217', 'latitude': 24.433, 'longitude': 54.651},
 {'station_id': 'AEM00041218', 'latitude': 24.262, 'longitude': 55.609},
 {'station_id': 'AF000040930', 'latitude': 35.317, 'longitude': 69.017},
 {'station_id': 'AFM00040938', 'latitude': 34.21, 'longitude': 62.228},
 {'station_id': 'AFM00040948', 'latitude': 34.566, 'longitude': 69.212},
 {'station_id': 'AFM00040990', 'latitude': 31.5, 'longitude': 65.85}]

In [50]:
(nearby_stations := [
    station for station in stations 
    if distance(winona_coords, (station['latitude'], station['longitude'])).miles <= 25])[:10]

[{'station_id': 'US1MNHS0001', 'latitude': 43.835, 'longitude': -91.314},
 {'station_id': 'US1MNHS0006', 'latitude': 43.742, 'longitude': -91.4369},
 {'station_id': 'US1MNHS0007', 'latitude': 43.8349, 'longitude': -91.3138},
 {'station_id': 'US1MNHS0008', 'latitude': 43.8381, 'longitude': -91.3079},
 {'station_id': 'US1MNHS0009', 'latitude': 43.8387, 'longitude': -91.3044},
 {'station_id': 'US1MNHS0012', 'latitude': 43.8253, 'longitude': -91.3209},
 {'station_id': 'US1MNHS0013', 'latitude': 43.7817, 'longitude': -91.3882},
 {'station_id': 'US1MNHS0022', 'latitude': 43.7921, 'longitude': -91.5856},
 {'station_id': 'US1MNHS0023', 'latitude': 43.7122, 'longitude': -91.6541},
 {'station_id': 'US1MNOL0038', 'latitude': 44.0762, 'longitude': -92.0979}]

In [51]:
(num_stations := len(nearby_stations))

63

In [52]:
# data_path = "./data/ghcnd-stations-reference.txt"
# with open(data_path, 'r') as file:
#     lines = file.readlines()
# split_lines = [line.split() for line in lines if line.strip()]
# stations = [
#     {
#         'station_id': parts[0],
#         'latitude': float(parts[1]),
#         'longitude': float(parts[2])
#     }
#     for parts in split_lines
# ]
# nearby_stations = [
#     station for station in stations
#     if distance(winona_coords, (station['latitude'], station['longitude'])).miles <= 25
# ]
# (num_stations := len(nearby_stations))

***This code generated the required 78 nearby stations until the website changed the entries for this specific file on 2025-01-30 16:33 (as indicated on the website). Now, the output inlcudes only 63 stations. For reference, the "old" ghcnd-station file is included in the data folder, named "./data/ghcnd-stations-reference.txt". The above commented code demonstrates this for reference if needed.***

#### Problem 3 - Prototype downloading and uncompressing a station file.

Before we download and uncompress all the stations of interest, let's practice on one station file.


1. Copy the url for some station and store is as a variable named `url`.
2. Write `lambda` functions that extract each of the following from the station `url`: compressed file name, compressed file path (e.g., `./data/...`), and uncompressed file path (e.g., `./data/...`).
3. Write a `lambda` function that extracts
4. Use `wget` to download this stations data.
5. Use `gzip` to uncompress the data.
6. Write the data to out output file.

Your code should have the following shape:

```{Python}
wget.download(...)
with gzip.open(...) as f:
    with open(..., 'w') as out:
        f.readlines()
        out.writelines(f)
```

You should be using your helper functions to, in part, fill in the `...`

In [53]:
url = 'https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_station/ACW00011604.csv.gz'

In [54]:
compressed_filename = lambda url: url.split('/')[-1]  
compressed_filepath = lambda url: f"./data/{compressed_filename(url)}" 
uncompressed_filepath = lambda url: compressed_filepath(url).replace('.gz', '') 

In [55]:
compressed_filename(url)

'ACW00011604.csv.gz'

In [56]:
compressed_filepath(url)

'./data/ACW00011604.csv.gz'

In [57]:
uncompressed_filepath(url)

'./data/ACW00011604.csv'

In [58]:
headers = "ID,YEAR/MONTH/DAY,ELEMENT,DATA VALUE,M-FLAG,Q-FLAG,S-FLAG,OBS-TIME\n"

In [59]:
wget.download(url, compressed_filepath(url))
with gzip.open(compressed_filepath(url), 'rt') as f: 
    with open('./data/output_file_example_station.csv', 'w') as out: 
        out.write(headers)
        lines = f.readlines()
        out.writelines(lines) 

## Problem 4 - Build the station URLs and download the files.

**Tasks.** Now you need to build urls for all stations of interest by

1. Use a comprehension to extract the stations of interest into a list.
2. Investigating the structure of the files stored in the `by_station` folder (see main folder link above).
3. Use a comprehension and an `f` string to build a list of URLS for all stations of interest.
4. Use `wget` to download the data for the stations of interest into the data folder.
5. Use `gzip` to uncompress the files.
6. Convert the `bytes` to `str` of format `utf-8`.
7. Use the append mode `"a"` of `open` with `writelines` to append the data in each file to your output file.

While we usually avoid using a `for` loop, we make an exception for code for lengthy IO.  To accomplish steps 4 & 5, use a `for` loop with the following shape.

```{Python}
for url in station_urls:
    wget.download(...)
    with gzip.open(...) as f:
        with open(..., 'a') as out:
            f.readlines()
            ... # Convert lines to strings here
            out.writelines(f)
    print(f"Downloaded and extracted the data for {url}")
```

Note that the code inside the loop should resemble the code from the previous step.

In [60]:
(stations_of_interest := [station["station_id"] for station in nearby_stations])[:10]

['US1MNHS0001',
 'US1MNHS0006',
 'US1MNHS0007',
 'US1MNHS0008',
 'US1MNHS0009',
 'US1MNHS0012',
 'US1MNHS0013',
 'US1MNHS0022',
 'US1MNHS0023',
 'US1MNOL0038']

In [61]:
base_url = "https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_station/"
(station_urls := [f"{base_url}{station}.csv.gz" for station in stations_of_interest])[:10]

['https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_station/US1MNHS0001.csv.gz',
 'https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_station/US1MNHS0006.csv.gz',
 'https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_station/US1MNHS0007.csv.gz',
 'https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_station/US1MNHS0008.csv.gz',
 'https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_station/US1MNHS0009.csv.gz',
 'https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_station/US1MNHS0012.csv.gz',
 'https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_station/US1MNHS0013.csv.gz',
 'https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_station/US1MNHS0022.csv.gz',
 'https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_station/US1MNHS0023.csv.gz',
 'https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_station/US1MNOL0038.csv.gz']

In [62]:
output_file = "./data/stations_of_interest_winona_weather_data.csv"
with open(output_file, 'w', encoding='utf-8') as out:
    out.write(headers)

In [63]:
import os
import urllib.error

for url in station_urls: 
    try:
        compressed_path = compressed_filepath(url)
        wget.download(url, compressed_path)
        with gzip.open(compressed_path, 'rt', encoding='utf-8') as f:
            with open(output_file, 'a', encoding='utf-8') as out:
                lines = f.readlines()
                out.writelines(lines)
        print(f"Downloaded and extracted the data for {url}")
    
    except urllib.error.HTTPError as e:
        print(f"HTTP Error {e.code} for {url}: {e.reason}")
    except urllib.error.URLError as e:
        print(f"URL Error for {url}: {e.reason}")
    except gzip.BadGzipFile:
        print(f"Error: The file downloaded from {url} is not a valid gzip file.")
    except Exception as e:
        print(f"An unexpected error occurred for {url}: {e}")

Downloaded and extracted the data for https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_station/US1MNHS0001.csv.gz
Downloaded and extracted the data for https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_station/US1MNHS0006.csv.gz
Downloaded and extracted the data for https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_station/US1MNHS0007.csv.gz
Downloaded and extracted the data for https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_station/US1MNHS0008.csv.gz
Downloaded and extracted the data for https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_station/US1MNHS0009.csv.gz
Downloaded and extracted the data for https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_station/US1MNHS0012.csv.gz
Downloaded and extracted the data for https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_station/US1MNHS0013.csv.gz
Downloaded and extracted the data for https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_station/US1MNHS0022.csv.gz
Downloaded and extracted the data for https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_s

***The downloading problem of the failed two files (station_id US1WIBF0010 and US1WILC0038) could be due to reporting issues/data availability within NOAA since those two files are also missing on their official website under the by_station section.***

### Code not used (it was already provided in notebook):

In [64]:
fake_station = "A123456789"
make_fake_url = lambda s: f"https://my_fake_website.cool/{s}"

make_fake_url(fake_station)

'https://my_fake_website.cool/A123456789'

In [None]:
my_fake_stations =[f'A{i}' for i in range(10)]

(my_fake_urls := [make_fake_url(s) for s in my_fake_stations])

['https://my_fake_website.cool/A0',
 'https://my_fake_website.cool/A1',
 'https://my_fake_website.cool/A2',
 'https://my_fake_website.cool/A3',
 'https://my_fake_website.cool/A4',
 'https://my_fake_website.cool/A5',
 'https://my_fake_website.cool/A6',
 'https://my_fake_website.cool/A7',
 'https://my_fake_website.cool/A8',
 'https://my_fake_website.cool/A9']