# Lab 2.1 - Weather Data Around Winona

In this lab, we will download and combine a decades worth of weather data from the NOAA, focusing on weather stations within 500 miles of Winona.

Here is the outline of the basic process.

1. Install and investigate useful packages.
2. Find all weather stations in proximity to Winona.
3. Use a single station to prototype our tools.
4. Automate the process of downloading and uncompressing data from all stations of interest.
5. Output the results to a CSV file.

## Problem 1 - Install and investigate useful tools.

First, you should install and investigate the following tools.

1. **`wget`** is a tool for programmically downloading data files from the web on the command line.  There is a Python wrapper to this tool that you can install with `pip` as shown below.
2. **`geopy`** is a package that, among other things, implements a function for computing distances between two lat-long pairs. Again, install this package with `pip` as shown below.
3. **`gzip`** is part of the standard Python library and

In [1]:
%pip install wget

Note: you may need to restart the kernel to use updated packages.


In [2]:
%pip install geopy

Note: you may need to restart the kernel to use updated packages.


In [4]:
import wget
import geopy

#### Task 1.1 - Investigate using `wget` to download a file.

Read the help/documentation on `wget` to figure out how to download the following data file [Some random data file from STAT 210] into the `./data` sub-folder.

[https://github.com/yardsale8/STAT_210/raw/refs/heads/main/data/sars1.csv](https://github.com/yardsale8/STAT_210/raw/refs/heads/main/data/sars1.csv)

In [5]:
# help(wget)

In [6]:
# Your code here

In [7]:
# !dir

In [8]:
url = "https://github.com/yardsale8/STAT_210/raw/refs/heads/main/data/sars1.csv"
output_path = "./data/sars1.csv"  # My file path

filename = wget.download(url, out=output_path)

print(f"File downloaded to: {filename}")

File downloaded to: ./data/sars1.csv


#### Task 1.2 - Investigate using `geopy.distance.distance` to compute a distance in miles.

1. Import the `distance` function from the `geopy.distance` submodule.
2. Use Wikipedia to find the lat-long coordinates of Winona and Rochester MN.
3. Use `distance` to compute the distance between Winona and Rochester.
4. Use some other source (e.g., Google Maps) to check the answer.

In [9]:
# Your code here

In [10]:
from geopy import distance

In [11]:
# Winona: 44.050556, -91.668333
# Rochester: 44.023333, -92.461389

In [12]:
#help(distance)

In [13]:
winona = (44.050556, -91.668333)
rochester = (44.023333, -92.461389)

print(distance.distance(winona, rochester).miles)

39.54418575388878


#### Task 1.3 - Investigate `gzip`

The yearly NOAA data is compressed as `.gz` files, which need to be uncompressed using `gzip`.  Explore the `gzip` module by

1. Exploring the documentation/help for the `gzip` module,
2. Using `wget` to download the following link into the `./data` folder, and
3. Using `gzip` to uncompress this file.
4. Inspect the data in your list, which should be of type `byte`.  Use a comprehension with the expression `l.decode('utf-8')` to convert this to a list of strings.
5. Write the uncompressed lines to an output file using `with open(path, 'w') as out` and the `writelines` method of `out`.  

**Link.** [https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_year/1750.csv.gz](https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_year/1750.csv.gz)

In [14]:
# Your code here

In [15]:
import gzip

In [16]:
#help(gzip)

In [18]:
# Download the file into the ./data folder
url = "https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_year/1750.csv.gz"
output_path = "./data/1750.csv.gz"  # My file path

filename = wget.download(url, out=output_path)

print(f"File downloaded to: {filename}")

File downloaded to: ./data/1750.csv (1).gz


In [19]:
# Define input and output file paths
file_to_uncompress = './data/1750.csv.gz'
output_file = './data/1750_uncompressed.csv'

# Step 1: Read the compressed file as bytes
# .gz file is a binary file. 
# So, you have to use 'rb' instead of 'r'
with open(file_to_uncompress, 'rb') as f: 
    compressed_data = f.read()  # Read the entire file in binary mode

# Step 2: Decompress the data
decompressed_data = gzip.decompress(compressed_data)  # Returns bytes

# Step 3: Convert bytes to string and split into lines
string_lines = decompressed_data.decode('utf-8').splitlines(keepends=True)

# Step 4: Write the uncompressed lines to an output file
with open(output_file, 'w') as out:
    out.writelines(string_lines)

print(f"Uncompressed file saved as: {output_file}")

Uncompressed file saved as: ./data/1750_uncompressed.csv


## Problem 2 - Find all stations within 500 miles of Winona, MN.

The file linked below contains information about all stations tracked by NOAA.  

*Main folder:* https://www.ncei.noaa.gov/pub/data/ghcn/daily/

*Station txt file:* https://www.ncei.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt

*Note.* While it would be easier to use the CSV version of the station file, you should use the TXT version here (for practice).

**Your tasks** Our goal is to get a list of stations that are within 500 miles of Winona.  Do this by

1. Using `wget` to download the stations information into the `./data` folder.
2. Use `with` to read the lines of this file.
3. At this point, the lines are strings in a fixed-width format separated by whitespace.  Use a list comprehension with the string split method to split the raw lines (strings) into a list of entries.
4. There are three entries of interest, the station ID and the lat-long coordinates of the station.  Inspect the file to determine the index for these three entries.
5. We want to transform the lines (currently a list of strings) into a record, which is a `dict` with good names for the entries as keys and the values representing the data in an appropriate type (string for station ID, `float` for the lat-long).  Use a comprehension to create a list of records as described.
6. Use another comprehension to apply a filter to the stations, keeping only those within 500 miles of Winona.

In [20]:
# Your code here (add cells as needed)

In [21]:
# Download the file into the ./data folder
url = "https://www.ncei.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt"
output_path = "./data/ghcnd-stations.txt"  # My file path

filename = wget.download(url, out=output_path)

print(f"File downloaded to: {filename}")

File downloaded to: ./data/ghcnd-stations (1).txt


In [22]:
with open('./data/ghcnd-stations.txt') as f:
    lines = f.readlines()
lines[:5]

['ACW00011604  17.1167  -61.7833   10.1    ST JOHNS COOLIDGE FLD                       \n',
 'ACW00011647  17.1333  -61.7833   19.2    ST JOHNS                                    \n',
 'AE000041196  25.3330   55.5170   34.0    SHARJAH INTER. AIRP            GSN     41196\n',
 'AEM00041194  25.2550   55.3640   10.4    DUBAI INTL                             41194\n',
 'AEM00041217  24.4330   54.6510   26.8    ABU DHABI INTL                         41217\n']

In [23]:
# By default, split() handles multiple consecutive whitespace characters as a single delimiter.
split_lines = [line.split() for line in lines]
split_lines[:3]

[['ACW00011604',
  '17.1167',
  '-61.7833',
  '10.1',
  'ST',
  'JOHNS',
  'COOLIDGE',
  'FLD'],
 ['ACW00011647', '17.1333', '-61.7833', '19.2', 'ST', 'JOHNS'],
 ['AE000041196',
  '25.3330',
  '55.5170',
  '34.0',
  'SHARJAH',
  'INTER.',
  'AIRP',
  'GSN',
  '41196']]

In [24]:
# List comprehension to create records (dicts) with keys: 'station_id', 'latitude', 'longitude'
stations = [
            {'station_id': line[0],       # Keep this as string
             'latitude': float(line[1]),  # Convert to float for latitude
             'longitude': float(line[2])  # Convert to float for longitude
            }
    for line in split_lines]

# Print the first 3 records
print(stations[:3])

[{'station_id': 'ACW00011604', 'latitude': 17.1167, 'longitude': -61.7833}, {'station_id': 'ACW00011647', 'latitude': 17.1333, 'longitude': -61.7833}, {'station_id': 'AE000041196', 'latitude': 25.333, 'longitude': 55.517}]


In [25]:
print(distance.distance(winona, rochester).miles)

39.54418575388878


In [26]:
# Winona coordinates
winona = (44.050556, -91.668333)

# Filter stations within 500 miles of Winona
nearby_stations = [
                    station for station in stations
                    if distance.distance(winona, (station['latitude'], station['longitude'])).miles <= 500
                  ]

# Print filtered stations
print(nearby_stations[:3])

[{'station_id': 'CA005012710', 'latitude': 49.45, 'longitude': -98.6167}, {'station_id': 'CA005020036', 'latitude': 49.55, 'longitude': -98.2}, {'station_id': 'CA005020040', 'latitude': 49.1, 'longitude': -97.55}]


## Problem 3 - Prototype downloading and uncompressing a station file.

Before we download and uncompress all the stations of interest, let's practice on one station file.


1. Copy the url for some station and store is as a variable named `url`.
2. Write `lambda` functions that extract each of the following from the station `url`: compressed file name, compressed file path (e.g., `./data/...`), and uncompressed file path (e.g., `./data/...`).
3. Write a `lambda` function that extracts
4. Use `wget` to download this stations data.
5. Use `gzip` to uncompress the data.
6. Write the data to out output file.

Your code should have the following shape:

```{Python}
wget.download(...)
with gzip.open(...) as f:
    with open(..., 'w') as out:
        f.readlines()
        out.writelines(f)
```

You should be using your helper functions to, in part, fill in the `...`

In [27]:
# Your code here

In [28]:
# Review lambda function 
add_lambda = lambda x, y: x + y

# Usage
print(add_lambda(3, 5))  # Output: 8

8


In [29]:
# ACW00011604.csv.gz
url = 'https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_station/ACW00011604.csv.gz'

In [31]:
# lambda to extract compressed file name

extract_comp_file_name = lambda url: url.split('/')[-1]
print(extract_comp_file_name(url))

ACW00011604.csv.gz


In [35]:
# lambda to extract compressed file path

extract_comp_file_path = lambda url: '.' + url.split('pub')[1]
print(extract_comp_file_path(url))

./data/ghcn/daily/by_station/ACW00011604.csv.gz


In [36]:
# lambda to extract uncompressed file path

extract_uncomp_file_path = lambda url: '.' + url.split('pub')[1].replace('.gz', '')
print(extract_uncomp_file_path(url))

./data/ghcn/daily/by_station/ACW00011604.csv


In [37]:
# ACW00011604.csv.gz
url = 'https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_station/ACW00011604.csv.gz'

# Download the compressed file
compressed_file_path = extract_comp_file_path(url)
uncompressed_file_path = extract_uncomp_file_path(url)

wget.download(url, compressed_file_path)

# Uncompress the file and write to an output file
with gzip.open(compressed_file_path, 'rt') as f:
    with open(uncompressed_file_path, 'w') as out:
        out.writelines(f.readlines())

print(f"Uncompressed file downloaded to: {uncompressed_file_path}")

Uncompressed file downloaded to: ./data/ghcn/daily/by_station/ACW00011604.csv


## Problem 4 - Build the station URLs and download the files.

**Tasks.** Now you need to build urls for all stations of interest by

1. Use a comprehension to extract the stations of interest into a list.
2. Investigating the structure of the files stored in the `by_station` folder (see main folder link above).
3. Use a comprehension and an `f` string to build a list of URLS for all stations of interest.
4. Use `wget` to download the data for the stations of interest into the data folder.
5. Use `gzip` to uncompress the files.
6. Convert the `bytes` to `str` of format `utf-8`.
7. Use the append mode `"a"` of `open` with `writelines` to append the data in each file to your output file.

While we usually avoid using a `for` loop, we make an exception for code for lengthy IO.  To accomplish steps 4 & 5, use a `for` loop with the following shape.

```{Python}
for url in station_urls:
    wget.download(...)
    with gzip.open(...) as f:
        with open(..., 'a') as out:
            f.readlines()
            ... # Convert lines to strings here
            out.writelines(f)
    print(f"Downloaded and extracted the data for {url}")
```

Note that the code inside the loop should resemble the code from the previous step.

In [38]:
# Example
fake_station = "A123456789"
make_fake_url = lambda s: f"https://my_fake_website.cool/{s}"

make_fake_url(fake_station)

'https://my_fake_website.cool/A123456789'

In [39]:
# Example
my_fake_stations =[f'A{i}' for i in range(10)]

(my_fake_urls := [make_fake_url(s) for s in my_fake_stations])

['https://my_fake_website.cool/A0',
 'https://my_fake_website.cool/A1',
 'https://my_fake_website.cool/A2',
 'https://my_fake_website.cool/A3',
 'https://my_fake_website.cool/A4',
 'https://my_fake_website.cool/A5',
 'https://my_fake_website.cool/A6',
 'https://my_fake_website.cool/A7',
 'https://my_fake_website.cool/A8',
 'https://my_fake_website.cool/A9']

In [40]:
# Your code here.

In [41]:
# nearby_stations from Step 3
nearby_station_ids = [station['station_id'] for station in nearby_stations]
print(nearby_station_ids[:3])

['CA005012710', 'CA005020036', 'CA005020040']


In [42]:
# My code
station = "CA005012710"
make_station_url = lambda s: f"https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_station/{s}.csv.gz"

make_station_url(station)

'https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_station/CA005012710.csv.gz'

In [43]:
# Generate URLs for each station
station_urls = [make_station_url(s) for s in nearby_station_ids]
station_urls[:3]

['https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_station/CA005012710.csv.gz',
 'https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_station/CA005020036.csv.gz',
 'https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_station/CA005020040.csv.gz']

In [44]:
# Your code here

In [45]:
# lambda to extract compressed file name
extract_comp_file_name = lambda url: url.split('/')[-1]
print(f'Compressed File Name: {extract_comp_file_name(url)}')

# lambda to extract compressed file path
extract_comp_file_path = lambda url: '.' + url.split('pub')[1]
print(f'Compressed File Path: {extract_comp_file_path(url)}')

# lambda to extract uncompressed file path
extract_uncomp_file_path = lambda url: '.' + url.split('pub')[1].replace('.gz', '')
print(f'Uncompressed File Path: {extract_uncomp_file_path(url)}')

Compressed File Name: ACW00011604.csv.gz
Compressed File Path: ./data/ghcn/daily/by_station/ACW00011604.csv.gz
Uncompressed File Path: ./data/ghcn/daily/by_station/ACW00011604.csv


In [None]:
for url in station_urls:
    wget.download(url, extract_comp_file_path(url))

    with gzip.open(extract_comp_file_path(url), 'rt', encoding='utf-8') as f:  # Ensure UTF-8 decoding
        with open(extract_uncomp_file_path(url), 'w', encoding='utf-8') as out:  # Use 'w' to overwrite
            lines = f.readlines()  
            lines = [line.strip() + '\n' for line in lines]  # Strip unwanted spaces & ensure newline
            out.writelines(lines)  # Write cleaned lines to file

    print(f"Downloaded and extracted the data for {url}")

Downloaded and extracted the data for https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_station/CA005012710.csv.gz
Downloaded and extracted the data for https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_station/CA005020036.csv.gz
Downloaded and extracted the data for https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_station/CA005020040.csv.gz
Downloaded and extracted the data for https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_station/CA005020050.csv.gz
Downloaded and extracted the data for https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_station/CA005020054.csv.gz
Downloaded and extracted the data for https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_station/CA005020069.csv.gz
Downloaded and extracted the data for https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_station/CA005020224.csv.gz
Downloaded and extracted the data for https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_station/CA005020551.csv.gz


In [None]:
# Download and extract data for each station
for url in station_urls:
    wget.download(url, extract_comp_file_path(url))

    with gzip.open(extract_comp_file_path(url), 'rt') as f: #'rt' = reading text file
        with open(extract_uncomp_file_path(url), 'a') as out:
            lines = f.readlines()  # Read and store lines
            lines = [line.decode('utf-8').strip() for line in f]  # Convert to strings
            out.writelines(lines)

    print(f"Downloaded and extracted the data for {url}")

Downloaded and extracted the data for https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_station/CA005012710.csv.gz
Downloaded and extracted the data for https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_station/CA005020036.csv.gz
Downloaded and extracted the data for https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_station/CA005020040.csv.gz
Downloaded and extracted the data for https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_station/CA005020050.csv.gz
Downloaded and extracted the data for https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_station/CA005020054.csv.gz
Downloaded and extracted the data for https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_station/CA005020069.csv.gz
Downloaded and extracted the data for https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_station/CA005020224.csv.gz
Downloaded and extracted the data for https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_station/CA005020551.csv.gz
Downloaded and extracted the data for https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_s

In [None]:
#help(open)