# Lab 2.1 - Weather Data Around Winona

In this lab, we will download and combine a decades worth of weather data from the NOAA, focusing on weather stations within 500 miles of Winona.

Here is the outline of the basic process.

1. Install and investigate useful packages.
2. Find all weather stations in proximity to Winona.
3. Use a single station to prototype our tools.
4. Automate the process of downloading and uncompressing data from all stations of interest.
5. Output the results to a CSV file.

## Problem 1 - Install and investigate useful tools.

First, you should install and investigate the following tools.

1. **`wget`** is a tool for programmically downloading data files from the web on the command line.  There is a Python wrapper to this tool that you can install with `pip` as shown below.
2. **`geopy`** is a package that, among other things, implements a function for computing distances between two lat-long pairs. Again, install this package with `pip` as shown below.
3. **`gzip`** is part of the standard Python library and

In [1]:
%pip install wget

Collecting wgetNote: you may need to restart the kernel to use updated packages.

  Downloading wget-3.2.zip (10 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: wget
  Building wheel for wget (setup.py): started
  Building wheel for wget (setup.py): finished with status 'done'
  Created wheel for wget: filename=wget-3.2-py3-none-any.whl size=9680 sha256=b86ba220f1cf415dea476e92618e94c3d59e17881ef0bde563cd68b65d52af85
  Stored in directory: c:\users\mp5667di\appdata\local\pip\cache\wheels\01\46\3b\e29ffbe4ebe614ff224bad40fc6a5773a67a163251585a13a9
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2


In [12]:
%pip install geopy

Note: you may need to restart the kernel to use updated packages.


In [14]:
import wget
import geopy

#### Task 1.1 - Investigate using `wget` to download a file.

Read the help/documentation on `wget` to figure out how to download the following data file [Some random data file from STAT 210] into the `./data` sub-folder.

[https://github.com/yardsale8/STAT_210/raw/refs/heads/main/data/sars1.csv](https://github.com/yardsale8/STAT_210/raw/refs/heads/main/data/sars1.csv)

In [15]:
help(wget)

Help on module wget:

NAME
    wget - Download utility as an easy way to get file from the net

DESCRIPTION
      python -m wget <URL>
      python wget.py <URL>

    Downloads: http://pypi.python.org/pypi/wget/
    Development: http://bitbucket.org/techtonik/python-wget/

    wget.py is not option compatible with Unix wget utility,
    to make command line interface intuitive for new people.

    Public domain by anatoly techtonik <techtonik@gmail.com>
    Also available under the terms of MIT license
    Copyright (c) 2010-2015 anatoly techtonik

FUNCTIONS
    bar_adaptive(current, total, width=80)
        Return progress bar string for given values in one of three
        styles depending on available width:

            [..  ] downloaded / total
            downloaded / total
            [.. ]

        if total value is unknown or <= 0, show bytes counter using two
        adaptive styles:

            %s / unknown
            %s

        if there is not enough space on the screen,

In [16]:
# Your code here

In [17]:
!dir

 Volume in drive C has no label.
 Volume Serial Number is 182D-87E4

 Directory of C:\Users\mp5667di\OneDrive - Minnesota State\Desktop\Spring_2025\DSCI_330\lab_2.1_Weather_data_around_Winona

01/28/2025  08:31 AM    <DIR>          .
01/28/2025  08:11 AM    <DIR>          ..
01/28/2025  08:18 AM                66 .gitattributes
01/28/2025  08:23 AM    <DIR>          .ipynb_checkpoints
01/28/2025  08:19 AM    <DIR>          data
01/28/2025  08:31 AM            19,779 lab2_1_weather_data_around_winona.ipynb
01/28/2025  08:18 AM                39 README.md
01/28/2025  08:29 AM             1,519 sars1.csv
               4 File(s)         21,403 bytes
               4 Dir(s)  50,200,256,512 bytes free


In [18]:
url = "https://github.com/yardsale8/STAT_210/raw/refs/heads/main/data/sars1.csv"
output_path = "./data/sars1.csv"  # My file path

filename = wget.download(url, out=output_path)

print(f"File downloaded to: {filename}")

File downloaded to: ./data/sars1.csv


#### Task 1.2 - Investigate using `geopy.distance.distance` to compute a distance in miles.

1. Import the `distance` function from the `geopy.distance` submodule.
2. Use Wikipedia to find the lat-long coordinates of Winona and Rochester MN.
3. Use `distance` to compute the distance between Winona and Rochester.
4. Use some other source (e.g., Google Maps) to check the answer.

In [19]:
# Your code here

In [34]:
from geopy import distance

In [35]:
# Winona: 44.050556, -91.668333
# Rochester: 44.023333, -92.461389

In [38]:
#help(distance)

In [37]:
winona = (44.050556, -91.668333)
rochester = (44.023333, -92.461389)

print(distance.distance(winona, rochester).miles)

39.54418575388878


#### Task 1.3 - Investigate `gzip`

The yearly NOAA data is compressed as `.gz` files, which need to be uncompressed using `gzip`.  Explore the `gzip` module by

1. Exploring the documentation/help for the `gzip` module,
2. Using `wget` to download the following link into the `./data` folder, and
3. Using `gzip` to uncompress this file.
4. Inspect the data in your list, which should be of type `byte`.  Use a comprehension with the expression `l.decode('utf-8')` to convert this to a list of strings.
5. Write the uncompressed lines to an output file using `with open(path, 'w') as out` and the `writelines` method of `out`.  

**Link.** [https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_year/1750.csv.gz](https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_year/1750.csv.gz)

In [41]:
# Your code here

In [42]:
import gzip

In [44]:
help(gzip)

Help on module gzip:

NAME
    gzip - Functions that read and write gzipped files.

MODULE REFERENCE
    https://docs.python.org/3.12/library/gzip.html

    The following documentation is automatically generated from the Python
    source files.  It may be incomplete, incorrect or include features that
    are considered implementation detail and may vary between Python
    implementations.  When in doubt, consult the module reference at the
    location listed above.

DESCRIPTION
    The user of the file doesn't have to worry about the compression,
    but random access is not allowed.

CLASSES
    _compression.BaseStream(io.BufferedIOBase)
        GzipFile
    builtins.OSError(builtins.Exception)
        BadGzipFile

    class BadGzipFile(builtins.OSError)
     |  Exception raised in some cases for invalid gzip files.
     |
     |  Method resolution order:
     |      BadGzipFile
     |      builtins.OSError
     |      builtins.Exception
     |      builtins.BaseException
     |   

In [49]:
url = "https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_year/1750.csv.gz"
output_path = "./data/1750.csv.gz"  # My file path

filename = wget.download(url, out=output_path)

print(f"File downloaded to: {filename}")

File downloaded to: ./data/1750.csv (1).gz


In [50]:
# The 'rb' mode opens the gzipped file for reading bytes
with gzip.open('./data/1750.csv.gz', 'rb') as f:
  file_content = f.read()
 
print(file_content)

IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)



In [48]:
# using gzip.decompress(s) method 

file_to_uncompress = './data/1750.csv.gz'
file_to_uncompress = gzip.compress(file_to_uncompress) 
  
# using gzip.decompress(s) method 
t = gzip.decompress(file_to_uncompress) 
print(t)

TypeError: a bytes-like object is required, not 'str'

## Problem 2 - Find all stations within 500 miles of Winona, MN.

The file linked below contains information about all stations tracked by NOAA.  

*Main folder:* https://www.ncei.noaa.gov/pub/data/ghcn/daily/

*Station txt file:* https://www.ncei.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt

*Note.* While it would be easier to use the CSV version of the station file, you should use the TXT version here (for practice).

**Your tasks** Our goal is to get a list of stations that are within 500 miles of Winona.  Do this by

1. Using `wget` to download the stations information into the `./data` folder.
2. Use `with` to read the lines of this file.
3. At this point, the lines are strings in a fixed-width format separated by whitespace.  Use a list comprehension with the string split method to split the raw lines (strings) into a list of entries.
4. There are three entries of interest, the station ID and the lat-long coordinates of the station.  Inspect the file to determine the index for these three entries.
5. We want to transform the lines (currently a list of strings) into a record, which is a `dict` with good names for the entries as keys and the values representing the data in an appropriate type (string for station ID, `float` for the lat-long).  Use a comprehension to create a list of records as described.
6. Use another comprehension to apply a filter to the stations, keeping only those within 500 miles of Winona.

In [None]:
# Your code here (add cells as needed)

#### Problem 3 - Prototype downloading and uncompressing a station file.

Before we download and uncompress all the stations of interest, let's practice on one station file.


1. Copy the url for some station and store is as a variable named `url`.
2. Write `lambda` functions that extract each of the following from the station `url`: compressed file name, compressed file path (e.g., `./data/...`), and uncompressed file path (e.g., `./data/...`).
3. Write a `lambda` function that extracts
4. Use `wget` to download this stations data.
5. Use `gzip` to uncompress the data.
6. Write the data to out output file.

Your code should have the following shape:

```{Python}
wget.download(...)
with gzip.open(...) as f:
    with open(..., 'w') as out:
        f.readlines()
        out.writelines(f)
```

You should be using your helper functions to, in part, fill in the `...`

In [None]:
# Your code here

## Problem 4 - Build the station URLs and download the files.

**Tasks.** Now you need to build urls for all stations of interest by

1. Use a comprehension to extract the stations of interest into a list.
2. Investigating the structure of the files stored in the `by_station` folder (see main folder link above).
3. Use a comprehension and an `f` string to build a list of URLS for all stations of interest.
4. Use `wget` to download the data for the stations of interest into the data folder.
5. Use `gzip` to uncompress the files.
6. Convert the `bytes` to `str` of format `utf-8`.
7. Use the append mode `"a"` of `open` with `writelines` to append the data in each file to your output file.

While we usually avoid using a `for` loop, we make an exception for code for lengthy IO.  To accomplish steps 4 & 5, use a `for` loop with the following shape.

```{Python}
for url in station_urls:
    wget.download(...)
    with gzip.open(...) as f:
        with open(..., 'a') as out:
            f.readlines()
            ... # Convert lines to strings here
            out.writelines(f)
    print(f"Downloaded and extracted the data for {url}")
```

Note that the code inside the loop should resemble the code from the previous step.

In [None]:
# Your code here.

In [None]:
fake_station = "A123456789"
make_fake_url = lambda s: f"https://my_fake_website.cool/{s}"

make_fake_url(fake_station)

'https://my_fake_website.cool/A123456789'

In [None]:
my_fake_stations =[f'A{i}' for i in range(10)]

(my_fake_urls := [make_fake_url(s) for s in my_fake_stations])

['https://my_fake_website.cool/A0',
 'https://my_fake_website.cool/A1',
 'https://my_fake_website.cool/A2',
 'https://my_fake_website.cool/A3',
 'https://my_fake_website.cool/A4',
 'https://my_fake_website.cool/A5',
 'https://my_fake_website.cool/A6',
 'https://my_fake_website.cool/A7',
 'https://my_fake_website.cool/A8',
 'https://my_fake_website.cool/A9']

In [None]:
# Your code here