# Scrape sale price documents for homes in one borough

## Build a list of documents we would like to download

Visit https://www.nyc.gov/site/finance/taxes/property-annualized-sales-update.page and peek under "Detailed Annual Sales Reports by Borough." We want to build a list of all of the excel files that link to **one borough**. It's your choice - Manhattan, Brooklyn, Staten Island, etc.

**You only need to go back to 2007!!!**

* _**Tip:** You can basically cut and paste from the end of class on this one_
* _**Tip:** 2017 and earlier files are `.xls`, not `.xlsx`_

**...but if you want to go before 2007**, you have a few options.

One is to pick extra stuff on the page, and then use [list slicing](https://www.learnbyexample.org/python-list-slicing/) to remove what you don't want. For example:

```python
numbers = [1, 2, 3, 4, 5]
# only gives [4, 5, 6]
numbers[3:]
```

If you'd like to only pick the right things, though, another option is using `AND` and `OR` your CSS selectors. [Reading about the different attribute selectors available](https://developer.mozilla.org/en-US/docs/Web/CSS/Attribute_selectors) might help.

* `"a[href*='mountains'][href$='.csv']"` grabs everything that contains "mountains" *and* ends with `.csv`
* `"a[href*='mountains'], a[href*='streams']"` grabs everything that contains "mountains" or contains "streams"

And a third is to filter your matches after you grab them from the page. It's kind of a combination of the first and second options - it involves selecting more than you really need, and then filtering them based on filenames. You could use either regex or "normal" Python.

```python
import re

urls = [
    "https://pokeapi.co/api/v2/pokemon/rayquaza",
    "https://pokeapi.co/api/v2/pokemon/snorlax",
    "https://pokeapi.co/api/v2/pokemon/22",
    "https://pokeapi.co/api/v2/pokemon/55",
    "https://pokeapi.co/api/v2/type/electric",
    "https://pokeapi.co/api/v2/type/fire"
]
# Use regex to find all urls that have pokemon/xx in them, where xx is a digit
poke_id_urls = [url for url in urls if re.search("pokemon/\d+", url)]

# Use "normal" Python to find all urls with 'type' in them that start with https://
type_urls = [url for url in urls if 'type' in url and url.startswith("https://")]
```



In [1]:
import requests
from bs4 import BeautifulSoup

In [2]:
response = requests.get("https://www.nyc.gov/site/finance/taxes/property-annualized-sales-update.page")
doc = BeautifulSoup(response.text)

In [10]:
links = doc.find_all('a')
for link in links:
    try:
        if 'bronx.xlsx' in link['href']:
            print("https://www.nyc.gov/" + link['href'])
        if 'bronx.xls' in link['href']:
            print("https://www.nyc.gov/" + link['href'])
    except:
        pass

https://www.nyc.gov//assets/finance/downloads/pdf/rolling_sales/annualized-sales/2021/2021_bronx.xlsx
https://www.nyc.gov//assets/finance/downloads/pdf/rolling_sales/annualized-sales/2021/2021_bronx.xlsx
https://www.nyc.gov//assets/finance/downloads/pdf/rolling_sales/annualized-sales/2020/2020_bronx.xlsx
https://www.nyc.gov//assets/finance/downloads/pdf/rolling_sales/annualized-sales/2020/2020_bronx.xlsx
https://www.nyc.gov//assets/finance/downloads/pdf/rolling_sales/annualized-sales/2019/2019_bronx.xlsx
https://www.nyc.gov//assets/finance/downloads/pdf/rolling_sales/annualized-sales/2019/2019_bronx.xlsx
https://www.nyc.gov//assets/finance/downloads/pdf/rolling_sales/annualized-sales/2018/2018_bronx.xlsx
https://www.nyc.gov//assets/finance/downloads/pdf/rolling_sales/annualized-sales/2018/2018_bronx.xlsx
https://www.nyc.gov//assets/finance/downloads/pdf/rolling_sales/annualized-sales/2017/2017_bronx.xls
https://www.nyc.gov//assets/finance/downloads/pdf/rolling_sales/annualized-sales/20

## Use Python to make a list of the URLs to be downloaded, and save them to a file.

The format is a _little_ different than what we did in class, as `/` at the beginning of a url means "start from the top of the domain" instead of "start relative to the page you're on now." Just examine your URLs and you'll notice it.

_**Tip:** If you want to google around at other ways to do this, the `'\n'.join(urls)` method might be an interesting one to look at._

In [14]:
doc.select("a[href*='bronx.xlxs'], a[href*='bronx.xls']")

[<a href="/assets/finance/downloads/pdf/rolling_sales/annualized-sales/2021/2021_bronx.xlsx" target="_blank">Download</a>,
 <a href="/assets/finance/downloads/pdf/rolling_sales/annualized-sales/2020/2020_bronx.xlsx" target="_blank">Download</a>,
 <a href="/assets/finance/downloads/pdf/rolling_sales/annualized-sales/2019/2019_bronx.xlsx" target="_blank">Download</a>,
 <a href="/assets/finance/downloads/pdf/rolling_sales/annualized-sales/2018/2018_bronx.xlsx" target="_blank">Download</a>,
 <a href="/assets/finance/downloads/pdf/rolling_sales/annualized-sales/2017/2017_bronx.xls" target="_blank">Download</a>,
 <a href="/assets/finance/downloads/pdf/rolling_sales/annualized-sales/2016/2016_bronx.xls" target="_blank">Download</a>,
 <a href="/assets/finance/downloads/pdf/rolling_sales/annualized-sales/2015/2015_bronx.xls" target="_blank">Download</a>,
 <a href="/assets/finance/downloads/pdf/rolling_sales/annualized-sales/2014/2014_bronx.xls" target="_blank">Download</a>,
 <a href="/assets/fi

In [15]:
links = doc.select("a[href*='bronx.xlxs'], a[href*='bronx.xls']")
for link in links:
    print(link['href'])

/assets/finance/downloads/pdf/rolling_sales/annualized-sales/2021/2021_bronx.xlsx
/assets/finance/downloads/pdf/rolling_sales/annualized-sales/2020/2020_bronx.xlsx
/assets/finance/downloads/pdf/rolling_sales/annualized-sales/2019/2019_bronx.xlsx
/assets/finance/downloads/pdf/rolling_sales/annualized-sales/2018/2018_bronx.xlsx
/assets/finance/downloads/pdf/rolling_sales/annualized-sales/2017/2017_bronx.xls
/assets/finance/downloads/pdf/rolling_sales/annualized-sales/2016/2016_bronx.xls
/assets/finance/downloads/pdf/rolling_sales/annualized-sales/2015/2015_bronx.xls
/assets/finance/downloads/pdf/rolling_sales/annualized-sales/2014/2014_bronx.xls
/assets/finance/downloads/pdf/rolling_sales/annualized-sales/2013/2013_bronx.xls
/assets/finance/downloads/pdf/rolling_sales/annualized-sales/2012/2012_bronx.xls
/assets/finance/downloads/pdf/rolling_sales/annualized-sales/2011/2011_bronx.xls
/assets/finance/downloads/pdf/rolling_sales/annualized-sales/2010/2010_bronx.xls
/assets/finance/download

In [16]:
urls = ["https://www.nyc.gov/" + link['href'] for link in links]
urls

['https://www.nyc.gov//assets/finance/downloads/pdf/rolling_sales/annualized-sales/2021/2021_bronx.xlsx',
 'https://www.nyc.gov//assets/finance/downloads/pdf/rolling_sales/annualized-sales/2020/2020_bronx.xlsx',
 'https://www.nyc.gov//assets/finance/downloads/pdf/rolling_sales/annualized-sales/2019/2019_bronx.xlsx',
 'https://www.nyc.gov//assets/finance/downloads/pdf/rolling_sales/annualized-sales/2018/2018_bronx.xlsx',
 'https://www.nyc.gov//assets/finance/downloads/pdf/rolling_sales/annualized-sales/2017/2017_bronx.xls',
 'https://www.nyc.gov//assets/finance/downloads/pdf/rolling_sales/annualized-sales/2016/2016_bronx.xls',
 'https://www.nyc.gov//assets/finance/downloads/pdf/rolling_sales/annualized-sales/2015/2015_bronx.xls',
 'https://www.nyc.gov//assets/finance/downloads/pdf/rolling_sales/annualized-sales/2014/2014_bronx.xls',
 'https://www.nyc.gov//assets/finance/downloads/pdf/rolling_sales/annualized-sales/2013/2013_bronx.xls',
 'https://www.nyc.gov//assets/finance/downloads/pdf

In [17]:
with open("urls.txt", 'w') as fp:
    for url in urls:
        fp.write(url + "\n")

## Download the Excel files with `wget` or `curl`

You can see what I did in class, but `wget` has an option to provide it with a filename to download al ist of files from.

In [18]:
!wget -i urls.txt

--2022-11-20 09:00:36--  https://www.nyc.gov//assets/finance/downloads/pdf/rolling_sales/annualized-sales/2021/2021_bronx.xlsx
Resolving www.nyc.gov (www.nyc.gov)... 104.70.72.36
Connecting to www.nyc.gov (www.nyc.gov)|104.70.72.36|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 875625 (855K) [application/vnd.openxmlformats-officedocument.spreadsheetml.sheet]
Saving to: '2021_bronx.xlsx'

     0K .......... .......... .......... .......... ..........  5% 3.06M 0s
    50K .......... .......... .......... .......... .......... 11% 7.29M 0s
   100K .......... .......... .......... .......... .......... 17% 2.79M 0s
   150K .......... .......... .......... .......... .......... 23% 5.80M 0s
   200K .......... .......... .......... .......... .......... 29% 7.17M 0s
   250K .......... .......... .......... .......... .......... 35% 8.08M 0s
   300K .......... .......... .......... .......... .......... 40% 33.2M 0s
   350K .......... .......... .......... .........

In [19]:
!ls

01 - Data Acquisition.ipynb
02 - Data Compilation.ipynb
03 - Data Analysis.ipynb
04 - Data Exploration.ipynb
2009_bronx.xls
2009_brooklyn.xls
2010_bronx.xls
2010_brooklyn.xls
2011_bronx.xls
2011_brooklyn.xls
2012_bronx.xls
2012_brooklyn.xls
2013_bronx.xls
2013_brooklyn.xls
2014_bronx.xls
2014_brooklyn.xls
2015_bronx.xls
2015_brooklyn.xls
2016_bronx.xls
2016_brooklyn.xls
2017_bronx.xls
2017_brooklyn.xls
2018_bronx.xlsx
2018_brooklyn.xlsx
2019_bronx.xlsx
2019_brooklyn.xlsx
2020_bronx.xlsx
2020_brooklyn.xlsx
2021_bronx.xlsx
2021_brooklyn.xlsx
cleaned.csv
Juhana -01 - Data Acquisition.ipynb
merged.csv
sales_2007_bronx.xls
sales_2007_brooklyn.xls
sales_2008_bronx.xls
sales_2008_brooklyn.xls
urls.txt
