# School Board Minutes

Scrape all of the school board minutes from http://www.mineral.k12.nv.us/pages/School_Board_Minutes

Save a CSV called `minutes.csv` with the date and the URL to the file. The date should be formatted as YYYY-MM-DD.

**Bonus:** Download the PDF files

**Bonus 2:** Use [PDF OCR X](https://solutions.weblite.ca/pdfocrx/index.php) on one of the PDF files and see if it can be converted into text successfully.

* **Hint:** If you're just looking for links, there are a lot of other links on that page! Can you look at the link to know whether it links or minutes or not? You'll want to use an "if" statement.
* **Hint:** You could also filter out bad links later on using pandas instead of when scraping
* **Hint:** If you get a weird error that you can't really figure out, you can always tell Python to just ignore it using `try` and `except`, like below. Python will try to do the stuff inside of 'try', but if it hits an error it will skip right out.
* **Hint:** Remember the codes at http://strftime.org
* **Hint:** If you have a date that you've parsed, you can use `.dt.strftime` to turn it into a specially-formatted string. You use the same codes (like %B etc) that you use for converting strings into dates.

```python
try:
  blah blah your code
  your code
  your code
except:
  pass
```

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import re
import numpy as np

In [2]:
response = requests.get("http://www.mineral.k12.nv.us/pages/School_Board_Minutes")
doc = BeautifulSoup(response.text)

In [3]:
tables = doc.find_all("span")
rows=[]

for table in tables:
    row = {}
    try:
        row["url"] = table.find("a")["href"]
    except:
        pass
    try:
        row["date"] = table.find("a")["href"]
    except:
        pass
    rows.append(row)
    
print(rows)    

[{}, {}, {}, {'url': 'https://www2.ed.gov/policy/gen/guid/fpco/ferpa/index.html', 'date': 'https://www2.ed.gov/policy/gen/guid/fpco/ferpa/index.html'}, {}, {}, {'url': 'http://www.doe.nv.gov/Assessments/', 'date': 'http://www.doe.nv.gov/Assessments/'}, {}, {}, {'url': 'https://online.nvdoe.org/#/VerifyLicense', 'date': 'https://online.nvdoe.org/#/VerifyLicense'}, {}, {'url': 'http://www.ets.org/parapro', 'date': 'http://www.ets.org/parapro'}, {}, {}, {}, {'url': 'http://www.mineral.k12.nv.us/files/diploma_requesst.pdf', 'date': 'http://www.mineral.k12.nv.us/files/diploma_requesst.pdf'}, {}, {}, {'url': 'http://www.mineral.k12.nv.us/files/diploma_requesst.pdf', 'date': 'http://www.mineral.k12.nv.us/files/diploma_requesst.pdf'}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {'url': '/files/6.4.19_minutes.pdf', 'date': '/files/6.4.19_minutes.pdf'}, {}, {'url': '/files/5.28.19_minutes.pdf', 'date': '/files/5.28.19_minutes.pdf'}, {}, {'url': '/files/5.7.19_minutes.pdf', 'date': '/files/5.

In [4]:
df = pd.DataFrame(rows)

In [5]:
df = df.dropna()

In [6]:
df["date"] = df[df.date.str.contains("_minutes.pdf", na = False, flags=re.IGNORECASE)]

In [7]:
df = df.dropna()

In [8]:
df

Unnamed: 0,date,url
31,/files/6.4.19_minutes.pdf,/files/6.4.19_minutes.pdf
33,/files/5.28.19_minutes.pdf,/files/5.28.19_minutes.pdf
35,/files/5.7.19_minutes.pdf,/files/5.7.19_minutes.pdf
37,/files/4.23.19_minutes.pdf,/files/4.23.19_minutes.pdf
39,/files/4.8.19_minutes.pdf,/files/4.8.19_minutes.pdf
41,/files/3.5.19_minutes.pdf,/files/3.5.19_minutes.pdf
45,/files/2.26.19_minutes.pdf,/files/2.26.19_minutes.pdf
47,/files/2.5.19_minutes.pdf,/files/2.5.19_minutes.pdf
68,/files/8.7.18_minutes.pdf,/files/8.7.18_minutes.pdf
70,/files/7.24.18_minutes.pdf,/files/7.24.18_minutes.pdf


In [9]:
df["date"] = df["date"].str.replace ("/files/april_2,_2018_minutes.pdf", "/files/4.2.18_minutes.pdf")
df["date"] = df["date"].str.replace ("/files/march_6,_2018_minutes.pdf", "/files/3.6.18_minutes.pdf")


In [10]:
regex_pat = re.compile(r'_..*', flags=re.IGNORECASE)
df["date"] = df["date"].str.replace(regex_pat, "")

In [11]:
regex_pat = re.compile(r'/files/', flags=re.IGNORECASE)
df["date"] = df["date"].str.replace(regex_pat, "")

In [12]:
df

Unnamed: 0,date,url
31,6.4.19,/files/6.4.19_minutes.pdf
33,5.28.19,/files/5.28.19_minutes.pdf
35,5.7.19,/files/5.7.19_minutes.pdf
37,4.23.19,/files/4.23.19_minutes.pdf
39,4.8.19,/files/4.8.19_minutes.pdf
41,3.5.19,/files/3.5.19_minutes.pdf
45,2.26.19,/files/2.26.19_minutes.pdf
47,2.5.19,/files/2.5.19_minutes.pdf
68,8.7.18,/files/8.7.18_minutes.pdf
70,7.24.18,/files/7.24.18_minutes.pdf


In [13]:
df['date'] = df['date'].apply(lambda value: f'{value[4:]}-{value[:2]}-{value[2:4]}')

In [14]:
df

Unnamed: 0,date,url
31,19-6.-4.,/files/6.4.19_minutes.pdf
33,.19-5.-28,/files/5.28.19_minutes.pdf
35,19-5.-7.,/files/5.7.19_minutes.pdf
37,.19-4.-23,/files/4.23.19_minutes.pdf
39,19-4.-8.,/files/4.8.19_minutes.pdf
41,19-3.-5.,/files/3.5.19_minutes.pdf
45,.19-2.-26,/files/2.26.19_minutes.pdf
47,19-2.-5.,/files/2.5.19_minutes.pdf
68,18-8.-7.,/files/8.7.18_minutes.pdf
70,.18-7.-24,/files/7.24.18_minutes.pdf


In [15]:
regex_pat = re.compile(r'[.]', flags=re.IGNORECASE)
df["date"] = df["date"].str.replace(regex_pat, "")

In [16]:
df

Unnamed: 0,date,url
31,19-6-4,/files/6.4.19_minutes.pdf
33,19-5-28,/files/5.28.19_minutes.pdf
35,19-5-7,/files/5.7.19_minutes.pdf
37,19-4-23,/files/4.23.19_minutes.pdf
39,19-4-8,/files/4.8.19_minutes.pdf
41,19-3-5,/files/3.5.19_minutes.pdf
45,19-2-26,/files/2.26.19_minutes.pdf
47,19-2-5,/files/2.5.19_minutes.pdf
68,18-8-7,/files/8.7.18_minutes.pdf
70,18-7-24,/files/7.24.18_minutes.pdf


In [20]:
df.to_csv("school_board", index= False)