# School Board Minutes

Scrape all of the school board minutes from http://www.mineral.k12.nv.us/pages/School_Board_Minutes

Save a CSV called `minutes.csv` with the date and the URL to the file. The date should be formatted as YYYY-MM-DD.

**Bonus:** Download the PDF files

**Bonus 2:** Use [PDF OCR X](https://solutions.weblite.ca/pdfocrx/index.php) on one of the PDF files and see if it can be converted into text successfully.

* **Hint:** If you're just looking for links, there are a lot of other links on that page! Can you look at the link to know whether it links or minutes or not? You'll want to use an "if" statement.
* **Hint:** You could also filter out bad links later on using pandas instead of when scraping
* **Hint:** If you get a weird error that you can't really figure out, you can always tell Python to just ignore it using `try` and `except`, like below. Python will try to do the stuff inside of 'try', but if it hits an error it will skip right out.
* **Hint:** Remember the codes at http://strftime.org
* **Hint:** If you have a date that you've parsed, you can use `.dt.strftime` to turn it into a specially-formatted string. You use the same codes (like %B etc) that you use for converting strings into dates.

```python
try:
  blah blah your code
  your code
  your code
except:
  pass
```

In [1]:
from selenium import webdriver
import time
import pandas as pd

In [2]:
driver = webdriver.Chrome()
driver.get("http://www.mineral.k12.nv.us/pages/School_Board_Minutes")
p = driver.find_elements_by_xpath("//div[3]/p[position()>3]")

In [3]:
def is_valid_date(strdate):
    #Date Testing
    try:
        time.strptime(strdate, "%B %d, %Y")
        return True
    except:
        return False

In [4]:
linkList=[]
dateList=[]

for i in p:
    date = i.text
    date = date.lstrip()
    date = date.rstrip()
    #Date Testing
    if is_valid_date(date):
        #If it is date
        try:
            link = i.find_element_by_xpath("a").get_attribute('href')
        except:
            try:
                link = i.find_element_by_xpath("span/a").get_attribute('href')
            except:
                try:
                    link = i.find_element_by_xpath("span/span/a").get_attribute('href')
                except:
                    link = 'None'
        linkList.append(link)
        dateList.append(date)
    elif (date.find("CANCELLED") != -1 or date.find("Canceled") != -1):
        #Canceled or not?
        date = date.rstrip(' CANCELLED')
        date = date.rstrip(' Canceled')
        link = 'None'
        linkList.append(link)
        dateList.append(date)


In [9]:
print(linkList)

['http://www.mineral.k12.nv.us/files/6.4.19_minutes.pdf', 'http://www.mineral.k12.nv.us/files/5.28.19_minutes.pdf', 'None', 'http://www.mineral.k12.nv.us/files/5.7.19_minutes.pdf', 'http://www.mineral.k12.nv.us/files/4.23.19_minutes.pdf', 'http://www.mineral.k12.nv.us/files/4.8.19_minutes.pdf', 'http://www.mineral.k12.nv.us/files/3.5.19_minutes.pdf', 'http://www.mineral.k12.nv.us/files/3.5.19.pdf', 'http://www.mineral.k12.nv.us/files/2.26.19_minutes.pdf', 'http://www.mineral.k12.nv.us/files/2.5.19_minutes.pdf', 'http://www.mineral.k12.nv.us/files/January_22_minutes.pdf', 'http://www.mineral.k12.nv.us/files/January_8_minutes.pdf', 'http://www.mineral.k12.nv.us/files/12.20.18_minutes.pdf', 'http://www.mineral.k12.nv.us/files/12.4.18_minutes.pdf', 'http://www.mineral.k12.nv.us/files/11.20.18.pdf', 'None', 'None', 'http://www.mineral.k12.nv.us/files/9.25.18_minutes.pdf', 'http://www.mineral.k12.nv.us/files/9.13.18_minutes.pdf', 'http://www.mineral.k12.nv.us/files/9.4.18.pdf', 'http://www.m

In [10]:
print(dateList)

['June 4, 2019', 'May 28, 2019', 'May 21, 2019', 'May 7, 2019', 'April 23, 2019', 'April 8, 2019', 'March 19, 2019', 'March 5, 2019', 'February 26, 2019', 'February 5, 2019', 'January 22, 2019', 'January 8, 2019', 'December 20, 2018', 'December 4, 2018', 'November 20, 2018', 'November 7, 2018', 'October 16, 2018', 'September 25, 2018', 'September 13, 2018', 'September 4, 2018', 'August 21, 2018', 'August 7, 2018', 'July 24, 2018', 'July 10, 2018', 'June 28, 2018', 'June 22, 2018', 'June 21, 2018', 'June 19, 2108', 'June 6, 2018', 'May 29, 2018', 'May 22, 2018', 'May 15, 2018', 'May 1, 2018', 'April 17, 2018', 'April 2, 2018', 'March 27, 2018', 'March 22, 2018', 'March 20, 2018', 'March 8, 2018', 'March 6, 2018', 'February 20, 2018', 'February 6, 2018', 'January 16, 2018', 'January 2, 2018', 'January 5, 2017', 'January 26, 2017', 'February 2, 2017', 'February 16, 2017', 'March 2, 2017', 'March 16, 2017', 'April 12, 2017', 'May 1, 2017', 'May 4, 2017']


In [6]:
total = dict(zip(['Date', 'Link'],[dateList, linkList]))
table = pd.DataFrame(total)
table.head()

Unnamed: 0,Date,Link
0,"June 4, 2019",http://www.mineral.k12.nv.us/files/6.4.19_minu...
1,"May 28, 2019",http://www.mineral.k12.nv.us/files/5.28.19_min...
2,"May 21, 2019",
3,"May 7, 2019",http://www.mineral.k12.nv.us/files/5.7.19_minu...
4,"April 23, 2019",http://www.mineral.k12.nv.us/files/4.23.19_min...


In [None]:
table.to_csv("SchoolBoardMinutes.csv", index=False)