# School Board Minutes

Scrape all of the school board minutes from http://www.mineral.k12.nv.us/pages/School_Board_Minutes

Save a CSV called `minutes.csv` with the date and the URL to the file. The date should be formatted as YYYY-MM-DD.

**Bonus:** Download the PDF files

**Bonus 2:** Use [PDF OCR X](https://solutions.weblite.ca/pdfocrx/index.php) on one of the PDF files and see if it can be converted into text successfully.

* **Hint:** If you're just looking for links, there are a lot of other links on that page! Can you look at the link to know whether it links or minutes or not? You'll want to use an "if" statement.
* **Hint:** You could also filter out bad links later on using pandas instead of when scraping
* **Hint:** If you get a weird error that you can't really figure out, you can always tell Python to just ignore it using `try` and `except`, like below. Python will try to do the stuff inside of 'try', but if it hits an error it will skip right out.
* **Hint:** Remember the codes at http://strftime.org
* **Hint:** If you have a date that you've parsed, you can use `.dt.strftime` to turn it into a specially-formatted string. You use the same codes (like %B etc) that you use for converting strings into dates.

```python
try:
  blah blah your code
  your code
  your code
except:
  pass
```

In [1]:
import requests
import pandas as pd
from bs4 import BeautifulSoup
from datetime import date
import re

In [2]:
url = 'http://www.mineral.k12.nv.us/pages/School_Board_Minutes'
response =requests.get(url,"html.paser")
doc = BeautifulSoup(response.text)

In [3]:
doc.find('div', attrs= {"id" : "livesite-page-content-left"}).find_all('p')[4].find('a').text.strip().split()
   

['June', '4,', '2019']

In [4]:
doc.find('div', attrs= {"id" : "livesite-page-content-left"}).find_all('p')[4].find('a')['href']

'/files/6.4.19_minutes.pdf'

In [5]:
board_minutes = []
minutes = doc.find('div', attrs= {"id" : "livesite-page-content-left"})

for minute in minutes.find_all('p')[4:]:
    row = {}
    
    try:
        row['Date'] = minute.find('a').text.strip()
    except:
        pass
    try:
        row['Link'] = minute.find('a')['href']
    except:
        "NOLINK"
    board_minutes.append(row)
board_minutes
     

[{'Date': 'June 4, 2019', 'Link': '/files/6.4.19_minutes.pdf'},
 {'Date': 'May 28, 2019', 'Link': '/files/5.28.19_minutes.pdf'},
 {},
 {'Date': 'May 7, 2019', 'Link': '/files/5.7.19_minutes.pdf'},
 {'Date': 'April 23, 2019', 'Link': '/files/4.23.19_minutes.pdf'},
 {'Date': 'April 8, 2019', 'Link': '/files/4.8.19_minutes.pdf'},
 {'Date': 'March 19, 2019', 'Link': '/files/3.5.19_minutes.pdf'},
 {'Date': 'March 5, 2019', 'Link': '/files/3.5.19.pdf'},
 {'Date': 'February 26, 2019', 'Link': '/files/2.26.19_minutes.pdf'},
 {'Date': 'February 5, 2019', 'Link': '/files/2.5.19_minutes.pdf'},
 {'Date': 'January 22, 2019', 'Link': '/files/January_22_minutes.pdf'},
 {'Date': 'January 8, 2019', 'Link': '/files/January_8_minutes.pdf'},
 {},
 {},
 {'Date': 'December 20, 2018', 'Link': '/files/12.20.18_minutes.pdf'},
 {'Date': 'December 4, 2018', 'Link': '/files/12.4.18_minutes.pdf'},
 {'Date': 'November 20, 2018', 'Link': '/files/11.20.18.pdf'},
 {},
 {},
 {'Date': 'September 25, 2018', 'Link': '/fil

In [6]:
df = pd.DataFrame(board_minutes)
df = df.dropna( how='all')
df.shape

(43, 2)

In [7]:
df['Date'].dtypes

dtype('O')

In [8]:
df['Date'] = pd.to_datetime(df['Date'],format=None, errors='raise').dt.strftime('%Y-%m-%d')

In [9]:
df

Unnamed: 0,Date,Link
0,2019-06-04,/files/6.4.19_minutes.pdf
1,2019-05-28,/files/5.28.19_minutes.pdf
3,2019-05-07,/files/5.7.19_minutes.pdf
4,2019-04-23,/files/4.23.19_minutes.pdf
5,2019-04-08,/files/4.8.19_minutes.pdf
6,2019-03-19,/files/3.5.19_minutes.pdf
7,2019-03-05,/files/3.5.19.pdf
8,2019-02-26,/files/2.26.19_minutes.pdf
9,2019-02-05,/files/2.5.19_minutes.pdf
10,2019-01-22,/files/January_22_minutes.pdf


In [10]:
df.shape

(43, 2)

In [11]:
#df.to_csv("minutes.csv", index=False)

In [13]:
for urls in df['Link']:
    link = 'http://www.mineral.k12.nv.us' + urls
    book_name = "/Users/suhailbhat/desktop/2018pdfs/" + link.split('/')[-1]
    with open(book_name, 'wb') as book:
        a = requests.get(link, stream=True)

        for block in a.iter_content(512):
            if not block:
                break

            book.write(block)

In [14]:
pdf_names = [url.split('/')[-1] for url in df['Link']]
pdf_names

['6.4.19_minutes.pdf',
 '5.28.19_minutes.pdf',
 '5.7.19_minutes.pdf',
 '4.23.19_minutes.pdf',
 '4.8.19_minutes.pdf',
 '3.5.19_minutes.pdf',
 '3.5.19.pdf',
 '2.26.19_minutes.pdf',
 '2.5.19_minutes.pdf',
 'January_22_minutes.pdf',
 'January_8_minutes.pdf',
 '12.20.18_minutes.pdf',
 '12.4.18_minutes.pdf',
 '11.20.18.pdf',
 '9.25.18_minutes.pdf',
 '9.13.18_minutes.pdf',
 '9.4.18.pdf',
 '8.21.18_minutes.pdf',
 '8.7.18_minutes.pdf',
 '7.24.18_minutes.pdf',
 '7.10.18_minutes.pdf',
 '6.28.18.pdf',
 '6.22.18_minutes.pdf',
 '6.21.18.pdf',
 '6.19.18_minutes.pdf',
 '5.29.18_minutes.pdf',
 '4.17.18.pdf',
 'april_2,_2018_minutes.pdf',
 '3.8.18.pdf',
 'march_6,_2018_minutes.pdf',
 'feb_20,_2108_minutes.pdf',
 '2.6.18_minutes.pdf',
 'january_16,_2018_minutes.pdf',
 '2.6.18_minutes.pdf',
 '1.5.17_minutes.pdf',
 '1.26.17_minutes.pdf',
 '2.2.17_minutes.pdf',
 '2.16.17_minutes.pdf',
 '3.2.17__minutes.pdf',
 '3.16.17_minutes.pdf',
 '4.12.17_minutes.pdf',
 '5.1.17_Minutes.pdf',
 '5.4.17_minutes.pdf']

In [15]:
import os
def pdf_to_text(name):
    folder = "/Users/suhailbhat/desktop/2018pdfs/"
    input1 = folder + name
    txt_name = name.replace(".pdf",".txt")
    output1 = folder + txt_name
    os.system("pdftotext '%s' '%s'" % (input1, output1))

In [16]:
for pdf_file in pdf_names:
    pdf_to_text(pdf_file)

In [17]:
f = open('/Users/suhailbhat/desktop/2018pdfs/2.26.19_minutes.txt', 'r')
sample_transcript = f.read()

In [18]:
sample_transcript

'\x0c\x0c\x0c'