## Gather documents from district websites

The goal of this script is to create a single data frame containing all of the possible links to DOI plans. Once run, the dataframe will have four columns: district, file type, level, link. 

In [2]:
import gather_documents
import os
import pandas as pd
from start import data_path

In [3]:
test_sample = False
sample_size = 5

## First Level Links

First level links are document urbls (identified by .doc, .pdf. drive.google) linking directly from the TEA website. They are almost certianly DOI plans.

In [21]:
tea_url = 'https://tea.texas.gov/Texas_Schools/District_Initiatives/Districts_of_Innovation/'

In [22]:
def pipe_dataframe(df):
    df = df.copy()
    df = df.reset_index()
    df = df.rename(columns={'index': 'title'})
    df['level'] = 'First'
    return df

In [23]:
first_level_links = gather_documents.FirstLevelLinks(tea_url, print_interim=False)
first_level_links_df = first_level_links.docs_df.pipe(pipe_dataframe)
print('There are {} first level links.'.format(len(first_level_links_df)))
first_level_links_df.tail()



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


There are 435 first level links.


Unnamed: 0,title,link,type,level
430,Burkburnett ISD,https://1.cdn.edl.io/NmmgpAdINIeQ8MlcQQx6MB0zP...,pdf,First
431,Malone ISD,http://www.maloneisd.org/pdf/doi/MaloneISDInno...,pdf,First
432,Ricardo ISD,http://www.ricardoisd.us/UserFiles/Servers/Ser...,pdf,First
433,Raymondville ISD,https://s3.amazonaws.com/scschoolfiles/1444/ri...,pdf,First
434,Mount Enterprise ISD,http://www.meisd.org/PDFs/District%20of%20Inno...,pdf,First


## Seed Links

Seed links are district URLs from the TEA website which do not have a document identifier (like .pdf). These are the websites we will search for additional documents.

In [24]:
seed_links = gather_documents.SeedLinks(tea_url, print_interim = False).seed_links
print("There are", len(seed_links), "seed links.")
seed_links



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


There are 401 seed links.


{'Abbott ISD': 'https://www.abbottisd.org/apps/bbmessages/show_bbm.jsp?REC_ID=110632',
 'Abernathy ISD': 'http://www.abernathyisd.com/apps/pages/index.jsp?uREC_ID=284545&type=d&pREC_ID=1134133',
 'Academy ISD': 'https://www.academyisd.net/apps/pages/index.jsp?uREC_ID=1218559&type=d&pREC_ID=1452890',
 'Agua Dulce ISD': 'https://www.adisd.net/domain/47',
 'Alamo Heights ISD': 'http://www.ahisd.net/news/what_s_new/district_of_innovation_plan_2016-2021',
 'Albany ISD': 'http://www.albanyisd.net/district-of-innovation.html',
 'Aldine ISD': 'http://www.aldineisd.org/cms/one.aspx?portalId=750&pageId=10168641',
 'Aledo ISD': 'https://www.aledoisd.org/Domain/2005',
 'Alief ISD': 'http://www.aliefisd.net/Page/8915',
 'Allen ISD': 'http://www.allenisd.org/Page/47067',
 'Alvin ISD': 'http://www.alvinisd.net/innovation',
 'Anahuac ISD': 'https://sites.google.com/aisdpanthers.com/anahuacisd/about/district-of-innovation?authuser=0',
 'Andrews ISD': 'https://www.andrews.esc18.net',
 'Angleton ISD': 'h

If we only want test the crawler, we should use a sample of seed links. Otherwise crawling can take up to half an hour to run.

In [25]:
if test_sample:
    seed_links = {v:seed_links[v] for v in [list(seed_links.keys())[k] for k in range(sample_size)]}

## Second Level Links

In [26]:
print("Crawling", len(seed_links), "seed links.")
second_level_links = gather_documents.SecondLevelLinks(seed_links)
second_level_links_df = second_level_links.docs_df
second_level_links_df = second_level_links_df.reset_index()
second_level_links_df['level'] = "Second"
print("There are", len(second_level_links_df), "second level links.")
second_level_links_df.head()

Crawling 401 seed links.




 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


error: Merkel ISD 'src'
error: Merkel ISD 'src'
error: Merkel ISD 'src'
error: Galena Park ISD <urlopen error [Errno 8] nodename nor servname provided, or not known>
error: Galena Park ISD <urlopen error [Errno 8] nodename nor servname provided, or not known>
error: Galena Park ISD <urlopen error [Errno 8] nodename nor servname provided, or not known>
error: Kress ISD [Errno 54] Connection reset by peer
error: Lake Dallas ISD <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:719)>
error: Lake Dallas ISD <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:719)>
error: Lake Dallas ISD <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:719)>
error: Lindale ISD 'src'
error: Lindale ISD 'src'
error: Lindale ISD 'src'
error: Granbury ISD <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:719)>
error: Granbury ISD <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certifica

Unnamed: 0,title,link,type,level
0,Coolidge ISD,https://drive.google.com/open?id=1uIeBVaCcOaqN...,google,Second
1,Celina ISD,https://www.celinaisd.com/wp-documents/Require...,pdf,Second
2,Hooks ISD,http://www.hooksisd.net/users/STAAR%20Informat...,pdf,Second
3,Albany ISD,http://www.albanyisd.net/uploads/4/4/4/1/44419...,pdf,Second
4,Lago Vista ISD,http://www.lagovistaisd.net/upload/page/0026/d...,pdf,Second


## HTML Links

A few DOI plans are simply stored as HTML. So we will also collect the seed links and treat these as possible documents as well. 

In [27]:
html_links_df = pd.DataFrame.from_dict(seed_links, orient = 'index', columns = ['link'])
html_links_df = html_links_df.reset_index()
html_links_df = html_links_df.rename(index=str, columns={'index': 'title'})
type_col = ['html'] * len(html_links_df.index)
level_col = ['html'] * len(html_links_df.index)
html_links_df['type'] = type_col
html_links_df['level'] = level_col
print("We added", len(seed_links), "HTML links as possible documents.")
html_links_df.head(10)

We added 401 HTML links as possible documents.


Unnamed: 0,title,link,type,level
0,Dickinson ISD,http://www.dickinsonisd.org/upload/page/0016/d...,html,html
1,Knippa ISD,https://docs.google.com/viewer?a=v&pid=sites&s...,html,html
2,Weatherford ISD,https://www.weatherfordisd.com/apps/pages/inde...,html,html
3,Banquete ISD,https://www.banqueteisd.esc2.net/domain/18,html,html
4,Alamo Heights ISD,http://www.ahisd.net/news/what_s_new/district_...,html,html
5,Round Rock ISD,https://roundrockisd.org/district-of-innovation/,html,html
6,La Vernia ISD,https://www.lvisd.org/Page/1083,html,html
7,Port Aransas ISD,http://www.paisd.net/home/district-of-innovation,html,html
8,Angleton ISD,https://www.angletonisd.net//site/default.aspx...,html,html
9,Alvin ISD,http://www.alvinisd.net/innovation,html,html


## Combine first, second, and html links

In [28]:
links_scraped = first_level_links_df.append(second_level_links_df)
links_scraped = links_scraped.reset_index()
links_scraped = links_scraped.append(html_links_df, sort = True)
links_scraped = links_scraped.drop(['index'], axis = 1)
links_scraped = links_scraped[['title', 'level', 'type', 'link']]
print("We have", len(links_scraped), "scraped links which may be DOI documents.")
links_scraped.head()

We have 4415 scraped links which may be DOI documents.


Unnamed: 0,title,level,type,link
0,Abilene ISD,First,pdf,https://www.abileneisd.org/wp-content/uploads/...
1,Carlisle ISD,First,pdf,https://s3.amazonaws.com/scschoolfiles/83/carl...
2,Lockney ISD,First,pdf,https://s3.amazonaws.com/scschoolfiles/777/doi...
3,Arlington ISD,First,pdf,http://w4.aisd.net/pdf/District-of-Innovation-...
4,Paris ISD,First,pdf,https://s3.amazonaws.com/scschoolfiles/167/doi...


But some of these are duplicates

In [29]:
links_scraped = links_scraped.drop_duplicates(subset = 'link')
print(len(links_scraped), "after dropping duplicates.")

3995 after dropping duplicates.


## Save

In [35]:
if test_sample == False:
    links_scraped.to_csv(os.path.join(data_path, 'links_scraped.csv'))

In [4]:
links = pd.read_csv(os.path.join(data_path, 'links_scraped.csv'))
links.head()

Unnamed: 0.1,Unnamed: 0,title,level,type,link
0,0,Abilene ISD,First,pdf,https://www.abileneisd.org/wp-content/uploads/...
1,1,Carlisle ISD,First,pdf,https://s3.amazonaws.com/scschoolfiles/83/carl...
2,2,Lockney ISD,First,pdf,https://s3.amazonaws.com/scschoolfiles/777/doi...
3,3,Arlington ISD,First,pdf,http://w4.aisd.net/pdf/District-of-Innovation-...
4,4,Paris ISD,First,pdf,https://s3.amazonaws.com/scschoolfiles/167/doi...


In [32]:
links[(links.title == "Abbott ISD")][['title', 'link']].sort_values(by = 'title').to_csv(os.path.join(data_path, 'links_scraped_snippet.csv'))

In [12]:
for link in links[(links.title == "Broaddus ISD")].link:
    print(link)

https://core-docs.s3.amazonaws.com/documents/asset/uploaded_file/283241/Broaddus_ISD_Dist_of_Innov_2018-2019.pdf


In [34]:
len(links)

3995