## Gather documents from district websites

The goal of this script is to create a single data frame containing all of the possible links to DOI plans. Once run, the dataframe will have four columns: district, file type, level, link. 

In [1]:
import gather_documents
import os
import pandas as pd
from start import data_path

In [2]:
test_sample = True
sample_size = 5

## First Level Links

First level links are document urbls (identified by .doc, .pdf. drive.google) linking directly from the TEA website. They are almost certianly DOI plans.

In [3]:
tea_url = 'https://tea.texas.gov/Texas_Schools/District_Initiatives/Districts_of_Innovation/'

In [6]:
def pipe_dataframe(df):
    df = df.copy()
    df = df.reset_index()
    df = df.rename(columns={'index': 'title'})
    df['level'] = 'First'
    return df

In [7]:
first_level_links = gather_documents.FirstLevelLinks(tea_url, print_interim=False)
first_level_links_df = first_level_links.docs_df.pipe(pipe_dataframe)
print('There are {} first level links.'.format(len(first_level_links_df)))
first_level_links_df.tail()



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


There are 397 first level links.


Unnamed: 0,title,link,type,level
392,Connally ISD,https://core-docs.s3.amazonaws.com/documents/a...,pdf,Fist
393,Hull-Daisetta ISD,http://www.hdisd.net/UserFiles/Servers/Server_...,pdf,Fist
394,Eula ISD,http://www.eulaisd.us/Documents/Eula%20ISD%20L...,pdf,Fist
395,River Road ISD,http://www.rrisd.net/UserFiles/Servers/Server_...,pdf,Fist
396,Snyder ISD,https://1.cdn.edl.io/7b21ktMGjSuFIU9LdTLlk0PHz...,pdf,Fist


## Seed Links

Seed links are district URLs from the TEA website which do not have a document identifier (like .pdf). These are the websites we will search for additional documents.

In [8]:
seed_links = gather_documents.SeedLinks(tea_url, print_interim = False).seed_links
print("There are", len(seed_links), "seed links.")
seed_links



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


There are 395 seed links.


{'Abbott ISD': 'https://www.abbottisd.org/apps/bbmessages/show_bbm.jsp?REC_ID=110632',
 'Abernathy ISD': 'http://www.abernathyisd.com/apps/pages/index.jsp?uREC_ID=284545&type=d&pREC_ID=1134133',
 'Academy ISD': 'https://www.academyisd.net/apps/pages/index.jsp?uREC_ID=1218559&type=d&pREC_ID=1452890',
 'Agua Dulce ISD': 'https://www.adisd.net/domain/47',
 'Alamo Heights ISD': 'http://www.ahisd.net/news/what_s_new/district_of_innovation_plan_2016-2021',
 'Albany ISD': 'http://www.albanyisd.net/district-of-innovation.html',
 'Aldine ISD': 'http://www.aldineisd.org/cms/one.aspx?portalId=750&pageId=10168641',
 'Alief ISD': 'http://www.aliefisd.net/Page/8915',
 'Allen ISD': 'http://www.allenisd.org/Page/47067',
 'Alto ISD': 'https://www.alto.esc7.net/apps/pages/index.jsp?uREC_ID=296073&type=d&pREC_ID=1401973',
 'Alvin ISD': 'http://www.alvinisd.net/innovation',
 'Anahuac ISD': 'https://sites.google.com/aisdpanthers.com/anahuacisd/about/district-of-innovation?authuser=0',
 'Angleton ISD': 'htt

If we only want test the crawler, we should use a sample of seed links. Otherwise crawling can take up to half an hour to run.

In [9]:
if test_sample:
    seed_links = {v:seed_links[v] for v in [list(seed_links.keys())[k] for k in range(sample_size)]}

## Second Level Links

In [10]:
print("Crawling", len(seed_links), "seed links.")
second_level_links = gather_documents.SecondLevelLinks(seed_links)
second_level_links_df = second_level_links.docs_df
second_level_links_df = second_level_links_df.reset_index()
second_level_links_df['level'] = "Second"
print("There are", len(second_level_links_df), "second level links.")
second_level_links_df.head()

Crawling 5 seed links.




 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


There are 46 second level links.


Unnamed: 0,title,link,type,level
0,Crowley ISD,http://www.crowleyisdtx.org/cms/lib5/TX0191778...,pdf,Second
1,Tyler ISD,https://www.tylerisd.org/cms/lib/TX01918383/Ce...,pdf,Second
2,Crowley ISD,http://www.crowleyisdtx.org/cms/lib5/TX0191778...,pdf,Second
3,Crowley ISD,http://sww.crowleyisdtx.org/cms/lib5/TX0191778...,pdf,Second
4,White Oak ISD,https://drive.google.com/drive/folders/0B-YcNq...,google,Second


## HTML Links

A few DOI plans are simply stored as HTML. So we will also collect the seed links and treat these as possible documents as well. 

In [114]:
html_links_df = pd.DataFrame.from_dict(seed_links, orient = 'index', columns = ['link'])
html_links_df = html_links_df.reset_index()
html_links_df = html_links_df.rename(index=str, columns={'index': 'title'})
type_col = ['html'] * len(html_links_df.index)
level_col = ['html'] * len(html_links_df.index)
html_links_df['type'] = type_col
html_links_df['level'] = level_col
print("We added", len(seed_links), "HTML links as possible documents.")
html_links_df.head(10)

We added 395 HTML links as possible documents.


Unnamed: 0,title,link,type,level
0,Throckmorton ISD,https://throck.socs.net/vnews/display.v/ART/48...,html,html
1,La Grange ISD,http://www.lgisd.net/district-of-innovation--4,html,html
2,Whitney ISD,http://www.whitney.k12.tx.us/,html,html
3,Industrial ISD,http://www.industrialisd.org/district-of-innov...,html,html
4,Dickinson ISD,http://www.dickinsonisd.org/upload/page/0016/d...,html,html
5,Nocona ISD,http://www.noconaisd.net/141152_2,html,html
6,Wall ISD,http://www.wallisd.net/DocumentCenter/View/10597,html,html
7,Glen Rose ISD,http://www.grisd.net/required-postings/,html,html
8,Evant ISD,http://www.evantisd.org/332142_3,html,html
9,Vega ISD,http://vegalonghorn.com/333456_3,html,html


## Combine first, second, and html links

In [119]:
links_scraped = first_level_links_df.append(second_level_links_df)
links_scraped = links_scraped.reset_index()
links_scraped = links_scraped.append(html_links_df, sort = True)
links_scraped = links_scraped.drop(['index'], axis = 1)
links_scraped = links_scraped[['title', 'level', 'type', 'link']]
print("We have", len(links_scraped), "scraped links which may be DOI documents.")
links_scraped.head()

We have 3747 scraped links which may be DOI documents.


Unnamed: 0,title,level,type,link
0,Sands CISD,First,pdf,http://sands.esc17.net/upload/page/0019/docs/S...
1,Ricardo ISD,First,pdf,http://www.ricardoisd.us/UserFiles/Servers/Ser...
2,Stanton ISD,First,pdf,http://www.stanton.esc18.net/site/handlers/fil...
3,Gold-Burg ISD,First,pdf,http://images.pcmac.org/SiSFiles/Schools/TX/Go...
4,Joaquin ISD,First,pdf,http://www.joaquinisd.net/upload/page/0025/Joa...


But some of these are duplicates

In [120]:
links_scraped = links_scraped.drop_duplicates(subset = 'link')
print(len(links_scraped), "after dropping duplicates.")

3736 after dropping duplicates.


## Save

In [124]:
if test_sample == False:
    links_scraped.to_csv(os.path.join(data_path, 'links_scraped.csv'))