# Webscrapping and API Pulls

In this notebook we will collect data from a variety of sources.  We will use the NOAA API to pull in weather alerts, News API to pull in news articles based on keywords related to road closures, and we will scrape road closure alerts from the ND and MN 511 sites.  

Finally, we will apply spacy named entity recognition to these data sets (with the exception of ND 511).  We will not apply spacy to the ND 511 alerts here, because this set is going to be used as a training set for our model.  The other data sources are being used to locate weather alerts in Minnessota.

## Table of Contents:
* [Import Packages](#first-bullet)
* [Weather API](#second-bullet)
* [News API](#third-bullet)
* [ND 511](#fourth-bullet)
* [MN 511](#fifth-bullet)
* [Spacy Named Entities](#sixth-bullet)

# Import Packages <a class="anchor" id="first-bullet"></a>

In [2]:
import json
import requests
import pandas as pd

In [3]:
import requests
from bs4 import BeautifulSoup

In [4]:
import en_core_web_sm
nlp = en_core_web_sm.load()
#https://github.com/explosion/spaCy/issues/1283

In [5]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

In [6]:
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()

In [7]:
from pprint import pprint

# Weather API  <a class="anchor" id="second-bullet"></a>

Pull in data from this page - https://alerts.weather.gov/cap/us.php?x=1
'Watches, Warnings or Advisories for
the United States
This page shows alerts currently in effect for the United States and is normally updated every two-three minutes. '

https://www.weather.gov/documentation/services-web-api

In [129]:
api_token = 'ga_nyc_students'
api_url_base = 'https://api.weather.gov/alerts/active?message_type=alert'

url = "https://api.weather.gov"


response = requests.get(url)
json   = response.json()
json

{'status': 'OK'}

In [130]:
response = requests.get(api_url_base)

In [132]:
print(response.status_code)

200


In [133]:
print(response.content)






In [134]:
json_obj = response.json()

In [135]:
json_obj

{'@context': ['https://raw.githubusercontent.com/geojson/geojson-ld/master/contexts/geojson-base.jsonld',
  {'wx': 'https://api.weather.gov/ontology#',
   '@vocab': 'https://api.weather.gov/ontology#'}],
 'type': 'FeatureCollection',
 'features': [{'id': 'https://api.weather.gov/alerts/NWS-IDP-PROD-KEEPALIVE-37966',
   'type': 'Feature',
   'geometry': None,
   'properties': {'@id': 'https://api.weather.gov/alerts/NWS-IDP-PROD-KEEPALIVE-37966',
    '@type': 'wx:Alert',
    'id': 'NWS-IDP-PROD-KEEPALIVE-37966',
    'areaDesc': 'Montgomery',
    'geocode': {'UGC': ['MDC031'], 'SAME': ['024031']},
    'affectedZones': ['https://api.weather.gov/zones/county/MDC031'],
    'references': [],
    'sent': '2019-04-21T02:16:25+00:00',
    'effective': '2019-04-21T02:16:25+00:00',
    'onset': None,
    'expires': '2019-04-21T02:26:25+00:00',
    'ends': None,
    'status': 'Test',
    'messageType': 'Alert',
    'category': 'Met',
    'severity': 'Unknown',
    'certainty': 'Unknown',
    'urgen

In [136]:
print(response.headers)

{'Server': 'nginx/1.12.2', 'Content-Type': 'application/geo+json', 'Last-Modified': 'Sun, 21 Apr 2019 02:16:38 GMT', 'Access-Control-Allow-Origin': '*', 'X-Server-ID': 'vm-lnx-nids-apiapp9.ncep.noaa.gov', 'X-Correlation-ID': '56a8c68b-77d4-45f8-812d-7be5b45ae40c', 'X-Request-ID': '56a8c68b-77d4-45f8-812d-7be5b45ae40c', 'Content-Encoding': 'gzip', 'Content-Length': '25750', 'Cache-Control': 'public, max-age=30, s-maxage=30', 'Expires': 'Sun, 21 Apr 2019 02:26:05 GMT', 'Date': 'Sun, 21 Apr 2019 02:25:35 GMT', 'Connection': 'keep-alive', 'Vary': 'Accept-Encoding, Accept,Feature-Flags', 'Strict-Transport-Security': 'max-age=31536000 ; includeSubDomains ; preload'}


In [137]:
weather_df = pd.DataFrame(json_obj['features'])

In [138]:
weather_df.head()

Unnamed: 0,geometry,id,properties,type
0,,https://api.weather.gov/alerts/NWS-IDP-PROD-KE...,{'@id': 'https://api.weather.gov/alerts/NWS-ID...,Feature
1,"{'type': 'Polygon', 'coordinates': [[[-105.84,...",https://api.weather.gov/alerts/NWS-IDP-PROD-34...,{'@id': 'https://api.weather.gov/alerts/NWS-ID...,Feature
2,"{'type': 'Polygon', 'coordinates': [[[-86.13, ...",https://api.weather.gov/alerts/NWS-IDP-PROD-34...,{'@id': 'https://api.weather.gov/alerts/NWS-ID...,Feature
3,"{'type': 'Polygon', 'coordinates': [[[-85.83, ...",https://api.weather.gov/alerts/NWS-IDP-PROD-34...,{'@id': 'https://api.weather.gov/alerts/NWS-ID...,Feature
4,"{'type': 'Polygon', 'coordinates': [[[-85.83, ...",https://api.weather.gov/alerts/NWS-IDP-PROD-34...,{'@id': 'https://api.weather.gov/alerts/NWS-ID...,Feature


In [139]:
weather_df = pd.DataFrame(columns = ['area', 'expires', 'headline', 'description'])

for i in json_obj['features']:
    area = i['properties']['areaDesc']
    expires = i['properties']['expires']
    headline = i['properties']['headline']
    description = i['properties']['description']
    weather_df = weather_df.append({'area': area,'expires': expires,'headline': headline,'description': description}, ignore_index=True)

In [140]:
weather_df.head()

Unnamed: 0,area,expires,headline,description
0,Montgomery,2019-04-21T02:26:25+00:00,,Monitoring message only. Please disregard.
1,Southern Campbell,2019-04-20T20:30:00-06:00,Special Weather Statement issued April 20 at 7...,"At 758 PM MDT, Doppler radar was tracking a st..."
2,"Martin, IN; Lawrence, IN",2019-04-21T21:42:00-04:00,Flood Warning issued April 20 at 9:43PM EDT un...,Big Blue River...Driftwood River...East Fork W...
3,"Lawrence, IN; Jackson, IN; Washington, IN",2019-04-21T21:42:00-04:00,Flood Warning issued April 20 at 9:43PM EDT un...,Big Blue River...Driftwood River...East Fork W...
4,"Bartholomew, IN",2019-04-21T21:42:00-04:00,Flood Warning issued April 20 at 9:43PM EDT un...,Big Blue River...Driftwood River...East Fork W...


In [141]:
len(weather_df)

60

Below are some of the alerts that may be relevant to our interests.  There are multiple alerts alerting users to road closures.  More specifically to our current interests (road closures in Minnesota), there is an alert regarding flooding in the state.

In [144]:
print(weather_df['description'][weather_df['description'].str.contains("road closure")])

Name: description, dtype: object


In [148]:
print(weather_df['description'][weather_df['area'].str.contains("MN")])

Name: description, dtype: object


# News API  <a class="anchor" id="third-bullet"></a>

The News API has both a free and paid version.  We are working with the free version, which only allows users to pull in one page per search term.  The paid version allows users to pull in multiple pages per search term.  The function defined below works with both version of the API.  If a user is using the free version, they should always set num_pages to 1.  If they are using the paid version, they may set it higher.  If a free user attempts to set the num_pages higher than 1, the API will return an error code: 

"You have requested too many results. Developer accounts are limited to a max of 100 results. You are trying to request results 300 to 400.Please upgrade to a paid plan if you need more results."

The function will not break however, but only one page of results per search term will be returned.

In [29]:
news_api_key = 'a4125c9be11c4b9d8edc5e827379c2fd'

In [87]:
def pull_news(phrase, num_pages):
    df_news_all = pd.DataFrame(columns = ['author', 'content', 'description', 'publishedAt', 'source', 'title', 'url', 'urlToImage'])    
    all_pulls = []
    for i in range(1, num_pages):
        url = ('https://newsapi.org/v2/everything?q='+ phrase + 
               '&apiKey=' + news_api_key + '&pageSize=100&page=' + str(i) + '&language=en')
        print(url)
        response = requests.get(url)
        print(response)
        if response.status_code == 200:
            json_obj_news = response.json()
            df_news = pd.DataFrame(json_obj_news['articles'])
            print(len(df_news))
            df_news_all = pd.concat([df_news_all.reset_index(drop=True), df_news.reset_index(drop=True)], axis=0)
            print(len(df_news_all))
        else:
            pass
    return df_news_all

We are interested in a variety of search terms.  We will loop through the function above to pull in multiple search terms.

In [109]:
# loop through multiple news topics and combine in one df

news_topics = ['road', 'road closure', 'closure', 'highway' , 'flood', 'snow', 'storm', 
               'flood', 'floods', 'flooding', 'storms', 'minnesota', 'landslide', 'blizzard', 'ice']
df_news_all_topics = pd.DataFrame(columns = ['author', 'content', 'description', 'publishedAt', 'source', 'title', 'url', 'urlToImage'])    

for i in news_topics:
    print(i)
    news_all_df = pull_news(i, 2)
    df_news_all_topics = pd.concat([df_news_all_topics.reset_index(drop=True), news_all_df.reset_index(drop=True)], axis=0)

road
https://newsapi.org/v2/everything?q=road&apiKey=a4125c9be11c4b9d8edc5e827379c2fd&pageSize=100&page=1&language=en
<Response [200]>
100
100
road closure
https://newsapi.org/v2/everything?q=road closure&apiKey=a4125c9be11c4b9d8edc5e827379c2fd&pageSize=100&page=1&language=en
<Response [200]>
100
100
closure
https://newsapi.org/v2/everything?q=closure&apiKey=a4125c9be11c4b9d8edc5e827379c2fd&pageSize=100&page=1&language=en
<Response [200]>
100
100
highway
https://newsapi.org/v2/everything?q=highway&apiKey=a4125c9be11c4b9d8edc5e827379c2fd&pageSize=100&page=1&language=en
<Response [200]>
100
100
flood
https://newsapi.org/v2/everything?q=flood&apiKey=a4125c9be11c4b9d8edc5e827379c2fd&pageSize=100&page=1&language=en
<Response [200]>
100
100
snow
https://newsapi.org/v2/everything?q=snow&apiKey=a4125c9be11c4b9d8edc5e827379c2fd&pageSize=100&page=1&language=en
<Response [200]>
100
100
storm
https://newsapi.org/v2/everything?q=storm&apiKey=a4125c9be11c4b9d8edc5e827379c2fd&pageSize=100&page=1&lang

In [110]:
len(df_news_all_topics)

1500

In [7]:
# https://newsapi.org/docs/endpoints/everything

# North Dakota DOT 511 Web scrapping  <a class="anchor" id="fourth-bullet"></a>

In this section, we will scrape the ND 511 site for road closure alerts.  This site has a large numbner of alerts (as of April 2018).  Because this site is geographically close to MN and contains more alerts than MN 511, it will be used to train our model to recognize likely road closure alerts.

In [47]:
url = 'https://www.dot.nd.gov/dotnet/news/Public/Index'

res = requests.get(url)

In [48]:
res.status_code

200

In [49]:
soup = BeautifulSoup(res.content, 'lxml')

In [50]:
soup.find()

<html lang="en">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="NDDOT" name="author"/>
<meta content="NDDOT Internet Forms" name="North Dakota Department of Transportation Internet Forms"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<title>NDDOT - News Releases</title>
<link href="https://cdn.datatables.net/1.10.15/css/dataTables.bootstrap.min.css" rel="stylesheet"/>
<link href="/assets/css/lightslider.css" rel="stylesheet" type="text/css"/>
<!-- Bootstrap core CSS -->
<link href="/assets/css/bootstrap.css" media="all" rel="stylesheet"/>
<!-- Custom styles for this template -->
<link href="/assets/css/app.css" media="handheld, screen" rel="stylesheet"/>
<link href="/assets/css/xs-device.css?ver=1.2" media="handheld, screen" rel="stylesheet"/>
<link href="/assets/css/sm-device.css?ver=1.1" media="handheld, screen" rel="stylesheet"/>
<link href="/assets/css/md-device.css" media="handheld, screen" rel="styles

In [59]:
table = soup.find('table', {'class': "table"})

In [60]:
print(table)

<table class="table table-bordered table-striped table-condensed">
<caption>Table containing a list of news releases.</caption>
<thead>
<tr>
<th class="control-label" style="width: 12%;">Category</th>
<th>Headline</th>
<th>Publish Date</th>
</tr>
</thead>
<tbody>
<tr>
<td>Public Meetings</td>
<td><a href="/dotnet/news/Public/View/8337" target="_blank">Public Input Meeting on April 25, to discuss proposed improvements to Main Street in Mandan</a></td>
<td style="white-space: nowrap;">4/18/2019 11:42 AM</td>
</tr>
<tr>
<td>Construction</td>
<td><a href="/dotnet/news/Public/View/8336" target="_blank">Construction starts Monday on I-29 near exit 44</a></td>
<td style="white-space: nowrap;">4/18/2019 10:18 AM</td>
</tr>
<tr>
<td>Road Conditions</td>
<td><a href="/dotnet/news/Public/View/8335" target="_blank">Water receded on I-29, 20 miles north of Grand Forks</a></td>
<td style="white-space: nowrap;">4/16/2019 10:39 AM</td>
</tr>
<tr>
<td>Road Conditions</td>
<td><a href="/dotnet/news/Publ

</table>


In [142]:
teams = []
for row in table.find_all('tr')[1:]:
    team = {}
    team['alert_type'] = row.find('td').text
    team['headline'] = row.find('a').text
    team['url'] = row.find('a').attrs
    teams.append(team)

df_nd = pd.DataFrame(teams)

In [143]:
df_nd['url'] = df_nd['url'].astype(str)

In [144]:
df_nd.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 173 entries, 0 to 172
Data columns (total 3 columns):
alert_type    173 non-null object
headline      173 non-null object
url           173 non-null object
dtypes: object(3)
memory usage: 4.1+ KB


In [146]:
df_nd['url'] = df_nd['url'].str.replace("{'href': '", "")

In [150]:
df_nd['url'] = df_nd['url'].str.replace("', 'target': '_blank'}", "")

In [156]:
df_nd.head()

Unnamed: 0,alert_type,headline,url
0,Public Meetings,"Public Input Meeting on April 25, to discuss proposed improvements to Main Street in Mandan",/dotnet/news/Public/View/8337
1,Construction,Construction starts Monday on I-29 near exit 44,/dotnet/news/Public/View/8336
2,Road Conditions,"Water receded on I-29, 20 miles north of Grand Forks",/dotnet/news/Public/View/8335
3,Road Conditions,ND Hwy 5 temporarily closed from I-29 to the Red River,/dotnet/news/Public/View/8334
4,Road Conditions,Water on I-29 north of Grand Forks,/dotnet/news/Public/View/8333


The site contains alerts for more than just road closures.  Some are for meetings, construction alerts, etc.  We do not want to include these alerts in our set, so we will remove them, leaving only road alerts.

In [159]:
mask = df_nd['alert_type'] == 'Road Conditions'
df_nd_conditions = df_nd[mask]

In [160]:
df_nd_conditions.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 134 entries, 2 to 170
Data columns (total 3 columns):
alert_type    134 non-null object
headline      134 non-null object
url           134 non-null object
dtypes: object(3)
memory usage: 4.2+ KB


In [161]:
df_nd_conditions.head()

Unnamed: 0,alert_type,headline,url
2,Road Conditions,"Water receded on I-29, 20 miles north of Grand Forks",/dotnet/news/Public/View/8335
3,Road Conditions,ND Hwy 5 temporarily closed from I-29 to the Red River,/dotnet/news/Public/View/8334
4,Road Conditions,Water on I-29 north of Grand Forks,/dotnet/news/Public/View/8333
5,Road Conditions,Water on I-29 north of Grand Forks,/dotnet/news/Public/View/8332
6,Road Conditions,Southbound on ramp on I-29 Exit 164 north of Grand Forks temporarily closed,/dotnet/news/Public/View/8331


In [198]:
full_text_list = []
for i in df_nd_conditions.url: 
    url_base = 'https://www.dot.nd.gov'
    url_combine = url_base + i
    res = requests.get(url_combine)
    soup = BeautifulSoup(res.content, 'lxml')
    full_text = soup.find('div', {'class': "display-content"}).text.strip()
    full_text_list.append(full_text)
print(full_text_list)






In [199]:
df_nd_conditions['post'] = full_text_list

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [200]:
df_nd_conditions.head()

Unnamed: 0,alert_type,headline,url,post
2,Road Conditions,"Water receded on I-29, 20 miles north of Grand Forks",/dotnet/news/Public/View/8335,"Water has receded on northbound I-29 approximately 20 miles north of Grand Forks. The roadway is open to two lanes of traffic and speeds are normal.\nMotorists should drive with caution as flooding continues to affect area highways and should check road conditions before traveling due to rapidly changing conditions. For road information, call 511 from any type of phone or go to the website: www.dot.nd.gov.\nFluctuating water levels make it difficult to predict when and where water will go over a roadway or recede from the roadway. NDDOT warns motorists that driving through water is dangerous and should not drive around barricades or into flooded areas as vehicles that leave the roadway may become immersed in high water.\nFor statewide flooding information, please go to https://ndresponse.gov/"
3,Road Conditions,ND Hwy 5 temporarily closed from I-29 to the Red River,/dotnet/news/Public/View/8334,"The North Dakota Department of Transportation has temporarily closed ND Hwy 5 from I-29 to the Red River (Exit 203 near Joliette) due to water on the roadway. This road will remain closed until river levels recede. Motorists will need to take an alternate route.\nThe NDDOT urges motorists to check road conditions before traveling due to rapidly changing conditions. Fluctuating water levels make it difficult to predict when and where water will go over a roadway or recede from the roadway. For updated road information, call 511 from any type of phone or go to the Travel Information Map on our website at www.dot.nd.gov.\nFor statewide flooding information, please go to https://ndresponse.gov/"
4,Road Conditions,Water on I-29 north of Grand Forks,/dotnet/news/Public/View/8333,"There is water on the southbound lanes of I-29 approximately 25 miles north of Grand Forks. The roadway is reduced to one lane. Traffic is allowed with traffic speeds reduced and traffic control is in place.\nNDDOT encourages motorists to check road conditions before traveling. For updated road information, call 511 from any type of phone or go to the Travel Information Map on our website at www.dot.nd.gov"
5,Road Conditions,Water on I-29 north of Grand Forks,/dotnet/news/Public/View/8332,"There is water on the northbound lanes of I-29 approximately 20 miles north of Grand Forks. The roadway is reduced to one lane. Traffic is allowed with traffic speeds reduced and traffic control is in place.\nNDDOT encourages motorists to check road conditions before traveling. For updated road information, call 511 from any type of phone or go to the Travel Information Map on our website at www.dot.nd.gov."
6,Road Conditions,Southbound on ramp on I-29 Exit 164 north of Grand Forks temporarily closed,/dotnet/news/Public/View/8331,"The North Dakota Department of Transportation has temporarily closed the southbound on ramp on I-29 Exit 164, approximately 20 miles north of Grand Forks, due to water on the roadway. The northbound off ramp on I-29 at Exit 164 remains closed. The ramps will be closed until river levels recede. Motorists will need to take an alternate route.\nThe NDDOT urges motorists to check road conditions before traveling due to rapidly changing conditions. Fluctuating water levels make it difficult to predict when and where water will go over a roadway or recede from the roadway. For updated road information, call 511 from any type of phone or go to the Travel Information Map on our website at www.dot.nd.gov.\nFor statewide flooding information, please go to https://ndresponse.gov/"


In [201]:
#df_nd_conditions.to_csv('./data/nd_511.csv')

# Minnesota DOT 511 Web scrapping  <a class="anchor" id="fifth-bullet"></a>

In this section, we will scrape the MN 511 DOT site for road closure alerts.

In [12]:
url = 'https://lb.511mn.org//mnlb/roadreports/menu.jsf?view=state&text=m&textOnly=true&current=true'

res = requests.get(url)

In [13]:
res.status_code

200

In [14]:
soup = BeautifulSoup(res.content, 'lxml')

In [15]:
soup.find()

<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Minnesota Department of Transportation - Low Bandwidth Web - Critical Alerts</title>
<link href="/mnlb/css/crc-common.css" rel="stylesheet"/>
<link href="/mnlb/css/mediumtext.css?2.2.10" rel="stylesheet"/>
<link href="/mnlb/css/base.css?2.2.10" rel="stylesheet"/>
<link href="/mnlb/css/current.css?2.2.10" rel="stylesheet"/>
<link href="/mnlb/css/panels-largeright.css?2.2.10" rel="stylesheet"/>
<link href="/mnlb/css/crc-event.css?2.2.10" rel="stylesheet"/>
<script src="/mnlb/js/lbweb.js?2.2.10" type="text/javascript"></script>
</head>
<body>
<style type="text/css">
      a#skip
      {
         position: absolute;
         left: auto;
         top: -1000px;
         width: 1px;
         height: 1px;
         overflow: hidden;
      }
      a:active#skip, a:focus#skip
      {
         position: static;
         width: auto;
         height: auto;
      }
   </style>
<div class="pageWidth">
<a href="#mainCont

In [16]:
soup.find('div', {'class': "reportBodyText"}).text.strip()

'Between MN 93; North 5th Street and US 169 (near Henderson). Look out for flooding. The road is closed.'

In [17]:
soup.find('table', {'class': "reportTitleCritical"}).text.strip()

'MN 19: Flooding.'

In [18]:
soup.find_all('table', {'class': "reportTitleCritical"})

[<table class="reportTitleCritical" style="width:100%">
 <tbody>
 <tr>
 <td class="verticalAlignMiddle"></td>
 <td class="verticalAlignMiddle fullWidth">
 <div><a href="/mnlb/roadreports/critical_reports.jsf;jsessionid=-DF2z0iaJ6SoxlkU-gTPhGLBzoVfm4Argi3-_hDj.ip-10-4-73-183?route=27%3A43&amp;view=state&amp;text=m&amp;textOnly=true">MN 19: Flooding.</a>
 </div></td>
 </tr>
 </tbody>
 </table>, <table class="reportTitleCritical" style="width:100%">
 <tbody>
 <tr>
 <td class="verticalAlignMiddle"></td>
 <td class="verticalAlignMiddle fullWidth">
 <div><a href="/mnlb/roadreports/critical_reports.jsf;jsessionid=-DF2z0iaJ6SoxlkU-gTPhGLBzoVfm4Argi3-_hDj.ip-10-4-73-183?route=27%3A177&amp;view=state&amp;text=m&amp;textOnly=true">MN 93: Flooding.</a>
 </div></td>
 </tr>
 </tbody>
 </table>, <table class="reportTitleCritical" style="width:100%">
 <tbody>
 <tr>
 <td class="verticalAlignMiddle"></td>
 <td class="verticalAlignMiddle fullWidth">
 <div><a href="/mnlb/roadreports/critical_reports.jsf;j

In [19]:
headers = []
for li in soup.find_all('table', {'class': "reportTitleCritical"}):
    print(li.text.strip())
    headers.append(li.text.strip())

MN 19: Flooding.
MN 93: Flooding.
MN 95: Road closed.
US 75 in both directions: Flooding.
MN 1: Flooding.
MN 41: Flooding.
MN 317: Flooding.
MN 220: Flooding.
US 75 in both directions: Flooding.
US 75 in both directions: Flooding.
MN 67: Road closed to traffic.
MN 74 in both directions: Flooding.
MN 60: Flooding.
MN 67: Flooding.


In [20]:
text = []
for li in soup.find_all('div', {'class': "reportBodyText"}):
    print(li.text.strip())
    text.append(li.text.strip())

Between MN 93; North 5th Street and US 169 (near Henderson). Look out for flooding. The road is closed.
Between MN 93; North 5th Street and US 169 (near Henderson). Look out for flooding. The road is closed. Web Comment: Detour where possible
Last updated on March 19
Between US 169 (Le Sueur) and MN 19; North 5th Street (Henderson). Look out for flooding. The road is closed.
Between US 169 (Le Sueur) and MN 19; North 5th Street (Henderson). Look out for flooding. The road is closed. Web Comment: Highway 93 under water. Closed from Highway 19 in Henderson to Highway 169 in both directions.
Last updated on March 21
At Fern Street North (Cambridge). The road is closed. There is a broken water main.
At Fern Street North (Cambridge). The road is closed. There is a broken water main.
Last updated today at 11:30am CDT
Between 220th Avenue and 230th Avenue (Halstad). Look out for flooding. The road is closed. A detour is in operation.
Between 220th Avenue and 230th Avenue (Halstad). Look out f

In [21]:
#the text of each element gets pulled in three times, need to only use the full text, which is the third
text_full = text[0::3]

In [24]:
d = {'Header':headers,'Text':text_full}

In [27]:
df_mn_511 = pd.DataFrame(d)

# Spacy Location Detection  <a class="anchor" id="sixth-bullet"></a>

Below is a function that is defined to apply spaCy location detection to text columns.  spaCy is a package (imported above) that uses a neural network to extract named entities.  This can be used to identify people, places, locations, companies, among others.  In this case we will be looking to extract location names.  These named locations will be used to both in the data preperation step of our modeling as well as in the final mapping stage.

In [17]:
# apply spacy function
def apply_spacy(input_column):
    locations = []
    for i in input_column:
        doc = nlp(i)
        locations.append([(X.text, X.label_) for X in doc.ents])
    locations_and_dates = []
    for i in locations:
# the GPE list contains more categories than we are interested in.  we really are only looking for locations and dates
#so we will filter the list here to those two categories before we append it to our dataframe
        #print(i)
        locs = []
        for j in i:
            if j[1] in ('DATE', 'GPE'):
                locs.append(j)
        locations_and_dates.append(locs)
    print(locations_and_dates)
    return locations_and_dates

## Apply Spacy to Weather API

In [18]:
weather_df['location'] = apply_spacy(weather_df['description'] )

[[], [('Savageton', 'GPE')], [('Indiana', 'GPE'), ('Indiana', 'GPE'), ('Thursday', 'DATE'), ('Tuesday', 'DATE'), ('tomorrow', 'DATE'), ('the week', 'DATE'), ('Rivervale', 'GPE'), ('Sunday', 'DATE'), ('Monday April 29', 'DATE'), ('Saturday', 'DATE'), ('Sunday', 'DATE'), ('Wednesday', 'DATE'), ('Sunday April 28', 'DATE')], [('Indiana', 'GPE'), ('Indiana', 'GPE'), ('Thursday', 'DATE'), ('Tuesday', 'DATE'), ('tomorrow', 'DATE'), ('the week', 'DATE'), ('Seymour', 'GPE'), ('Thursday', 'DATE'), ('Saturday', 'DATE'), ('Sunday', 'DATE'), ('June 2010', 'DATE'), ('a few days earlier', 'DATE')], [('Indiana', 'GPE'), ('Indiana', 'GPE'), ('Thursday', 'DATE'), ('Tuesday', 'DATE'), ('tomorrow', 'DATE'), ('the week', 'DATE'), ('Columbus', 'GPE'), ('Saturday', 'DATE'), ('Sunday', 'DATE'), ('Monday', 'DATE'), ('April 6, 2011', 'DATE')], [('Indiana', 'GPE'), ('Indiana', 'GPE'), ('Thursday', 'DATE'), ('Tuesday', 'DATE'), ('tomorrow', 'DATE'), ('the week', 'DATE'), ('Ridgeville', 'GPE'), ('Saturday', 'DATE'

In [15]:
#weather_df.to_csv('./data/weather_api.csv')

## Apply Spacy to News API

In [111]:
# there are  rows with null content, these should be removed
df_news_all_topics.content.isnull().sum()

119

In [112]:
df_news_all_topics = df_news_all_topics.dropna(subset=['content'])

In [113]:
df_news_all_topics.content.isnull().sum()

0

In [None]:
df_news_all_topics['locations_and_times']  = apply_spacy(df_news_all_topics['content'] )

In [120]:
df_news_export = df_news_all_topics[['publishedAt', 'content', 'url', 'locations_and_times']]

In [None]:
#df_news_export.to_csv(./data/news_api.csv)

## Apply Spacy to MN 511

In [28]:
df_mn_511.head()

Unnamed: 0,Header,Text
0,MN 19: Flooding.,Between MN 93; North 5th Street and US 169 (ne...
1,MN 93: Flooding.,Between US 169 (Le Sueur) and MN 19; North 5th...
2,MN 95: Road closed.,At Fern Street North (Cambridge). The road is ...
3,US 75 in both directions: Flooding.,Between 220th Avenue and 230th Avenue (Halstad...
4,MN 1: Flooding.,Between North Dakota State Line (Oslo) and 470...


In [41]:
df_mn_511['locations_and_times'] = apply_spacy(df_mn_511['Text'] )

In [42]:
df_mn_511

Unnamed: 0,Header,Text,locations_and_times
0,MN 19: Flooding.,Between MN 93; North 5th Street and US 169 (ne...,"[(North 5th Street, GPE), (Henderson, GPE)]"
1,MN 93: Flooding.,Between US 169 (Le Sueur) and MN 19; North 5th...,"[(US, GPE)]"
2,MN 95: Road closed.,At Fern Street North (Cambridge). The road is ...,[]
3,US 75 in both directions: Flooding.,Between 220th Avenue and 230th Avenue (Halstad...,[]
4,MN 1: Flooding.,Between North Dakota State Line (Oslo) and 470...,"[(North Dakota, GPE), (Oslo, GPE), (Alvarado, ..."
5,MN 41: Flooding.,Between US 169; Chestnut Boulevard (near Carve...,"[(US, GPE), (Carver, GPE), (Chaska, GPE), (tod..."
6,MN 317: Flooding.,Between North Dakota State Line and MN 220. Lo...,"[(North Dakota State Line, GPE)]"
7,MN 220: Flooding.,Between MN 1; 490th Avenue Northwest and 390th...,"[(Oslo, GPE)]"
8,US 75 in both directions: Flooding.,At Northeast 3rd Street (Climax). Look out for...,[]
9,US 75 in both directions: Flooding.,Between MN 200; 3rd Street East and 250th Aven...,[]


In [44]:
#df_mn_511.to_csv('./data/mn_511.csv')

Now that we have pulled in our data, applied Spacy named entities to it, and exported all needed files to CSV.  We will begin modelling our data to filter it down to only relevant alerts.  Finally, we will map these relevant road closures.