# 03. Extracting Locations with spaCy   

spaCy is an open-source software library for advanced natural language processing, written in the Python and Cython. 
spaCy’s models can be installed as Python packages.  

## Table of Contents
- [Installation](#Installation)
- [Loading Libraries & Data](#Loading-Libraries-&-Data)  
- [Preprocessing](#Preprocessing)
- [Training spaCy](#Training-spaCy)
- [Location Extracting](#Location-Extracting)
- [Source](#Source) 

## Installation  
  
In your Terminal, at the current environment:
```Terminal
# Download best-matching version of specific model for your spaCy installation   
$ python -m spacy download en_core_web_sm  

# Out-of-the-box: download best-matching default model and create shortcut link  
$ python -m spacy download en  

# Download exact model version (doesn't create shortcut link)   
$ python -m spacy download en_core_web_sm-2.2.0 --direct    

$pip install spacy   

```

More details, please check [here](https://spacy.io/usage/models)

## Loading Libraries & Data

In [10]:
# Import libraries: 
import pandas as pd
import numpy as np
import regex as re
import re
import spacy
import random
import string
import datetime
import os
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

from spacy import displacy

%matplotlib inline

In [2]:
# Read data:
closures = pd.read_csv('../datasets/account_tweets.csv')

## Preprocessing  
Before applying spaCy, we need to preprocessing our closures data.

In [3]:
# Drop NaN values and unwanted columns;
# Set 'id' as index:

closures.drop(closures.loc[closures['text'].isnull()==True].index, inplace=True)
closures.drop(columns=['hashtags', 'geo'], inplace=True)
closures.set_index('id', inplace=True)

In [4]:
abbreviation_dict = {'US ' : 'US-', 
                     'I-' : 'Interstate ',  
                     'CR ' : 'Country Route ',
                     'CR-' : 'Country Route ',
                     'St ' : 'Street ',
                     'St. ' : 'Street ',
                     'Rt ' : 'Route ',
                     'Rte ' : 'Route ',
                     'Rd ' : 'Road ',
                     'Rd. ' : 'Road ',
                     'Twp' : 'Township ',
                     'Rd. ' : 'Road ',
                     'Av ': 'Avenue ',
                     'Av. ': 'Avenue ',                
                     'SB ' : 'southbound ',
                     'WB ' : 'westbound ',
                     'EB ' : 'easthbound ',
                     'NB ' : 'northbound ',
                     'Hwy ' : 'Highway ',
                     'MM ':'Mile Marker ',
                     'Pkwy ': 'Parkway ',
                     'SR-': 'State Route '}

In [5]:
# Replace all location related abbreviations:
expanded_tweets = []
for cleaned_tweet in closures['text']:
    for key in abbreviation_dict:
        cleaned_tweet = cleaned_tweet.replace(key, abbreviation_dict[key])
    expanded_tweets.append(cleaned_tweet)
closures['expanded_tweet'] = expanded_tweets

In [6]:
# Checking:
closures.head()

Unnamed: 0_level_0,username,date,text,traffic,expanded_tweet
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1051600805755846657,WJHG_TV,2018-10-14 22:29:04+00:00,A lot of disaster assistance info... as well a...,1,A lot of disaster assistance info... as well a...
1051595380369096704,fl511_panhandl,2018-10-14 22:07:31+00:00,Cleared: Traffic congestion in Bay on US-231 s...,1,Cleared: Traffic congestion in Bay on US-231 s...
1051580211702243333,WJHG_TV,2018-10-14 21:07:14+00:00,Jessica and Ryan are about to handle our storm...,1,Jessica and Ryan are about to handle our storm...
1051575221671682054,fl511_panhandl,2018-10-14 20:47:24+00:00,Cleared: Object on roadway in Okaloosa on I-10...,1,Cleared: Object on roadway in Okaloosa on Inte...
1051574997964255232,fl511_panhandl,2018-10-14 20:46:31+00:00,New: Object on roadway in Okaloosa on I-10 wes...,1,New: Object on roadway in Okaloosa on Intersta...


## Training spaCy

The build-in spaCy model does not perform very well since in general there are only three location related entities. More entities, please check [HERE](https://spacy.io/api/annotation#named-entities)   

#### Named Entity Recognition:  

| TYPE | DESCRIPTION |  
|---|---|  
| `GPE` | Countries, cities, states. |  
| `LOC` | Non-GPE locations, mountain ranges, bodies of water. |  
| `FAC` | Buildings, airports, highways, bridges, etc. |

Therefore, we are training our own spaCy model by providing many examples to meaningfully improve the system, since the default spaCy model performs poorly.  
![](https://spacy.io/training-73950e71e6b59678754a87d6cf1481f9.svg)

Code:   
```Python
test_col = closures['expanded_tweet'].sample(n = 40, replace = False)
np.savetxt(r'spacy_annotator_test.txt', test_col, fmt='%s')
```
Then we run the `txt` file in [spaCy NER Annotator](https://manivannanmurugavel.github.io/annotating-tool/spacy-ner-annotator/) created by Manivannan Murugavel.

The below code converts our labels from a JSON format to a spaCy editable format code borrowed from Manivannan's posting.

In [9]:
import json

filename = input("../datasets/road_entity_labels.json")
print(filename)


with open(filename) as train_data:
    train = json.load(train_data)

TRAIN_DATA = []
for data in train:
    ents = [tuple(entity) for entity in data['entities']]
    TRAIN_DATA.append((data['content'],{'entities':ents}))


with open('{}'.format(filename.replace('json','txt')),'w') as write:
    write.write(str(TRAIN_DATA))

../datasets/road_entity_labels.json../datasets/road_entity_labels.json
../datasets/road_entity_labels.json


## Location Extracting  

We are going to use `TRAIN_DATA` from our trained spaCy model.

In [12]:
TRAIN_DATA = [('Updated: Planned construction in Washington on Interstate 10 east at Mile Marker 117, right lane blocked. Last updated at 12:58:47AM. http://fl511.com/EventDetails/District%203-CHP/63064', {'entities': [(33, 43, 'TWN'), (47, 65, 'INS'), (69, 84, 'MM'), (86, 96, 'LN')]}), ('Cleared: Object on roadway in Gadsden on Interstate 10 west at Mile Marker 168, all lanes blocked. Last updated at 03:09:38PM.', {'entities': [(30, 37, 'TWN'), (41, 59, 'INS'), (63, 78, 'MM'), (80, 89, 'LN')]}), ('New: Disabled vehicle in Santa Rosa on Pensacola Bay Bridge north at Pensacola Bay Bridge, right lane blocked....http://fl511.com/EventDetails/District%203-CHP/63918', {'entities': [(39, 65, 'BRDG'), (91, 101, 'LN'), (25, 35, 'TWN')]}), ('ROAD CLOSURE: State Road 30A from West Rutherford Street south to Cape San Blas Road is closed and County Road 30A from Cape San Blas Road to the Franklin County Line is closed.', {'entities': [(13, 28, 'RD'), (34, 62, 'ST')]}), ('Update to Motor Vehicle Accident (MVA) on 23rd Street at 23rd Street Plaza. The roadway obstructions have been removed. Emergency Response personnel are on scene. Use caution.', {'entities': [(42, 53, 'ST')]}), ('Cleared: Crash in Walton on Interstate 10 east at Mile Marker 78, left lane blocked. Last updated at 10:17:49AM.', {'entities': [(18, 24, 'TWN'), (28, 46, 'INS'), (50, 64, 'MM'), (66, 75, 'LN')]}), ('Updated: Planned construction in Washington on Interstate 10 east at Mile Marker 117, right lane blocked. Last updated at 11:09:44PM. http://fl511.com/EventDetails/District%203-CHP/62845', {'entities': [(33, 43, 'TWN'), (47, 65, 'INS'), (69, 84, 'MM'), (86, 96, 'LN')]}), ('Cleared: Crash in Jackson on US-231 south beyond Interstate 10, right shoulder blocked. Last updated at 11:38:37AM.', {'entities': [(18, 25, 'TWN'), (29, 41, 'RD'), (49, 62, 'INS'), (64, 78, 'LN')]}), ('New: Planned construction in Holmes on Interstate 10 east at Mile Marker 115, right lane blocked. Last updated at 09:33:26PM. #fl511 http://fl511.com/EventDetails/District%203-CHP/62960', {'entities': [(29, 35, 'TWN'), (39, 57, 'INS'), (61, 76, 'MM'), (78, 88, 'LN')]}), ('New: Planned construction in Washington on Interstate 10 east at Mile Marker 117, right lane blocked. Last updated at 09:17:00PM....http://fl511.com/EventDetails/District%203-CHP/63063', {'entities': [(29, 39, 'TWN'), (43, 61, 'INS'), (65, 80, 'MM'), (82, 92, 'LN')]}), ('New: Object on roadway in Walton on Interstate 10 west at Mile Marker 74, right lane blocked. Last updated at 09:56:29AM. #fl511 http://fl511.com/EventDetails/District%203-CHP/62998', {'entities': [(26, 32, 'TWN'), (36, 54, 'INS'), (58, 72, 'MM'), (74, 84, 'LN')]}), ('Cleared: Disabled vehicle in Holmes on Interstate 10 west at Mile Marker 115, left lane blocked. Last updated at 07:21:36PM.', {'entities': [(29, 35, 'TWN'), (39, 57, 'INS'), (61, 76, 'MM'), (78, 87, 'LN')]})]

In [13]:
def train_spacy(data,iterations):
    TRAIN_DATA = data
    nlp = spacy.blank('en')  # create blank Language class
    
    # create the built-in pipeline components and add them to the pipeline                         
    if 'ner' not in nlp.pipe_names:
        ner = nlp.create_pipe('ner') # nlp.create_pipe is for built-ins of spaCy
        nlp.add_pipe(ner, last=True)
       

    # Add in labels
    for _, annotations in TRAIN_DATA:
         for ent in annotations.get('entities'):
            ner.add_label(ent[2])

    # Get names of other pipes to disable them during training
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
    with nlp.disable_pipes(*other_pipes): 
        optimizer = nlp.begin_training()
        for itn in range(iterations):
            random.shuffle(TRAIN_DATA)
            losses = {}
            for text, annotations in TRAIN_DATA:
                nlp.update(
                    [text],         # batch of texts
                    [annotations],  # batch of annotations
                    drop=0.2,       # dropout - make it harder to memorise data
                    sgd=optimizer,  # callable to update weights
                    losses=losses)
    return nlp


prdnlp = train_spacy(TRAIN_DATA, 20)

# Save our trained Model
modelfile = input("closure_labels")
prdnlp.to_disk(modelfile)
# Note that after running this code, you have to reenter the model file name (closure_labels)

closure_labelsclosure_labels


In [18]:
# Load the labels:
nlp = spacy.load('closure_labels')

# Instantiate the location columns:
closures['road'] = ''
closures['streets'] = ''
closures['interstate'] = ''
closures['mile_marker'] = ''
closures['bridge'] = ''
closures['lanes'] = ''
closures['town'] = ''

# Extracting locations:
for i in range(closures.shape[0]):
    doc = nlp(closures['expanded_tweet'].iloc[i])
    
    roads = []
    streets = []
    interstates = []
    mile_markers = []
    bridges = []
    lanes = []
    towns = []
    
    for char in doc.ents:
        if (char.label_ == 'ST'):
            streets.append(char)
            closures['streets'].iloc[i] = streets
        if (char.label_ == 'RD'):
            roads.append(char)
            closures['road'].iloc[i] = roads
        if (char.label_ == 'INS'):
            interstates.append(char)
            closures['interstate'].iloc[i] = interstates
        if (char.label_ == 'MM'):
            mile_markers.append(char)
            closures['mile_marker'].iloc[i] = mile_markers
        if (char.label_ == 'BRDG'):
            bridges.append(char)
            closures['bridge'].iloc[i] = bridges
        if (char.label_ == 'LN'):
            lanes.append(char)
            closures['lanes'].iloc[i] = lanes
        if (char.label_ == 'TWN'):
            towns.append(char)
            closures['town'].iloc[i] = towns

# Cleaning each columns:            
tok_cols = ['road', 'streets', 'interstate', 'mile_marker', 'bridge', 'lanes', 'town']
for col in tok_cols:
    clean_list = []
    for token in closures[col]:
        clean_list.append(str(token).replace('[', '').replace(']', ''))
    closures[col] = clean_list
    
# Create 'address' column:
closures['address'] = closures['road']+' '+closures['streets']+' '+closures['interstate']+' '+closures['mile_marker']+' '+closures['town']+' '+closures['bridge']+' FL'

In [19]:
# Get a cleaned address data frame:
fl511 = closures.loc[closures['username']=='fl511_panhandl']
address = fl511.drop(fl511.loc[fl511['address']=='      FL'].index)[['text', 'address']]
address.head()

Unnamed: 0_level_0,text,address
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1051595380369096704,Cleared: Traffic congestion in Bay on US-231 s...,US-231 south Bay FL
1051575221671682054,Cleared: Object on roadway in Okaloosa on I-10...,Interstate 10 west Mile Marker 51 Okaloosa FL
1051574997964255232,New: Object on roadway in Okaloosa on I-10 wes...,Interstate 10 west Mile Marker 51 Okaloosa FL
1051564919219515399,Updated: Traffic congestion in Bay on US-231 s...,US-231 south Bay FL
1051564637832007680,Updated: Traffic congestion in Bay on US-231 s...,US-231 south Bay FL


In [20]:
# Export data:
address.to_csv('../datasets/address.csv', index=False)

## Source  

- [Named Entity Recognition for spaCy](https://spacy.io/api/annotation#named-entities)  
- [Road Closures Evacuation Routes](https://github.com/fmanon/Road_Closures_Evacuation_Routes)  
- [spaCy NER Annotator](https://github.com/ManivannanMurugavel/spacy-ner-annotator)  
- [Training Basics for spaCy](https://spacy.io/usage/training)  