<a href="https://colab.research.google.com/github/EtzionR/NLP4GeoAI/blob/main/Text_to_Geo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP for GeoAI
### created by Etzion Harari | RFL

[**https://github.com/EtzionR/NLP4GeoAI**](https://github.com/EtzionR/NLP4GeoAI)

## Imports

In [1]:
from geopy.geocoders import Nominatim
from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer
from time import sleep as wait
from tqdm import tqdm

import networkx as nx
import pandas as pd
import folium

## Clone git Repo
[https://github.com/EtzionR/NLP4GeoAI](https://github.com/EtzionR/NLP4GeoAI)

In [2]:
%%bash
rm -rf NLP4GeoAI
git clone https://github.com/EtzionR/NLP4GeoAI.git

Cloning into 'NLP4GeoAI'...


## Load Data

In [3]:
df = pd.read_csv('NLP4GeoAI/data.csv')

print(f'Dataframe shape: {df.shape}')

df.head()

Dataframe shape: (23072, 2)


Unnamed: 0,Text,Source
0,"Last week, Sen. Malcolm Wallop -LRB- R., Wyo. ...",ontonotes5
1,Rules that set standards for products or gover...,ontonotes5
2,Determining when handicapped access is require...,ontonotes5
3,"``It's very costly and time-consuming ,'' says...",ontonotes5
4,"Next to medical insurance, ``costs of complian...",ontonotes5


## Activate NER model
[https://huggingface.co/dslim/distilbert-NER](https://huggingface.co/dslim/distilbert-NER)

In [4]:
MODEL = "dslim/bert-base-NER"
TEXT = "Sam Altman visited OpenAI in San Francisco."

ner = pipeline("ner",
               model=MODEL,
               aggregation_strategy='average')

output = ner(TEXT)

print('\n\n\nNER output:\n\n')

pd.DataFrame(output)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/829 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Device set to use cuda:0





NER output:




Unnamed: 0,entity_group,score,word,start,end
0,PER,0.999328,Sam Altman,0,10
1,ORG,0.664299,OpenAI,19,25
2,LOC,0.999392,San Francisco,29,42


## Run NER on the entire corpus

In [5]:
P = .1 # Run NER only on P% of df entries

outputs = []

for text, source in tqdm(df[:int(len(df)*P)].values):
    entites = ner(text)

    for entity in entites:

        entity['source'] = source
        entity['text'] = text

        outputs.append(entity)

outputs = pd.DataFrame(outputs)

print(f'\n\nNER output shape: {outputs.shape}\n')

outputs

  0%|          | 6/2307 [00:00<00:40, 57.18it/s]You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
100%|██████████| 2307/2307 [00:29<00:00, 79.44it/s]



NER output shape: (6644, 7)






Unnamed: 0,entity_group,score,word,start,end,source,text
0,PER,0.999573,Malcolm Wallop,16,30,ontonotes5,"Last week, Sen. Malcolm Wallop -LRB- R., Wyo. ..."
1,ORG,0.500653,LRB,32,35,ontonotes5,"Last week, Sen. Malcolm Wallop -LRB- R., Wyo. ..."
2,LOC,0.879631,R,37,38,ontonotes5,"Last week, Sen. Malcolm Wallop -LRB- R., Wyo. ..."
3,LOC,0.428096,Wyo,41,44,ontonotes5,"Last week, Sen. Malcolm Wallop -LRB- R., Wyo. ..."
4,ORG,0.499855,RRB,47,50,ontonotes5,"Last week, Sen. Malcolm Wallop -LRB- R., Wyo. ..."
...,...,...,...,...,...,...,...
6639,LOC,0.995469,Western Europe,152,166,ontonotes5,It raises the long-cherished hopes of many Ger...
6640,PER,0.466073,Krenz,4,9,ontonotes5,"Mr. Krenz, 52, was named the new party chief j..."
6641,ORG,0.960657,Party,68,73,ontonotes5,"Mr. Krenz, 52, was named the new party chief j..."
6642,ORG,0.998935,Central Committee,87,104,ontonotes5,"Mr. Krenz, 52, was named the new party chief j..."


## Display the Top K locations from the corpus

In [6]:
k = 10

top_k_locations = outputs[outputs.entity_group=='LOC'].word.value_counts().head(k).reset_index()
top_k_locations

Unnamed: 0,word,count
0,U. S.,387
1,New York,154
2,Japan,69
3,California,64
4,China,55
5,London,45
6,Britain,41
7,Poland,41
8,Washington,40
9,Los Angeles,38


## Use Nominatim package to geocode example place name

In [7]:
example = "Tel Aviv, Israel"

geolocator = Nominatim(user_agent="GeoAI_Course_Geocoder")

location = geolocator.geocode(example)

print(f'Location full adress:\n{location.address}\n')
print(f'WGS84 GEO X = {round(location.longitude,6)}, Y = {round(location.latitude,6)}')

Location full adress:
תל־אביב–יפו, נפת תל אביב, מחוז תל אביב, ישראל

WGS84 GEO X = 34.781806, Y = 32.0853


## Geocode the Top K place names

In [8]:
time_gap = 1.25

x_coords = []
y_coords = []

for placename in tqdm(top_k_locations.word):
    loc = geolocator.geocode(placename)

    x_coords.append(loc.longitude)
    y_coords.append(loc.latitude)

    wait(time_gap)

top_k_locations['x'] = x_coords
top_k_locations['y'] = y_coords

top_k_locations

100%|██████████| 10/10 [00:15<00:00,  1.59s/it]


Unnamed: 0,word,count,x,y
0,U. S.,387,-100.445882,39.78373
1,New York,154,-74.006015,40.712728
2,Japan,69,139.239418,36.574844
3,California,64,-118.755997,36.701463
4,China,55,104.999927,35.000074
5,London,45,-0.127765,51.507446
6,Britain,41,-1.918153,54.315159
7,Poland,41,19.134422,52.215933
8,Washington,40,-77.036543,38.895037
9,Los Angeles,38,-118.242766,34.053691


## Create folimap on the Top K places in the Corpus

In [9]:

fmap = folium.Map(location=[0, 0], zoom_start=3)

places = []
place_to_xy = {}

for name,x,y in zip(top_k_locations.word, top_k_locations.x, top_k_locations.y):
    folium.Marker([y,x], popup=name, tooltip=name).add_to(fmap)
    places.append(name)
    place_to_xy[name] = (y, x)

fmap

## Construct a Graph from the pair locations entities

In [10]:
topk_places = set(places)

edge_weight = {}

G = nx.Graph()

sub = outputs[['text', 'entity_group', 'word']]
sub['merged'] = [(entity, typ) for _, typ, entity in sub.values]

sub = pd.pivot_table(sub[['text', 'merged']], index='text', aggfunc=set)

for entity_set in sub.merged[sub.merged.str.len()>1]:

    entity_list = [*entity_set]

    for i in range(len(entity_list)):
        for j in range(i+1, len(entity_list)):
            placei = entity_list[i][0]
            placej = entity_list[j][0]
            if placei in topk_places and placej in topk_places:
                G.add_edge(placei,
                           placej)
                edge_weight[(placei, placej)] = edge_weight.get((placei, placej), 0) + 1


print(f'\n\n\nConnections Graph created!\n|V| = {len(G.nodes)}\n|E| = {len(G.edges)}')




Connections Graph created!
|V| = 9
|E| = 15


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sub['merged'] = [(entity, typ) for _, typ, entity in sub.values]


## Display the Graph on the MAP

In [11]:
fmap = folium.Map(location=[0, 0], zoom_start=3)

for name,x,y in zip(top_k_locations.word, top_k_locations.x, top_k_locations.y):
    folium.Marker([y,x], popup=name, tooltip=name,icon = folium.Icon(color="blue") ).add_to(fmap)

for i,j in G.edges:
    if (i,j) in edge_weight:
        folium.PolyLine(locations=[place_to_xy[j],
                                   place_to_xy[i]],
                        color="blue",
                        opacity=0.3,
                        tooltip=f'Side 1: {i}<br>Side 2: {j}<br>Connections: {edge_weight[(i,j)]}',
                        weight=edge_weight[(i,j)]**.5).add_to(fmap)

fmap

[https://github.com/EtzionR/NLP4GeoAI](https://github.com/EtzionR/NLP4GeoAI)