<font color = "blue">
Content:
    
1. [Problem](#1)
2. [Libraries](#2)
3. [Load Data](#3)
4. [Extract Data](#4)
5. [Name Entity](#5)
6. [Clean-Filter](#6)
7. [Find The Coordinate](#7)
8. [Outputs](#8)
9. [References](#9)
    
Written by Kemal Gunay

<a id = "1"></a><br>
# 1. Problem

Task description: A big part of our daily data collection is creating algorithms that can extract key pieces of information from a wide range of data sources. In this task, you are required to create an algorithm that takes as input a pdf file corresponding to a research publication (provided by us) and outputs a list of all geographical locations mentioned in the publication. For each geographical location, the algorithm will have to additionally identify the country that the location belongs to, and return a latitude- longitude pair corresponding to the centroid of the respective country.

The example below:
* 1) Location: Russia, Country: Russia, Centroid: (61.52401, 105.318756)
* 2) Location: Alberta, Country: Canada, Centroid: (56.130366, -106.346771)
* 3) Location: Scottish Highlands, Country: UK, Centroid: (55.378051, -3.435973)
* 4) Location: Northern Alaska, Country: US, Centroid: ( 37.09024, -95.712891)

Submission requirements: a jupyter notebook file that takes as input any pdf file provided and outputs a csv file containing each location detected in the text of the publication in the format described above. A csv file containing the output for each of the sample pdfs provided.

The notebook file should have one example pdf run through each cell, providing the outputs along the way for easy inspection. However, the code should be completely independent of the pdf file provided (if we re-run the notebook on a new pdf file, the output should still be correct).


<a id = "2"></a><br>
# 2. Libraries

In [1]:
# Installation
!pip install PyPDF2
!pip install openpyxl

# Libraries
import pandas as pd
import numpy as np
from PyPDF2 import PdfFileReader
import spacy
from spacy import displacy
from geopy.exc import GeocoderTimedOut
from geopy.geocoders import Nominatim

pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: '%.3f' % x)
pd.set_option('display.width', 500)

Collecting PyPDF2
  Downloading PyPDF2-2.6.0-py3-none-any.whl (201 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m201.9/201.9 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: PyPDF2
Successfully installed PyPDF2-2.6.0
[0mCollecting openpyxl
  Downloading openpyxl-3.0.10-py2.py3-none-any.whl (242 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m242.1/242.1 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting et-xmlfile
  Downloading et_xmlfile-1.1.0-py3-none-any.whl (4.7 kB)
Installing collected packages: et-xmlfile, openpyxl
Successfully installed et-xmlfile-1.1.0 openpyxl-3.0.10
[0m

<a id = "3"></a><br>
# 3. Load Data

In [2]:
# Read PDF Data
target_file = "../input/nlpgeo/Moore1995-1.pdf"
opened_file = open(target_file, "rb")
pdf = PdfFileReader(opened_file)



<a id = "4"></a><br>
# 4. Extract Text

In [3]:
# get pages number
num_pages = pdf.getNumPages()
num_pages

16

In [4]:
# extract text
text = ""
for i in range(num_pages):
    page = pdf.getPage(i)
    text = text + " " + page.extractText()

<a id = "5"></a><br>
# 5. Name-Entity

In [5]:
# name entity
nlp = spacy.load("en_core_web_sm")
# nlp = spacy.load("en_core_web_lg") # another english model

In [6]:
doc = nlp(text)

entities = []
labels = []
position_start = []
position_end = []

for ent in doc.ents:
    entities.append(ent)
    labels.append(ent.label_)
    position_start.append(ent.start_char)
    position_end.append(ent.end_char)

<a id = "6"></a><br>
# 6. Clean-Filter Data

In [7]:
# Filtering locations
places = []  # places in document but there are some incorrect nouns
for ent in doc.ents:
    if (ent.label_ == "GPE") | (ent.label_ == "LOC"):
        places.append(ent.text)

In [8]:
# Places from our content
places

['Manitoba',
 'Canada',
 'Montreal',
 'Quebec',
 'Canada',
 'lagg',
 'Kcorr',
 'pH',
 'Whalen',
 'Reeburgh',
 'Harriss',
 'Sundh',
 'Chanton',
 'Dacey',
 'Bay Lowland',
 'Alaska',
 'Thompson',
 'Manitoba',
 'Thompson',
 'Manitoba',
 'spp.',
 'Zoltai',
 'communities',
 'palsas',
 'Roulet',
 'Thompson',
 'pH',
 'Riley',
 'Wilkinson',
 'Braak',
 'pH',
 's.d.',
 'the Zo)tai fen',
 'pH',
 'pH',
 'lagg',
 'TF',
 'Palsa',
 'Calliergon',
 'Meesia',
 'fens',
 'palsas',
 'Chamaedaphne',
 'Betula',
 'lagg',
 'Reeburgh',
 'Vaccinium',
 'Calliergon',
 'Betula',
 'Meesia',
 'pH',
 'Canada',
 'pH',
 'Ontario',
 'Dalva',
 'pH',
 'Kcorr',
 'pH',
 'Schefferville',
 'Quebec',
 'the Hudson Bay Lowlands',
 'pH',
 'Ontario',
 'Quebec',
 'Clay Belt',
 'Ontario',
 'Schefferville',
 'Quebec',
 'Chanton',
 'Dacey',
 'Chanton',
 'Chanton',
 'Warnstorfia',
 'Roulet',
 'lagg',
 'lagg',
 'Habenaria',
 'intermedia',
 'Calliergon',
 'Sphagnum',
 'North America',
 'Mexico',
 'North America',
 'Mexico',
 'Alaska',
 'Re

In [9]:
# we will check our content places with the this lists if they are correct or not
# the file is in helpers folder
df = pd.read_excel("../input/worldcities/world_cities.xlsx")

In [10]:
# take out dublicate rows - locations
df1 = df[df[df.columns[0]].isin(places)][['location', 'country']]
df1

Unnamed: 0,location,country
12,New York,United States
126,Toronto,Canada
234,San Diego,United States
366,Portland,United States
2801,Portland,United States
...,...,...
85505,Alberta,Canada
85542,Quebec,Canada
85568,Quebec,Canada
85600,Quebec,Canada


In [11]:
df1.drop_duplicates(inplace = True)
df1

Unnamed: 0,location,country
12,New York,United States
126,Toronto,Canada
234,San Diego,United States
366,Portland,United States
3002,San Diego,Venezuela
3033,Ontario,United States
3183,Mexico,Philippines
14368,Wageningen,Netherlands
17685,Ocean,United States
24318,Thompson,United States


<a id = "7"></a><br>
# 7. Find The Coordinate

In [12]:
# longitude & latitude
# declare an empty list to store
# latitude and longitude of values
# of city column
longitude = []
latitude = []


# function to find the coordinate
# of a given city
def findGeocode(city):
    # try and catch is used to overcome
    # the exception thrown by geolocator
    # using geocodertimedout
    try:

        # Specify the user_agent as your
        # app name it should not be none
        geolocator = Nominatim(user_agent="your_app_name")

        return geolocator.geocode(city)

    except GeocoderTimedOut:

        return findGeocode(city)

In [13]:
# with above function we check df1
for i in (df1["location"]):

    if findGeocode(i) != None:

        loc = findGeocode(i)

        # coordinates returned from
        # function is stored into
        # two separate list
        latitude.append(loc.latitude)
        longitude.append(loc.longitude)

    # if coordinate for a city not
    # found, insert "NaN" indicating
    # missing value
    else:
        latitude.append(np.nan)
        longitude.append(np.nan)

<a id = "8"></a><br>
# 8. Outputs

In [14]:
# added new columns "longitude" and "latitude" in df1
df1["Longitude"] = longitude
df1["Latitude"] = latitude

df1

Unnamed: 0,location,country,Longitude,Latitude
12,New York,United States,-74.006,40.713
126,Toronto,Canada,-79.384,43.653
234,San Diego,United States,-117.163,32.717
366,Portland,United States,-122.674,45.52
3002,San Diego,Venezuela,-117.163,32.717
3033,Ontario,United States,-86.001,50.001
3183,Mexico,Philippines,-102.008,23.659
14368,Wageningen,Netherlands,5.668,51.969
17685,Ocean,United States,-74.332,39.978
24318,Thompson,United States,-97.863,55.743


In [15]:
# export csv
# df1.to_csv("moore1995.csv")

<a id = "9"></a><br>
# 9. References

* https://www.youtube.com/watch?v=N6Su4Hk8_-g

* https://stackoverflow.com/questions/52686159/how-to-extract-the-location-name-country-name-city-name-tourist-places-by-usi

* https://www.youtube.com/watch?v=gJMHbW3MK2w&t=843s

* https://www.geeksforgeeks.org/how-to-find-longitude-and-latitude-for-a-list-of-regions-or-country-using-python/

* https://www.youtube.com/watch?v=dIUTsFT2MeQ