# GIR19 analysis: data preparation

In order to reproduce the analysis in our GIR19 paper, you'll need to first prepare the data.

**3 files** from three different sources are required to reproduce GIR19 figures:

- **GB1900:** `gb1900_gazetteer_complete_july_2018.csv`
- **WikiGazetteer:** `wikigazetteer.pkl`
- **Annotated data:** `gold_standard.pkl`

**Directory structure:**

```
├── GIR19_figs.ipynb
├── files
    ├── wikigazetteer.pkl
    ├── gb1900_gazetteer_complete_july_2018.csv
    └── gold_standard.pkl
```

The following steps will get your data ready to run the `GIR19_figs.ipynb` notebook.

In [1]:
import mysql.connector
from mysql.connector import Error
import pandas as pd

## Annotated data

Dataframe containing 950 manually-annotated locations from 100 articles for the period between 1820 and 1870  that have been randomly subsampled from the following newspaper titles:
* _The Dorchester and Sherborne Journal, and Western Advertiser_ (Dorchester, Dorset)
* _Dorset County Chronicle_ (Dorchester, Dorset)
* _The Poole and South-Western Herald_ (Poole, Dorset)
* _Poole and Bournemouth Herald_ (Poole, Dorset)

The dataframe contains the following information for all instances of locations in the subsampled data:
* Wikipedia id (`wiki_id` in WikiGazetteer)
* Wikipedia title (`wiki_title` in WikiGazetteer)
* Wikipedia lat (`wiki_lat` in WikiGazetteer)
* Wikipedia lon (`wiki_lon` in WikiGazetteer)

In [2]:
gsdf = pd.read_pickle("files/annotated_places.pkl")
gsdf.head()

Unnamed: 0,wiki_id,wiki_title,wiki_lat,wiki_lon
0,48593,Poole,50.7167,-1.98333
1,805,Jersey,49.19,-2.11
2,236,Battle_of_Waterloo,50.68,4.412
3,979,London,51.5072,-0.1275
4,979,London,51.5072,-0.1275


## GB1900

GB1900 (citation below) is a crowdsourced transcription of text labels on a late nineteenth-century edition of Ordnance Survey (OS) maps of Britain. The GB1900 Complete Gazetteer dataset includes ca. 2.55 million named and unnamed features (e.g. any text label).

> Humphrey Southall and Paula Aucott. 2019. Locating Past Places in Britain: Creating and evaluating the GB1900 Gazetteer. International Journal of Humanities & Arts Computing 13, 2 (2019), 1–8.

Download GB1900 from http://www.visionofbritain.org.uk/data/ ("complete GB1900 gazetteer"). Store it in the `files/` folder. The version we used was `gb1900_gazetteer_complete_july_2018.csv`.

In [3]:
with open("files/gb1900_gazetteer_complete_july_2018.csv", encoding='UTF-16') as f:
    df = pd.read_csv(f)
df.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,pin_id,final_text,nation,local_authority,parish,osgb_east,osgb_north,latitude,longitude,notes
0,52b34d8b695fe90005004e1e,F. P.,Wales,Powys,Llansilin,320836.712742,327820.182715,52.84205,-3.176744,
1,5800a6b92c66dcab3d061796,Parly. & Munl Boro. By.,England,City of London,,531794.825962,180705.741898,51.509918,-0.102246,
2,5800a6782c66dcab3d061786,S. Ps.,England,City of London,,531736.217116,180725.02773,51.510105,-0.103083,
3,57f684f42c66dcab3d01c0dd,Southwark Bridge Stairs,England,City of London,,532199.584123,180696.934434,51.509744,-0.09642,
4,57f685002c66dcab3d01c0e9,St. Paul's Pier,England,City of London,,531987.486097,180745.664556,51.510232,-0.099456,


## WikiGazetteer

Gazetteer based on Wikipedia and enhanced with information from Geonames. Make sure you have followed the instructions to build it in https://github.com/alan-turing-institute/lwm_GIR19_resolving_places/blob/master/gazetteer_construction/README.md.

### Start MySQL connection

Before running, make sure you've changed your credentials in order to connect to the database:

In [4]:
gazDB = ""
cursorGaz = ""
try:
    gazDB = mysql.connector.connect(
            host='localhost',
            database='wikiGazetteer',
            user='testGazetteer',
            password='1234')
    if gazDB.is_connected():
        cursorGaz = gazDB.cursor(dictionary=True)
except Error as e:
    print("Error while connecting to MySQL", e)

### Create dataframe from gazetteer

The dataframe contains the following information for all locations in WikiGazetteer:
* Wikipedia id (`wiki_id` in WikiGazetteer)
* Wikipedia title (`wiki_title` in WikiGazetteer)
* Wikipedia lat (`wiki_lat` in WikiGazetteer)
* Wikipedia lon (`wiki_lon` in WikiGazetteer)

In [5]:
def find_locations(cursorGaz):

    cursorGaz.execute("""
            SELECT id, wiki_title, lat, lon FROM location
        """)
    results = cursorGaz.fetchall()
    
    wiki_id = []
    wiki_title = []
    wiki_lat = []
    wiki_lon = []
    for r in results:
        wiki_id.append(r['id'])
        wiki_title.append(r['wiki_title'])
        wiki_lat.append(r['lat'])
        wiki_lon.append(r['lon'])

    df = pd.DataFrame(
        {'wiki_id' : wiki_id,
         'wiki_title': wiki_title,
         'wiki_lat': wiki_lat,
         'wiki_lon': wiki_lon
        })
    return df

In [6]:
wikidf = find_locations(cursorGaz)
wikidf.head()

Unnamed: 0,wiki_id,wiki_title,wiki_lat,wiki_lon
0,1,İstasyon_Cyprus,35.1103,32.8797
1,2,List_of_Cultural_Properties_of_the_Philippines...,14.0252,121.593
2,3,List_of_Cultural_Properties_of_the_Philippines...,14.0242,121.594
3,4,List_of_Cultural_Properties_of_the_Philippines...,14.0239,121.594
4,5,List_of_Cultural_Properties_of_the_Philippines...,14.0284,121.594


In [7]:
wikidf.to_pickle("files/wikigazetteer.pkl")

### Close MySQL connection

In [8]:
if (gazDB.is_connected()):
    cursorGaz.close()
    gazDB.close()