# GIR19 analysis: data preparation

In order to reproduce the analysis in our GIR19 paper, you'll need to first prepare the data.

**3 files** from three different sources are required to reproduce GIR19 figures:

- **GB1900:** `gb1900_gazetteer_complete_july_2018.csv`
- **WikiGazetteer:** `wikigazetteer.pkl`
- **Annotated data:** `gold_standard.pkl`

**Directory structure:**

```
├── GIR19_figs.ipynb
├── files
    ├── wikigazetteer.pkl
    ├── gb1900_gazetteer_complete_july_2018.csv
    └── gold_standard.pkl
```

The following steps will get your data ready to run the `GIR19_figs.ipynb` notebook.

In [None]:
import mysql.connector
from mysql.connector import Error
import pandas as pd

## Annotated data

Dataframe containing 950 manually-annotated locations from 100 articles for the period between 1820 and 1870  that have been randomly subsampled from the following newspaper titles:
* _The Dorchester and Sherborne Journal, and Western Advertiser_ (Dorchester, Dorset)
* _Dorset County Chronicle_ (Dorchester, Dorset)
* _The Poole and South-Western Herald_ (Poole, Dorset)
* _Poole and Bournemouth Herald_ (Poole, Dorset)

The dataframe contains the following information for all instances of locations in the subsampled data:
* Wikipedia id (`wiki_id` in WikiGazetteer)
* Wikipedia title (`wiki_title` in WikiGazetteer)
* Wikipedia lat (`wiki_lat` in WikiGazetteer)
* Wikipedia lon (`wiki_lon` in WikiGazetteer)

In [None]:
gsdf = pd.read_pickle("files/annotated_places.pkl")
gsdf.head()

## GB1900

GB1900 (citation below) is a crowdsourced transcription of text labels on a late nineteenth-century edition of Ordnance Survey (OS) maps of Britain. The GB1900 Complete Gazetteer dataset includes ca. 2.55 million named and unnamed features (e.g. any text label).

> Humphrey Southall and Paula Aucott. 2019. Locating Past Places in Britain: Creating and evaluating the GB1900 Gazetteer. International Journal of Humanities & Arts Computing 13, 2 (2019), 1–8.

Download GB1900 from http://www.visionofbritain.org.uk/data/ ("complete GB1900 gazetteer"). Store it in the `files/` folder. The version we used was `gb1900_gazetteer_complete_july_2018.csv`.

In [None]:
with open("files/gb1900_gazetteer_complete_july_2018.csv", encoding='UTF-16') as f:
    df = pd.read_csv(f)
df.head()

## WikiGazetteer

Gazetteer based on Wikipedia and enhanced with information from Geonames. Make sure you have followed the instructions to build it in https://github.com/alan-turing-institute/lwm_GIR19_resolving_places/blob/master/gazetteer_construction/README.md.

### Start MySQL connection

Before running, make sure you've changed your credentials in order to connect to the database:

In [None]:
gazDB = ""
cursorGaz = ""
try:
    gazDB = mysql.connector.connect(
            host='localhost',
            database='wikiGazetteer',
            user='xxxxxxxx',
            password='xxxxxxx')
    if gazDB.is_connected():
        cursorGaz = gazDB.cursor(dictionary=True)
except Error as e:
    print("Error while connecting to MySQL", e)

### Create dataframe from gazetteer

The dataframe contains the following information for all locations in WikiGazetteer:
* Wikipedia id (`wiki_id` in WikiGazetteer)
* Wikipedia title (`wiki_title` in WikiGazetteer)
* Wikipedia lat (`wiki_lat` in WikiGazetteer)
* Wikipedia lon (`wiki_lon` in WikiGazetteer)

In [None]:
def find_locations(cursorGaz):

    cursorGaz.execute("""
            SELECT id, wiki_title, lat, lon FROM location
        """)
    results = cursorGaz.fetchall()
    
    wiki_id = []
    wiki_title = []
    wiki_lat = []
    wiki_lon = []
    for r in results:
        wiki_id.append(r['id'])
        wiki_title.append(r['wiki_title'])
        wiki_lat.append(r['lat'])
        wiki_lon.append(r['lon'])

    df = pd.DataFrame(
        {'wiki_id' : wiki_id,
         'wiki_title': wiki_title,
         'wiki_lat': wiki_lat,
         'wiki_lon': wiki_lon
        })
    return df

In [None]:
wikidf = find_locations(cursorGaz)
wikidf.head()

In [None]:
wikidf.to_pickle("files/wikigazetteer.pkl")

### Close MySQL connection

In [None]:
if (gazDB.is_connected()):
    cursorGaz.close()
    gazDB.close()