# Disambiguation 2

**KEY OUTPUT**: `matches.csv`, which (1) has latlng data, (2) has confidence score including census conflicts, and (3) is sorted by census record index.

Improves on confidence score generating process in `disambiguation_analysis.ipynb`  
Summary of actions:
- add lat lng of CD record back into matched dataset
- add column with number of potential CD matches for each census record ('census conflicts')
- confidence score to include age and number of **census conflicts**
    - diff between CD and census conflicts: if one CD is matched to 2 census records, CD conflict is 2. But, for those two census records, this CD record might be the only record matched to them, hence they might both have census conflict 1.  
    - we are looking at census records to be anchors for spatial disambiguation, where confidence = 1. Hence, confidence score should calculate census conflicts instead of CD conflicts.  
    - no. of conflicts for CD used in disambiguation process instead.

**_Note_**: This notebook is not run as the process was previously done in another notebook. However, this notebook accurately documents the steps taken for the production of `matches.csv`.

In [3]:
import pandas as pd

## Joining lat lng

In [4]:
match = pd.read_csv("../data/match_results_confidence_score.csv")

In [5]:
latlng = pd.read_csv("../data/cd_1880.csv")
latlng = latlng[['OBJECTID', 'LONG', 'LAT']]

In [6]:
match = match.merge(latlng, how='left', left_on='OBJECTID', right_on='OBJECTID', validate='many_to_one')

In [7]:
# sort by original census order
match = match.sort_values(by = ['OBJECTID.x'])

## RE-Constructing a confidence score
#### Using the following weightage (abitrarily decided) in the confidence score
1. **50%** - Jaro-Winkler distance
2. **20%** - No. of CD matches (conflicts)
3. **20%** - No. of census matches
4. **5%** - Absence of occupation in the census (*)
5. **5%** - Whether age is smaller than 12

In [8]:
# recalculate the scores to include age and census conflicts
match['age_score'] = match['CENSUS_AGE'].apply(lambda x: 0 if x <= 12 else 1)
match["census_count"] = match.groupby("OBJECTID.x")["OBJECTID"].transform('count')
match['confidence_score'] = .5*match.jaro_winkler_aggr_score + .2*(1/match.num_matches) + \
                            .2*(1/match.census_count) + .05*match.census_occupation_listed + \
                            .05*match.age_score
match['confidence_score'] = match['confidence_score'].round(decimals = 2)

## Export

In [10]:
# rename objectIDs to prevent errors
match = match.rename(columns={'OBJECTID': 'CD_ID', 'OBJECTID.x': 'CENSUS_ID'})

In [11]:
match['CD_ID'] = 'CD_' + match['CD_ID'].astype(str)
match['CENSUS_ID'] = 'CENSUS_' + match['CENSUS_ID'].astype(str)

In [12]:
match.to_csv('../data/matches.csv')