In [1]:
%run resources/library.py

In [2]:
style_notebook()

Elimination of  Maternal-to-Child Transmission in Thailand, 2013-2016

# Notebook 3: Process Data

## Reusing the pickled dataframes

In [3]:
import pandas as pd
pd.__version__

'0.24.2'

In [4]:
%run resources/pandas.py

To reuse the dataframes we pickled in Notebook 2, we will use the `pandas` `.read_pickle()` dot function. After reading the pickled file into its similarly named dataframe, we will display the first five records using `.head()`. 

In [5]:
mtct_df2 = pd.read_pickle('data/mtct_df2.pickle')
mtct2016_df1 = pd.read_pickle("data/mtct2016_df.pickle")
nb_by_province_df2 = pd.read_pickle('data/nb_by_province_df2.pickle')
translate_province_df1 = pd.read_pickle("data/translate_province_df.pickle")
th_geo_df2 = pd.read_pickle('data/th_geo_df2.pickle')

> If the default for `.head()` is 5 records, how do you display the first 10 records?

In [6]:
mtct2016_df1.head(10)

Unnamed: 0,Year_Recorded,Region,Province_TH,HIVpos_children,2PCR_children,TRpct,livebirths_100k,pregnancies,HIVpos_pregwomen_labor_room,product
231,2016,11,กระบี่,0,15,0.00%,0.0,6482,28,0.0
232,2016,13,กรุงเทพฯ,5,188,2.66%,15.36,32556,193,0.026596
233,2016,4,กาญจนบุรี,0,41,0.00%,0.0,8431,47,0.0
234,2016,6,กาฬสินธุ์,0,39,0.00%,0.0,7277,42,0.0
235,2016,8,กำแพงเพชร,1,33,3.03%,18.62,5371,36,0.030303
236,2016,6,ขอนแก่น,1,66,1.52%,6.26,15979,101,0.015152
237,2016,3,จันทบุรี,0,44,0.00%,0.0,5200,41,0.0
238,2016,3,ฉะเชิงเทรา,0,21,0.00%,0.0,4813,30,0.0
239,2016,3,ชลบุรี,1,197,0.51%,3.59,27826,194,0.005076
240,2016,8,ชัยนาท,0,4,0.00%,0.0,2282,17,0.0


In [10]:
translate_province_df1.head(10)

Unnamed: 0,Province_TH,Province_EN
1,ทั่วราชอาณาจักร,Whole Kingdom
2,กรุงเทพมหานคร,Bangkok
3,จังหวัดอำนาจเจริญ,Amnat Charoen Province
4,จังหวัดอ่างทอง,Ang Thong Province
5,จังหวัดบึงกาฬ,Bueng Kan Province
6,จังหวัดบุรีรัมย์,Buri Ram Province
7,จังหวัดฉะเชิงเทรา,Chachoengsao Province
8,จังหวัดชัยนาท,Chai Nat Province
9,จังหวัดชัยภูมิ,Chaiyaphum Province
10,จังหวัดจันทบุรี,Chanthaburi Province


## Begin Translation Process

In the previous notebook, Notebook 2, we created two dataframes we will use in this and subsequent notebooks:
1. __`mtct2016_df1`__: 2016 subset of MTCT data from 2013-2016 which has only Thai province names.
2. __`translate_province_df1`__: Data from the file `NB-by-Province.csv` file which has Thai and English province names.

### Step 1: Import required python packages

To translate Thai province names in `mtct2016_df1` using `translate_province_df1`, we use two Python packages:
1. __FuzzyWuzzy__ - You can learn more about the `fuzzywuzzy` package from its GitHub repository [here](https://github.com/seatgeek/fuzzywuzzy).
2. __GoogleTrans__ - You can learn more about `googletrans` package from the PyPI documentation [here](https://pypi.org/project/googletrans/).

In [70]:
import googletrans

googletrans.__version__

'2.3.0'

In [14]:
from googletrans import Translator

translator = Translator(service_urls=[
      'translate.google.co.uk',
      'translate.google.co.kr',
      'translate.google.co.th',
      'translate.google.co.in',
      'translate.google.com'
    ])
#translator = Translator()

Let's test the Google Translate package.

In [None]:
test = await translator.translate("สมุทรสาคร", dest='en')
test.text

'Samut Sakhon'

You should see the value "`Samut Sakhon`" in the Output Cell.

### Important Note: The CSV from the Thailand shapefile has an error.

Upon attempting to use the CSV derived from the `THA_adm1` shapefile, the record for 'Bangkok' actually translates incorrectly to 'Chiangmai Province'. 

In [17]:
th_geo_df2

Unnamed: 0,ISO_CODE,Province_EN,Province_TH
0,TH-37,Amnat Charoen,จังหวัดอำนาจเจริ
1,TH-15,Ang Thong,จังหวัดอ่างทอง
2,TH-10,Bangkok Metropolis,จังหวัดเชียงใหม่
3,TH-38,Bueng Kan,บึงกาฬ
4,TH-31,Buri Ram,จังหวัดบุรีรัมย์
5,TH-24,Chachoengsao,จังหวัดฉะเชิงเทร
6,TH-18,Chai Nat,จังหวัดชัยนาท
7,TH-36,Chaiyaphum,จังหวัดชัยนาท
8,TH-22,Chanthaburi,จันทบุรี
9,TH-50,Chiang Mai,จังหวัดเชียงใหม่


Let's use the `.loc()` dot function that we used in Notebook 2 to display dataframe record 2, index #1.

In [19]:
print(th_geo_df2.loc[2])

ISO_CODE       TH-10             
Province_EN    Bangkok Metropolis
Province_TH    จังหวัดเชียงใหม่  
Name: 2, dtype: object


We can validate the existence of that error with the `googletrans` package that we tested above. We can copy-paste the `NL_NAME_1` value from the `print()` command execution above to the the `translator.translate()` function below. The `test` variable should display the value `Chiangmai Province` in the output cell.

We can specifically print the Thai translation using that particular element `NL_NAME_1`...

In [21]:
print(th_geo_df2.loc[2]['Province_TH'])

จังหวัดเชียงใหม่


...and substitute it for the input text for the `.translate()` dot function below:

In [None]:
test = await translator.translate(th_geo_df2.loc[2]['Province_TH'], dest='en')

test.text

'Chiangmai Province'

In [None]:
test = await translator.translate("จังหวัดเชียงใหม่", dest='en')
test.text

'Chiangmai Province'

### Step 2: Define variables to use in the translation process.

Let's create a `pandas` dataframe, `fuzzy_df1`, which we will use to store fuzzy matching metrics and Google Translate translations. We will use the following variables as column names for `new_df`.

* `prov_th_1`: variable to store original province name in Thai from `mtct2016_df`
* `prov_en_1`: variable to store English province name from Google Translate
* `prov_th_2`: original province name in Thai from `translate_province_df` (lookup table)
* `prov_en_2`: original province name in English from `translate_province_df` (lookup table) - This is selected algorithmically using `fuzzywuzzy` from pair-wise matching with the `Province_TH` column from `mtct2016_df1`.
* `sm_prov_th_fr`: similarity measure between `prov_th_1` and `prov_th_2` using `fuzz.ratio` from `fuzzywuzzy`. Read more about various similarity measures generated by `fuzzywuzzy` for two pieces of text [here](https://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/).
* `sm_prov_th_fpr`: similarity measure between `prov_th_1` and `prov_th_2` using `fuzz.partial_ratio` from `fuzzywuzzy`
* `sm_prov_en_fr`: similarity measure between `prov_en_1` and `prov_en_2` using `fuzz.ratio`
* `sm_prov_en_fpr`: similarity measure between `prov_en_1` and `prov_en_2` using `fuzz.partial_ratio`
* `mean`: mean value for `sm_prov_th_fr`, `sm_prov_th_fpr`, `sm_prov_en_fr`, `sm_prov_en_fpr` (conditional on whether Google Translate is correct or not)
* `maxval`: maximum `mean` value for each loop / sampling
* `bestmatch`: use this to easily view which are the highest scoring translations

### Step 3: Create new dataframe to hold values for translation process.

Let's create a new dataframe, `new_df` with an empty row, `row_data`.

In [None]:
row_data = {
        #"iso": [],
        "prov_th_1": [],
        "prov_en_1": [],
        "prov_th_2": [],
        "prov_en_2": [],
        "sm_prov_th_fr": [],
        "sm_prov_th_fpr": [],
        "sm_prov_en_fr": [],
        "sm_prov_en_fpr": [],
        "mean": [],
        "maxval": [],
        "bestmatch": []
    }
#fuzzy_df1 = pd.DataFrame(row_data)
fuzzy_df1 = pd.DataFrame(columns=["fuzzy_score", "sm_prov_en_fpr", "mean", "maxval"])
#fuzzy_df1 = pd.concat([fuzzy_df1, pd.DataFrame(row_data)], ignore_index=True)


We should see an empty dataframe, `new_df`, below.

In [27]:
fuzzy_df1

Unnamed: 0,prov_th_1,prov_en_1,prov_th_2,prov_en_2,sm_prov_th_fr,sm_prov_th_fpr,sm_prov_en_fr,sm_prov_en_fpr,mean,maxval,bestmatch


Column `bestmatch` should be type `object` (a variable length string).

In [29]:
fuzzy_df1.dtypes

prov_th_1         float64
prov_en_1         float64
prov_th_2         float64
prov_en_2         float64
sm_prov_th_fr     float64
sm_prov_th_fpr    float64
sm_prov_en_fr     float64
sm_prov_en_fpr    float64
mean              float64
maxval            float64
bestmatch         object 
dtype: object

### Step 4: Run Nested Loop Algorithm

This step takes a while, due to slowed down remote access to Google Translate API (we use the `time.sleep()` function). 

The two loops work according to the pseudo-code that follows:

**A. Loop 1 - Outer Loop**

1. Iterate through `mtct2016_df1` using `.iterrows()` and store the current record in varable `x`
2. Load prov_th_1 from `Province_TH` column of each record `mtct2016_df1`  
3. Normalize and translate `prov_th_1` to English as `prov_en_1`  
4. Proceed to Loop 2  
  
  
**B. Loop 2 - Inner Loop**

1. Iterate through the `translate_province_df1` dataframe using `.iterrows()` and store the current record of that dataframe in variable `y`
2. Load `prov_th_2` from `Province_TH` column of each record of `translate_province_df1`
2. Normalize and translate `prov_th_2` to English as `prov_en_2`
3. Compute fuzzy features between `prov_th_1` and `prov_th_2` (fuzzy features explained below)
4. If any of the fuzzy features exceed the assigned threshold (`threshold` variable below equals 60), let's `print()` that row so we can examine the values of the different `fuzzywuzzy` metrics.

The process entails fuzzy matching the two `Province_TH` columns from the `mtct2016_df1` and `translate_province_df1` dataframes. We then translate the `Province_TH` column from `mtct2016_df1` to English using Google Translate and fuzzy match this with the English province name in the `translate_province_df1` DataFrame. The highest fuzzy match scores (using ratio and partial ratio of the `fuzzywuzzy` package) are the likely correct matches. 

We `print()` key variables and metrics as output so we can see how the algorithm processes the data. This output will be a lengthy one. You can review the output for errors.
  
**IMPORTANT:** 
1. If you get an error `StreamResetError: Stream forcefully closed`, adjust the `api_sleep_interval`and run the double loop cell again.
2. If you get a JSON decode error, it is likely that the API stopped working (Google possibly blocked your access to the API). You can try adjusting the `api_sleep_interval` and run the loop again.

In [69]:
import fuzzywuzzy

fuzzywuzzy.__version__

'0.17.0'

In [None]:
import time
from fuzzywuzzy import process, fuzz
import numpy as np

# Let's time the execution of the double loop algorithm
# t0 is start time
t0 = time.time()

# FIRST LOOP
# initialize variables for first loop
prov_th_1 = ''
index = 0
# this is the API sleep interval
api_sleep_interval = 0.8
 
for x in mtct2016_df1.iterrows():
    if prov_th_1 != x[1]['Province_TH']:
        print('===========================================================================')

    # thai province name from mtct data
    prov_th_1 = x[1]['Province_TH'] 

    # sleep a little to slow "hit" rate on google translate API
    time.sleep(api_sleep_interval) 

    # translate prov_th_1 using Google Translate
    trans = await translator.translate(prov_th_1, dest='en')
    n1o = trans.text

    # normalize translated name by converting to lower() case and 
    #   strip() invisible characters
    prov_en_1 = n1o.lower().strip() 
    
    # SECOND LOOP
    # initialize variables for second loop
    maxval = 0
    iso = ''
    
    for y in translate_province_df1.iterrows():
        index = y[0]

        # thai province name from lookup df
        prov_th_2 = y[1]['Province_TH'] 
        # remove "Province" from the english name in lookup df
        n2o = y[1]['Province_EN'].replace("Province","") 
        # normalize english name in lookup df
        prov_en_2 = n2o.lower().strip() 

        # compute fuzzy features using fuzzywuzzy
        sm_prov_th_fr = fuzz.ratio(prov_th_1, prov_th_2)
        sm_prov_th_fpr = fuzz.partial_ratio(prov_th_1, prov_th_2)
        sm_prov_en_fr = fuzz.ratio(prov_en_1, prov_en_2)
        sm_prov_en_fpr = fuzz.partial_ratio(prov_en_1, prov_en_2)

        # if any value exceeds the threshold value print the row
        threshold = 60
        if sm_prov_th_fr > threshold or sm_prov_th_fpr > threshold \
            or sm_prov_en_fr > threshold or sm_prov_en_fpr > threshold:
            # some province terms are not translated corrected by google translate
            exceptions = ['market', 'at all', 'spread', 'dry']
            exception_values = {'market':'Trat','at all':'Amnat Charoen',\
                                'spread':'Phrae','dry':'Tak'}
            if prov_en_1 in exceptions:
                print('exception')
                maxval = mean = sm_prov_th_fpr # just use spp for this
                #n1 = exception_values[n1]
            else:
                mean = \
                  np.mean([sm_prov_th_fr, sm_prov_th_fpr, sm_prov_en_fr, sm_prov_en_fpr])

            # get the maximum value for mean while in this loop
            if maxval < mean:
                maxval = mean
                #iso = y[1]['ISO_CODE']
        
            #print index, n1, p2, n2, spr, spp, snr, snp, mean, maxval, iso
            print("index: ", index, "\n", \
                  "prov_th_1: ", prov_th_1, "\n", \
                  "prov_en_1: ", prov_en_1, "\n", 
                  "n1o: ", n1o, "\n",
                  "prov_th_2: ", prov_th_2, "\n",
                  "prov_en_2: ", prov_en_2, "\n", \
                  "n2o: ", n2o, "\n")
            print("sm_prov_th_fr: ", sm_prov_th_fr, "\n",
                  "sm_prov_th_fpr: ", sm_prov_th_fpr, "\n",
                  "sm_prov_en_fr: ", sm_prov_en_fr, "\n",
                  "sm_prov_en_fpr: ", sm_prov_en_fpr, "\n",
                  "mean: ", mean, "\n",
                  "maxval: ", maxval, "\n")
            row_data = {
                #"iso": [iso],
                "prov_th_1": [prov_th_1],
                "prov_en_1": [prov_en_1],
                "prov_th_2": [prov_th_2],
                "prov_en_2": [prov_en_2],
                "sm_prov_th_fr": [sm_prov_th_fr],
                "sm_prov_th_fpr": [sm_prov_th_fpr],
                "sm_prov_en_fr": [sm_prov_en_fr],
                "sm_prov_en_fpr": [sm_prov_en_fpr],
                "mean": [mean],
                "maxval": [maxval]
            }
            #fuzzy_df1 = fuzzy_df1.append(pd.DataFrame(row_data))
            fuzzy_df1 = pd.concat([fuzzy_df1, pd.DataFrame(row_data)], ignore_index=True)

            print("Loop completed.")
# end time            
t1 = time.time()
t_total = t1 - t0

index:  19 
 prov_th_1:   กระบี่ 
 prov_en_1:  krabi 
 n1o:  Krabi 
 prov_th_2:  จังหวัดกระบี่ 
 prov_en_2:  krabi 
 n2o:  Krabi  

sm_prov_th_fr:  60 
 sm_prov_th_fpr:  86 
 sm_prov_en_fr:  100 
 sm_prov_en_fpr:  100 
 mean:  86.5 
 maxval:  86.5 

index:  36 
 prov_th_1:   กระบี่ 
 prov_en_1:  krabi 
 n1o:  Krabi 
 prov_th_2:  จังหวัดหนองคาย 
 prov_en_2:  nong khai 
 n2o:  Nong Khai  

sm_prov_th_fr:  0 
 sm_prov_th_fpr:  0 
 sm_prov_en_fr:  43 
 sm_prov_en_fpr:  67 
 mean:  27.5 
 maxval:  86.5 

index:  61 
 prov_th_1:   กระบี่ 
 prov_en_1:  krabi 
 n1o:  Krabi 
 prov_th_2:  จังหวัดสระบุรี 
 prov_en_2:  saraburi 
 n2o:  Saraburi  

sm_prov_th_fr:  38 
 sm_prov_th_fpr:  57 
 sm_prov_en_fr:  62 
 sm_prov_en_fpr:  60 
 mean:  54.25 
 maxval:  86.5 

index:  2 
 prov_th_1:   กรุงเทพฯ 
 prov_en_1:  bangkok 
 n1o:  Bangkok 
 prov_th_2:  กรุงเทพมหานคร 
 prov_en_2:  bangkok 
 n2o:  Bangkok 

sm_prov_th_fr:  64 
 sm_prov_th_fpr:  78 
 sm_prov_en_fr:  100 
 sm_prov_en_fpr:  100 
 mean:  85.5

index:  8 
 prov_th_1:   จันทบุรี 
 prov_en_1:  chanthaburi 
 n1o:  Chanthaburi 
 prov_th_2:  จังหวัดชัยนาท 
 prov_en_2:  chai nat 
 n2o:  Chai Nat  

sm_prov_th_fr:  36 
 sm_prov_th_fpr:  33 
 sm_prov_en_fr:  53 
 sm_prov_en_fpr:  62 
 mean:  46.0 
 maxval:  46.0 

index:  10 
 prov_th_1:   จันทบุรี 
 prov_en_1:  chanthaburi 
 n1o:  Chanthaburi 
 prov_th_2:  จังหวัดจันทบุรี 
 prov_en_2:  chanthaburi 
 n2o:  Chanthaburi  

sm_prov_th_fr:  67 
 sm_prov_th_fpr:  89 
 sm_prov_en_fr:  100 
 sm_prov_en_fpr:  100 
 mean:  89.0 
 maxval:  89.0 

index:  13 
 prov_th_1:   จันทบุรี 
 prov_en_1:  chanthaburi 
 n1o:  Chanthaburi 
 prov_th_2:  จังหวัดชลบุรี 
 prov_en_2:  chon buri 
 n2o:  Chon Buri  

sm_prov_th_fr:  55 
 sm_prov_th_fpr:  56 
 sm_prov_en_fr:  70 
 sm_prov_en_fpr:  56 
 mean:  59.25 
 maxval:  89.0 

index:  17 
 prov_th_1:   จันทบุรี 
 prov_en_1:  chanthaburi 
 n1o:  Chanthaburi 
 prov_th_2:  จังหวัดกาญจนบุรี 
 prov_en_2:  kanchanaburi 
 n2o:  Kanchanaburi  

sm_prov_th_fr:  56 
 

index:  8 
 prov_th_1:   ชัยภูมิ 
 prov_en_1:  chaiyaphum 
 n1o:  Chaiyaphum 
 prov_th_2:  จังหวัดชัยนาท 
 prov_en_2:  chai nat 
 n2o:  Chai Nat  

sm_prov_th_fr:  29 
 sm_prov_th_fpr:  40 
 sm_prov_en_fr:  56 
 sm_prov_en_fpr:  62 
 mean:  46.75 
 maxval:  46.75 

index:  9 
 prov_th_1:   ชัยภูมิ 
 prov_en_1:  chaiyaphum 
 n1o:  Chaiyaphum 
 prov_th_2:  จังหวัดชัยภูมิ 
 prov_en_2:  chaiyaphum 
 n2o:  Chaiyaphum  

sm_prov_th_fr:  64 
 sm_prov_th_fpr:  88 
 sm_prov_en_fr:  100 
 sm_prov_en_fpr:  100 
 mean:  88.0 
 maxval:  88.0 

index:  42 
 prov_th_1:   ชัยภูมิ 
 prov_en_1:  chaiyaphum 
 n1o:  Chaiyaphum 
 prov_th_2:  จังหวัดพะเยา 
 prov_en_2:  phayao 
 n2o:  Phayao  

sm_prov_th_fr:  20 
 sm_prov_th_fpr:  25 
 sm_prov_en_fr:  50 
 sm_prov_en_fpr:  67 
 mean:  40.5 
 maxval:  88.0 

index:  7 
 prov_th_1:   ชุมพร 
 prov_en_1:  chumphon 
 n1o:  Chumphon 
 prov_th_2:  จังหวัดฉะเชิงเทรา 
 prov_en_2:  chachoengsao 
 n2o:  Chachoengsao  

sm_prov_th_fr:  17 
 sm_prov_th_fpr:  33 
 sm_pro

index:  18 
 prov_th_1:   นครนายก 
 prov_en_1:  nakhon nayok 
 n1o:  Nakhon Nayok 
 prov_th_2:  จังหวัดขอนแก่น 
 prov_en_2:  khon kaen 
 n2o:  Khon Kaen  

sm_prov_th_fr:  18 
 sm_prov_th_fpr:  25 
 sm_prov_en_fr:  57 
 sm_prov_en_fpr:  67 
 mean:  41.75 
 maxval:  41.75 

index:  27 
 prov_th_1:   นครนายก 
 prov_en_1:  nakhon nayok 
 n1o:  Nakhon Nayok 
 prov_th_2:  จังหวัดนครนายก 
 prov_en_2:  nakhon nayok 
 n2o:  Nakhon Nayok  

sm_prov_th_fr:  64 
 sm_prov_th_fpr:  88 
 sm_prov_en_fr:  100 
 sm_prov_en_fpr:  100 
 mean:  88.0 
 maxval:  88.0 

index:  28 
 prov_th_1:   นครนายก 
 prov_en_1:  nakhon nayok 
 n1o:  Nakhon Nayok 
 prov_th_2:  จังหวัดนครปฐม 
 prov_en_2:  nakhon pathom 
 n2o:  Nakhon Pathom  

sm_prov_th_fr:  29 
 sm_prov_th_fpr:  40 
 sm_prov_en_fr:  72 
 sm_prov_en_fpr:  75 
 mean:  54.0 
 maxval:  88.0 

index:  29 
 prov_th_1:   นครนายก 
 prov_en_1:  nakhon nayok 
 n1o:  Nakhon Nayok 
 prov_th_2:  จังหวัดนครพนม 
 prov_en_2:  nakhon phanom 
 n2o:  Nakhon Phanom  

sm_p

index:  8 
 prov_th_1:   นครราชสีมา 
 prov_en_1:  nakhon ratchasima 
 n1o:  Nakhon Ratchasima 
 prov_th_2:  จังหวัดชัยนาท 
 prov_en_2:  chai nat 
 n2o:  Chai Nat  

sm_prov_th_fr:  17 
 sm_prov_th_fpr:  18 
 sm_prov_en_fr:  40 
 sm_prov_en_fpr:  62 
 mean:  34.25 
 maxval:  34.25 

index:  18 
 prov_th_1:   นครราชสีมา 
 prov_en_1:  nakhon ratchasima 
 n1o:  Nakhon Ratchasima 
 prov_th_2:  จังหวัดขอนแก่น 
 prov_en_2:  khon kaen 
 n2o:  Khon Kaen  

sm_prov_th_fr:  8 
 sm_prov_th_fpr:  9 
 sm_prov_en_fr:  46 
 sm_prov_en_fpr:  67 
 mean:  32.5 
 maxval:  34.25 

index:  27 
 prov_th_1:   นครราชสีมา 
 prov_en_1:  nakhon ratchasima 
 n1o:  Nakhon Ratchasima 
 prov_th_2:  จังหวัดนครนายก 
 prov_en_2:  nakhon nayok 
 n2o:  Nakhon Nayok  

sm_prov_th_fr:  32 
 sm_prov_th_fpr:  42 
 sm_prov_en_fr:  55 
 sm_prov_en_fpr:  67 
 mean:  49.0 
 maxval:  49.0 

index:  28 
 prov_th_1:   นครราชสีมา 
 prov_en_1:  nakhon ratchasima 
 n1o:  Nakhon Ratchasima 
 prov_th_2:  จังหวัดนครปฐม 
 prov_en_2:  nakho

index:  10 
 prov_th_1:   นนทบุรี 
 prov_en_1:  nonthaburi 
 n1o:  Nonthaburi 
 prov_th_2:  จังหวัดจันทบุรี 
 prov_en_2:  chanthaburi 
 n2o:  Chanthaburi  

sm_prov_th_fr:  52 
 sm_prov_th_fpr:  75 
 sm_prov_en_fr:  76 
 sm_prov_en_fpr:  80 
 mean:  70.75 
 maxval:  70.75 

index:  13 
 prov_th_1:   นนทบุรี 
 prov_en_1:  nonthaburi 
 n1o:  Nonthaburi 
 prov_th_2:  จังหวัดชลบุรี 
 prov_en_2:  chon buri 
 n2o:  Chon Buri  

sm_prov_th_fr:  38 
 sm_prov_th_fpr:  50 
 sm_prov_en_fr:  63 
 sm_prov_en_fpr:  67 
 mean:  54.5 
 maxval:  70.75 

index:  17 
 prov_th_1:   นนทบุรี 
 prov_en_1:  nonthaburi 
 n1o:  Nonthaburi 
 prov_th_2:  จังหวัดกาญจนบุรี 
 prov_en_2:  kanchanaburi 
 n2o:  Kanchanaburi  

sm_prov_th_fr:  42 
 sm_prov_th_fpr:  62 
 sm_prov_en_fr:  64 
 sm_prov_en_fpr:  70 
 mean:  59.5 
 maxval:  70.75 

index:  33 
 prov_th_1:   นนทบุรี 
 prov_en_1:  nonthaburi 
 n1o:  Nonthaburi 
 prov_th_2:  จังหวัดน่าน 
 prov_en_2:  nan 
 n2o:  Nan  

sm_prov_th_fr:  21 
 sm_prov_th_fpr:  25 
 

index:  5 
 prov_th_1:   บึงกาฬ 
 prov_en_1:  bueng kan 
 n1o:  Bueng Kan 
 prov_th_2:  จังหวัดบึงกาฬ 
 prov_en_2:  bueng kan 
 n2o:  Bueng Kan  

sm_prov_th_fr:  60 
 sm_prov_th_fpr:  86 
 sm_prov_en_fr:  100 
 sm_prov_en_fpr:  100 
 mean:  86.5 
 maxval:  86.5 

index:  33 
 prov_th_1:   บึงกาฬ 
 prov_en_1:  bueng kan 
 n1o:  Bueng Kan 
 prov_th_2:  จังหวัดน่าน 
 prov_en_2:  nan 
 n2o:  Nan  

sm_prov_th_fr:  22 
 sm_prov_th_fpr:  14 
 sm_prov_en_fr:  50 
 sm_prov_en_fpr:  67 
 mean:  38.25 
 maxval:  86.5 

index:  6 
 prov_th_1:   บุรีรัมย์ 
 prov_en_1:  buri ram 
 n1o:  Buri Ram 
 prov_th_2:  จังหวัดบุรีรัมย์ 
 prov_en_2:  buri ram 
 n2o:  Buri Ram  

sm_prov_th_fr:  69 
 sm_prov_th_fpr:  90 
 sm_prov_en_fr:  100 
 sm_prov_en_fpr:  100 
 mean:  89.75 
 maxval:  89.75 

index:  35 
 prov_th_1:   บุรีรัมย์ 
 prov_en_1:  buri ram 
 n1o:  Buri Ram 
 prov_th_2:  จังหวัดหนองบัวลำภู 
 prov_en_2:  nong bua lam phu 
 n2o:  Nong Bua Lam Phu  

sm_prov_th_fr:  14 
 sm_prov_th_fpr:  20 
 sm_p

index:  42 
 prov_th_1:   พระนครศรีอยุธยา 
 prov_en_1:  ayutthaya 
 n1o:  Ayutthaya 
 prov_th_2:  จังหวัดพะเยา 
 prov_en_2:  phayao 
 n2o:  Phayao  

sm_prov_th_fr:  29 
 sm_prov_th_fpr:  17 
 sm_prov_en_fr:  53 
 sm_prov_en_fpr:  73 
 mean:  43.0 
 maxval:  43.0 

index:  47 
 prov_th_1:   พระนครศรีอยุธยา 
 prov_en_1:  ayutthaya 
 n1o:  Ayutthaya 
 prov_th_2:  จังหวัดพระนครศรีอยุธยา 
 prov_en_2:  phra nakhon si ayutthaya 
 n2o:  Phra Nakhon Si Ayutthaya  

sm_prov_th_fr:  79 
 sm_prov_th_fpr:  94 
 sm_prov_en_fr:  55 
 sm_prov_en_fpr:  100 
 mean:  82.0 
 maxval:  82.0 

index:  9 
 prov_th_1:   พะเยา 
 prov_en_1:  phayao 
 n1o:  Phayao 
 prov_th_2:  จังหวัดชัยภูมิ 
 prov_en_2:  chaiyaphum 
 n2o:  Chaiyaphum  

sm_prov_th_fr:  10 
 sm_prov_th_fpr:  17 
 sm_prov_en_fr:  50 
 sm_prov_en_fpr:  67 
 mean:  36.0 
 maxval:  36.0 

index:  29 
 prov_th_1:   พะเยา 
 prov_en_1:  phayao 
 n1o:  Phayao 
 prov_th_2:  จังหวัดนครพนม 
 prov_en_2:  nakhon phanom 
 n2o:  Nakhon Phanom  

sm_prov_th_fr

index:  41 
 prov_th_1:   เพชรบูรณ์ 
 prov_en_1:  phetchabun 
 n1o:  Phetchabun 
 prov_th_2:  จังหวัดพัทลุง 
 prov_en_2:  phatthalung 
 n2o:  Phatthalung  

sm_prov_th_fr:  9 
 sm_prov_th_fpr:  10 
 sm_prov_en_fr:  67 
 sm_prov_en_fpr:  70 
 mean:  39.0 
 maxval:  39.0 

index:  43 
 prov_th_1:   เพชรบูรณ์ 
 prov_en_1:  phetchabun 
 n1o:  Phetchabun 
 prov_th_2:  จังหวัดเพชรบูรณ์ 
 prov_en_2:  phetchabun 
 n2o:  Phetchabun  

sm_prov_th_fr:  69 
 sm_prov_th_fpr:  90 
 sm_prov_en_fr:  100 
 sm_prov_en_fpr:  100 
 mean:  89.75 
 maxval:  89.75 

index:  44 
 prov_th_1:   เพชรบูรณ์ 
 prov_en_1:  phetchabun 
 n1o:  Phetchabun 
 prov_th_2:  จังหวัดเพชรบุรี 
 prov_en_2:  phetchaburi 
 n2o:  Phetchaburi  

sm_prov_th_fr:  48 
 sm_prov_th_fpr:  63 
 sm_prov_en_fr:  86 
 sm_prov_en_fpr:  90 
 mean:  71.75 
 maxval:  89.75 

index:  49 
 prov_th_1:   เพชรบูรณ์ 
 prov_en_1:  phetchabun 
 n1o:  Phetchabun 
 prov_th_2:  จังหวัดภูเก็ต 
 prov_en_2:  phuket 
 n2o:  Phuket  

sm_prov_th_fr:  9 
 sm_pro

index:  52 
 prov_th_1:   ระยอง 
 prov_en_1:  rayong 
 n1o:  Rayong 
 prov_th_2:  จังหวัดระนอง 
 prov_en_2:  ranong 
 n2o:  Ranong  

sm_prov_th_fr:  44 
 sm_prov_th_fpr:  67 
 sm_prov_en_fr:  83 
 sm_prov_en_fpr:  83 
 mean:  69.25 
 maxval:  69.25 

index:  54 
 prov_th_1:   ระยอง 
 prov_en_1:  rayong 
 n1o:  Rayong 
 prov_th_2:  จังหวัดระยอง 
 prov_en_2:  rayong 
 n2o:  Rayong  

sm_prov_th_fr:  56 
 sm_prov_th_fpr:  83 
 sm_prov_en_fr:  100 
 sm_prov_en_fpr:  100 
 mean:  84.75 
 maxval:  84.75 

index:  71 
 prov_th_1:   ระยอง 
 prov_en_1:  rayong 
 n1o:  Rayong 
 prov_th_2:  จังหวัดตรัง 
 prov_en_2:  trang 
 n2o:  Trang  

sm_prov_th_fr:  24 
 sm_prov_th_fpr:  33 
 sm_prov_en_fr:  73 
 sm_prov_en_fpr:  60 
 mean:  47.5 
 maxval:  84.75 

index:  10 
 prov_th_1:   ราชบุรี 
 prov_en_1:  ratchaburi 
 n1o:  Ratchaburi 
 prov_th_2:  จังหวัดจันทบุรี 
 prov_en_2:  chanthaburi 
 n2o:  Chanthaburi  

sm_prov_th_fr:  35 
 sm_prov_th_fpr:  50 
 sm_prov_en_fr:  76 
 sm_prov_en_fpr:  80 
 mea

exception
index:  3 
 prov_th_1:   เลย 
 prov_en_1:  at all 
 n1o:  at all 
 prov_th_2:  จังหวัดอำนาจเจริญ 
 prov_en_2:  amnat charoen 
 n2o:  Amnat Charoen  

sm_prov_th_fr:  10 
 sm_prov_th_fpr:  25 
 sm_prov_en_fr:  42 
 sm_prov_en_fpr:  67 
 mean:  25 
 maxval:  25 

exception
index:  22 
 prov_th_1:   เลย 
 prov_en_1:  at all 
 n1o:  at all 
 prov_th_2:  จังหวัดเลย 
 prov_en_2:  loei 
 n2o:  Loei  

sm_prov_th_fr:  43 
 sm_prov_th_fpr:  75 
 sm_prov_en_fr:  20 
 sm_prov_en_fpr:  25 
 mean:  75 
 maxval:  75 

exception
index:  41 
 prov_th_1:   เลย 
 prov_en_1:  at all 
 n1o:  at all 
 prov_th_2:  จังหวัดพัทลุง 
 prov_en_2:  phatthalung 
 n2o:  Phatthalung  

sm_prov_th_fr:  12 
 sm_prov_th_fpr:  25 
 sm_prov_en_fr:  47 
 sm_prov_en_fpr:  67 
 mean:  25 
 maxval:  25 

exception
index:  68 
 prov_th_1:   เลย 
 prov_en_1:  at all 
 n1o:  at all 
 prov_th_2:  จังหวัดสุราษฎร์ธานี 
 prov_en_2:  surat thani 
 n2o:  Surat Thani  

sm_prov_th_fr:  0 
 sm_prov_th_fpr:  0 
 sm_prov_en_fr: 

index:  56 
 prov_th_1:   สระแก้ว 
 prov_en_1:  sa kaeo 
 n1o:  Sa Kaeo 
 prov_th_2:  จังหวัดสระแก้ว 
 prov_en_2:  sa kaeo 
 n2o:  Sa Kaeo  

sm_prov_th_fr:  64 
 sm_prov_th_fpr:  88 
 sm_prov_en_fr:  100 
 sm_prov_en_fpr:  100 
 mean:  88.0 
 maxval:  88.0 

index:  63 
 prov_th_1:   สระแก้ว 
 prov_en_1:  sa kaeo 
 n1o:  Sa Kaeo 
 prov_th_2:  จังหวัดศรีสะเกษ 
 prov_en_2:  si sa ket 
 n2o:  Si Sa Ket  

sm_prov_th_fr:  26 
 sm_prov_th_fpr:  40 
 sm_prov_en_fr:  62 
 sm_prov_en_fpr:  71 
 mean:  49.75 
 maxval:  88.0 

index:  70 
 prov_th_1:   สระแก้ว 
 prov_en_1:  sa kaeo 
 n1o:  Sa Kaeo 
 prov_th_2:  จังหวัดตาก 
 prov_en_2:  tak 
 n2o:  Tak  

sm_prov_th_fr:  11 
 sm_prov_th_fpr:  12 
 sm_prov_en_fr:  40 
 sm_prov_en_fpr:  67 
 mean:  32.5 
 maxval:  88.0 

index:  10 
 prov_th_1:   สระบุรี 
 prov_en_1:  saraburi 
 n1o:  Saraburi 
 prov_th_2:  จังหวัดจันทบุรี 
 prov_en_2:  chanthaburi 
 n2o:  Chanthaburi  

sm_prov_th_fr:  35 
 sm_prov_th_fpr:  50 
 sm_prov_en_fr:  63 
 sm_prov_en_fp

index:  33 
 prov_th_1:   สุราษฎร์ธานี 
 prov_en_1:  surat thani 
 n1o:  Surat Thani 
 prov_th_2:  จังหวัดน่าน 
 prov_en_2:  nan 
 n2o:  Nan  

sm_prov_th_fr:  17 
 sm_prov_th_fpr:  18 
 sm_prov_en_fr:  29 
 sm_prov_en_fpr:  67 
 mean:  32.75 
 maxval:  32.75 

index:  38 
 prov_th_1:   สุราษฎร์ธานี 
 prov_en_1:  surat thani 
 n1o:  Surat Thani 
 prov_th_2:  จังหวัดปทุมธานี 
 prov_en_2:  pathum thani 
 n2o:  Pathum Thani  

sm_prov_th_fr:  36 
 sm_prov_th_fpr:  38 
 sm_prov_en_fr:  70 
 sm_prov_en_fpr:  73 
 mean:  54.25 
 maxval:  54.25 

index:  39 
 prov_th_1:   สุราษฎร์ธานี 
 prov_en_1:  surat thani 
 n1o:  Surat Thani 
 prov_th_2:  จังหวัดปัตตานี 
 prov_en_2:  pattani 
 n2o:  Pattani  

sm_prov_th_fr:  22 
 sm_prov_th_fpr:  23 
 sm_prov_en_fr:  67 
 sm_prov_en_fpr:  71 
 mean:  45.75 
 maxval:  54.25 

index:  53 
 prov_th_1:   สุราษฎร์ธานี 
 prov_en_1:  surat thani 
 n1o:  Surat Thani 
 prov_th_2:  จังหวัดราชบุรี 
 prov_en_2:  ratchaburi 
 n2o:  Ratchaburi  

sm_prov_th_fr:  30 


index:  4 
 prov_th_1:   อ่างทอง 
 prov_en_1:  ang thong 
 n1o:  Ang Thong 
 prov_th_2:  จังหวัดอ่างทอง 
 prov_en_2:  ang thong 
 n2o:  Ang Thong  

sm_prov_th_fr:  64 
 sm_prov_th_fpr:  88 
 sm_prov_en_fr:  100 
 sm_prov_en_fpr:  100 
 mean:  88.0 
 maxval:  88.0 

index:  24 
 prov_th_1:   อ่างทอง 
 prov_en_1:  ang thong 
 n1o:  Ang Thong 
 prov_th_2:  จังหวัดแม่ฮ่องสอน 
 prov_en_2:  mae hong son 
 n2o:  Mae Hong Son  

sm_prov_th_fr:  24 
 sm_prov_th_fpr:  38 
 sm_prov_en_fr:  57 
 sm_prov_en_fpr:  67 
 mean:  46.5 
 maxval:  88.0 

index:  52 
 prov_th_1:   อ่างทอง 
 prov_en_1:  ang thong 
 n1o:  Ang Thong 
 prov_th_2:  จังหวัดระนอง 
 prov_en_2:  ranong 
 n2o:  Ranong  

sm_prov_th_fr:  30 
 sm_prov_th_fpr:  25 
 sm_prov_en_fr:  67 
 sm_prov_en_fpr:  50 
 mean:  43.0 
 maxval:  88.0 

index:  78 
 prov_th_1:   อ่างทอง 
 prov_en_1:  ang thong 
 n1o:  Ang Thong 
 prov_th_2:  จังหวัดยโสธร 
 prov_en_2:  yasothon 
 n2o:  Yasothon  

sm_prov_th_fr:  10 
 sm_prov_th_fpr:  0 
 sm_prov_en_f

In [36]:
print("Total elapsed time, seconds: ", t_total)

Total elapsed time, seconds:  79.97034573554993


In [37]:
fuzzy_df1.reset_index().drop(['index'],axis=1)

Unnamed: 0,bestmatch,maxval,mean,prov_en_1,prov_en_2,prov_th_1,prov_th_2,sm_prov_en_fpr,sm_prov_en_fr,sm_prov_th_fpr,sm_prov_th_fr
0,,86.5,86.5,krabi,krabi,กระบี่,จังหวัดกระบี่,100.0,100.0,86.0,60.0
1,,86.5,27.5,krabi,nong khai,กระบี่,จังหวัดหนองคาย,67.0,43.0,0.0,0.0
2,,86.5,54.25,krabi,saraburi,กระบี่,จังหวัดสระบุรี,60.0,62.0,57.0,38.0
3,,85.5,85.5,bangkok,bangkok,กรุงเทพฯ,กรุงเทพมหานคร,100.0,100.0,78.0,64.0
4,,85.5,32.0,bangkok,,กรุงเทพฯ,จังหวัดน่าน,67.0,40.0,11.0,10.0
5,,34.75,34.75,kanchanaburi,chai nat,กาญจนบุรี,จังหวัดชัยนาท,62.0,50.0,10.0,17.0
6,,67.0,67.0,kanchanaburi,chanthaburi,กาญจนบุรี,จังหวัดจันทบุรี,82.0,78.0,60.0,48.0
7,,67.0,38.25,kanchanaburi,chiang mai,กาญจนบุรี,จังหวัดเชียงใหม่,63.0,55.0,20.0,15.0
8,,67.0,36.0,kanchanaburi,chiang rai,กาญจนบุรี,จังหวัดเชียงราย,63.0,55.0,10.0,16.0
9,,67.0,57.0,kanchanaburi,chon buri,กาญจนบุรี,จังหวัดชลบุรี,78.0,67.0,40.0,43.0


In [38]:
import numpy as np

#new_df['bestmatch'] = np.where((new_df['maxval']==new_df['mean']), 
#                                           'yes', 'no')
fuzzy_df1['bestmatch'] = np.where((fuzzy_df1['prov_en_1']==fuzzy_df1['prov_en_2']), 
                                           'yes', 'no')

In [39]:
fuzzy_df1

Unnamed: 0,bestmatch,maxval,mean,prov_en_1,prov_en_2,prov_th_1,prov_th_2,sm_prov_en_fpr,sm_prov_en_fr,sm_prov_th_fpr,sm_prov_th_fr
0,yes,86.5,86.5,krabi,krabi,กระบี่,จังหวัดกระบี่,100.0,100.0,86.0,60.0
0,no,86.5,27.5,krabi,nong khai,กระบี่,จังหวัดหนองคาย,67.0,43.0,0.0,0.0
0,no,86.5,54.25,krabi,saraburi,กระบี่,จังหวัดสระบุรี,60.0,62.0,57.0,38.0
0,yes,85.5,85.5,bangkok,bangkok,กรุงเทพฯ,กรุงเทพมหานคร,100.0,100.0,78.0,64.0
0,no,85.5,32.0,bangkok,,กรุงเทพฯ,จังหวัดน่าน,67.0,40.0,11.0,10.0
0,no,34.75,34.75,kanchanaburi,chai nat,กาญจนบุรี,จังหวัดชัยนาท,62.0,50.0,10.0,17.0
0,no,67.0,67.0,kanchanaburi,chanthaburi,กาญจนบุรี,จังหวัดจันทบุรี,82.0,78.0,60.0,48.0
0,no,67.0,38.25,kanchanaburi,chiang mai,กาญจนบุรี,จังหวัดเชียงใหม่,63.0,55.0,20.0,15.0
0,no,67.0,36.0,kanchanaburi,chiang rai,กาญจนบุรี,จังหวัดเชียงราย,63.0,55.0,10.0,16.0
0,no,67.0,57.0,kanchanaburi,chon buri,กาญจนบุรี,จังหวัดชลบุรี,78.0,67.0,40.0,43.0


In [None]:
#fuzzy_df1.loc[fuzzy_df1['maxval'].equals(fuzzy_df1['mean'])]
fuzzy_df1.loc[fuzzy_df1['maxval'] == fuzzy_df1['mean']]

Unnamed: 0,bestmatch,maxval,mean,prov_en_1,prov_en_2,prov_th_1,prov_th_2,sm_prov_en_fpr,sm_prov_en_fr,sm_prov_th_fpr,sm_prov_th_fr
0,yes,86.5,86.5,krabi,krabi,กระบี่,จังหวัดกระบี่,100.0,100.0,86.0,60.0
0,no,86.5,27.5,krabi,nong khai,กระบี่,จังหวัดหนองคาย,67.0,43.0,0.0,0.0
0,no,86.5,54.25,krabi,saraburi,กระบี่,จังหวัดสระบุรี,60.0,62.0,57.0,38.0
0,yes,85.5,85.5,bangkok,bangkok,กรุงเทพฯ,กรุงเทพมหานคร,100.0,100.0,78.0,64.0
0,no,85.5,32.0,bangkok,,กรุงเทพฯ,จังหวัดน่าน,67.0,40.0,11.0,10.0
0,no,34.75,34.75,kanchanaburi,chai nat,กาญจนบุรี,จังหวัดชัยนาท,62.0,50.0,10.0,17.0
0,no,67.0,67.0,kanchanaburi,chanthaburi,กาญจนบุรี,จังหวัดจันทบุรี,82.0,78.0,60.0,48.0
0,no,67.0,38.25,kanchanaburi,chiang mai,กาญจนบุรี,จังหวัดเชียงใหม่,63.0,55.0,20.0,15.0
0,no,67.0,36.0,kanchanaburi,chiang rai,กาญจนบุรี,จังหวัดเชียงราย,63.0,55.0,10.0,16.0
0,no,67.0,57.0,kanchanaburi,chon buri,กาญจนบุรี,จังหวัดชลบุรี,78.0,67.0,40.0,43.0


In [41]:
fuzzy_df1

Unnamed: 0,bestmatch,maxval,mean,prov_en_1,prov_en_2,prov_th_1,prov_th_2,sm_prov_en_fpr,sm_prov_en_fr,sm_prov_th_fpr,sm_prov_th_fr
0,yes,86.5,86.5,krabi,krabi,กระบี่,จังหวัดกระบี่,100.0,100.0,86.0,60.0
0,no,86.5,27.5,krabi,nong khai,กระบี่,จังหวัดหนองคาย,67.0,43.0,0.0,0.0
0,no,86.5,54.25,krabi,saraburi,กระบี่,จังหวัดสระบุรี,60.0,62.0,57.0,38.0
0,yes,85.5,85.5,bangkok,bangkok,กรุงเทพฯ,กรุงเทพมหานคร,100.0,100.0,78.0,64.0
0,no,85.5,32.0,bangkok,,กรุงเทพฯ,จังหวัดน่าน,67.0,40.0,11.0,10.0
0,no,34.75,34.75,kanchanaburi,chai nat,กาญจนบุรี,จังหวัดชัยนาท,62.0,50.0,10.0,17.0
0,no,67.0,67.0,kanchanaburi,chanthaburi,กาญจนบุรี,จังหวัดจันทบุรี,82.0,78.0,60.0,48.0
0,no,67.0,38.25,kanchanaburi,chiang mai,กาญจนบุรี,จังหวัดเชียงใหม่,63.0,55.0,20.0,15.0
0,no,67.0,36.0,kanchanaburi,chiang rai,กาญจนบุรี,จังหวัดเชียงราย,63.0,55.0,10.0,16.0
0,no,67.0,57.0,kanchanaburi,chon buri,กาญจนบุรี,จังหวัดชลบุรี,78.0,67.0,40.0,43.0


### Step 5: Create sorted `fuzzy_df2`

Let's sort this new dataframe on `prov_en_1`, the Google Translate column, on `maxval` and `mean`. Note in the listing above that where `maxval` and `mean` are equal, that's likely the correct match. We want those matches to "bubble up" among similar rows so we can eliminate the rest of the low scoring ones later.

In [42]:
fuzzy_df2 = fuzzy_df1.sort_values(['prov_en_1','maxval','mean'], ascending=[False,False,False])

In [43]:
fuzzy_df2.reset_index()

Unnamed: 0,index,bestmatch,maxval,mean,prov_en_1,prov_en_2,prov_th_1,prov_th_2,sm_prov_en_fpr,sm_prov_en_fr,sm_prov_th_fpr,sm_prov_th_fr
0,0,yes,84.75,84.75,yasothon,yasothon,ยโสธร,จังหวัดยโสธร,100.0,100.0,83.0,56.0
1,0,no,30.25,30.25,yasothon,ang thong,ยโสธร,จังหวัดอ่างทอง,62.0,59.0,0.0,0.0
2,0,yes,82.5,82.5,yala,yala,ยะลา,จังหวัดยะลา,100.0,100.0,80.0,50.0
3,0,no,40.0,40.0,yala,kalasin,ยะลา,จังหวัดกาฬสินธุ์,75.0,55.0,20.0,10.0
4,0,no,40.0,33.5,yala,nong bua lam phu,ยะลา,จังหวัดหนองบัวลำภู,75.0,30.0,20.0,9.0
5,0,yes,89.75,89.75,uttaradit,uttaradit,อุตรดิตถ์,จังหวัดอุตรดิตถ์,100.0,100.0,90.0,69.0
6,0,no,42.75,42.75,uttaradit,trat,อุตรดิตถ์,จังหวัดตราด,50.0,62.0,30.0,29.0
7,0,no,35.0,35.0,uttaradit,tak,อุตรดิตถ์,จังหวัดตาก,67.0,33.0,20.0,20.0
8,0,yes,89.75,89.75,uthai thani,uthai thani,อุทัยธานี,จังหวัดอุทัยธานี,100.0,100.0,90.0,69.0
9,0,no,59.5,59.5,uthai thani,udon thani,อุทัยธานี,จังหวัดอุดรธานี,60.0,67.0,63.0,48.0


Note: We reported the wrong translations to Google Translate so you have to check if these would have been corrected in the future.

### Step 6: Create new dataframe, `fuzzy_df3`, with unique rows using maxval as guide to correct translation 

In [44]:
idx = \
    fuzzy_df2.groupby(['prov_en_2'], sort=False)\
    ['mean'].transform(max) == fuzzy_df2['mean']
idx

0    True 
0    False
0    True 
0    False
0    False
0    True 
0    False
0    False
0    True 
0    False
0    False
0    False
0    False
0    False
0    False
0    False
0    True 
0    False
0    False
0    False
0    False
0    False
0    False
0    False
0    False
0    True 
0    False
0    False
0    False
0    False
0    False
0    False
0    False
0    False
0    False
0    False
0    True 
0    False
0    False
0    False
0    False
0    False
0    True 
0    False
0    False
0    False
0    False
0    False
0    False
0    False
0    False
0    False
0    True 
0    False
0    False
0    False
0    False
0    False
0    False
0    False
0    False
0    True 
0    False
0    False
0    False
0    False
0    False
0    False
0    False
0    False
0    False
0    True 
0    False
0    False
0    True 
0    True 
0    False
0    False
0    True 
0    False
0    False
0    False
0    False
0    False
0    False
0    True 
0    False
0    False
0    True 
0    True 
0    False

In [45]:
fuzzy_df3 = fuzzy_df2[idx][['prov_en_1','prov_en_2','prov_th_1','prov_th_2','maxval']]

Let's create a `unicode_thai` column for this dataframe as well.

In [46]:
fuzzy_df3['Province_TH'] = fuzzy_df3['prov_th_1']

In [47]:
fuzzy_df3

Unnamed: 0,prov_en_1,prov_en_2,prov_th_1,prov_th_2,maxval,Province_TH
0,yasothon,yasothon,ยโสธร,จังหวัดยโสธร,84.75,ยโสธร
0,yala,yala,ยะลา,จังหวัดยะลา,82.5,ยะลา
0,uttaradit,uttaradit,อุตรดิตถ์,จังหวัดอุตรดิตถ์,89.75,อุตรดิตถ์
0,uthai thani,uthai thani,อุทัยธานี,จังหวัดอุทัยธานี,89.75,อุทัยธานี
0,udon thani,udon thani,อุดรธานี,จังหวัดอุดรธานี,89.0,อุดรธานี
0,ubon ratchathani,ubon ratchathani,อุบลราชธานี,จังหวัดอุบลราชธานี,91.25,อุบลราชธานี
0,trang,trang,ตรัง,จังหวัดตรัง,82.5,ตรัง
0,surin,surin,สุรินทร์,จังหวัดสุรินทร์,89.0,สุรินทร์
0,surat thani,surat thani,สุราษฎร์ธานี,จังหวัดสุราษฎร์ธานี,91.75,สุราษฎร์ธานี
0,suphan buri,suphan buri,สุพรรณบุรี,จังหวัดสุพรรณบุรี,90.5,สุพรรณบุรี


### Step 7: Get `iso2_df`

Let's implement this extra step that we missed due to the Thailand shapefile having some possible wrong translations. We will link `new3_df` with our old `iso2_df` dataframe from the Maternal and Child Health Case Study. We will load `iso2_df` from the pickle file we saved before.

In [48]:
iso2_df = pd.read_pickle('data/iso2_df.pickle')
iso2_df

Unnamed: 0,fuzzymatch,location_category,location_code,location_name,location_name_match,fuzzymatch1
0,,metropolitan administration,TH-10,Bangkok,bangkok,
1,,special administrative city,TH-S,Phatthaya,phatthaya,
2,"(amnatchareon, 92, 18)",province,TH-37,Amnat Charoen,amnatcharoen,amnatchareon
3,"(angthong, 100, 68)",province,TH-15,Ang Thong,angthong,angthong
4,"(buengkan, 100, 19)",province,TH-38,Bueng Kan,buengkan,buengkan
5,"(buriram, 100, 13)",province,TH-31,Buri Ram,buriram,buriram
6,"(chachoengsao, 100, 52)",province,TH-24,Chachoengsao,chachoengsao,chachoengsao
7,"(chainat, 100, 67)",province,TH-18,Chai Nat,chainat,chainat
8,"(chaiyaphum, 100, 9)",province,TH-36,Chaiyaphum,chaiyaphum,chaiyaphum
9,"(chanthaburi, 100, 55)",province,TH-22,Chanthaburi,chanthaburi,chanthaburi


### Step 8: Prepare for fuzzy match between `fuzzy_df3` and `iso2_df`

Let's use column `n2` in `new3_df` as our English matching column `location_name_match` and prepare (normalize) it accordingly. We normalize it by applying a `lambda` function to make every character lower case and replace all single spaces with nothing.

In [49]:
fuzzy_df3['location_name_match'] = \
    fuzzy_df3['prov_en_2'].apply(lambda x:x.lower().strip().replace(' ',''))

fuzzy_df3

Unnamed: 0,prov_en_1,prov_en_2,prov_th_1,prov_th_2,maxval,Province_TH,location_name_match
0,yasothon,yasothon,ยโสธร,จังหวัดยโสธร,84.75,ยโสธร,yasothon
0,yala,yala,ยะลา,จังหวัดยะลา,82.5,ยะลา,yala
0,uttaradit,uttaradit,อุตรดิตถ์,จังหวัดอุตรดิตถ์,89.75,อุตรดิตถ์,uttaradit
0,uthai thani,uthai thani,อุทัยธานี,จังหวัดอุทัยธานี,89.75,อุทัยธานี,uthaithani
0,udon thani,udon thani,อุดรธานี,จังหวัดอุดรธานี,89.0,อุดรธานี,udonthani
0,ubon ratchathani,ubon ratchathani,อุบลราชธานี,จังหวัดอุบลราชธานี,91.25,อุบลราชธานี,ubonratchathani
0,trang,trang,ตรัง,จังหวัดตรัง,82.5,ตรัง,trang
0,surin,surin,สุรินทร์,จังหวัดสุรินทร์,89.0,สุรินทร์,surin
0,surat thani,surat thani,สุราษฎร์ธานี,จังหวัดสุราษฎร์ธานี,91.75,สุราษฎร์ธานี,suratthani
0,suphan buri,suphan buri,สุพรรณบุรี,จังหวัดสุพรรณบุรี,90.5,สุพรรณบุรี,suphanburi


You can see that the `lambda` function has converted `n2` values to lower case with `lower()`, stripped invisible characters with `strip()` and replaced spaces with nothing using `replace()`. These functions are interconnected using an approach called **method chaining**. 

Method chaining allows you to feed the output of one function call to the next (compare with `bash` scripting "pipe" from Notebook 0). You can learn more about method chaining [here](https://en.wikipedia.org/wiki/Method_chaining). Learn more about Python lambda functions [here](https://www.w3schools.com/python/python_lambda.asp).

Next, let's fuzzy match `location_name_match` from `iso2_df` with `location_name_match` of `new3_df`. The column `fuzzymatch` will hold the fuzzy matching scores.

In [50]:
iso2_df['fuzzymatch'] = \
    iso2_df['location_name_match'].\
    apply(lambda x: process.extractOne(x,fuzzy_df3['location_name_match'],\
                                       scorer=fuzz.ratio,score_cutoff=80))
iso2_df

Unnamed: 0,fuzzymatch,location_category,location_code,location_name,location_name_match,fuzzymatch1
0,"(bangkok, 100, 0)",metropolitan administration,TH-10,Bangkok,bangkok,
1,,special administrative city,TH-S,Phatthaya,phatthaya,
2,"(amnatcharoen, 100, 0)",province,TH-37,Amnat Charoen,amnatcharoen,amnatchareon
3,"(angthong, 100, 0)",province,TH-15,Ang Thong,angthong,angthong
4,"(buengkan, 100, 0)",province,TH-38,Bueng Kan,buengkan,buengkan
5,"(buriram, 100, 0)",province,TH-31,Buri Ram,buriram,buriram
6,"(chachoengsao, 100, 0)",province,TH-24,Chachoengsao,chachoengsao,chachoengsao
7,"(chainat, 100, 0)",province,TH-18,Chai Nat,chainat,chainat
8,"(chaiyaphum, 100, 0)",province,TH-36,Chaiyaphum,chaiyaphum,chaiyaphum
9,"(chanthaburi, 100, 0)",province,TH-22,Chanthaburi,chanthaburi,chanthaburi


Let's transfer matches to the `fuzzymatch1` column. The `lambda` function here stores the value of the column `fuzzymatch` in `x`, then checks the value. If the value of `x` is `'None'` it assigns the value `'None'` to column `fuzzymatch1`; otherwise, it assigns the the value of `x[0]`. Note that for every record, `x` can be a list containing the fuzzy matched province name with corresponding fuzzy match score, or a character variable with value `'None'`.

In [51]:
iso2_df['fuzzymatch1'] = iso2_df['fuzzymatch'].apply(lambda x:'None' if x == None else x[0])

iso2_df

Unnamed: 0,fuzzymatch,location_category,location_code,location_name,location_name_match,fuzzymatch1
0,"(bangkok, 100, 0)",metropolitan administration,TH-10,Bangkok,bangkok,bangkok
1,,special administrative city,TH-S,Phatthaya,phatthaya,
2,"(amnatcharoen, 100, 0)",province,TH-37,Amnat Charoen,amnatcharoen,amnatcharoen
3,"(angthong, 100, 0)",province,TH-15,Ang Thong,angthong,angthong
4,"(buengkan, 100, 0)",province,TH-38,Bueng Kan,buengkan,buengkan
5,"(buriram, 100, 0)",province,TH-31,Buri Ram,buriram,buriram
6,"(chachoengsao, 100, 0)",province,TH-24,Chachoengsao,chachoengsao,chachoengsao
7,"(chainat, 100, 0)",province,TH-18,Chai Nat,chainat,chainat
8,"(chaiyaphum, 100, 0)",province,TH-36,Chaiyaphum,chaiyaphum,chaiyaphum
9,"(chanthaburi, 100, 0)",province,TH-22,Chanthaburi,chanthaburi,chanthaburi


### Step 9: Finally, merge `fuzzy_df3` and `iso2_df`.

Be sure to check the listing above to see if all our Thai province names have been matched. Remember `iso2_df` has duplicate entries due to the various English names one province may have. Based on the results of the fuzzy match above let's do the merge of `iso2_df` and `new3_df` on the `location_name_match` column. (Think about it as a linkage ID.) The fuzzy match scores should be 100 or close to 100. The `process.extractOne` method from `fuzzywuzzy` returns one result for potential matches, the one with the highest score.

In [52]:
iso2_fuzzy_df1 = pd.merge(iso2_df, fuzzy_df3, how='inner', on=None, \
        left_on='fuzzymatch1', right_on='location_name_match',
        left_index=False, right_index=False, sort=True,
        suffixes=('_x', '_y'), copy=True, indicator=False)

Let's view the merged data frame.

In [53]:
iso2_fuzzy_df1

Unnamed: 0,fuzzymatch,location_category,location_code,location_name,location_name_match_x,fuzzymatch1,prov_en_1,prov_en_2,prov_th_1,prov_th_2,maxval,Province_TH,location_name_match_y
0,"(amnatcharoen, 100, 0)",province,TH-37,Amnat Charoen,amnatcharoen,amnatcharoen,amnat charoen,amnat charoen,อำนาจเจริญ,จังหวัดอำนาจเจริญ,90.5,อำนาจเจริญ,amnatcharoen
1,"(angthong, 100, 0)",province,TH-15,Ang Thong,angthong,angthong,ang thong,ang thong,อ่างทอง,จังหวัดอ่างทอง,88.0,อ่างทอง,angthong
2,"(bangkok, 100, 0)",metropolitan administration,TH-10,Bangkok,bangkok,bangkok,bangkok,bangkok,กรุงเทพฯ,กรุงเทพมหานคร,85.5,กรุงเทพฯ,bangkok
3,"(buengkan, 100, 0)",province,TH-38,Bueng Kan,buengkan,buengkan,bueng kan,bueng kan,บึงกาฬ,จังหวัดบึงกาฬ,86.5,บึงกาฬ,buengkan
4,"(buriram, 100, 0)",province,TH-31,Buri Ram,buriram,buriram,buri ram,buri ram,บุรีรัมย์,จังหวัดบุรีรัมย์,89.75,บุรีรัมย์,buriram
5,"(buriram, 100, 0)",province,TH-31,Buriram,buriram,buriram,buri ram,buri ram,บุรีรัมย์,จังหวัดบุรีรัมย์,89.75,บุรีรัมย์,buriram
6,"(chachoengsao, 100, 0)",province,TH-24,Chachoengsao,chachoengsao,chachoengsao,chachoengsao,chachoengsao,ฉะเชิงเทรา,จังหวัดฉะเชิงเทรา,90.5,ฉะเชิงเทรา,chachoengsao
7,"(chainat, 100, 0)",province,TH-18,Chai Nat,chainat,chainat,chai nat,chai nat,ชัยนาท,จังหวัดชัยนาท,86.5,ชัยนาท,chainat
8,"(chainat, 100, 0)",province,TH-18,Chainat,chainat,chainat,chai nat,chai nat,ชัยนาท,จังหวัดชัยนาท,86.5,ชัยนาท,chainat
9,"(chaiyaphum, 100, 0)",province,TH-36,Chaiyaphum,chaiyaphum,chaiyaphum,chaiyaphum,chaiyaphum,ชัยภูมิ,จังหวัดชัยภูมิ,88.0,ชัยภูมิ,chaiyaphum


### Step 10: Create `lookup_df` from `iso2_fuzzy_df1`

Let's drop the duplicates (based on duplicate values in the `location_code` column) from the `iso2_new3_df` and copy it to `lookup_df`.

In [54]:
lookup_df1 = iso2_fuzzy_df1.drop_duplicates(subset=['location_code'])

lookup_df1

Unnamed: 0,fuzzymatch,location_category,location_code,location_name,location_name_match_x,fuzzymatch1,prov_en_1,prov_en_2,prov_th_1,prov_th_2,maxval,Province_TH,location_name_match_y
0,"(amnatcharoen, 100, 0)",province,TH-37,Amnat Charoen,amnatcharoen,amnatcharoen,amnat charoen,amnat charoen,อำนาจเจริญ,จังหวัดอำนาจเจริญ,90.5,อำนาจเจริญ,amnatcharoen
1,"(angthong, 100, 0)",province,TH-15,Ang Thong,angthong,angthong,ang thong,ang thong,อ่างทอง,จังหวัดอ่างทอง,88.0,อ่างทอง,angthong
2,"(bangkok, 100, 0)",metropolitan administration,TH-10,Bangkok,bangkok,bangkok,bangkok,bangkok,กรุงเทพฯ,กรุงเทพมหานคร,85.5,กรุงเทพฯ,bangkok
3,"(buengkan, 100, 0)",province,TH-38,Bueng Kan,buengkan,buengkan,bueng kan,bueng kan,บึงกาฬ,จังหวัดบึงกาฬ,86.5,บึงกาฬ,buengkan
4,"(buriram, 100, 0)",province,TH-31,Buri Ram,buriram,buriram,buri ram,buri ram,บุรีรัมย์,จังหวัดบุรีรัมย์,89.75,บุรีรัมย์,buriram
6,"(chachoengsao, 100, 0)",province,TH-24,Chachoengsao,chachoengsao,chachoengsao,chachoengsao,chachoengsao,ฉะเชิงเทรา,จังหวัดฉะเชิงเทรา,90.5,ฉะเชิงเทรา,chachoengsao
7,"(chainat, 100, 0)",province,TH-18,Chai Nat,chainat,chainat,chai nat,chai nat,ชัยนาท,จังหวัดชัยนาท,86.5,ชัยนาท,chainat
9,"(chaiyaphum, 100, 0)",province,TH-36,Chaiyaphum,chaiyaphum,chaiyaphum,chaiyaphum,chaiyaphum,ชัยภูมิ,จังหวัดชัยภูมิ,88.0,ชัยภูมิ,chaiyaphum
10,"(chanthaburi, 100, 0)",province,TH-22,Chanthaburi,chanthaburi,chanthaburi,chanthaburi,chanthaburi,จันทบุรี,จังหวัดจันทบุรี,89.0,จันทบุรี,chanthaburi
11,"(chiangmai, 100, 0)",province,TH-50,Chiang Mai,chiangmai,chiangmai,chiang mai,chiang mai,เชียงใหม่,จังหวัดเชียงใหม่,89.75,เชียงใหม่,chiangmai


What we have just created through that multi-step process is a lookup table or dictionary, aptly called `lookup_df`. Let's now proceed to match the Thai province names in `mtct_df` to the Thai province names in `lookup_df`. This step yields matched records with English province names and more importantly, ISO-3166 province codes which we can match with corresponding values in Thai SHP files.

### Step 11: Merge `lookup_df` and `mtct_df2`

Let's do a merge (inner join) of lookup_df and the full `mtct_df` dataframe (all years) on column unicode_thai.

In [57]:
mtct_df2

Unnamed: 0,Year_Recorded,Region,Province_TH,HIVpos_children,2PCR_children,TRpct,livebirths_100k,pregnancies,HIVpos_pregwomen_labor_room,product
0,2013,11,พังงา,1,12,8.33%,25.9,3861,20,0.083333
1,2013,4,กาญจนบุรี,0,34,0.00%,0.0,9030,57,0.0
2,2013,3,สมุทรปราการ,1,48,2.08%,8.7,11495,93,0.020833
3,2013,4,ประจวบคีรีขันธ์,1,109,0.92%,14.12,7080,59,0.009174
4,2013,9,เพชรบูรณ์,1,31,3.23%,11.81,8470,44,0.032258
5,2013,6,กาฬสินธุ์,1,46,2.17%,12.84,7788,34,0.021739
6,2013,3,จันทบุรี,0,31,0.00%,0.0,5696,40,0.0
7,2013,8,ชัยนาท,0,0,0.00%,0.0,491,6,
8,2013,3,ระยอง,1,60,1.67%,9.72,10290,106,0.016667
9,2013,11,ชุมพร,0,24,0.00%,0.0,7607,48,0.0


In [59]:
mtct_series_df1 = pd.merge(lookup_df1, mtct_df2, on='Province_TH', how='inner')

In [60]:
mtct_series_df1

Unnamed: 0,fuzzymatch,location_category,location_code,location_name,location_name_match_x,fuzzymatch1,prov_en_1,prov_en_2,prov_th_1,prov_th_2,maxval,Province_TH,location_name_match_y,Year_Recorded,Region,HIVpos_children,2PCR_children,TRpct,livebirths_100k,pregnancies,HIVpos_pregwomen_labor_room,product
0,"(amnatcharoen, 100, 0)",province,TH-37,Amnat Charoen,amnatcharoen,amnatcharoen,amnat charoen,amnat charoen,อำนาจเจริญ,จังหวัดอำนาจเจริญ,90.5,อำนาจเจริญ,amnatcharoen,2013,7,0,19,0.00%,0.0,3542,14,0.0
1,"(amnatcharoen, 100, 0)",province,TH-37,Amnat Charoen,amnatcharoen,amnatcharoen,amnat charoen,amnat charoen,อำนาจเจริญ,จังหวัดอำนาจเจริญ,90.5,อำนาจเจริญ,amnatcharoen,2014,7,1,11,9.09%,28.67,3488,12,0.090909
2,"(amnatcharoen, 100, 0)",province,TH-37,Amnat Charoen,amnatcharoen,amnatcharoen,amnat charoen,amnat charoen,อำนาจเจริญ,จังหวัดอำนาจเจริญ,90.5,อำนาจเจริญ,amnatcharoen,2015,7,0,21,0.00%,0.0,3367,13,0.0
3,"(amnatcharoen, 100, 0)",province,TH-37,Amnat Charoen,amnatcharoen,amnatcharoen,amnat charoen,amnat charoen,อำนาจเจริญ,จังหวัดอำนาจเจริญ,90.5,อำนาจเจริญ,amnatcharoen,2016,7,0,8,0.00%,0.0,3112,9,0.0
4,"(angthong, 100, 0)",province,TH-15,Ang Thong,angthong,angthong,ang thong,ang thong,อ่างทอง,จังหวัดอ่างทอง,88.0,อ่างทอง,angthong,2013,1,0,50,0.00%,0.0,2606,22,0.0
5,"(angthong, 100, 0)",province,TH-15,Ang Thong,angthong,angthong,ang thong,ang thong,อ่างทอง,จังหวัดอ่างทอง,88.0,อ่างทอง,angthong,2014,1,1,22,4.55%,39.23,2549,18,0.045455
6,"(angthong, 100, 0)",province,TH-15,Ang Thong,angthong,angthong,ang thong,ang thong,อ่างทอง,จังหวัดอ่างทอง,88.0,อ่างทอง,angthong,2015,1,1,16,6.25%,42.75,2339,6,0.0625
7,"(angthong, 100, 0)",province,TH-15,Ang Thong,angthong,angthong,ang thong,ang thong,อ่างทอง,จังหวัดอ่างทอง,88.0,อ่างทอง,angthong,2016,1,1,27,3.70%,44.15,2265,22,0.037037
8,"(bangkok, 100, 0)",metropolitan administration,TH-10,Bangkok,bangkok,bangkok,bangkok,bangkok,กรุงเทพฯ,กรุงเทพมหานคร,85.5,กรุงเทพฯ,bangkok,2013,13,1,103,0.97%,5.55,18022,141,0.009709
9,"(bangkok, 100, 0)",metropolitan administration,TH-10,Bangkok,bangkok,bangkok,bangkok,bangkok,กรุงเทพฯ,กรุงเทพมหานคร,85.5,กรุงเทพฯ,bangkok,2014,13,5,117,4.27%,19.76,25300,185,0.042735


### Step 12: Clean up the merged dataframe

Let's do a few things to this merged DataFrame:
1. Eliminate all the columns we don't need. 
2. Convert the `Year_Recorded` column to a `Timestamp` data type.
3. Create two special columns we will use later for timeseries analysis, `ds` and `y`.

In [62]:
from pandas import Timestamp

mtct_series_df2 = mtct_series_df1[['location_code', 'location_name',\
                                  'Year_Recorded','Region','TRpct',\
                                   'livebirths_100k','pregnancies']]
mtct_series_df2.is_copy = False
mtct_series_df2['ds'] = \
    mtct_series_df2['Year_Recorded'].astype(str).\
    apply(lambda x: Timestamp(x+'-01-01 00:00:00'))
mtct_series_df2['TRpct'] = mtct_series_df2['TRpct'].str.replace('%','').astype(float)
mtct_series_df2['y'] = mtct_series_df2['TRpct']
mtct_series_df2.dtypes

location_code      object        
location_name      object        
Year_Recorded      int64         
Region             int64         
TRpct              float64       
livebirths_100k    float64       
pregnancies        int64         
ds                 datetime64[ns]
y                  float64       
dtype: object

Let's inspect our new DataFrame. Each English province name should have an entry for 2013,  2014, 2015 and 2016.

In [63]:
mtct_series_df2

Unnamed: 0,location_code,location_name,Year_Recorded,Region,TRpct,livebirths_100k,pregnancies,ds,y
0,TH-37,Amnat Charoen,2013,7,0.0,0.0,3542,2013-01-01,0.0
1,TH-37,Amnat Charoen,2014,7,9.09,28.67,3488,2014-01-01,9.09
2,TH-37,Amnat Charoen,2015,7,0.0,0.0,3367,2015-01-01,0.0
3,TH-37,Amnat Charoen,2016,7,0.0,0.0,3112,2016-01-01,0.0
4,TH-15,Ang Thong,2013,1,0.0,0.0,2606,2013-01-01,0.0
5,TH-15,Ang Thong,2014,1,4.55,39.23,2549,2014-01-01,4.55
6,TH-15,Ang Thong,2015,1,6.25,42.75,2339,2015-01-01,6.25
7,TH-15,Ang Thong,2016,1,3.7,44.15,2265,2016-01-01,3.7
8,TH-10,Bangkok,2013,13,0.97,5.55,18022,2013-01-01,0.97
9,TH-10,Bangkok,2014,13,4.27,19.76,25300,2014-01-01,4.27


Let's rename some columns. We will use `ISO_CODE` to link with `ISO_CODE` column from SHP file.

In [64]:
mtct_series_df3 = mtct_series_df2.rename(columns={\
                'location_code':'ISO_CODE',\
                'location_name':'province',\
                'Year_Recorded':'year',\
                'Region':'region'})

In [65]:
mtct_series_df3

Unnamed: 0,ISO_CODE,province,year,region,TRpct,livebirths_100k,pregnancies,ds,y
0,TH-37,Amnat Charoen,2013,7,0.0,0.0,3542,2013-01-01,0.0
1,TH-37,Amnat Charoen,2014,7,9.09,28.67,3488,2014-01-01,9.09
2,TH-37,Amnat Charoen,2015,7,0.0,0.0,3367,2015-01-01,0.0
3,TH-37,Amnat Charoen,2016,7,0.0,0.0,3112,2016-01-01,0.0
4,TH-15,Ang Thong,2013,1,0.0,0.0,2606,2013-01-01,0.0
5,TH-15,Ang Thong,2014,1,4.55,39.23,2549,2014-01-01,4.55
6,TH-15,Ang Thong,2015,1,6.25,42.75,2339,2015-01-01,6.25
7,TH-15,Ang Thong,2016,1,3.7,44.15,2265,2016-01-01,3.7
8,TH-10,Bangkok,2013,13,0.97,5.55,18022,2013-01-01,0.97
9,TH-10,Bangkok,2014,13,4.27,19.76,25300,2014-01-01,4.27


In [66]:
mtct_series_df3.dtypes

ISO_CODE           object        
province           object        
year               int64         
region             int64         
TRpct              float64       
livebirths_100k    float64       
pregnancies        int64         
ds                 datetime64[ns]
y                  float64       
dtype: object

Let's pickle this new DataFrame and also create a CSV file.

In [67]:
mtct_series_df3.to_pickle('data/mtct_series_df3.pickle')

In [68]:
mtct_series_df3.to_csv('data/mtct_series_df3.csv')

## Congratulations! 

### You finished Notebook 3 for this data management exercise for MTCT data.

Let's proceed to Notebook 4.