*(Steps in italics are not so important for now)*

### Main idea
1. Use Tesseract on each of the lines (i.e. images). Keep track of the number of words that have been guessed for each line, for the last step.
2. Combine the "non-numeric" part of the first line with the second line, to get a single array of words.
3. Start by finding the district and province by looking at the last words
4. Use API to fix the rest of the entire "thai" address
5. Use the number of words that had been recognized in the first step, to split the address string correctly into the "line 1" and "line 2" fields.


### Conceptual
(This was by no means written by a Thai expert, so it might be wrong)
1. Fields starting with "จ." refer to the province (exhaustive list)
2. Fields starting with "อ." refer to the district (almost? exhaustive list)
3. Fields starting with "ต." refer to the township or ตำบล (see https://www.wikiwand.com/th/%E0%B8%95%E0%B8%B3%E0%B8%9A%E0%B8%A5)
4. Fields starting with "ถ." refer to the street/road or ถนน
5. ซ.

TODO: Check if หมู่ที่ is compatible with ซ. and ต.

Many addresses start with (sequence-)number, then "หมู่ที่" (i.e. "group"?), then a number 

### Begin of address TODOs (not started)
1. Parse the numbers in the beginning: In most cases it is of the form `^[0-9]+[/[0-9]*]*`
2. Take into account fields 3-5 from **Conceptual** to create rules

### End of address TODOs (WIP)

1. Search for the num_p most likely provinces (by taking the 2 with highest similarity when compared to the most frequent Tesseract output)
2. Search for the num_d most likely districts (by taking the 2 with highest similarity when compared to the most frequent Tesseract output)
3. Create all possible num_p x num_d pairs and filter them (based on if the pair exists in csv file or not)
4. *Compute a "pair-likelihood", based on the individual similarities of the district and the province, to take the most likely pair in the case when several pairs make it through the filtering*


# Exploration and csv files creation

In [1]:
import pandas as pd

### Client data 

In [2]:
PATH = "../resources/"

In [3]:
# Read true labels from xls
true_address_df = pd.read_excel(
    PATH + "DIE_Train_Label_3Scenario.xlsx",
    sheetname='Address')
true_address_l1 = true_address_df['Address line 1']
true_address_l2 = true_address_df['Address line 2']

# define tags
address_tag  = "ที่อยู่"
district_tag = "อ"
#district_tag_bangkok = "เขต" # assume it is always there for Bangkok
province_tag = "จ"
city_tag     = "เมือง" # the district may be this, followed by a province

In [4]:
def matcher(words):
    words = [w for w in words if w!=""]
    if len(words) == 0:
        return "", "", ""
    elif len(words) == 1:
        return "", "", clean_province(words[0])
    elif len(words) == 2:
        return "", words[0], words[1]
    elif len(words) == 3:
        return words[0], words[1], words[2]
    else:
        return " ".join(words[-4]), words[-2], words[-1]

def clean_province(province):
    if province.startswith("จ."):
        return province[2:]
    else:
        return province
    
def clean_district(district):
    if district.startswith("อ."):
        return district[2:]
    else:
        return district

def splitDataFrameList(df,target_column,separator):
    ''' df = dataframe to split,
    target_column = the column containing the values to split
    separator = the symbol used to perform the split
    returns: a dataframe with each entry for the target column separated, with each element moved into a new row.
    The values in the other columns are duplicated across the newly divided rows.
    
    Taken from https://gist.github.com/jlln/338b4b0b55bd6984f883
    '''
    def splitListToRows(row,row_accumulator,target_column,separator):
        split_row = row[target_column].split(separator)
        for s in split_row:
            new_row = row.to_dict()
            new_row[target_column] = s
            row_accumulator.append(new_row)
    new_rows = []
    df.apply(splitListToRows,axis=1,args = (new_rows,target_column,separator))
    new_df = pd.DataFrame(new_rows)
    return new_df

districts_df = pd.read_csv(PATH+"districts_provinces_reference.csv")
splitted_dist_df = splitDataFrameList(districts_df, 'districts', ',')

In [5]:
observed = pd.DataFrame([matcher(l2.split(" ")) for l2 in true_address_l2])
len([x for x in observed[1] if x==""])

124

### csv files creation 

They are based on districts_provinces_reference.csv, so regenerate if this file changes

In [6]:
with open(PATH+"districts.csv", mode='wt', encoding='utf-8') as districts_file:
    districts_file.write('district\n'+'\n'.join(splitted_dist_df.districts.unique()))

In [7]:
with open(PATH+"provinces.csv", mode='wt', encoding='utf-8') as provinces_file:
    provinces_file.write('province\n'+'\n'.join(splitted_dist_df.province.unique()))

In [8]:
splitted_dist_df.rename(columns={"districts":"district"}).to_csv(PATH+"province_district.csv", index=False)

### Explore provinces
Result: All are found. Also takes into account that Bangkok is represented in two ways.

In [9]:
# all rows have a province
assert len([p for p in observed[2] if p != ""]) == 200

In [10]:
# client data
unique_provinces = list(set(observed[2]))
print(len(unique_provinces))

unique_clean_provinces = list(set([clean_province(x) for x in unique_provinces]))
print(len(unique_clean_provinces))

87
65


The ones not found are actually typos:

In [11]:
# info from csv
parsed_provinces = list(districts_df['province'].unique())

not_found = [p for p in unique_clean_provinces if p not in parsed_provinces]
not_found

['อ.แพร่', 'กรุงเมพมหานคร', 'กรุงเทพมหานนคร']

In [12]:
print(len(list(districts_df['province'].unique())))
print(len(list(districts_df['province'])))

78
78


### Now districts

In [13]:
# client data
unique_districts = list(set(observed[1]))
print(len(unique_districts))

unique_clean_districts = list(set([clean_district(x) for x in unique_districts]))
print(len(unique_clean_districts))

unique_clean_districts[:5]

65
65


['', 'เขตคลองเตย', 'เมืองอุบลราชธานี', 'บ้านธิ', 'ร้องกวาง']

In [14]:
# info from csv
parsed_districts = list(splitted_dist_df['districts'].unique())

not_found = [d for d in unique_clean_districts if d not in parsed_districts]
not_found

['', 'เมืองเชี่ยงใหม่']

In [15]:
len(splitted_dist_df.districts.unique())

922

### Check if all pairs in client data are actually pairs in the generated csv files