<a href="https://colab.research.google.com/github/AutisMaxima/Address-Standardisation/blob/main/Address_standard.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Address Standardization of Indonesian addresses

The objective of this notebook is to determine whether a sequence-to-sequence model can be trained on *imperfect* data such that the model can accurately convert "dirty" addresses to "clean" addresses.

## Data Store Mounting

We store the data related to the events in a personal Google Drive. To setup, we need to mount the drive into our colab.

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


**Optional**: We can create a folder to store our data by code. This saves us the hassle of creating one manually.

In [None]:
import os

# Create a folder in the root directory
!mkdir -p "/content/drive/My Drive/Projects/Address Standardization"

## Dataset Preparation

**Optional**: Install the required packages if not already installed.

In [None]:
! pip install pandas googlemaps requests numpy datasets evaluate
! pip install accelerate -U
! pip install transformers -U

Collecting googlemaps
  Downloading googlemaps-4.10.0.tar.gz (33 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting datasets
  Downloading datasets-2.14.5-py3-none-any.whl (519 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.6/519.6 kB[0m [31m14.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.0-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 kB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m14.9 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m24.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollec

The dataset used is a list of shipments in Indonesia recorded in a set timeframe.

In [None]:
import pandas as pd
import googlemaps
import numpy as np

In [None]:
sample = pd.read_excel('/content/drive/My Drive/Projects/Address Standardization/SHIPMENTS_RAW_1-31 DESEMBER_2020_UPDATE 01-FEB-2021.xlsx')

In [None]:
sample.head(5)

Unnamed: 0,number_reference,pickup_address,origin_city,delivery_address,destination_city,pickup_latitude,pickup_longitude,delivery_latitude,delivery_longitude
0,8983110710093660,"BELM Graha Dipta <br/>Jalan Raya Bekasi KM.18,...",KOTA JAKARTA TIMUR,JL. CIPINANG MUARA RT.2 RW.11 NO.7 JATINEGARA ...,KOTA JAKARTA TIMUR,-6.190889,106.904495,-6.226711,106.889037
1,8983110710093740,"BELM Graha Dipta <br/>Jalan Raya Bekasi KM.18,...",KOTA JAKARTA TIMUR,"JALAN HARAPAN, KOMPLEK AMANILA RESIDEN NO A4, ...",KOTA DEPOK,-6.190889,106.904495,-6.398684,106.789913
2,8983110710094090,"BELM Graha Dipta <br/>Jalan Raya Bekasi KM.18,...",KOTA JAKARTA TIMUR,JALAN KASWARI 2 BLOK E36 NO.5 RT.12 RW.10 KUTA...,KABUPATEN TANGERANG,-6.190889,106.904495,-6.184688,106.502142
3,8983110710093680,"BELM Graha Dipta <br/>Jalan Raya Bekasi KM.18,...",KOTA JAKARTA TIMUR,"JL. LETJEND SUPRAPTO NO.14, RT.10/RW.7, EAST C...",KOTA JAKARTA PUSAT,-6.190889,106.904495,-6.176012,106.871683
4,8983110710094080,"BELM Graha Dipta <br/>Jalan Raya Bekasi KM.18,...",KOTA JAKARTA TIMUR,"JALAN C2, NO 60D, RT.001/008, DURI KEPA, KEBON...",KOTA JAKARTA BARAT,-6.190889,106.904495,-6.17597,106.773792


We will train with a subset of this data to accomodate for limitations in processing power and API.

In [None]:
sample_small = sample.sample(n=5000)

We only want one column of dirty addresses with their corresponding coordinates. We use the delivery addresses instead of pickup address since it consists of many different locations.

In [None]:
needed = sample_small.get(['delivery_address', 'delivery_latitude', 'delivery_longitude'])
needed.rename( columns = {'delivery_address': 'dirty_address'}, inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  needed.rename( columns = {'delivery_address': 'dirty_address'}, inplace = True)


To obtain the clean addresses, we use Google's Reverse-Geocoding API from the coordinates provided in the dataset.

In [None]:
API_KEY = 'Provide API Key here'

In [None]:
import requests
def reverse_geocode(latitude, longitude):
    # Make a request to the Geocoding API for reverse geocoding
    url = f'https://maps.googleapis.com/maps/api/geocode/json?latlng={latitude},{longitude}&key={API_KEY}'
    response = requests.get(url)
    data = response.json()
    if data['status'] == 'OK':
        formatted_address = data['results'][0]['formatted_address']
        return formatted_address
    else:
        return 'Reverse Geocoding Failed'

In [None]:
address = np.array([])
for row_idx in needed.index:
  actual_address = reverse_geocode(needed.get('delivery_latitude').loc[row_idx], needed.get('delivery_longitude').loc[row_idx])
  address = np.append(address, actual_address)

We format the dataset such that the left column is filled with dirty address

In [None]:
comparison = needed.assign(clean_address = address)
comparison.rename(columns={'delivery_latitude':'latitude', 'delivery_longitude':'longitude'}, inplace=True)

In [None]:
comparison.head(5)

Unnamed: 0,dirty_address,latitude,longitude,clean_address
294389,"RT.03/RW.12, Curug, Bogor City, West Java, Ind...",-6.548546,106.770179,"Jl. Bogor Raya Permai No.Kel, RT.04/RW.13, Cur..."
166226,"Jl. Manyar Sambongan No.28, RT.002/RW.10, Kert...",-7.280967,112.756994,"Jl. Manyar Sambongan No.28, RT.002/RW.10, Kert..."
441040,"Taman, Jl. Tytyan Indah No.10B, RT.002/RW.012,...",-6.217911,106.98873,"Taman, Jl. Tytyan Indah No.10B, RT.002/RW.012,..."
260946,"RT.11/RW.4, Petogogan, Kota Jakarta Selatan, D...",-6.241507,106.811732,"Jl. Cipaku II No.11, RT.11/RW.4, Petogogan, Ke..."
511777,"Jl. Kalibata Utara II No.18MS, RT.11/RW.7, Kal...",-6.261241,106.830124,"Jl. Kalibata Utara II No.18MS, RT.11/RW.7, Kal..."


We then save our modified dataset into a seperate file for future use.

In [None]:
comparison.to_csv('/content/drive/My Drive/Projects/Address Standardization/dataset.csv')

## Preprocessing of Data

We load the dataset which contains both the dirty addresses and clean addresses.

In [None]:
df = pd.read_csv('/content/drive/My Drive/Projects/Address Standardization/dataset.csv')

In [None]:
df.head(5)

Unnamed: 0.1,Unnamed: 0,dirty_address,latitude,longitude,clean_address
0,294389,"RT.03/RW.12, Curug, Bogor City, West Java, Ind...",-6.548546,106.770179,"Jl. Bogor Raya Permai No.Kel, RT.04/RW.13, Cur..."
1,166226,"Jl. Manyar Sambongan No.28, RT.002/RW.10, Kert...",-7.280967,112.756994,"Jl. Manyar Sambongan No.28, RT.002/RW.10, Kert..."
2,441040,"Taman, Jl. Tytyan Indah No.10B, RT.002/RW.012,...",-6.217911,106.98873,"Taman, Jl. Tytyan Indah No.10B, RT.002/RW.012,..."
3,260946,"RT.11/RW.4, Petogogan, Kota Jakarta Selatan, D...",-6.241507,106.811732,"Jl. Cipaku II No.11, RT.11/RW.4, Petogogan, Ke..."
4,511777,"Jl. Kalibata Utara II No.18MS, RT.11/RW.7, Kal...",-6.261241,106.830124,"Jl. Kalibata Utara II No.18MS, RT.11/RW.7, Kal..."


We need to get rid of columns that are not used in training. In this case, we create a new dataset with only the dirty and clean addresses.

In [None]:
df_new = df.get(["dirty_address", "clean_address"])
df_new = df_new.dropna() # drops all rows with empty cells present

In [None]:
df_new

Unnamed: 0,dirty_address,clean_address
0,"RT.03/RW.12, Curug, Bogor City, West Java, Ind...","Jl. Bogor Raya Permai No.Kel, RT.04/RW.13, Cur..."
1,"Jl. Manyar Sambongan No.28, RT.002/RW.10, Kert...","Jl. Manyar Sambongan No.28, RT.002/RW.10, Kert..."
2,"Taman, Jl. Tytyan Indah No.10B, RT.002/RW.012,...","Taman, Jl. Tytyan Indah No.10B, RT.002/RW.012,..."
3,"RT.11/RW.4, Petogogan, Kota Jakarta Selatan, D...","Jl. Cipaku II No.11, RT.11/RW.4, Petogogan, Ke..."
4,"Jl. Kalibata Utara II No.18MS, RT.11/RW.7, Kal...","Jl. Kalibata Utara II No.18MS, RT.11/RW.7, Kal..."
...,...,...
4995,"Blok Q, Jl. Lobi-lobi No.17, RT.9/RW.6, Kaliba...","Jl. Lobi-lobi No.17, RT.9/RW.6, Kalibata, Kec...."
4996,"Cinangka, Sawangan, Depok City, Jawa Barat, In...","Jl. Parakan No.32, Cinangka, Kec. Sawangan, Ko..."
4997,28 Jl. Kemang 1 rt.02 rw.04 no.28 belakang war...,"Jl. Mawar Blok A. 3 No.52, RT.005/RW.011, Jati..."
4998,foresta cluster placido k1/7 bsd serpong,"Jl. Puspitek Blok 5. J No.20, Pagedangan, Kec...."


We convert our DataFrame dataset to a HuggingFace dataset to make use of their library and models.

In [None]:
from datasets import Dataset, DatasetDict

dataset = Dataset.from_pandas(df_new)

In [None]:
dataset

Dataset({
    features: ['dirty_address', 'clean_address', '__index_level_0__'],
    num_rows: 4996
})

We split the dataset into training, testing, and validation data

In [None]:
# 90% train, 10% test + validation
train_testvalid = dataset.train_test_split(test_size=0.1)
# Split the 10% test + valid in half test, half valid
test_valid = train_testvalid['test'].train_test_split(test_size=0.5)
# gather everyone if you want to have a single DatasetDict
split_datasets = DatasetDict({
    'train': train_testvalid['train'],
    'test': test_valid['test'],
    'validation': test_valid['train']})

Preprocessing data with HuggingFace starts with a tokenizer. The tokenizer assigns numbers called *tokens* and they give the model some context to the word and its meaning. We are going to use the T5 model to train the data on. We will also use the BLEU metric to evaluate the accuracy of our model as our task contains similarities to translation tasks.

**Note**: To account for limitations in computational capacity, we opted to use a version of the T5 model trained with a smaller corpus.

In [None]:
model_checkpoint = 't5-small'

In [None]:
from transformers import AutoTokenizer
from evaluate import load

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
metric = load('bleu')

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/5.94k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.34k [00:00<?, ?B/s]

In [None]:
metric.inputs_description

'\nComputes BLEU score of translated segments against one or more references.\nArgs:\n    predictions: list of translations to score.\n    references: list of lists of or just a list of references for each translation.\n    tokenizer : approach used for tokenizing `predictions` and `references`.\n        The default tokenizer is `tokenizer_13a`, a minimal tokenization approach that is equivalent to `mteval-v13a`, used by WMT.\n        This can be replaced by any function that takes a string as input and returns a list of tokens as output.\n    max_order: Maximum n-gram order to use when computing BLEU score.\n    smooth: Whether or not to apply Lin et al. 2004 smoothing.\nReturns:\n    \'bleu\': bleu score,\n    \'precisions\': geometric mean of n-gram precisions,\n    \'brevity_penalty\': brevity penalty,\n    \'length_ratio\': ratio of lengths,\n    \'translation_length\': translation_length,\n    \'reference_length\': reference_length\nExamples:\n\n    >>> predictions = ["hello ther

We create a preprocessing function to convert both our dirty addresses and clean addresses into tokens, which is then used as our input to the model (seq2seq models require BOTH the training input and output)

In [None]:
max_input_length = 256
max_target_length = 256

def preprocess_function(examples):
    # Setup the tokenizer for inputs
    inputs = [addr for addr in examples["dirty_address"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Setup the tokenizer for targets
    labels = tokenizer(text_target=examples["clean_address"], max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [None]:
tokenized_datasets = split_datasets.map(preprocess_function, batched=True, remove_columns=split_datasets["train"].column_names)

Map:   0%|          | 0/4496 [00:00<?, ? examples/s]

Map:   0%|          | 0/250 [00:00<?, ? examples/s]

Map:   0%|          | 0/250 [00:00<?, ? examples/s]

## Fine-tuning the model

We call in T5 from the AutoModelForSeq2SeqLM since the class has more features such as computing metrics within training, etc.

In [None]:
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Arguments are saved inside a Seq2SeqTrainingArguments object. We use a batch size of 16. An ideal batch size can be 4, 8, 16, or 32.

In [None]:
batch_size = 4
model_name = model_checkpoint.split("/")[-1]
args = Seq2SeqTrainingArguments(
    f"/content/drive/My Drive/Projects/Address Standardization/address-standardization-indonesia",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.05,
    save_total_limit=3,
    num_train_epochs=7,
    predict_with_generate=True,
    fp16=True,
    push_to_hub=True,
)

Data collator allows for dynamic padding. This type of padding ensures that every word is of the same length as the max length in the **batch**.

In [None]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

In [None]:
import numpy as np


def compute_metrics(eval_preds):
    preds, labels = eval_preds
    # In case the model returns more than the prediction logits
    if isinstance(preds, tuple):
        preds = preds[0]

    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    # Replace -100s in the labels as we can't decode them
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds = [pred.strip() for pred in decoded_preds]
    decoded_labels = [[label.strip()] for label in decoded_labels]

    results = metric.compute(predictions=decoded_preds, references=decoded_labels)
    return {"bleu": results["bleu"]}


In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Bleu
1,1.2613,1.111524,0.136211
2,1.2589,1.099608,0.134771
3,1.2181,1.085227,0.135734
4,1.2086,1.077921,0.134489
5,1.194,1.073573,0.133821
6,1.1608,1.06943,0.133962
7,1.192,1.067053,0.133821


TrainOutput(global_step=7868, training_loss=1.2209065082901072, metrics={'train_runtime': 1035.4753, 'train_samples_per_second': 30.394, 'train_steps_per_second': 7.598, 'total_flos': 396010111107072.0, 'train_loss': 1.2209065082901072, 'epoch': 7.0})