In [1]:
import pandas as pd
import re

In [None]:
sample_address_input_1 = {
    'Address': "A 406, Siddhivinayak Apartment, S. V. Road, Opp. Police station, Malad (west), Mumbai 400064"
}

In [None]:
sample_address_input_2 = {'address_line_1': 'C-302, Oberoi Splendour apartment'
                          , 'address_line_2': 'Jogeshwari Vikhroli Link Road, Andheri (E), Mumbai 400072'}

In [None]:
sample_address_input_3 = {
    'House': '1023, Rajesh Tower'
    , 'Address': 'Mahavir Nagar, Borivali West'
    , 'Landmark': 'near J B Kot School'
    , 'City': 'Mumbai'
    , 'State': 'Maharashtra'
    , 'Pincode': 400092
}

Assumptions
1. It is assumed that we have data of only Indian addresses
2. Addresses of other countries can have significant differences in the 
general format, hence they are not considered
3. Only inputs of the forms shared in the assignment pdf are considered
4. It is assumed that we have an existing database of addresses parsed in the 
format required as well as their raw formats
5. Area is ignored for now as there is lack of clarity as to what it represents
6. Only addresses of the 3 types are considered


Let us look at the problems we can face while processing addresses
- Misspelled words
- Missing space or extra spaces
- Incorrect label
- Abbreviation usage

Certain preprocessing steps can be done:
- Common abbreviations used can be replaced with a standard name e.g. Street instead of st
- All addresses can be converted to lowercase 
- Extra spaces can be removed

In [19]:
def handle_common_abbreviations(address: str) -> str:
    abbreviations_dict = {
    'Opp.': 'Opposite',
    'opp.' : 'opposite',
    'St': 'Street',
    'Ave': 'Avenue',
    'Blvd': 'Boulevard',
    'Rd': 'Road',
    'Dr': 'Drive',
    'Ln': 'Lane',
    'Cir': 'Circle',
    'Ct': 'Court',
    'Ter': 'Terrace',
    'Pl': 'Place',
    'Pkwy': 'Parkway',
    'Way': 'Way',
    'Hwy': 'Highway',
}
    words = address.split()
    for i, word in enumerate(words):
        if word in abbreviations_dict:
            words[i] = abbreviations_dict[word]
    return ' '.join(words)

In [20]:
def convert_address_to_lowercase(address:str)->str:
    return address.lower()

In [21]:
def remove_extra_spaces(address:str)->str:
    # using re module
    return re.sub(r'\s+', ' ', address)

In [41]:
def preprocess_addresses(address: str) -> str:
    temp_address = handle_common_abbreviations(address)
    temp_address = convert_address_to_lowercase(temp_address)
    temp_address = remove_extra_spaces(temp_address)
    return temp_address

In [60]:
def extract_house(address: str) -> str:
    words = address.split(',')[0:2]
    return words

In [61]:
def extract_locality(address: str) -> str:
    words = address.split(',')[2]
    return words

In [83]:
def extract_landmark(address: str) -> str:
    words = address.split(",")
    keywords = ['near', 'opposite', 'behind', 'beside']
    for index, word in enumerate(words):
        for keyword in keywords:
            if keyword in word:
                required_segment = words[index]
                return required_segment
        
    return None

In [3]:
def extract_city(address):
    cities_data = pd.read_html('https://en.wikipedia.org/wiki/List_of_cities_in_India_by_population')[0]
    cities = list(cities_data['City'].unique())
    words = address.split(',')
    for word in words:
        if word in cities:
            return word
    return None

In [4]:
def extract_state(address):
    cities_data = pd.read_html('https://en.wikipedia.org/wiki/List_of_cities_in_India_by_population')[0]
    states = list(cities_data['State'].unique())
    words = address.split(',')
    for word in words:
        if word in states:
            return word
    return None

In [5]:
def process_address_of_single_entry_form(address):
    address = preprocess_addresses(address)
    house = extract_house(address)
    locality = extract_locality(address) 
    landmark = extract_landmark(address)
    city = extract_city(address)
    state = extract_state(address)
    postal_code = re.search(r",\s[A-Za-z\s\(\)\/]+[,]\s[A-Za-z]+\s([0-9]+)", address).group(1)
    country = 'India'
    return house, locality, landmark, city, state, postal_code, country
    

Pseudocode for processing addresses of type 1
- Apply the preprocessing steps as above
- Get the components in processed format

*Note: All functions will be applied in the form of an apply function through pandas*

Pseudocode for processing addresses of type 2
- Join the strings
- Apply similar operations as done for addresses of type 1

Pseudocode for processing addresses of type 3
- Directly fetch the data as its already processed

Pseudocode for the whole processing flow
- Apply the functions to a given dataset(lets assume its a pandas dataframe)
- Columns will be added for each of the required fields using apply function of pandas

For taking care of misspelled cities for example, we could use the following function

In [None]:
def get_correct_cities(misspelled_city, indian_cities_list):
    closest_cities = difflib.get_close_matches(misspelled_city, indian_cities_list, n=1, cutoff=0.6)
    corrected_city = closest_cities[0]
    return corrected_city

Similar functions as above can be used for other components as well

If we have existing addresses , we could use them to do a fuzzy match as shown below. This will tackle most spelling related issues.

In [None]:
# Algorithm 2
# Using fuzzy matching
from fuzzywuzzy import fuzz
address_1 = {'240/A Dum Dum Park, Kolkata 700055'}
address_2 = {'240 A dum dum park KOLkata 55'}

In [None]:
similarity_score = fuzz.token_set_ratio(address_1, address_2)

We could also use NLP related models or Google Map Geocoding APIs to solve the problems
1. Geocoding API
- Geocoding API can be used to designate addresses based on lat and long
- This will make it easier to collapse multiple addresses together easily
- This can also handle spelling related issues
2. NLP
-  One common approach to address component extraction is to use named entity recognition (NER). NER is a subtask of natural language processing that involves identifying and classifying entities, such as people, organizations, and locations, in text.
- To extract address components, you would first train a NER model on a dataset of addresses. The dataset would need to include a variety of different address formats, as addresses can be written in many different ways. Once the model is trained, you can use it to identify and extract address components from new text.

Pincodes can also be used to get states or cities in some cases as they follow a designated format

Ref:https://en.wikipedia.org/wiki/Postal_Index_Number#Postal_zones

In [None]:
# General info
# - A typical Indian pincode would consist of 
# 1. Zone(1st digit)
# 2. Sub-zone(2nd digit)
# 3. Sorting District (3rd digit)
# 4. Post Office (Last 3 digits)

Issues that can come up while processing

- Missing locality
- Missing city, state or zip code
- Incorrect formatting
- Unrecognized abbreviations
- Typographical errors

Possible solutions

- Using lat or long instead
- Using pincode to get state or city
- Using existing/3rd party data sources to validate addresses
