<a href="https://www.kaggle.com/code/jayeshdahiwale/loginext-case-study?scriptVersionId=136727753" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# About Loginext:

### LogiNext is a global technology firm that offers a SaaS based Delivery Automation Platform. The software helps brands across Food & Beverage, Courier, Express and Parcel, eCommerce & Retail and Transportation (3PLs, 4PLs, etc.) to digitize, optimize and automate deliveries across the supply chain. Growing at an average rate of 120% YoY, LogiNext has 200+ enterprise clients in 50+ countries with headquarters in New Jersey, USA and regional offices in Mumbai, Jakarta, Delhi and Dubai. We’re backed with 49.5 million dollar across three rounds of private equity investments by Tiger Global Management, Steadview Capital and Alibaba Group of companies. The majority of our workforce is in Mumbai and we’re a bunch of people interested in technology and working at the forefront of innovation in the logistics automation industry. With smaller teams distributed across the globe, the entire team gets together during our annual workation. The vision of building a global enterprise company and going IPO is what drives us to achieve more, every day!


## Problem Statement
*In correct addresses is one of the critical challenges for the delivery of the orders by eCommerce and
Courier companies. In this case study we are exploring how do we convert from unreliable address
information into a reliable base.


Some ecommerce portals allow addresses to be input in one single line while some portals have multiple
address lines for different address components. The system may receive input address in various formats
as listed below.


Example1:
Address: “A 406, Siddhivinayak Apartment, S. V. Road, Opp. Police station, Malad (west), Mumbai 400064”


Example2:
Address Line1: “C-302, Oberoi Splendour apartment”

Address Line2: “Jogeshwari Vikhroli Link Road, Andheri (E), Mumbai 400072”


Example3:
House: “1023, Rajesh Tower”,
Address: “Mahavir Nagar, Borivali West”,
Landmark: “near J B Kot School”,
City: “Mumbai”,
State: “Maharashtra”
Pincode: 400092


All inputs are in the JSON format as key-value pairs.
The goal of the system is to translate given inputs in fields including*****
- House
- Locality
- Landmark
- Area
- City
- State
- Country
- Pincode/Zipcode

# Github Link

https://github.com/Jayeshdahiwale/Loginext_Case_Study

# Please visit this link to live access this project - https://www.kaggle.com/jayeshdahiwale/loginext-case-study

In [1]:
!pip install fuzzywuzzy



In [2]:
# importing the important libraries
import json
import pandas as pd
import string
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from fuzzywuzzy import fuzz
import re



# The dataset is taken from public govt. resource.

The link is : https://data.gov.in/resource/villagelocality-based-pin-mapping-16th-march-2017

In [3]:
path = '/kaggle/input/village-locations/Locality_village_pincode_final_mar-2017.csv'
df = pd.read_csv(path, encoding='latin-1')

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 906267 entries, 0 to 906266
Data columns (total 6 columns):
 #   Column                  Non-Null Count   Dtype 
---  ------                  --------------   ----- 
 0   Village/Locality name   906267 non-null  object
 1   Officename ( BO/SO/HO)  906267 non-null  object
 2   Pincode                 906267 non-null  int64 
 3   Sub-distname            906267 non-null  object
 4   Districtname            906267 non-null  object
 5   StateName               906267 non-null  object
dtypes: int64(1), object(5)
memory usage: 41.5+ MB


In [5]:
df['Sr.No.'] = np.array([[i for i in range(1, df.shape[0]+1)]]).reshape(-1,1)

In [6]:
# replcaing the 'H.0', 'B.O','S.O'
df['Officename ( BO/SO/HO)'] = df['Officename ( BO/SO/HO)'].apply(lambda x:x.replace('H.O','').replace('S.O','').replace('B.O',' '))

In [7]:


# function for replace direction abbrviations with proper directions
def replace_directions(string):
    directions = {'(E)':'east','(e)':'east','(N)':'north','(n)':'north','(W)':'west','(w)':'west','(S)':'south','(s)':'south'}
    for key, value in directions.items():
        string = string.replace(key, value)
    return string

# split on the basis of punctuation
def split_punct(string):
    splitted = re.split(r'[^\w\s]', string)
    splitted = ' '.join([s for s in splitted if s.strip()])
    return splitted

In [8]:
# Cleaning every columns
for col in df.columns:
    if col not in df.describe().columns:
        df[col] = df[col].apply(lambda x: x.lower())
        df[col] = df[col].apply(replace_directions)
        df[col] = df[col].apply(split_punct)
        # remove the unwanted punctuation from the address
        translator = str.maketrans("", "", string.punctuation)

        # Remove punctuation using the translation table
        df[col]= df[col].apply(lambda x: x.translate(translator))

In [9]:
# lets make a tag column = > ['Village/Locality name', 'Officename ( BO/SO/HO)', 'Sub-distname', 'Districtname']
df['tags'] = df['Village/Locality name']+' ' + df['Officename ( BO/SO/HO)']+ " " + df['Sub-distname'] + " "+ df['Districtname']+' '+df['StateName']

In [10]:
# Assuming 'df' is your DataFrame and 'column_name' is the name of the column
df['tags'] = df['tags'].apply(lambda x: ' '.join(set(x.split())))

In [11]:
df.head()

Unnamed: 0,Village/Locality name,Officename ( BO/SO/HO),Pincode,Sub-distname,Districtname,StateName,Sr.No.,tags
0,aliganj,lodi road,110003,defence colony,south east delhi,delhi,1,colony south defence aliganj road east delhi lodi
1,kasturba nagar,lodi road,110003,defence colony,south east delhi,delhi,2,colony south defence kasturba road east nagar ...
2,jeewan nagar,jungpura,110014,defence colony,south east delhi,delhi,3,colony jungpura south defence east nagar jeewa...
3,tehkhand,okhla industrial estate,110020,defence colony,south east delhi,delhi,4,industrial colony estate south defence tehkhan...
4,zakir nagar so,new friends colony,110025,defence colony,south east delhi,delhi,5,delhi colony so friends south defence east nag...


In [12]:
#total unique terms in tags
unique_words = set(df['tags'].str.cat(sep=' ').split())
total_unique_words = len(unique_words)
print(total_unique_words)

457686


In [13]:
df = df.reset_index(drop=True)


In [14]:
df.head()

Unnamed: 0,Village/Locality name,Officename ( BO/SO/HO),Pincode,Sub-distname,Districtname,StateName,Sr.No.,tags
0,aliganj,lodi road,110003,defence colony,south east delhi,delhi,1,colony south defence aliganj road east delhi lodi
1,kasturba nagar,lodi road,110003,defence colony,south east delhi,delhi,2,colony south defence kasturba road east nagar ...
2,jeewan nagar,jungpura,110014,defence colony,south east delhi,delhi,3,colony jungpura south defence east nagar jeewa...
3,tehkhand,okhla industrial estate,110020,defence colony,south east delhi,delhi,4,industrial colony estate south defence tehkhan...
4,zakir nagar so,new friends colony,110025,defence colony,south east delhi,delhi,5,delhi colony so friends south defence east nag...


In [15]:
# total number of unique words in tags
unique_words = set(df['tags'].str.cat(sep=' ').split())
total_unique_words = len(unique_words)
print(total_unique_words)

457686


In [16]:
#lets split the dataset on the basis of districtname
district_splits = {district: df[df['Districtname'] == district].copy() for district in df['Districtname'].unique()}
print('Total unique districts:',len(list(set(df['Districtname'].values))))

Total unique districts: 614


In [17]:
# columns with less than 15000 unique values
#total unique terms
districts = []
for district in district_splits.keys():
    unique_words = set(district_splits[district]['tags'].str.cat(sep=' ').split())
    total_unique_words = len(unique_words)
    if total_unique_words < 15000:
        districts.append(district)
    else:
        print(district)

len(districts)

614

In [18]:


# create tfidf 
tfidf_vectors_dict = dict()

for district in districts:
  # create the object of tfid vectorizer
    tfidf = TfidfVectorizer(lowercase=False, max_features = 15000)   # max features = 10000 to prevent system from crashing
  # convert vector into array form for clustering
    vector = tfidf.fit_transform(district_splits[district]['tags']).toarray()
    tfidf_vectors_dict[district] = [tfidf,vector]




In [19]:
#lets fetch any district name
district_splits['bhandara']

Unnamed: 0,Village/Locality name,Officename ( BO/SO/HO),Pincode,Sub-distname,Districtname,StateName,Sr.No.,tags
361508,pahuni,khat,441106,mohadi,bhandara,maharashtra,361509,bhandara khat maharashtra pahuni mohadi
364325,salai kh,bhivkhidki,441702,sakoli,bhandara,maharashtra,364326,sakoli kh salai bhandara bhivkhidki maharashtra
364421,salai bk,bhivkhidki,441702,sakoli,bhandara,maharashtra,364422,sakoli bk salai bhandara bhivkhidki maharashtra
364435,kesalwada,bhivkhidki,441702,sakoli,bhandara,maharashtra,364436,sakoli bhandara bhivkhidki kesalwada maharashtra
364525,amgaon bk,amgaon bk,441802,sakoli,bhandara,maharashtra,364526,sakoli bk bhandara maharashtra amgaon
...,...,...,...,...,...,...,...,...
366003,shrinagar,pahela,441924,bhandara,bhandara,maharashtra,366004,maharashtra pahela bhandara shrinagar
366004,silli,silli,441924,bhandara,bhandara,maharashtra,366005,silli maharashtra bhandara
366005,sitepar,matora,441924,bhandara,bhandara,maharashtra,366006,matora sitepar maharashtra bhandara
366006,wakeshwar,pahela,441924,bhandara,bhandara,maharashtra,366007,pahela maharashtra wakeshwar bhandara


# Testing the Recommendation System based on Cosine Simialrity

In [20]:
string_to_compare = 'a 406 siddhivinayak apartment s v road opp police station malad west mumbai 400064'
# Transform the string to a TF-IDF vector
string_vector = tfidf_vectors_dict['mumbai'][0].transform([string_to_compare])

# Calculate the cosine similarity between the string vector and matrix vectors
from sklearn.metrics.pairwise import cosine_similarity

similarities = cosine_similarity(string_vector,tfidf_vectors_dict['mumbai'][1] )

# # Get the index of the vector with the highest similarity
most_similar_index = np.argmax(similarities)

print("Most similar document index:", most_similar_index)
similarity_score = similarities[0, most_similar_index]
print(f"Similarity score is : ",{similarity_score})
temp =district_splits['mumbai'].reset_index(drop= True)
temp.iloc[most_similar_index]

Most similar document index: 54
Similarity score is :  {0.48793802389885704}


Village/Locality name                                 mumbai
Officename ( BO/SO/HO)                      malad west dely 
Pincode                                               400064
Sub-distname                                          mumbai
Districtname                                          mumbai
StateName                                        maharashtra
Sr.No.                                                321234
tags                      mumbai dely maharashtra west malad
Name: 54, dtype: object

# Testing the recomndation system based on Manhattan Distance

In [21]:
#implenting recommendation system on the basis of Manhattan Distance
from scipy.spatial.distance import cityblock


# String to compare
string_to_compare = 'a 406 siddhivinayak apartment s v road opp police station malad west mumbai 400064'
string_vector = tfidf_vectors_dict['mumbai'][0].transform([string_to_compare])

# Calculate the Manhattan distances
distances = [cityblock(string_vector.toarray().flatten(), vector) for vector in tfidf_vectors_dict['mumbai'][1]]

# Get the index of the vector with the minimum distance
most_similar_index = np.argmin(distances)

print("Most similar document index:", most_similar_index)
temp =district_splits['mumbai'].reset_index(drop= True)
temp.iloc[most_similar_index]

Most similar document index: 54


Village/Locality name                                 mumbai
Officename ( BO/SO/HO)                      malad west dely 
Pincode                                               400064
Sub-distname                                          mumbai
Districtname                                          mumbai
StateName                                        maharashtra
Sr.No.                                                321234
tags                      mumbai dely maharashtra west malad
Name: 54, dtype: object

In [22]:
# Function for trunacating the input address
def truncate_address(lst):
    lst = lst.split()
    address_words = [
    "road","station", "street", "lane", "avenue", "square", "circle",
    "park", "way", "terrace", "close", "crescent", "gardens", "heights", "manor", "path", "plaza"
]
    for i,word in enumerate(lst):
        if word in address_words:
            return ' '.join(lst[i+1:])
    return ' '.join(lst)

In [23]:
# NOw our next process is to clean these inputs
import string
import re
def remove_duplicates(string):
    words = string.split()
    seen = set()
    unique_words = []
    for word in words:
        if word not in seen:
            seen.add(word)
            unique_words.append(word)
    return ' '.join(unique_words)

def clean_address(address_lst):
    cleaned_address = []
    # first split all the terms in address
    for address in address_lst:
        # replacing the direction abbraviation with correct direction
        address = replace_directions(address)
        address = re.split(r'[^\w\s]', address)
        address = ' '.join([s for s in address if s.strip()])

        # remove the unwanted punctuation from the address
        translator = str.maketrans("", "", string.punctuation)

        # Remove punctuation using the translation table
        text_without_punctuation = address.translate(translator)

        #split on the basis of space
        tokens = ' '.join(map(lambda word: word.lower(), text_without_punctuation.split(' ')))

        tokens = remove_duplicates(tokens)
        # appeding the cleaned token to clean address
        cleaned_address.append(tokens)
        
    return cleaned_address

In [24]:
# Now to extract the structural address first we have to fetch the pincode from the addres
#becasue pincode is unique idenitifier for the locality
# get the pincode from the string
import re

def extract_pincode(text):
    pattern = r'\b\d{6}\b'  # Regex pattern to match a 6-digit number

    # Search for the pincode in the text
    matches = re.findall(pattern, text)

    if matches:
        return matches[0]  # Return the first match found
    else:
        return None  # Return None if no pincode is found


In [25]:
# Once we get the input we have to fetch the data about the locality by mapping it with the dataframe
#another way to fetch info through Postal API of India
# we will be only using dataframe if we are not able to get Pincode through input

# function to get the information using pincode
import requests
def get_data_using_pincode(pincode):
    url = f"https://api.postalpincode.in/pincode/{pincode}"
    response = requests.get(url)

    if response.status_code == 200:
        data = response.json()
        # Process the data as per your requirements
        return data[0]['PostOffice']
    else:
        print("Error occurred:", response.status_code)
        return None

In [26]:
# getiing area is one of the most difficult task
def get_locality_block_city_state_country(input_address, post_info, data):
    
    input_address = truncate_address(input_address)
  #  get the area name from text
  # first check on the basis of postoffice name
    match_confidence = float('-inf')
    locality_name = None
    locality_length = float('-inf')
  #merge the Block and Postoffice name
  
    localities = list(post_info['Name'])
    for locality in localities:
        if len(locality) < locality_length:
            continue
        if locality in input_address:
            locality_name = locality
            locality_length = len(locality_name)
            match_confidence = 100
        else:
            locality_len = len(locality.split(' '))
            split_input = input_address.split(' ')
            for i in range(len(split_input)):
                match_score = fuzz.ratio(locality.split(' ')[0], split_input[i])
                if match_score >=80 and match_score >= match_confidence:
                    supplementary = fuzz.ratio(locality," ".join(split_input[i:i+locality_len-1])) if (i + locality_len - 1) < len(input_address) else None
                    if supplementary != None and supplementary >= 85:
                        locality_name = locality
                        locality_length = len(locality_name)
                        match_confidence = supplementary
                    elif len(locality.split(' ')[0]) > locality_length:
                        locality_name =locality.split(' ')[0]
                        locality_length = len(locality_name)
                        match_confidence = match_score
            
    # now we have to take the block(area) which is corresponding to locality
    match_locality_score = float('-inf')
    for post in data:
        score = fuzz.ratio(locality_name,post['Name'])
        if score > match_locality_score:
            block = post['Block']
            city = post['District']
            state = post['State']
            country = post['Country']
            pincode = str(post['Pincode'])
            match_locality_score = score
        
    
    
    return [locality_name,block,city,state,country,pincode]

In [27]:
# if locality  is coming None, so there can be possibility that the pincode is wrong or the data
#coresponding to that pincode is missing.
# Our second option is to look into the pandas dataframe
def check_if_not_found_through_api(pin_code_info):
    locality = None
    pincode = pin_code_info[1]
    office_names = df[df['Pincode']==int(pincode)]['Officename ( BO/SO/HO)'].values
    locality_names = df[df['Pincode']==int(pincode)]['Village/Locality name'].values
    max_score = 65
    row_num =None
    address = truncate_address(pin_code_info[0])
    address_split = address.split(' ')
    
    for i,office_name in enumerate(office_names):
        office_len = len(office_name.strip().split(' '))
        
        if len(address_split) > office_len:
            check_locality = []
            for j in range(len(address_split)-office_len+1):
                check_locality.append(' '.join(address_split[j:j+office_len]))
        else:
            check_locality = [address]
        
        for place in check_locality:
            score = fuzz.ratio(office_name,place)
            if score >= max_score:
                locality = office_name
                max_score = score
                row_num = i
    if row_num == None:           
        for i,locality_name in enumerate(locality_names):

            locality_len = len(locality_name.strip().split(' '))
            if len(address_split) > locality_len:
                check_locality = []
                for j in range(len(address_split)-locality_len+1):
                    check_locality.append(' '.join(address_split[j:j+ locality_len]))
            else:
                check_locality = [address]
            for place in check_locality:
                score = fuzz.ratio(locality_name,place)
                if score >= max_score:
                    locality = locality_name
                    max_score = score
                    row_num = i

               
    if locality != None:
        status = df[df['Pincode']==int(pincode)].values[row_num]
        return [locality,status[3],status[4],status[5],'india',str(status[2])]
    else:
        return None

In [28]:
df.columns

Index(['Village/Locality name', 'Officename ( BO/SO/HO)', 'Pincode',
       'Sub-distname', 'Districtname', 'StateName', 'Sr.No.', 'tags'],
      dtype='object')

In [29]:
# When we tried to check it through dataframe, the pincode is not giving the correct provided address
# Now we need to apply recommender system and using similarity score we have to fetch the data through database
# We will use this if pincode is not available or we are not able to get correct data through pincode
def get_address_using_recommender_system(input_address):
    
    input_address = truncate_address(input_address)
    # as we have made the different database copies for every district
    unique_districts = list(set(df['Districtname'].values))
    
    # iterate thorugh the district and check if it is present in input address or not
    match_threshold = 65
    matched_district = None
    for district in unique_districts:
        # first split the district because district can be of two or three word
        len_dist = len(district.split(' '))
        
        # lets split the sentence into the strings with length len_dist
        splitted_input = input_address.split(' ')
        check_district = []
        for i in range(len(splitted_input)-len_dist+1):
            check_district.append(' '.join(splitted_input[i:i+len_dist]))
        for j in check_district:
            score = fuzz.ratio(j, district)
            if score > match_threshold:
                matched_district = district
                match_threshold = score
    # if we get the matched we will run the recommender system
    # we will use consine similarity to get the closest possible row
    if matched_district:
        # Transform the input to  a TF-IDF vector
        address_vector = tfidf_vectors_dict[matched_district][0].transform([input_address])
        similarities = cosine_similarity(address_vector,tfidf_vectors_dict[matched_district][1] )
        # # Get the index of the vector with the highest similarity
        most_similar_index = np.argmax(similarities)
        
        ## Get the similarity score
        similarity_score = similarities[0, most_similar_index]
        temp =district_splits[matched_district].reset_index(drop= True)
        result = temp.iloc[most_similar_index]
        locality = result[1] if result[0] ==result[4] else result[0]
        
        return [locality,result[3],result[4],result[5],'india',str(result[2])]
    else:
        match_score = float('-inf')
        matched_district = None
        index = None
        truncate_the_address = ' '.join(splitted_input)
        for district in unique_districts:
            # Transform the input to  a TF-IDF vector
            address_vector = tfidf_vectors_dict[district][0].transform([truncate_the_address])
            similarities = cosine_similarity(address_vector,tfidf_vectors_dict[district][1] )
            # # Get the index of the vector with the highest similarity
            most_similar_index = np.argmax(similarities)

            ## Get the similarity score
            similarity_score = similarities[0, most_similar_index]
            if similarity_score > match_score:
                matched_district = district
                match_score = similarity_score
                index = most_similar_index
        temp =district_splits[matched_district].reset_index(drop= True)
        result = temp.iloc[index]
        locality = result[1] if result[0] ==result[4] else result[0]
        return [locality,result[3],result[4],result[5],'india',str(result[2])]      
        

In [30]:
def get_landmark_house(address,location_info):
    
    
    #initialise house  and landmark as none
    house= None
    landmark = None
    
    # fist join the location info
    s = set()
    unique_info = ' '.join([s.add(i.lower()) or i.lower() for i in location_info if i.lower() not in s])
    unique_info = [i for i in unique_info.split(' ') if i !='']
    
    
    
    # now we will split the address
    address = address.split(' ')
    
    for i,word in enumerate(address):
        if word in unique_info:
            truncate_address = address[:i]
            break
    # common words use for houses
    address_keywords = [
    "apartment", "flat", "unit", "suite", "building", "block", "tower", "complex",
    "condominium", "residency", "housing", "society", "colony", "house", "bhavan",
    "plaza", "estate", "mansions", "enclave", "club", "hotel", "lounge", "lodge",
    "resort", "inn", "motel", "hostel", "retreat", "guesthouse", "lounge", "bar",
    "tavern", "pub", "saloon", "bistro", "café", "diner", "eatery", "restaurant"
]
    
    # we have to split up the hosuse part and landmark part
    max_score = 60
    split_at_index = None
    for j,house in enumerate(truncate_address):
        for k in address_keywords:
            score = fuzz.ratio(k,house)
            if score > max_score:
                max_score= score
                split_at_index= j
    if split_at_index == None:
        house = " "
        landmark = ' '.join(truncate_address)
    else:
        house = ' '.join(truncate_address[:split_at_index+1])
        landmark = ' '.join(truncate_address[split_at_index+1:])
    return [house,landmark]

In [31]:
# now let us design the input pipeline
#corporate company has the input store in json file. Lets make a coustom json input files
#considering we got json input


# Lets store the one line adresses in json file
addresses = {'Addresses': ['A 406, Siddhivinayak Apartment, S. V. Road, Opp. Police station, Malad (west), Mumbai 400064','A 1247, Jayesh House, Ashok Nagar Layout Adyal,Pauni, Bhandara']}
file_path = 'one_line_address.json'

# Convert dictionary to JSON string
json_data = json.dumps(addresses)

# Write JSON string to a file
with open(file_path, "w") as json_file:
    json_file.write(json_data)

print("JSON file created successfully.")

JSON file created successfully.


In [32]:
# lets store the two line addresse in json file
addresses = {'Addresses':[{'First Line':'C-302, Oberoi Splendour apartment','Second Line':'Jogeshwari Vikhroli Link Road, Andheri (E), Mumbai 400072'}]}
file_path = 'two_line_address.json'

# Convert dictionary to JSON string
json_data = json.dumps(addresses)

# Write JSON string to a file
with open(file_path, "w") as json_file:
    json_file.write(json_data)

print("JSON file created successfully.")

JSON file created successfully.


In [33]:
# Lets Start implementing the model
#First decide what kind of input, company is acceptiog one line or two line
#Lets generate a random input
# 1 - dontes the one line input
# 2 denotes the tow line input

input_type = np.random.randint(1, 3)
    
print(input_type)   
if input_type == 1:
    #load the oneline address json input
    # getting one line json input
    with open('one_line_address.json','r') as file:
        data = json.load(file)
else:
    with open('two_line_address.json','r') as file:
        data = json.load(file)
    

2


In [34]:
# convert the inputs in the list
inputs = []
for row in data["Addresses"]:
    if input_type == 1:
        inputs.append(row)
    else:
        inputs.append(row['First Line']+' ' + row['Second Line'])

In [35]:
# see the inputs
inputs

['C-302, Oberoi Splendour apartment Jogeshwari Vikhroli Link Road, Andheri (E), Mumbai 400072']

In [36]:
# lets clean the input
inputs = clean_address(inputs)
inputs

['c 302 oberoi splendour apartment jogeshwari vikhroli link road andheri east mumbai 400072']

In [37]:
#lets extract the pincode for given inputs
pincodes = []
for address in inputs:
    pincodes.append(extract_pincode(address))

    
 # lets divide the data on the basis wheter we got the input of not
got_pin_code = {1:[(inputs[i],pincodes[i] ) for i in range(len(pincodes)) if pincodes[i]!= None],0:[inputs[i] for i in range(len(pincodes)) if pincodes[i]== None]}
# get the postofficev data
data = []
for code in got_pin_code[1]:
    # Data stores all the information about particular Post office get through API
    data.append(get_data_using_pincode(code[1]))
#post office infor stores the important information about name of post    
post_office_info = []
for results in data:
    required_keys = {'Name'}
    d = dict()
    for post in results:
        for post_key, post_value in post.items():
            if post_value != None:
                post_value = re.sub(r'\(.*?\)', '', post_value).strip()
            if post_key in required_keys and post_key not in d:
                d[post_key] = set([post_value.lower()])
            elif post_key in required_keys and post_key in d:
                d[post_key].add(post_value.lower())
    post_office_info.append(d)

In [38]:
got_pin_code[1]

[('c 302 oberoi splendour apartment jogeshwari vikhroli link road andheri east mumbai 400072',
  '400072')]

In [39]:
got_pin_code[0]

[]

In [40]:
#lets merge corresponding postoffice_info, address and the data at one point
# these are inputs to get the sturctured address using postal API
inputs_to_get_data_using_api= []
for i in range(len(got_pin_code[1])):
    # we will store like (address, post_office_info, data)
    inputs_to_get_data_using_api.append([got_pin_code[1][i],post_office_info[i],data[i]])

In [41]:
inputs_to_get_data_using_api

[[('c 302 oberoi splendour apartment jogeshwari vikhroli link road andheri east mumbai 400072',
   '400072'),
  {'Name': {'sakinaka', 'vihar road'}},
  [{'Name': 'Sakinaka',
    'Description': None,
    'BranchType': 'Sub Post Office',
    'DeliveryStatus': 'Delivery',
    'Circle': 'Maharashtra',
    'District': 'Mumbai',
    'Division': 'Mumbai  North East',
    'Region': 'Mumbai',
    'Block': 'NA',
    'State': 'Maharashtra',
    'Country': 'India',
    'Pincode': '400072'},
   {'Name': 'Vihar Road',
    'Description': None,
    'BranchType': 'Sub Post Office',
    'DeliveryStatus': 'Non-Delivery',
    'Circle': 'Maharashtra',
    'District': 'Mumbai',
    'Division': 'Mumbai  North East',
    'Region': 'Mumbai',
    'Block': 'NA',
    'State': 'Maharashtra',
    'Country': 'India',
    'Pincode': '400072'}]]]

In [42]:
# What if we don't get pin code in the address
# We will no more be using postal API
# lets make input for non-pincode address
inputs_without_pincode = []
for i in range(len(got_pin_code[0])):
    inputs_without_pincode.append(got_pin_code[0][i])
    

In [43]:
inputs_without_pincode

[]

In [44]:
final_results = []
for i,data in enumerate(inputs_to_get_data_using_api):
    result =get_locality_block_city_state_country(data[0][0],data[1],data[2])
    
    # if we do not get the result using API
    if result[0]==None or result[0]=='NA':
        # we will check the results using our recommendation system
        result = get_address_using_recommender_system(data[0][0])
        
            
    # Its time to fetch landmark and house        
    house_landmark = get_landmark_house(data[0][0],result)
    house_landmark.extend(result)
    
    final_results.append([data[0][0],house_landmark])
    
# For addresses which do not consist of ipincode. In such cases we directly will using recommender system
for data in inputs_without_pincode:
    result = get_address_using_recommender_system(data)
    # Its time to fetch landmark and house        
    house_landmark = get_landmark_house(data,result)
    house_landmark.extend(result)
    
    final_results.append([data,house_landmark])
    
    
    
    
    

In [45]:
## Final result time
for i,result in enumerate(final_results):
    print(f"Result for Query {i+1}")
    print()
    print()
    print(f'Cleaned Address for given query: {result[0]} is')
    print(f"House: {' '.join(word.upper() for word in result[1][0].split())}")
    print(f"Landmark: {' '.join(word.upper() for word in result[1][1].split())}")
    print(f"Locality/Village: {' '.join(word.upper() for word in result[1][2].split())}")
    print(f"Area/Block/Tehsil: {' '.join(word.upper() for word in result[1][3].split())}")
    print(f"City/District: {' '.join(word.upper() for word in result[1][4].split())}")
    print(f"State : {' '.join(word.upper() for word in result[1][5].split())}")
    print(f"Country : {' '.join(word.upper() for word in result[1][6].split())}")
    print(f"Pincode : {' '.join(word.upper() for word in result[1][7].split())}")
    
    if extract_pincode(result[0])!= result[1][7]:
        print()
        print()
        print(f"Our ML algorithms correctly tried to correct the pincode using exactly mapping the address to it's right position. The user has entered the wrong pincode which is not corresponding to his address. Instead of putting pincode as {extract_pincode(result[0])} he should use {result[1][7]}.")
        
    
    print("====================================================================")
    

Result for Query 1


Cleaned Address for given query: c 302 oberoi splendour apartment jogeshwari vikhroli link road andheri east mumbai 400072 is
House: C 302 OBEROI SPLENDOUR APARTMENT
Landmark: JOGESHWARI VIKHROLI LINK ROAD
Locality/Village: ANDHERI EAST
Area/Block/Tehsil: MUMBAI
City/District: MUMBAI
State : MAHARASHTRA
Country : INDIA
Pincode : 400069


Our ML algorithms correctly tried to correct the pincode using exactly mapping the address to it's right position. The user has entered the wrong pincode which is not corresponding to his address. Instead of putting pincode as 400072 he should use 400069.


## Conclusion : 
   1 . I have made an application which can structure the address in correct format.
   
   2 .  While correcting the structure we followd the input should be flow like-

    First Step - Checking whether pincode is present in the adress or not

    Second Step - Pincode Present - Address will be fetched using Postal Service of India

    If not able to fetch the correct address -> Then we will try our second way -> Get the address coresponding to pincode through DB-> Will apply some designend algorithm to get the output

    Even if DataBase not able to get the output, we will use recommendation system for it.

    Third Step : If pincode is not present in data we will use Recommendation system directly.

    The final results were pretty descent.

    Our model can able to insert the correct spellings for location

    It determines the correct pincode if the provided pincode is not according to the address
