## EIRCODE database creation

There are a number of options available to review. The long term aim is to develop an EIRCODE database that includes sufficient detail to map the location details into a map.

Data source:
- Property price register [website](https://www.propertypriceregister.ie/) provides an option to download all data from database into zip file
- EIRCODE API [website](https://services.vision-net.ie/eircode.jsp). Includes a JSON format output. An Póst Geo ID also included.
- GeoCoding APIs
> - Google API [website](https://developers.google.com/maps/documentation/geocoding/overview)
> - MapQuest [website](https://www.mapquest.com/) with a developer API option for geocoding [website](https://developer.mapquest.com/documentation/geocoding-api). NOTE that a pricing plan is in place per monthly transactions. Requires personal API key to request details. 

Objectives:
- Review the PPR database to understand volume of properties with non-null EIRCODE
- Review recent sample of PPR properties with property websites to see if data for EIRCODE is still available
>- Develop an automated search function
- Location details [LAT, LNG]
>- Take sample of properties and extract LAT,LNG from range of sources e.g., Google MAP API, to understand what can be extracted

Backlog:
- Refactor to use modins API, allow for parallel computation

### Additional NLP processing

Aim was to review a number of algorithms that could help benchmark the matching algorithms taking place. Using the EIRCODE dataset that has been formatted to understand what is possible. 

Next steps:
- Create a sample dataset with fake data entry issues e.g., misspelling, missing information, different format

In [None]:
# importing required modules
from zipfile import ZipFile
  
# specifying the zip file name - maintain file format that is downloaded from the PPR website. Algorithm below will unzip to working directory
file_name = "PPR-ALL.zip"
  
# opening the zip file in READ mode
with ZipFile(file_name, 'r') as zip_file:
    # printing all the contents of the zip file
    zip_file.printdir()
  
    # extracting all the files
    print('Extracting all the files now...')
    zip_file.extractall()
    print('Done!')

In [None]:
# Import modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time
import sys
import polars as pl
import plotly.express as px
import os
import subprocess

In [None]:
# Review number of cpus available - this is where the introduction of modin could aid with parallel processing
print(os.cpu_count())

In [None]:
# Notebook setting updates
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Adjust options for displaying the float columns
pd.options.display.float_format = '{:,.2f}'.format

# Warning settings
import warnings
warnings.filterwarnings(action='ignore')

In [None]:
# import ppr data
df = pd.read_csv('PPR-ALL.csv', encoding = "ISO-8859-1")

In [None]:
df.info(memory_usage='deep')

In [None]:
# Function used to clean imported data to allow for exploration. Code is taken from the analysis completed in section 1 of this notebook. For interactive insight on steps user can re-run outputs.
# .assign : steps used to update variables
def tweak_jb(df):
    return (
        df
        .rename(columns=lambda c:c.replace(' ','_'))
        .rename(columns={'Date_of_Sale_(dd/mm/yyyy)':'Date_of_Sale',
                        'Price_()':'Price'
                       })
        .assign(Price=lambda df_:df_.Price.str[1:].str.replace(',','').astype(float),
                Date_of_Sale=lambda df_:pd.to_datetime(df_.Date_of_Sale),
                Not_Full_Market_Price=lambda df_:df_.Not_Full_Market_Price.astype('category'),
                VAT_Exclusive=lambda df_:df_.VAT_Exclusive.astype('category'),
                Description_of_Property=lambda df_:df_.Description_of_Property.astype('category'),
                Property_Size_Description=lambda df_:df_.Property_Size_Description.astype('category'),
               )
        .drop(columns=['Property_Size_Description'])
    )

In [None]:
# Run function to create the updated DataFrame for analysis
df1 = tweak_jb(df)
df1.head()

In [None]:
# Reviewing frequency for each feature
(
    df1
    # .Property_Size_Description # remove not enough info
    .Description_of_Property
    # .VAT_Exclusive
    # .Not_Full_Market_Price
    .value_counts(dropna=False)
)

In [None]:
# Review high level details from DataFrame
df1.describe(include='all').T # check for cardinality

In [None]:
# Check to see if the columns can be converted to categories. If there is a low cardinality (proportion of unique values) then it 
# makes sense to convert the column data type
cardinality = df1.apply(pd.Series.nunique) # Display the cardinality for each column
cardinality

# Extract the column name which matches the column index value being reviewed
N = 6
cat_val = [i for i in (df1.apply(pd.Series.nunique)) if i <= N]
cat_cols = [df1.columns[i] for i, n in enumerate(df1.apply(pd.Series.nunique)) if n <= N] # adding the enumerate method provides an index value
cat_val
cat_cols

In [None]:
df1.info(memory_usage='deep')

### 2 Missing value review

In [None]:
# Understand the missing values by column
df1.isnull().sum()

# Create method to review the proportion of missing values by each column
def missing_columns(df):
    for col in df.columns:
        miss = df.isnull().sum()
        miss_per = miss / len(df)
    return miss_per

missing_columns(df1)

### Review EIRCODEs

In [None]:
# Review number of non-null EIRCODEs within DataFrame
df_e = df1.loc[df1.Eircode.notna()]
df_e.shape
df_e.head()

In [None]:
# Addition of KW parameter provides output for non-numeric features
df_e.describe(include='all')

In [None]:
# Review price variable
df_e['Price'].describe(percentiles=[.1, .2, .3, .4, .5, .6 , .7, .8, .9, .95, .99, 1])

In [None]:
df_e.hist(figsize=(10,10))

In [None]:
# Plotly review provides interactive options
fig = px.histogram(df_e, x="Date_of_Sale")
fig.show()

In [None]:
# Exclude larger outliers
fig = px.histogram(df_e.loc[(df_e.Price < 800_000)], x="Price", nbins=20)
fig.show()

In [None]:
# check for duplicates
dups_check = df_e.Eircode.is_unique
dups_check

In [None]:
# Review duplicates by EIRCODE feature
df_e['duplicate'] = df_e.duplicated(keep=False, subset=['Eircode'])
df_e.head()
dup_count = df_e['duplicate'].value_counts().to_string()
dup_count

In [None]:
# Understand volume of duplicates 
duplicates = (
    df_e
    .loc[df_e.duplicate == True]
    # .head()
    .sort_values(['Eircode'])
    # .head(20)
    .groupby(['Eircode'])['Eircode']
    .count()
    .value_counts() # include to check numbers by duplicate category
    .sort_values(ascending=False)
)

duplicates
# check for highest number

In [None]:
# Single sale date, multiple entries
df_e_check = df_e.loc[df_e.Eircode == 'D07F6K5']
df_e_check
df_e_check.groupby('Eircode').agg({'Price': sum})

In [None]:
# Separate sale dates
df_e_check1 = df_e.loc[df_e.Eircode == 'N39TP27']
df_e_check1

In [None]:
sample = (
    df_e
    .loc[df_e.Eircode == 'D11XE43']
)
sample

In [None]:
sample1 = (
    df1
    .loc[df1.Address.str.contains('BELCLARE PARK', case=False)]
)
sample1

### Next steps

- Try GeoCoding APIs to understand how additional EIRCODEs could be retreived
- Use analysis to review recent property sales i.e., most recent 3 months, could have EIRCODEs missing and how to start providing details. Having an understanding of details that could be seen within the property websites e.g., daft, estate agents, could help with this review

### N-grams

In [None]:
pip install ngram

In [None]:
import ngram

In [None]:
from nltk.util import ngrams

# Example sentence
address = "UNIT 1 69 CABRA RD, PHIBSBOROUGH, DUBLIN 7"

# Tokenize the sentence into words
words = address.split()

# Create bigrams from the list of words
bigrams = ngrams(words, 2)

# Print the bigrams
for bigram in bigrams:
    print(bigram)

In [None]:
G = ngram.NGram(['joe','joseph','jon','john','sally'])
G.search('jon')

In [None]:
check_str = 'dublin'

for bigram in bigrams:
    print(ngram.NGram.compare(bigram, check_str))

### Search Engine review

Working with BM25

In [None]:
# import libraries
# import installPack

# Function to review and install package if missing
def installPackage(package):
    p = subprocess.run([sys.executable, "-m", "pip", "install", "-U", package], stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
    print(p.stdout.decode())

In [None]:
# List of libraries to import
requirements = ["spacy", "rank_bm25"]
for requirement in requirements:
    # installPack.installPackage(requirement)
    installPackage(requirement)

In [None]:
import spacy
from rank_bm25 import BM25Okapi
from tqdm import tqdm

In [None]:
# Initialise SpaCy - getting errors so can't compute these steps
# nlp = spacy.load("en_core_web_sm")

In [None]:
# Address will be the string
df_e.Address.head()

In [None]:
text_list = df_e.Address.str.lower().values
tok_text = [] # tokenised corpus

In [None]:
text_list

In [None]:
# Tokenising using SpaCy
# for doc in tqdm(nlp.pipe(text_list, disable=[])):
#     tok = [t.text for t in doc if t.is_alpha]
#     tok_text.append(tok)

In [None]:
pip install fuzzywuzzy

In [None]:
# Fuzzy Matching
from fuzzywuzzy import fuzz

similarity = fuzz.ratio("UNIT 6 69 CABRA RD, PHIBSBOROUGH, DUBLIN 7, D07F6K5", "UNIT 6 69 CABRA ROAD, PHIBSBOROUGH, DUBLIN 7, D07F6K5")
similarity

In [None]:
# Create a class statement to work with on sample addresses
class FuzzyStringMatcher:
    def __init__(self, df):
        self.df = df
    
    def calculate_similarity_score(self, source_column, target_column, new_column_name):
        self.df[new_column_name] = self.df.apply(lambda row: fuzz.ratio(row[source_column], row[target_column]), axis=1)
    
if __name__ == "__main__":
    # Sample DataFrame
    data = {'Address1': ["UNIT 6 69 CABRA RD, PHIBSBOROUGH, DUBLIN 7, D07F6K5", "UNIT 1 69 CABRA RD, PHIBSBOROUGH, DUBLIN 7, D07F6K5", "UNIT 5 69 CABRA RD, PHIBSBOROUGH, DUBLIN 7, D07F6K5"],
            'Address2': ["UNIT 4 69 CABRA RoaD, PHIBSBOROUGH, DUBLIN 7, D07F6K5", "UNIT 3 69 CABRA RD, PHIBSBORO, DUBLIN 7, D07F6K5", "UNIT 7 69 CABRA RD, PHIBSBOROUGH, DUBLIN 7, D07F6K5"]}
    df = pd.DataFrame(data)

    # Create an instance of the FuzzyStringMatcher class
    fuzzy_matcher = FuzzyStringMatcher(df)
    
    # Calculate similarity score
    fuzzy_matcher.calculate_similarity_score(source_column='Address1', target_column='Address2', new_column_name='SimilarityScore')
    
    # Display DF
    fuzzy_matcher.df

In [None]:
# Create a class that works with larger dataframes
from fuzzywuzzy import fuzz, process

class FuzzyStringMatcherBest:
    def __init__(self, review_df, target_df, review_column, target_column):
        self.review_df = review_df
        self.target_df = target_df
        self.review_column = review_column
        self.target_column = target_column
        self.best_matches = {}
    
    def calculate_best_match(self):
        for count, review_item in tqdm(enumerate(self.review_df[self.review_column])):
            # Find the best match for the review item in the target DataFrame
            best_match, score, _ = process.extractOne(review_item, self.target_df[self.target_column])
            self.best_matches[count] = {'Review_item': review_item, 'Best_Match': best_match, 'Score': score}
        
        # Create a DataFrame to store the results
        result_df = pd.DataFrame.from_dict(self.best_matches, orient='index')
        
        return result_df
        
if __name__ == "__main__":
    # Sample DataFrame
    review_data = {'Address1': ["UNIT 6 69 CABRA RD, PHIBSBOROUGH, DUBLIN 7, D07F6K5", "UNIT 1 69 CABRA RD, PHIBSBOROUGH, DUBLIN 7, D07F6K5", "UNIT 5 69 CABRA RD, PHIBSBOROUGH, DUBLIN 7, D07F6K5"]}
    target_data = {'Address2': ["UNIT 1 69 CABRA RoaD, PHIBSBOROUGH, DUBLIN 7, D07F6K5", "UNIT 3 69 CABRA RD, PHIBSBORO, DUBLIN 7, D07F6K5", "UNIT 7 69 CABRA RD, PHIBSBOROUGH, DUBLIN 7, D07F6K5"]}
    review_df = pd.DataFrame(review_data)
    target_df = pd.DataFrame(target_data)

    # Create an instance of the FuzzyStringMatcher class
    fuzzy_matcher = FuzzyStringMatcherBest(review_df, target_df, review_column='Address1', target_column='Address2')
    
    # Calculate best score
    best_matches_df = fuzzy_matcher.calculate_best_match()
    
    # Display DF
    best_matches_df

In [None]:
df_e1 = (
    df_e
    .assign(address_eir = df_e.Address + ', ' + df_e.Eircode)
)

In [None]:
if __name__ == "__main__":
    # Sample DataFrame
    review_data = df_e1.sample(n=100, replace=True, random_state=1)
    target_data = df_e1.sample(n=500, replace=True, random_state=2)
    review_df = pd.DataFrame(review_data)
    target_df = pd.DataFrame(target_data)

    # Create an instance of the FuzzyStringMatcher class
    fuzzy_matcher = FuzzyStringMatcherBest(review_df, target_df, review_column='address_eir', target_column='address_eir')
    
    # Calculate best score
    best_matches_df = fuzzy_matcher.calculate_best_match()
    
    # Display DF
    best_matches_df

In [None]:
best_matches_df.Score.value_counts(ascending=True)
best_matches_df.loc[(best_matches_df.Score >= 80),:]

In [None]:
review_data.head()

In [None]:
# Tokenization and Text Similarity
from nltk.tokenize import word_tokenize
from nltk.metrics import jaccard_distance
import nltk

# had to download punctuation package to allow work_tokenize to work
nltk.download('punkt')

In [None]:
# Perform review
address1_tokens = set(word_tokenize("UNIT 6 69 CABRA RD, PHIBSBOROUGH, DUBLIN 7, D07F6K5"))
address2_tokens = set(word_tokenize("UNIT 6 69 CABRA ROAD, PHIBSBOROUGH, DUBLIN 7, D07F6K5"))

similarity = 1 - jaccard_distance(address1_tokens, address2_tokens)
similarity

### TF-IDF

Analysis will review how effectively TF-IDF can help to optimise searching through a list of addresses to perform matching.

Code was originally used by Tim Black to match IMDB movie titles with a MovieLens dataset [article](https://medium.com/tim-black/fuzzy-string-matching-at-scale-41ae6ac452c2). Tim references work completed by Chris van den Berg on TF-IDF approach.

We are going to take the code and apply it to the EIRCODE dataset.

In [None]:
# Producing errors so can't get it working correctly
# spacy.cli.download("en_core_web_sm")

In [None]:
pip install sparse_dot_topn

In [None]:
# Load libraries
import re
import time
import operator

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from scipy.sparse import csr_matrix
import pandas as pd

import sparse_dot_topn.sparse_dot_topn as ct

In [None]:
class StringMatch():
    
    def __init__(self, source_names, target_names):
        self.source_names = source_names
        self.target_names = target_names
        self.ct_vect      = None
        self.tfidf_vect   = None
        self.vocab        = None
        self.sprse_mtx    = None
        
        
    def tokenize(self, analyzer='char_wb', n=3):
        '''
        Tokenizes the list of strings, based on the selected analyzer

        :param str analyzer: Type of analyzer ('char_wb', 'word'). Default is trigram
        :param str n: If using n-gram analyzer, the gram length
        '''
        # Create initial count vectorizer & fit it on both lists to get vocab
        self.ct_vect = CountVectorizer(analyzer=analyzer, ngram_range=(n, n))
        self.vocab   = self.ct_vect.fit(self.source_names + self.target_names).vocabulary_
        
        # Create tf-idf vectorizer
        self.tfidf_vect  = TfidfVectorizer(vocabulary=self.vocab, analyzer=analyzer, ngram_range=(n, n))
        
        
    def match(self, ntop=1, lower_bound=0, output_fmt='df'):
        '''
        Main match function. Default settings return only the top candidate for every source string.
        
        :param int ntop: The number of top-n candidates that should be returned
        :param float lower_bound: The lower-bound threshold for keeping a candidate, between 0-1.
                                   Default set to 0, so consider all canidates
        :param str output_fmt: The output format. Either dataframe ('df') or dict ('dict')
        '''
        self._awesome_cossim_top(ntop, lower_bound)
        
        if output_fmt == 'df':
            match_output = self._make_matchdf()
        elif output_fmt == 'dict':
            match_output = self._make_matchdict()
            
        return match_output
        
        
    def _awesome_cossim_top(self, ntop, lower_bound):
        ''' https://gist.github.com/ymwdalex/5c363ddc1af447a9ff0b58ba14828fd6#file-awesome_sparse_dot_top-py '''
        # To CSR Matrix, if needed
        A = self.tfidf_vect.fit_transform(self.source_names).tocsr()
        B = self.tfidf_vect.fit_transform(self.target_names).transpose().tocsr()
        M, _ = A.shape
        _, N = B.shape

        idx_dtype = np.int32

        nnz_max = M * ntop

        indptr = np.zeros(M+1, dtype=idx_dtype)
        indices = np.zeros(nnz_max, dtype=idx_dtype)
        data = np.zeros(nnz_max, dtype=A.dtype)

        ct.sparse_dot_topn(
            M, N, np.asarray(A.indptr, dtype=idx_dtype),
            np.asarray(A.indices, dtype=idx_dtype),
            A.data,
            np.asarray(B.indptr, dtype=idx_dtype),
            np.asarray(B.indices, dtype=idx_dtype),
            B.data,
            ntop,
            lower_bound,
            indptr, indices, data)

        self.sprse_mtx = csr_matrix((data,indices,indptr), shape=(M,N))
    
    
    def _make_matchdf(self):
        ''' Build dataframe for result return '''
        # CSR matrix -> COO matrix
        cx = self.sprse_mtx.tocoo()

        # COO matrix to list of tuples
        match_list = []
        for row, col, val in zip(cx.row, cx.col, cx.data):
            match_list.append((row, self.source_names[row], col, self.target_names[col], val))

        # List of tuples to dataframe
        colnames = ['Row_Idx', 'Sample_Address', 'Source_Idx', 'Source_Address', 'Score']
        match_df = pd.DataFrame(match_list, columns=colnames)

        return match_df

    
    def _make_matchdict(self):
        ''' Build dictionary for result return '''
        # CSR matrix -> COO matrix
        cx = self.sprse_mtx.tocoo()

        # dict value should be tuple of values
        match_dict = {}
        for row, col, val in zip(cx.row, cx.col, cx.data):
            if match_dict.get(row):
                match_dict[row].append((col,val))
            else:
                match_dict[row] = [(col, val)]

        return match_dict   

In [None]:
# Make sample list for review
def sample_test_df(n=1_000):
    return (
    df_e
    .sample(n=n, random_state=1)
)
# df_e1 = sample_test_df(10_000)
df_e1 = sample_test_df(25_000)
df_e1

In [None]:
from datetime import datetime

# Match the sample address to EIRCODE addresses (and time it)
t0 = datetime.now()
titlematch = StringMatch(df_e1.Address.tolist(), df_e.Address.tolist()) # first param: sample list, second param: target list
titlematch.tokenize()
match_df = titlematch.match()
t1 = datetime.now()
full_time_tfidf = (t1-t0).total_seconds()
full_time_tfidf

# Performance:
# n = 1_000; time = 4.5s
# n = 10_000; time = 13s
# n = 25_000; time = 28s

In [None]:
match_df.sample(10)

In [None]:
# Should result in perfect match as the same address is being used for both fields (Sample & Source)
match_df.groupby(round(match_df.Score,2)).agg({'Sample_Address':['count']})

In [None]:
# As the sample input list increased in size the score moved towards 1. Therefore query results only work for sample size (n=1_000)
# review = (
#     match_df.loc[(round(match_df.Score,2) == 0.98),:]
# )
# review

In [None]:
# Trying to display the matrix output. As sample size has increased this output has increased in size
plt.spy(titlematch.sprse_mtx);