### Kunden Match <a class="anchor" id="chapter1"></a>

In [1]:
import pandas as pd
import numpy as np

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

Questions to understand both datasets' structure, quality and the insights.

Data Quality

1) Data Completeness:

- Are there missing values in the DS? If so, how many and in which columns?
- What is the proportion of missing values?

(2) Data Consistency

- Are there any inconsistencies in the data entries, such as different formats for the same type of data?
- Are there duplicate rows in the dataset?

(3) Data Types

- What are the data types of each column? Are they appropriate for the type of data they hold?
- Do any columns need to be converted to different data types for analysis (e.g., integer columns might be stored as a string)?


In [2]:
file_path = 'Fuzzy Match.xlsx'

#Salesforce Data

df_salesforce = pd.read_excel(file_path, sheet_name = 'Salesforce')

df_salesforce.head()


Unnamed: 0,Salesforce ID,Account Name,Account Owner,Parent Account ID,Parent Account,Rating - Number Of Funerals p.a.,Steps ID,Billing Country,Billing City,Billing ZIP,Billing Street,Street Number,Phone,Steps ID.1
0,0010X000045oU62QAE,Bestattungen Gredler,Maximilian Witt,,,700.0,,Germany,Leutkirch,88299,Storchenstraße,15/1,07561 / 5009,
1,0010X000045okZGQAY,Toussaint Bestattungen,Jasmin Ouali Turki,,,100.0,330860.0,Germany,Blieskastel,66440,Alte Pfarrgasse,17,+49 6842 4563,
2,0010X000045om5rQAA,Hermann Janßen KG,Christian Leppert,,,150.0,,Germany,Schortens Heidmühle,26419,Oldenburger Straße,32,+49 4461 8802,
3,0010X000045peOdQAI,Wonnemann e.K. Bestattungen,Florian Walzer,,,200.0,,Germany,Ennigerloh,59320,Elmstraße,12,+49 2524 93310,
4,0010X000045piDPQAY,Müther e. K. Bestattungen,Maximilian Witt,,,80.0,,Germany,Gütersloh,33335,Hirschweg,11-13,+49 5241 78033,


In [3]:
df_salesforce.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4469 entries, 0 to 4468
Data columns (total 14 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   Salesforce ID                     4469 non-null   object 
 1   Account Name                      4469 non-null   object 
 2   Account Owner                     4469 non-null   object 
 3   Parent Account ID                 286 non-null    object 
 4   Parent Account                    286 non-null    object 
 5   Rating - Number Of Funerals p.a.  2774 non-null   float64
 6   Steps ID                          1067 non-null   float64
 7   Billing Country                   4469 non-null   object 
 8   Billing City                      4423 non-null   object 
 9   Billing ZIP                       4413 non-null   object 
 10  Billing Street                    4411 non-null   object 
 11  Street Number                     4306 non-null   object 
 12  Phone 

### Data Manipulation: Salesforce <a class="anchor" id="chapter1"></a>

In [4]:
# 1) Data types are objects, coverting three into integers accordingly.

df_salesforce['Rating Number Of Funerals p.a.'] = pd.to_numeric(df_salesforce['Rating - Number Of Funerals p.a.'], errors='coerce').fillna(0).astype(int)
df_salesforce['Steps ID'] = pd.to_numeric(df_salesforce['Steps ID'], errors='coerce').fillna(0).astype(int)
#df_salesforce['Billing ZIP'] = pd.to_numeric(df_salesforce['Billing ZIP'], errors='coerce').fillna(0).astype(int)

df_salesforce.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4469 entries, 0 to 4468
Data columns (total 15 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   Salesforce ID                     4469 non-null   object 
 1   Account Name                      4469 non-null   object 
 2   Account Owner                     4469 non-null   object 
 3   Parent Account ID                 286 non-null    object 
 4   Parent Account                    286 non-null    object 
 5   Rating - Number Of Funerals p.a.  2774 non-null   float64
 6   Steps ID                          4469 non-null   int64  
 7   Billing Country                   4469 non-null   object 
 8   Billing City                      4423 non-null   object 
 9   Billing ZIP                       4413 non-null   object 
 10  Billing Street                    4411 non-null   object 
 11  Street Number                     4306 non-null   object 
 12  Phone 

In [5]:
#df_salesforce['Billing ZIP'] = df_salesforce['Billing ZIP'].astype(int)

#df_salesforce['Billing ZIP'].dtype()
#-> invalid literal for int() with base 10: 'W1F 7LP'


In [6]:
#2) Einige wenige Kunden wurden bereits zwischen den Systemen gematcht, d.h. einer Salesforce-ID wurde eine Steps-ID zugewiesen. 
#-> Renaming column Steps ID to 'Manual_StepsID' to separate from column N.

df_salesforce.rename(columns={'Steps ID': 'Manual_StepsID'},inplace=True)
#df_salesforce['Manual_StepsID']


Handling null values

In [7]:
# Function to count null values in a DataFrame column
def count_null(column):
    return column.isnull().sum()

# Function to count empty values in a DataFrame column
def count_empty(column):
    return (column == '').sum()

# Function to count zero values in a DataFrame column
def count_zeros(column):
    # Check if the column is numeric to avoid type errors
    if pd.api.types.is_numeric_dtype(column):
        return (column == 0).sum()
    return 0

# Function to count blank (whitespace-only) values in a DataFrame column
def count_blank(column):
    return column.apply(lambda x: isinstance(x, str) and x.strip() == '').sum()

# Apply the functions to each column and create Series with the results
null_counts = df_salesforce.apply(count_null)
empty_counts = df_salesforce.apply(count_empty)
zero_counts = df_salesforce.apply(count_zeros)
blank_counts = df_salesforce.apply(count_blank)

# Display the counts
print("Null value counts for each column:")
print(null_counts)

print("\nEmpty value counts for each column:")
print(empty_counts)

print("\nZero value counts for each column:")
print(zero_counts)

print("\nBlank value counts for each column:")
print(blank_counts)

Null value counts for each column:
Salesforce ID                          0
Account Name                           0
Account Owner                          0
Parent Account ID                   4183
Parent Account                      4183
Rating - Number Of Funerals p.a.    1695
Manual_StepsID                         0
Billing Country                        0
Billing City                          46
Billing ZIP                           56
Billing Street                        58
Street Number                        163
Phone                                171
Steps ID.1                          4469
Rating Number Of Funerals p.a.         0
dtype: int64

Empty value counts for each column:
Salesforce ID                       0
Account Name                        0
Account Owner                       0
Parent Account ID                   0
Parent Account                      0
Rating - Number Of Funerals p.a.    0
Manual_StepsID                      0
Billing Country                   

In [8]:
# Calculating the percentage of null values in each column
total_rows = df_salesforce.shape[0]
null_percentage = (null_counts / total_rows) * 100
	

# Combine both counts and percentages into a single DataFrame for better readability
null_summary = pd.DataFrame({'Null Count': null_counts, 'Null Percentage': null_percentage})
null_summary


Unnamed: 0,Null Count,Null Percentage
Salesforce ID,0,0.0
Account Name,0,0.0
Account Owner,0,0.0
Parent Account ID,4183,93.600358
Parent Account,4183,93.600358
Rating - Number Of Funerals p.a.,1695,37.927948
Manual_StepsID,0,0.0
Billing Country,0,0.0
Billing City,46,1.029313
Billing ZIP,56,1.253077


In [9]:
#How many matches do we already have?

# Counting the non-null and integer values in 'Manual_StepsID'
matches_count = df_salesforce['Manual_StepsID'].apply(lambda x: pd.notna(x) and isinstance(x, (int, np.integer))).sum()
matches_count

# Total number of rows in the DataFrame
total_rows = df_salesforce.shape[0]
total_rows
# Calculate percentages
matches_percentage = (matches_count / total_rows) * 100
non_matches_percentage = 100 - matches_percentage

matches_percentage
non_matches_percentage

#Conclusion: Counting for non-null values, non-empty values didn't work -> I figured values were zero.


np.float64(0.0)

In [10]:
#Encountered the issue that Manual_StepsID is showing 4469 total entries and all are non-null and non-empty.

df_salesforce['Manual_StepsID'].head()

0         0
1    330860
2         0
3         0
4         0
Name: Manual_StepsID, dtype: int64

In [11]:
# Counting zero and non-zero values in 'Manual_StepsID'

zero_count = (df_salesforce['Manual_StepsID'] == 0).sum()
non_zero_count = (df_salesforce['Manual_StepsID'] != 0).sum()

# What is the total number of rows in the DF?
total_rows = df_salesforce.shape[0]

# Calculating percentages
zero_percentage = (zero_count / total_rows) * 100
non_zero_percentage = (non_zero_count / total_rows) * 100

#Printing results
print(f"Total entries: {total_rows}")
print(f"Number of zero values in 'Manual StepsID': {zero_count}")
print(f"Number of non-zero values in 'Manual StepsID': {non_zero_count}")
print(f"Percentage of zero values: {zero_percentage:.2f}%")
print(f"Percentage of non-zero values: {non_zero_percentage:.2f}%")


Total entries: 4469
Number of zero values in 'Manual StepsID': 3402
Number of non-zero values in 'Manual StepsID': 1067
Percentage of zero values: 76.12%
Percentage of non-zero values: 23.88%


In [12]:
# Filter non-zero values in 'Manual_StepsID'
non_zero_values = df_salesforce[df_salesforce['Manual_StepsID'] != 0]

# Count unique non-zero values
unique_non_zero_count = non_zero_values['Manual_StepsID'].nunique()

# Print results
print(f"Number of unique non-zero values in 'Manual_StepsID': {unique_non_zero_count}")

Number of unique non-zero values in 'Manual_StepsID': 1067


Conclusion: 
- About 1/4 of the column ‘Manual_StepsID’ have matched values of StepsID. 
- The Total entries, both Number of (non) - zero values in ‘Manual_StepsID’ were matched manually in the .xlsx file.

In [13]:
df_salesforce['Manual_StepsID']

0            0
1       330860
2            0
3            0
4            0
         ...  
4464    330520
4465         0
4466    330242
4467    326215
4468         0
Name: Manual_StepsID, Length: 4469, dtype: int64

Checking for duplicates in Salesforce

In [14]:
total_values = df_salesforce['Account Name'].count()


unique_Account_Name = df_salesforce['Account Name'].nunique()

print(f"Total number of values in 'Account Name': {total_values}")
print(f"Number of unique values in 'Account Name': {unique_Account_Name}")

Total number of values in 'Account Name': 4469
Number of unique values in 'Account Name': 4269


In [15]:
# Count total values and unique values in 'Account Name'
total_values = df_salesforce['Account Name'].count()
unique_Account_Name = df_salesforce['Account Name'].nunique()

# Identify duplicate values in 'Account Name'
duplicate_Account_Names = df_salesforce[df_salesforce.duplicated(subset=['Account Name'], keep=False)]

# Print results
print(f"Total number of values in 'Account Name': {total_values}")
print(f"Number of unique values in 'Account Name': {unique_Account_Name}")

print("\nDuplicate values in 'Account Name':")
print(duplicate_Account_Names[['Account Name']].drop_duplicates())

Total number of values in 'Account Name': 4469
Number of unique values in 'Account Name': 4269

Duplicate values in 'Account Name':
                                  Account Name
7                      Grieneisen Bestattungen
69                           Bestattungen Rien
71              Asgard Bestattungshaus Rostock
79                    Bestattungen Stangl GmbH
148   Deutsche Bestattungsvorsorge Treuhand AG
...                                        ...
3654                 Erich Müller Bestattungen
3670        Bestattungsinstitut Hans von Holdt
3677               Helmut Schmidt Bestattungen
3689                           Volksbestattung
3695                 Pietät Halle Bestattungen

[75 rows x 1 columns]


-> check if duplicates have same address for consistency

### Data Manipulation: Steps <a class="anchor" id="chapter1"></a>

In [16]:
df_steps = pd.read_excel(file_path, sheet_name = 'Steps')

df_steps.head()

Unnamed: 0,Group ID,Group Name,Steps ID,Customer Name,Zip Code,City,Street,Phone,E-Mail,Cases
0,132,Keine,329984,DBU Bestattungs-Union GmbH,,,,,,
1,132,Keine,329358,?Bestattungen Potthoff,48703.0,Stadtlohn,Vredener Straße 41,+49 2563 7933,info@bestattungen-potthoff.de,
2,132,Keine,501395,„Pietät“ Berthold Wiesel,60311.0,Frankfurt am Main,Kirchnerstraße 4,+49 (69) 9207160,info@bestattungen-wiesel.de,100.0
3,132,Keine,501076,1. Brühler Bestattungsinstitut,68782.0,Brühl,Stuttgarter Straße 26,+49 (6202) 71528,info@bestattungsinstitut-gredel.de,0.0
4,297,ASV,326034,ASV Deutschland GmbH,22089.0,Hamburg,Eilbeker Weg 16,+49 (40) 258055,info@asv-deutschland.de,


In [17]:
df_steps.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6593 entries, 0 to 6592
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Group ID       6593 non-null   int64  
 1   Group Name     6593 non-null   object 
 2   Steps ID       6593 non-null   int64  
 3   Customer Name  6593 non-null   object 
 4   Zip Code       6538 non-null   float64
 5   City           6542 non-null   object 
 6   Street         6521 non-null   object 
 7   Phone          6385 non-null   object 
 8   E-Mail         4583 non-null   object 
 9   Cases          4357 non-null   float64
dtypes: float64(2), int64(2), object(6)
memory usage: 515.2+ KB


Conclusion: Steps has 6593 entries, whereas Salesforce has 4469 (Steps has 2124 more entries)

In [18]:
#Conclusion: Steps has 6593 entries, whereas Salesforce has 4469 (Steps has 2124 more entries)

# Number of entries in each DataFrame
entries_steps = 6593
entries_salesforce = 4469

# Calculate the difference in entries
difference = entries_steps - entries_salesforce

# Calculate the percentage difference
percentage_difference = (difference / entries_salesforce) * 100

# Print the results
print(f"Entries in Steps: {entries_steps}")
print(f"Entries in Salesforce: {entries_salesforce}")
print(f"Difference in entries: {difference}")
print(f"Percentage difference: {percentage_difference:.2f}%")

Entries in Steps: 6593
Entries in Salesforce: 4469
Difference in entries: 2124
Percentage difference: 47.53%


In [19]:
# 1) Data types are objects, coverting Zip Code into an integer accordingly.
#df_steps['Zip Code'] = pd.to_numeric(df_steps['Zip Code'], errors='coerce').fillna(0).astype(int)
#df_steps['Zip Code']

In [20]:
# Counting null values in each column
null_counts = df_steps.isnull().sum()


# Function to count null values in a DataFrame column
def count_null(column):
    return column.isnull().sum()

# Function to count empty values in a DataFrame column
def count_empty(column):
    return (column == '').sum()

# Function to count zero values in a DataFrame column
def count_zeros(column):
    # Check if the column is numeric to avoid type errors
    if pd.api.types.is_numeric_dtype(column):
        return (column == 0).sum()
    return 0

# Function to count blank (whitespace-only) values in a DataFrame column
def count_blank(column):
    return column.apply(lambda x: isinstance(x, str) and x.strip() == '').sum()

# Function to count 'NULL' string values in a DataFrame column
def count_null_string(column):
    return (column == 'NULL').sum()

# Apply the functions to each column and create Series with the results
null_counts_steps = df_steps.apply(count_null)
empty_counts_steps = df_steps.apply(count_empty)
zero_counts_steps = df_steps.apply(count_zeros)
blank_counts_steps = df_steps.apply(count_blank)
null_string_counts_steps = df_steps.apply(count_null_string)

# Display the counts
print("Null value counts for each column in df_steps:")
print(null_counts_steps)

print("\nEmpty value counts for each column in df_steps:")
print(empty_counts_steps)

print("\nZero value counts for each column in df_steps:")
print(zero_counts_steps)

print("\nBlank value counts for each column in df_steps:")
print(blank_counts_steps)

print("\n'NULL' string value counts for each column in df_steps:")
print(null_string_counts_steps)

Null value counts for each column in df_steps:
Group ID            0
Group Name          0
Steps ID            0
Customer Name       0
Zip Code           55
City               51
Street             72
Phone             208
E-Mail           2010
Cases            2236
dtype: int64

Empty value counts for each column in df_steps:
Group ID         0
Group Name       0
Steps ID         0
Customer Name    0
Zip Code         0
City             0
Street           0
Phone            0
E-Mail           0
Cases            0
dtype: int64

Zero value counts for each column in df_steps:
Group ID            0
Group Name          0
Steps ID            0
Customer Name       0
Zip Code            0
City                0
Street              0
Phone               0
E-Mail              0
Cases            1006
dtype: int64

Blank value counts for each column in df_steps:
Group ID         0
Group Name       0
Steps ID         0
Customer Name    0
Zip Code         0
City             0
Street           0
Phone

In [21]:
# Calculating the percentage of null values in each column
total_rows = df_steps.shape[0]
null_percentage = (null_counts / total_rows) * 100
	

# Combine both counts and percentages into a single DataFrame for better readability
null_summary = pd.DataFrame({'Null Count': null_counts, 'Null Percentage': null_percentage})
null_summary

Unnamed: 0,Null Count,Null Percentage
Group ID,0,0.0
Group Name,0,0.0
Steps ID,0,0.0
Customer Name,0,0.0
Zip Code,55,0.834218
City,51,0.773548
Street,72,1.092067
Phone,208,3.154861
E-Mail,2010,30.48688
Cases,2236,33.914758


Fuzzy String Matching 

1) Creating a new column "Address" for both Salesforce and Steps

In [22]:
df_steps['Street'] = df_steps['Street'].astype(str)
df_steps['Zip Code'] = df_steps['Zip Code'].astype(str)
df_steps['City'] = df_steps['City'].astype(str)

# Verify the conversion
print(df_steps['Street'].dtype)
print(df_steps['Zip Code'].dtype)
print(df_steps['City'].dtype)

object
object
object


In [23]:
# (1) Create a concatenated address field to prepare the fuzzy string match. Therefore, the three columns Street, Zip Code, and City are combined in the new column 'Address' 
#Spaces are ignored - In this analysis, they don't add value
df_steps['Steps_Address'] = df_steps['Street'] + df_steps['Zip Code'] + df_steps['City']
df_steps['Steps_Address']

0                                       nannannan
1              Vredener Straße 4148703.0Stadtlohn
2        Kirchnerstraße 460311.0Frankfurt am Main
3               Stuttgarter Straße 2668782.0Brühl
4                   Eilbeker Weg 1622089.0Hamburg
                          ...                    
6588    Mannheimer Straße 22955543.0Bad Kreuznach
6589                   Lindenstr. 3917389.0Anklam
6590                      Worthweg 529693.0Ahlden
6591          Donnerburgweg 4038106.0Braunschweig
6592                    Liebenau 364252.0Liebenau
Name: Steps_Address, Length: 6593, dtype: object

In [24]:
steps_address_dtype = df_steps['Steps_Address'].dtype
steps_address_dtype

dtype('O')

In [25]:
#df_salesforce['Billing ZIP'] = df_salesforce['Billing ZIP'].astype(int)

In [26]:
df_salesforce['Street Number'].dtype
df_salesforce['Street Number'].head

<bound method NDFrame.head of 0        15/1
1          17
2          32
3          12
4       11-13
        ...  
4464       21
4465       83
4466        6
4467       35
4468       2a
Name: Street Number, Length: 4469, dtype: object>

In [27]:
df_salesforce['Billing Street'].dtype
df_salesforce['Billing Street'].head()

0        Storchenstraße
1       Alte Pfarrgasse
2    Oldenburger Straße
3             Elmstraße
4             Hirschweg
Name: Billing Street, dtype: object

In [28]:
df_salesforce['Billing ZIP'].dtype
df_salesforce['Billing ZIP'].head()

0    88299
1    66440
2    26419
3    59320
4    33335
Name: Billing ZIP, dtype: object

In [29]:
df_salesforce['Billing City'].dtype
df_salesforce['Billing City'].head()

0              Leutkirch
1            Blieskastel
2    Schortens Heidmühle
3             Ennigerloh
4              Gütersloh
Name: Billing City, dtype: object

In [30]:
df_salesforce['Billing Country'].dtype
df_salesforce['Billing Country'].head()

0    Germany
1    Germany
2    Germany
3    Germany
4    Germany
Name: Billing Country, dtype: object

Transforming Salesforce address into objects

In [31]:
df_salesforce['Billing Street'] = df_salesforce['Billing Street'].astype(str)
df_salesforce['Street Number'] = df_salesforce['Street Number'].astype(str)
df_salesforce['Billing ZIP'] = df_salesforce['Billing ZIP'].astype(str)
df_salesforce['Billing City'] = df_salesforce['Billing City'].astype(str)
# Verify the conversion
print(df_salesforce['Billing Street'].dtype)
print(df_salesforce['Street Number'].dtype)
print(df_salesforce['Billing ZIP'].dtype)
print(df_salesforce['Billing City'].dtype)
#.head()
print(df_salesforce['Billing Street'].head)
print(df_salesforce['Street Number'].head)
print(df_salesforce['Billing ZIP'].head)
print(df_salesforce['Billing City'].head)

object
object
object
object
<bound method NDFrame.head of 0           Storchenstraße
1          Alte Pfarrgasse
2       Oldenburger Straße
3                Elmstraße
4                Hirschweg
               ...        
4464       Immermannstraße
4465         Prinzenstraße
4466        Weinbergstraße
4467      Große Paaschburg
4468             Stadtfeld
Name: Billing Street, Length: 4469, dtype: object>
<bound method NDFrame.head of 0        15/1
1          17
2          32
3          12
4       11-13
        ...  
4464       21
4465       83
4466        6
4467       35
4468       2a
Name: Street Number, Length: 4469, dtype: object>
<bound method NDFrame.head of 0       88299
1       66440
2       26419
3       59320
4       33335
        ...  
4464    39108
4465    47198
4466    17192
4467    25524
4468    24837
Name: Billing ZIP, Length: 4469, dtype: object>
<bound method NDFrame.head of 0                 Leutkirch
1               Blieskastel
2       Schortens Heidmühle
3             

In [32]:
df_salesforce['Billing Street'].head()
df_salesforce['Street Number'].head()
df_salesforce['Billing ZIP'].head()
df_salesforce['Billing City'].head()

0              Leutkirch
1            Blieskastel
2    Schortens Heidmühle
3             Ennigerloh
4              Gütersloh
Name: Billing City, dtype: object

In [33]:
df_salesforce['Billing Street'].head()

0        Storchenstraße
1       Alte Pfarrgasse
2    Oldenburger Straße
3             Elmstraße
4             Hirschweg
Name: Billing Street, dtype: object

In [34]:
df_salesforce['Salesforce_Address'] = df_salesforce['Billing Street'] + df_salesforce['Street Number'] + df_salesforce['Billing ZIP'] + df_salesforce['Billing City'] 
df_salesforce['Salesforce_Address'].head()

0                Storchenstraße15/188299Leutkirch
1               Alte Pfarrgasse1766440Blieskastel
2    Oldenburger Straße3226419Schortens Heidmühle
3                      Elmstraße1259320Ennigerloh
4                    Hirschweg11-1333335Gütersloh
Name: Salesforce_Address, dtype: object

What is known: 

(1) Some Steps IDs were added to Salesforce already. 
Conclusion: These "prefilled" rows need to be excluded in Python.

-> if steps id in salesforce = steps then filter out

In [35]:
# Excluding Steps ID in Salesforce that are filled 

#df_salesforce_filtered = df_salesforce[df_salesforce['Manual_StepsID'] == 0]

#print("Filtered Salesforce DataFrame (excluding filled 'Manual_StepsID'):")
#print(df_salesforce_filtered)

Approach Outline


(1) What is known: Some Steps IDs were added to Salesforce already. Conclusion: These "prefilled" rows need to be excluded in Python.

-> if steps id in salesforce = steps then filter out

(2) Load both sheets -> compare names and addresses of both sheets -> check technqiues
-  potential ones: highest probability (Levenstein distance ; Fuzzywuzzy)
Check the highest ratio for:
- Name
- zip code (group them)
- Street

Filter ratio : 95% / 90% 

-> create a df for each and compare both df
-> check each row of Steps with rows from Salesforce

- Remove empty spaces

Further steps:
- Remove common token, e.g. 


Limitation: it's not NLP, there is no "intelligence" behind this , it just looks at tokens

process.extract()


Transforming Billing ZIP into Python object

In [36]:
df_salesforce['Billing ZIP'].dtype

dtype('O')

In [37]:
df_salesforce['Billing ZIP'] = df_salesforce['Billing ZIP'].astype(str)

# Verify the conversion
print(df_salesforce['Billing ZIP'].dtype)



object


Finding the best match for each Salesforce address in Steps DataFrame

Snippet 1: Issue with this code: Figured there are missing and non-string (e.g. ZIP code) values. Elements used to create both Salesforce_Address or Steps_Address have non-string values and contain NaN values. There was a TypeError as process.extractOne() expected a strings or byte-like object.

In [38]:

# Function to find the best match for each Salesforce address in Steps DataFrame
#def find_best_match(salesforce_address, steps_addresses):
 #   best_match = process.extractOne(salesforce_address, steps_addresses, scorer=fuzz.token_sort_ratio)
  #  return best_match

# Apply the function to find matches
#matches = df_salesforce['Salesforce_Address'].apply(find_best_match, steps_addresses=df_steps['Steps_Address'])

# Extract matched addresses and scores
#df_salesforce['Best Match Address'] = matches.apply(lambda x: x[0] if x[1] >= 95 else None)
#df_salesforce['Match Score'] = matches.apply(lambda x: x[1])

# Filter rows with matches having at least 95% match score
#matched_rows = df_salesforce[df_salesforce['Match Score'] >= 95]

# Display the DataFrame with matched addresses
#print("Salesforce DataFrame with best matches:")
#print(df_salesforce)


#-> TypeError: expected string or bytes-like object, got 'float'

#Some Salesforce_Address or Steps_Address columns have NaN or non-string values 

#Issue with this code: Figured there are missing and non-string (e.g. ZIP code) values. Elements used to create both Salesforce_Address or Steps_Address have non-string values and contain NaN values. There was a TypeError as process.extractOne() expected a strings or byte-like object.

What is the added value of this code compared to Snippet 1?
- It can handle missing or None values by checking that x isn't None before accessing  x[0] and x[1]. It ensues that Not a Number and non-string values don't cause issues during the extraction process.

Attempt 1

In [39]:
def find_best_match(salesforce_address, steps_addresses):
    best_match = process.extractOne(salesforce_address, steps_addresses, scorer=fuzz.token_sort_ratio)
    return best_match
# Apply the function to find matches
matches = df_salesforce['Salesforce_Address'].apply(find_best_match, steps_addresses=df_steps['Steps_Address'])

# Extract matched addresses and scores
df_salesforce['Best Match Address'] = matches.apply(lambda x: x[0] if x and x[1] >= 95 else None)
df_salesforce['Match Score'] = matches.apply(lambda x: x[1] if x else None)

# Filter rows with matches having at least 95% match score
matched_rows = df_salesforce[df_salesforce['Match Score'] >= 95]

# Display the DataFrame with matched addresses
print("Salesforce DataFrame with best matches:")
print(df_salesforce)

Salesforce DataFrame with best matches:
           Salesforce ID                             Account Name  \
0     0010X000045oU62QAE                     Bestattungen Gredler   
1     0010X000045okZGQAY                   Toussaint Bestattungen   
2     0010X000045om5rQAA                        Hermann Janßen KG   
3     0010X000045peOdQAI              Wonnemann e.K. Bestattungen   
4     0010X000045piDPQAY                Müther e. K. Bestattungen   
...                  ...                                      ...   
4464  001b0000044cAQgAAM  Vergissmeinnicht® Abschied & Bestattung   
4465  001b0000044cmhlAAA                    Bestattungshaus Dunas   
4466  001b0000044dG3zAAE               Bestattungshaus Engelhardt   
4467  001b0000044dac3AAA          Bestattungsinstitut Hans Müller   
4468  001b0000044ddYjAAI        Bestattungshaus Michael Jürgensen   

           Account Owner Parent Account ID  \
0        Maximilian Witt               NaN   
1     Jasmin Ouali Turki               

Best Match Address column df_salesforce DataFrame is empty. 
This indicates there weren't matches with a score of 95% or higher.

Troubleshooting

In [40]:
match_score_mode = df_salesforce['Match Score'].mode()

print("Mode of Match Scores:")
print(match_score_mode)

Mode of Match Scores:
0    51
Name: Match Score, dtype: int64


In [41]:
frequency_table = df_salesforce['Match Score'].value_counts().sort_index()

# Display the frequency table
print("Frequency table of Match Scores:")
print(frequency_table)


Frequency table of Match Scores:
Match Score
34      3
40      3
41      2
42      4
43     17
44     34
45     48
46     88
47    187
48    270
49    314
50    235
51    357
52    352
53    261
54    216
55    247
56    154
57    137
58    121
59     82
60     90
61     79
62     77
63     60
64     56
65     58
66     38
67     69
68     75
69     51
70     72
71     71
72     62
73     56
74     53
75     64
76     53
77     46
78     33
79     40
80     19
81     16
82     13
83     11
84      8
85      3
86     47
87      3
88      6
89      2
90      3
91      2
95      1
Name: count, dtype: int64


Attempt 2: Using Pandas to remove spaces, hyphens, periods, and slashes.

Attempt 2: Steps

In [42]:
def remove_special_characters(street_name):
    return street_name.replace(" ", "").replace("-", "").replace(".", "").replace("/", "")

# Apply the function to the 'Street' column and create a new column 'Street_without_special_characters'
df_steps['Street_without_special_characters'] = df_steps['Street'].apply(remove_special_characters)

# Display the DataFrame
print(df_steps['Street_without_special_characters'])

0                       nan
1          VredenerStraße41
2           Kirchnerstraße4
3       StuttgarterStraße26
4             EilbekerWeg16
               ...         
6588    MannheimerStraße229
6589            Lindenstr39
6590              Worthweg5
6591        Donnerburgweg40
6592             Liebenau36
Name: Street_without_special_characters, Length: 6593, dtype: object


Attempt 2: Salesforce

In [43]:
df_salesforce['Billing_Street_without_special_characters'] = df_salesforce['Billing Street'].str.replace(r'[ \-./]', '', regex=True)

# Display the DataFrame
df_salesforce['Billing_Street_without_special_characters'].head()

0       Storchenstraße
1       AltePfarrgasse
2    OldenburgerStraße
3            Elmstraße
4            Hirschweg
Name: Billing_Street_without_special_characters, dtype: object

In [44]:
# (1)
df_steps['Steps_Address_without_special_characters'] = df_steps['Street_without_special_characters'] + df_steps['Zip Code'] + df_steps['City']
df_steps['Street_without_special_characters'].head()

0                    nan
1       VredenerStraße41
2        Kirchnerstraße4
3    StuttgarterStraße26
4          EilbekerWeg16
Name: Street_without_special_characters, dtype: object

In [45]:
#(2)
df_salesforce['Billing_Street_without_special_characters'] = df_salesforce['Billing_Street_without_special_characters'] + df_salesforce['Street Number'] + df_salesforce['Billing ZIP'] + df_salesforce['Billing City'] 
df_salesforce['Billing_Street_without_special_characters'].head()

0               Storchenstraße15/188299Leutkirch
1               AltePfarrgasse1766440Blieskastel
2    OldenburgerStraße3226419Schortens Heidmühle
3                     Elmstraße1259320Ennigerloh
4                   Hirschweg11-1333335Gütersloh
Name: Billing_Street_without_special_characters, dtype: object

2: Matching Function Using the cleaned Addresses

 Function Using Cleaned Addresses

In [46]:
def find_best_match(cleaned_salesforce_address, cleaned_steps_addresses):
    # Find the best match for the cleaned Salesforce address
    best_match = process.extractOne(cleaned_salesforce_address, cleaned_steps_addresses, scorer=fuzz.token_sort_ratio)
    return best_match


2: Apply the Matching Function to Find Matches

In [47]:
# Apply the function to find matches
matches = df_salesforce['Billing_Street_without_special_characters'].apply(find_best_match, cleaned_steps_addresses=df_steps['Steps_Address_without_special_characters'])

# Extract matched addresses and scores
df_salesforce['Second_Best_Match_Address'] = matches.apply(lambda x: x[0] if x and x[1] >= 95 else None)
df_salesforce['Second_Match_Score'] = matches.apply(lambda x: x[1] if x else None)

# Filter rows with matches having at least 95% match score
matched_rows = df_salesforce[df_salesforce['Second_Match_Score'] >= 95]

# Display the DataFrame with matched addresses
print("Salesforce DataFrame with best matches:")
print(df_salesforce)


Salesforce DataFrame with best matches:
           Salesforce ID                             Account Name  \
0     0010X000045oU62QAE                     Bestattungen Gredler   
1     0010X000045okZGQAY                   Toussaint Bestattungen   
2     0010X000045om5rQAA                        Hermann Janßen KG   
3     0010X000045peOdQAI              Wonnemann e.K. Bestattungen   
4     0010X000045piDPQAY                Müther e. K. Bestattungen   
...                  ...                                      ...   
4464  001b0000044cAQgAAM  Vergissmeinnicht® Abschied & Bestattung   
4465  001b0000044cmhlAAA                    Bestattungshaus Dunas   
4466  001b0000044dG3zAAE               Bestattungshaus Engelhardt   
4467  001b0000044dac3AAA          Bestattungsinstitut Hans Müller   
4468  001b0000044ddYjAAI        Bestattungshaus Michael Jürgensen   

           Account Owner Parent Account ID  \
0        Maximilian Witt               NaN   
1     Jasmin Ouali Turki               

In [48]:
match_score_mode = df_salesforce['Second_Match_Score'].mode()

print("Second_Match_Score:")
print(match_score_mode)

Second_Match_Score:
0    62
Name: Second_Match_Score, dtype: int64


removing the additional strings ("str", "straße", "strasse") from the addresses in both DataFrames and then applying the matching function.

In [49]:
# Function to remove special characters and specified substrings
def clean_address(text):
    if pd.isna(text):
        return ""
    text = text.replace(" ", "").replace("-", "").replace(".", "").replace("/", "")
    text = text.replace("straße", "").replace("strasse", "").replace("str", "")
    return text

# Apply the function to create new columns in both DataFrames
df_salesforce['Billing_Street3'] = df_salesforce['Billing Street'].apply(clean_address)
df_steps['Street3'] = df_steps['Street'].apply(clean_address)

# Combine with other relevant columns
df_salesforce['Billing_Street3'] += df_salesforce['Street Number'] + df_salesforce['Billing ZIP'] + df_salesforce['Billing City']
df_steps['Steps_Address3'] = df_steps['Street3'] + df_steps['Zip Code'] + df_steps['City']

# Display the DataFrames to verify
print("Salesforce DataFrame with cleaned Billing Street:")
print(df_salesforce[['Billing Street', 'Billing_Street3']].head())

print("Steps DataFrame with cleaned Street:")
print(df_steps[['Street', 'Steps_Address3']].head())

Salesforce DataFrame with cleaned Billing Street:
       Billing Street                              Billing_Street3
0      Storchenstraße                   Storchen15/188299Leutkirch
1     Alte Pfarrgasse             AltePfarrgasse1766440Blieskastel
2  Oldenburger Straße  OldenburgerStraße3226419Schortens Heidmühle
3           Elmstraße                         Elm1259320Ennigerloh
4           Hirschweg                 Hirschweg11-1333335Gütersloh
Steps DataFrame with cleaned Street:
                  Street                     Steps_Address3
0                    nan                          nannannan
1     Vredener Straße 41   VredenerStraße4148703.0Stadtlohn
2       Kirchnerstraße 4  Kirchner460311.0Frankfurt am Main
3  Stuttgarter Straße 26    StuttgarterStraße2668782.0Brühl
4        Eilbeker Weg 16        EilbekerWeg1622089.0Hamburg


In [50]:
# Function to find the best match using cleaned addresses
def find_best_match(cleaned_salesforce_address, cleaned_steps_addresses):
    best_match = process.extractOne(cleaned_salesforce_address, cleaned_steps_addresses, scorer=fuzz.token_sort_ratio)
    return best_match

# Apply the function to find matches
matches = df_salesforce['Billing_Street3'].apply(find_best_match, cleaned_steps_addresses=df_steps['Steps_Address3'])

# Extract matched addresses and scores
df_salesforce['Third_Best_Match_Address'] = matches.apply(lambda x: x[0] if x and x[1] >= 95 else None)
df_salesforce['Third_Match_Score'] = matches.apply(lambda x: x[1] if x else None)

# Filter rows with matches having at least 95% match score
matched_rows = df_salesforce[df_salesforce['Third_Match_Score'] >= 95]

# Display the DataFrame with matched addresses
print("Salesforce DataFrame with best matches:")
print(df_salesforce)

Salesforce DataFrame with best matches:
           Salesforce ID                             Account Name  \
0     0010X000045oU62QAE                     Bestattungen Gredler   
1     0010X000045okZGQAY                   Toussaint Bestattungen   
2     0010X000045om5rQAA                        Hermann Janßen KG   
3     0010X000045peOdQAI              Wonnemann e.K. Bestattungen   
4     0010X000045piDPQAY                Müther e. K. Bestattungen   
...                  ...                                      ...   
4464  001b0000044cAQgAAM  Vergissmeinnicht® Abschied & Bestattung   
4465  001b0000044cmhlAAA                    Bestattungshaus Dunas   
4466  001b0000044dG3zAAE               Bestattungshaus Engelhardt   
4467  001b0000044dac3AAA          Bestattungsinstitut Hans Müller   
4468  001b0000044ddYjAAI        Bestattungshaus Michael Jürgensen   

           Account Owner Parent Account ID  \
0        Maximilian Witt               NaN   
1     Jasmin Ouali Turki               

In [51]:
match_score_mode = df_salesforce['Third_Match_Score'].mode()

print("Third_Match_Score:")
print(match_score_mode)

Third_Match_Score:
0    62
Name: Third_Match_Score, dtype: int64


Account Name (Salesforce) / Customer Name (Steps)
-> clean the addresses by removing specified characters and substrings from the columns "Account Name" in df_salesforce and "Customer Name" in df_steps

Apply the Function to Clean the Columns

In [54]:
# Define the function to remove specified characters and substrings
def clean_text(text):
    if pd.isna(text):
        return ""
    text = text.replace(" ", "").replace("-", "").replace(".", "").replace("&", "").replace("", "")
    text = text.replace("?", "").replace("*", "").replace("«", "").replace("»", "")
    text = text.replace("/", "").replace("+", "").replace("·", "").replace("(", "").replace(")", "")
    return text

# Apply the function to clean the 'Account Name' and 'Customer Name' columns
df_salesforce['Cleaned_Account_Name'] = df_salesforce['Account Name'].apply(clean_text)
df_steps['Cleaned_Customer_Name'] = df_steps['Customer Name'].apply(clean_text)

# Function to find the best match from df_salesforce for each cleaned df_steps address
def find_best_match_for_steps(cleaned_customer_name, cleaned_account_names):
    best_match = process.extractOne(cleaned_customer_name, cleaned_account_names, scorer=fuzz.token_sort_ratio)
    return best_match

# Apply the function to find matches
matches = df_steps['Cleaned_Customer_Name'].apply(find_best_match_for_steps, cleaned_account_names=df_salesforce['Cleaned_Account_Name'])

# Extract matched account names and scores
df_steps['Fourth_Best_Match_Account_Name'] = matches.apply(lambda x: x[0] if x and x[1] >= 95 else None)
df_steps['Fourth_Match_Score'] = matches.apply(lambda x: x[1] if x else None)

# Filter rows with matches having at least 95% match score
matched_rows = df_steps[df_steps['Fourth_Match_Score'] >= 95]

# Display the DataFrame with matched names
print("Steps DataFrame with best matches:")
print(df_steps[['Customer Name', 'Cleaned_Customer_Name', 'Fourth_Best_Match_Account_Name', 'Fourth_Match_Score']])

Steps DataFrame with best matches:
                       Customer Name        Cleaned_Customer_Name  \
0         DBU Bestattungs-Union GmbH      DBUBestattungsUnionGmbH   
1             ?Bestattungen Potthoff         BestattungenPotthoff   
2           „Pietät“ Berthold Wiesel       „Pietät“BertholdWiesel   
3     1. Brühler Bestattungsinstitut  1BrühlerBestattungsinstitut   
4               ASV Deutschland GmbH           ASVDeutschlandGmbH   
...                              ...                          ...   
6588                       Zorn GmbH                     ZornGmbH   
6589          Zotner Bestattungshaus        ZotnerBestattungshaus   
6590           Zur Ruhe Bestattungen          ZurRuheBestattungen   
6591      Zur Ruhe Bestattungen GmbH      ZurRuheBestattungenGmbH   
6592              Zwölfer, Christine            Zwölfer,Christine   

     Fourth_Best_Match_Account_Name  Fourth_Match_Score  
0                              None                  80  
1                   

In [55]:
match_score_mode = df_salesforce['Fourth_Match_Score'].mode()

print("Fourth_Match_Score:")
print(match_score_mode)

KeyError: 'Fourth_Match_Score'

In [52]:
output_file = r'C:\Users\Katha\Documents\Job application documents\Data Analyst application\Case Studies\Rapid Data\df_salesforce.xlsx'

# Export the DataFrame to an Excel file
df_salesforce.to_excel(output_file, index=False, engine='openpyxl')

print(f"DataFrame has been exported to {output_file}")

DataFrame has been exported to C:\Users\Katha\Documents\Job application documents\Data Analyst application\Case Studies\Rapid Data\df_salesforce.xlsx
