# Leads Data - Cleaning & Processing

The dataset contains information about leads from various European countries, distributed across four separate Excel sheets. The first three sheets correspond to Germany, Austria, and Switzerland, while the fourth sheet encompasses leads from mixed countries. Each dataset primarily includes details such as leads, their addresses, zip codes, cities, countries, and telephone numbers. Although the mixed countries dataset contains additional fields for names, surnames, and salutations, a significant portion of this data is missing. Common issues within the datasets include incorrect data formatting, missing values, the presence of duplicates, and numerous uncleaned phone numbers.

### Importing statements

<span style = "color:red" > [Note] If not installed,  you may have to install phonenumbers by running the command 'pip install phonenumbers' in your terminal </span>

In [1]:
import pandas as pd
import numpy as np
import re
import phonenumbers
from phonenumbers import NumberParseException
import warnings

In [2]:
# Avoid warnings
warnings.filterwarnings("ignore")

### Custom function to clean phone number

In [3]:
def clean_phone_number(row):
    """
    Cleans the phone number based on the specified country, removing leading and trailing country codes and any leading zeros.
    
    Parameters:
    row (pd.Series): A row from the DataFrame containing 'telefon' and 'country'.
    
    Returns:
    str: The cleaned phone number without country code or leading zeros, or NaN if input is NaN.
    """
    
    phone = row['telefon']
    country = row['country']
    
    # If phone number is NaN, return NaN
    if pd.isna(phone):
        return np.nan
    
    # Step 1: Remove all non-numeric characters
    cleaned_phone = re.sub(r'\D', '', phone)  # Remove all non-digit characters

    # Country-specific country code removal
    country_codes = {
        'DE': ['0049', '49'],
        'AT': ['0043', '43'],
        'CH': ['0041', '41'],
        'IT': ['0039', '39'],
        'GB': ['0044', '44']
    }

    # Step 2: Remove leading or trailing country codes based on country
    if country in country_codes:
        for code in country_codes[country]:
            # Remove from the start
            if cleaned_phone.startswith(code):
                cleaned_phone = cleaned_phone[len(code):]  # Remove the country code from the start
                break  # Exit after removing the first found country code

            # Remove from the end
            if cleaned_phone.endswith(code):
                cleaned_phone = cleaned_phone[:-len(code)]  # Remove the country code from the end, if the length of country code is 4 or
                break                                       # 2, the function remove as that many characters from end

    # Step 3: Remove leading zeros after country code removal
    cleaned_phone = cleaned_phone.lstrip('0')

    return cleaned_phone

<span style = "color:blue" >Note: There are instances where country codes may appear either at the beginning or the end of phone numbers. I developed a function to effectively handle these cases after removing all non-digit characters. In this process, I did not utilize the Python "phonenumbers" library directly for creating clean phone number formats, as I encountered some inconsistent formats that were still considered valid by the library.</span>

<span style="color:blue">[Important] The function "validate_phone_length" is designed to validate phone numbers after they have been cleaned. For each country, the typical length of local phone numbers (excluding leading zeros and country codes) has been predefined. Once the phone numbers are cleaned, this function checks each number's digit length and updates the "flag" column accordingly, indicating "valid" or "bad data." For example, if a cleaned German phone number has 12 digits, the flag will be set to "bad data," as standard German phone numbers typically have 10 to 11 digits.</span>
<br>

<span style="color:blue">However, the function remains in its raw form to allow for execution if desired. Validation using Python's "phonenumbers" library may yield different results, as it accounts for certain exceptions (e.g., some 15-digit numbers can be considered valid). For a more stringent cleaning approach, the self-written function can be applied. In this case, validation is primarily carried out using "phonenumbers".</span>

### Function to Validate Cleaned Phone Numbers using "phonenumbers"

In [4]:
def validate_phone_number(row):
    """
    Validates the phone number using the phonenumbers library based on the country.
    
    Parameters:
    row (pd.Series): A row from the DataFrame containing 'cleaned phone number' and 'country'.
    
    Returns:
    str: 'bad data' if the phone number is invalid, otherwise 'valid'.
    """
    
    phone_number = row['cleaned phone number']
    country = row['country']
    
    # If phone number is NaN or missing, return 'bad data'
    if pd.isna(phone_number):
        return 'bad data'
    
    try:
        # Parse the phone number using phonenumbers
        parsed_number = phonenumbers.parse(phone_number, country)
        
        # Check if the phone number is valid for the given country
        if phonenumbers.is_valid_number(parsed_number):
            return 'valid'
        else:
            return 'bad data'
    
    except phonenumbers.NumberParseException:
        # Handle cases where the number cannot be parsed
        return 'bad data'

<span style = "color:blue" >[Note] The function above is applied to the "flag" column </span>

In [5]:
# Disable scientific notation globally in pandas
pd.set_option('display.float_format', lambda x: '%.0f' % x)

<span style = "color:red;" >[Note] In the mixed countries sheet, in the 'telefonnr' column, the values are in scientific notation format. </span>

### Reading Datasets

In [6]:
# file path based on  directory structure
data_path = '../Inputs/data cleanup assignment.xlsx'

In [7]:
sheets = ["DE", "AT", "CH", "mixed"]
dataframes = {}

for name in sheets:
    df = pd.read_excel(
        data_path,
        sheet_name=name,
        dtype=str,                # Convert all columns to strings
        na_filter=True            # Convert empty strings to NaN
    )

    # Remove trailing ".0" from strings while preserving NaN, and handle empty/whitespace strings
    
    # 1. If the value is a string and is empty or contains only whitespace, it is replaced with NaN.
    # 2. If the value is a string and ends with '.0', the '.0' suffix is removed.
    # 3. If neither condition is met, the original value is retained.
    dataframes[name] = df.applymap(lambda x: np.nan if isinstance(x, str) and x.strip() == '' else (x.rstrip('.0') if isinstance(x, str) and '.0' in x else x))


# Access each DataFrame by its sheet name
data_de = dataframes["DE"]
data_at = dataframes["AT"]
data_ch = dataframes["CH"]
data_mix = dataframes["mixed"]


**First We will look at each dataset seperately, in order to have an overview, and to make small adjustments before merging them.**

## 1. Data for Germany

In [8]:
data_de.sample(7, random_state = 43)

Unnamed: 0,firma,street,plz,city,old_ctry,telefon
270,Salzstangerlheuriger,Badener Straße 11,2544,Leobersdorf,DE,+49174-6599330
709,"Bohlen Elektrowärme, Inh. Stefan Urbanek",Freiberger Straße 55,9669,Frankenberg,DE,9/5405657
860,WERBUNG aktuell,Dessauer Straße 6a,6886,Lutherstadt Wittenberg,DE,+49175abc2494-023
11,König Wilhelm Autoteile - Einbrodt & Schubert GbR,Dahlweg 122,48153,Münster,DE,00491600795434
134,Prenzl´berger Friseure,Stubbenkammerstraße 4,10437,Berlin,DE,00491773478809
514,"Bäckerei und Café Reif, Inhaber: Jürgen Reif",Bärwalder Straße 9,1471,Radeburg,DE,+491780760340
345,Praxis für Ergotherapie Kathrin Kenzler,Am Margaretenhof 26,19057,Schwerin,DE,0049176-3343742


In [9]:
data_de.shape

(865, 6)

In [10]:
data_de.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 865 entries, 0 to 864
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   firma     865 non-null    object
 1   street    865 non-null    object
 2   plz       865 non-null    object
 3   city      865 non-null    object
 4   old_ctry  865 non-null    object
 5   telefon   865 non-null    object
dtypes: object(6)
memory usage: 40.7+ KB


In [11]:
data_de.isna().sum()

firma       0
street      0
plz         0
city        0
old_ctry    0
telefon     0
dtype: int64

In [12]:
# Changing the column 'old_ctry' name to match the data in other dataframes
data_de.rename(columns = {'old_ctry':'country'}, inplace = True)

In [13]:
# Checking whether there is other countries as well
data_de['country'].value_counts()

country
DE    865
Name: count, dtype: int64

## 2. Data for Austria 

In [14]:
data_at.sample(7, random_state = 43)

Unnamed: 0,firma,street,plz,city,country,telefon
56,Niedermayer Steuerberatung GmbH,Passauer Straße 13,4780,Schärding,AT,"Für Hilfe, wählen Sie: +ZZ66-ERROR 676 1911048"
37,Almrausch Planai,Planaistraße 2,8971,Schladming,AT,Telefon: 00 914 CALL 3938 / 676
67,VIDEBIS GmbH,Schlosshofer Straße 6,1210,Wien,AT,664.2502286
79,Naturpraxis Inge Nedved-Kempf,Königsklostergasse 7/3 Mezzanin,1060,Wien,AT,0043/650545 8500
80,Orthopädie Schuhtechnik Nindl GmbH,Kirchenstraße 36,5733,Bramberg am Wildkogel,AT,
188,Weinschloss Koarl Thaller GmbH,Maierhofbergen 24,8263,Großwilfersdorf,AT,+43/676217 5111
183,Buschenschank Luttenberger,Seibersdorf 20,8423,St. Veit in der Südsteiermark,AT,


In [15]:
data_at.shape

(200, 6)

In [16]:
data_at.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 6 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   firma    200 non-null    object
 1   street   200 non-null    object
 2   plz      200 non-null    object
 3   city     200 non-null    object
 4   country  200 non-null    object
 5   telefon  189 non-null    object
dtypes: object(6)
memory usage: 9.5+ KB


In [17]:
data_at.isna().sum()

firma       0
street      0
plz         0
city        0
country     0
telefon    11
dtype: int64

In [18]:
# checking whether there is other countries as well
data_at['country'].value_counts()

country
AT    200
Name: count, dtype: int64

## 3. Data for Switzerland

In [19]:
data_ch.sample(7, random_state = 43)

Unnamed: 0,firma,street,plz,city,country,telefon
2,Nocera & Strub AG,Hirzenstrasse 1,9244,Niederuzwil,CH,Telefon: 00 037 CALL 9506 / 79
62,Imfeld Uhren + Schmuck GmbH,Hofstrasse 2,6061,Sarnen,CH,+41783764144
78,Benediktinerkloster Mariastein,Klosterplatz 2,4115,Mariastein,CH,Telefon: 00 954 CALL 8410 / 79
15,ECOTENT Flümoart,Seetalstrasse 72,8280,Kreutzlingen TG,CH,/78/3585206
11,Tierferienheim Höfli,Höfli 66,4574,Nennigkofen,CH,0041/79688 5332
45,"Prinzli Spielwaren, Inh.: Doris Kälin",Bergstrasse 1,8712,Stäfa,CH,0041/906 6268
63,Go Fish GmbH Fischerei- & Outdoorartikel,Hinterbergstrasse 56,6312,Steinhausen,CH,+4176abc7352-651


In [20]:
data_ch.shape

(92, 6)

In [21]:
data_ch.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 92 entries, 0 to 91
Data columns (total 6 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   firma    92 non-null     object
 1   street   92 non-null     object
 2   plz      92 non-null     object
 3   city     92 non-null     object
 4   country  92 non-null     object
 5   telefon  85 non-null     object
dtypes: object(6)
memory usage: 4.4+ KB


In [22]:
data_ch.isna().sum()

firma      0
street     0
plz        0
city       0
country    0
telefon    7
dtype: int64

In [23]:
# Checking whether there is other countries as well
data_ch['country'].value_counts()

country
CH    92
Name: count, dtype: int64

## 4. Data Mixed Countries

In [24]:
data_mix.sample(7, random_state = 43)

Unnamed: 0,firma,street,plz,city,landesvorwahl,telefonnr,anrede,vorname,nachname
1334,Khorasan Market ( خراسان مارکت ),Marienstraße 9,33098.0,Paderborn,49,17670975872,,,
2304,Pizza Star Inh.: Nilofar Wali,Detmolder Str. 74,33100.0,Paderborn,49,52511438283,,,
465,Haus Gabriele,Silge 26,36433.0,Bad Salzungen,49,3695853779,,,
667,Hotel Luzern Engel,Luzernerstrasse 1,6285.0,Hitzkirch,41,419112050,,,
3582,AL HAMDO GmbH,Martin-Luther-King-Str. 2,63452.0,Hanau,49,1628504560,Herr,Abdul,Kader Al Hamdo
1895,Begemann Wellness Fitness Center,,,Extertal,49,5262993325,,,
2860,ERGO Versicherung Christian Aner,Normannenstraße 3,10367.0,Berlin,49,3030343411,,,


In [25]:
data_mix.shape

(3618, 9)

In [26]:
data_mix.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3618 entries, 0 to 3617
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   firma          3072 non-null   object
 1   street         3025 non-null   object
 2   plz            3062 non-null   object
 3   city           3026 non-null   object
 4   landesvorwahl  3617 non-null   object
 5   telefonnr      3581 non-null   object
 6   anrede         354 non-null    object
 7   vorname        468 non-null    object
 8   nachname       395 non-null    object
dtypes: object(9)
memory usage: 254.5+ KB


In [27]:
data_mix.isna().sum()

firma             546
street            593
plz               556
city              592
landesvorwahl       1
telefonnr          37
anrede           3264
vorname          3150
nachname         3223
dtype: int64

In [28]:
# Change 'landesvorwahl','telefonnr','anrede','vorname', 'nachname' column names
new_column_names = {'landesvorwahl':'country code','telefonnr':'telefon','anrede':'salutation','vorname':'first name','nachname':'surname'}

data_mix.rename(columns = new_column_names, inplace = True)

In [29]:
data_mix['country code'].value_counts()

country code
49    3429
41     114
43      70
39       1
12       1
44       1
4        1
Name: count, dtype: int64

#### Creating "country code" column in mixed data (like e.g 0049)

In [30]:
# The code adds '00' to the all values except they already starts with '00' or are missing
data_mix['country code'] = data_mix['country code'].apply(lambda x: '00' + str(x) if pd.notna(x) and not str(x).startswith('00') else x)

[Note] The code above will add double zero to the front of each value, if they already do not start with '00' and not NA

In [31]:
data_mix['country code'].unique()

array(['0049', '0041', '0043', '0039', '0012', '0044', nan, '004'],
      dtype=object)

<span style = "color:red"> [Note] The country codes with values '0012', '004', and NA need to futher investigation since they are invalid </span>

In [32]:
# Filter rows where 'country code' is NaN, '004', or '0012'
invalid_rows = data_mix[data_mix['country code'].isna() | 
                         data_mix['country code'].isin(['004', '0012'])]

# Display the filtered rows
invalid_rows

Unnamed: 0,firma,street,plz,city,country code,telefon,salutation,first name,surname
655,BLUME2000 im Famila Bad Segeberg,Eutiner Straße 45,23795.0,Klein Rönnau,12.0,623301212,,,
741,,,31.0,,,455281144,,,
871,,,,,4.0,22897479018,,,


<span style = "color:red; font-size:16px" > [Note] The data can be corrected manually in row 655, by using the city. However the rows 741 and 871 will be kept as Bad Data. Klein Rönnau is located in Germany, thus the respective country code will be added  </span>

In [33]:
# Replacing country code with '0049' where city is 'Klein Rönnau'
data_mix.loc[data_mix['city'] == 'Klein Rönnau', 'country code'] = '0049'

In [34]:
# Add a flag column to data_mix
data_mix['flag'] = np.where(
    data_mix['country code'].isin(['004', '0012']) | data_mix['country code'].isna(),
    'bad data',
    ''  # Default value if the condition is not met
)

<span style = "color:blue" >[Note] the "flag" column is generated and will be used to differentiate "bad" and "valid" data. For example, if there is unrecoverable or missing data in phone number column, then for those rows the column gets the value of "bad data" </span>

#### Creating Country column

The country code was missing in data_mix, so based on country codes they are mapped

In [35]:
# Define mapping from country code to country
country_mapping = {
    '0049': 'DE',
    '0041': 'CH',
    '0043': 'AT',
    '0039': 'IT',
    '0044': 'GB'
}

In [36]:
# Function to map country codes to countries
def map_country(code):
    ''' Map country code to country name '''
    return country_mapping.get(code, np.nan)  # Return NaN for invalid or missing codes

In [37]:
# Add the country column
data_mix['country'] = data_mix['country code'].apply(map_country)

In [38]:
data_mix['country'].isna().sum()

2

## Merging all 4 dataframes

In [39]:
# Outer joining all 4 dataframes based on 'country', 'firma', 'street', 'plz', 'city', and 'telefon' columns
merge_on_columns = ['country', 'firma', 'street', 'plz', 'city', 'telefon']

merged_data = data_de.merge(data_at, on=merge_on_columns, how='outer') \
                       .merge(data_ch, on=merge_on_columns, how='outer') \
                       .merge(data_mix, on=merge_on_columns, how='outer')

merged_data.sample(7, random_state = 43)

Unnamed: 0,firma,street,plz,city,country,telefon,country code,salutation,first name,surname,flag
997,"Pohl Johann, Architekt - Ingenieur - Baumeister",Gewerbestraße 2a,6430.0,Ötztal-Bahnhof,AT,436993772837,,,,,
2338,Kai Murek Photographie,Margaretenhof 10-12,45888.0,Gelsenkirchen,DE,15752381270,49.0,,Kai Murek,,
2783,Lale Tour,Bahnhofstr.20,74366.0,Kirchheim am Neckar,DE,71439692727,49.0,,,,
4612,A & P Modevertriebs GmbH,Fasanenweg 57,15711.0,Königs Wusterhausen,DE,3375292930,49.0,Frau,Anja,Wagner,
3156,HDS GmbH Projekte im Bau,Hainstr. 105,9130.0,Chemnitz,DE,371431340,49.0,,,,
1967,Mobile-Hundepflege.com,Am Weilerweg 14,64665.0,Alsbach-Hähnlein,DE,17642491934,49.0,,,,
1561,,,,,DE,8954882664,49.0,,,,


In [40]:
merged_data['country code'].value_counts()

country code
0049    3430
0041     114
0043      70
0039       1
0044       1
004        1
Name: count, dtype: int64

## Data Pre-Processing for the Merged Dataframe

The data for DE, CH, and AT was missing country codes (opposite of mixed countries data) which are generated based of country names

In [41]:
# Define mapping from country code to country
country_mapping = {
    'DE': '0049',
    'CH': '0041',
    'AT': '0043',
    'IT': '0039',
    'GB': '0044'
}

In [42]:
# Function to map country codes to countries
def map_country(country):
    ''' Map country to country code '''
    return country_mapping.get(country, np.nan)  # Return NaN for invalid or missing codes

In [43]:
# Map values to country code column
merged_data['country code'] = merged_data['country'].apply(map_country)

### The Custom function to ***clean phone number*** is applied to merged data

In [44]:
# Apply cleaning function 
merged_data['cleaned phone number'] = merged_data.apply(clean_phone_number, axis=1)

If you would like to use my custom validation code (as outlined at the beginning), please follow these steps:

1. Change the cell for the validate_phone_length function at the start of this file from raw to code, then run it.
2. Change the cell containing the validate_phone_number function at the beginning from code back to raw.
3. Comment out the line where the validate_phone_number function is applied to the "flag" column below.
4. Uncomment the code line located below this cell.

In [45]:
#merged_data['flag'] = merged_data.apply(validate_phone_length, axis=1) 

In [46]:
# apply validation function to the dataframe
merged_data['flag'] = merged_data.apply(validate_phone_number, axis=1)

In [47]:
merged_data['flag'].value_counts()

flag
valid       4374
bad data     401
Name: count, dtype: int64

### Creating "digit length" Column

In [48]:
# Calculate the lengths of cleaned_phone
merged_data['digit length'] = merged_data['cleaned phone number'].str.len()

<span style = "color:red" > [Note] The column is created based on the number of digits in "cleaned phone number" column in order to have more insights </span>

### Cleaning duplicated values from necessary columns

In [49]:
# creating list of all columns in the dataframe
columns = merged_data.columns.tolist()

# for each column the sum of duplicates
for col in columns:
    duplicates = merged_data[col].duplicated().sum()
    print(f"{col} - {duplicates} duplicates ")

firma - 669 duplicates 
street - 639 duplicates 
plz - 2043 duplicates 
city - 2850 duplicates 
country - 4769 duplicates 
telefon - 95 duplicates 
country code - 4769 duplicates 
salutation - 4767 duplicates 
first name - 4394 duplicates 
surname - 4395 duplicates 
flag - 4773 duplicates 
cleaned phone number - 95 duplicates 
digit length - 4751 duplicates 


***To ensure data integrity, we must first remove any duplicate entries in the "firma" column across the DE, CH, AT, and Mix sheets. However, since it is possible for different companies to share the same country, city, and country code, we will not remove duplicates from these columns. This approach allows us to maintain relevant data without losing important distinctions between companies operating in the same locations.***

In [50]:
# Remove duplicates from the 'firma' column and reset the index
merged_data_cleaned = merged_data.drop_duplicates(subset=['firma']).reset_index(drop=True)

***Duplicates in the phone number column can be considered "bad data" because they may lead to confusion and inefficiencies in data processing. Typically, each unique business should have a distinct phone number to facilitate accurate communication. However, there are instances where different companies might share the same phone number, particularly in certain sectors like shared offices or co-working spaces.***

***To ensure data integrity, we will mark these duplicate phone number entries as "bad data" in the flag column for further analysis. This will help us to identify and address potential inconsistencies while also allowing us to distinguish between legitimate duplicates and errors in our dataset. By doing so, we can refine our data quality and improve the overall accuracy of our analysis.***

In [51]:
# Update the 'flag' column for duplicates in the 'cleaned phone number' column
merged_data_cleaned.loc[merged_data_cleaned['cleaned phone number'].duplicated(keep=False), 'flag'] = 'bad data'

***The following assumption states that to effectively contact businesses, it is essential to have at least the following information: the company name (firma), the country in which it operates, the country code, and the phone number. If all of these critical values are simultaneously missing, we will categorize this data as "bad" or "unrecoverable" in the flag column. The code provided below is designed to verify this assumption.***

In [52]:
# Show rows where 'firma', 'cleaned_phone', and 'country_code' are all missing
missing_data = merged_data_cleaned[merged_data_cleaned[['firma', 'country code', 'country', 'cleaned phone number']].isnull().all(axis=1)]

# Display the result
print(missing_data)

Empty DataFrame
Columns: [firma, street, plz, city, country, telefon, country code, salutation, first name, surname, flag, cleaned phone number, digit length]
Index: []


<span style = "color:red" >[Note] The results indicate that the simultaneous absence of the firma name, country, country code, and phone number does not occur. </span>

***The next assumption is that, despite the significant number of missing values in the salutation, first name, and last name columns, these omissions do not hinder our ability to contact the businesses. Therefore, instead of removing rows with missing data in these fields, we will replace any empty values in the salutation, first name, and last name columns with the phrase "No data." This approach allows us to retain all records while providing a clear indication of missing information.***

In [53]:
# Replace missing values with 'No data' in specified columns using .loc
merged_data_cleaned.loc[:, ['first name', 'surname', 'salutation']] = merged_data_cleaned[['first name', 'surname', 'salutation']].fillna('No data')

In [54]:
merged_data_cleaned['flag'].value_counts()

flag
valid       3724
bad data     382
Name: count, dtype: int64

***The company names in the 'firma' column are checked for validity by analyzing the string lengths of their values. This process involves calculating the length of each string in the column and displaying the top 15 entries with the shortest string lengths in ascending order. This helps identify any potential anomalies or unusually short entries in the data.***

***If anomalies are found in the string lengths of the 'firma' column and address data is missing for those entries at the same time, such data will be flagged as "bad data" in the flag column for further review.***

In [55]:
# Calculating the length of values in the 'firma' column
merged_data_cleaned['firma length'] = merged_data_cleaned['firma'].str.len()

# Sorting the DataFrame by the length of 'firma' in ascending order and select the top 15
top_15_firma_lengths = merged_data_cleaned[['firma', 'firma length', 'street', 'flag']].sort_values(by='firma length').head(15)

# Display the result
print(top_15_firma_lengths)

      firma  firma length                      street   flag
1433      L             1                         NaN  valid
1681      t             1                         NaN  valid
3239     NN             2              Sulzerring 24   valid
1813    MÔC             3             Grafenstraße 20  valid
1848    LBS             3                         NaN  valid
1924    Eni             3                         NaN  valid
3177   Wiko             4          Elisenhofstraße 5   valid
3316   Home             4          Paul-Klee-Str. 34   valid
3287   Jahn             4                Weidestr. 6   valid
2576   Büro             4        Rothäuserbergweg 12   valid
1764   rosi             4              Weberstraße 18  valid
1744   Café             4      Heddesdorfer Straße 33  valid
2978  Ravel             5              Ludwigstr. 91   valid
2557  KONAK             5        Hamburger Straße 10   valid
3730  Härer             5  Wilhelm-Bahmüller-Str. 59   valid


<span style = "color:red;font-size:14px" > As shown in the results, the first two entries in the 'firma' column contain only a single letter, and both are missing corresponding address information. This makes it difficult to identify or locate these businesses, presenting a challenge for contacting them. Consequently, these entries will be labeled as "bad data" in the flag column for further analysis.</span>

In [56]:
# Update the 'flag' column where 'firma' length is less than 3 and 'street' is NaN
merged_data_cleaned.loc[
    (merged_data_cleaned['firma'].str.len() < 3) & 
    (merged_data_cleaned['street'].isna()), 
    'flag'
] = 'bad data'

***Reording the column names to improve readability and streamline data analysis***

In [57]:
# Define column name order
new_column_order = [
    'firma', 'street', 'plz', 'city', 'telefon', 
    'country', 'country code', 'cleaned phone number', 
    'flag', 'salutation', 'first name', 'surname', 
    'digit length', 'firma length'
]

# Reorder the columns
merged_data_cleaned = merged_data_cleaned[new_column_order]

## Creating Bad and Clean Data

Based on the entries in the 'flag' column, two distinct datasets are created. Rows marked as 'bad data' are transferred to the bad_data dataset, while those flagged as 'valid' are moved to the clean_data dataset. This allows for a clear separation of invalid or problematic records from the clean, usable data for further processing.

In [58]:
bad_data = merged_data_cleaned[merged_data_cleaned['flag'] == 'bad data']
clean_data = merged_data_cleaned[merged_data_cleaned['flag'] == 'valid']

**This function adds unique_id columns to each DataFrame, ensuring each row has a unique identifier. These IDs are crucial during the database insertion process to prevent conflicts and facilitate the identification of individual records. By generating unique identifiers, we maintain data integrity and avoid duplication, ensuring that each record can be reliably referenced and managed within the database.**

In [59]:
# Function to add a sequential unique_id column
def add_sequential_unique_id_column(df, prefix):
    df['unique_id'] = [f"{prefix}{i+1}" for i in range(len(df))]
    return df

In [60]:
# Adding unique_id to cleaned_data and bad_data
clean_data = add_sequential_unique_id_column(clean_data, prefix='cleaned_')
bad_data = add_sequential_unique_id_column(bad_data, prefix='bad_')

## Writing the both Bad and Clean dataframes into excel files

In [61]:
# uncomment the codes if you need to write into excel file again

# bad_data.to_excel('../Inputs/leads_in_review_data.xlsx', index=False)  # Write bad_data to an Excel file
# clean_data.to_excel('../Inputs/leads_cleaned_data.xlsx', index=False)  # Write clean_data to a separate Excel file

<span style = "color:blue" >[Note] It is also possible to employ a function that If the dataset is present in the location then evertime the code runs, it is written as _v2(version2),_v3 and so on. </span>

In conclusion, I successfully cleaned the merged dataset, focusing primarily on validating phone numbers using the Python 'phonenumbers' library. This process also involved addressing duplicated and missing values across the dataset. The cleaned dataset contains approximately 3,700 entries, with no missing values in key columns, while the review dataset holds around 380 entries, some of which still have missing values.

I removed around 650 duplicate entries from the 'firma' column, significantly enhancing data quality. Additionally, I manually filled in the country code data where feasible. A flag column was introduced to distinguish between 'valid' and 'bad' data. After cleaning, I separated the datasets based on the flag values, saving the 'bad data' entries into leads_in_review_data.xlsx for further analysis and the 'valid' entries into leads_cleaned_data.xlsx.