<h1>Dataset Cleaning and Preparation</h1>
<hr/>
<em>Predicting Propensity to Purchase a Vehicle | Motus Hackathon Competition 2025</em>
<author>Sifiso Rimana<author> <br/> 
Student Number: 222138688 <br/>
Email: 222138688@student.uj.ac.za / rimanasifiso@gmail.com <br/>


In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
plt.style.use('ggplot')
import re

In [2]:
# constants
DATA_URL = "https://www.mxhackathon.co.za/docs/TrainData.csv"
EMAIL_REGEX = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
DOMAIN_REGEX = r'^@[a-zA-Z0-9][a-zA-Z0-9.-]*\.[a-zA-Z]{2,}$'

In [3]:
train_df = pd.read_csv(DATA_URL)
train_df.head()

Unnamed: 0,CustomerID,LeadID,DTLeadCreated,DTLeadAllocated,Dealer,LeadSource,LeadType,Seek,InterestMake,InterestModel,...,OBSEmail,Domain,CellPrefix,CellPhoneNoLength,HourOfEnquiry,DayOfEnquiry,InFinanceProcessSystemApp,FinanceApplied,FinanceApproved,VehicleSold
0,D3B8E20E-F798-4CA9-94EE-FF6D0E7E99F2,11187465,2025-02-15 14:34:00,2025-02-17 11:47:00,Motus Chery Germiston,DEALER FACEBOOK,Facebook,New,Chery,Tiggo 4 Pro 1.5 LIT MIT,...,XXXXXXXXXXXXXXXXXXX5@gmail.com,@gmail.com,61,10,14,15,0,0,0,0
1,372AB00B-6A1C-45D1-8D29-29287C627D3B,11187550,2025-02-15 14:50:00,2025-02-15 14:50:00,Renault Northcliff - 70045,RR-Dealer Facebook-Renault,Facebook,New,Renault,Triber,...,XXXXXXXX@gmail.com,@gmail.com,83,10,14,15,0,0,0,0
2,74262064-94E9-4E7E-9C01-CA8405243929,11187564,2025-02-15 14:51:00,2025-02-17 09:01:00,Durban Multifranchise - 70111,KIA-CCPPG-SONET,Other,New,KIA,Sonet,...,XXXXXXXXX39@gmail.com,@gmail.com,79,10,14,15,0,0,0,0
3,F4D65935-310B-455D-B414-1759C3B807F7,11187583,2025-02-15 14:55:00,2025-02-15 14:57:00,Lindsay Saker VW Germiston,Group Website VW: Specials,Website,New,Volkswagen,Polo Vivo,...,XXXXXXXXXX@gmail.com,@gmail.com,71,10,14,15,1,0,0,1
4,A732E82D-B4BE-438C-9690-07AB3EFD9CD4,11187615,2025-02-15 15:04:00,2025-02-15 15:04:00,Durban Multifranchise - 70111,TikTok,TikTok,New,Renault,Kwid,...,XXXXXXXXXX12@gmail.com,@gmail.com,71,10,15,15,0,0,0,0


In [4]:
# dropping fields which are not to be used in modelling
train_df = train_df.drop(['InFinanceProcessSystemApp', 'FinanceApplied', 'FinanceApproved'], axis=1)
train_df.columns = [col.lower() for col in train_df.columns]
df = train_df.copy()
df.head()

Unnamed: 0,customerid,leadid,dtleadcreated,dtleadallocated,dealer,leadsource,leadtype,seek,interestmake,interestmodel,obsfullname,obsemail,domain,cellprefix,cellphonenolength,hourofenquiry,dayofenquiry,vehiclesold
0,D3B8E20E-F798-4CA9-94EE-FF6D0E7E99F2,11187465,2025-02-15 14:34:00,2025-02-17 11:47:00,Motus Chery Germiston,DEALER FACEBOOK,Facebook,New,Chery,Tiggo 4 Pro 1.5 LIT MIT,XXXXXXXXXX XXXXXXXXX,XXXXXXXXXXXXXXXXXXX5@gmail.com,@gmail.com,61,10,14,15,0
1,372AB00B-6A1C-45D1-8D29-29287C627D3B,11187550,2025-02-15 14:50:00,2025-02-15 14:50:00,Renault Northcliff - 70045,RR-Dealer Facebook-Renault,Facebook,New,Renault,Triber,XXXXX XXXXXX,XXXXXXXX@gmail.com,@gmail.com,83,10,14,15,0
2,74262064-94E9-4E7E-9C01-CA8405243929,11187564,2025-02-15 14:51:00,2025-02-17 09:01:00,Durban Multifranchise - 70111,KIA-CCPPG-SONET,Other,New,KIA,Sonet,XXXXXX XXXXX,XXXXXXXXX39@gmail.com,@gmail.com,79,10,14,15,0
3,F4D65935-310B-455D-B414-1759C3B807F7,11187583,2025-02-15 14:55:00,2025-02-15 14:57:00,Lindsay Saker VW Germiston,Group Website VW: Specials,Website,New,Volkswagen,Polo Vivo,XXXXX-XXX XX XXXXX,XXXXXXXXXX@gmail.com,@gmail.com,71,10,14,15,1
4,A732E82D-B4BE-438C-9690-07AB3EFD9CD4,11187615,2025-02-15 15:04:00,2025-02-15 15:04:00,Durban Multifranchise - 70111,TikTok,TikTok,New,Renault,Kwid,XXXXX XXXXXXXX XXXXX X/X,XXXXXXXXXX12@gmail.com,@gmail.com,71,10,15,15,0


# 1. Cleaning and Preparation

## 1.1 Email Analysis

- **`OBSEmail`** and **`domain`** $\rightarrow$ **`is_valid_email`, `is_email_provided`**

Email is deemed provided is when:
1) Email is not empty (that is, email field is not empty)

Email is valid when:
1) Email is deemed provided
2) Email passes the regular expression match
3) The domain associated with the email is valid 

The domain associated with the email is valid when
1) The domain passes the regular expression 
2) The domain TLD is valid
3) The domain DNS is valid

We could also capture when emails with free providers by `is_personal_email`. The email, in this case, is deemed a personal email if:
1) The domain is one of $\{\text{gmail.com}, \text{yahoo.com}, \text{hotmail.com}, \text{outlook.com}, \text{icloud.com}\}$ OR
2) The domain is closer to one of the domains in 1), (this could be a typo). We will use the Levenshtein distance to measure closeness, please refer to this [Wikipedia article](https://en.wikipedia.org/wiki/Levenshtein_distance) for more information about the Levenshtein distance measurement.



In [5]:
import tldextract
import dns.resolver
from Levenshtein import distance
free_providers = {"gmail.com", "yahoo.com", "hotmail.com", "outlook.com", "icloud.com"}

# function to check if domain DNS is valid
def has_mx_records(domain):
    domain_clean = domain.lstrip('@')
    try:
        mx_records = dns.resolver.resolve(domain_clean, 'MX')
        return len(mx_records) > 0
    except:
        return False

# function to detect typo distances
def typo_distance_of_popular(domain):
    return min([distance(domain.lstrip('@').lower(), provider) for provider in free_providers])
    
# function to check if domain overall is valid
def is_valid_domain(domain):
    if not isinstance(domain, str): return 0
    domain = domain.strip()
    if 0 < typo_distance_of_popular(domain) <= 4: return 0  
    return int(
        (bool(re.fullmatch(DOMAIN_REGEX, domain, flags=re.IGNORECASE)) and # check if domain passes the regex match
        bool(tldextract.extract(domain.lstrip('@').lower()).suffix))   # check if TLD is valid 
    ) 

# function to check if the email is provided by free providers
def is_free_email_provider(domain, max_distance=4):
    if not isinstance(domain, str): return 0
    return int(0 <= typo_distance_of_popular(domain.strip()) <= max_distance)
# function to check if email is valid
def is_valid_email(email: str, domain:str, is_email_provided: int):
    return (is_email_provided and 
            is_valid_domain(domain) and 
            int(bool(re.fullmatch(EMAIL_REGEX, email, flags=re.IGNORECASE))))



In [6]:
df['is_free_email_provider'] = df['domain'].apply(is_free_email_provider)

## 1.2 Cellphone Analysis

- A valid `cellprefix` must have two digits
- A valid cellphone number must 10 digits AND a valid `cellprefix`
- A personal phone has `cellprefix` beginning with 6 or 7 or 8

In [7]:
def is_valid_cell(cell_prefix: str, cell_length: int):
    return int(isinstance(cell_length, int) and cell_length == 10 and len(cell_prefix) == 2)

In [8]:
# for _, row in df.iterrows():
#     print(f"{row['cellprefix']}\t {row['cellphonenolength']} \t {'✅' if is_valid_cell(row['cellprefix'], row['cellphonenolength']) else '❌'}")

In [9]:
def is_personal_cell(cellprefix):
    return int(len(cellprefix) == 2 and (6 <= int(cellprefix[0]) <= 8))

In [10]:
df['is_personal_cell'] = df['cellprefix'].apply(is_personal_cell)

In [11]:
df['is_valid_cell']=df.apply(lambda row: is_valid_cell(row['cellprefix'], row['cellphonenolength']), axis=1)

## 1.3 `HourOfEnquiry` and `DayOfEnquiry`

In [12]:
tod_bins = [0, 4, 7, 12, 17, 21, 24]
tod_labels = ['night', 'dawn', 'morning', 'afternoon', 'evening', 'night']

df['time_of_day'] = pd.cut(df['hourofenquiry'], bins=tod_bins, labels=tod_labels, right=False,ordered=False)

In [13]:
dom_bins = [1, 15, 25, 30]
dom_labels = ['Beginning of Month', 'Mid Month', 'Month End']

df['month_period'] = pd.cut(df['dayofenquiry'], bins=dom_bins, labels=dom_labels, right=False,ordered=False)

## 1.4 `InterestMake` and `InterestModel`

In [14]:
df['interestmake'] = df['interestmake'].fillna('NO INTEREST MAKE SELECTED')
df['interestmodel'] = df['interestmodel'].fillna('NO INTEREST MODEL SELECTED')

In [25]:
def clean_interestmodel(text):
    if pd.isna(text): return ''
    text = text.lower()
    text = re.sub(r'[^a-z0-9./ ]+', '', text)
    return text.strip()

df['model_clean'] = df['interestmodel'].apply(clean_interestmodel)
df['engine_size'] = df['model_clean'].str.extract(r'(\d\.\d)').astype(float)
df['transmission'] = df['model_clean'].apply(lambda x: 'auto' if 'a/t' in x or 'automatic' in x else ('manual' if 'manual' in x else 'Unspecified'))
df['is_promotional'] = df['model_clean'].str.contains(r'from only|per month|pm|only').apply(lambda x: int(x))
df['body_type'] = df["model_clean"].str.extract(r'(hatch|sedan|suv|pickup)')[0]
df['variant'] = df["model_clean"].str.extract(r'(pro|premium|comfortline|zen|dynamique|motion|trim level)')[0]
df['drive_type'] = df["model_clean"].str.extract(r'(fwd|2wd|4wd|rwd|awd)')[0]
df['model_description_length'] = df['model_clean'].apply(lambda x: len(x.split(' ')))

df['variant'] = df['variant'].fillna('Unspeficied')
df['body_type'] = df['body_type'].fillna('Unspeficied')
df['drive_type'] = df['drive_type'].fillna('Unspeficied')
df['engine_size'] = df['engine_size'].fillna('Unspeficied')

In [16]:
df[['model_clean', 'engine_size', 'transmission', 'is_promotional', 'body_type', 'variant', 'drive_type', 'model_description_length']].head(10)

Unnamed: 0,model_clean,engine_size,transmission,is_promotional,body_type,variant,drive_type,model_description_length
0,tiggo 4 pro 1.5 lit mit,1.5,Unspecified,0,Unspeficied,pro,Unspeficied,6
1,triber,Unspeficied,Unspecified,0,Unspeficied,Unspeficied,Unspeficied,1
2,sonet,Unspeficied,Unspecified,0,Unspeficied,Unspeficied,Unspeficied,1
3,polo vivo,Unspeficied,Unspecified,0,Unspeficied,Unspeficied,Unspeficied,2
4,kwid,Unspeficied,Unspecified,0,Unspeficied,Unspeficied,Unspeficied,1
5,grand i10 1.2 premium a/t,1.2,auto,0,Unspeficied,premium,Unspeficied,5
6,triber,Unspeficied,Unspecified,0,Unspeficied,Unspeficied,Unspeficied,1
7,i20 1.2 motion,1.2,Unspecified,0,Unspeficied,motion,Unspeficied,3
8,urban cruiser 1.5 xr manual,1.5,manual,0,Unspeficied,Unspeficied,Unspeficied,5
9,tiggo 4 pro,Unspeficied,Unspecified,0,Unspeficied,pro,Unspeficied,3


In [40]:
def get_interestmake_mapper(is_test: bool = False):
    mapper_url =  '../data/car_makes_test_clean.csv' if is_test else '../data/car_makes.csv'
    sep = ';' if is_test else ','
    car_make_df = pd.read_csv(mapper_url, sep=sep)
    return dict(zip(car_make_df['interestmake'].str.strip().str.lower(), 
                               car_make_df['clean_interestmake'].str.strip().str.lower()))

In [41]:
interestmake_mapper = get_interestmake_mapper()
df['clean_interestmake'] = df['interestmake'].str.strip().str.lower().map(interestmake_mapper)

df[['interestmake', 'clean_interestmake']].iloc[200:210]

Unnamed: 0,interestmake,clean_interestmake
200,Kia,kia
201,NO INTEREST MAKE SELECTED,no interest make selected
202,Mercedes-Benz,mercedes-benz
203,Hyundai,hyundai
204,AUDI,audi
205,NO INTEREST MAKE SELECTED,no interest make selected
206,Volkswagen,volkswagen
207,Volkswagen,volkswagen
208,VOLKSWAGEN,volkswagen
209,Nissan,nissan


## 1.5 `Dealer`

- `dealer_city`
- `dealer_group`
- `is_motus_group`

In [19]:
def extract_dealer_city(dealer: str) -> str:
    if 'Multifranchise' in dealer:
        # build a city name
        dealer_parts = dealer.split(" ")
        city_name = ''
        for word in dealer_parts:
            if word == 'Multifranchise':
                break
            else:
                city_name += ' ' + word

        return city_name.strip()
    
    match = re.findall(r'([A-Za-z ]+)[-–]?\s*(\d{5})?$', dealer)
    if match:
        return match[0][0].split()[-1]
    return dealer.split()[-1]


def extract_brand(dealer: str) -> str:
    known_brands = ['Chery', 'Renault', 'Toyota', 'VW', 'Hyundai', 'Kia', 'Mazda',
                    'Mercedes-Benz', 'Nissan', 'Ford', 'Audi', 'Peugeot', 'Citroën',
                    'Isuzu', 'Honda', 'MG', 'BMW', 'Subaru', 'Jaguar', 'Land Rover',
                    'LDV', 'Opel', 'Haval', 'Mitsubishi', 'Volvo', 'Fiat', 'Jeep', 'JETOUR']
    
    for brand in known_brands:
        if brand.lower() in dealer.lower():
            return brand
        
    if 'Lindsay Saker' in dealer:
        return 'Lindsay Saker VW'
    if 'Mercurius' in dealer:
        return 'Mercurius'
    if 'Cargo Motors' in dealer:
        return 'Mercedes-Benz'
    if 'Omoda' in dealer:
        return 'Omoda & Jaecoo'
    return 'General'


def is_motus(dealer):
    return 'Yes' if 'Motus' in dealer else 'No'

df['dealer_city'] = df['dealer'].apply(extract_dealer_city)
df['dealer_group'] = df['dealer'].apply(extract_brand)
df['is_motus_group'] = df['dealer'].apply(is_motus)


In [20]:
df[['dealer', 'dealer_city', 'is_motus_group', 'dealer_group']].head(20)

Unnamed: 0,dealer,dealer_city,is_motus_group,dealer_group
0,Motus Chery Germiston,Germiston,Yes,Chery
1,Renault Northcliff - 70045,Northcliff,No,Renault
2,Durban Multifranchise - 70111,Durban,No,General
3,Lindsay Saker VW Germiston,Germiston,No,VW
4,Durban Multifranchise - 70111,Durban,No,General
5,Hyundai Menlyn,Menlyn,No,Hyundai
6,Renault Northcliff - 70045,Northcliff,No,Renault
7,Alberton Multifranchise -70051,Alberton,No,General
8,Motus Toyota Cape Gate,Gate,Yes,Toyota
9,Motus Chery Polokwane,Polokwane,Yes,Chery


## 1.6 Name analyis

- `name_length`
- `is_valid_name`
- `is_full_name`

- a name is considered valid if it has no symbols and numbers, and has more than two chars
- a name is considered a full name if has at least two parts saparated by space, e.g. 'abc def'

In [21]:
def name_length(name):
    return len(str(name.strip())) if pd.notnull(name) else 0

def is_valid_name(name):
    if pd.isna(name) or len(str(name.strip())) <= 2:
        return 0
    return int(bool(re.fullmatch(r'[A-Za-z ]+', str(name.strip()))))

def is_full_name(name):
    if pd.isna(name):
        return 0
    return int(len(name.strip().split()) >= 2)

In [22]:
df['name_length'] = df['obsfullname'].apply(name_length)
df['is_valid_name'] = df['obsfullname'].apply(is_valid_name)
df['is_full_name'] = df['obsfullname'].apply(is_full_name)

In [23]:
df[['obsfullname', 'name_length', 'is_full_name', 'is_valid_name']].head(10)

Unnamed: 0,obsfullname,name_length,is_full_name,is_valid_name
0,XXXXXXXXXX XXXXXXXXX,20,1,1
1,XXXXX XXXXXX,12,1,1
2,XXXXXX XXXXX,12,1,1
3,XXXXX-XXX XX XXXXX,18,1,0
4,XXXXX XXXXXXXX XXXXX X/X,24,1,0
5,XXXXXX XXXXXXX,14,1,1
6,XXXXXX XXXXX,12,1,1
7,XXXXXXX XXXXXX XXXXXXX,22,1,1
8,XXXXXXX,7,0,1
9,XXXXXX XXXXXXX,14,1,1


## 1.7 `DTLeadCreated` and `DTLeadAllocated`

- `lead_response_time` - in hours
- `is_weekend_lead` - whether the lead was created during the weekend

In [27]:
df['dtleadcreated'] = pd.to_datetime(df['dtleadcreated'])
df['dtleadallocated'] = pd.to_datetime(df['dtleadallocated'])

df['is_weekend_lead'] = df['dtleadcreated'].apply(lambda x: int(x.day_of_week >= 5))

In [28]:
df['lead_response_time'] = (df['dtleadallocated'] - df['dtleadcreated']).apply(lambda t: t.days * 24 + (t.seconds // 3600))

In [29]:
df[['dtleadallocated', 'dtleadcreated', 'is_weekend_lead', 'lead_response_time']].head(10)

Unnamed: 0,dtleadallocated,dtleadcreated,is_weekend_lead,lead_response_time
0,2025-02-17 11:47:00,2025-02-15 14:34:00,1,45
1,2025-02-15 14:50:00,2025-02-15 14:50:00,1,0
2,2025-02-17 09:01:00,2025-02-15 14:51:00,1,42
3,2025-02-15 14:57:00,2025-02-15 14:55:00,1,0
4,2025-02-15 15:04:00,2025-02-15 15:04:00,1,0
5,2025-02-17 11:12:00,2025-02-15 15:04:00,1,44
6,2025-02-15 15:06:00,2025-02-15 15:06:00,1,0
7,2025-02-15 17:31:00,2025-02-15 15:06:00,1,2
8,2025-02-17 10:12:00,2025-02-15 15:10:00,1,43
9,2025-02-17 11:43:00,2025-02-15 15:23:00,1,44


## 1.8 `CustomerID`

- `customer_leads_count`

In [30]:
df['customer_leads_count'] = df.groupby('customerid')['customerid'].transform('count')

df[['customerid', 'customer_leads_count']].head()

Unnamed: 0,customerid,customer_leads_count
0,D3B8E20E-F798-4CA9-94EE-FF6D0E7E99F2,2
1,372AB00B-6A1C-45D1-8D29-29287C627D3B,1
2,74262064-94E9-4E7E-9C01-CA8405243929,1
3,F4D65935-310B-455D-B414-1759C3B807F7,1
4,A732E82D-B4BE-438C-9690-07AB3EFD9CD4,1


## Final Function for Data Preparation

In [46]:
def prepare_df(df: pd.DataFrame = train_df, is_test_df: bool = False) -> pd.DataFrame:
    """
    Cleans and prepares the dataframe into a format ready for transformation
    """
    interestmake_mapper = get_interestmake_mapper(is_test=is_test_df)

    df = df.copy()
    df.columns = [col.lower() for col in df.columns]
    df['is_email_provided'] = df['obsemail'].apply(lambda email: int(isinstance(email, str) and len(email) > 0))
    df['is_valid_email'] = df.apply(lambda row: is_valid_email(email=row['obsemail'], 
                                                                domain=row['domain'], 
                                                                is_email_provided=row['is_email_provided']), axis=1)
    df['is_personal_email'] = df['domain'].apply(is_free_email_provider)
    df['is_personal_cell'] = df['cellprefix'].apply(is_personal_cell)
    df['is_valid_cell']=df.apply(lambda row: is_valid_cell(row['cellprefix'], row['cellphonenolength']), axis=1)
    df['time_of_day'] = pd.cut(df['hourofenquiry'], bins=tod_bins, labels=tod_labels, right=False,ordered=False)
    df['month_period'] = pd.cut(df['dayofenquiry'], bins=dom_bins, labels=dom_labels, right=False,ordered=False)
    df['interestmake'] = df['interestmake'].fillna('NO INTEREST MAKE SELECTED')
    df['interestmodel'] = df['interestmodel'].fillna('NO INTEREST MODEL SELECTED')
    df['clean_interestmake'] = df['interestmake'].str.strip().str.lower().map(interestmake_mapper)
    df['model_clean'] = df['interestmodel'].apply(clean_interestmodel)
    df['engine_size'] = df['model_clean'].str.extract(r'(\d\.\d)').astype(float)
    df['transmission'] = df['model_clean'].apply(lambda x: 'auto' if 'a/t' in x or 'automatic' in x else ('manual' if 'manual' in x else 'Unspecified'))
    df['is_promotional'] = df['model_clean'].str.contains(r'from only|per month|pm').apply(lambda x: int(x))
    df['body_type'] = df["model_clean"].str.extract(r'(hatch|sedan|suv|pickup)')[0]
    df['variant'] = df["model_clean"].str.extract(r'(pro|premium|comfortline|zen|dynamique|motion|trim level)')[0]
    df['drive_type'] = df["model_clean"].str.extract(r'(fwd|2wd|4wd|rwd|awd)')[0]
    df['model_description_length'] = df['model_clean'].apply(lambda x: len(x.split(' ')))
    df['variant'] = df['variant'].fillna('Unspeficied')
    df['body_type'] = df['body_type'].fillna('Unspeficied')
    df['drive_type'] = df['drive_type'].fillna('Unspeficied')
    df['engine_size'] = df['engine_size'].fillna('Unspeficied')
    df['dealer_city'] = df['dealer'].apply(extract_dealer_city)
    df['dealer_group'] = df['dealer'].apply(extract_brand)
    df['is_motus_group'] = df['dealer'].apply(is_motus)
    df['name_length'] = df['obsfullname'].apply(name_length)
    df['is_valid_name'] = df['obsfullname'].apply(is_valid_name)
    df['is_full_name'] = df['obsfullname'].apply(is_full_name)
    df['dtleadcreated'] = pd.to_datetime(df['dtleadcreated'])
    df['dtleadallocated'] = pd.to_datetime(df['dtleadallocated'])
    df['is_weekend_lead'] = df['dtleadcreated'].apply(lambda x: int(x.day_of_week >= 5))
    df['lead_response_time'] = (df['dtleadallocated'] - df['dtleadcreated']).apply(lambda t: t.days * 24 + (t.seconds // 3600))
    df['customer_leads_count'] = df.groupby('customerid')['customerid'].transform('count')
    return df





## 

In [43]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [33]:
cols_to_drop = ['dtleadcreated', 'dtleadallocated', 'leadsource', 'interestmake', 'interestmodel',
                'obsfullname', 'obsemail', 'domain', 'cellprefix', 'cellphonenolength', 'hourofenquiry', 'dayofenquiry', 'dealer', 'model_clean']

clean_train_df = prepare_df().drop(columns=cols_to_drop, axis=1)
clean_train_df.drop(['customerid'], axis=1).head(10)

Unnamed: 0,leadid,leadtype,seek,vehiclesold,is_email_provided,is_valid_email,is_personal_email,is_personal_cell,is_valid_cell,time_of_day,month_period,engine_size,transmission,is_promotional,body_type,variant,drive_type,model_description_length,clean_interestmake,dealer_city,dealer_group,is_motus_group,name_length,is_valid_name,is_full_name,is_weekend_lead,lead_response_time,customer_leads_count
0,11187465,Facebook,New,0,1,1,1,1,1,afternoon,Mid Month,1.5,Unspecified,0,Unspeficied,pro,Unspeficied,6,chery,Germiston,Chery,Yes,20,1,1,1,45,2
1,11187550,Facebook,New,0,1,1,1,1,1,afternoon,Mid Month,Unspeficied,Unspecified,0,Unspeficied,Unspeficied,Unspeficied,1,renault,Northcliff,Renault,No,12,1,1,1,0,1
2,11187564,Other,New,0,1,1,1,1,1,afternoon,Mid Month,Unspeficied,Unspecified,0,Unspeficied,Unspeficied,Unspeficied,1,kia,Durban,General,No,12,1,1,1,42,1
3,11187583,Website,New,1,1,1,1,1,1,afternoon,Mid Month,Unspeficied,Unspecified,0,Unspeficied,Unspeficied,Unspeficied,2,volkswagen,Germiston,VW,No,18,0,1,1,0,1
4,11187615,TikTok,New,0,1,1,1,1,1,afternoon,Mid Month,Unspeficied,Unspecified,0,Unspeficied,Unspeficied,Unspeficied,1,renault,Durban,General,No,24,0,1,1,0,1
5,11187617,Other,New,0,1,1,1,1,1,afternoon,Mid Month,1.2,auto,0,Unspeficied,premium,Unspeficied,5,hyundai,Menlyn,Hyundai,No,14,1,1,1,44,1
6,11187631,Facebook,New,0,1,1,1,1,1,afternoon,Mid Month,Unspeficied,Unspecified,0,Unspeficied,Unspeficied,Unspeficied,1,renault,Northcliff,Renault,No,12,1,1,1,0,2
7,11187632,Other,Used,0,1,1,1,1,1,afternoon,Mid Month,1.2,Unspecified,0,Unspeficied,motion,Unspeficied,3,hyundai,Alberton,General,No,22,1,1,1,2,1
8,11187650,CPC,Used,0,1,1,1,1,1,afternoon,Mid Month,1.5,manual,0,Unspeficied,Unspeficied,Unspeficied,5,toyota,Gate,Toyota,Yes,7,1,0,1,43,1
9,11187731,Facebook,New,0,1,1,1,1,1,afternoon,Mid Month,Unspeficied,Unspecified,0,Unspeficied,pro,Unspeficied,3,chery,Polokwane,Chery,Yes,14,1,1,1,44,1


In [34]:
clean_train_df.to_csv('../data/clean_train.csv', index=False, header=True)

In [44]:
test_df = pd.read_csv('../data/TestData.csv')

In [47]:
clean_test_df = prepare_df(df=test_df, is_test_df=True).drop(columns=cols_to_drop, axis=1)
clean_test_df.drop(['customerid'], axis=1).head(20)

Unnamed: 0,leadid,leadtype,seek,is_email_provided,is_valid_email,is_personal_email,is_personal_cell,is_valid_cell,time_of_day,month_period,clean_interestmake,engine_size,transmission,is_promotional,body_type,variant,drive_type,model_description_length,dealer_city,dealer_group,is_motus_group,name_length,is_valid_name,is_full_name,is_weekend_lead,lead_response_time,customer_leads_count
0,11187548,Other,New,1,1,1,1,1,afternoon,Mid Month,kia,Unspeficied,Unspecified,0,Unspeficied,Unspeficied,Unspeficied,1,Menlyn,Kia,No,16,1,1,1,45,1
1,11187684,Facebook,New,1,1,1,1,1,afternoon,Mid Month,renault,Unspeficied,Unspecified,0,Unspeficied,Unspeficied,Unspeficied,1,Northcliff,Renault,No,13,1,1,1,0,1
2,11187917,Facebook,New,1,1,1,1,1,afternoon,Mid Month,mg,Unspeficied,Unspecified,0,Unspeficied,Unspeficied,Unspeficied,2,Roodepoort,MG,Yes,6,1,0,1,0,1
3,11187948,AutoTrader,Used,1,1,1,1,1,afternoon,Mid Month,volkswagen,Unspeficied,Unspecified,0,Unspeficied,Unspeficied,Unspeficied,2,Mall,VW,No,13,1,1,1,42,1
4,11188151,Other,Used,1,1,1,1,1,afternoon,Mid Month,hyundai,1.2,Unspecified,0,Unspeficied,motion,Unspeficied,3,Nelspruit,General,No,21,1,1,1,0,1
5,11188453,Facebook,Used,1,1,1,1,1,evening,Mid Month,renault,Unspeficied,Unspecified,0,Unspeficied,Unspeficied,Unspeficied,1,Northcliff,Renault,No,16,1,1,1,0,2
6,11188503,Facebook,New,1,1,1,1,1,evening,Mid Month,renault,Unspeficied,Unspecified,0,Unspeficied,Unspeficied,Unspeficied,1,Roodepoort,General,No,11,1,1,1,0,1
7,11188652,Website,Used,1,1,1,1,1,evening,Mid Month,toyota,Unspeficied,Unspecified,0,Unspeficied,Unspeficied,Unspeficied,1,Alberton,General,No,15,1,1,1,0,1
8,11188720,Facebook,New,1,1,1,1,1,evening,Mid Month,renault,Unspeficied,Unspecified,0,Unspeficied,Unspeficied,Unspeficied,1,Fourways,General,No,22,1,1,1,0,2
9,11188802,Cars.co.za,Used,1,1,1,1,1,evening,Mid Month,volkswagen,Unspeficied,Unspecified,0,Unspeficied,Unspeficied,Unspeficied,1,Bloemfontein,VW,No,7,1,0,1,39,1


In [48]:
clean_test_df.to_csv('../data/clean_test.csv', index=False, header=True)