# Data Preprocessing

## Import Libraries and Dataset

We first would import the necessary libraries we need at the moment, we will import others as when they are needed

In [8]:
import pandas as pd # for data manipulation
import joblib # for saving tools

Import dataset

In [9]:
lagos = pd.read_csv("clean/lagos.csv")

In [10]:
# show first 5 rows
lagos.head()

Unnamed: 0,Title,Location,Currency,Price,Serviced,Newly Built,Furnished,Bedrooms,Bathrooms,Toilets,Neighborhood
0,Shops,Allen Avenue Roundabout Allen Avenue Ikeja Lagos,₦,"750,000/sqm",1,0,0,0 beds,0 baths,0 Toilets,Allen Avenue
1,Newly Built 5 Bedrooms Detached Triplex,"Eso Close, Off Oduduwa Crescent, Gra Ikeja, La...",₦,"280,000,000/year",0,1,0,5 beds,5 baths,6 Toilets,GRA
2,1800m2 Land,Off Adedayo Banjo Street Opebi Ikeja Lagos,₦,220000000,0,0,0,0 beds,0 baths,0 Toilets,Opebi
3,5 Bedroom Fully Detached Duplex,Magodo Ikeja Lagos,₦,300000000,0,1,0,5 beds,6 baths,6 Toilets,Other Ikeja
4,Luxury Built 4bedroom Fully Detached Duplex,Ikeja Lagos,₦,295000000,1,1,0,4 beds,5 baths,5 Toilets,Other Ikeja


In [11]:
# shoe basic information of the dataset
lagos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 66383 entries, 0 to 66382
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Title         66382 non-null  object
 1   Location      66383 non-null  object
 2   Currency      66383 non-null  object
 3   Price         66383 non-null  object
 4   Serviced      66383 non-null  int64 
 5   Newly Built   66383 non-null  int64 
 6   Furnished     66383 non-null  int64 
 7   Bedrooms      66383 non-null  object
 8   Bathrooms     66383 non-null  object
 9   Toilets       66383 non-null  object
 10  Neighborhood  66383 non-null  object
dtypes: int64(3), object(8)
memory usage: 5.6+ MB


We can see from the data above we have 10 features as our Target Variable is **Price**

In [12]:
lagos.columns

Index(['Title', 'Location', 'Currency', 'Price', 'Serviced', 'Newly Built',
       'Furnished', 'Bedrooms', 'Bathrooms', 'Toilets', 'Neighborhood'],
      dtype='object')

## Feature Engineering

#### Extracting Type of Property from Title

Since we are selling homes, let's parse out the three main types of Property Types from the title column: land, duplex and apartment

In [13]:
def house_type(serie):
    final = []
    
    # Extract property name from title
    for title in list(serie):
        # convert all titles to lowercase trings
        title = str(title).lower()
        if "land" in title and (not "island" in title):
            final.append("land")
        elif ("duplex" in title) or ("massionette" in title):
            final.append("duplex")
        else:
            final.append("apartment")
    return final

In [14]:
lagos["Property Type"] = house_type(lagos["Title"])

In [15]:
lagos[["Title","Property Type"]].sample(5)

Unnamed: 0,Title,Property Type
30663,5 Bedroom Detached Duplex With Bq,duplex
6195,4 Bedroom Terrace House Atisheri North,apartment
58861,4 Bedroom Fully Detached Ikota Villa Estate,apartment
20356,5 Bedroom Semi Detached Duplex With Swimming P...,duplex
34257,4 Units Of 3 Bedroom Terraced Duplex,duplex


We only need duplex properties, so it will be set

In [16]:
lagos = lagos[lagos["Property Type"] == "duplex"].copy()

#### Extracting Type of Duplex from Title

We can see different types of homes like detached duplex, semi detached etc, lets extract that

In [17]:
lagos["Title"].value_counts()

4 Bedroom Semi Detached Duplex                                                    1479
5 Bedroom Detached Duplex                                                         1088
4 Bedroom Terrace Duplex                                                           767
5 Bedroom Fully Detached Duplex                                                    761
4 Bedroom Detached Duplex                                                          467
                                                                                  ... 
Executive Super Luxury 4 Bedroom Fully Detached Duplex With 1 Room Bq For Sale       1
Beautiful 5 Bedroom Fully Detached Duplex In Prime Location                          1
Classic 5 Bedroom Fully Detached Duplex                                              1
Distinguished 5 Bedroom Contemporary Duplex                                          1
7 Bedroom Duplex + 2 Room Bq                                                         1
Name: Title, Length: 14667, dtype: int64

In [18]:
# Seperate Detached Duplex Houses from Semi Detached houses
lagos.loc[
    (lagos["Title"] == "Detached Duplex") | (lagos["Title"] == "5 Bedroom Detached Duplex") |
    (lagos["Title"] == "4 Bedroom Detached Duplex") | (lagos["Title"] == "6 Bedroom Detached Duplex"),
    "Title"
] = "ONLY DETACHED DUPLEX"

This is a function to differentiate whether a home is detached, semi-detached, terraced or massionette

In [19]:
def type_house(col):
    final = []
    
    for value in col:
        if ("ONLY DETACHED DUPLEX" in value.upper()) or ('bedroom duplex' in value.lower()):
            final.append("Detached Duplex")
        elif ('Fully Detached' in value) or ('bedroom detached duplex' in value.lower()):
            final.append("Detached Duplex")
        elif ('bedrooms duplex' in value.lower()) or ('bedrooms detached' in value.lower()):
            final.append("Detached Duplex")
        elif ('bed room' in value.lower()) or ('luxury' in value.lower()) or ('contemporary' in value.lower()):
            final.append("Detached Duplex")
        elif "semi" in value.lower():
            final.append("Semi Detached Duplex")
        elif "terra" in value.lower():
            final.append("Terraced Duplex")
        else:
            final.append("Massionette")
            
    return final

In [20]:
# Applying the function
lagos["Type"] = type_house(lagos["Title"])

In [21]:
# Checking count of type of duplexes
lagos["Type"].value_counts()

Detached Duplex         19297
Semi Detached Duplex     7557
Terraced Duplex          5447
Massionette               877
Name: Type, dtype: int64

We won't be needing the Title column again

In [22]:
lagos.drop("Title", axis=1, inplace=True)

We dont also need houses that are massionette

In [23]:
mask = lagos.loc[lagos['Type'] == 'Massionette'].index
lagos.drop(mask, inplace=True)

#### Cleaning and Casting of the Price column

It seems our price column is a string, as it contains commas and also includes the price period of the properties

Let's sort that out

In [24]:
# A sample of price values
lagos["Price"].sample(5)

30412     90,000,000
14242    280,000,000
18867     58,000,000
33699    500,000,000
57243    120,000,000
Name: Price, dtype: object

Function to differentiate whether a price is fixed or per sqm, per year etc

In [25]:
# Remove the last item from the price
def price_split(serie):
    final = []
    for value in serie:
        # if value has a price period, extract it else let it's price period be fixed
        try:
            final.append(value.split("/")[1])
        except:
            final.append("fixed")
    return final

In [26]:
# applying function
lagos["Price Period"] = price_split(lagos["Price"])

In [27]:
# checking samples
lagos[["Price", "Price Period"]].sample(5)

Unnamed: 0,Price,Price Period
29936,180000000,fixed
20306,80000000,fixed
21169,270000000,fixed
30620,58000000,fixed
44722,100000000,fixed


Now since we have parsed out the price period, let's remove the commas from them and convert them to numbers

In [28]:
def comma_remove(feature):
    final = []
    for sample in feature:
        # Get the prices
        sample = sample.split("/")[0]
        
        # Remove the commas
        if "," in sample:
            final.append(sample.replace(",",""))
        else:
            final.append(sample)
    
    # Convert them to numbers
    int_final = [float(item) for item in final]
    return int_final


In [29]:
# applying functions
lagos["Price"] = comma_remove(lagos["Price"])

In [30]:
lagos[["Price", "Price Period"]].sample(5)

Unnamed: 0,Price,Price Period
11478,1200000000.0,fixed
35316,120000000.0,fixed
31582,115000000.0,fixed
44284,250000000.0,fixed
31012,145000000.0,fixed


Extracting houses only with fixed prices

In [31]:
lagos = lagos[lagos['Price Period'] == 'fixed']

We can now perform numerical methods on the Price column now

Minimum price

In [32]:
lagos["Price"].min()

0.0

Average price

In [33]:
lagos["Price"].median()

100000000.0

Max Price

We can see that some of our other columns meant to be numbers are actually strings, so we would have take care of that

In [34]:
lagos["Price"].max()

140000000000000.0

#### Cleaning and Casting of the Bedrooms, Bathrooms and Toilets Features

In [35]:
lagos[["Bedrooms","Bathrooms","Toilets"]].sample(5)

Unnamed: 0,Bedrooms,Bathrooms,Toilets
26649,4 beds,5 baths,5 Toilets
24980,5 beds,baths,Toilets
52451,5 beds,5 baths,6 Toilets
31369,4 beds,4 baths,5 Toilets
56427,4 beds,4 baths,Toilets


Function to clean and cast bathroom, bedroom and toilets features

In [36]:
def remove_bed_strip(serie):
    final = []
    # If value doesn't have a number, replace it with a zero,
    # else replace it with the number it contains
    for value in serie:
        if value.split("beds")[0].strip() == "":
            final.append(int("0"))
        else:
            final.append(int(value.split("beds")[0].strip()))
    return final

def remove_toilet_strip(serie):
    final = []
    for value in serie:
        if value.split("Toilets")[0].strip() == "":
            final.append(int("0"))
        else:
            final.append(int(value.split("Toilets")[0].strip()))
    return final 

def remove_bath_strip(serie):
    final = []
    for value in serie:
        if value.split("baths")[0].strip() == "":
            final.append(int("0"))
        else:
            final.append(int(value.split("baths")[0].strip()))
    return final 

Applying functions to three features

In [37]:
lagos["Bedrooms"] = remove_bed_strip(lagos["Bedrooms"])

In [38]:
lagos["Bathrooms"] = remove_bath_strip(lagos["Bathrooms"])

In [39]:
lagos["Toilets"] = remove_toilet_strip(lagos["Toilets"])

In [40]:
lagos[["Bedrooms","Bathrooms","Toilets"]].sample(5)

Unnamed: 0,Bedrooms,Bathrooms,Toilets
9018,4,4,5
17241,0,4,5
49419,3,0,0
12379,6,6,7
1786,4,5,5


#### Extracting City of property

Also we need to parse out the necessary neighbourhoods from the Location column

In [41]:
# Remove location from data
def location_parser(serie):
    # Replace all Spaces with a comma in the values
    location = [location.replace(" ", ",") for location in serie]
    
    # Split the values by commas, and get the second to the last item, also replace items that
    # contain Island with Victoria Island
    area = [area.split(",")[-2].replace("Island", "Victoria Island") for area in location]
    
    return area

Applying function

In [42]:
lagos["City"] = location_parser(lagos["Location"])

In [43]:
lagos["City"].sample(5)

2883     Ikeja
55808    Lekki
6633     Ikeja
53840    Lekki
42399    Lekki
Name: City, dtype: object

We wont be needing the Location feature again since City has been extracted

In [44]:
lagos.drop("Location", axis=1, inplace=True)

In [45]:
lagos.sample(10)

Unnamed: 0,Currency,Price,Serviced,Newly Built,Furnished,Bedrooms,Bathrooms,Toilets,Neighborhood,Property Type,Type,Price Period,City
24952,₦,140000000.0,0,1,0,0,0,0,Ikota,duplex,Detached Duplex,fixed,Lekki
27719,₦,510000000.0,0,1,0,6,6,7,Osapa London,duplex,Detached Duplex,fixed,Lekki
23592,₦,150000000.0,0,1,0,4,4,5,Ikate,duplex,Detached Duplex,fixed,Lekki
48314,₦,50000000.0,0,1,0,4,4,5,Lekki Phase 2,duplex,Semi Detached Duplex,fixed,Lekki
65743,₦,155000000.0,0,0,0,5,0,0,Oniru,duplex,Semi Detached Duplex,fixed,Victoria Island
34070,₦,100000000.0,0,1,0,4,4,5,Chevron,duplex,Detached Duplex,fixed,Lekki
32730,₦,45000000.0,1,1,1,3,4,4,Chevron,duplex,Detached Duplex,fixed,Lekki
49053,₦,85000000.0,0,1,0,5,5,5,Chevron,duplex,Detached Duplex,fixed,Lekki
43652,₦,125000000.0,0,1,0,5,5,5,Chevron,duplex,Detached Duplex,fixed,Lekki
59836,₦,85000000.0,0,0,0,5,5,6,Chevron,duplex,Detached Duplex,fixed,Lekki


#### Handling Missing Data

After Analysis of the data it was observed that most properties' bathrooms and bedrooms had the same values , while toilets were an increment of the bedrooms. For example a property containing 4 bedrooms will have 4 bathrooms and 5 toilets. Lets turn this idea into a function

In [46]:
# Imputing for bathrooms and bedrooms
def bed_bath(cols):
    bedroom = cols[0]
    bathroom = cols[1]
    
    if bathroom == 0:

        if bedroom == 4:
            return 4

        elif bedroom == 5:
            return 5
        
        elif bedroom == 6:
            return 6

        else:
            return 0
    else:
        return bathroom

In [47]:
# Importing for toilets
def bed_toilet(cols):
    bedroom = cols[0]
    toilet = cols[1]
    
    if toilet == 0:

        if bedroom == 4:
            return 5

        elif bedroom == 5:
            return 6

        elif bedroom == 6:
            return 7
        
        else:
            return 0
    else:
        return toilet

Applying the functions

In [48]:
lagos['Bedrooms'] = lagos[['Bathrooms','Bedrooms']].apply(bed_bath,axis=1)
lagos['Bathrooms'] = lagos[['Bedrooms','Bathrooms']].apply(bed_bath,axis=1)
lagos['Toilets'] = lagos[['Bedrooms','Toilets']].apply(bed_toilet,axis=1)

#### Handling duplicate values

Counting Number of duplicate values

In [49]:
lagos.duplicated().value_counts()[1]

16399

Removing duplicate values

In [50]:
lagos.drop_duplicates(inplace=True)

#### Removing Inconsistent rows and features

Extracting properties that contain only naira curriencies

In [51]:
lagos = lagos.loc[lagos["Currency"] != "$"].copy()

Removing rows that containing only 0 bedrooms, 0 bathrooms and 0 toilets

In [52]:
mask = lagos.loc[(lagos['Bedrooms'] == 0) & (lagos['Bedrooms'] == 0) & (lagos['Toilets'] == 0)].index
lagos.drop(mask, inplace=True)

Removing Inconsistent Features

In [53]:
lagos.drop(['Currency', 'Property Type', 'Price Period'], axis=1, inplace=True)

##### Price Cap


The average Nigerian in 2022 will not normally buy a property less than 10 million naira and higher than 500 million naira, this rows are redundant and will be removed

In [54]:
mask = lagos.loc[(lagos["Price"] < 1e7) | (lagos["Price"] > 5e8)].index
lagos.drop(mask, inplace=True)

In [55]:
lagos = lagos.sample(frac=1).copy()

Saving preprocessed data for EDA

In [56]:
lagos.to_csv('clean/eda.csv', index=False)

## Data Preprocessing

#### Data Split

Splitting our data before data preprocessing to avoid data leakage

In [57]:
# Spliting to extract 20% test data
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(lagos, lagos["Neighborhood"]):
    train_set = lagos.iloc[train_index]
    test_set = lagos.iloc[test_index]

In [58]:
# Splitting to extract 20% validation data
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(train_set, train_set["Neighborhood"]):
    train = train_set.iloc[train_index]
    validation_set = train_set.iloc[test_index]

In [59]:
lagos = train.drop("Price", axis=1)
price = train["Price"]

Let's seperate our numerical columns and our categorical ones

In [60]:
categorical_col = [column for column in lagos.columns if lagos[column].dtype == "O"]
numerical_col = ['Bathrooms', 'Bedrooms', 'Toilets']

In [61]:
lagos[categorical_col].sample(4)

Unnamed: 0,Neighborhood,Type,City
53429,Ikota,Semi Detached Duplex,Lekki
31650,Ikota,Detached Duplex,Lekki
56865,Chevron,Semi Detached Duplex,Lekki
22066,Osapa London,Detached Duplex,Lekki


In [62]:
lagos[numerical_col].sample(4)

Unnamed: 0,Bathrooms,Bedrooms,Toilets
48006,4,4,4
6744,5,4,5
16075,6,5,6
37931,6,6,5


#### Handling Missing Data

Our dataset does not have missing values, but we still need to set a foundation, in case new data cotains some missing values

In [63]:
# import library for missing data 
from sklearn.impute import SimpleImputer

Load the imputer and replacing missing categorical variables with their mode

In [64]:
categorical_imputer = SimpleImputer(strategy="most_frequent")
categorical_imputer.fit(lagos[categorical_col])
lagos[categorical_col] = categorical_imputer.transform(lagos[categorical_col])

Saving Imputer for future use in model deployment

In [65]:
joblib.dump(categorical_imputer, "tools/imputer_joblib")

['tools/imputer_joblib']

#### Feature Scaling

Scaling for all our features to be in the same range of -3 to 3

In [66]:
# import library for scaling
from sklearn.preprocessing import StandardScaler

In [67]:
# load the scaler and scale the values
scaler = StandardScaler()
scaler.fit(lagos[numerical_col])
lagos[numerical_col] = scaler.transform(lagos[numerical_col])

Saving Scaler for future use in model deployment

In [68]:
joblib.dump(scaler, "tools/scaler_joblib")

['tools/scaler_joblib']

#### Feature Encoding

We need to perform OneHotEncoding on some of the norminal categorical variables (variables with no order e.g. Nigeria, Congo, Ghana)

In [69]:
# import library for encoding
from feature_engine.encoding import OneHotEncoder

In [70]:
encoder = OneHotEncoder()
encoder.fit(lagos)
lagos = encoder.transform(lagos)

Saving Encoder for future use in model deployment

In [71]:
joblib.dump(encoder, "tools/encoder_joblib")

['tools/encoder_joblib']

In [72]:
lagos

Unnamed: 0,Serviced,Newly Built,Furnished,Bedrooms,Bathrooms,Toilets,Neighborhood_Lekki Phase 1,Neighborhood_Lekki Phase 2,Neighborhood_Ikate,Neighborhood_Ikota,...,Neighborhood_Waziri Adeola Odeku,Neighborhood_Awolowo Way,Neighborhood_Agidingbi,Type_Semi Detached Duplex,Type_Detached Duplex,Type_Terraced Duplex,City_Lekki,City_Ikeja,City_Ikoyi,City_Victoria Island
29575,0,0,0,-0.483533,-0.568038,-1.045317,1,0,0,0,...,0,0,0,1,0,0,1,0,0,0
54080,0,0,0,-0.483533,-0.568038,-1.045317,0,1,0,0,...,0,0,0,1,0,0,1,0,0,0
51234,0,0,0,0.765681,-2.449226,0.731838,0,0,1,0,...,0,0,0,0,1,0,1,0,0,0
44934,1,1,1,-0.483533,1.313149,0.731838,0,0,0,1,...,0,0,0,0,0,1,1,0,0,0
33096,0,1,0,0.765681,-0.568038,-0.156740,0,0,0,0,...,0,0,0,1,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14104,1,1,0,0.765681,0.372556,0.731838,0,0,0,0,...,0,0,0,1,0,0,0,0,1,0
63029,1,0,1,0.765681,2.253743,1.620415,0,0,0,0,...,0,0,0,1,0,0,0,0,0,1
27980,0,1,0,-0.483533,-0.568038,0.731838,1,0,0,0,...,0,0,0,0,0,1,1,0,0,0
38099,0,1,0,0.765681,1.313149,0.731838,0,0,1,0,...,0,0,0,0,1,0,1,0,0,0


Save preprocessed data for Model Building

In [73]:
lagos = pd.concat([lagos, price], axis=1)

In [74]:
lagos.to_csv('clean/train.csv', index=False)
test_set.to_csv('clean/test.csv', index=False)
validation_set.to_csv('clean/validation.csv', index=False)