## Table of content

1. [Assessment](#assessment) -> [Report summary](#assessment-report)
1. [Cleaning](#cleaning)

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

from googletrans import Translator, constants # https://www.thepythoncode.com/article/translate-text-in-python
from slugify import slugify
import os

%matplotlib inline

In [2]:
# init the Google API translator
translator = Translator()

# translate a spanish text to english text (by default)
translation = translator.translate("Hola Mundo")
print(f"{translation.origin} ({translation.src}) --> {translation.text} ({translation.dest})")

Hola Mundo (es) --> Hello World (en)


In [3]:
translation2 = translator.translate("ফ্ল্যাটে উঠেও কিস্তি চালাতে পারবেন")
print(f"{translation2.origin} ({translation2.src}) --> {translation2.text} ({translation2.dest})")

ফ্ল্যাটে উঠেও কিস্তি চালাতে পারবেন (bn) --> You can pay installments even in the flat (en)


In [None]:
"""
Text translation: 
    - https://www.thepythoncode.com/article/translate-text-in-python
    - https://huggingface.co/course/chapter1/3?fw=pt
"""

In [4]:
# CSV folders

raw_data_folder="../../../data/Raw_Data"
cleaned_data_folder="../../../data/CLeaned_Data"

bikrisohoj_folder= f"{raw_data_folder}/bikrisohoj_spider"
cleaned_bikrisohoj_folder= f"{cleaned_data_folder}/bikrisohoj"

<span id="bproperty-assessment"> </span>

## Assessing `bikrisohoj`

In [5]:
df=pd.read_csv(f"{bikrisohoj_folder}/mohammedadnan_bikrisohoj.csv")
df.head()

Unnamed: 0,Name,Location,Description,Ad posted time,Price,AD URL
0,ফ্ল্যাটে উঠেও কিস্তি চালাতে পারবেন,"Dhaka, Dhaka, Keraniganj",\n ...,26 Mar 2023 02:13 am,3150,https://www.bikrisohoj.com/details/%e0%a6%ab%e...
1,প্লটন থেকে ৪ কি.মি দূরত্বে কিস্তি প্রায় রেডি ফ...,"Dhaka, Dhaka, Keraniganj",\n ...,26 Mar 2023 02:21 am,3850,https://www.bikrisohoj.com/details/%e0%a6%aa%e...
2,Apartment for rent,"Dhaka, Dhaka, Uttara",\n ...,22 Mar 2023 10:42 pm,50000,https://www.bikrisohoj.com/details/apartment-f...
3,"A modern well decorated flat at Banasree,Rampu...","Dhaka, Dhaka, Rampura",\n ...,26 Mar 2023 02:21 am,12000000,https://www.bikrisohoj.com/details/a-modern-we...
4,ইন্ডাস্ট্রিয়াল সেড / INDUSTRIAL SHED FOR RENT,"Dhaka, Narayanganj",\n ...,26 Mar 2023 02:16 am,150000,https://www.bikrisohoj.com/details/%e0%a6%87%e...


`Name` has some samples not en English. Those samples should be translated to English. (quality issue)

In [6]:
df.shape

(940, 6)

In [7]:
df.loc[0]

Name                             ফ্ল্যাটে উঠেও কিস্তি চালাতে পারবেন
Location                                   Dhaka, Dhaka, Keraniganj
Description       \n                                            ...
Ad posted time                                26 Mar 2023 02:13 am 
Price                                                         3,150
AD URL            https://www.bikrisohoj.com/details/%e0%a6%ab%e...
Name: 0, dtype: object

In [8]:
df.loc[0,"Description"]

'\n                                            ## মতিঝিল থেকে মাত্র ৭ কিলোমিটার\n\n• সামনে প্রশস্ত রাস্তা\n\n## ১০০০, ১১৫০, ১২৫০ ও ১৪৫০ বর্গফুটের রেডি ফ্ল্যাট\n\n* ৩টি/৪টি বেড, ৩টি বাথ, ডাইনিং, ড্রইং ও বারান্দা সহ রেডি ফ্ল্যাট\n\n** প্রতি বর্গফুটের মূল্য মাত্র ৩১৫০/- টাকা ।\n\n## আপনি মাত্র ডাউন পেমেন্ট দিয়েই ফ্ল্যাটে উঠে যেতে পারবেন ।\n\nবাকি টাকা ২ বছরের কিস্তিতে পরিশোধ করার সুযোগ পাবেন (দীর্ঘমেয়াদী কিস্তির সুযোগ ও রয়েছে)। ফ্ল্যাট কেনার ইচ্ছা থাকলে সুযোগটি আপনিও নিতে পারেন ।\n\n## ফ্ল্যাটে ও ফ্ল্যাটের আশে পাশে নাগরিক সকল সুবিধা বিদ্যমান । তাই থাকার মত আবাসিক পরিবেশের জন্য আগে সরাসরি ফ্ল্যাটটি দেখুন ।\n\nতাই আর দেরি কেন বিস্তারিত তথ্যের জন্য আজই যোগাযোগ করুন >>>\nমোবাইল – 01855-646432\n– 01710-690820\n\nOnline Contact ( imo, WhatsApp, Viver ) – 01855-646432\nঅথবা\nআপনার মোবাইল নাম্বার দিয়ে ইনবক্স করুন, ধন্যবাদ ।\n\n### বি: দ্র: এছাড়াও আমাদের ছোট থেকে শুরু করে বিভিন্ন সাইজের রেডি ফ্ল্যাট আছে এবং বিভিন্ন সাইজের রেডি প্লট এককালিন ও সওজ কিস্তিতে বিক্রয় করি।                               

`Description` has samples that are not en English. Those samples should be translated to English. (quality issue)

In [9]:
slugify(df.loc[0,"Description"])

'mtijhil-theke-maatr-7-kilomittaar-saamne-prshst-raastaa-1000-1150-1250-o-1450-brgphutter-reddi-phlyaatt-3tti-4tti-bedd-3tti-baath-ddaainin-ddrin-o-baaraandaa-sh-reddi-phlyaatt-prti-brgphutter-muuly-maatr-3150-ttaakaa-aapni-maatr-ddaaun-pementt-diyei-phlyaatte-utthe-yete-paarben-baaki-ttaakaa-2-bchrer-kistite-prishodh-kraar-suyog-paaben-diirghmeyyaadii-kistir-suyog-o-ryyeche-phlyaatt-kenaar-icchaa-thaakle-suyogtti-aapnio-nite-paaren-phlyaatte-o-phlyaatter-aashe-paashe-naagrik-skl-subidhaa-bidymaan-taai-thaakaar-mt-aabaasik-pribesher-jny-aage-sraasri-phlyaatttti-dekhun-taai-aar-deri-ken-bistaarit-tthyer-jny-aaji-yogaayog-krun-mobaail-01855-646432-01710-690820-online-contact-imo-whatsapp-viver-01855-646432-athbaa-aapnaar-mobaail-naambaar-diyye-inbks-krun-dhnybaad-bi-dr-echaarraao-aamaader-chott-theke-shuru-kre-bibhinn-saaijer-reddi-phlyaatt-aache-ebn-bibhinn-saaijer-reddi-pltt-ekkaalin-o-soj-kistite-bikryy-kri'

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 940 entries, 0 to 939
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Name            940 non-null    object
 1   Location        940 non-null    object
 2   Description     940 non-null    object
 3   Ad posted time  940 non-null    object
 4   Price           940 non-null    object
 5   AD URL          940 non-null    object
dtypes: object(6)
memory usage: 44.2+ KB


* `Ad posted time` should be of type datetime, not str. (quality issue)
* `Price` should be of type float, not str. (quality issue)
* `location` should be splitted to `city` and `address`. (tidiness issue)
* Column names should become lowercase for consistency with the other datasets cleaned. (tidiness issue)

<span id="assessment-report"> </span>

#### Assessment report summary

##### Quality issues
1. `Name` has some samples not en English. Those samples should be translated to English
1. `Description` has samples that are not en English. Those samples should be translated to English. (quality issue)
1. `Ad posted time` should be of type datetime, not str.
1. `Price` should be of type float, not str.


##### Tidiness issues
1. `location` should be splitted to `city` and `address`. (tidiness issue)
1. Column names should become lowercase for consistency with the other datasets cleaned. (tidiness issue)

<span id="cleaning"> </span>

## Cleaning

###  `Name` has some samples not en English ( [quality issues #1 & #2](#assessment-report) )

`Name` and `Description` have some samples not en English. Those samples should be translated to English.

In [11]:
df["Name"].head()

0                   ফ্ল্যাটে উঠেও কিস্তি চালাতে পারবেন
1    প্লটন থেকে ৪ কি.মি দূরত্বে কিস্তি প্রায় রেডি ফ...
2                                   Apartment for rent
3    A modern well decorated flat at Banasree,Rampu...
4        ইন্ডাস্ট্রিয়াল সেড / INDUSTRIAL SHED FOR RENT
Name: Name, dtype: object

In [12]:
df.loc[0,"Description"]

'\n                                            ## মতিঝিল থেকে মাত্র ৭ কিলোমিটার\n\n• সামনে প্রশস্ত রাস্তা\n\n## ১০০০, ১১৫০, ১২৫০ ও ১৪৫০ বর্গফুটের রেডি ফ্ল্যাট\n\n* ৩টি/৪টি বেড, ৩টি বাথ, ডাইনিং, ড্রইং ও বারান্দা সহ রেডি ফ্ল্যাট\n\n** প্রতি বর্গফুটের মূল্য মাত্র ৩১৫০/- টাকা ।\n\n## আপনি মাত্র ডাউন পেমেন্ট দিয়েই ফ্ল্যাটে উঠে যেতে পারবেন ।\n\nবাকি টাকা ২ বছরের কিস্তিতে পরিশোধ করার সুযোগ পাবেন (দীর্ঘমেয়াদী কিস্তির সুযোগ ও রয়েছে)। ফ্ল্যাট কেনার ইচ্ছা থাকলে সুযোগটি আপনিও নিতে পারেন ।\n\n## ফ্ল্যাটে ও ফ্ল্যাটের আশে পাশে নাগরিক সকল সুবিধা বিদ্যমান । তাই থাকার মত আবাসিক পরিবেশের জন্য আগে সরাসরি ফ্ল্যাটটি দেখুন ।\n\nতাই আর দেরি কেন বিস্তারিত তথ্যের জন্য আজই যোগাযোগ করুন >>>\nমোবাইল – 01855-646432\n– 01710-690820\n\nOnline Contact ( imo, WhatsApp, Viver ) – 01855-646432\nঅথবা\nআপনার মোবাইল নাম্বার দিয়ে ইনবক্স করুন, ধন্যবাদ ।\n\n### বি: দ্র: এছাড়াও আমাদের ছোট থেকে শুরু করে বিভিন্ন সাইজের রেডি ফ্ল্যাট আছে এবং বিভিন্ন সাইজের রেডি প্লট এককালিন ও সওজ কিস্তিতে বিক্রয় করি।                               

#### Define
- Convert `Name` and `Description` to english 

#### Code

In [15]:
# init the Google API translator
translator = Translator()


# translate a spanish text to english text (by default)
translation = translator.translate("Hola Mundo")
print(translation)
print(f"{translation.origin} ({translation.src}) --> {translation.text} ({translation.dest})")

Translated(src=es, dest=en, text=Hello World, pronunciation=Hello World, extra_data="{'translat...")
Hola Mundo (es) --> Hello World (en)


In [20]:
tr = translator.translate(df["Name"].to_list()) 
print(f"{tr.origin} ({tr.src}) --> {tr.text} ({tr.dest})")

AttributeError: 'list' object has no attribute 'origin'

In [23]:
tr = translator.translate(df["Name"].to_list())
tr[:5]

[<googletrans.models.Translated at 0x290e3eccd90>,
 <googletrans.models.Translated at 0x290e3ec7ac0>,
 <googletrans.models.Translated at 0x290e3ec9b80>,
 <googletrans.models.Translated at 0x290e3eeadf0>,
 <googletrans.models.Translated at 0x290e5941a90>]

In [25]:
i = 0
for t in tr:
    print(f"{t.origin} ({t.src}) --> {t.text} ({t.dest})")
    
    i+=1
    if i==5:
        break

ফ্ল্যাটে উঠেও কিস্তি চালাতে পারবেন (bn) --> You can pay installments even in the flat (en)
প্লটন থেকে ৪ কি.মি দূরত্বে কিস্তি প্রায় রেডি ফ্ল্যাট (bn) --> Kisthi is almost ready flat at a distance of 4 km from Platon (en)
Apartment for rent (en) --> Apartment for rent (en)
A modern well decorated flat at Banasree,Rampura,Dhaka (mr) --> A modern well decorated flat at Banasree,Rampura,Dhaka (en)
ইন্ডাস্ট্রিয়াল সেড / INDUSTRIAL SHED FOR RENT (bn) --> Industrial Shed / INDUSTRIAL SHED FOR RENT (en)


In [None]:
"""
    Loop through the samples. For each one, translate Name and Location columns to English
"""

for index, row in df.iterrows(): # loop through each sample
    
    # The code may take time, log in the console to keep track of things
    if index==0 or index%1000==0:
        print(f"Currently processing sample {index}...")
        
    # retrieve the Name and Description
    name = df.loc[index, "Name"]
    description = df.loc[index, "Description"]
    
    # translate text
    
    

    # updating the relevant columns of the sample in the dataframe
    bproperty_df.loc[index, "city"] = city
    bproperty_df.loc[index, "locality"] = locality
    bproperty_df.loc[index, "address"] = address

print("Processing has come to an end")

In [18]:
# TODO: code

In [None]:
xxxx

#### Testing

###  `Ad posted time` is of type str ( [quality issues #3](#assessment-report) )

`Ad posted time` should be of type datetime, not str.

#### Define
* Rename `Ad posted time` to `posted_time` (tidiness issue)
* Convert `Ad posted time` to datetime

#### Code

In [23]:
# Rename column
df.rename(columns={
    "Ad posted time":"posted_time"
}, inplace=True)

df.head()

Unnamed: 0,Name,Location,Description,posted_time,Price,AD URL
0,ফ্ল্যাটে উঠেও কিস্তি চালাতে পারবেন,"Dhaka, Dhaka, Keraniganj",\n ...,26 Mar 2023 02:13 am,3150,https://www.bikrisohoj.com/details/%e0%a6%ab%e...
1,প্লটন থেকে ৪ কি.মি দূরত্বে কিস্তি প্রায় রেডি ফ...,"Dhaka, Dhaka, Keraniganj",\n ...,26 Mar 2023 02:21 am,3850,https://www.bikrisohoj.com/details/%e0%a6%aa%e...
2,Apartment for rent,"Dhaka, Dhaka, Uttara",\n ...,22 Mar 2023 10:42 pm,50000,https://www.bikrisohoj.com/details/apartment-f...
3,"A modern well decorated flat at Banasree,Rampu...","Dhaka, Dhaka, Rampura",\n ...,26 Mar 2023 02:21 am,12000000,https://www.bikrisohoj.com/details/a-modern-we...
4,ইন্ডাস্ট্রিয়াল সেড / INDUSTRIAL SHED FOR RENT,"Dhaka, Narayanganj",\n ...,26 Mar 2023 02:16 am,150000,https://www.bikrisohoj.com/details/%e0%a6%87%e...


In [25]:
# Coonverting from str to datetime
df["posted_time"] = pd.to_datetime(df["posted_time"])

### Testing

In [28]:
df.info() #["posted_time"].dtype

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 940 entries, 0 to 939
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   Name         940 non-null    object        
 1   Location     940 non-null    object        
 2   Description  940 non-null    object        
 3   posted_time  940 non-null    datetime64[ns]
 4   Price        940 non-null    object        
 5   AD URL       940 non-null    object        
dtypes: datetime64[ns](1), object(5)
memory usage: 44.2+ KB


### `Price` is of type str ( [quality issues #4](#assessment-report) )

`Price` should be of type float, not str. (quality issue)

#### Define
* Remove `,` from `Price`
* Convert `Price` from str to float

#### Code

In [30]:
df["Price"] = df["Price"].str.replace(",","")

df["Price"] = df["Price"].astype(float)
df["Price"].head()

0        3150.0
1        3850.0
2       50000.0
3    12000000.0
4      150000.0
Name: Price, dtype: float64

#### Testing

In [31]:
df["Price"].dtype

dtype('float64')

### Split `location` column content into adequate columns ( [tidiness issues #1](#assessment-report) )

 `location` should be splitted to `city`, `zone`, and `address`. (tidiness issue)

In [38]:
df["Location"]

0       Dhaka, Dhaka, Keraniganj
1       Dhaka, Dhaka, Keraniganj
2           Dhaka, Dhaka, Uttara
3          Dhaka, Dhaka, Rampura
4             Dhaka, Narayanganj
                 ...            
935         Dhaka, Dhaka, Uttara
936     Dhaka, Dhaka, Keraniganj
937           Dhaka, Narayanganj
938       Chittagong, Chattogram
939     Dhaka, Dhaka, Keraniganj
Name: Location, Length: 940, dtype: object

#### Define
* Retrieve the city, area, and address from each `location` through `get_detailed_address()`
* Update new columns (city, area, address) based on values retrieve from `location`

#### Code

In [36]:
# Retrieve city in location
df["city"] = df["Location"].apply(lambda x: x.split(",")[0].strip() )

# Retrieve address in location
df["address"] = df["Location"].apply(lambda x: ",".join(x.split(",")[1:]).strip() )

In [39]:
# Checking the content of location, city, and address
df[ ["Location","city","address"] ] 

Unnamed: 0,Location,city,address
0,"Dhaka, Dhaka, Keraniganj",Dhaka,"Dhaka, Keraniganj"
1,"Dhaka, Dhaka, Keraniganj",Dhaka,"Dhaka, Keraniganj"
2,"Dhaka, Dhaka, Uttara",Dhaka,"Dhaka, Uttara"
3,"Dhaka, Dhaka, Rampura",Dhaka,"Dhaka, Rampura"
4,"Dhaka, Narayanganj",Dhaka,Narayanganj
...,...,...,...
935,"Dhaka, Dhaka, Uttara",Dhaka,"Dhaka, Uttara"
936,"Dhaka, Dhaka, Keraniganj",Dhaka,"Dhaka, Keraniganj"
937,"Dhaka, Narayanganj",Dhaka,Narayanganj
938,"Chittagong, Chattogram",Chittagong,Chattogram


In [40]:
df.shape

(940, 8)

In [41]:
# Drop location column
df.drop(["Location"], axis=1, inplace=True)

df.shape

(940, 7)

### Column names should become lowercase  ( [tidiness issues #2](#assessment-report) )

Column names should become lowercase for consistency with the other datasets cleaned. (tidiness issue)

In [None]:
xxxxxx

#### Code

In [23]:
# Replacing values of commercial_type column
bproperty_df.loc[ bproperty_df["commercial_type"]==True, ["commercial_type"] ] = "Commercial"
bproperty_df.loc[ bproperty_df["commercial_type"]==False, ["commercial_type"] ] = "Residential"

# Making sure values were updated
bproperty_df["commercial_type"].unique()

array(['Residential', 'Commercial'], dtype=object)

In [24]:
# Renaming column
bproperty_df.rename(columns={
    "commercial_type":"building_nature"
}, inplace=True)

# Confirming rename was done
bproperty_df.columns.to_list()

['amenities',
 'area',
 'building_type',
 'building_nature',
 'location',
 'num_bath_rooms',
 'num_bed_rooms',
 'price',
 'property_description',
 'property_overview',
 'property_url',
 'purpose']

In [25]:
# Taking a look at content (for general confirmation)
bproperty_df.head(2).T

Unnamed: 0,0,1
amenities,"{'Flooring': 'yes', 'Parking Spaces': ' 1', 'B...",
area,1265.0,4400.0
building_type,Apartment,Apartment
building_nature,Residential,Residential
location,"Baridhara DOHS, Dhaka","Gulshan 2, Gulshan, Dhaka"
num_bath_rooms,3 Baths,4 Baths
num_bed_rooms,3 Beds,4 Beds
price,1.25 Crore,7.04 Crore
property_description,Ready Flat Of 1265 Sq Ft Is Now Up For Sale In...,You Can Move Into This Well Planned And Comfor...
property_overview,Looking for a luxurious apartment with top-not...,"Amicable environment, appropriate commuting sy..."


### `num_bath_rooms` and `num_bed_rooms` should be integer, no string. ( [quality issue #3](#bproperty-assessment-report) )

In [26]:
bproperty_df["num_bath_rooms"].dtype

dtype('O')

In [27]:
bproperty_df["num_bath_rooms"].unique()

array(['3 Baths', '4 Baths', nan, '2 Baths', '10 Baths', '5 Baths',
       '8 Baths', '1 Bath', '7 Baths', '6 Baths', '9 Baths'], dtype=object)

In [28]:
bproperty_df["num_bed_rooms"].dtype

dtype('O')

In [29]:
bproperty_df["num_bed_rooms"].unique()

array(['3 Beds', '4 Beds', '2 Beds', nan, '21 Beds', '5 Beds', '7 Beds',
       '1 Bed', '6 Beds', '19 Beds', '24 Beds', '33 Beds', '56 Beds',
       '10 Beds', '13 Beds', '48 Beds', '12 Beds', '60 Beds', '18 Beds',
       '40 Beds', '29 Beds', '23 Beds', '8 Beds', '75 Beds', '14 Beds',
       '50 Beds', '42 Beds', '16 Beds', '36 Beds', '15 Beds', '25 Beds',
       '22 Beds', '46 Beds', '32 Beds', '30 Beds', '11 Beds', '94 Beds',
       '17 Beds', '20 Beds'], dtype=object)

#### Define
* Replace `NaN` values by `0` (since in this case, that made sense: it mean the sample doesn't have a bath_room or bed_room 
* Remove `Bed`, `Beds`, `Bath` and `Baths` from the values of `num_bed_rooms` and `num_bath_rooms`
* Convert `num_bed_rooms` and `num_bath_rooms` to integer

#### Code

In [30]:
# Replace NaN value by 0 in num_bed_rooms and num_bath_rooms
bproperty_df["num_bed_rooms"].fillna("0", inplace=True)
bproperty_df["num_bath_rooms"].fillna("0", inplace=True)

# Check that NaN values where replaced
bproperty_df["num_bed_rooms"].isnull().sum(), bproperty_df["num_bath_rooms"].isnull().sum()

(0, 0)

In [31]:
# Removing the units (bed, bath, ...) in num_bed_rooms and num_bath_rooms
bproperty_df["num_bed_rooms"] = bproperty_df["num_bed_rooms"].apply(lambda x: x.split(" ")[0] )
bproperty_df["num_bath_rooms"] = bproperty_df["num_bath_rooms"].apply(lambda x: x.split(" ")[0] )

In [32]:
# Converting num_bed_rooms and num_bath_rooms to integer
bproperty_df["num_bed_rooms"] = bproperty_df["num_bed_rooms"].astype(int)
bproperty_df["num_bath_rooms"] = bproperty_df["num_bath_rooms"].astype(int)


#### Testing

In [33]:
# Checking type conversion was succesful
bproperty_df["num_bed_rooms"].dtype, bproperty_df["num_bath_rooms"].dtype

(dtype('int32'), dtype('int32'))

### `price` content is not uniform accross the dataset ( [quality issue #4 & #5](#bproperty-assessment-report) )

`price` content is not uniform accross the dataset. Some are in `Lakh`, other in `Crore`, etc... The unit used for the price should be uniformized. A special attention should be paid to the fact that there are `price` without unit.  

Furthermore, `price` should be decimal, not string.

In [34]:
bproperty_df["price"].unique()

array(['1.25 Crore', '7.04 Crore', '62 Lakh', ..., '13.98 Lakh',
       '96.25 Lakh', '92.1 Lakh'], dtype=object)

#### Define
* Convert all price to the same currency
* Replace `Thousand` by triple `0`
* Convert the column to float

#### Code

In [35]:
"""
    Loop through `price` column, while:
        * Converting all prices to BDT currency
        * Replacing `Thousand` by triple `0`
"""

for index, row in bproperty_df.iterrows(): # loop through each sample
    
    # The code may take time, log in the console to keep track of things
    if index==0 or index%1000==0:
        print(f"Currently processing sample {index}...")
        
    # retrieve the price
    sample_price = bproperty_df.loc[index, "price"]
    splitted_sample_price= sample_price.split()
    
    # making sure there are only the value and unit in sample price
    if len(splitted_sample_price)>2:
        print(f"Sample of index {index} has a suspicious value as price: {sample_price}")
        break
        
    price = float( splitted_sample_price[0] ) # will contain the price; eg: 1345
    price_unit = splitted_sample_price[1].lower() # will contain the unit; eg: Lakh, Crore
    
    # making sure all units are taken into account
    if price_unit not in ["arab","crore","lakh","thousand"]:
        print(f"Sample of index {index} has a unit not taken into account for its price: {sample_price}")
        break
    
    # converting all price unit to BDT : 1 lakh=100000 BDT,1 crore=10000000 BDT, 1 Arab= 1000000000 BDT (Thanks @Al Momin Faruk)
    if price_unit=="arab":
        price *= 1000000000
    elif price_unit=="crore":
        price *= 10000000
    elif price_unit=="lakh":
        price *= 100000
    elif price_unit=="thousand":
        price *= 1000
    else:
        raise Exception(f"Currency {price_unit} not taken to account")
    
    # updating the price of the sample in the dataframe
    bproperty_df.loc[index, "price"] = price

print("Processing has come to an end")
    
# Converting area to decimal
bproperty_df["price"] = bproperty_df["price"].astype(float)

Currently processing sample 0...
Currently processing sample 1000...
Currently processing sample 2000...
Currently processing sample 3000...
Currently processing sample 4000...
Currently processing sample 5000...
Currently processing sample 6000...
Currently processing sample 7000...
Currently processing sample 8000...
Currently processing sample 9000...
Currently processing sample 10000...
Currently processing sample 11000...
Currently processing sample 12000...
Currently processing sample 13000...
Currently processing sample 14000...
Currently processing sample 15000...
Currently processing sample 16000...
Currently processing sample 17000...
Processing has come to an end


#### Testing

In [36]:
bproperty_df["price"].dtype

dtype('float64')

### Set `purpose` values to `Rent` or `Sale` ( [quality issue #6](#bproperty-assessment-report) )

`purpose` should have `Rent` or `Sale` as values. This is not really an issue, its goal is only to keep values consistent accross all cleaned datasets.

In [37]:
bproperty_df["purpose"].unique()

array(['For Sale', 'For Rent'], dtype=object)

#### Define
* Replace `For Sale` by `Sale`, and `For Rent` by `Rent` 

#### Code

In [38]:
bproperty_df["purpose"] = bproperty_df["purpose"].apply(lambda x: x.split(" ")[1] )

#### Testing

In [39]:
bproperty_df["purpose"].unique()

array(['Sale', 'Rent'], dtype=object)

### Split `location` column content into adequate columns ( [tidiness issue #1](#bproperty-assessment-report) )

`location` has concatened informations: city, district, sector, etc. Those will be splitted into `city` and `address`.

In [40]:
bproperty_df["location"]

0                  Baridhara DOHS, Dhaka
1              Gulshan 2, Gulshan, Dhaka
2                        Khilgaon, Dhaka
3                        Khilgaon, Dhaka
4                        Khilgaon, Dhaka
                      ...               
17251          Darussalam, Mirpur, Dhaka
17252           Meradia, Khilgaon, Dhaka
17253    Block J, Bashundhara R-A, Dhaka
17254    Block G, Bashundhara R-A, Dhaka
17255           Block H, Banasree, Dhaka
Name: location, Length: 17256, dtype: object

#### Define
* Split content of `location` to `city` and `address`
* Remove `location` column

#### Code

In [41]:
# Retrieve city in location
bproperty_df["city"] = bproperty_df["location"].apply(lambda x: x.split(",")[-1].strip() )

# Retrieve address in location
bproperty_df["address"] = bproperty_df["location"].apply(lambda x: ",".join(x.split(",")[:-1]).strip() )

In [42]:
# Checking the content of location, city, and address
bproperty_df[ ["location","city","address"] ] 

Unnamed: 0,location,city,address
0,"Baridhara DOHS, Dhaka",Dhaka,Baridhara DOHS
1,"Gulshan 2, Gulshan, Dhaka",Dhaka,"Gulshan 2, Gulshan"
2,"Khilgaon, Dhaka",Dhaka,Khilgaon
3,"Khilgaon, Dhaka",Dhaka,Khilgaon
4,"Khilgaon, Dhaka",Dhaka,Khilgaon
...,...,...,...
17251,"Darussalam, Mirpur, Dhaka",Dhaka,"Darussalam, Mirpur"
17252,"Meradia, Khilgaon, Dhaka",Dhaka,"Meradia, Khilgaon"
17253,"Block J, Bashundhara R-A, Dhaka",Dhaka,"Block J, Bashundhara R-A"
17254,"Block G, Bashundhara R-A, Dhaka",Dhaka,"Block G, Bashundhara R-A"


In [43]:
bproperty_df.shape

(17256, 14)

In [44]:
# Drop location column
bproperty_df.drop(["location"], axis=1, inplace=True)

In [45]:
# Making sure removal was successful
bproperty_df.shape

(17256, 13)

### Cleaning `amenities` feature ( [tidiness issue #2](#bproperty-assessment-report) )

In `amenities` feature, each key in the dictionaries (in its content) should become a column. The value of the key should become the sample value corresponding to that column.

In [46]:
bproperty_df["amenities"][0]

"{'Flooring': 'yes', 'Parking Spaces': ' 1', 'Balcony or Terrace': 'yes', 'Floor Level': 'yes', 'View': 'yes', 'Elevators in Building': ' 1', 'Lobby in Building': 'yes'}"

In [47]:
bproperty_df["amenities"][12]

"{'View': 'yes', 'Parking Spaces': ' 1', 'Floor Level': 'yes', 'Balcony or Terrace': 'yes', 'Lobby in Building': 'yes', 'Electricity Backup': 'yes', 'Flooring': 'yes', 'Elevators in Building': ' 1', 'Maintenance Staff': 'yes', 'Cleaning Services': 'yes'}"

#### Define
* Keys in the dictionaries of `amenities` will become new columns in the dataset; the values of the keys will become the new columns values for the corresponding sample.

#### Code

In [48]:
"""
    Loop through `amenities` column, while:
         * Converting the dictionnaries keys to new columns; the values of the keys are becoming
             the new columns values for the corresponding sample
"""

for index, row in bproperty_df.iterrows(): # loop through each sample
    
    # The code may take time, log in the console to keep track of things
    if index==0 or index%1000==0:
        print(f"Currently processing sample {index}...")
        
    # If current sample doen't have amenities, go to the next one
    if pd.isna(bproperty_df.loc[index, "amenities"]):
        continue
    
    # retrieve the amenities
    sample_amenities = str(bproperty_df.loc[index, "amenities"]).replace("'","\"")
    
    amenities_dict = eval(sample_amenities)
    
    # Go through each key in the amenities dictionnary
    for key, value in amenities_dict.items():
        
        # put a suffix to the new column name, so that collaborators know it was generated from amenities feature
        column_name = slugify(key)+"-amenity"
        #print(column_name)
        
        # Create new column based on the key if not already existing
        if column_name not in bproperty_df.columns.to_list():
            bproperty_df[column_name]= np.NaN # Giving NaN as the default value for the column
        
        # Affecting to the new column created, for the current sample, the value of the dictionary's key
        bproperty_df.loc[index, column_name] = value
        

Currently processing sample 0...
Currently processing sample 1000...
Currently processing sample 2000...
Currently processing sample 3000...
Currently processing sample 4000...
Currently processing sample 5000...
Currently processing sample 6000...
Currently processing sample 7000...
Currently processing sample 8000...
Currently processing sample 9000...
Currently processing sample 10000...
Currently processing sample 11000...
Currently processing sample 12000...
Currently processing sample 13000...
Currently processing sample 14000...
Currently processing sample 15000...
Currently processing sample 16000...
Currently processing sample 17000...


In [49]:
# Checking columns
bproperty_df.head(3).T

Unnamed: 0,0,1,2
amenities,"{'Flooring': 'yes', 'Parking Spaces': ' 1', 'B...",,"{'View': 'yes', 'Balcony or Terrace': 'yes', '..."
area,1265.0,4400.0,1160.0
building_type,Apartment,Apartment,Apartment
building_nature,Residential,Residential,Residential
num_bath_rooms,3,4,0
num_bed_rooms,3,4,3
price,12500000.0,70400000.0,6200000.0
property_description,Ready Flat Of 1265 Sq Ft Is Now Up For Sale In...,You Can Move Into This Well Planned And Comfor...,"Buy This 1160 Sq Ft Flat In Khilgaon, South Goran"
property_overview,Looking for a luxurious apartment with top-not...,"Amicable environment, appropriate commuting sy...","A lively area to live, lovely home to settle a..."
property_url,https://www.bproperty.com/en/property/details-...,https://www.bproperty.com/en/property/details-...,https://www.bproperty.com/en/property/details-...


In [50]:
# Drop amenities column
bproperty_df.drop(["amenities"],axis=1, inplace=True)

# Check that removal was effective
"amenities" in bproperty_df.columns.to_list()

False

### Save cleaned dataset

In [51]:
# Create folder in which to save cleaned dataset
if not os.path.exists(cleaned_bproperty_folder):
    os.makedirs(cleaned_bproperty_folder)
    print(f"Create folder '{cleaned_bproperty_folder}'")
else:
    print(f"Folder '{cleaned_bproperty_folder}' already exists")

Create folder '../../../data/CLeaned_Data/bproperty'


In [52]:
# Save cleaned dataset to csv
bproperty_df.to_csv(f"{cleaned_bproperty_folder}/cleaned_bproperty.csv", index=False)

In [53]:
# Load saved csv (to make sure it was successfully save)
clean_bproperty_df = pd.read_csv(f"{cleaned_bproperty_folder}/cleaned_bproperty.csv")
clean_bproperty_df.head(3).T

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0,0,1,2
area,1265.0,4400.0,1160.0
building_type,Apartment,Apartment,Apartment
building_nature,Residential,Residential,Residential
num_bath_rooms,3,4,0
num_bed_rooms,3,4,3
price,12500000.0,70400000.0,6200000.0
property_description,Ready Flat Of 1265 Sq Ft Is Now Up For Sale In...,You Can Move Into This Well Planned And Comfor...,"Buy This 1160 Sq Ft Flat In Khilgaon, South Goran"
property_overview,Looking for a luxurious apartment with top-not...,"Amicable environment, appropriate commuting sy...","A lively area to live, lovely home to settle a..."
property_url,https://www.bproperty.com/en/property/details-...,https://www.bproperty.com/en/property/details-...,https://www.bproperty.com/en/property/details-...
purpose,Sale,Sale,Sale
