<a href="https://colab.research.google.com/github/Imppel-9704/condo-data-web-scraping-project/blob/master/condo_price_data_preparation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Rental condo data preparation.

## Data Cleaning, Data Transformation
What I'm going to do in this part is following below:
- Data Preparation
  - Adding Necessary Columns
  - Removing Unwanted Columns
  - Handling Missing Data
  - Changing Data Types

In [127]:
# Import necessary library

import pandas as pd
import numpy as np
import re

In [128]:
df = pd.read_csv('rental_condo_data_bkk_jan_2024.csv', encoding='utf-8-sig')

df.head()

Unnamed: 0,name,price_p_month,address,station,condo_details,room_size,bedroom,bathroom,price_per_sqm,agency,link
0,"Supalai Loft @ Talat Phlu Station, Bangkok","฿12,000 /mo","Ratchadaphisek Road, Talat Plu, Thon Buri, Ban...",6 mins (480 m) to S10 Talat Phlu BTS,Condo\r\nPartially Furnished\r\nBuilt: 2015,43,1,1.0,279.07,Land Property Management,https://www.ddproperty.com/en/property/supalai...
1,"ASHTON Morph 38, Bangkok","฿60,000 /mo","88 Soi Sukhumvit 38, Phra Kanong, Khlong Toei,...",5 mins (360 m) to E6 Thong Lo BTS,Condo\r\nFully Furnished\r\nBuilt: 2012,75,2,2.0,800.0,Usanisa Mahanukul (PARN),https://www.ddproperty.com/en/property/ashton-...
2,"28 Chidlom, Bangkok","฿95,000 /mo","28 Chit Lom Alley, Lumphini, Pathum Wan, Bangkok",4 mins (330 m) to E1 Chit Lom BTS,Condo\r\nBuilt: 2019,75,1,2.0,1266.67,อาภรณ์ เปี่ยมปัญญา,https://www.ddproperty.com/en/property/28-chid...
3,"Merlin Tower Condominium, Bangkok","฿15,000 /mo","Soi Narathiwat 14 Sathon Road, Thung Wat Don, ...",5 mins (350 m) to B3 Technic Krungthep BRT,Condo\r\nFully Furnished\r\nBuilt: 2012,80,2,2.0,187.5,ณัฐพัชร์ โชติอัครสินทบ,https://www.ddproperty.com/en/property/merlin-...
4,"Life Sathorn Sierra, Bangkok","฿13,000 /mo","Ratchaphruek Rd, Talat Plu, Thon Buri, Bangkok",1 mins (90 m) to B12 Ratchaphruek BRT,Condo\r\nFully Furnished\r\nBuilt: 2022,28,1,1.0,464.29,Theeradon Chaopaknam,https://www.ddproperty.com/en/property/life-sa...


In [129]:
# Identify data types

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 66874 entries, 0 to 66873
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   name           66874 non-null  object 
 1   price_p_month  66871 non-null  object 
 2   address        66874 non-null  object 
 3   station        57916 non-null  object 
 4   condo_details  66874 non-null  object 
 5   room_size      66870 non-null  object 
 6   bedroom        66855 non-null  object 
 7   bathroom       64898 non-null  float64
 8   price_per_sqm  66870 non-null  object 
 9   agency         66874 non-null  object 
 10  link           66874 non-null  object 
dtypes: float64(1), object(10)
memory usage: 5.6+ MB


After a quick examination of the data, I have found that it is not ready to be used at this time. Data types are not suitable for analysis. The data in each column needs to be extracted or replaced with appropriate words.

## Data Cleaning

### Remove duplicated values.

In [130]:
df

Unnamed: 0,name,price_p_month,address,station,condo_details,room_size,bedroom,bathroom,price_per_sqm,agency,link
0,"Supalai Loft @ Talat Phlu Station, Bangkok","฿12,000 /mo","Ratchadaphisek Road, Talat Plu, Thon Buri, Ban...",6 mins (480 m) to S10 Talat Phlu BTS,Condo\r\nPartially Furnished\r\nBuilt: 2015,43,1,1.0,279.07,Land Property Management,https://www.ddproperty.com/en/property/supalai...
1,"ASHTON Morph 38, Bangkok","฿60,000 /mo","88 Soi Sukhumvit 38, Phra Kanong, Khlong Toei,...",5 mins (360 m) to E6 Thong Lo BTS,Condo\r\nFully Furnished\r\nBuilt: 2012,75,2,2.0,800.0,Usanisa Mahanukul (PARN),https://www.ddproperty.com/en/property/ashton-...
2,"28 Chidlom, Bangkok","฿95,000 /mo","28 Chit Lom Alley, Lumphini, Pathum Wan, Bangkok",4 mins (330 m) to E1 Chit Lom BTS,Condo\r\nBuilt: 2019,75,1,2.0,1266.67,อาภรณ์ เปี่ยมปัญญา,https://www.ddproperty.com/en/property/28-chid...
3,"Merlin Tower Condominium, Bangkok","฿15,000 /mo","Soi Narathiwat 14 Sathon Road, Thung Wat Don, ...",5 mins (350 m) to B3 Technic Krungthep BRT,Condo\r\nFully Furnished\r\nBuilt: 2012,80,2,2.0,187.5,ณัฐพัชร์ โชติอัครสินทบ,https://www.ddproperty.com/en/property/merlin-...
4,"Life Sathorn Sierra, Bangkok","฿13,000 /mo","Ratchaphruek Rd, Talat Plu, Thon Buri, Bangkok",1 mins (90 m) to B12 Ratchaphruek BRT,Condo\r\nFully Furnished\r\nBuilt: 2022,28,1,1.0,464.29,Theeradon Chaopaknam,https://www.ddproperty.com/en/property/life-sa...
...,...,...,...,...,...,...,...,...,...,...,...
66869,"Life Ladprao, Bangkok","฿15,000 /mo","992 Ladprao Road, Jom Phon, Chatuchak, Bangkok",2 mins (160 m) to N9 Ha Yaek Lat Phrao BTS,Condo\r\nBuilt: 2020,26 sqm,Studio,,฿576.92 / sqm,Incube Realty,https://www.ddproperty.com/en/property/life-la...
66870,"The Waterford Diamond Tower Sukhumvit, Bangkok","฿18,000 /mo","758/18 Soi Sukhumvit 30/1, Sukhumvit Road, Kh...",8 mins (570 m) to E6 Thong Lo BTS,Condo\r\nBuilt: 1999,52 sqm,1,1.0,฿346.15 / sqm,Incube Realty,https://www.ddproperty.com/en/property/the-wat...
66871,"Supalai Premier Place Asok, Bangkok","฿20,000 /mo","60 Asoke Montri Road, Khlongtoei Nua, Watthana...",5 mins (410 m) to BL21 Phetchaburi MRT,Condo\r\nBuilt: 2014,65 sqm,1,1.0,฿307.69 / sqm,Incube Realty,https://www.ddproperty.com/en/property/supalai...
66872,"The River by Raimon Land, Bangkok","฿80,000 /mo","Soi Charoen Nakorn 13, Charoen Nakorn Road, Kh...",7 mins (520 m) to S6 Saphan Taksin BTS,Condo\r\nBuilt: 2009,185 sqm,1,2.0,฿432.43 / sqm,Incube Realty,https://www.ddproperty.com/en/property/the-riv...


In [131]:
# Data in "link" column should not be duplicated.

df[df[['link']].duplicated()]

Unnamed: 0,name,price_p_month,address,station,condo_details,room_size,bedroom,bathroom,price_per_sqm,agency,link
309,"28 Chidlom, Bangkok","฿95,000 /mo","28 Chit Lom Alley, Lumphini, Pathum Wan, Bangkok",4 mins (330 m) to E1 Chit Lom BTS,Condo\r\nBuilt: 2019,75,1,2.0,1266.67,อาภรณ์ เปี่ยมปัญญา,https://www.ddproperty.com/en/property/28-chid...
925,"Supalai Loft @ Talat Phlu Station, Bangkok","฿12,000 /mo","Ratchadaphisek Road, Talat Plu, Thon Buri, Ban...",6 mins (480 m) to S10 Talat Phlu BTS,Condo\r\nPartially Furnished\r\nBuilt: 2015,43,1,1.0,279.07,Land Property Management,https://www.ddproperty.com/en/property/supalai...
2083,"Cooper Siam, Bangkok","฿28,500 /mo","Soi Rong Mueang 5, Rong Muang, Pathum Wan, Ban...",9 mins (670 m) to W1 National Stadium BTS,Condo\r\nFully Furnished\r\nBuilt: 2021,36,1,1.0,791.67,Agentbkk,https://www.ddproperty.com/en/property/cooper-...
2084,"CONNER Ratchathewi, Bangkok","฿45,000 /mo","312 Soi Phetchaburi 7, Thanon Phetchaburi, Rat...",5 mins (350 m) to N1 Ratchathewi BTS,Condo\r\nPartially Furnished\r\nBuilt: 2021,51,1,1.0,882.35,พชรธรรม์ พลอัครวัตน์,https://www.ddproperty.com/en/property/conner-...
2085,"ASHTON Asoke - Rama 9, Bangkok","฿30,000 /mo","469 Asoke-Dindaeng Road, Din Daeng, Din Daeng,...",3 mins (230 m) to BL20 Phra Ram 9 MRT,Condo\r\nFully Furnished\r\nBuilt: 2020,43,1,1.0,697.67,Kim Thailand Property,https://www.ddproperty.com/en/property/ashton-...
...,...,...,...,...,...,...,...,...,...,...,...
66824,"Baan Ratchadamri Condominium, Bangkok","฿160,000 /mo","185 Ratchadamri Road, Lumphini, Pathum Wan, Ba...",7 mins (500 m) to S1 Rachadamri BTS,Condo\r\nBuilt: 2014,267 sqm,3,4.0,฿599.99 / sqm,"Accom Asia Co.,Ltd .",https://www.ddproperty.com/en/property/baan-ra...
66826,"Casa Condo Asoke - Dindaeng, Bangkok","฿9,000 /mo","5801 Din Daeng Road, Din Daeng, Din Daeng, Ba...",,Condo\r\nFully Furnished\r\nBuilt: 2013,26 sqm,Studio,,฿346.15 / sqm,Bangkok Prime Property,https://www.ddproperty.com/en/property/casa-co...
66851,"Dcondo Panaa, Bangkok","฿8,500 /mo","188 Liap Thang Rotfai Taling Chan Rd, Bang Khu...",,New Project: 2023\r\nCondo\r\nFully Furnished,26 sqm,1,1.0,฿326.92 / sqm,พัทธนันท์ เชาวลิต,https://www.ddproperty.com/en/property/dcondo-...
66852,"Dcondo Tann Charan, Bangkok","฿8,500 /mo","Liap Thang Rotfai Taling Chan Road, Bang Khun ...",,Condo\r\nFully Furnished\r\nBuilt: 2020,27 sqm,Studio,,฿314.81 / sqm,พัทธนันท์ เชาวลิต,https://www.ddproperty.com/en/property/dcondo-...


In [132]:
# Drop duplicated values.
df = df.drop_duplicates(subset=['link'], keep='first')

### Correcting errors, Adding Necessary Columns, Removing Unwanted Columns

In [133]:
# Extract necessary word to create new columns

# Strip words, creating new columns by extracing from address
df['address'] = df['address'].str.lstrip('-.,')
df['province'] = df['address'].str.extract(r'([A-z]+$)')
df['district'] = df['address'].str.extract(r'[A-z].+\,\s([A-z].+)\,\s[A-z].+$')

# Create new column using price_p_month, remove unwanted words from column rental price
df['rental_price'] = df['price_p_month'].str.extract(r'([0-9].+)\W')
df['rental_price'] = df['rental_price'].str.replace(',', '')

# Create new columns built_year, is_furnished and type using condo_details
df['built_year'] = df['condo_details'].str.extract(r'Built: (\d{4})')
df['is_furnished'] = df['condo_details'].str.extract(r'(Fully Furnished|Unfurnished|Partially Furnished)')
df['type'] = df['condo_details'].str.extract(r'^(\w+)')

# Create new columns from station
df['transportation'] = df['station'].str.extract(r'\b((?:MRT|BTS|BRT|Airport Link))\b')
df['station_names'] = df['station'].str.extract(r'\bto\s(.+)\s')
df['distance_from_station'] = df['station'].str.extract(r'\((\d+\s+m)\)')

# Clean data in bedroom column, replace 'studio' with '0' instead.
df['bedroom'] = df['bedroom'].str.strip()
df['bedroom'] = df['bedroom'].str.replace('Studio', '0')
# Assume room type studio have 1 bathroom
df.loc[df['bedroom'] == '0', 'bathroom'] = 1

# remove sqm for each rows
df['room_size'] = df['room_size'].str.strip(' sqm')

# Remove unwanted words from column price per sqm
df['price_per_sqm'] = df['price_per_sqm'].str.extract(r'\W(\d+.\d+)\W')
df['price_per_sqm'] = df['price_per_sqm'].str.replace(',', '')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['address'] = df['address'].str.lstrip('-.,')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['province'] = df['address'].str.extract(r'([A-z]+$)')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['district'] = df['address'].str.extract(r'[A-z].+\,\s([A-z].+)\,\s[A-z].+$')
A value is trying to

In [134]:
# Drop unwanted columns
df.drop(['price_p_month', 'condo_details', 'station'], axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop(['price_p_month', 'condo_details', 'station'], axis=1, inplace=True)


In [135]:
# Reindex columns
cols = ['name', 'type', 'address', 'province', 'district', 'agency', 'link',
        'built_year', 'room_size', 'bedroom', 'bathroom', 'is_furnished',
        'transportation', 'station_names', 'distance_from_station', 'price_per_sqm', 'rental_price']

df = df.reindex(columns=cols)

### Handling missing values.

In [136]:
# I assume these columns should not be null. So I drop if Null
df = df.dropna(subset=['rental_price', 'bedroom', 'bathroom', 'room_size', 'price_per_sqm'])

### Changing Data types

In [137]:
# mapping value to Boolean
mapping = {'Partially Furnished': True, 'Fully Furnished': True, 'Unfurnished': False}

# Do mapping and converting to boolean
df['is_furnished'] = df['is_furnished'].replace(mapping).astype('bool')

In [138]:
# Change data types to proper type

df.loc[:, :'link'] = df.loc[:, :'link'].astype(str)
df['price_per_sqm'] = df['price_per_sqm'].astype(float)
df[['room_size', 'bedroom', 'bathroom', 'rental_price']] = df[['room_size', 'bedroom', 'bathroom', 'rental_price']].astype(int)

In [139]:
# Check data types again

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 41289 entries, 10003 to 66873
Data columns (total 17 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   name                   41289 non-null  object 
 1   type                   41289 non-null  object 
 2   address                41289 non-null  object 
 3   province               41289 non-null  object 
 4   district               41289 non-null  object 
 5   agency                 41289 non-null  object 
 6   link                   41289 non-null  object 
 7   built_year             39292 non-null  object 
 8   room_size              41289 non-null  int64  
 9   bedroom                41289 non-null  int64  
 10  bathroom               41289 non-null  int64  
 11  is_furnished           41289 non-null  bool   
 12  transportation         35520 non-null  object 
 13  station_names          35647 non-null  object 
 14  distance_from_station  35647 non-null  object 
 15

In [140]:
# I use .describe() to identify dataset statistic and found that price_per_sqm and rental_price are weird.

df.describe()

Unnamed: 0,room_size,bedroom,bathroom,price_per_sqm,rental_price
count,41289.0,41289.0,41289.0,41289.0,41289.0
mean,70.681102,1.503984,1.50224,2843.668538,630964.2
std,79.765572,0.748078,0.773834,22307.219903,84398370.0
min,2.0,0.0,1.0,2.04,1900.0
25%,35.0,1.0,1.0,460.0,20000.0
50%,50.0,1.0,1.0,609.76,32000.0
75%,78.0,2.0,2.0,800.0,55000.0
max,9330.0,7.0,9.0,614973.0,17140920000.0


### Removing outlier

In [141]:
# I assume they set the wrong price if rental price is higher than 300,000 baht and have only 1 bedroom
final_df = df.loc[((df['rental_price'] <= 300000) | (df['bedroom'] >= 2)) & (df['price_per_sqm'] < 40000) & (df['rental_price'] < 800000)]

In [142]:
# finally I use .describe() again to identify dataset statistic

final_df.describe()

Unnamed: 0,room_size,bedroom,bathroom,price_per_sqm,rental_price
count,40797.0,40797.0,40797.0,40797.0,40797.0
mean,70.640831,1.502316,1.500846,654.838965,45616.063289
std,79.909081,0.747533,0.772404,302.630312,46089.174544
min,2.0,0.0,1.0,2.04,1900.0
25%,35.0,1.0,1.0,457.14,20000.0
50%,50.0,1.0,1.0,606.06,32000.0
75%,78.0,2.0,2.0,791.37,55000.0
max,9330.0,7.0,9.0,13333.0,750000.0


In [None]:
# Export it as .csv file

final_df.to_csv('final_rental_condo_data_bkk_jan_2024.csv', index=False, encoding='utf-8')