# Part 1 : Data Cleaning

## Introduction

This project is a self-made regression project in which i scrape a website called __rumah123.com__. To simplify the data collection process, I scrape data using a chrome extension app called __web scraper by webscraper.io__. The scraped dataset is saved in an excel file named __rumah123_scraper.xlsx__. The dataset contains the details of houses for sale in Cimahi and its surroundings (the radius of 20 kilometers) in November 2021. The data discussed here might not be up-to-date.
<br>

Problem statement of this project, I want to create a machine learning model to predict the selling price of houses in Cimahi and its surroundings. I also want to know which features or factors have the most impact on the selling price of house.
<br>

At this stage, I will load the raw dataset and conduct data wrangling to clean the data. My goal at this stage is to have a cleaned dataset at the end of this notebook, and then I can use the cleaned dataset to create or draw an insight.
<br>

This project is motivated by my dream to buy land, which will build my house later, because I want to have a house located near my parents (located in Cimahi) amen haha.

## Importing Library

In [1]:
import pandas as pd

## Importing Dataset

In [2]:
df = pd.read_excel('rumah123_scraper.xlsx')
df

Unnamed: 0,web-scraper-order,web-scraper-start-url,pages,house,house-href,address,building_area,surface_area,bedroom,bathroom,parking_area,tenure,price
0,1636548696-51,https://www.rumah123.com/jual/residensial/?q=c...,https://www.rumah123.com/jual/residensial/?q=c...,RUMAH DI CIMAHI,https://www.rumah123.com/properti/cimahi/hos89...,"Cimahi Utara, Cimahi",-,-,-,-,-,SHM - Sertifikat Hak Milik,Rp. 650.000.000
1,1636549200-148,https://www.rumah123.com/jual/residensial/?q=c...,https://www.rumah123.com/jual/residensial/?q=c...,RUMAH STRATEGIS DI CIMAHI UTARA CITEUREUP SEBR...,https://www.rumah123.com/properti/cimahi/hos62...,"Cimahi Utara, Cimahi",170 m²,205 m²,2,1,-,SHM - Sertifikat Hak Milik,Rp. 2.500.000.000
2,1636551582-631,https://www.rumah123.com/jual/residensial/?q=c...,https://www.rumah123.com/jual/residensial/?q=c...,Jual rumah di Lembang perumahan rose garden vi...,https://www.rumah123.com/properti/bandung/hos9...,"Lembang, Bandung",50 m²,85 m²,3,2,1,SHM - Sertifikat Hak Milik,Rp. 998.000.000
3,1636548513-8,https://www.rumah123.com/jual/residensial/?q=c...,https://www.rumah123.com/jual/residensial/?q=c...,RUMAH VILLA SYARIAH CIMAHI TANPA BANK & BI CHE...,https://www.rumah123.com/properti/cimahi/hos91...,"Cimahi Utara, Cimahi",65 m²,120 m²,3,2,2,SHM - Sertifikat Hak Milik,Rp. 1.040.000.000
4,1636551508-608,https://www.rumah123.com/jual/residensial/?q=c...,https://www.rumah123.com/jual/residensial/?q=c...,Rumah Siap Huni Terawat di Jalan Pesantren Cim...,https://www.rumah123.com/properti/cimahi/hos86...,"Cimahi Utara, Cimahi",200 m²,150 m²,4,3,1,SHM - Sertifikat Hak Milik,Rp. 1.650.000.000
...,...,...,...,...,...,...,...,...,...,...,...,...,...
970,1636550603-428,https://www.rumah123.com/jual/residensial/?q=c...,https://www.rumah123.com/jual/residensial/?q=c...,Rumah Mewah luas siap huni bonus kolam renang ...,https://www.rumah123.com/properti/bandung/hos8...,"Lembang, Bandung",590 m²,756 m²,11,6,-,SHM - Sertifikat Hak Milik,Rp. 5.500.000.000
971,1636553127-946,https://www.rumah123.com/jual/residensial/?q=c...,https://www.rumah123.com/jual/residensial/?q=c...,RUMAH VILLA DI CIMAHI PAKUHAJI,https://www.rumah123.com/properti/cimahi/hos91...,"Cimahi Utara, Cimahi",55 m²,80 m²,2,2,1,SHM - Sertifikat Hak Milik,Rp. 700.000.000
972,1636550330-380,https://www.rumah123.com/jual/residensial/?q=c...,https://www.rumah123.com/jual/residensial/?q=c...,rumah murah strategis di katumiri grend hill,https://www.rumah123.com/properti/bandung/hos8...,"Cihanjuang, Bandung",60 m²,90 m²,3,1,1,SHM - Sertifikat Hak Milik,Rp. 750.000.000
973,1636551414-588,https://www.rumah123.com/jual/residensial/?q=c...,https://www.rumah123.com/jual/residensial/?q=c...,RUMAH DEKAT PEMKOT CIMAHI CIAWITALI,https://www.rumah123.com/properti/cimahi/hos82...,"Cimahi Tengah, Cimahi",100 m²,102 m²,4,2,-,SHM - Sertifikat Hak Milik,Rp. 760.000.000


## Data Understanding

In [3]:
# Data shape
df.shape

(975, 13)

So we have 975 rows and 13 columns, at the end we can drop useless columns for model prediction

In [4]:
# Unique columns
df.columns

Index(['web-scraper-order', 'web-scraper-start-url', 'pages', 'house',
       'house-href', 'address', 'building_area', 'surface_area', 'bedroom',
       'bathroom', 'parking_area', 'tenure', 'price'],
      dtype='object')

In [5]:
# data type
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 975 entries, 0 to 974
Data columns (total 13 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   web-scraper-order      975 non-null    object
 1   web-scraper-start-url  975 non-null    object
 2   pages                  975 non-null    object
 3   house                  975 non-null    object
 4   house-href             975 non-null    object
 5   address                975 non-null    object
 6   building_area          975 non-null    object
 7   surface_area           975 non-null    object
 8   bedroom                975 non-null    object
 9   bathroom               975 non-null    object
 10  parking_area           975 non-null    object
 11  tenure                 975 non-null    object
 12  price                  975 non-null    object
dtypes: object(13)
memory usage: 99.1+ KB


In [6]:
# Unique value per columns
for item in df.columns:
    print(item + '=',(df[item].unique()))

web-scraper-order= ['1636548696-51' '1636549200-148' '1636551582-631' '1636548513-8'
 '1636551508-608' '1636552281-782' '1636552194-756' '1636550044-326'
 '1636553199-971' '1636551938-699' '1636548954-91' '1636553202-972'
 '1636552438-813' '1636548678-44' '1636549063-123' '1636550638-440'
 '1636553143-952' '1636552262-775' '1636548699-52' '1636550341-383'
 '1636549547-220' '1636548688-48' '1636548568-27' '1636548976-97'
 '1636548541-18' '1636552856-907' '1636548518-10' '1636551797-675'
 '1636551551-621' '1636550511-421' '1636550347-385' '1636551416-589'
 '1636548968-96' '1636549916-303' '1636552641-854' '1636549956-317'
 '1636553167-959' '1636551471-600' '1636551961-707' '1636552615-846'
 '1636551239-558' '1636549966-321' '1636553064-933' '1636550763-461'
 '1636551983-713' '1636552175-749' '1636551794-674' '1636552649-857'
 '1636550306-372' '1636552382-794' '1636551326-565' '1636552375-792'
 '1636549457-193' '1636552677-866' '1636551951-703' '1636549816-282'
 '1636552643-855' '16365520

In [7]:
# Detect duplicate data
print('Total Duplicate Data :', )
for item in df :
     print ('•', item + '=', df.duplicated().sum())

Total Duplicate Data :
• web-scraper-order= 0
• web-scraper-start-url= 0
• pages= 0
• house= 0
• house-href= 0
• address= 0
• building_area= 0
• surface_area= 0
• bedroom= 0
• bathroom= 0
• parking_area= 0
• tenure= 0
• price= 0


In [8]:
# Unique Region 
pd.DataFrame({'Unique Region : ' : df['address'].unique()})

Unnamed: 0,Unique Region :
0,"Cimahi Utara, Cimahi"
1,"Lembang, Bandung"
2,"Cimahi, Cimahi"
3,"Padalarang, Bandung"
4,"Parongpong, Bandung Barat"
5,"Ngamprah, Bandung"
6,"Cimahi Selatan, Cimahi"
7,"Cisarua, Bandung"
8,"Cimahi Tengah, Cimahi"
9,"Cimahi, Bandung"


As we can see, many areas have double names and misspelled cities name, So we can clean it up, and we can drop the region which far from cimahi and its surrounding (out of the radius of 20 kilometers). 

In [9]:
df[df['address'] == 'Cimahi, Bandung']

Unnamed: 0,web-scraper-order,web-scraper-start-url,pages,house,house-href,address,building_area,surface_area,bedroom,bathroom,parking_area,tenure,price
16,1636553143-952,https://www.rumah123.com/jual/residensial/?q=c...,https://www.rumah123.com/jual/residensial/?q=c...,RUMAH CILAME NGAMPRAH DEKAT TOL PADALARANG,https://www.rumah123.com/properti/bandung/hos9...,"Cimahi, Bandung",48 m²,100 m²,2,1,1,SHM - Sertifikat Hak Milik,Rp. 579.600.000
19,1636550341-383,https://www.rumah123.com/jual/residensial/?q=c...,https://www.rumah123.com/jual/residensial/?q=c...,"Rumah Komplek Melong Green Garden,Cimahi Selatan.",https://www.rumah123.com/properti/bandung/hos8...,"Cimahi, Bandung",168 m²,137 m²,3,2,2,SHM - Sertifikat Hak Milik,Rp. 1.250.000.000
21,1636548688-48,https://www.rumah123.com/jual/residensial/?q=c...,https://www.rumah123.com/jual/residensial/?q=c...,Dijual rumah di komp taman mutiara,https://www.rumah123.com/properti/bandung/hos8...,"Cimahi, Bandung",175 m²,135 m²,5,2,-,SHM - Sertifikat Hak Milik,Rp. 1.400.000.000
32,1636548968-96,https://www.rumah123.com/jual/residensial/?q=c...,https://www.rumah123.com/jual/residensial/?q=c...,Rmh Siap Huni Komp. Kavling CiawitaLi,https://www.rumah123.com/properti/bandung/hos8...,"Cimahi, Bandung",35 m²,72 m²,2,1,2,SHM - Sertifikat Hak Milik,Rp. 495.000.000
34,1636552641-854,https://www.rumah123.com/jual/residensial/?q=c...,https://www.rumah123.com/jual/residensial/?q=c...,Rumah Siap Huni dan Nyaman di Taman Bumi Prima...,https://www.rumah123.com/properti/bandung/hos8...,"Cimahi, Bandung",100 m²,144 m²,2,1,-,SHM - Sertifikat Hak Milik,Rp. 1.600.000.000
...,...,...,...,...,...,...,...,...,...,...,...,...,...
901,1636551450-597,https://www.rumah123.com/jual/residensial/?q=c...,https://www.rumah123.com/jual/residensial/?q=c...,RUMAH CANTIK MEEAH MURAH GRAHA CILAME BANDUNG ...,https://www.rumah123.com/properti/bandung/hos8...,"Cimahi, Bandung",60 m²,93 m²,2,2,-,SHM - Sertifikat Hak Milik,Rp. 635.000.000
914,1636550062-333,https://www.rumah123.com/jual/residensial/?q=c...,https://www.rumah123.com/jual/residensial/?q=c...,Rumah terawat taman mutiara cimahi,https://www.rumah123.com/properti/bandung/hos9...,"Cimahi, Bandung",110 m²,103 m²,3,2,-,SHM - Sertifikat Hak Milik,Rp. 950.000.000
921,1636551701-648,https://www.rumah123.com/jual/residensial/?q=c...,https://www.rumah123.com/jual/residensial/?q=c...,RUMAH SIAP HUNI BEBAS BANJIR MURAH LINGKUNGAN ...,https://www.rumah123.com/properti/bandung/hos8...,"Cimahi, Bandung",45 m²,146 m²,2,1,1,SHM - Sertifikat Hak Milik,Rp. 1.400.000.000
967,1636551964-708,https://www.rumah123.com/jual/residensial/?q=c...,https://www.rumah123.com/jual/residensial/?q=c...,Jual Rumah Minimalis Bagus di Komplek Griya As...,https://www.rumah123.com/properti/bandung/hos9...,"Cimahi, Bandung",80 m²,106 m²,3,2,1,SHM - Sertifikat Hak Milik,Rp. 1.200.000.000


Because it's too complicated, so I have to clean "Cimahi, Bandung" manually one by one with link from each row. I will find the specific area based on column 'house' and matched by maps

In [10]:
df[df['address'] == 'Cimahi, Cimahi']

Unnamed: 0,web-scraper-order,web-scraper-start-url,pages,house,house-href,address,building_area,surface_area,bedroom,bathroom,parking_area,tenure,price
6,1636552194-756,https://www.rumah123.com/jual/residensial/?q=c...,https://www.rumah123.com/jual/residensial/?q=c...,Rumah Mewah Luas di Cimahi Terusan Agronomi Ta...,https://www.rumah123.com/properti/cimahi/hos85...,"Cimahi, Cimahi",521 m²,280 m²,8,3,1,SHM - Sertifikat Hak Milik,Rp. 2.600.000.000
796,1636549647-250,https://www.rumah123.com/jual/residensial/?q=c...,https://www.rumah123.com/jual/residensial/?q=c...,Rumah masih di huni bangunan 2014 ahir,https://www.rumah123.com/properti/cimahi/hos86...,"Cimahi, Cimahi",80 m²,60 m²,2,2,-,SHM - Sertifikat Hak Milik,Rp. 950.000.000


By column __house__, we can see Agronomi street which in fact is located in "Cimahi Tengah", so we can assume, these values ​​are included in Cimahi Tengah teritory , so we can replace "Cimahi, Cimahi" with "Cimahi Tengah, Cimahi"

In [11]:
# Detect unidentified null values
print ('Total unidentified null values ("-") :')
for item in df :
     print ('•', item + '=', (df[item][df[item] == "-"]).count())

Total unidentified null values ("-") :
• web-scraper-order= 0
• web-scraper-start-url= 0
• pages= 0
• house= 0
• house-href= 0
• address= 0
• building_area= 1
• surface_area= 14
• bedroom= 21
• bathroom= 12
• parking_area= 488
• tenure= 52
• price= 0


In [12]:
df[df['building_area'] == '-']

Unnamed: 0,web-scraper-order,web-scraper-start-url,pages,house,house-href,address,building_area,surface_area,bedroom,bathroom,parking_area,tenure,price
0,1636548696-51,https://www.rumah123.com/jual/residensial/?q=c...,https://www.rumah123.com/jual/residensial/?q=c...,RUMAH DI CIMAHI,https://www.rumah123.com/properti/cimahi/hos89...,"Cimahi Utara, Cimahi",-,-,-,-,-,SHM - Sertifikat Hak Milik,Rp. 650.000.000


In [13]:
df[df['surface_area'] == '-']

Unnamed: 0,web-scraper-order,web-scraper-start-url,pages,house,house-href,address,building_area,surface_area,bedroom,bathroom,parking_area,tenure,price
0,1636548696-51,https://www.rumah123.com/jual/residensial/?q=c...,https://www.rumah123.com/jual/residensial/?q=c...,RUMAH DI CIMAHI,https://www.rumah123.com/properti/cimahi/hos89...,"Cimahi Utara, Cimahi",-,-,-,-,-,SHM - Sertifikat Hak Milik,Rp. 650.000.000
134,1636549477-200,https://www.rumah123.com/jual/residensial/?q=c...,https://www.rumah123.com/jual/residensial/?q=c...,STOK TERBATAS APARTEMEN TERLARIS FULL FURNISH ...,https://www.rumah123.com/properti/bandung/aps2...,"Sudirman, Bandung",59 m²,-,3,2,-,"Lainnya (PPJB,Girik,Adat,dll)",Rp. 1.200.000.000
136,1636549575-228,https://www.rumah123.com/jual/residensial/?q=c...,https://www.rumah123.com/jual/residensial/?q=c...,"Dijual Apartemen The Edge Baros, Bandung",https://www.rumah123.com/properti/sukabumi/aps...,"Baros, Sukabumi",30 m²,-,2,1,-,"Lainnya (PPJB,Girik,Adat,dll)",Rp. 289.000.000
349,1636550862-492,https://www.rumah123.com/jual/residensial/?q=c...,https://www.rumah123.com/jual/residensial/?q=c...,Apartemen The Edge Cimahi Type Executive,https://www.rumah123.com/properti/bandung/aps1...,"Cimahi, Bandung",36 m²,-,2,1,-,SHM - Sertifikat Hak Milik,Rp. 390.000.000
393,1636552985-928,https://www.rumah123.com/jual/residensial/?q=c...,https://www.rumah123.com/jual/residensial/?q=c...,Gemeshin Amad Sih.. Apartement View Poll! Furn...,https://www.rumah123.com/properti/bandung/aps2...,"Cicendo, Bandung",33 m²,-,1,1,-,"Lainnya (PPJB,Girik,Adat,dll)",Rp. 440.000.000
436,1636551104-533,https://www.rumah123.com/jual/residensial/?q=c...,https://www.rumah123.com/jual/residensial/?q=c...,Bukan Sulap Bukan Sihir! Tapi Bikin Nyengir! A...,https://www.rumah123.com/properti/cimahi/aps24...,"Cimahi Tengah, Cimahi",36 m²,-,2,1,-,HGB - Hak Guna Bangunan,Rp. 300.000.000
493,1636552572-834,https://www.rumah123.com/jual/residensial/?q=c...,https://www.rumah123.com/jual/residensial/?q=c...,Apartemen EDGE Baros Cimahi dekat Gerbang Tol,https://www.rumah123.com/properti/cimahi/aps24...,"Cimahi Tengah, Cimahi",36 m²,-,2,1,-,-,Rp. 350.000.000
607,1636549943-312,https://www.rumah123.com/jual/residensial/?q=c...,https://www.rumah123.com/jual/residensial/?q=c...,Apartemen satu- satunya yg dekat dengan tol Ba...,https://www.rumah123.com/properti/bandung/aps1...,"Cimahi, Bandung",36 m²,-,2,1,-,HGB - Hak Guna Bangunan,Rp. 270.000.000
657,1636552692-872,https://www.rumah123.com/jual/residensial/?q=c...,https://www.rumah123.com/jual/residensial/?q=c...,(06JO) DAPAT APARTEMEN VIEW CANTIK FULL FURNIS...,https://www.rumah123.com/properti/cimahi/aps24...,"Leuwi Gajah, Cimahi",36 m²,-,2,1,-,-,Rp. 240.000.000
707,1636551837-688,https://www.rumah123.com/jual/residensial/?q=c...,https://www.rumah123.com/jual/residensial/?q=c...,APARTEMEN MURAH DI CIMAHI MILIKI SEKARANG JUGA,https://www.rumah123.com/properti/cimahi/aps24...,"Cimahi Selatan, Cimahi",36 m²,-,2,1,-,HGB - Hak Guna Bangunan,Rp. 240.000.000


we can drop this, because there's impossible a house doesn't have a surface and building area

In [14]:
df[df['tenure'] == '-']

Unnamed: 0,web-scraper-order,web-scraper-start-url,pages,house,house-href,address,building_area,surface_area,bedroom,bathroom,parking_area,tenure,price
17,1636552262-775,https://www.rumah123.com/jual/residensial/?q=c...,https://www.rumah123.com/jual/residensial/?q=c...,Rumah 2 Lantai di Kolonel Masturi Cimahi,https://www.rumah123.com/properti/cimahi/hos87...,"Cimahi Utara, Cimahi",90 m²,105 m²,3,2,-,-,Rp. 1.244.000.000
35,1636549956-317,https://www.rumah123.com/jual/residensial/?q=c...,https://www.rumah123.com/jual/residensial/?q=c...,Rumah Derah Sayap Pasteur Bandung,https://www.rumah123.com/properti/bandung/hos4...,"Pasteur, Bandung",150 m²,135 m²,4,2,-,-,Rp. 2.300.000.000
43,1636550763-461,https://www.rumah123.com/jual/residensial/?q=c...,https://www.rumah123.com/jual/residensial/?q=c...,Rumah Lux dan luas di Setramas - Cimahi,https://www.rumah123.com/properti/bandung/hos7...,"Cimahi, Bandung",260 m²,280 m²,5,4,2,-,Rp. 3.500.000.000
95,1636549848-290,https://www.rumah123.com/jual/residensial/?q=c...,https://www.rumah123.com/jual/residensial/?q=c...,"RUMAH at Jl CIMAHI, MENTENG T/B. 763/800 (FOR ...",https://www.rumah123.com/properti/jakarta-pusa...,"Menteng, Jakarta Pusat",800 m²,763 m²,8,6,-,-,Rp. 68.000.000.000
96,1636550349-386,https://www.rumah123.com/jual/residensial/?q=c...,https://www.rumah123.com/jual/residensial/?q=c...,Rumah Kost Jalan Sirnarasa Cibabat Cimahi,https://www.rumah123.com/properti/cimahi/hos82...,"Cimahi Tengah, Cimahi",816 m²,820 m²,19,15,-,-,Rp. 8.000.000.000
104,1636550458-402,https://www.rumah123.com/jual/residensial/?q=c...,https://www.rumah123.com/jual/residensial/?q=c...,"Rumah siap huni di Kotamas, Cimahi",https://www.rumah123.com/properti/bandung/hos6...,"Cimahi, Bandung",100 m²,133 m²,4,2,1,-,Rp. 1.400.000.000
146,1636551482-604,https://www.rumah123.com/jual/residensial/?q=c...,https://www.rumah123.com/jual/residensial/?q=c...,Villa Cantik Seperti Calon Pemilik! Suasanapun...,https://www.rumah123.com/properti/cimahi/hos86...,"Cimahi Tengah, Cimahi",300 m²,1062 m²,2,2,1,-,Rp. 2.490.000.000
153,1636550287-367,https://www.rumah123.com/jual/residensial/?q=c...,https://www.rumah123.com/jual/residensial/?q=c...,Rumah Cluster Di Komplek Graha Indah Cimindi C...,https://www.rumah123.com/properti/bandung/hos8...,"Cimindi, Bandung",80 m²,130 m²,2,2,-,-,Rp. 900.000.000
231,1636550852-488,https://www.rumah123.com/jual/residensial/?q=c...,https://www.rumah123.com/jual/residensial/?q=c...,RUMAH 2 LANTAI CITY VIEW 10 MENIT DARI MCDONAL...,https://www.rumah123.com/properti/bandung/hos3...,"Cimahi, Bandung",160 m²,363 m²,5,3,1,-,Rp. 1.200.000.000
262,1636552033-730,https://www.rumah123.com/jual/residensial/?q=c...,https://www.rumah123.com/jual/residensial/?q=c...,"Rumah di Melong Green Garden, Cimahi",https://www.rumah123.com/properti/cimahi/hos60...,"Cimahi Selatan, Cimahi",150 m²,164 m²,3,3,1,-,Rp. 1.000.000.000


we can assume that the website users didn't want to tell about their tenure status, so we can replace "-" with "Tidak Ada Penjelasan"

In [15]:
df[df['surface_area'] == '1 m²']

Unnamed: 0,web-scraper-order,web-scraper-start-url,pages,house,house-href,address,building_area,surface_area,bedroom,bathroom,parking_area,tenure,price
47,1636552649-857,https://www.rumah123.com/jual/residensial/?q=c...,https://www.rumah123.com/jual/residensial/?q=c...,Villa Cimahi Utara Nyaman Harga Nego,https://www.rumah123.com/properti/cimahi/hos85...,"Cimahi Utara, Cimahi",300 m²,1 m²,-,-,-,SHM - Sertifikat Hak Milik,Rp. 2.500.000.000


In [16]:
df[df['surface_area'] == '2 m²']

Unnamed: 0,web-scraper-order,web-scraper-start-url,pages,house,house-href,address,building_area,surface_area,bedroom,bathroom,parking_area,tenure,price
56,1636552643-855,https://www.rumah123.com/jual/residensial/?q=c...,https://www.rumah123.com/jual/residensial/?q=c...,Dijual Rumah di Kolonel Masturi Cimahi,https://www.rumah123.com/properti/bandung/hos8...,"Cimahi, Bandung",300 m²,2 m²,3,3,-,SHM - Sertifikat Hak Milik,Rp. 6.250.000.000


We can drop this too because there's impossible for a house that has a building area and surface area under 21 meter squared

## Data Cleaning

In [17]:
house = df.copy()

In [18]:
# Cleaning manually "Cimahi, Bandung"
pd.set_option('display.max_rows',df.shape[0]+1)
house[house['address'] == 'Cimahi, Bandung']

Unnamed: 0,web-scraper-order,web-scraper-start-url,pages,house,house-href,address,building_area,surface_area,bedroom,bathroom,parking_area,tenure,price
16,1636553143-952,https://www.rumah123.com/jual/residensial/?q=c...,https://www.rumah123.com/jual/residensial/?q=c...,RUMAH CILAME NGAMPRAH DEKAT TOL PADALARANG,https://www.rumah123.com/properti/bandung/hos9...,"Cimahi, Bandung",48 m²,100 m²,2,1,1,SHM - Sertifikat Hak Milik,Rp. 579.600.000
19,1636550341-383,https://www.rumah123.com/jual/residensial/?q=c...,https://www.rumah123.com/jual/residensial/?q=c...,"Rumah Komplek Melong Green Garden,Cimahi Selatan.",https://www.rumah123.com/properti/bandung/hos8...,"Cimahi, Bandung",168 m²,137 m²,3,2,2,SHM - Sertifikat Hak Milik,Rp. 1.250.000.000
21,1636548688-48,https://www.rumah123.com/jual/residensial/?q=c...,https://www.rumah123.com/jual/residensial/?q=c...,Dijual rumah di komp taman mutiara,https://www.rumah123.com/properti/bandung/hos8...,"Cimahi, Bandung",175 m²,135 m²,5,2,-,SHM - Sertifikat Hak Milik,Rp. 1.400.000.000
32,1636548968-96,https://www.rumah123.com/jual/residensial/?q=c...,https://www.rumah123.com/jual/residensial/?q=c...,Rmh Siap Huni Komp. Kavling CiawitaLi,https://www.rumah123.com/properti/bandung/hos8...,"Cimahi, Bandung",35 m²,72 m²,2,1,2,SHM - Sertifikat Hak Milik,Rp. 495.000.000
34,1636552641-854,https://www.rumah123.com/jual/residensial/?q=c...,https://www.rumah123.com/jual/residensial/?q=c...,Rumah Siap Huni dan Nyaman di Taman Bumi Prima...,https://www.rumah123.com/properti/bandung/hos8...,"Cimahi, Bandung",100 m²,144 m²,2,1,-,SHM - Sertifikat Hak Milik,Rp. 1.600.000.000
43,1636550763-461,https://www.rumah123.com/jual/residensial/?q=c...,https://www.rumah123.com/jual/residensial/?q=c...,Rumah Lux dan luas di Setramas - Cimahi,https://www.rumah123.com/properti/bandung/hos7...,"Cimahi, Bandung",260 m²,280 m²,5,4,2,-,Rp. 3.500.000.000
55,1636549816-282,https://www.rumah123.com/jual/residensial/?q=c...,https://www.rumah123.com/jual/residensial/?q=c...,Rumah Lux Asri Siap Huni di Bumi Sariwangi Cimahi,https://www.rumah123.com/properti/bandung/hos7...,"Cimahi, Bandung",300 m²,525 m²,4,3,2,SHM - Sertifikat Hak Milik,Rp. 5.500.000.000
56,1636552643-855,https://www.rumah123.com/jual/residensial/?q=c...,https://www.rumah123.com/jual/residensial/?q=c...,Dijual Rumah di Kolonel Masturi Cimahi,https://www.rumah123.com/properti/bandung/hos8...,"Cimahi, Bandung",300 m²,2 m²,3,3,-,SHM - Sertifikat Hak Milik,Rp. 6.250.000.000
80,1636549206-149,https://www.rumah123.com/jual/residensial/?q=c...,https://www.rumah123.com/jual/residensial/?q=c...,BIKIN PEDE! HARGA BISA NEGO? SIKAT AJA SAMPAI ...,https://www.rumah123.com/properti/bandung/hos9...,"Cimahi, Bandung",400 m²,394 m²,4,4,-,SHM - Sertifikat Hak Milik,Rp. 3.980.000.000
81,1636551698-647,https://www.rumah123.com/jual/residensial/?q=c...,https://www.rumah123.com/jual/residensial/?q=c...,Rumah Baru Lokasi Stratgis Diperumahan Kolmas ...,https://www.rumah123.com/properti/bandung/hos8...,"Cimahi, Bandung",90 m²,102 m²,3,2,-,SHM - Sertifikat Hak Milik,Rp. 1.128.000.000


In [19]:
var = ['web-scraper-order']

for item in var :
    house[item] = house[item].str.replace('-','')

cmh_utr = ['163654896896','1636552641854','1636550763461','1636552643855','1636551698647','1636551061518','1636550458402','1636549545219','1636552829897'
,'1636551543618','1636551305559','1636551366577','1636550479410','1636550352387','1636551933697','1636551714653','1636551789672','1636549528215','1636551360575',
'1636550059332','1636551346572','1636552533821','1636550852488','1636552633851','1636549569226','1636551832686','1636552022726','1636551371579',
'163654867342','1636550747455','1636549228157','1636551311561','163654869149','1636551208546','1636551768667','1636552017724','1636551548620',
'1636550215364','163654869450','163654870553','1636549889293','1636551943701','1636549961319','1636550192355','1636552934912','1636552559829',
'1636549933308','1636549700265','1636551202545','163654867041','163654881770','1636549964320','1636549188145','1636549194147','1636550937496',
'163654866840','1636553204973','1636549556222','1636551827684','1636551587633','1636551810680','1636549842288','1636550464404','1636550173349',
'1636550860491','1636551456598','1636548992102','1636549914302','1636550506419','1636551223552','1636552042733','1636550333381','1636549593233','1636550660447',
'1636550328379','1636552278781','1636550834483','1636551701648','1636551964708','1636550770463']

cmh_tgh = ['163654868848','1636552111734','1636549206149','1636549559223','1636551791673','1636550966507','163654883476','1636550862492','1636552178750'
,'1636551766666','1636549850291','1636550786467','1636552548825','1636549475199','1636549943312','1636551674640','1636552958921','1636551363576',
'1636550496415','1636553196970','1636550947500','1636550062333']

cmh_stn = ['1636550341383','1636549601235','1636550587424','1636550361390','1636549523213','1636551799676','1636550358389','1636549094134',
'1636550755458','163654882573','1636552960922']

Ng = ['1636553143952','1636553148954','1636553141951','1636549909300','1636551553622','1636551709651','163654878961','1636550801472','1636552768877',
'1636553130947','1636549225156','1636550652444','163654852111','1636551450597']

ch = ['1636552803889','1636549630245','1636553061932','1636552778879','1636552792884','163654896093','1636549000104','1636551577629','1636549361182'
,'1636551704649','1636552801888','163654853115','1636551824683','1636550620433']

pdl = ['1636550484412','1636553146953','1636552816894','163654852914','1636550818477','1636553052929','163654852613','1636549712269']

cbr = ['1636549347178','1636549350179']

sr = ['1636549816282']

pst = ['1636551668638']

sk = ['1636552308791']

pr = ['1636549329171']

cs = ['1636550622434']

for item in cmh_utr:
    house.loc[house['web-scraper-order'] == item, 'address'] = 'Cimahi Utara, Cimahi'

for item in cmh_tgh:
    house.loc[house['web-scraper-order'] == item, 'address'] = 'Cimahi Tengah, Cimahi'

for item in cmh_stn:
    house.loc[house['web-scraper-order'] == item, 'address'] = 'Cimahi Selatan, Cimahi'

for item in Ng:
    house.loc[house['web-scraper-order'] == item, 'address'] = 'Ngamprah, Bandung Barat'

for item in ch:
    house.loc[house['web-scraper-order'] == item, 'address'] = 'Cihanjuang, Bandung Barat'

for item in pdl:
    house.loc[house['web-scraper-order'] == item, 'address'] = 'Padalarang, Bandung Barat'

for item in cbr:
    house.loc[house['web-scraper-order'] == item, 'address'] = 'Cibeureum, Bandung'

for item in sr:
    house.loc[house['web-scraper-order'] == item, 'address'] = 'Sariwangi, Bandung Barat'

for item in sk:
    house.loc[house['web-scraper-order'] == item, 'address'] = 'sukajadi, Bandung Barat'

for item in pr:
    house.loc[house['web-scraper-order'] == item, 'address'] = 'Parongpong, Bandung Barat'

for item in cs:
    house.loc[house['web-scraper-order'] == item, 'address'] = 'Cisarua, Bandung Barat'

for item in pst:
    house.loc[house['web-scraper-order'] == item, 'address'] = 'Pasteur, Bandung'

In [20]:
# Cleaning a useless columns for model
house.drop(['web-scraper-order','web-scraper-start-url','pages','house'],axis = 1, inplace = True)

In [21]:
# Rename column
house = house.rename(columns={'house-href':'url'})

In [22]:
# Cleaning wrong values
house.drop(house[house['surface_area'] == '1 m²'].index, inplace = True) 
house.drop(house[house['surface_area'] == '2 m²'].index, inplace = True) 

In [23]:
# Drop any symbols
var = ['building_area', 'surface_area','price']

for item in var :
    house.drop(house[house[item] == '-'].index, inplace = True)
    house[item] = house[item].str.replace('m²','')
    house[item] = house[item].str.replace('Rp.','')
    house[item] = house[item].str.replace('.','')

  house[item] = house[item].str.replace('Rp.','')
  house[item] = house[item].str.replace('.','')


In [24]:
# Convert unidentified null values
facilities = ['bedroom','bathroom','parking_area']

for item in facilities :
    house[item].replace('-','0', inplace = True)

In [25]:
house['tenure'].replace('-','Tidak Ada Penjelasan', inplace = True)

In [26]:
# Cleaning areas that far from Cimahi
far_cmh = ('Menteng, Jakarta Pusat'
,'Cibiru, Bandung'
,'Kutawaringin, Bandung'
,'Margaasih, Bandung'
,'Antapani, Bandung'
,'Dago, Bandung'
,'Rancabali, Bandung'
,'Buah Batu, Bandung'
,'Cicendo, Bandung'
,'Sayap Dago, Bandung'
,'Cihampelas, Bandung Barat'
,'Bandung Kota, Bandung'
,'Bandung Wetan, Bandung'
,'Batununggal, Bandung'
,'Cilengkrang, Bandung'
,'Kopo, Bandung'
,'Arjasari, Bandung'
,'Cihampelas, Bandung'
,'Surya Sumantri, Bandung'
,'Cibeunying, Bandung'
,'Kiaracondong, Bandung')

for item in far_cmh :
    house.drop(house[house['address'] == item].index, inplace = True)

In [27]:
# Cleaning the area names
area = ['address']

for region in area :
     house[region].replace('Cihanjuang, Bandung','Cihanjuang, Bandung Barat', inplace = True)
     house[region].replace('Cimahi, Cimahi','Cimahi Tengah, Cimahi', inplace = True)
     house[region].replace('Cisarua, Bandung','Cisarua, Bandung Barat', inplace = True)
     house[region].replace('Lembang, Bandung','Lembang, Bandung Barat', inplace = True)
     house[region].replace('Ngamprah, Bandung','Ngamprah, Bandung Barat', inplace = True)
     house[region].replace('Padalarang, Bandung','Padalarang, Bandung Barat', inplace = True)
     house[region].replace('Parongpong, Bandung','Parongpong, Bandung Barat', inplace = True)
     house[region].replace('Sariwangi, Bandung','Sariwangi, Bandung Barat', inplace = True)
     house[region].replace('Kota Baru Parahyangan, Bandung','Kota Baru Parahyangan, Bandung Barat', inplace = True)
     house[region].replace('sukajadi, Bandung Barat','Sukajadi, Bandung', inplace = True)
     house[region].replace('Geger Kalong, Bandung','Gegerkalong, Bandung', inplace = True)
     house[region].replace('Leuwi Gajah, Cimahi','Cimahi Selatan, Cimahi', inplace = True)

In [28]:
house['address'].value_counts()

Cimahi Utara, Cimahi                    393
Cimahi Tengah, Cimahi                   121
Cihanjuang, Bandung Barat                71
Cimahi Selatan, Cimahi                   60
Ngamprah, Bandung Barat                  44
Parongpong, Bandung Barat                42
Sariwangi, Bandung Barat                 30
Padalarang, Bandung Barat                26
Lembang, Bandung Barat                   24
Cisarua, Bandung Barat                   23
Pasteur, Bandung                         13
Gegerkalong, Bandung                      8
Gunung Batu, Bandung                      7
Setra Duta, Bandung                       6
Setiabudi, Bandung                        6
Cijerah, Bandung                          5
Cibeureum, Bandung                        5
Batujajar, Bandung Barat                  4
Sukajadi, Bandung                         4
Sarijadi, Bandung                         3
Cimindi, Bandung                          2
Kota Baru Parahyangan, Bandung Barat      2
Sudirman, Bandung               

In [29]:
# Convert feature data type to numeric
Integer = ['bedroom','bathroom','parking_area']
Float = ['price','surface_area','building_area']

for item in Integer :
    house[item] = pd.to_numeric(house[item])

house[Float] = house[Float].astype(float)

In [30]:
for item in house :
    print (item + '=',house.duplicated().sum())

url= 0
address= 0
building_area= 0
surface_area= 0
bedroom= 0
bathroom= 0
parking_area= 0
tenure= 0
price= 0


In [31]:
# Drop duplicate values
house.drop_duplicates(inplace = True)

## Export Clean Dataset

In [32]:
house.to_csv("house_clean.csv",index=False)