## Load the dataset

**Dataset Link:**
**https://www.kaggle.com/datasets/taeefnajib/house-rent-in-dhaka-city/data**

In [23]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [24]:
df=pd.read_csv('houserentdhaka.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,Location,Area,Bed,Bath,Price
0,0,"Block H, Bashundhara R-A, Dhaka","1,600 sqft",3,3,20 Thousand
1,1,"Farmgate, Tejgaon, Dhaka",900 sqft,2,2,20 Thousand
2,2,"Block B, Nobodoy Housing Society, Mohammadpur,...","1,250 sqft",3,3,18 Thousand
3,3,"Gulshan 1, Gulshan, Dhaka","2,200 sqft",3,4,75 Thousand
4,4,"Baridhara, Dhaka","2,200 sqft",3,3,75 Thousand


**We need to drop unnamed: 0 column because it doesnt put any impact in our prediction**

In [25]:
df.drop(['Unnamed: 0'], axis='columns', inplace=True)
df.head()

Unnamed: 0,Location,Area,Bed,Bath,Price
0,"Block H, Bashundhara R-A, Dhaka","1,600 sqft",3,3,20 Thousand
1,"Farmgate, Tejgaon, Dhaka",900 sqft,2,2,20 Thousand
2,"Block B, Nobodoy Housing Society, Mohammadpur,...","1,250 sqft",3,3,18 Thousand
3,"Gulshan 1, Gulshan, Dhaka","2,200 sqft",3,4,75 Thousand
4,"Baridhara, Dhaka","2,200 sqft",3,3,75 Thousand


**Now we can see our dataset have two columns named Price and Area where we need just the number, not that string like- Thousand,sqft etc. So now we clean that**

In [26]:
multiplier= {
    'Thousand': 1000,
    'Lakh': 100000,
    'Million': 1000000,
    'Crore': 10000000
}

def clean_price(price):
    parts = str(price).split()

    if len(parts)==2:
        num,prefix = parts
        return float(num) * multiplier.get(prefix,1)
    else:
        return float(parts[0])

df['Price']=df['Price'].apply(clean_price)

In [27]:
df.head()

Unnamed: 0,Location,Area,Bed,Bath,Price
0,"Block H, Bashundhara R-A, Dhaka","1,600 sqft",3,3,20000.0
1,"Farmgate, Tejgaon, Dhaka",900 sqft,2,2,20000.0
2,"Block B, Nobodoy Housing Society, Mohammadpur,...","1,250 sqft",3,3,18000.0
3,"Gulshan 1, Gulshan, Dhaka","2,200 sqft",3,4,75000.0
4,"Baridhara, Dhaka","2,200 sqft",3,3,75000.0


In [28]:
df['Area']=df['Area'].str.replace(',','').str.replace('sqft','').astype(int)

In [29]:
df.head()

Unnamed: 0,Location,Area,Bed,Bath,Price
0,"Block H, Bashundhara R-A, Dhaka",1600,3,3,20000.0
1,"Farmgate, Tejgaon, Dhaka",900,2,2,20000.0
2,"Block B, Nobodoy Housing Society, Mohammadpur,...",1250,3,3,18000.0
3,"Gulshan 1, Gulshan, Dhaka",2200,3,4,75000.0
4,"Baridhara, Dhaka",2200,3,3,75000.0


In [30]:
df['Area'].dtype

dtype('int64')

**Now we need to work on Location Column**

**At first we check if all location ended with dhaka or not**

In [32]:
df['Location'].str.strip().str.endswith('Dhaka').all()  ## if this return false that mean all column doesnt ended with Dhaka
                                                        ## if return true then that means all column ended with Dhaka string

np.True_

**It means all of our location column ended with Dhaka, So now we should remove it.**

In [33]:
df['Location']=df['Location'].str.strip().str.removesuffix(", Dhaka")

In [34]:
df.head()

Unnamed: 0,Location,Area,Bed,Bath,Price
0,"Block H, Bashundhara R-A",1600,3,3,20000.0
1,"Farmgate, Tejgaon",900,2,2,20000.0
2,"Block B, Nobodoy Housing Society, Mohammadpur",1250,3,3,18000.0
3,"Gulshan 1, Gulshan",2200,3,4,75000.0
4,Baridhara,2200,3,3,75000.0


**Before further going now we will check what is the condition in our current dataframe**

In [36]:
df.Area.isnull().sum()

np.int64(0)

In [37]:
df.Bed.isnull().sum()

np.int64(0)

In [38]:
df.Bath.isnull().sum()

np.int64(0)

In [39]:
df.Price.isnull().sum()

np.int64(0)

**So all of these column has values, so now we move on next step to work on Location column to make it more usable**

**So now we can look towards bigger picture like a location called 'Bashundhara R-A' not very specific like 'Block H, Bashundhara R-A'. Beacuse the rent price is almost similar.**

In [40]:
df['Location']=df['Location'].apply(lambda x:x.split(',')[-1].strip())

In [41]:
df.head()

Unnamed: 0,Location,Area,Bed,Bath,Price
0,Bashundhara R-A,1600,3,3,20000.0
1,Tejgaon,900,2,2,20000.0
2,Mohammadpur,1250,3,3,18000.0
3,Gulshan,2200,3,4,75000.0
4,Baridhara,2200,3,3,75000.0


**How many unique values are there in a column and what is that?**

In [49]:
df['Location'].unique()

array(['Bashundhara R-A', 'Tejgaon', 'Mohammadpur', 'Gulshan',
       'Baridhara', 'Hazaribag', 'Mirpur', 'Nikunja', 'Uttara',
       'Khilgaon', 'Ibrahimpur', 'Badda', 'Adabor', 'Jatra Bari',
       'Malibagh', 'Banani', 'Kakrail', 'Dhanmondi', 'Maghbazar',
       'Kalachandpur', 'Niketan', 'Eskaton', 'Banasree', 'Bashabo',
       'Baridhara DOHS', 'Aftab Nagar', 'Lalmatia', 'Dakshin Khan',
       'Mohakhali DOHS', 'Sutrapur', 'Hatirpool', 'Agargaon', 'Rampura',
       'Cantonment', 'Shahbagh', 'Khilkhet', 'Motijheel', 'Shantinagar',
       'Shegunbagicha', 'Kathalbagan', 'Shyamoli', 'Kalabagan', 'Demra',
       'Kuril', 'Mohakhali', 'Lalbagh', 'New Market', 'Kafrul',
       'Kachukhet', 'Turag', 'Dhaka', 'Nadda', 'Shyampur', 'Maniknagar',
       'Banani DOHS', 'Shiddheswari', 'Bangshal', 'Paribagh',
       'Joar Sahara', 'Mugdapara', 'North Shahjahanpur', 'Kotwali',
       'Shahjahanpur', 'Uttar Khan', 'Taltola', 'Sadarghat',
       'Banglamotors', 'Zafrabad', 'Keraniganj'], dtype=ob

In [50]:
df['Location'].nunique()

69