In [46]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [47]:
#reading df
df = pd.read_csv('no_outlier_df.csv')
df.head().T

Unnamed: 0,0,1,2,3,4
area,1100.0,1300.0,1500.0,1350.0,1825.0
building_type,Apartment,Apartment,Apartment,Apartment,Apartment
building_nature,Residential,Residential,Residential,Residential,Residential
image_url,https://images-cdn.bproperty.com/thumbnails/80...,https://images-cdn.bproperty.com/thumbnails/13...,https://images-cdn.bproperty.com/thumbnails/15...,https://images-cdn.bproperty.com/thumbnails/15...,https://images-cdn.bproperty.com/thumbnails/15...
num_bath_rooms,3.0,3.0,3.0,3.0,4.0
num_bed_rooms,3.0,3.0,3.0,3.0,3.0
price,22000.0,40000.0,35000.0,20000.0,60000.0
property_description,1150 Square Feet Apartment For Rent In Mohamma...,Grab This Lovely Flat For Rent In Bashundhara ...,1500 Square Feet Apartment With Necessary Resi...,Wow! This 1350 Sq Ft Apartment For Rent In Bas...,This 1825 Sq. Ft Apartment Will Ensure Your Go...
property_overview,The apartment of 1100 Square Feet is located ...,This lovely apartment is located in a great lo...,"Make this beautiful 1,500 Sft apartment in Utt...","In a very busy city like Dhaka, everyone tranc...","In a very busy city like Dhaka, everyone tranc..."
property_url,https://www.bproperty.com/en/property/details-...,https://www.bproperty.com/en/property/details-...,https://www.bproperty.com/en/property/details-...,https://www.bproperty.com/en/property/details-...,https://www.bproperty.com/en/property/details-...


## Feautre Engineering

In [48]:
#setting out amenities to simplify the dataset and taking on important columns
amenities = ['relaxation_amenity_count','security_amenity_count','maintenance_or_cleaning_amenity_count',\
             'social_amenity_count','expendable_amenity_count','service_staff_amenity_count','unclassify_amenity_count']
not_required_cols = ['property_description', 'property_overview', 'property_url', 'image_url','id']
working_df = df.drop(columns=amenities+not_required_cols)
working_df.isna().sum()

area                  0
building_type         0
building_nature       0
num_bath_rooms        0
num_bed_rooms         0
price                 0
purpose               0
city                  0
locality              0
address            4680
division              1
zone                 80
dtype: int64

Now I am going to modify each feature using the insights given in task#3 EDA

### area

In [49]:
df['area'].describe()
#nothing to do here

count    29813.000000
mean      1657.573696
std       1215.077508
min         93.000000
25%       1050.000000
50%       1350.000000
75%       2000.000000
max      17000.000000
Name: area, dtype: float64

### building_type

1. Nearly 80% of our properties are `Apartment`, for a total of nearly 27000 samples. We also some `Office`, `Building`, `Shop`, `Floor`, `Residential Plot`, whose number are under 10% of the total dataset; that is to say, their numbers are under 2500. 
2. There are other types of properties, in a very negligible number.

❗ **Recommendation**:
* We are expecting our future models to perform well on `Apartment`, and to have and acceptable result for `Office`, `Building`, `Shop`, `Floor`, `Residential Plot`. They are expected to perform poorly on other types of properties.
* Types not part of (1) should be dropped in order to avoid noise in our future models.


In [50]:
req_building_type = ['Apartment','Office', 'Building', 'Shop', 'Floor', 'Residential Plot']
df = df[df['building_type'].isin(req_building_type)]

### building_nature

In [51]:
df['building_nature'].describe()
#nothing to do here

count           29637
unique              2
top       Residential
freq            23985
Name: building_nature, dtype: object

### num_bath_rooms & num_bed_rooms

In [52]:
df[['num_bath_rooms','num_bed_rooms']].describe()
#nothing to do here

Unnamed: 0,num_bath_rooms,num_bed_rooms
count,29637.0,29637.0
mean,1.660694,2.3168
std,1.55106,1.268307
min,0.0,0.0
25%,0.0,2.0
50%,2.0,3.0
75%,3.0,3.0
max,10.0,10.0


### price

In [53]:
df['price'].describe()
#non-linear relation was observed with every feature
#nothing to do here

count    2.963700e+04
mean     3.572601e+06
std      5.929573e+06
min      4.200000e+03
25%      2.500000e+04
50%      1.300000e+05
75%      6.000000e+06
max      1.200000e+08
Name: price, dtype: float64

### purpose

In [54]:
#nothing to do here
df['purpose'].describe()

count     29637
unique        2
top        Rent
freq      17727
Name: purpose, dtype: object

### city

1. Most of our properties are in `Dhaka`, for a total of nearly 28,000 properties. We also have nearly 4000 properties in `Chattogram`.     
1. A negligible amount of properties are in `Narayanganj City`, `Barishal`, `Gazipur`, each of them with a count below 500 properties ..
1. As for the other cities, their properties count is too insignificant.

❗ **Recommendation**:
* We are expecting our future models not to be perform well on cities mentioned in (2). We should consider dropping samples with those cities when building models since their low number will make it so that the models will not predict well on them. 
* Cities not part of (1) and (2) should definitively be dropped in order to avoid noise in our future models.

In [55]:
req_city = ['Dhaka','Chattogram','Narayanganj City', 'Barishal','Gazipur']
df = df[df['city'].isin(req_city)]
df.shape

(29346, 24)

### locality

In [56]:
#nothing to do here
df['locality'].describe()

count      29346
unique       160
top       Mirpur
freq        4966
Name: locality, dtype: object

### address

In [57]:
df.shape

(29346, 24)

In [58]:
df['address'].value_counts(ascending=False).sum()
#as mostly there are different values of address in dataframe, it won't be much helpful for ML algorithm
#so better droppping it

24727

In [59]:
df.drop(columns='address', inplace=True)

### divison

In [60]:
df['division'].value_counts()

Dhaka         25780
Chattogram     3344
Barisal         222
Name: division, dtype: int64

In [61]:
df['division'].isnull().sum()

0

### zone

In [62]:
df['zone'].isnull().sum()

80

In [63]:
df[df['zone'].isna()]['division'].value_counts()
#all missing zones belong to dhaka


Dhaka    80
Name: division, dtype: int64

In [64]:
#it won't be wrong to fill them with mode
df['zone'].fillna(df['zone'].mode, inplace=True)

### Handling Categorical Columns

### Feature Scaling

### Exporting ready data