# Project Description
**PROBLEM STATEMENT -**
To predict the prices of houses in bangalore, using attributes of a house and attributes of the locality of the house.

**Data Source -**
1. from kaggle - https://www.kaggle.com/amitabhajoy/bengaluru-house-price-data
2. utilising FourSquare to get the locality data.

**Goal of the project -**
To model the prediction of housing prices in Bangalore, and come up with the key drivers.

**Okay so prima facie, lets import some of the libraries we will be needing for this project.**  
(even if we are missing out on some library, we can always import it later on in the project.)

In [18]:
import sys

# library to handle vectorized data 
import numpy as np 
# library for data analsysis and manupulation
import pandas as pd 
pd.set_option('display.max_columns', 1000) #don't think that there will be more than a 1000 of those describing an apartment!!
pd.set_option('display.max_rows', 100000)

# for stats visualisation
import seaborn as sns

%matplotlib inline 
import matplotlib as mpl
import matplotlib.pyplot as plt
# check for latest version of Matplotlib
print ('Matplotlib version: ', mpl.__version__) # >= 2.0.0

# folium for data visualisation of maps
import folium

# geopy for getting the geographical latitudes and longitudes of a location
import geopy
# Now lets import 'Nominatim' from geopy to convert an address into latitude and longitude values.
from geopy.geocoders import Nominatim

# importing json to handle json files as we are expecting the file type from Foursquare to be json
import json

import csv

Matplotlib version:  2.2.3


# Loading the data
Source = https://www.kaggle.com/amitabhajoy/bengaluru-house-price-data

In [19]:
data = pd.read_csv('Bengaluru_House_Data.csv')
data.head()

Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,Coomee,1056,2.0,1.0,39.07
1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,Theanmp,2600,5.0,3.0,120.0
2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,,1440,2.0,3.0,62.0
3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,Soiewre,1521,3.0,1.0,95.0
4,Super built-up Area,Ready To Move,Kothanur,2 BHK,,1200,2.0,1.0,51.0


**let's check the features of our dataset.**

In [20]:
data.columns

Index(['area_type', 'availability', 'location', 'size', 'society',
       'total_sqft', 'bath', 'balcony', 'price'],
      dtype='object')

**okay so we have 9 total features in our dataset (including the TARGET feature); they are:**
    1. area_type : type of area mentioned in the unit
    2. availability : construction status of the unit
    3. location : locality of unit in Bangalore
    4. size : unit type in terms of no.of BHK(s)
    5. society : name of the society of the unit
    6. total_sqft : total squared feet of the unit
    7. bath : number of washrooms in the unit
    8. balcony : number of balcony(s) in the unit
    9. price : price of the unit

# Data cleaning  
**Okay let's take a look at the data type of the features.**

In [21]:
data.dtypes

area_type        object
availability     object
location         object
size             object
society          object
total_sqft       object
bath            float64
balcony         float64
price           float64
dtype: object

**so we have some categorical features and some numerical features in our data.**  
**The categorical features are as follows :**  
    1. area_type
    2. availability
    3. location
    4. size
    5. society
**The numerical continous features are as follows :**
    1. bath
    2. balcony
    3. price
**The only feature which has a data type out of the odinary is the <b>'total_sqft'</b>, which needs to be changed to the data type of float64; so let's explore the same.**

In [22]:
# to see what kind of values are present in the feature 'total_sqft'
(data['total_sqft'].unique())

array(['1056', '2600', '1440', ..., '1133 - 1384', '774', '4689'],
      dtype=object)

**okay so we can see that there are different kinds of values present in our feature of interest presently.**  
**So let's do further data cleaning.**

In [23]:
# a function to return False if the value in our feature 'total_sqft' cannot be converted into float data type; and vice-versa.
def float_or_not(x) :
    try:
        float(x)
    except:
        return False
    return True

# applying the float_or_not function on our feature 'total_sqft', and retreiving the index of the rows having non-float values 
non_float_values_index = data[~data['total_sqft'].apply(float_or_not)].index.values.tolist()
'''
printing the data in rows of 'total_sqft' having non-float values 
to visually understand the variations present in the data representation in the feature.
'''
for i in non_float_values_index:
    print(data['total_sqft'][i])

2100 - 2850
3010 - 3410
2957 - 3450
3067 - 8156
1042 - 1105
1145 - 1340
1015 - 1540
1520 - 1740
34.46Sq. Meter
1195 - 1440
1200 - 2400
4125Perch
1120 - 1145
4400 - 6640
3090 - 5002
4400 - 6800
1160 - 1195
1000Sq. Meter
4000 - 5249
1115 - 1130
1100Sq. Yards
520 - 645
1000 - 1285
3606 - 5091
650 - 665
633 - 666
5.31Acres
30Acres
1445 - 1455
884 - 1116
850 - 1093
1440 - 1884
716Sq. Meter
547.34 - 827.31
580 - 650
3425 - 3435
1804 - 2273
3630 - 3800
660 - 670
4000 - 5249
1500Sq. Meter
620 - 933
142.61Sq. Meter
2695 - 2940
2000 - 5634
1574Sq. Yards
3450 - 3472
1250 - 1305
670 - 980
1005.03 - 1252.49
3630 - 3800
1004 - 1204
361.33Sq. Yards
645 - 936
2710 - 3360
2249.81 - 4112.19
3436 - 3643
2830 - 2882
596 - 804
1255 - 1863
1300 - 1405
1200 - 2400
1500 - 2400
117Sq. Yards
934 - 1437
980 - 1030
1564 - 1850
1446 - 1506
2249.81 - 4112.19
1070 - 1315
3040Sq. Meter
500Sq. Yards
2806 - 3019
613 - 648
1430 - 1630
704 - 730
1482 - 1846
2805 - 3565
3293 - 5314
1210 - 1477
3369 - 3464
1125 - 1500
167S

**okay so we can see that threre are the following variations in the data representation for the feature 'total_sqtf :**  

    1. in a range delimited by a '-'
    2. in 'Sq. Meter'
    3. in 'Perch'
    4. in 'Sq. yards'
    5. in 'Acres'
    6. in 'Grounds'
    7. in 'Guntha'
    8. in  'Cents'
    
**We need to transform these string values to float value; and we also need to transform all the values in Sq.ft.**  

* 1 Sq.Meter = 10.7639 Sq.feet.
* 1 Perch = 272.25 Sq.feet.
* 1 Sq.yards = 9 Sq.feet.
* 1 Acres = 43560 Sq.feet.
* 1 Grounds = 2400 Sq.feet.
* 1 Guntha = 1089 Sq.feet.
* 1 Cents = 435.6 Sq.feet

**okay then let's contruct a function to transform the different representations of the values in the feature 'total_sqft';**   
**into float and also in the same unit i.e. <u>'sqft'</u>.** 

In [24]:
# we willbe taking the a value as an inout which is required to be trnsformed.
def float_and_sqft_transformation(value):
    # dictionary of possible unit conversions for the feature 'total_sqft'
    area_units = {'Sq. Meter': 10.7639,
            'Perch': 272.25,
            'Sq. yards': 9,
            'Acres': 43560,
            'Grounds': 2400,
            'Guntha': 1089,
            'Cents': 435.6}
    global value_float
    if '-'in value:
        values = value.split("-")
        if len(values) == 2:
            value_float = float((float(values[0]) + float(values[1]))/2)
    else:
        for i in range(0, len(list(area_units.keys()))):
            unit = (list(area_units.keys())[i])
            if unit in value:
                value_float = (float(value.replace(unit, ""))) * area_units[unit]        
    return value_float 

In [25]:
# applying the float_and_sqft_transformation to our feature 'total_sqft'
for index in non_float_values_index:
    data['total_sqft'][index] = float_and_sqft_transformation(data['total_sqft'][index])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [26]:
# checking the data type of the values in the feature 'total_sqft'
(data.total_sqft).dtype

dtype('O')

In [27]:
# changing the data type of the values in feature 'total_sqft' to float
data['total_sqft'] = pd.to_numeric(data['total_sqft'], errors='ignore', downcast='float')

In [28]:
data.head()

Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,Coomee,1056.0,2.0,1.0,39.07
1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,Theanmp,2600.0,5.0,3.0,120.0
2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,,1440.0,2.0,3.0,62.0
3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,Soiewre,1521.0,3.0,1.0,95.0
4,Super built-up Area,Ready To Move,Kothanur,2 BHK,,1200.0,2.0,1.0,51.0


**Now let's re-check the data types of our features.**  

In [29]:
data.dtypes

area_type        object
availability     object
location         object
size             object
society          object
total_sqft      float32
bath            float64
balcony         float64
price           float64
dtype: object

**Dropping the feature called <u>'society'</u>; because it won't be adding value to our analysis.**  
**This is due to the fact that, we have incomplete information related to that feature; and even if we wanted to use it we will have to look for additional data regarding the various societies through which we could define and score the societies based on those extra data points.**  

In [30]:
data.drop("society", axis = 1, inplace = True)
data.head()

Unnamed: 0,area_type,availability,location,size,total_sqft,bath,balcony,price
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,1056.0,2.0,1.0,39.07
1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,2600.0,5.0,3.0,120.0
2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,1440.0,2.0,3.0,62.0
3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,1521.0,3.0,1.0,95.0
4,Super built-up Area,Ready To Move,Kothanur,2 BHK,1200.0,2.0,1.0,51.0


**Okay, now let's explore and clean the feature 'size'.**

In [31]:
data['size'].unique()

array(['2 BHK', '4 Bedroom', '3 BHK', '4 BHK', '6 Bedroom', '3 Bedroom',
       '1 BHK', '1 RK', '1 Bedroom', '8 Bedroom', '2 Bedroom',
       '7 Bedroom', '5 BHK', '7 BHK', '6 BHK', '5 Bedroom', '11 BHK',
       '9 BHK', nan, '9 Bedroom', '27 BHK', '10 Bedroom', '11 Bedroom',
       '10 BHK', '19 BHK', '16 BHK', '43 Bedroom', '14 BHK', '8 BHK',
       '12 Bedroom', '13 BHK', '18 Bedroom'], dtype=object)

**As we can clearly see that, it has mixed representation of the values; i.e. in numeric and string values.**
**Let's convert them to only numeric values.**

In [32]:
# first converting the 'nan' string values to 'numpy.nan' values; to efficiently filter the feature.
data['size'].replace(to_replace = 'nan', value = np.nan, inplace = True)

In [None]:
'''
If the value == numpy.nan, 
we are going to keep it that way only,
in order to address this issue later in the data cleaning process.

If the data-type of a value is 'float' or 'int', 
we are going to keep it unaltered, 
as it is in-line with our desired outcome for this feature.
'''
for i in range(0, len(data)):
    if data['size'][i] != np.nan and type(data['size'][i]) != float and type(data['size'][i]) != int:
         data.at['size', i] = int((data['size'][i]).split(' ')[0])

In [None]:
# changing the data type of the values in feature 'total_sqft' to integer
data['size'] = pd.to_numeric(data['size'], errors='ignore', downcast='integer')
data['size'].dtype

**Now let's explore the feature 'bath' for any anomaly in the data representation.**

In [None]:
data['bath'].unique()

**We can observe from above that some of the value(s) in this feature is not an integer;**  
**the problem with that is; the number of bathrooms in an apartment/house cannot be logically defined in decimals;**  
**i.e. it can only be defined in an integer.**  

**Due to the fact that there is only one value in this feature that is represented as a decimal number;**  
**we can associate this error as a data-entry error.**

**So let's just round-up the values in this feature, and convert the data type to integer.**  
**Doing so will transform the data representation to the correct form and also there will be no information loss as we are not discarding the inconsistent entry in the feature.**

In [None]:
# rounding-up and then changing the data type if it is not NaN.
for i in range(0, len(data)):
    if data.at['bath', i] != np.nan :
        data.at['bath', i] = data['bath'].round(0).astype(int)

In [None]:
# let's re-check the data representation in this feature to ensure our operation on it was successful. 
data['bath'].unique()

**Okay now let's explore the feature 'balcony' for any anomaly in the data representation.**

In [None]:
data['balcony'].unique()

**Okay so we can observe the same problem exists in this feature also, as it was found for the data representation in the feature 'bath'.**   
**Hence we will handle this issue in the exactly same manner as we did above.**  

In [None]:
# rounding0up and then changing the data type
data['balcony'] = data['balcony'].round(0).astype(int)

In [None]:
# let's re-check the data representation in this feature i=to ensure our operation on it was successful. 
data['balcony'].unique()

In [None]:
data.head(10)

**Now let's explore the feature 'area_type'.**

In [None]:
data['area_type'].unique()

**Okay so the area mentioned in each of the instance of our housing data, is represented in the followwing types -**  

* Carpet Area
* Plot Area
* Built-up Area
* Super built-up Area.

**So this means that the representation of values in the feature 'total_sqft' is not uniform.**  
**What we can do about this issue is, we can convert all the area_types into one or multiple area type vlues but being uniform internally in their representation.**  

**Let's see the relation between these four types of representation of area type of an apartment/house -**   
**( According to <u>Housing.com</u>, ( source = https://housing.com/news/real-estate-basics-part-1-carpet-area-built-up-area-super-built-up-area/ ))**

* **<u>Carpet Area</u>** - **It is the area that can actually be covered by a carpet, or the area of the apartment excluding the thickness of inner walls.** Carpet area does not include the space covered by common areas such as lobby, lift, stairs, play area, etc. Carpet area is the actual area you get for use in a housing unit. **Carpet area is usually around 70% of the built-up area**. 


* **<u>Built-up Area</u>** - **It is the area that comes after adding carpet area and wall area.** The wall area means the thickness of the inner walls of a unit. **The area constituting the walls is around 20% of the built-up area. The built-up area also consists of other areas mandated by the authorities, such as a dry balcony, flower beds, etc., that add up to 10% of the built-up area.** So the usable area is only 70% of the built-up area.


* **<u>Super bult-up Area</u>** - **It is the area calculated by adding the built-up area and common area that includes the corridor, lift lobby, lift, etc.** In some cases, builders even include amenities such as a pool, garden and clubhouse in the common area. **A developer/builder charges you on the basis of the super built-up area which is why it is also known as ‘saleable’ area.**


* **<u>Plot Area</u>** - **It is the total area of the plot on which the building has been constructed.**


**So we can summarize the above relationships as follows -**
* Carpet Area = 70% of Built-up Area
* Built-up Area = Carpet Aarea + 30% of Carpet Area
* Super Built-up Area = Built-up Area + 1.25 as a 'loading factor' of the common spaces in the plot that is shared by all.
                   = (Carpet Aarea + 30% of Carpet Area) + 1.25 as a loadng factor.  
    
**So we will be converting the Carpet Area and Built-up Area into Super Built-up Area, because of the fact that the price of the unit represents the price for the Super Built-up Area.**   
                    
**PLot Area cannot be converted into any of the above three types of area representation, due to the fact that they are not meant for representing the same thing; so this issue will be resolved by discarding the instances in our feaure that contain the area_type of Plot Area.**


In [None]:
for index_no in range(0, len(data)):
    if data['area_type'][index_no] == 'Carpet Area':
        data.at['total_sqft', index_no] = (data['total_sqft'][index_no] + (0.3 * data['total_sqft'][index_no])) + (data['total_sqft'][index_no] * 0.25)    # if area type is Built-up Area
    elif data['area_type'][index_no] == 'Built-up Area':
        data.at['total_sqft', index_no] = (data['total_sqft'][index_no]) + (data['total_sqft'][index_no] * 0.25)
    elif data['area_type'][index_no] == 'Plot  Area': 
        data.at['total_sqft', index_no] = np.nan
    else :
        data.at['total_sqft', index_no] = data['total_sqft'][index_no]
        

In [None]:
data.head(20)

In [None]:
# geopy for getting the geographical latitudes and longitudes of a location
import geopy
# Now lets import 'Nominatim' from geopy to convert an address into latitude and longitude values.
from geopy.geocoders import Nominatim

In [None]:
keys = ['Neighborhood', 'Latitude', 'Longitude']  #kkeys of our dictionary to contain the lat-long data of the localities.
coordinates_dict = dict.fromkeys(keys)

neigh_list = []  # empty list
lat_list = []    # empty list
long_list = []   # empty list

geolocator = Nominatim(user_agent="Bangalore_area_lat_long")

for locality in data['location'].unique() :
    area = geolocator.geocode("{}".format(locality), timeout = 2)
    if area is not None :
        lat = area.latitude
        long = area.longitude
        neigh_list.append(locality) 
        lat_list.append(lat)
        long_list.append(long)
    else:
        neigh_list.append(locality)
        lat_list.append(np.nan)
        long_list.append(np.nan)
coordinates_dict.update(Neighborhood = neigh_list, Latitude = lat_list, Longitude = long_list)

In [None]:
coordinates = pd.DataFrame(coordinates_dict)
coordinates.head()

In [None]:
coordinates['Latitude'].isnull().sum()

**Okay so we have some rows in the column <u>Latitude</u> and <u>Longitude</u> containing <u>NaN</u> values as the geopy could not recognize those locality names which is because those localities are absent in the library of geopy; so lets drop those rows from our final data.**

In [None]:
(coordinates.shape)  # shape of our data BEFORE droping the above mentioned rows.
coordinates.dropna(axis = 0, how ='any', inplace = True)
(coordinates.shape)  # shape of our data AFTER droping the above mentioned rows.

**Merging the two datframes: <i>data</i> & <i>coordinated</i> to make our <i>final_data</i> dataframe.**

In [None]:
final_data = pd.merge(data, coordinates, on='locality_name')
final_data.head()

# Handling missing values in our features.

**Let's count the number of missing values in each feture of our data.**

In [None]:
data.isnull().sum()

**Let's visualize the missing values in our data.**

In [None]:
!{sys.executable} -m pip install missingno
import missingno as msno

In [None]:
msno.matrix(data)

**Clearly we can see that missing values are present in the following features -**  
    location : 1 missing value  
    size : 16 missing values  
    bath : 73 missing values  
    balcony : 609 missing values  

**We will be using two strategies based on the data type in a particular feature,**  
**for the imputation of these missing values.**

In [None]:
# imputing the missing value in the feature 'location' with the mode, as it is a categorical feature
data["location"].fillna(data["location"].mode()[0],inplace=True)
print('The number of missing values in the feature after imputation using the "mode" are: {}'.format((data['location']).isnull().sum()))

In [None]:
# importing the required library for this operation
from sklearn.impute import SimpleImputer
# imputing the missing values in the rest of the features with the 'mean', as they are numerical features.
num_imputer =  SimpleImputer(missing_values=np.nan, strategy="mean")

num_imputer_size = num_imputer.fit(data[['size']])
data['size'] = num_imputer_size.transform(data[['size']])

num_imputer_bath = num_imputer.fit(data[['bath']])
data['bath'] = num_imputer_bath.transform(data[['bath']])

num_imputer_balcony = num_imputer.fit(data[['balcony']])
data['balcony'] = num_imputer_balcony.transform(data[['balcony']])

data.isnull().sum()

In [None]:
msno.matrix(data)

# Outlier detection and imptation

**Using business logic to recognise anomalies in the data.**

**Let's check the distribution of area in sqft held out for 1bhk for the apartments.**

In [None]:
area_under_1bhk = []
for i in range(o, len(data)):
    area = data['total_sqft'][i] / data['']

In [None]:
data['area_type'].unique()

In [None]:
'''
Function: print_quantile_info(qu_dataset, qu_field)
Print out the following information about the data
   - interquartile range
   - upper_inner_fence
   - lower_inner_fence
   - upper_outer_fence
   - lower_outer_fence
   - percentage of records out of inner fences
   - percentage of records out of outer fences
 Input: 
   - pandas dataframe (qu_dataset)
   - name of the column to analyze (qu_field)
 Output:
   None
'''

def TUKEY_outlier_detector(dataset, feature):
    a = dataset[feature].describe()
    
    q3 = a["75%"]
    q1 = a["25%"]
    
    iqr = q3 - q1
    print("interquartile range:", iqr)
    
    upper_inner_fence = q3 + 1.5 * iqr
    lower_inner_fence = q1 - 1.5 * iqr
    print("upper_inner_fence:", upper_inner_fence)
    print("lower_inner_fence:", lower_inner_fence)
    
    upper_outer_fence = q3 + 3 * iqr
    lower_outer_fence = q1 - 3 * iqr
    print("upper_outer_fence:", upper_outer_fence)
    print("lower_outer_fence:", lower_outer_fence)
    
    count_over_upper = len(qu_dataset[qu_dataset[qu_field]>upper_inner_fence])
    count_under_lower = len(qu_dataset[qu_dataset[qu_field]<lower_inner_fence])
    percentage = 100 * (count_under_lower + count_over_upper) / a["count"]
    print("percentage of records out of inner fences for "  +qu_field+ " is: %.2f"% (percentage))
    
    count_over_upper = len(qu_dataset[qu_dataset[qu_field]>upper_outer_fence])
    count_under_lower = len(qu_dataset[qu_dataset[qu_field]<lower_outer_fence])
    percentage = 100 * (count_under_lower + count_over_upper) / a["count"]
    print("percentage of records out of outer fences for "  +qu_field+ " is: %.2f"% (percentage))
    
'''  
Function: remove_outliers_using_quantiles(qu_dataset, qu_field, qu_fence)
   1- Remove outliers according to the given fence value and return new dataframe.
   2- Print out the following information about the data
      - interquartile range
      - upper_inner_fence
      - lower_inner_fence
      - upper_outer_fence
      - lower_outer_fence
      - percentage of records out of inner fences
      - percentage of records out of outer fences
 Input: 
   - pandas dataframe (qu_dataset)
   - name of the column to analyze (qu_field)
   - inner (1.5*iqr) or outer (3.0*iqr) (qu_fence) values: "inner" or "outer"
 Output:
   - new pandas dataframe (output_dataset)
'''
def TUKEY_outlier_remover(qu_dataset, qu_field, qu_fence):
    a = qu_dataset[qu_field].describe()
    
    q3 = a["75%"]
    q1 = a["25%"]
    
    iqr = q3 - q1
    print("interquartile range:", iqr)
    
    upper_inner_fence = q3 + 1.5 * iqr
    lower_inner_fence = q1 - 1.5 * iqr
    print("upper_inner_fence:", upper_inner_fence)
    print("lower_inner_fence:", lower_inner_fence)
    
    upper_outer_fence = q3 + 3 * iqr
    lower_outer_fence = q1 - 3 * iqr
    print("upper_outer_fence:", upper_outer_fence)
    print("lower_outer_fence:", lower_outer_fence)
    
    count_over_upper = len(qu_dataset[qu_dataset[qu_field]>upper_inner_fence])
    count_under_lower = len(qu_dataset[qu_dataset[qu_field]<lower_inner_fence])
    percentage = 100 * (count_under_lower + count_over_upper) / a["count"]
    print("percentage of records out of inner fences: %.2f"% (percentage))
    
    count_over_upper = len(qu_dataset[qu_dataset[qu_field]>upper_outer_fence])
    count_under_lower = len(qu_dataset[qu_dataset[qu_field]<lower_outer_fence])
    percentage = 100 * (count_under_lower + count_over_upper) / a["count"]
    print("percentage of records out of outer fences: %.2f"% (percentage))
    
    if qu_fence == "inner":
        output_dataset = qu_dataset[qu_dataset[qu_field]<=upper_inner_fence]
        output_dataset = output_dataset[output_dataset[qu_field]>=lower_inner_fence]
    elif qu_fence == "outer":
        output_dataset = qu_dataset[qu_dataset[qu_field]<=upper_outer_fence]
        output_dataset = output_dataset[output_dataset[qu_field]>=lower_outer_fence]
    else:
        output_dataset = qu_dataset
    
    print("length of input dataframe:", len(qu_dataset))
    print("length of new dataframe after outlier removal:", len(output_dataset))
    
    return output_dataset

In [None]:
a = data['balcony'].describe()

In [None]:
q3 = a["75%"]
q1 = a["25%"]

iqr = q3 - q1
print("interquartile range:", iqr)

upper_inner_fence = q3 + 1.5 * iqr
lower_inner_fence = q1 - 1.5 * iqr
print("upper_inner_fence:", upper_inner_fence)
print("lower_inner_fence:", lower_inner_fence)

upper_outer_fence = q3 + 3 * iqr
lower_outer_fence = q1 - 3 * iqr
print("upper_outer_fence:", upper_outer_fence)
print("lower_outer_fence:", lower_outer_fence)

In [None]:
count_over_upper = len(data[data['balcony']>upper_inner_fence])
count_under_lower = len(data[data['balcony']<lower_inner_fence])
percentage = 100 * (count_under_lower + count_over_upper) / a["count"]
print("percentage of records out of inner fences for balcony feature is: %.2f"% (percentage))

count_over_upper = len(data[data['balcony']>upper_outer_fence])
count_under_lower = len(data[data['balcony']<lower_outer_fence])
percentage = 100 * (count_under_lower + count_over_upper) / a["count"]
print("percentage of records out of outer fences for balcony feature is: %.2f"% (percentage))

In [None]:
len(data[data['balcony']>upper_inner_fence])

In [None]:
plt.figure(figsize=(20,10))
sns.boxplot(x=data['balcony'], y=data['price'], data=data)
plt.xticks(rotation=45, horizontalalignment='right')
# plt.xlabel('Transmission types')
# plt.ylabel('Co2_emissions')
# plt.title('Transmission type frequency distribution with Co2_emissions.')
plt.show()

In [None]:
data.balcony.unique()

In [None]:
data.bath.unique()