# **Kenya Housing Value Details**

Among the worst hit sectors due to Covid-19 was the real estate. According to the data provided by the [Kenya Property Center](https://kenyapropertycentre.com/market-trends/average-prices), Nairobi tops the list of county with an average high of KSh 70,000 per month in rental houses followed by Mombasa and Kilifi with an average of KSh 40,000 per month. Based on towns, Lavington, Karen, an Westlands top the list with an average monthly rental charges of KSh 210K, KSh 150K, and 130K respectively. The dataset of interest is obtained [here](https://www.kaggle.com/datasets/iamasteriix/rental-apartments-in-kenya).

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set_style('whitegrid')

apt_rents = pd.read_csv('rent_apts.csv')
apt_rents.head()

Unnamed: 0,Agency,Neighborhood,Price,link,sq_mtrs,Bedrooms,Bathrooms
0,Buy Rent Shelters,"General Mathenge, Westlands","KSh 155,000",/listings/4-bedroom-apartment-for-rent-general...,4.0,4.0,4.0
1,Kenya Classic Homes,"Kilimani, Dagoretti North","KSh 100,000",/listings/3-bedroom-apartment-for-rent-kiliman...,300.0,3.0,4.0
2,Absolute Estate Agents,"Hatheru Rd,, Lavington, Dagoretti North","KSh 75,000",/listings/3-bedroom-apartment-for-rent-lavingt...,3.0,3.0,5.0
3,A1 Properties Limited,"Kilimani, Dagoretti North","KSh 135,000",/listings/3-bedroom-apartment-for-rent-kiliman...,227.0,3.0,4.0
4,Pmc Estates Limited,"Imara Daima, Embakasi","KSh 50,000",/listings/3-bedroom-apartment-for-rent-imara-d...,3.0,3.0,


# **Data Cleaning & Selection** 

From the data, the variables of interest are Neighborhood, Price, House size (sq_mtrs), number of bedrooms and bathrooms (Bedrooms, Bathrooms). However, before analysis, some cleaning is in order. As a starting point, only two variables will be considered$-$`Neighborhood` and `Price`. The remaining adjustments will be considered as analysis goes by.

In [2]:
apt_rents.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1848 entries, 0 to 1847
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Agency        1848 non-null   object 
 1   Neighborhood  1848 non-null   object 
 2   Price         1848 non-null   object 
 3   link          1848 non-null   object 
 4   sq_mtrs       1846 non-null   float64
 5   Bedrooms      1845 non-null   float64
 6   Bathrooms     1557 non-null   float64
dtypes: float64(3), object(4)
memory usage: 101.2+ KB


## Neighborhood

The `Neighborhood` variable contains the location of these houses. A look at these values shows that there are over 600 unique places. However, the interest is on specific locations.

In [3]:
len(apt_rents.Neighborhood.unique())

647

Here, we create a column `region` which we take the last name which most represent constituency with a few exceptions of places like 'Mombasa Rd' among others. Now out of 647 unique Neighborhoods there are only 37 unique neighborhoods stored in the column `region`. However, the data type is `Object` which presents some challenges during analysis. Therefore, it has to be converted to categorical data.

In [4]:
get_region = lambda x: x.rpartition(',')[-1]
remove_space = lambda x: x.strip()

apt_rents['region'] = apt_rents.Neighborhood\
                .apply(get_region)\
                .apply(remove_space)

apt_rents['region'] = apt_rents.region.astype('category')

assert apt_rents.region.dtype == 'category'
print('number of regions:', len(apt_rents.region.unique()), '\n')
print(apt_rents.region.head())

number of regions: 37 

0          Westlands
1    Dagoretti North
2    Dagoretti North
3    Dagoretti North
4           Embakasi
Name: region, dtype: category
Categories (37, object): ['Athi River', 'Dagoretti North', 'Dagoretti South', 'Eldoret North', ..., 'Thika East', 'Thika Road', 'Thindigua', 'Westlands']


## Rental Prices

The `Price` variables also need some adjustments to convert it from an `Object` data type to integer. To do this the comma and the `KSh` currency symbol.

In [None]:
remove_character = lambda x: x.translate({ord(c): None for c in ",KSh"})

apt_rents['rental_price'] = apt_rents.Price.apply(remove_character)
apt_rents['rental_price'] = apt_rents['rental_price'].astype('int64')

print(apt_rents.rental_price.head())
assert apt_rents.rental_price.dtype == 'int64'

## Data Selection

The variables `Agency` and `link` won't be of use in this analysis but they provide some useful information to buyers. Onwards, the data for analysis is stored in `rental_aprt` DataFrame.

In [34]:
rental_aprt = apt_rents[['region', 'rental_price', 'sq_mtrs', 'Bedrooms', 'Bathrooms']]
print(rental_aprt.head())

            region  rental_price  sq_mtrs  Bedrooms  Bathrooms
0        Westlands        155000      4.0       4.0        4.0
1  Dagoretti North        100000    300.0       3.0        4.0
2  Dagoretti North         75000      3.0       3.0        5.0
3  Dagoretti North        135000    227.0       3.0        4.0
4         Embakasi         50000      3.0       3.0        NaN


## Further Selection

This analysis will be based on regions. Therefore, to use parametric analysis, each `region` should have about 30 entries across the variables. Going by this assumptions, only 6 regions meet the requirement.

In [31]:
thirty_more = (rental_aprt.region.value_counts() >= 30).sum()
print(f'Count of regions with more than 30 entries: {thirty_more}')

Count of regions with more than 30 entries: 6


Therefore, sub-setting is in order. Here, a list that contains regions with 30 entries across the variables, `thirty_entries` is obtained and then used to subset the DataFrame. Note that the local variable `thirty_more` which was defined previously is used to obtain only the names of regions which met the 30 data entries threshold.

It is evident that about 150 of the data entries did not meet the minimum requirement size of 30. 

In [33]:
thirty_entries = rental_aprt.region\
                .value_counts()\
                .index\
                .to_list()[:thirty_more]

rental_aprt = rental_aprt[rental_aprt.region.isin(thirty_entries)]
print(rental_aprt.shape)
print(rental_aprt.head())

(1698, 5)
            region  rental_price  sq_mtrs  Bedrooms  Bathrooms
0        Westlands        155000      4.0       4.0        4.0
1  Dagoretti North        100000    300.0       3.0        4.0
2  Dagoretti North         75000      3.0       3.0        5.0
3  Dagoretti North        135000    227.0       3.0        4.0
5        Westlands        150000      2.0       2.0        NaN


Now the `region` values will be transformed to nominal data points of 0 to 4 to represent the regions. This will be useful when applying 

## Handling Missing Values

Checking the missing values it is evident that `Bedrooms` and `sq_mtrs` have 2 and 3 missing values respectively. However, `Bathroom` has a large number of missing values 291. 

In [39]:
rental_aprt.isna().sum()

region            0
rental_price      0
sq_mtrs           2
Bedrooms          3
Bathrooms       291
dtype: int64

### Splitting the Data

To address this, we'll first split the data into training and testing set. This way, the prediction data will be a representative of the real world data with the missing values accounted for. Note that the data will be limited to the named regions only.

In [57]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_train =\
    train_test_split()

To address this, we'll first split the data into training and testing set.