# Predicting Rent Prices Based on Unit/House Data
*Exploratory Data Analysis, Data Processing, Model Development, and Model Deployment by Charles Selden*
***

## Importing Packages

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker  
from sklearn.preprocessing import OneHotEncoder
from pandas.api.types import CategoricalDtype
from sklearn.preprocessing import scale 
from sklearn import model_selection
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from statsmodels.stats.stattools import durbin_watson
from scipy.stats import chi2_contingency
import researchpy as rp
import math
from sklearn.linear_model import Lasso

***

## Importing the Dataset

In [2]:
data = pd.read_csv("~/Documents/data/house_data/House_Rent_Dataset.csv")
columns = list(data.columns)

FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\Charlie/Documents/data/house_data/House_Rent_Dataset.csv'

In [None]:
column_to_description = {}

descriptions_temp = ["Posted On: The date on which the house listing was posted.",
"BHK: Number of Bedrooms, Hall, Kitchen.",
"Rent: Rent of the Houses/Apartments/Flats.",
"Size: Size of the Houses/Apartments/Flats in Square Feet.",
"Floor: Houses/Apartments/Flats situated in which Floor and Total Number of Floors (Example: Ground out of 2, 3 out of 5, etc.)",
"Area Type: Size of the Houses/Apartments/Flats calculated on either Super Area or Carpet Area or Build Area.",
"Area Locality: Locality of the Houses/Apartments/Flats.",
"City: City where the Houses/Apartments/Flats are Located.",
"Furnishing Status: Furnishing Status of the Houses/Apartments/Flats, either it is Furnished or Semi-Furnished or Unfurnished.",
"Tenant Preferred: Type of Tenant Preferred by the Owner or Agent.",
"Bathroom: Number of Bathrooms.",
"Point of Contact: Whom should you contact for more information regarding the Houses/Apartments/Flats."]

for i in range(len(list(data.columns))):
    column_to_description[columns[i]] = descriptions_temp[i]
                                  
for i in columns:
    print(column_to_description[i])

***

# Exploring the Dataset

In [None]:
data.info()

## Posted On

In [None]:
print(column_to_description["Posted On"],"\n")
print(data["Posted On"],"\n")
posted_counts = data["Posted On"].value_counts()
print(posted_counts)

#### Maybe convert dates to int representing time since first listed date/start of 2020

## BHK

In [None]:
print(column_to_description["BHK"],"\n")
print(data["BHK"],"\n")
print(data["BHK"].value_counts())

#### Simple, works in current form

## Rent

In [None]:
print(column_to_description["Rent"],"\n")
print(data["Rent"])

In [None]:
plt.scatter(data["Size"],np.sort(data["Rent"]))
plt.gca().yaxis.set_major_formatter(mticker.FormatStrFormatter('%.0f Rupees'))

*Lets put a limit on the y axis to ignore the extreme outliers.*

In [None]:
plt.scatter(data["Size"],np.sort(data["Rent"]))
plt.gca().yaxis.set_major_formatter(mticker.FormatStrFormatter('%.0f Rupees'))
plt.ylim(0,600000)

*As a scalar value, it's easy to use. Our y value, what we want to predict using the model. A few extremely high outliers, but the majority values are under 100,000 and an even larger majority are under 400,000*

## Size

In [None]:
print(column_to_description["Size"],"\n")
print(data["Size"])

*In square feet, makes using much easier than a categorical version.*

## Floor

In [None]:
print(column_to_description["Floor"],"\n")
print(data["Floor"],"\n")
print(data["Floor"].value_counts())

*Actual values are scalar but in a string format, split into two seperate columns for the floor the room is on as well as the total number of floors the place has.*

## Area Type

In [None]:
print(column_to_description["Area Type"],"\n")
print(data["Area Type"],"\n")
print(data["Area Type"].value_counts())

*Super area includes sq feet for areas the tenant will have access to outside his apartment/house (stairways, public areas, hallways), while carpet area is just the apartment or house itself.*

## Area Locality

In [None]:
print(column_to_description["Area Locality"],"\n")
print(data["Area Locality"],"\n")
print(data["Area Locality"].value_counts())

In [None]:
plt.hist(data["Area Locality"])

*Too many categorical variables with only a few data points to one-hot encode or label encode, just use city instead.*

## City

In [None]:
print(column_to_description["City"],"\n")
print(data["City"],"\n")
print(data["City"].value_counts())

*To be used instead of Locality, the issue of how to encode this categorical data is still a concern. Forgot that one-hot encoding created a column for each value which have binary values, this definitely is the best encoding scheme to use here.*

## Furnishing Status

In [None]:
print(column_to_description["Furnishing Status"],"\n")
print(data["Furnishing Status"],"\n")
print(data["Furnishing Status"].value_counts())

*Categorical Data but seems like it would likely perform well if label-encoded as the categories are a scale from Unfurnished  through Furnished. Just need to make sure that unfurnished = 0, semi-furnished = 1, and furnished = 2.*

## Tenant Preferred

In [None]:
print(column_to_description["Tenant Preferred"],"\n")
print(data["Tenant Preferred"],"\n")
print(data["Tenant Preferred"].value_counts())

*Potentially able to use one-hot encoding for this, but might not give as much information as it potentially could.*

## Bathroom

In [None]:
print(column_to_description["Bathroom"],"\n")
print(data["Bathroom"],"\n")
print(data["Bathroom"].value_counts())

*No problem to use, simple scalar integer value.*

## Point of Contact

In [None]:
print(column_to_description["Point of Contact"],"\n")
print(data["Point of Contact"], "\n")
print(data["Point of Contact"].value_counts())

*Maybe remove contact builder since with just one data point we can't learn how this factors into the regression robustly.*
***

# Checking for Missing Values

In [None]:
for i in columns:
    print(str(i) + ":",data[i].isnull().sum())

*There are no missing values in this dataset.*

***

# Data Engineering and Processing

## Steps to be taken based on EDA:
1. `Posted On`: Convert Date to time since start of 2022.
2. `Size`: Maybe scale this using *Area Type* if it doesn't work well in the regression elsewise. For now having *Area Type* as a label-encoded variable which learns a flat value to adjust the outcome by might be fine, but the difference between different *Size*s will vary as some super-areas will severely overestimate *Size* and some will only minimally overestimate *Size*.
3. `Floor`: One of two options seems optimal. Both involve splitting the active floor and the total floors given the building first. The first option is just having both of those as seperate columns, the second option involves the floor the unit is on as the first column, and the ratio of that to the total floors in the building as the second.
The first option seems like the data is presented more clearly, but the second allows the relationship between the two to be passed directly in, as learning that might be suboptimal in terms of limiting learning other relationships. Call these columns *On Floor* and *Building Floors* in either case.
4. `Area Type`: If used to scale *Size* then don't use this column, otherwise use binary one-hot encoding for *Carpet Area* and *Super Area*, and throw out *Built Area* as that only has two data points.
5. `City`: This needs to be one-hot encoded, definitely good in this case as there are only a small number of cities with decent data population for each. 
6. `Furnishing Status`: Label encode with the following system. *Unfurnished* = 0, *Semi-Furnished* = 1, *Furnished* = 2.
7. `Tenant Preferred`: Do a kind of custom one-hot encoding. Normally each value would get its own column, but here we can just make two instead of three as one just represents both of the previous options simultaneously.
8. `Point of Contact`: Just drop the *Contact Builder* as it only has a single data point, the other two can be label encoded with 0 and 1.

***
## Taking these Steps

### Initial Data Shape

In [None]:
data.shape

### Step One - Making a Function to Update Posted On

In [None]:
posted_on_temp = data["Posted On"]
posted_on_temp

*First we can check whether all the dates start in 2022, this same method can be applied using whatever year is the earliest date in the dataset.*

In [None]:
split_dates = []

for i in range(len(posted_on_temp)):
    split_dates.append(posted_on_temp[i].split("-"))
    
    if split_dates[i][0] != "2022":
        print("diff date",split_dates[i])

*Since all the data is from 2022, we will set 2022-01-01 to be 0, and every day past will increment.*

*First we need some way to tell the number of days in a month given the month and year.*

In [None]:
def numberOfDays(y, m):
      leap = 0
      if y% 400 == 0:
         leap = 1
      elif y % 100 == 0:
         leap = 0
      elif y% 4 == 0:
         leap = 1
      if m==2:
         return 28 + leap
      list = [1,3,5,7,8,10,12]
      if m in list:
         return 31
      return 30

*Now we can build our update function.*

In [None]:
def updatePostedOnToScalar(data):
    list_of_date_num_equivalent = []    
    
    for i in range(len(split_dates)):
        date_num_equivalent = 0
        
        for j in range(int(split_dates[i][1])):
            date_num_equivalent = date_num_equivalent + numberOfDays(int(split_dates[i][0]),j)
            
        date_num_equivalent = date_num_equivalent + int(split_dates[i][2])
        
        list_of_date_num_equivalent.append(date_num_equivalent)
        
    list_of_date_num_equivalent = np.array(list_of_date_num_equivalent)
        
    data["Posted On"] = list_of_date_num_equivalent

    return data

### Step Two - Making a Function to Adjust Size if Necessary
*Nothing necessary here so far.*

### Step Three - Making a Function to Split Floor Column

In [None]:
def splitFloorIntoTwo(data):
    original_floor = data["Floor"]
    floor_on = []
    floor_out_of = []
    for i in range(len(original_floor)):
        split = original_floor[i].split(" out of ")
        if len(split) == 1:
            if split[0] == 'Ground':
                floor_on.append(0)
                floor_out_of.append(0)
            else:
                floor_on.append(split[0])
                floor_out_of.append(split[0])
        else:  
            #Ground is 0, others are basement and such so all other non-int convertables go to -1
            if split[0] == 'Ground':
                floor_on.append(0)
            else:
                try:
                    floor_on.append(int(split[0]))
                except:
                    floor_on.append(-1)
            floor_out_of.append(int(split[1]))
    
    floor_on = np.array(floor_on)
    floor_out_of = np.array(floor_out_of)
    
    data["Floor On"] = floor_on
    data["Floor Out Of"] = floor_out_of

    data.drop("Floor",axis=1,inplace=True)

    return data

### Step Four - Making a Function to Label Encode a Given Column

In [None]:
def labelEncodeColumn(data,column_name):
    if column_name == "Furnishing Status":
        cat_type = CategoricalDtype(categories=["Unfurnished", "Semi-Furnished", "Furnished"], ordered=True)
        data[column_name] = data[column_name].astype(cat_type)
        data[column_name] = data[column_name].cat.codes
    else:
        data[column_name] = data[column_name].astype('category')
        data[column_name] = data[column_name].cat.codes

    return data

### Step Five - Making a Function to One-Hot Encode a Given Column

In [None]:
def oneHotEncodeColumn(data,column_name):
    encoder = OneHotEncoder(handle_unknown='ignore',sparse=False)
    if column_name == "Tenant Preferred":
        bachelors = []
        family = []
        for i in range(len(data["Tenant Preferred"])):
            if data["Tenant Preferred"][i] == "Bachelors":
                bachelors.append(1)
                family.append(0)
            elif data["Tenant Preferred"][i] == "Family":
                bachelors.append(0)
                family.append(1)
            elif data["Tenant Preferred"][i] == "Bachelors/Family":
                bachelors.append(1)
                family.append(1)
            else:
                print("issue w tenant preferred encoding")
        bachelors = pd.Series(bachelors)
        bachelors.name = "Bachelors"
        family = pd.Series(family)
        family.name = "Family"

        data = pd.concat([data,bachelors,family],axis=1)

        return data

    else:
        encoded_data = pd.DataFrame(encoder.fit_transform(data[[column_name]]))

        feature_names = encoder.get_feature_names_out()
        for i in range(len(feature_names)):
            feature_names[i] = feature_names[i].split("_")
        for i in range(len(feature_names)):
            feature_names[i] = feature_names[i][1]

        encoded_data.columns = feature_names

        data = pd.concat([data,encoded_data],axis=1)

    return data

***
## Processing the Data for Training our Model

In [None]:
data.shape

In [None]:
data.columns

## Step One - Update Posted On to Scalar Values

In [None]:
data["Posted On"]

In [None]:
data = updatePostedOnToScalar(data)

In [None]:
data["Posted On"]

*Check that the update didn't create any null values unintentionally.*

In [None]:
columns = data.columns
for i in columns:
    print(str(i) + ":",data[i].isnull().sum())

### Step Two - Adjust Size *(if necessary)*

### Step Three - Split Floor Column into Floor On and Floor Out Of

In [None]:
data["Floor"]

In [None]:
data = splitFloorIntoTwo(data)

In [None]:
data["Floor On"].value_counts()

In [None]:
data["Floor Out Of"].value_counts()

*Check that the update didn't create any null values unintentionally.*

In [None]:
columns = data.columns
for i in columns:
    print(str(i) + ":",data[i].isnull().sum())

### Step Four - Encode Area Type

*First we need to drop Built Area as it has too few data points.*

In [None]:
data['Area Type']

In [None]:
data = data.drop(data[data['Area Type'] == 'Built Area'].index)
data['Area Type'].value_counts()

In [None]:
data = data.reset_index()
data

In [None]:
data.shape

*Check that dropping rows didn't create any null values unintentionally.*

In [None]:
columns = data.columns
for i in columns:
    print(str(i) + ":",data[i].isnull().sum())

*We can label encode into two categories after having dropped Built Area. Since there are only two potential categories, we don't need to worry about ordinality of the categories.*

In [None]:
data['Area Type']

In [None]:
data = labelEncodeColumn(data,'Area Type')
data['Area Type']

*Check that the update didn't create any null values unintentionally.*

In [None]:
columns = data.columns
for i in columns:
    print(str(i) + ":",data[i].isnull().sum())

### Step Five - One-Hot Encoding for City

In [None]:
data['City']

In [None]:
data = oneHotEncodeColumn(data,'City')

In [None]:
data = data.drop("City",axis=1)
data = data.drop("Area Locality",axis=1)

In [None]:
display(data)

*Check that the update didn't create any null values unintentionally.*

In [None]:
columns = data.columns
for i in columns:
    print(str(i) + ":",data[i].isnull().sum())

### Step Six - Label *(or maybe ordinal)* Encode Furnishing Status

In [None]:
data['Furnishing Status']

In [None]:
data = labelEncodeColumn(data,'Furnishing Status')
data['Furnishing Status']

*Check that the update didn't create any null values unintentionally.*

In [None]:
columns = data.columns
for i in columns:
    print(str(i) + ":",data[i].isnull().sum())

### Step Seven - One-Hot Encode Tenant Preferred *(with bachelor and family both binary so they can include bachelor/family with just those columns)*

In [None]:
data["Tenant Preferred"]

In [None]:
data = oneHotEncodeColumn(data,"Tenant Preferred")
columns = data.columns
columns

In [None]:
data = data.drop("Tenant Preferred",axis=1)
display(data)

*Check that the update didn't create any null values unintentionally.*

In [None]:
columns = data.columns
for i in columns:
    print(str(i) + ":",data[i].isnull().sum())

### Step Eight - Label Encode Point of Contact

In [None]:
data['Point of Contact']

*First I need to drop Contact Builder given the scarcity of data using that value.*

In [None]:
data = data.drop(data[data['Point of Contact'] == 'Contact Builder'].index)
data['Point of Contact'].value_counts()

In [None]:
data = data.reset_index()
display(data)

*Check that dropping rows didn't create any null values unintentionally.*

In [None]:
columns = data.columns
for i in columns:
    print(str(i) + ":",data[i].isnull().sum())

*Next I just label encode the Point of Contact column.*

In [None]:
data = labelEncodeColumn(data,'Point of Contact')
display(data)

*Now after we remove the columns representing the indices at points in the process, we are finished processing our data.*

In [None]:
data = data.drop("level_0",axis=1)
data = data.drop("index",axis=1)
display(data)

*Now we just need to make sure that all of our data types are numbers (just not strings)*

In [None]:
columns = data.columns
for i in columns:
    data[i] = data[i].astype(int)

*Check that the update didn't create any null values unintentionally.*

In [None]:
columns = data.columns
for i in columns:
    print(str(i) + ":",data[i].isnull().sum())

# Testing for Linear Relationships in Our Variables with Rent

In [None]:
def check_linear_relationship(compare_from):
    compare_col = compare_from
    theta = np.polyfit(data[compare_col], data['Rent'],1)
    y_line = theta[1] + theta[0] * data[compare_col]
    plt.scatter(data[compare_col], data['Rent'], color='red')
    plt.plot(data[compare_col], y_line, 'b')
    plt.title('Rent Vs ' + str(compare_col), fontsize=14)
    plt.xlabel(compare_col, fontsize=14)
    plt.ylabel('Rent', fontsize=14)
    plt.grid(True)
    plt.show()
    print("The slope of the best fit for", compare_col, "is " + str(theta[0]))

### Posted On

In [None]:
check_linear_relationship("Posted On")

### BHK

In [None]:
check_linear_relationship("BHK")

### Size

In [None]:
check_linear_relationship("Size")

In [None]:
check_linear_relationship("Area Type")

In [None]:
check_linear_relationship("Furnishing Status")

In [None]:
check_linear_relationship("Bathroom")

In [None]:
check_linear_relationship("Point of Contact")

In [None]:
check_linear_relationship("Floor On")
check_linear_relationship("Floor Out Of")

In [None]:
check_linear_relationship("Bachelors")
check_linear_relationship("Family")

In [None]:
check_linear_relationship("Bangalore")
check_linear_relationship("Chennai")
check_linear_relationship("Delhi")
check_linear_relationship("Hyderabad")
check_linear_relationship("Kolkata")
check_linear_relationship("Mumbai")

### We know now that despite not directly appearing linear, each of our variables has an acceptably strong linear relationship. While the clustered data points make these difficult to see, the best fit lines' slopes indicate the actual trend.

### Along with accomplishing initial testing for a basic linear regression model, we can perform PCA to determine if we are using too much information along with get a baseline score for validation purposes given each model we want to test.

In [None]:
def validate(data,reg_type="linear_regression",repeats=3):
    columns = data.columns
    X = data[columns]
    y = np.ravel(data[["Rent"]])
    X = X.drop("Rent",axis=1)
    num_points = len(data)
    
    pca = PCA()
    X_reduced = pca.fit_transform(scale(X))
    
    cv = RepeatedKFold(n_splits=10, n_repeats=repeats, random_state=.33)
    
    if reg_type == 'linear':
        regr = LinearRegression()
    elif reg_type == 'random_forest':
        regr = RandomForestRegressor()
    elif reg_type == 'lasso':
        regr = Lasso()
    else:
        print("you need a valid regression type for pca")
        return
    mse = []
    
    score = -1*model_selection.cross_val_score(regr,
           np.ones((len(X_reduced),1)), y, cv=cv,
           scoring='neg_mean_squared_error').mean()  
    mse.append(score/num_points)
    
    for i in np.arange(1, len(data.columns)):
        score = -1*model_selection.cross_val_score(regr,
               X_reduced[:,:i], y, cv=cv, scoring='neg_mean_squared_error').mean()
        mse.append(score/num_points)
        
    plt.plot(mse)
    plt.xlabel('Number of Used Components')
    plt.ylabel('MSE / Sample Size')
    plt.title('Rent')
    display(mse)

In [None]:
columns = data.columns
X = data[columns]
y = data[["Rent"]]
X = X.drop("Rent",axis=1)
display(X)

*First we scale our predictor variables between 0 and 1.*

In [None]:
pca = PCA()
X_reduced = pca.fit_transform(scale(X))
X_reduced

*Next we define the cross validation method we will use in our evaluation.*

In [None]:
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)

In [None]:
regr = LinearRegression()
#regr = RandomForestRegressor()
mse = []

*Next we calculate MSE with only the intercept.*

In [None]:
score = -1*model_selection.cross_val_score(regr,
           np.ones((len(X_reduced),1)), y, cv=cv,
           scoring='neg_mean_squared_error').mean()  
mse.append(score/len(data))

*Finally we calculate MSE using cross-validation, adding one component at a time.*

In [None]:
for i in np.arange(1, len(data.columns)):
    score = -1*model_selection.cross_val_score(regr,
               X_reduced[:,:i], y, cv=cv, scoring='neg_mean_squared_error').mean()
    mse.append(score/len(data))

*Plot cross-validation results (found using multiple linear regression with MSE)*

In [None]:
plt.plot(mse)
plt.xlabel('Number of Used Components')
plt.ylabel('MSE / Sample Size')
plt.title('Rent')
mse

*We can clearly see that each of the principal components reduces the overall MSE of the model except for the last two, meaning we likely have some correlation between some of our variables. This means that next up is testing for correlation or multicollinearity, if the variables are all correlated or if each correlated variable is correlated to a different variable.*

## Testing Variables for Correlation

In [None]:
def get_residuals(data,skl_model,trained):
    y = data["Rent"]
    X = data.drop("Rent",axis=1)
    X_train = X[:int(len(X)*.7)]
    y_train = y[:int(len(X)*.7)]
    X_test = X[int(len(X)*.7):]
    y_test = y[int(len(X)*.7):]
    
    
    X_train = X_train.reset_index()
    X_test = X_test.reset_index()
    y_train = y_train.reset_index()
    y_test = y_test.reset_index()
    
    
    if not trained:
        skl_model.fit(X_train,y_train)
    
    full_predictions = skl_model.predict(X_test)
    predictions = []
    residuals = []
    
    y_test = y_test.drop("index",axis=1)
    
    for i in range(len(full_predictions)):
        predictions.append(full_predictions[i][1])
    
    for i in range(len(predictions)):
        residuals.append(abs(predictions[i] - y_test["Rent"][i]))
           
    return residuals
                         
resids = get_residuals(data,regr,False)

In [None]:
durbin_watson(resids)

**The durbin_watson statistic being under 1.5 indicates a relatively strong autocorrelation within the used variables.**

This means we now need to test for which of our variables strongly correlate with each other.

In [None]:
#chisquare(data).statistic

#covariance = SUM((xi - avgi)(yj - avgj))/n

def areScalarVariablesCorrelated(data,varA,varB):
    var1 = scale(data[varA])
    var2 = scale(data[varB])
    
    avg1 = var1.mean()
    avg2 = var2.mean()
    
    if len(var1) != len(var2):
        print("Please use variables with the same number of entries.")
        return None
    
    total_sum = 0
    for i in range(len(var1)):
        total_sum = total_sum + (var1[i] - avg1)*(var2[i] - avg2)
    
    return total_sum/len(var1)

In [None]:
display(data)

In [None]:
def scalarCorrelationMatrix(data):
    scalar_columns = ['Posted On','BHK','Size','Bathroom','Floor On','Floor Out Of']

    data_columns = data.columns
    
    for i in scalar_columns:
        if not i in data.columns:
            scalar_columns.remove(i)
    
    correlation_columns = []

    for i in scalar_columns:
        list_column = []
        for j in scalar_columns:
            list_column.append(areScalarVariablesCorrelated(data,i,j))
        correlation_columns.append(pd.Series(list_column,name=i,dtype='float64'))

    correlation_frame = pd.concat(correlation_columns,axis=1)
    correlation_frame.index = scalar_columns
    
    return correlation_frame

In [None]:
display(scalarCorrelationMatrix(data))

**Here the correlation matrix shows the level of correlation between any of the two scalar variables.**

The strongest correlations are as follows:

1. Floor on and Floor Out Of - This correlation is pretty simple, Floor Out Of is the maximum value that Floor On could be, meaning that as Floor Out Of increases, the potential size of Floor On does as well.

    To fix this we could make Floor On into a ratio between current Floor On and Floor Out Of, making the Floor Out Of represent the building size while Floor On will say where in the building it is without scaling linearly along with Floor Out Of.


2. Bathroom and BHK - This correlation makes a lot of sense as well, given that Bathroom tells the number of available bathrooms while BHK tells the number of bathroons, hallways, and kitchens. 

3. Size and Bathroom - The correlation makes sense as the larger a place is, generally the more bathrooms it has and apparently that trend is strong enough to show correlation here.

4. Size and BHK - This is part of the same multicollinear issue with the previous two. 

    To try to fix all three of these we can try this:

    First, scale BHK and Bathroom by Size (BHK/Size,Bathroom/Size).

    Next, update BHK to be itself minus Bathroom.

Check for collinearity after this, should remove correlation with Size for Bathroom and BHK, but still need to see if the processing will affect how the model reads the data.

## Floor

In [None]:
def scaleFloor(data):
    data["Floor On"].astype("float32",copy=False)
    floor_on = data["Floor On"]
    floor_out_of = data["Floor Out Of"]
    for i in range(len(floor_on)):
        if floor_on[i] == 0:
            floor_on[i] = 0
        else:
            floor_on[i] = floor_on[i] / floor_out_of[i]
    data["Floor On"] = floor_on
    
    return data

In [None]:
display(data)

In [None]:
data = scaleFloor(data)
display(data)

In [None]:
display(scalarCorrelationMatrix(data))

**Dropped from .86 to .084, successfully removed significant correlation from Floor On and Floor Out Of.**

In [None]:
def sizeBathroomBHKScale(data):
    #First scale Bathroom and BHK by Size
    data["Bathroom"].astype("float32",copy=False)
    data["BHK"].astype("float32",copy=False)
    
    bathroom = data["Bathroom"]
    bhk = data["BHK"]
    size = data["Size"]
    
    for i in range(len(bathroom)):
        if size[i] == 0:
            size[i] = 1
        bathroom[i] = bathroom[i] + bhk[i]
        bathroom[i] = bathroom[i] / size[i]
            
    data["Bathroom"] = bathroom
    data = data.drop("BHK",axis=1)
    
    return data

In [None]:
data = sizeBathroomBHKScale(data)
display(data)

In [None]:
display(scalarCorrelationMatrix(data))

**Now some of our correlation is negative, but the degree is significantly smaller indicating much less correlation between the three variables.**

In [None]:
resids = get_residuals(data,LinearRegression(),False)

In [None]:
durbin_watson(resids)

**We've improved our durbin watson statistic from .95 to 1.19, however our goal is still 1.5.**

We have looked close at our scalar variables, now it's time to check our categorical vars.

In [None]:
cat_vars = ["Area Type","Furnishing Status","Point of Contact","Bachelors","Family",
            "Bangalore","Chennai","Delhi","Hyderabad","Kolkata","Mumbai"]

In [None]:
def getCatCorrelation(data):
    result_holder = []
    for i in cat_vars:
        #cat_holder.append(data[i].value_counts())
        for j in cat_vars:
            crosstab, test_results, expected = rp.crosstab(data[i], data[j],
                                               test= "chi-square",
                                               expected_freqs= True,
                                               prop= "cell")
            result_holder.append([crosstab,test_results,expected])
        
    return result_holder
    
result = getCatCorrelation(data)

**Phi and Cramer's V	Interpretation**

0.25:	Very strong

0.15:	Strong

0.10:	Moderate

0.05:	Weak

0:	No or very weak

    

## Correlation Strength per Categorical Variable

In [None]:
def catCorrMatrix(catCorrResults):
    corr_strength = pd.DataFrame()
    corr_strength_constructor = []

    #n is the starting point (index wise) for the current var
    n = 0
    interp = False
    for i in range(len(cat_vars)): 
        corr_strength_helper = []
        for j in range(len(cat_vars)):
            if interp:
                if abs(result[n+j][1]['results'][2]) > .25:
                    #print("Very Strong")
                    corr_strength_helper.append('VS')
                elif abs(result[n+j][1]['results'][2]) > .15:
                    #print("Strong")
                    corr_strength_helper.append('S')
                elif abs(result[n+j][1]['results'][2]) > .1:
                    #print("Moderate")
                    corr_strength_helper.append('M')
                elif abs(result[n+j][1]['results'][2]) > .05:
                    #print("Weak")
                    corr_strength_helper.append('W')
                else:
                    #print("None or Very Weak")
                    corr_strength_helper.append('N')
            else:
                corr_strength_helper.append(result[n+j][1]['results'][2])

        corr_strength_helper = pd.Series(corr_strength_helper,name=cat_vars[i])
        corr_strength = pd.concat([corr_strength,corr_strength_helper],axis=1)

        n = n + len(cat_vars)

    corr_strength.index = cat_vars
    
    return corr_strength

In [None]:
corr_strength = catCorrMatrix(result)

In [None]:
def highlightS(x,color):
    ones = np.where(x > .99, "color: white;", None)
    s = np.where(x > .2, f"color: {color};", None)
    vs = np.where(x > .5, "color: blue;", None)
    for i in range(len(vs)):
        if ones[i] == None:
            if not s[i] == None:
                if vs[i] == None:
                    vs[i] = s[i]
        else:
            vs[i] = ones[i]
    return vs

def highlightN(x,color):
    return np.where((x == "N"), f"color: {color};", None)

In [None]:
display(corr_strength.style.apply(highlightS,color="green"))

From these we can see a lot of strong and very strong correlation between our categorical variables. The main problematic correlations seem to fall into two groups.

The first main group is as follows:

1. Point of Contact and Area Type
2. Point of Contact and Mumbai
3. Mumbai and Area Type
4. Family and Area Type

In addition to these there is a lot of correlation between the different one-hot encoded cities.

5. Bangalore and Mumbai
6. Bangalore and Chennai
7. Bangalore and Hyderabad
8. Chennai and Hyderabad
9. Chennai and Mumbai
10. Hyderabad and Mumbai

**First, we can see that Area Type and Point of Contact are collinear, along with Area Type being collinear with Mumbai and Family as well. This seems to indicate that Area Type doesn't present any information not within those other vars, so we can just drop it [IF THIS DOESN'T WORK WE MAY NEED TO SCALE SIZE BY AREA TYPE BEFORE DROPPING IT]**

In [None]:
data = data.drop("Area Type",axis=1)

In [None]:
cat_vars = ["Furnishing Status","Point of Contact","Bachelors","Family",
            "Bangalore","Chennai","Delhi","Hyderabad","Kolkata","Mumbai"]

In [None]:
result = getCatCorrelation(data)
corr_strength = catCorrMatrix(result)
display(corr_strength.style.apply(highlightS,color="green"))

**Especially when it comes to the cities there seems to be tons of multicollinearity between all of them other than Delhi and Kolkata. Probably drop all other cities other than one.**

In [None]:
#combine bangalore, chennai, hyderabad, and mumbai into a column with 
#0 meaning the apartment is not in any of the listed cities, 
#while 1 means that it is in ANY of the listed cities

In [None]:
for i in range(len(data["Delhi"])):
    in_one_of_cities = False
    for j in ["Bangalore","Chennai","Hyderabad","Delhi"]:
        if data[j][i] == 1:
            in_one_of_cities = True
    if in_one_of_cities:
        data["Delhi"][i] = 1

In [None]:
data = data.drop("Bangalore",axis=1)
data = data.drop("Chennai",axis=1)
data = data.drop("Hyderabad",axis=1)
data = data.drop("Kolkata",axis=1)

In [None]:
cat_vars = ["Furnishing Status","Point of Contact","Bachelors","Family","Delhi","Mumbai"]

In [None]:
result = getCatCorrelation(data)
corr_strength = catCorrMatrix(result)
display(corr_strength.style.apply(highlightS,color="green"))

We finally have removed most strong collinearity from our categorical variables. Lets check the durbin watson score again now.

In [None]:
resids = get_residuals(data,LinearRegression(),False)
durbin_watson(resids)

It seems that we are still a little outside of the range considered normal, as we want at least ~1.5. Looking through these variables we can think that seperating Bachelors and Family may have just contributed to our autocorrelation. Removing one should leave most of the relevant information with the other.

In [None]:
#cat_vars = ["Delhi","Furnishing Status","Family","Mumbai"]
cat_vars = ["Furnishing Status","Point of Contact","Family","Delhi","Mumbai"]

In [None]:
data = data.drop("Bachelors",axis=1)

In [None]:
result = getCatCorrelation(data)
corr_strength = catCorrMatrix(result)
display(corr_strength.style.apply(highlightS,color="green"))

In [None]:
resids = get_residuals(data,LinearRegression(),False)
durbin_watson(resids)

In [None]:
data = data.drop("Point of Contact",axis=1)

cat_vars = ["Furnishing Status","Family","Delhi","Mumbai"]

In [None]:
result = getCatCorrelation(data)
corr_strength = catCorrMatrix(result)
display(corr_strength.style.apply(highlightS,color="green"))

In [None]:
resids = get_residuals(data,LinearRegression(),False)
durbin_watson(resids)

In [None]:
data = data.drop("Furnishing Status",axis=1)

cat_vars = ["Family","Delhi","Mumbai"]

In [None]:
result = getCatCorrelation(data)
corr_strength = catCorrMatrix(result)
display(corr_strength.style.apply(highlightS,color="green"))

In [None]:
resids = get_residuals(data,LinearRegression(),False)
durbin_watson(resids)

In [None]:
data = data.rename(columns={"Delhi":"InSimilarCities"})

In [None]:
#data = data.drop("Mumbai",axis=1)

In [None]:
cat_vars = ["Family","InSimilarCities","Mumbai"]

In [None]:
result = getCatCorrelation(data)
corr_strength = catCorrMatrix(result)
display(corr_strength.style.apply(highlightS,color="green"))

In [None]:
resids = get_residuals(data,LinearRegression(),False)
durbin_watson(resids)

**After all this we still only have 1.3. Seeing as we have removed essentially all autocorrelation from our categorical variables by now we can only attribute this to the continuous variables.**

# FIX CONTINUOUS AUTOCORRELATION NOW

In [None]:
data

In [None]:
validate(data)

### After we've removed most of the autocorrelation from our data we just need to remove extreme outlier points and we will have a useable dataset.

In [None]:
col = "Rent"
sample_mean = np.mean(data[col],axis=0)
sample_std_dev = np.std(data[col],axis=0)
row_pointer = 0
for row in range(len(data)):
    safe = True
    val = data.iloc[row_pointer][col]
    if val <= sample_mean - 2 * sample_std_dev:
        safe = False
    elif val >= sample_mean + 2 * sample_std_dev:
        safe = False

    if not safe:
        data = data.drop(row,axis=0)
        row_pointer = row_pointer - 1

    row_pointer = row_pointer + 1
data.reset_index(drop=True)

In [None]:
data

# Model Usability Experimentation and Model Development

### To start model development lets begin by looking at the main types of regression techniques we could potentially use.

1. Multiple Linear Regression - By this I mean a regression done with the simple/multiple linear regression model ((((EQUATION -   Y = XB + epsilon)))). Uses OLS or Ordinary Least Squares [SUM:(yHat - y)^2]

2. Neural network regression - This type of model has its advantages, but before testing I think our data sample size is too small from which to learn anything meaningful.

3. Lasso Regression - Uses Least Absolute Shrinkage and Selection Operator in place of the classic OLS. Good for datasets with highly correlated variables which you're having a hard time seperating.

4. Decision Tree or Random Forest Regression - Using either a single main decision tree or a forest of small trees built from random sampling, either case uses these trees to evaluate a sum as the nodes are evaluated which works in the end as a regression.

5. KNN (or other clustering) - Cluster the data using a KNN or other clustering algorithm, and evaluate the value of any newly presented point by taking the mean of the nearby points after the new point is plotted.

6. SVM - SVM's are excellent for both classification and regression due to their use of the kernel trick to create arbitrarily complex domains in which to perform the regression without knowing what those domains are. 

7. Gaussian Process Regression - Models the probability distribution for the domain of all inputs to the function used rather than for any specific inputs. Sounds really nice, and each prediction comes with a uncertainty measure. 

8. Polynomial Regression - Like simple linear regression, but instead of y = x0 + b(x1), the series is increasing polynomials. Therefore, y = x0 + a(x1) + b(x1)^2 + c(x1)^3 + ... n(x1)^n. Seems a little useful, but I have too many variables to use a simple regression model.

### Now we just need to decide actual models to test based on these techniques.

1. Multiple Linear Regression

In [None]:
validate(data,reg_type='linear',repeats=50)

2. Neural Network Regression

3. Lasso Regression

In [None]:
validate(data,reg_type='lasso',repeats=50)

4. Random Forest Regression

In [None]:
validate(data,reg_type='random_forest',repeats=5)

5. K-Means Clustering with an SVM Regression per Cluster
https://blog.paperspace.com/svr-kmeans-clustering-for-regression/

6. SVM with Linear Kernel

7. SVM with Gaussian Kernel

8. SVM with Linear Kernel