<h1> This is a further "experimental" EDA analysis of the "merged.csv" file. </h1>

Before running into conclusions it would be nice to mess with the data a bit more and see how they respond with each other. 

First, let's load the file:

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore') # Bad practice my ass

merged = pd.read_csv('data/merged.csv')

<h2> Define a function that plots a simple scatter between a feature and "logerror"</h2>

With a first look we can see that in the relation `"area_live_finished - logerror"`, as the "area_live_finished" increases alot then the logerror converges to 0

In [None]:
def plot_feature_logerror(feature, df=merged):
    """This function plots a scatter x-y style with the feature in xAxis and logerror in yAxis."""
    def is_numerical():
        return df[feature].dtype in ['float64', 'int64']
    def is_label():
        return feature in ['logerror', 'ID']
    def is_constant():
        return df[feature].max() == df[feature].min()
    
    if not (is_numerical() and not is_label() and not is_constant()):
        return
    plt.scatter(df[feature].values, df["logerror"].values)
    plt.xlabel(feature); plt.ylabel("logerror"); plt.title(feature + " - logerror")
    plt.show()
    

# The line bellow plot all features that can be plotted.   
[plot_feature_logerror(feature) for feature in merged.columns]

<h2> Find all non numerical columns </h2>


In [None]:
def is_numerical(feature, df=merged):
    return df[feature].dtype in ['float64', 'int64']
        
categorical_features = [col for col in merged.columns if not is_numerical(col)]
categorical_features

<h2> So we find out the non numerical values! </h2>

These are: 
['flag_tub', 'zoning_landuse_county', 'zoning_property', 'flag_fireplace', 'tax_delinquency', 'transactiondate']
Let's investigate more what these values are:<br />
<b>flag_tub</b>              --> only "True" and NaN. <br />
<b>zoning_landuse_county</b> --> values like: 0100, 010C, 96, ...<br />
<b>zoning_property</b>       --> values like: 1NR1*, AH RM-CD*, WVRPD4OOOO, ...<br />
<b>flag_fireplace</b>        --> only "True" and NaN.<br />
<b>tax_delinquency</b>       --> only "Y" and NaN<br />
<b>transactiondate</b>       --> date from "2016-01-02" to "2016-12-30" <br />

<h2> So... </h2>
These features need a special treatment. We could just change them with a correspondence of integers (e.g. True --> 1 and NaN --> 0, or 1NR1* --> 1, WVRPD4OOOO --> 2 etc...) using a <b>dict</b>. 
Although, this may not seem alright because the data "won't make sense". For example, the mean value of these features is an irrelevant number (imagine that you have values ranging from 0 to 10 and the mean value is 5.5???). Therefore, a different approach may occur

<h2> Analysis for "flag_tub" </h2>

The valid values for this feature are "True" and "Nan". These are sufficient to make it a binary feature.

From the results we can see that the True values are only the 2.62% of the total samples so it is not a great factor. Nevertheless, it shows that True values have lower logerror mean value (std is higher though). So, we could try to get this data into account to our final model

In [None]:
# Let's investigate on "flag_tub". [True, NaN] --> [1, 0]. Warning! Ugly code
merged["flag_tub"].replace(to_replace=True ,value=1, inplace=True)

# We fill the NaN's here because "flag_tub" has only the value True. So me make it binary.
# if it had also the Value False then True and False would form the binary data.
merged["flag_tub"].fillna(value=0, inplace=True)

test_column = merged[['flag_tub','logerror']]
total_samples = test_column.count().get("logerror")

# mean and std for flag_tub == 1
mean = test_column.where(test_column["flag_tub"] > 0).dropna().mean().get("logerror")
std = test_column.where(test_column["flag_tub"] > 0).dropna().std().get("logerror")
samples = test_column.where(test_column["flag_tub"] > 0).dropna().count().get("logerror")
#for the above there must be a more elegant way but this is not the point here
print("For flag_tub == 'True', the logerror's mean and std are: {0:06.5f} , {1:06.5f}. Total samples ratio {2:04.2f}%"\
      .format(mean, std, 100.*samples/total_samples))

# mean and std for flag_tub == 0
mean = test_column.where(test_column["flag_tub"] < 1).dropna().mean().get("logerror")
std = test_column.where(test_column["flag_tub"] < 1).dropna().std().get("logerror")
samples = test_column.where(test_column["flag_tub"] < 1).dropna().count().get("logerror")
#for the above there must be a more elegant way but this is not the point here
print("For flag_tub == 'Nan', the logerror's mean and std are: {0:06.5f} , {1:06.5f}. Total samples ratio {2:04.2f}%"\
      .format(mean, std, 100.*samples/total_samples))
# Let's plot it
plot_feature_logerror("flag_tub")

<h2> Analysis for "flag_fireplace" </h2>

Similar actions with "flag_tub".
Here the True data are 0.25%. Very small factor. We shouldn't take this data into account but we observe the following. If there is a fireplace (True) then the mean value is bigger and the std is smaller. This means that if there is a fireplace we have a bigger logerror and the small std means that this observation is more concrete. But again the True values are only the 0.25% of whole data so the conclusions may be vague.

In [None]:
# Let's investigate on "flag_fireplace". [True, NaN] --> [1, 0]. Warning! Ugly code
merged["flag_fireplace"].replace(to_replace=True ,value=1, inplace=True)

# We fill the NaN's here because "flag_fireplace" has only the value True. So me make it binary.
# if it had also the Value False then True and False would form the binary data.
merged["flag_fireplace"].fillna(value=0, inplace=True)

test_column = merged[['flag_fireplace','logerror']]
total_samples = test_column.count().get("logerror")

# mean and std for flag_fireplace == 1
mean = test_column.where(test_column["flag_fireplace"] > 0).dropna().mean().get("logerror")
std = test_column.where(test_column["flag_fireplace"] > 0).dropna().std().get("logerror")
samples = test_column.where(test_column["flag_fireplace"] > 0).dropna().count().get("logerror")
#for the above there must be a more elegant way but this is not the point here
print("For flag_fireplace == 'True', the logerror's mean and std are: {0:06.5f} , {1:06.5f}. Total samples ratio {2:04.2f}%"\
      .format(mean, std, 100.*samples/total_samples))

# mean and std for flag_fireplace == 0
mean = test_column.where(test_column["flag_fireplace"] < 1).dropna().mean().get("logerror")
std = test_column.where(test_column["flag_fireplace"] < 1).dropna().std().get("logerror")
samples = test_column.where(test_column["flag_fireplace"] < 1).dropna().count().get("logerror")
#for the above there must be a more elegant way but this is not the point here
print("For flag_fireplace == 'Nan', the logerror's mean and std are: {0:06.5f} , {1:06.5f}. Total samples ratio {2:04.2f}%"\
      .format(mean, std, 100.*samples/total_samples))
# Let's plot it
plot_feature_logerror("flag_fireplace")

<h2> Analysis for "tax_delinquency" </h2>

Similar analysis to "flag_fireplace".
We can observe here that again the value "Y" makes logerror more complicated. It increases the mean value and the std which makes the "guess" for the correct logerror very difficult. 

In [None]:
# Let's investigate on "tax_delinquency". ["Y", NaN] --> [1, 0]. Warning! Ugly code
merged["tax_delinquency"].replace(to_replace="Y" ,value=1, inplace=True)

# We fill the NaN's here because "tax_delinquency" has only the value "Y". So me make it binary.
# if it had also the Value "N" then "Y" and "N" would form the binary data.
merged["tax_delinquency"].fillna(value=0, inplace=True)

test_column = merged[['tax_delinquency','logerror']]
total_samples = test_column.count().get("logerror")

# mean and std for tax_delinquency == 1
mean = test_column.where(test_column["tax_delinquency"] > 0).dropna().mean().get("logerror")
std = test_column.where(test_column["tax_delinquency"] > 0).dropna().std().get("logerror")
samples = test_column.where(test_column["tax_delinquency"] > 0).dropna().count().get("logerror")
#for the above there must be a more elegant way but this is not the point here
print("For tax_delinquency == 'Y', the logerror's mean and std are: {0:06.5f} , {1:06.5f}. Total samples ratio {2:04.2f}%"\
      .format(mean, std, 100.*samples/total_samples))

# mean and std for tax_delinquency == 0
mean = test_column.where(test_column["tax_delinquency"] < 1).dropna().mean().get("logerror")
std = test_column.where(test_column["tax_delinquency"] < 1).dropna().std().get("logerror")
samples = test_column.where(test_column["tax_delinquency"] < 1).dropna().count().get("logerror")
#for the above there must be a more elegant way but this is not the point here
print("For tax_delinquency == 'NaN', the logerror's mean and std are: {0:06.5f} , {1:06.5f}. Total samples ratio {2:04.2f}%"\
      .format(mean, std, 100.*samples/total_samples))
# Let's plot it
plot_feature_logerror("tax_delinquency")

<h2> Analysis for "zoning_property" </h2>

This feature has a lot of strings (zones) and we want to correspond them with increment integers. The final plot showw that we got legitimate data!!! Enjoy

In [None]:
# Let's investigate on "zoning_property". Warning! Ugly code
# we want to correspond every unique string inside "zoning_property" to an increment integer.
temp_df = merged[["zoning_property", "logerror"]] #create a copy of merged
temp_df.dropna(inplace=True) #drop all NaN

possible_values = temp_df.where(temp_df["zoning_property"].duplicated() == False).dropna()["zoning_property"]
# now that we have all possible_values (e.g. 1NR1, AH RM-CD, WVRPD4OOOO,) we want to convert them to integers.
# Starting from 1 to length(possible_values). Don't Use 0 because in future we might want to correspond it to NaN
# values. TODO

merged["zoning_property"].replace(to_replace=possible_values.tolist(),\
value=range(1,possible_values.count()+1), inplace=True)#this works. Very slow execution
"""
The following code does the same thing as above. They both run very slowly. keep the best of them.
merged["zoning_property"].replace(to_replace=\
{"zoning_property": dict(zip(possible_values, range(1,possible_values.count()+1)))}, inplace=True)#this works. Very slow execution
"""

# If you want to fill NaN with the value 0 uncomment the following line of code
#merged["zoning_property"].fillna(value=0, inplace=True)


# Let's see our legitimate data
plot_feature_logerror("zoning_property")

del temp_df

