<h1> This is a further "experimental" EDA analysis of the "merged.csv" file. </h1>

Before running into conclusions it would be nice to mess with the data a bit more and see how they respond with each other. 

First, let's load the file:

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore') # Bad practice my ass

merged = pd.read_csv('data/merged.csv')

<h2> Define a function that plots a simple scatter between a feature and "logerror"</h2>

With a first look we can see that in the relation `"area_live_finished - logerror"`, as the "area_live_finished" increases alot then the logerror converges to 0

In [None]:
def plot_feature_logerror(feature, df=merged):
    """This function plots a scatter x-y style with the feature in xAxis and logerror in yAxis."""
    def is_numerical():
        return df[feature].dtype in ['float64', 'int64']
    def is_label():
        return feature in ['logerror', 'ID']
    def is_constant():
        return df[feature].max() == df[feature].min()
    
    if not (is_numerical() and not is_label() and not is_constant()):
        return
    plt.scatter(df[feature].values, df["logerror"].values)
    plt.xlabel(feature); plt.ylabel("logerror"); plt.title(feature + " - logerror")
    plt.show()
    

# The line bellow plot all features that can be plotted.   
[plot_feature_logerror(feature) for feature in merged.columns]

<h2> Find all non numerical columns </h2>


In [None]:
def is_numerical(feature, df=merged):
    return df[feature].dtype in ['float64', 'int64']
        
categorical_features = [col for col in merged.columns if not is_numerical(col)]
categorical_features

<h2> So we find out the non numerical values! </h2>

These are: 
['flag_tub', 'zoning_landuse_county', 'zoning_property', 'flag_fireplace', 'tax_delinquency', 'transactiondate']
Let's investigate more what these values are:<br />
<b>flag_tub</b>              --> only "True" and NaN. <br />
<b>zoning_landuse_county</b> --> values like: 0100, 010C, 96, ...<br />
<b>zoning_property</b>       --> values like: 1NR1*, AH RM-CD*, WVRPD4OOOO, ...<br />
<b>flag_fireplace</b>        --> only "True" and NaN.<br />
<b>tax_delinquency</b>       --> only "Y" and NaN<br />
<b>transactiondate</b>       --> date from "2016-01-02" to "2016-12-30" <br />

<h2> So... </h2>
These features need a special treatment. We could just change them with a correspondence of integers (e.g. True --> 1 and NaN --> 0, or 1NR1* --> 1, WVRPD4OOOO --> 2 etc...) using a <b>dict</b>. 
Although, this may not seem alright because the data "won't make sense". For example, the mean value of these features is an irrelevant number (imagine that you have values ranging from 0 to 10 and the mean value is 5.5???). Therefore, a different approach may occur

<h2> Treating binary features </h2>
A quick inspection shows that many of these features are essentially Binary. However different notations 
are used for True and False values. Lets unify the notation and study their distribution. Lastly,
we would like to include these features in an ML model which will probably be restricted to numerical values.
We should therefore make the transformation

In [None]:
def treat_binary(feature, oldTrue = True, oldFalse = None):
    """ 
    This function will display and plot statistics regarding any Binary feature of the merged table.
    It will also transform various binary notation to a [0, 1] representation
    """
    merged[feature].replace(to_replace=oldTrue ,value=1, inplace=True)
    if not oldFalse:
        merged[feature].fillna(value=0, inplace=True)
    else:
        merged[feature].replace(to_replace=oldFalse ,value=0, inplace=True)
    
    true_part = merged[merged[feature] == 1]["logerror"]
    false_part = merged[merged[feature] == 0]["logerror"]
    true_ratio = len(true_part) * 100 / len(merged)
    false_ratio = 100 - true_ratio
    
    mean_true, mean_false = true_part.mean(), false_part.mean()
    std_true, std_false = true_part.std(), false_part.std()
    
    def print_stats(value = True):
        mean = mean_true if value else mean_false
        std = std_true if value else std_false
        ratio = true_ratio if value else false_ratio
        print("For {0} == {1}, the logerror's mean and std are: {2:06.5f} , {3:06.5f}. Total samples ratio {4:04.2f}%"
              .format(feature, value, mean, std, ratio)) 
    print_stats()
    print_stats(False)
    plot_feature_logerror(feature)

<h2> Analysis for "flag_tub" </h2>

The valid values for this feature are "True" and "Nan". These are sufficient to make it a binary feature.

From the results we can see that the True values are only the 2.62% of the total samples so it is not a great factor. Nevertheless, it shows that True values have lower logerror mean value (std is higher though). So, we could try to get this data into account to our final model

In [None]:
treat_binary("flag_tub")

<h2> Analysis for "flag_fireplace" </h2>

Similar actions with "flag_tub".
Here the True data are 0.25%. Very small factor. We shouldn't take this data into account but we observe the following. If there is a fireplace (True) then the mean value is bigger and the std is smaller. This means that if there is a fireplace we have a bigger logerror and the small std means that this observation is more concrete. But again the True values are only the 0.25% of whole data so the conclusions may be vague.

In [None]:
treat_binary("flag_fireplace")

<h2> Analysis for "tax_delinquency" </h2>

Similar analysis to "flag_fireplace".
We can observe here that again the value "Y" makes logerror more complicated. It increases the mean value and the std which makes the "guess" for the correct logerror very difficult. 

In [None]:
treat_binary("tax_delinquency", oldTrue="Y")

<h2> Treating non binary features </h2>

A quick inspection shows that many of these features are categorial. Having strings or integers as types (like IDs). So we are going to define a function that takes categorial features and corersponds to integer serial numbers.

In [None]:
def treat_categorial(feature, correspond_NaN = True):
    """
    This function will display and plot statistics regarding any Categorial feature of the merged table.
    It will convert strings or integers that denote types (like IDs) to a integer serial numbers
    [0, 1, 2, ..., length_of_different_categories] representation. The serial numbers will start from 1. The 
    number 0 is reserved for NaN and the user can convert NaN to 0 or leave them NaN with the flag correspond_NaN.
    """
    # we want to correspond every unique category inside feature to an increment integer.
    temp_df = merged[[feature, "logerror"]] #create a copy of merged
    temp_df.dropna(inplace=True) #drop all NaN

    # Get only one instance of every category, ignore duplicates 
    possible_values = temp_df.where(temp_df[feature].duplicated() == False).dropna()[feature]
    # now that we have all possible_values (e.g. 1NR1, AH RM-CD, WVRPD4OOOO,) we want to convert them to integers.
    # Starting from 1 to length(possible_values). Don't Use 0 because it is reserved for NaN

    merged[feature].replace(to_replace=possible_values.tolist(),\
    value=range(1,possible_values.count()+1), inplace=True)#this works. Very slow execution
    """
    The following code does the same thing as above. They both run very slowly. keep the best of them.
    merged[feature].replace(to_replace=\
    {feature: dict(zip(possible_values, range(1,possible_values.count()+1)))}, inplace=True)#this works. Very slow execution
    """
    
    # Fill NaN with the value 0 according to correspond_NaN flag
    if correspond_NaN:
        merged[feature].fillna(value=0, inplace=True)
               
    plot_feature_logerror(feature)

<h2> Analysis for "zoning_property" </h2>

This feature has a lot of strings (zones) and we want to correspond them with increment integers. The final plot showw that we got legitimate data!!! Enjoy

In [None]:
treat_categorial("zoning_property")