<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Prep" data-toc-modified-id="Prep-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Prep</a></span></li><li><span><a href="#Housing-price-distribution" data-toc-modified-id="Housing-price-distribution-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Housing price distribution</a></span></li><li><span><a href="#Numerical-Data-Distribution" data-toc-modified-id="Numerical-Data-Distribution-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Numerical Data Distribution</a></span></li><li><span><a href="#Correlation" data-toc-modified-id="Correlation-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Correlation</a></span></li><li><span><a href="#Feature-to-feature-relationship" data-toc-modified-id="Feature-to-feature-relationship-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Feature to feature relationship</a></span></li><li><span><a href="#Q--&gt;-Q-(Quantitative-to-Quantitative-relationship)" data-toc-modified-id="Q-->-Q-(Quantitative-to-Quantitative-relationship)-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Q -&gt; Q (Quantitative to Quantitative relationship)</a></span></li><li><span><a href="#C--&gt;-Q-(Categorical-to-Quantitative-relationship)" data-toc-modified-id="C-->-Q-(Categorical-to-Quantitative-relationship)-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>C -&gt; Q (Categorical to Quantitative relationship)</a></span></li></ul></div>

__File Info:__

Date: 20181023

Author: Stephanie Langeland 

File Name: 04_EDA_tutorial.ipynb

Version: 01

Previous Version/File: None

Dependencies: Data dictionary: ".\data_info\data_description.txt"

Purpose: Detailed exploratory data analysis with Python

Input File(s): train.csv

Output File(s): None

Required by: 
 - A beginner's guide to Python.
 - Tutorial: https://www.kaggle.com/ekami66/detailed-exploratory-data-analysis-with-python/notebook

Status: Complete

Machine: Dell Latitude - Windows 10

Python Version: Python 3

# Prep

In [None]:
## Import packages:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import operator ## to use Standard operators as functions

In [None]:
# Comment this if the data visualizsations doesn't work on your side
%matplotlib inline

plt.style.use('bmh')

In [None]:
## Import training data:
df = pd.read_csv("C:/Users/stephanie.langeland/OneDrive - Slalom/bit_bucket/a_beginners_guide_to_python/input_output_files/train.csv")

df.head()

In [None]:
df.info()

Let's just remove Id and the features with 30% or more NaN values:

In [None]:
df2 = df[ ## create a copy of df
    [
        column for column in df if df[
            column
        ].count() / len(df) >= 0.3
    ]
] ## identify columns in df that have >= 30% NaN values

del df2['Id'] ## delete this column

print(
    "List of dropped columns:",
    end = " "
) 

for c in df.columns:
    if c not in df2.columns:
        print(
            c,
            end = ", "
        )
print("\n") ## list the previously identified columns

df = df2 ## overwrite df as df2

# Housing price distribution

In [None]:
print(
    df["SalePrice"].describe()
)

plt.figure(
    figsize = (9, 8)
)

sns.distplot(
    df["SalePrice"],
    color = "g",
    bins = 100,
    hist_kws = {"alpha": 0.4}
)

# Numerical Data Distribution

For this part lets look at the distribution of all of the features by plotting them.  

To do so lets first list all the types of our data from our data set and take only the numerical ones:

In [None]:
list(
    set(
        df.dtypes.tolist()
    )
)

In [None]:
df_num = df.select_dtypes(include = ["float64", "int64"])

df_num.head()

Plot all numerical features' distributions:

In [None]:
df_num.hist(
    figsize = (16, 20),
    bins = 50,
    xlabelsize = 8,
    ylabelsize = 8
); ## use the ";" to avoid verbose matplotlib information displayed 

# Correlation

Now we'll try to find which features are strongly correlated with `SalePrice`. We'll store them in a var called `golden_features_list`. We'll reuse our `df_num` data set to do so.

In [None]:
df_num.head()

In [None]:
df_num_corr = df_num.corr()["SalePrice"][:-1] ## correlations of each variable with SalePrice (which is the last column)

golden_features_list = df_num_corr[
    abs(
        df_num_corr
    ) > 0.5 ## identify which variables have a correlation of more than 0.5 with SalePrice
].sort_values(
    ascending = False ## sort them in descending order
)

print(
    "There are {} strongly correlated values with SalePrice:\n{}".format(
        len(
            golden_features_list
        ), ## insert this length in the {} after "There are" above
        golden_features_list ## display this object 
    )
)

Explore the affect of outliers on the above correlations:

 - Plot the numerical features - which variables have very few or explainable outliers?

 - Remove the outliers from these features - which variables still have  strong correlations with SalePrice without outliers?

In [None]:
for i in range(
    0, ## start
    len(df_num.columns), ## stop
    5 ## step
):
    sns.pairplot(
        data = df_num,
        x_vars = df_num.columns[i:i + 5], ## step by 5
        y_vars = ["SalePrice"]
    )
    

    We see many points at x = 0 in various graphs, which denotes the absence of 
    that feature in a home, e.g. fireplaces, pool area, etc.  Remove these 0
    values and redo the correlation values:

In [None]:
individual_features_df = [] ## create an empty list 

for i in range(
    0, ## start
    len(df_num.columns) - 1 ## stop; all columns except SalePrice which is the last column
):
    tmpDf = df_num[
        [
            df_num.columns[i],
            "SalePrice"
        ]
    ] ## for all columns in df_num except SalePrice
    
    tmpDf = tmpDf[
        tmpDf[
            df_num.columns[i]
        ] != 0
    ] ## columns in tmpDf that don't equal 0
    
    individual_features_df.append(tmpDf)
    

all_correlations = {
    feature.columns[0]: feature.corr()["SalePrice"][0] for feature in individual_features_df
} ## each variable's correlation with SalePrice


all_correlations = sorted(
    all_correlations.items(), ## items() method returns a view obj that displays a list of dictionary's (key, value) tuple pairs
    key = operator.itemgetter(1) ##  operator.itemgetter(1): A function that grabs the nth (here is 1) item from a list-like obj
) 


for (key, value) in all_correlations:
    print(
        "{:>15}: {:>15}".format(
            key, ## corresponds to key =  in the revious function?
            value ## value that needs to be formatted
        )
    )

Now our golden_features_list var looks like this:

In [None]:
golden_features_list = [
    key for key,
    value in all_correlations if abs(value) >= 0.5
]

print(
    "There are {} strongly correlated values with SalePrice:\n{}".format(
        len(golden_features_list),
        golden_features_list
    )
)

Therefore, there are 11 features that are strongly correlated with SalePrice.

# Feature to feature relationship

Rather than plotting all of the numerical features using seasborn (which is very time consuming and difficult to interpret), explore whether some variables have relationships with each other:

In [None]:
corr  = df_num.drop("SalePrice", axis = 1).corr() ## drop SalePrice bc we already explored this above

plt.figure(figsize = (12, 10))

sns.heatmap(
    corr[
        (corr >= 0.5) | (corr <= -0.4) ## only plot these correlations
    ],
    cmap = "viridis", ## color list 
    vmax = 1.0, ## value to anchor color map
    linewidths = 0.1, 
    annot = True, ## If True, write the data value in each cell
    annot_kws = {"size": 8}, ## value mapping
    square = True ## If True, set the Axes aspect to “equal” so each cell will be square-shaped.
);

# Q -> Q (Quantitative to Quantitative relationship)

Examine the quantitative features of our dataframe and how they relate to the `SalePrice`.

Separate the categorical from quantitative features (refer to the following data dictionary: "C:\Users\stephanie.langeland\OneDrive - Slalom\Misc\Personal\Coding Reference Files\Python\a_beginners_guide_to_python\data_info\data_description.txt"):

In [None]:
quantitative_features_list = [
    'LotFrontage', 
    'LotArea', 
    'MasVnrArea', 
    'BsmtFinSF1', 
    'BsmtFinSF2', 
    'TotalBsmtSF', 
    '1stFlrSF',
    '2ndFlrSF', 
    'LowQualFinSF', 
    'GrLivArea', 
    'BsmtFullBath', 
    'BsmtHalfBath', 
    'FullBath', 
    'HalfBath',
    'BedroomAbvGr', 
    'KitchenAbvGr', 
    'TotRmsAbvGrd', 
    'Fireplaces', 
    'GarageCars', 
    'GarageArea', 
    'WoodDeckSF', 
    'OpenPorchSF', 
    'EnclosedPorch', 
    '3SsnPorch', 
    'ScreenPorch', 
    'PoolArea', 
    'MiscVal', 
    'SalePrice'
]


df_quantitative_values = df[quantitative_features_list] ## subset this list of columns

df_quantitative_values.head()

Look at the strongly correlated features:

In [None]:
features_to_analyse = [
    x for x in quantitative_features_list if x in golden_features_list ## extract variables in both lists
]


features_to_analyse.append("SalePrice") ## add this variable name to the list we created

features_to_analyse

Distribution of `features_to_analyse`:

In [None]:
fig, ax = plt.subplots(
    round(
        len(features_to_analyse) / 3
    ),
    3,
    figsize = (18,12)
) ## this creates the blank plots with just the grey background 


for i, ax in enumerate(fig.axes):
    if i < len(features_to_analyse) - 1:
        sns.regplot(
            x = features_to_analyse[i],
            y = "SalePrice",
            data = df[features_to_analyse],
            ax = ax
        ) ## for each variable in features_to_analyse, plot it against SalePrice

# C -> Q (Categorical to Quantitative relationship)

Subset the categorical variables --> just remove quantitative_features_list and irrelevant non-numerical features from our entire dataframe:

In [None]:
## remove columns from df that are included in quantitative_features_list:
categorical_features = [
    a for a in quantitative_features_list[:-1] + ## need to keep SalePrice which is the last column
    df.columns.tolist() if ( ## tolist() converts the array into a list.
        a not in quantitative_features_list[:-1] ## if columns in df are not in quantitative_features_list
    ) or (
        a not in df.columns.tolist()  ## or in df
    )
] 


df_categ = df[categorical_features]

df_categ.head()

In [None]:
## remove non-numerical features:
df_not_num = df_categ.select_dtypes(include = ['O'])


print(
    "There are {} non-numerical features:\n{}".format(
        len(df_not_num.columns),
        df_not_num.columns.tolist()
    )
)

Plot some categorical features:

In [None]:
plt.figure(figsize = (10, 6))

ax = sns.boxplot(
    x = "BsmtExposure",
    y = "SalePrice",
    data = df_categ
)


plt.setp(
    ax.artists, 
    alpha = 0.5,
    linewidth = 2, 
    edgecolor = "k"
)


plt.xticks(rotation = 45)

In [None]:
plt.figure(figsize = (12, 6 ))

ax = sns.boxplot(
    x = "SaleCondition",
    y = "SalePrice",
    data = df_categ
)

plt.setp(
    ax.artists,
    alpha = 0.5,
    linewidth = 2,
    edgecolor = "k"
)

plt.xticks(rotation = 45)

Distributions of Categorical Features:

In [None]:
fig, axes = plt.subplots(
    round(
        len(df_not_num.columns) / 3
    ), 
    3,
    figsize = (12, 30)
)


for i, ax in enumerate(fig.axes):
    if i < len(df_not_num.columns):
        ax.set_xticklabels(
            ax.xaxis.get_majorticklabels(),
            rotation = 45
        )
        
        sns.countplot(
            x = df_not_num.columns[i],
            alpha = 0.7,
            data = df_not_num,
            ax = ax
        )