# `Project 2: Demonstrating Exploratory Data Analysis`

### <font color='blue'> __Purpose:__</font>
- The intent of this project is to demonstrate the importance & insightful capabilities of performing Exploratory Data Analysis (EDA).

### <font color='blue'> __Objectives:__</font>
1. Provide appropriately-labelled graphical visuals to enhance understanding of data.
2. Interpret descriptive statistics.
3. Formulate sound assumptions & justify observations with concrete statistical hypothesis testing.

### <font color='blue'> __Why EDA?__</font>

1. Gives you an opportunity to familiarise yourself with the data.
2. Understand the distribution of the population and detect for any strange behaviour (outliers).
3. Allows one to explore relationships between variables.
4. Test these observed relationships/hypothesis for any statistical significance so as to provide meaningful business insights.
5. Provides preliminary ideas as to how we can potential employ various Machine Learning Models.

### <font color='blue'> __What does it entail?__</font>

1. Get a high-level overview of the distribution of the dataset & detect for the presence of any outliers.
> Examples: .describe( ) and boxplots
2. Derive & identify meaningful observations/relationships btwn variables.
> Examples: .corr( ), Seaborn's pairplot & heatmap
3. Prove to see if there is statistical significance in those relationships observed.
> Examples: T-test, chi-square test etc.

### <font color='blue'> __Codes are in Python & the following libraries:__</font>
1. Pandas
2. NumPy
3. SciPy
4. Seaborn
5. Matplotlib

### <font color='blue'> __Problem Statement:__</font>

Imagine you are a data scientist working for a real estate company.
The company's main aim is to purchase houses at a lower price and at the same time, does not cost much to renovate based on its current condition.
Your team lead has divided the project into 2 parts:
1. Predicting the value of housing prices based on fixed characteristics (i.e. characteristics which cannot be renovated).
2. Identifying renovatable characteristics which bear a lower cost.

You and your team are tasked to work on the _first part_.
<br>The complete dataset can be found at __[Kaggle: House Prices - Advanced Regression Techniques](https://www.kaggle.com/c/house-prices-advanced-regression-techniques)__

***
***

In [11]:
import pandas as pd
import numpy as np
import scipy.stats as stats
from scipy.stats import t

import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
%matplotlib inline
%config InlineBackend.figure_format ='retina'

import seaborn as sns
sns.set_style('whitegrid')

from ipywidgets import *
from IPython.display import display

__Note:__
<br> The dataset we are about to import has been "cleaned".
<br> If you would like to see how I carried out the data munging process for this dataset,
please refer to `Project 1: Demonstrating Data Munging`.

In [None]:
# Importing the dataset.


## <font color=pink> __Step 1:__</font>

In [None]:
# Step 1: Get a high-level overview of the dataset.

# .describe and boxplots to see if distribution is normal,
# and explain any outliers, how is it interpreted and dealth with.

In [None]:
# Step 2: derive & identify meaningful relationships btwn variables.

# .corr, .pairplot, heatmap

In [None]:
# Step 3: Prove to see if there is statistical significance in those relationships observed.

***
***
***

In [None]:
.describe()

In [None]:
# Boxplot for outliers
fig = plt.figure(figsize=(6,4))
ax = fig.gca()

ax = sns.boxplot(boston.rate_of_crime, orient='v',
                fliersize=8, linewidth=1.5, notch=True,
                saturation=0.5, ax=ax)

ax.set_ylabel('rate_of_crime', fontsize=16)
ax.set_title('Rate of crime boxplot\n', fontsize=20)

plt.show()

In [None]:
# Plotting all standardised boxplots for outliers.
boston_stand = (boston - boston.mean()) / boston.std()

fig = plt.figure(figsize=(15, 7))
ax = fig.gca()

ax = sns.boxplot(data=boston_stand, orient='h', fliersize=5, 
                 linewidth=3, notch=True, saturation=0.5, ax=ax)
plt.show()

In [None]:
#Correlation
boston.corr()
sns.heatmap (half of it)

In [None]:
# sns.pairplot
# sns.heatmap

In [None]:
# Plotting afew things at one go.
fig, axes = plt.subplots(2,2, figsize=(16,8))
df['col2'].plot(figsize=(16,4), color='purple', fontsize=21, ax=axes[0][0])