<hr>

##### Mount Drive - **Google Colab Only Step**

When using google colab in order to access files on our google drive we need to mount the drive by running the below python cell, then clicking the link it generates and pasting the code in the cell.



In [0]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


Change Directory To Access The Dependent Files - **Google Colab Only Step**

In [0]:
directory = "student"
if (directory == "student"):
  %cd drive/Colab\ Notebooks/data-science-track/
else:
  %cd drive/Shared\ drives/Rubrik/Data\ Science/Course/Data-Science-Track

/content/drive/Shared drives/Rubrik/Data Science/Course/Data-Science-Track


<hr>

<br>

# Exploratory Analysis

## In this lesson...

In this lesson, we'll go through the essential exploratory analysis steps:
1. [Understanding the data](#basic)
2. [Distributions of numeric features](#numeric)
3. [Distributions of categorical features](#categorical)
4. [Segmentations](#segmentations)
5. [Correlations](#correlations)

<hr>

## Import libraries

In general, it's good practice to keep all of your library imports at the top of your notebook or program.

In [0]:
# Data
import numpy as np
import pandas as pd

# Plotting
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns

# Library Configurations: 
sns.set() # make seaborn override the styling of matplotlib graphs
pd.set_option('display.max_columns', None) # display all columns
pd.set_option('display.max_rows', None) # display all rows

## Import the real estate dataset
- Use pandas' `read_csv()` function 
- Provide the following path for the data 
```python 
path = './data/real_estate_data.csv'
```

In [0]:
# Load real estate data from CSV


<br id="basic">

## Understand the data

First, always look at basic information about the dataset.

### Display the dimensions of the dataset.
- Use the `.shape` property of the DataFrame to find out the shape of the dataset

In [0]:
# Dataframe dimensions


### Display the data types of our features
- Use the DataFrame's `info()` method to find out more about the DataFrame, such as the column data types and column names 

In [0]:
# Column datatypes


#### What columns are text, or classified as categorical data? 

- property_type
- exterior_walls
- roof

#### Display the first 5 rows to see example observations

In [0]:
# Display first 5 rows of df


## Histograms
### Create a histogram plot using the sqft feature

In [0]:
# Import the matplotlib.pyplot submodule and name it plt


# Create a Figure and an Axes with plt.subplots

# Plot a histogram on the figure we just created


# Add the title


# Customize the x-axis label


# Customize the y-axis label


# Call the show function to show the result
#fig.show() # We don't invoke show off of figure


### Create multiple histograms on one figure

Create two histograms on one figure
- figure has one row 
- figure has two columns
- plot a histogram using sqft feature
- plot a histogram using tx_price feature 

#### `plt.subplots()` parameters:
- nrows (int) number of rows the figure will have
- ncols (int) number of columns the figure will have  
- (optional) figsize: (float, float) width, height in inches

#### `ax.hist()` parameters :
- `x:` takes in an array of data as input values to plot as the first argument
- `bins:` splits the data into groups based on the number specified
- `color:` colors the histogram with one of these values {'b', 'g', 'r', 'c', 'm', 'y', 'k', 'w'}


In [0]:
# Import the matplotlib.pyplot submodule and name it plt

 
# Create a Figure and an Axes with plt.subplots

 
# Plot a histogram on the figure we just created

 
# Add the title for axis 0

 
# Customize the x-axis label for axis 0 

 
# Customize the y-axis label for axis 0

 
# Add the title for axis 1

 
# Customize the x-axis label for axis 1

 
# Customize the y-axis label for axis 

 
 
# Call the show function to show the result


<hr> 

### Using the pandas DataFrame `.hist()` method to create histogram on all features


#### Arguments to consider passing in:
- (optional) `bins:` splits the data into groups based on the number specified 
- (optional) `xrot:` rotates x-axis labels counter-clockwise; <span style="color:red"> really useful for long x index labels </span>
- (optional) `figsize`: (float, float) width, height in inches.
- (optional) `color:` colors the histogram with one of these values {'b', 'g', 'r', 'c', 'm', 'y', 'k', 'w'}


#### Plot the histogram grid, but make it larger, and rotate the x-axis labels clockwise by 45 degrees.
- <code style="color:steelblue">df.hist()</code> has a <code style="color:steelblue">figsize=</code> argument takes a tuple for figure size, try making the figure size 20 x 20
- <code style="color:steelblue">df.hist()</code> has a <code style="color:steelblue">xrot=</code> argument rotates x-axis labels **counter-clockwise**, lets move it 45% clockwise.
- The [documentation](http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.hist.html) is useful for learning more about the arguments to the <code style="color:steelblue">.hist()</code> function.


In [0]:
# Plot histogram grid



### Countplot 

One way to analyze categorical data is by creating a countplot, which will show the counts of observations in each categorical bin using bars.

We will use the `sns.countplot()` method

#### sns.countplot() parameters:
- (optional) `x` (string: series name): specify the values for the x axis
- (optional) `y` (string: series name): specify the values for the y axis
- `data` (DataFrame, array, or list of arrays): Dataset for plotting

#### Plot using a countplot using the <code style="color:steelblue">'roof'</code> feature.


In [0]:
# count plot for 'roof'


## Segmentations

Next, let's create some segmentations. Segmentations are powerful ways to cut the data to observe the relationship between **categorical features** and **numeric features**.

### Boxplots

#### Using a seaborn boxplot, plot the resulting distributions by segmenting <code style="color:steelblue">'tx_price'</code> by <code style="color:steelblue">'property_type'</code>


#### sns.boxplot() parameters:
- `y` (string): axis parameter provide the categorical column (series) name
- `x` (string): axis parameter provide the numerical column (series) name
- data (DataFrame): Dataset for plotting
<br>


In [0]:
# Segment tx_price by property_type and plot distributions


### Groupby 
#### Using the pandas groupby method, segment by property_type and display the means and standard deviations within each class


### Seaborn lmplot()
#### Remember:
Seaborn's lmplot will allow you to create scatterplots, meaning comparing two numerical features for each data point. A data point being a single entry, or row, with multiple columns or otherwise refered to as features. This plot also allows you to attach additional information to each data point. This plot will allow you to see clustering of data relative to the feature values. 

#### `seaborn.lmplot()` parameters: 
- (optional) x (string: series name): specify numerical values for the x axis
- (optional) y (string: series name): specify numerical values for the y axis
- (optional) hue (string: series name): specify categorical values for the data point which helps us to group our data into clusters
- (optional) fit_reg (boolean): if `True`, estimate and plot a regression model relating the x and y variables
- (optional) height (float): height (in inches) of each facet, a particular aspect or feature of something
- (optional) aspect (float): width (in inches) of each facet, a particular aspect or feature of something
- data (DataFrame): data set for plotting 

[Seaborn lmplot Docs](https://seaborn.pydata.org/generated/seaborn.lmplot.html)

#### Example: 
```python
sns.lmplot(data=df, x='numerical_feature_one', y='numerical_feature_two', hue='categorical_feature')
plt.show()
```

#### Create a lmplot with the following parameter values:
  - `x` = 'tx_price'
  - `y` = 'sqft'
  - `hue` = 'property_type'
  - `height` = 10 

#### Note 
- the default value for fit_reg is `True`, if you want to hide the estimate regression model lines then set the value of the `fit_reg` to False 

## Correlations

Finally, let's take a look at the relationships between **numeric features** and **other numeric features**.

<br>

#### Create a <code style="color:steelblue">correlations</code> dataframe from <code style="color:steelblue">df</code>.

- Use pandas' DataFrame `.corr()` method to show you all of the correlations between all the columns of the DataFrame.
- Save this correlations DataFrame into a variable called `correlations`

**Note:** The default parameters utilizes the pearson correlation coefficient.


In [0]:
# Calculate correlations between numeric features


#### Visualize the correlation grid with a heatmap to make it easier to digest.

#### Seaborn heatmap()
A **heat map** is a graphical representation of data where the individual values contained in a matrix are represented as colors. 

#### create a seaborn.heatmap with the following parameters:
- data : correlations * 100 
- annot: True
- fmt='.0f' to format the annotations to a whole number

In [0]:
# Make the figsize 10 x 10


# Plot heatmap of annotated correlations
