##### Mount Drive - **Google Colab Only Step**

When using google colab in order to access files on our google drive we need to mount the drive by running the below python cell, then clicking the link it generates and pasting the code in the cell.



In [1]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


Change Directory To Access The Dependent Files - **Google Colab Only Step**

In [2]:
directory = "student"
if (directory == "student"):
  %cd drive/Colab\ Notebooks/data-science-track/
else:
  %cd drive/Shared\ drives/Rubrik/Data\ Science/Course/Data-Science-Track

/content/drive/Shared drives/Rubrik/Data Science/Course/Data-Science-Track


<hr>

<br>

# Exploratory Analysis

## In this lesson...

In this lesson, we'll go through the essential exploratory analysis steps:
1. [Understanding the data](#basic)
2. [Distributions of numeric features](#numeric)
3. [Distributions of categorical features](#categorical)
4. [Segmentations](#segmentations)
5. [Correlations](#correlations)

<hr>

## Import libraries

In general, it's good practice to keep all of your library imports at the top of your notebook or program.

In [0]:
# Data
import numpy as np
import pandas as pd

# Plotting
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns

# Library Configurations: 
sns.set() # make seaborn override the styling of matplotlib graphs
pd.set_option('display.max_columns', None) # display all columns
pd.set_option('display.max_rows', None) # display all rows

## Import the real estate dataset
- Use pandas' `read_csv()` function 
- Provide the following path for the data 
```python 
path = './data/real_estate_data.csv'
```

In [0]:
# Load real estate data from CSV
df = pd.read_csv('./data/real_estate_data.csv')

<br id="basic">

## Understand the data

First, always look at basic information about the dataset.

### Display the dimensions of the dataset.
- Use the `.shape` property of the DataFrame to find out the shape of the dataset

In [0]:
# Dataframe dimensions
df.shape

### Display the data types of our features
- Use the DataFrame's `info()` method to find out more about the DataFrame, such as the column data types and column names 

In [0]:
# Column datatypes
df.info()

#### What columns are text, or classified as categorical data? 

- property_type
- exterior_walls
- roof

#### Display the first 5 rows to see example observations

In [0]:
# Display first 5 rows of df
df.head()

<br><hr style="border-color:royalblue;background-color:royalblue;height:1px;">
## <span style="color:RoyalBlue">Exercise 1.1</span>

Before moving on, let's dig a bit deeper into some of these functionalities. Getting some extra practice right now will set you up for smoother success as you continue through the project.
<br>
#### Print The DataFrame's data types using the `.dtypes` DataFrame Property


In [0]:
df.dtypes

#### What is the data type of `df.dtypes`?
- Use python's `type()` function and pass in `df.dtypes` as an argument

In [9]:
type(df.dtypes)

pandas.core.series.Series

#### Filter df.dtypes to only categorical variables:

#### Tips:
- How does knowing that data type help you out? 
- Remember the boolean filtering we've been talking about?

In [0]:
# Filter and display only df.dtypes that are 'object'
df.dtypes[df.dtypes == 'object']

#### Iterate through the categorical feature names and print each name.

**Tips:** 
- DataFrames have indexes, you can access these indexes by utilizing the `.index` property of a DataFrame
- Use a for loop and iterate only through indexes that have only the data type of `object`

In [0]:
# Loop through categorical feature names and print each one
for cat_feature in df.dtypes[df.dtypes == 'object'].index:
    print(cat_feature)

As you'll see later, the ability to select feature names based on some condition (instead of manually typing out each one) will be quite useful.

#### **Next**, look at a few more examples by displaying the first 10 rows of data, instead of just the first 5

In [0]:
# Display the first 10 rows of data
df.head(10)

#### Finally, it's also helpful to look at the last 5 rows of data.
- Sometimes datasets will have **corrupted data** hiding at the very end (depending on the data source).
- It never hurts to double-check.

In [0]:
# Display last 5 rows of data
df.tail()

<hr> 

<br id="numeric">

# Distributions of numeric features

One of the most enlightening data exploration tasks is plotting the distributions of your features.

### Plot a histogram grid

A histogram is a plot that lets you discover, and show, the underlying frequency distribution (shape) of a set of continuous data. This allows the inspection of the data for its underlying distribution (e.g., normal distribution), outliers, skewness, etc.

#### Note: 
- It is the product of height multiplied by the width of the bin that indicates the frequency of occurrences within that bin.
- Only works with numberical data

#### When to use: 
- Shows distribution of values among an axis, typically column wise (axis 1)

#### Cons: 
- can't see spikes in individual values ( consider: pdf  - probability distribution function) 

<hr>

### Creating A Histogram graph

#### Import the matplotlib.pyplot submodule and name it plt

```python
# Import the matplotlib.pyplot submodule and name it plt
import matplotlib.pyplot as plt
```

#### Create a blank figure with axes
This step is the setup to start drawing plots. We can think of this step as geting paper to draw (creating a figure), as well as defining a area for us to draw (creating an axis). 
```python
# Create a Figure and an Axes with plt.subplots
fig, ax = plt.subplots()
```

#### plt.subplots() parameters:
- (optional) `figsize`: (float, float) width, height in inches.

#### Show the figure
```python
# Call the show function to show the result
plt.show() 
```


We can see we will get a blank figure (plot) with an x and y axes, which makes sense because we have not told the figure to plot data yet, we just setup showing the figure.

```python 
# Import the matplotlib.pyplot submodule and name it plt
import matplotlib.pyplot as plt

# Create a Figure and an Axes with plt.subplots
fig, ax = plt.subplots(figsize=(5,5))

# Call the show function to show the result
#fig.show() # We don't invoke show off of figure
plt.show()
```

#### Try it out

In [0]:
# Import the matplotlib.pyplot submodule and name it plt
import matplotlib.pyplot as plt
 
# Create a Figure and an Axes with plt.subplots
fig, ax = plt.subplots(figsize=(5,5))
 
# Call the show function to show the result
# fig.show() # We don't invoke show off of figure
plt.show()

#### Plotting data on the figure
We can now use a method on the axes object to help us plot some data. 

The method we will use is `.hist()`. You can find the documentation [here](https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.axes.Axes.hist.html#matplotlib.axes.Axes.hist)  for more guidance.


The  method takes in an array of data as input values to plot .
#### `ax.hist()` parameters :
- `x:` takes in an array of data as input values to plot as the first argument
- `bins:` splits the data into groups based on the number specified
- `color:` colors the histogram with one of these values {'b', 'g', 'r', 'c', 'm', 'y', 'k', 'w'}

#### Example without specifying bins parameter: 
```python
# Import the matplotlib.pyplot submodule and name it plt
import matplotlib.pyplot as plt

# Create a Figure and an Axes with plt.subplots
fig, ax = plt.subplots()

# Plot a histogram on the figure we just created
ax.hist(df["tx_price"])

# Call the show function to show the result
plt.show()
```

#### Try it out:

In [0]:
# Import the matplotlib.pyplot submodule and name it plt
import matplotlib.pyplot as plt
 
# Create a Figure and an Axes with plt.subplots
fig, ax = plt.subplots()
 
# Plot a histogram on the figure we just created
ax.hist(df["tx_price"])
 
# Call the show function to show the result
plt.show()

#### Example with specifying bins and color parameter: 
```python
# Import the matplotlib.pyplot submodule and name it plt
import matplotlib.pyplot as plt

# Create a Figure and an Axes with plt.subplots
fig, ax = plt.subplots()

# Plot a histogram on the figure we just created
ax.hist(df["tx_price"], bins=50, color='g')

# Call the show function to show the result
plt.show()
```

#### Try it out:

In [0]:
# Import the matplotlib.pyplot submodule and name it plt
import matplotlib.pyplot as plt
 
# Create a Figure and an Axes with plt.subplots
fig, ax = plt.subplots()
 
# Plot a histogram on the figure we just created
ax.hist(df["tx_price"], bins=50, color='g')
 
# Call the show function to show the result
plt.show()

#### Adding labels to the graph
```python
# Add the title
ax.set_title("Histogram of Transaction Prices")

# Customize the x-axis label
ax.set_xlabel("Price")

# Customize the y-axis label
ax.set_ylabel("Frequency")
```

#### Putting it all together 

```python 
# Import the matplotlib.pyplot submodule and name it plt
import matplotlib.pyplot as plt

# Create a Figure and an Axes with plt.subplots
fig, ax = plt.subplots()

# Plot a histogram on the figure we just created
ax.hist(df["tx_price"], bins=50)

# Add the title
ax.set_title("Histogram of Transaction Prices")

# Customize the x-axis label
ax.set_xlabel("Price")

# Customize the y-axis label
ax.set_ylabel("Frequency")

# Call the show function to show the result
#fig.show() # We don't invoke show off of figure
plt.show()
```

#### Try it out:


In [0]:
# Import the matplotlib.pyplot submodule and name it plt
import matplotlib.pyplot as plt
 
# Create a Figure and an Axes with plt.subplots
fig, ax = plt.subplots()
 
# Plot a histogram on the figure we just created
ax.hist(df["tx_price"], bins=50)
 
# Add the title
ax.set_title("Histogram of Transaction Prices")
 
# Customize the x-axis label
ax.set_xlabel("Price")
 
# Customize the y-axis label
ax.set_ylabel("Frequency")
 
# Call the show function to show the result
# fig.show() # We don't invoke show off of figure
plt.show()

### Plotting multiple plots on one figure

Calling `plt.subplots()` without parameters creates one subplot.  

If we call plt.subplots() with parameters such as ```plt.subplots(num_rows,num_cols)``` we will have the ability to plot multiple plots on one figure.

#### `plt.subplots()` parameters:
- nrows (int) number of rows the figure will have
- ncols (int) number of columns the figure will have  
- (optional) figsize: (float, float) width, height in inches

##### give these parameters their respectful values

Example:
```python 
plt.subplots(nrows=2, ncols=1)
```

#### When we specify two dimensions with multiple rows and columns at the same time: 
When we do this the variable ax is no longer only one Axes object, instead it is an array of Axes objects with a shape of num_rows by num_cols. 
- ax[0,0] - references the first row first column
- ax[0,1] - references the first row second column
- ax[1,0] - references the second row first column 
- ax[1,1] - references the second row second column 

#### When you specify two dimensions with either one row and multiple columns at a time or multiple rows and one column at a time 
There is a special case where you only have one row or only one column of plots. In this case, the resulting array will be one-dimensional and you only have to provide one index to access the elements of this array.
##### Example with `plt.subplots(nrows=2, ncols=1)`
- ax[0] - references the first row first column
- ax[1] - references the second row first column

#### For Example:

```python 
# Import the matplotlib.pyplot submodule and name it plt
import matplotlib.pyplot as plt

# Create a Figure and an Axes with plt.subplots
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(20, 5))


# Plot a histogram on the figure we just created
ax[0].hist(df["tx_price"], color='g')
ax[1].hist(df["tx_price"], bins=50, color='c')

# Call the show function to show the result
plt.show()
```

#### Try it out:



In [0]:
# Import the matplotlib.pyplot submodule and name it plt
import matplotlib.pyplot as plt
 
# Create a Figure and an Axes with plt.subplots
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(20, 5))
 
 
# Plot a histogram on the figure we just created
ax[0].hist(df["tx_price"], color='g')
ax[1].hist(df["tx_price"], bins=50, color='c')
 
# Call the show function to show the result
plt.show()


#### Adding labels to the graph

When we add labels to a multi-dimensional plot on one figure we will have to be more specific, specifying the axis we are talking about. 

#### For example:
```python
# Add the title for axis 0
ax[0].set_title("Histogram of Transaction Prices")

# Customize the x-axis label for axis 0 
ax[0].set_xlabel("Price")

# Customize the y-axis label for axis 0
ax[0].set_ylabel("Frequency")

# Add the title for axis 1
ax[1].set_title("Histogram of Transaction Prices With Bin Size == 50")

# Customize the x-axis label for axis 1
ax[1].set_xlabel("Price")

# Customize the y-axis label for axis 1
ax[1].set_ylabel("Frequency")
```

#### Putting it all together:

```python
# Import the matplotlib.pyplot submodule and name it plt
import matplotlib.pyplot as plt

# Create a Figure and an Axes with plt.subplots
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(20, 5))

# Plot a histogram on the figure we just created
ax[0].hist(df["tx_price"], color="r")
ax[1].hist(df["tx_price"], bins=50)

# Add the title for axis 0
ax[0].set_title("Histogram of Transaction Prices")

# Customize the x-axis label for axis 0 
ax[0].set_xlabel("Price")

# Customize the y-axis label for axis 0
ax[0].set_ylabel("Frequency")

# Add the title for axis 1
ax[1].set_title("Histogram of Transaction Prices With Bin Size == 50")

# Customize the x-axis label for axis 1
ax[1].set_xlabel("Price")

# Customize the y-axis label for axis 
ax[1].set_ylabel("Frequency")


# Call the show function to show the result
plt.show()
```

#### Try it out:

In [0]:
# Import the matplotlib.pyplot submodule and name it plt
import matplotlib.pyplot as plt
 
# Create a Figure and an Axes with plt.subplots
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(20, 5))
 
# Plot a histogram on the figure we just created
ax[0].hist(df["tx_price"], color="r")
ax[1].hist(df["tx_price"], bins=50)
 
# Add the title for axis 0
ax[0].set_title("Histogram of Transaction Prices")
 
# Customize the x-axis label for axis 0 
ax[0].set_xlabel("Price")
 
# Customize the y-axis label for axis 0
ax[0].set_ylabel("Frequency")
 
# Add the title for axis 1
ax[1].set_title("Histogram of Transaction Prices With Bin Size == 50")
 
# Customize the x-axis label for axis 1
ax[1].set_xlabel("Price")
 
# Customize the y-axis label for axis 
ax[1].set_ylabel("Frequency")
 
 
# Call the show function to show the result
plt.show()

### Resources
- [matplotlib subplots method](https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.subplots.html)
- [matplotlib axes object](https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.axes.Axes.hist.html#matplotlib.axes.Axes.hist)
- [matplotlib hist method](https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.hist.html)


<hr> 

### Using the pandas DataFrame `.hist()` method to create histogram on all features


#### Arguments to consider passing in:
- (optional) `bins:` splits the data into groups based on the number specified 
- (optional) `xrot:` rotates x-axis labels counter-clockwise; <span style="color:red"> really useful for long x index labels </span>
- (optional) `figsize`: (float, float) width, height in inches.
- (optional) `color:` colors the histogram with one of these values {'b', 'g', 'r', 'c', 'm', 'y', 'k', 'w'}


#### Plot the histogram grid, but make it larger, and rotate the x-axis labels clockwise by 45 degrees.
- <code style="color:steelblue">df.hist()</code> has a <code style="color:steelblue">figsize=</code> argument takes a tuple for figure size, try making the figure size 20 x 20
- <code style="color:steelblue">df.hist()</code> has a <code style="color:steelblue">xrot=</code> argument rotates x-axis labels **counter-clockwise**, lets move it 45% clockwise.
- The [documentation](http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.hist.html) is useful for learning more about the arguments to the <code style="color:steelblue">.hist()</code> function.

**Tip:** It's ok to arrive at the answer through **trial and error** (this is often easier than memorizing the various arguments).


```python 
# Plot histogram grid
df.hist(xrot=-45, figsize=(20, 20), color='g')
plt.show()
```

#### Try it out:

In [0]:
# Plot histogram grid
df.hist(xrot=-45, figsize=(20, 20), color='g')
plt.show()

#### Display summary statistics for the numerical features.

In [0]:
# Summarize numerical features
df.describe()

<hr style="border-color:royalblue;background-color:royalblue;height:1px;">

<div style="text-align:center; margin: 40px 0 40px 0; font-weight:bold">
    
[Back to Contents](#toc)
</div>

<br id="categorical">

# Distributions of categorical features

Next, let's take a look at the distributions of our categorical features.

<br>
Display summary statistics for categorical features.

- Use the dataframe's `describe()` method
- The describe method can accepts a parameter called `include` which can take in an the following value `"object"`. Doing this will show only information about categorical features.  

In [0]:
# Summarize categorical features
df.describe(include="object")


### Countplot 

One way to analyze categorical data is by creating a countplot, which will show the counts of observations in each categorical bin using bars.

We will use the `sns.countplot()` method

#### sns.countplot() parameters:
- (optional) `x` (string: series name): specify the values for the x axis
- (optional) `y` (string: series name): specify the values for the y axis
- `data` (DataFrame, array, or list of arrays): Dataset for plotting

#### Plot using a countplot using the 'exterior_walls' feature

```python
# count plot for 'exterior_walls'
sns.countplot(y="exterior_walls", data=df)
plt.show()
```

#### Try it out: 

In [0]:
# count plot for 'exterior_walls'
sns.countplot(y="exterior_walls", data=df)
plt.show()

<br><hr style="border-color:royalblue;background-color:royalblue;height:1px;">
## <span style="color:RoyalBlue">Exercise 1.3</span>

**Write a <code style="color:steelblue">for</code> loop to plot bar plots of each of the categorical features.**
* Write the loop to be able to handle any number of categorical features (borrow from your answer to <span style="color:royalblue">Exercise 1.1</span>).
* Invoke <code style="color:steelblue">plt.show()</code> after each bar plot to display all 3 plots in one output.

In [0]:
# Plot bar plot for each categorical feature
for features in df.dtypes[df.dtypes == "object"].index: 
    sns.countplot(y=features, data=df)
    plt.show()

#### Which features suffer from sparse classes? Meaning which features have a lot of different uniuqe values?

<hr>

<br id="segmentations">

# Segmentations

Next, let's create some segmentations. Segmentations are powerful ways to cut the data to observe the relationship between **categorical features** and **numeric features**.

### Boxplots
Drawing a box plot will allow you to show distributions with respect to categories.

#### sns.boxplot() parameters:
- `y` (string): axis parameter provide the categorical column (series) name
- `x` (string): axis parameter provide the numerical column (series) name
- data (DataFrame): Dataset for plotting
<br>


```python
# Segment numerical_feature by categorical_feature and plot distributions
sns.boxplot(x= "numerical_feature", y="categorical_feature", data=df)
plt.show()
```

#### Using a seaborn boxplot, plot the resulting distributions by segmenting <code style="color:steelblue">'tx_price'</code> by <code style="color:steelblue">'property_type'</code>



In [0]:
 # Segment tx_price by property_type and plot distributions
sns.boxplot( x= "tx_price", y="property_type", data=df)
plt.show()

### Groupby
The `groupby()` method allows you to group rows of data together. After you have grouped rows of data together you can then call aggregate functions. 

Aggregation functions allow you to compute a summary statistic (or statistics) for each group. 

[Groupby Documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html)

Like mentioned below you can use the `.groupby()` method to group rows together based off of a column name. 

For instance let's group based off of property_type column (series). This will create a DataFrameGroupBy object:

```python
df.groupby('property_type')
```

#### Try it yourself to see the output

In [26]:
df.groupby('property_type')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f0c8c572940>

#### We can now calculate the average value for each feature within each class 

**Note:** 
- `class` means (unique value of the property type column (series))

Segment by <code style="color:steelblue">'property_type'</code> and calculate the average value of each feature within each class by calling the `mean()` method on the `groupby` object.

```python 
# Segment by property_type and display the means within each class
df.groupby("property_type").mean()
``` 

#### Try it your self to see the output

In [0]:
df.groupby('property_type').mean()

<br><hr style="border-color:royalblue;background-color:royalblue;height:1px;">
## <span style="color:RoyalBlue">Exercise 1.4</span>

On average, it looks like single family homes are more expensive.

How else do the different property types differ? Let's see:
<br>

#### Using a seaborn boxplot, plot the resulting distributions by segmenting <code style="color:steelblue">'sqft'</code> by <code style="color:steelblue">'property_type'</code> 
- Invoke the seaborn `boxplot()` function
- For the `y` axis parameter provide the categorical column (series) name
- For the `x` axis parameter provide the numerical column (series) name
- For the `data` parameter provide the DataFrame
- Show the graph using matplotlib's `show()` function


In [0]:
# Segment sqft by sqft and property_type distributions
sns.boxplot(y="property_type", x="sqft", data=df)
plt.show()

#### Which type of property is larger, on average?

Single-Family properties

#### Which type of property sees greater variance in sizes?


Single-Family properties

#### Does the difference in distributions between classes make intuitive sense?


Yes, single family homes tend to have more sqft and also I would think that apartments have less space to work with.


<br>

#### Display the standard deviations of each feature alongside their means after performing a groupby

This will give you a better idea of the variation within in feature, by class.

**Tip:** Pass a list of metrics into the <code style="color:steelblue">.agg()</code> function, after performing your groupby.

<br>

**Some Common metrics you can pass in:**
- np.mean
- np.std

**Note:** The metrics are function declarations that have not been invoked (called) 

Check out the [documentation](http://pandas.pydata.org/pandas-docs/stable/groupby.html#applying-multiple-functions-at-once) for more help.

```python
# Segment by property_type and display the means and standard deviations within each class
df.groupby("property_type").agg([np.mean, np.std])
```

In [29]:
# Segment by property_type and display the means and standard deviations within each class
df.groupby("property_type").agg([np.mean, np.std])

Unnamed: 0_level_0,tx_price,tx_price,beds,beds,baths,baths,sqft,sqft,year_built,year_built,lot_size,lot_size,basement,basement,restaurants,restaurants,groceries,groceries,nightlife,nightlife,cafes,cafes,shopping,shopping,arts_entertainment,arts_entertainment,beauty_spas,beauty_spas,active_life,active_life,median_age,median_age,married,married,college_grad,college_grad,property_tax,property_tax,insurance,insurance,median_school,median_school,num_schools,num_schools,tx_year,tx_year
Unnamed: 0_level_1,mean,std,mean,std,mean,std,mean,std,mean,std,mean,std,mean,std,mean,std,mean,std,mean,std,mean,std,mean,std,mean,std,mean,std,mean,std,mean,std,mean,std,mean,std,mean,std,mean,std,mean,std,mean,std,mean,std
property_type,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2,Unnamed: 24_level_2,Unnamed: 25_level_2,Unnamed: 26_level_2,Unnamed: 27_level_2,Unnamed: 28_level_2,Unnamed: 29_level_2,Unnamed: 30_level_2,Unnamed: 31_level_2,Unnamed: 32_level_2,Unnamed: 33_level_2,Unnamed: 34_level_2,Unnamed: 35_level_2,Unnamed: 36_level_2,Unnamed: 37_level_2,Unnamed: 38_level_2,Unnamed: 39_level_2,Unnamed: 40_level_2,Unnamed: 41_level_2,Unnamed: 42_level_2,Unnamed: 43_level_2,Unnamed: 44_level_2,Unnamed: 45_level_2,Unnamed: 46_level_2
Apartment / Condo / Townhouse,366614.034869,121784.490486,2.601494,0.81022,2.200498,0.815009,1513.727273,556.28665,1988.936488,15.51364,3944.239103,44284.168767,1.0,0.0,58.418431,54.343594,5.919054,4.645774,7.855542,10.643816,8.03736,9.077038,57.631382,61.852299,4.840598,5.234834,32.087173,26.910443,22.410959,21.058178,37.199253,6.906584,57.534247,20.372706,66.372354,17.095874,346.261519,142.292282,105.652553,47.118015,6.382316,1.941998,2.83188,0.45537,2007.941469,4.099487
Single-Family,464644.711111,157758.739013,4.02963,0.795639,2.862037,0.937551,2935.865741,1422.871169,1978.523148,22.210582,20417.666667,44165.529302,1.0,0.0,26.672222,34.726416,3.453704,4.067285,3.007407,5.543822,3.308333,5.325053,28.289815,42.292313,2.318519,3.929691,16.97037,22.872112,10.946296,12.599296,39.643519,6.225732,77.685185,13.868205,64.128704,16.790347,556.383333,244.351559,166.32963,77.816022,6.592593,2.031663,2.764815,0.537959,2006.494444,5.807059


<hr>

<br>

### Seaborn lmplot()
Seaborn's lmplot will allow you to create scatterplots, meaning comparing two numerical features for each data point. A data point being a single entry, or row, with multiple columns or otherwise refered to as features. This plot also allows you to attach additional information to each data point. This plot will allow you to see clustering of data relative to the feature values. 

#### `seaborn.lmplot()` parameters: 
- (optional) x (string: series name): specify numerical values for the x axis
- (optional) y (string: series name): specify numerical values for the y axis
- (optional) hue (string: series name): specify categorical values for the data point which helps us to group our data into clusters
- (optional) fit_reg (boolean): if `True`, estimate and plot a regression model relating the x and y variables
- (optional) height (float): height (in inches) of each facet, a particular aspect or feature of something
- (optional) aspect (float): width (in inches) of each facet, a particular aspect or feature of something
- data (DataFrame): data set for plotting 

[Seaborn lmplot Docs](https://seaborn.pydata.org/generated/seaborn.lmplot.html)

#### Example: 
```python
sns.lmplot(data=df, x='numerical_feature_one', y='numerical_feature_two', hue='categorical_feature')
plt.show()
```

#### Create a lmplot with the following parameter values:
  - `x` = 'tx_price'
  - `y` = 'sqft'
  - `hue` = 'property_type'
  - `height` = 10 

#### Note 
- the default value for fit_reg is `True`, if you want to hide the estimate regression model lines then set the value of the `fit_reg` to False 

In [0]:
sns.lmplot(data=df, x='tx_price', y='sqft', hue='property_type', height=10, fit_reg=False)
plt.show()

<hr> 

<br id="correlations">

# Correlations

Finally, let's take a look at the relationships between **numeric features** and **other numeric features**.

<br>

#### Create a <code style="color:steelblue">correlations</code> dataframe from <code style="color:steelblue">df</code>.

- Use pandas' DataFrame `.corr()` method to show you all of the correlations between all the columns of the DataFrame.
- Save this correlations DataFrame into a variable called `correlations`

**Note:** The default parameters utilizes the pearson correlation coefficient.


In [0]:
# Calculate correlations between numeric features
correlations = df.corr()

#### Visualize the correlation grid with a heatmap to make it easier to digest.

#### Seaborn heatmap()
A **heat map** is a graphical representation of data where the individual values contained in a matrix are represented as colors. 
#### seaborn.heatmap() parameters:
- data : rectangular dataset, 2D dataset

Example:
```python 
# Make the figsize 10 x 10
plt.figure(figsize=(10,10))

# Plot heatmap of correlations
sns.heatmap(correlations)

# Show plot
plt.show()
```
#### Try it yourself:

In [0]:
# Make the figsize 10 x 10
plt.figure(figsize=(10,10))

# Plot heatmap of correlations
sns.heatmap(correlations)

# Show plot
plt.show()

<br><hr style="border-color:royalblue;background-color:royalblue;height:1px;">
## <span style="color:RoyalBlue">Exercise 1.5</span>

When plotting a heatmap of correlations, it's often helpful to do three things:
1. Annotate the cell with their correlations values
2. Mask the top triangle (less visual noise)
3. Drop the legend (colorbar on the side)

<br>

### Original Heatmap

In [0]:
# Make the figsize 10 x 10
plt.figure(figsize=(10,10))

# Plot heatmap of correlations
sns.heatmap(correlations)
plt.show()

See how the cells for <code style="color:steelblue">'basement'</code> are now white? That's what we want because they were not able to be calculated.

<br>

**Next, display the correlation values in each cell.**

* The <code style="color:steelblue">annot=</code> argument controls whether to annotate each cell with its value. By default, it's <code style="color:crimson">False</code>.
* To make the chart cleaner, multiply the <code style="color:steelblue">correlations</code> DataFrame by 100 before passing it to the heatmap function.
* Pass in the argument <code style="color:steelblue">fmt=<span style="color:crimson">'.0f'</span></code> to format the annotations to a whole number.


In [0]:
# Make the figsize 10 x 10
plt.figure(figsize=(10,10))

# Plot heatmap of annotated correlations
correlations = correlations * 100
sns.heatmap(correlations, annot=True, fmt='.0f')
plt.show()

#### Next, we'll generate a mask for the top triangle. Run this code:

This mask will help us cut out the duplicates correlation values.

In [0]:
# Generate a mask for the upper triangle
bool_mask = np.zeros_like(correlations, dtype=np.bool)
bool_mask[np.triu_indices_from(bool_mask)] = True

print(bool_mask)

<br>

**Plot the heatmap again, this time using that mask.**

* <code style="color:steelblue">sns.heatmap()</code> has a <code style="color:steelblue">mask=</code> argument.
* Keep all of the other styling changes you've made up to now.

In [0]:
# Make the figsize 10 x 10
plt.figure(figsize=(10, 10))


# Plot heatmap of correlations
sns.heatmap(correlations, annot=True, fmt='.0f', mask=bool_mask)
plt.show()

<br>

**Finally, remove the colorbar on the side.**

* <code style="color:steelblue">sns.heatmap()</code> has a <code style="color:steelblue">cbar=</code> argument. By default, it's <code style="color:crimson">True</code>.
* Keep all of the other styling changes you've made up to now.

In [0]:
# Make the figsize 10 x 10
plt.figure(figsize=(10, 10))


# Plot heatmap of correlations
sns.heatmap(correlations, annot=True, fmt='.0f', mask=bool_mask, cbar=False)
plt.show()

<hr style="border-color:royalblue;background-color:royalblue;height:1px;">
<div style="text-align:center; margin: 40px 0 40px 0; font-weight:bold">
    
[Back to Contents](#toc)
</div>