## Correlation Analysis

- What is correlation analysis?
- How does correlation analysis help with data cleaning?
- Coding Example

### Correlation Analysis
It's a statistical technique used to examine the strength and direction of the relationship
between two or more variables. 

Analyze the degree to which changes from one variable are associated with changes from another variable.

### How to do it?
Use `Correlation Coefficients`, which is a measurement of the strength and the direction of the relationship between the two variables

$$
\{Y,X\}
$$


### Popular Correlation Coefficients

#### Pearson 
- Use for continuous data
- Measures the strength of the `linear relationships` between the variables
- Sensitive to outliers

#### Spearman
- Use for ordinal or ranked data
- Measures the strength of the `monotonic relationship` between the variables, which can be linear or non-linear
- More robust toward outliers  

Correlation analysis can identify variables that are highly correlated to each other.

The analysis may indicate if one variable is `redundant` and can be eliminated

### Dealing with Categorical Data
- Data has many `non-numeric` features. You CAN NOT feed them to a learning model. They need to be converted.
- Use the `pd.dtypes()` to see the data types

### Two main types 
- Label encoding
- One-hot encoding

#### Label Encoding
Each `unique` category in the categorical variable is assigned a numberical label. Typically starting at 0, 1, etc

#### One-hot Encoding
A new binary feature is created `for each` category, and the value for that feature is set to 1 if the observation is true. 0 if false.

If my data was Fr, Sop, Ju, Se, then One-Hot would be like this:
- Fr : 1 0 0 0
- So : 0 1 0 0 
- Ju : 0 0 1 0
- Se : 0 0 0 1

### Dealing with Categorical Data

In [3]:
import pandas as pd
# Load dataset
data_df = pd.read_csv('../data/hotel_bookings.csv')

data_df.head()

Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,...,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date
0,Resort Hotel,0,342,2015,July,27,1,0,0,2,...,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
1,Resort Hotel,0,737,2015,July,27,1,0,0,2,...,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
2,Resort Hotel,0,7,2015,July,27,1,0,1,1,...,No Deposit,,,0,Transient,75.0,0,0,Check-Out,2015-07-02
3,Resort Hotel,0,13,2015,July,27,1,0,1,1,...,No Deposit,304.0,,0,Transient,75.0,0,0,Check-Out,2015-07-02
4,Resort Hotel,0,14,2015,July,27,1,0,2,2,...,No Deposit,240.0,,0,Transient,98.0,0,1,Check-Out,2015-07-03


In [4]:
# See all Dataset dtypes
# data_df.dtypes

# Task: Create a list of those columns/features

hotel                              object
is_canceled                         int64
lead_time                           int64
arrival_date_year                   int64
arrival_date_month                 object
arrival_date_week_number            int64
arrival_date_day_of_month           int64
stays_in_weekend_nights             int64
stays_in_week_nights                int64
adults                              int64
children                          float64
babies                              int64
meal                               object
country                            object
market_segment                     object
distribution_channel               object
is_repeated_guest                   int64
previous_cancellations              int64
previous_bookings_not_canceled      int64
reserved_room_type                 object
assigned_room_type                 object
booking_changes                     int64
deposit_type                       object
agent                             