## Correlation Analysis
- What is correlation Analysis?
- How does correlation analysis help with data cleaning?
- Coding Example

## Correlation Analysis
Is a statistical technique used to examine the strength and direction of the relationship
between two or more variables.

Analyze the degree to which changes from one variable are associated wih changes from another variable.

### How to do it?
Use `Correlation Coefficients`, which is a measurement of thr strength and the direction of the relationship between the two variables
$$
\{Y,X\}
$$

### Popular Correlation Coefficients

### Pearson 
- Use for continuous data
- Measures the strength of the linear relationship between the variables
- Sensitive to outliers

### Spearman
- Use for ordinal or ranked data
- Measures the strength of thr `monotonic relationship` between the variables, which can be linear or non-linear
- More robust towards outliers

Correlation analysis can identify variables that are highly correlated to each other.

The analysis may indicate if one variable is `redundant` and can be eliminated

### Dealing with Categorical Data
- Data has many `non-numeric` features. You CAN NOT feed them to a learning model. They need to be converted.
- Use the `pd.dtypes()` to see the data types

### Two main types
- Label encoding
- One-hot encoding


### Label Encoding
Each `unique` category in the categorical variable is assigned a numerical label. Typically starting at 0, 1, etc

### One-hot Encoding
A new binary feature is created `for each` category, and the value of that feature is set to 1 if the observation belongs to that.

## Dealing with Categorical Data

In [None]:
import pandas as pd
# Load dataset
# Use the hotel_bookings_wip.csv from module3/my_notes.ipynb
data_df = pd.read_csv('../data/hotels_booking_wip.csv')
data_df.head()

In [None]:
# See dataset dtypes
#data_df.dtypes

# Task: Create a list of those columns/features that are of type 'object'
# Using a list comprehension with oredicate
cat_cols = [col for col in data_df.columns if data_df[col].dtype == 'object']
cat_cols

In [None]:
# Create a new DataFrame from only those columns
cat_df = data_df[cat_cols]
cat_df.head()

In [None]:
# Task: Print unique values of each feature in cat_df
for col in cat_df.columns:
    print(f'{col}: \n{cat_df[col].unique()}')

In [None]:
# Task: Cast the categorical features to the category data type for 'arrival_date_month'
# January -> 1, February -> 2, etc.
month_map = {'January': 1, 'February': 2, 'March': 3, 'April': 4, 'May': 5, 'June': 6, 'July': 7, 'August': 8, 'September': 9, 'October': 10, 'November': 11, 'December': 12}

# One/Old approach
#cat_df['arrival_date_month'] = cat_df['arrival_date_month'].map(month_map)

#Another/Newer approach
cat_df.loc[:,'arrival_date_month'] = cat_df['arrival_date_month'].map(month_map)


In [1]:
# Now di the country hotel feature
# Use label encoding from sklearn: conda install scikit-learn
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()     # create an object of type LabelEncoder

In [None]:
# Encode country feature
# cat_df['country'] = le.fit_transform(cat_df['country']) # this gives you a warning
cat_df.loc[:,'country'] = le.fit_transform(cat_df['country']) # new way with.loc
cat_df.head()

### One_Hot Encoding

In [None]:
# print all your categories
cat_df.columns

In [None]:
# One-hot encoding for ALL features as one, except: 'arrival_date_month' and 'country'
processed_cols = ['arrival_date_month', 'country']
# Create a new list of columns to be converted
one_hot_cols = [col for col in cat_df.columns if col not in processed_cols]
dummy_df = pd.get_dummies(data=cat_df, columns=one_hot_cols)

dummy_df.head()
# NOTE: TODO: Fix, 'reservation_status_date' feature

In [None]:
# Merge back to the cat_df
cat_df = pd.concat([cat_df, dummy_df], axis=1)
cat_df.head()

In [None]:
# Drop the original 'object' columns
num_df = data_df.drop(columns=cat_cols, axis=1)
num_df.head()

In [None]:
# Now, create your final df
final_df = pd.concat([num_df, cat_df], axis=1)
final_df.head()

In [None]:
# Save it as a new file
final_df.to_csv('../data/hotel_booking_new.csv', index=False)

## Correlation Analysis

In [None]:
# Create a correlation matrix
corr_matrix = final_df.corr(numeric_only=True)
corr_matrix

### Correlation and Causation

**Correlation**: a measure of the extend of interdependence between variables

**Causation**: the relationship between cause and effect between two variables

### Pearson Correlation
Measures the linear dependence between two variables. The result coefficient is a value between `-1 and 1`, where
- `1`: Perfect positive linear correlation
- `0`: No linear correlation
- `-1`: Perfect negative linear correlation

Note: `Pearson` correlation is the default method in pandas for the `corr()` method

In [None]:
# Do a correlation between 'is_called' and 'children' features
corr_matrix_two = final_df[['is_canceled', 'children']].corr()
corr_matrix_two

Sometimes we want to know the significance of the correlation estimate

#### P-Value
The probability value that the correlation between the variables is statistically significant.

Normally we choose a significance level of `0.05`, which means we are `95%` confident.

P-Values
- $<$ 0.001: we say there is a `strong` evidence
- $<$ 0.05: we say there is `moderate` evidence
- $<$ 0.1: we say there is a `weak` evidence
- $>$ 0.: we say there is `no` evidence

In [2]:
# conda install scipy
from scipy.stats import pearsonr, spearmanr

correlation, p_value = pearsonr(final_df['is_canceled'], final_df['children'])
print(f'Pearson Correlation:[{corr}]\n P_value:[{p_value}]')