# Exploratory Data Analysis

## Import   

In [1]:
## Import
import os
import sys

module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)
    
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns

## Exploring data

### Read data from file

In [3]:
cleaned_data = pd.read_csv('./data/cleaned_data.csv')
cleaned_data

Unnamed: 0,company,company_size,job_title,level,domain,yoe_total,yoe_at_company,base,stock,bonus,total_compensation,location
0,Logitech,7250,Software Engineer,I4,Testing (SDET),10,5,190000,10000,0,200000,San Francisco Bay Area
1,Logitech,7250,Software Engineer,I2,ML / AI,4,3,126000,0,7000,133000,"Vancouver, WA"
2,Logitech,7250,Software Engineer,I3,Testing (SDET),11,11,120000,5000,12000,137000,"San Francisco, CA"
3,Logitech,7250,Software Engineer,I4,Production,8,8,100000,10000,0,110000,"Hsin-chu, TP, Taiwan"
4,Logitech,7250,Software Engineer,I4,Android,13,1,185000,15000,18500,218500,"San Francisco, CA"
...,...,...,...,...,...,...,...,...,...,...,...,...
1713,Snap,6250,Marketing,L3,Marketing,7,3,159000,60000,24000,243000,"New York, NY"
1714,Snap,6250,Marketing,L4,Design,6,0,150000,76000,10000,236000,"Los Angeles, CA"
1715,Snap,6250,Marketing,L3,Sales,8,3,134000,10000,20000,164000,"Los Angeles, CA"
1716,Snap,6250,Marketing,L3,Analyst role,4,4,120000,15000,20000,155000,"Los Angeles, CA"


In [4]:
cleaned_data.shape

(1718, 12)

In [5]:
cleaned_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1718 entries, 0 to 1717
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   company             1718 non-null   object
 1   company_size        1718 non-null   int64 
 2   job_title           1718 non-null   object
 3   level               1718 non-null   object
 4   domain              1718 non-null   object
 5   yoe_total           1718 non-null   int64 
 6   yoe_at_company      1718 non-null   int64 
 7   base                1718 non-null   int64 
 8   stock               1718 non-null   int64 
 9   bonus               1718 non-null   int64 
 10  total_compensation  1718 non-null   int64 
 11  location            1718 non-null   object
dtypes: int64(7), object(5)
memory usage: 161.2+ KB


> **Observation:** 
> - The data has total 12 columns and 1718 rows
> - The data has no missing values
> - The total data size is higher than 1000 which means it a well collecting data
> - The type of the data is float64 and int64 which means it is a numerical data so we can easily apply some statistical methods to explore and analyze the data


### Numerical analysis using descriptive statistics

Descriptive statistics show the characteristics of numerical features. It shows us the information such as:
- The mean (and you can go further with advanced techiques as Arithmetic mean, Geometric mean, Harmonic mean)
- The median
- The mode
- Quantiles (Quartiles, Percentiles, Deciles, Crocodiles?)
- Range and IQR (Interquartile Range) (you might find that its relation to box plots)
- Variance and Standard deviation (std dev)
- Coefficient of Variation
- Skewness
- Kurtosis
- Standard Error (of the sample mean)
- Moments
- Covariance and Correlation

Due the scope of this lab, you just need to use basic function of Pandas to calculate basic descriptive statistics information and give insights from it.

### Describe the correlation between features in the dataset

Considering the available features within the training dataset, we want to identify and analyse the relationships between them, and then determine which features within the dataset significantly contribute to our solution goal. By using calculation and visualisation, Python allows us to create a correlation matrix, which is a table that represents the correlation coefficients between different variables.

Now let's make a correlation matrix, then visualize it, and describe what insights you observed?

In [None]:
data_copy = cleaned_data.copy()

fig, ax = plt.subplots(figsize=(10,10))
sns.heatmap(data_copy.corr(), cmap='RdBu', center=0,ax=ax)
plt.show()