# EDA Bivariate Analysis: pandas exercices

Load the diamonds dataset from seaborn with the following snippet.

```python
import seaborn as sns

data = sns.load_dataset('diamonds')
data
```

In [1]:
import seaborn as sns
import pandas as pd
from scipy.stats import kruskal, shapiro, normaltest, f_oneway, chi2_contingency
data = sns.load_dataset('diamonds')
print(data.head(),'\n','---')
data.info()

   carat      cut color clarity  depth  table  price     x     y     z
0   0.23    Ideal     E     SI2   61.5   55.0    326  3.95  3.98  2.43
1   0.21  Premium     E     SI1   59.8   61.0    326  3.89  3.84  2.31
2   0.23     Good     E     VS1   56.9   65.0    327  4.05  4.07  2.31
3   0.29  Premium     I     VS2   62.4   58.0    334  4.20  4.23  2.63
4   0.31     Good     J     SI2   63.3   58.0    335  4.34  4.35  2.75 
 ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 10 columns):
 #   Column   Non-Null Count  Dtype   
---  ------   --------------  -----   
 0   carat    53940 non-null  float64 
 1   cut      53940 non-null  category
 2   color    53940 non-null  category
 3   clarity  53940 non-null  category
 4   depth    53940 non-null  float64 
 5   table    53940 non-null  float64 
 6   price    53940 non-null  int64   
 7   x        53940 non-null  float64 
 8   y        53940 non-null  float64 
 9   z        53940 non-null  

## 1. Calculate the correlation of all the numeric variables and analyze the results.

In [2]:
print('Pearson correlation coefficient','\n','---','\n', data.corr(numeric_only=True),'\n','---','\n') # default method='pearson' (parametric)
print('Spearman rank correlation coefficient','\n','---','\n', data.corr(method='spearman', numeric_only=True)) # (not parametric)

Pearson correlation coefficient 
 --- 
           carat     depth     table     price         x         y         z
carat  1.000000  0.028224  0.181618  0.921591  0.975094  0.951722  0.953387
depth  0.028224  1.000000 -0.295779 -0.010647 -0.025289 -0.029341  0.094924
table  0.181618 -0.295779  1.000000  0.127134  0.195344  0.183760  0.150929
price  0.921591 -0.010647  0.127134  1.000000  0.884435  0.865421  0.861249
x      0.975094 -0.025289  0.195344  0.884435  1.000000  0.974701  0.970772
y      0.951722 -0.029341  0.183760  0.865421  0.974701  1.000000  0.952006
z      0.953387  0.094924  0.150929  0.861249  0.970772  0.952006  1.000000 
 --- 

Spearman rank correlation coefficient 
 --- 
           carat     depth     table     price         x         y         z
carat  1.000000  0.030104  0.194980  0.962883  0.996117  0.995572  0.993183
depth  0.030104  1.000000 -0.245061  0.010020 -0.023442 -0.025425  0.103498
table  0.194980 -0.245061  1.000000  0.171784  0.202231  0.195734  0.1

The correlation between the continuous variables of the dataset ('carat', 'depth', 'table', 'price', 'x', 'y' and 'z') was estimated using the .corr() method. It estimates by default the Pearson correlation coefficient designed for normal variables. Since this assumption was not explored for the purposes of this exercise, the Spearman rank correlation coefficient (non-parametric) was additionally estimated. In general, no differences were detected in the strength of the correlations estimated by the two methods.
Considering the Spearman rank correlation coefficient, no correlations were detected for the variables 'depth' and 'table' with any other variable. For the rest of the continuous variables, strong positive correlations were detected in the form of:
- 'carat' correlated with 'price', 'x', 'y' and 'z'.
- 'price' also correlated with 'x', 'y' and 'z'.
- 'x' also correlated with 'y' and 'z'.
- 'y' also correlated with 'z'.

## 2. Analyze the correlation of the variable *color* and the variable *price*. Use the kruskal-wallis test.

In [3]:
print(kruskal(*[data.price[data.color == c] for c in data.color.unique()]),'\n','---')
print(kruskal(*[group.price for name,group in data.groupby('color')]))

KruskalResult(statistic=1335.570626350983, pvalue=2.1580813998043093e-285) 
 ---
KruskalResult(statistic=1335.570626350983, pvalue=2.1580813998043093e-285)


The Kruskal-Wallis non-parametric test was used to explore whether there are differences in 'price' when considering the categories of the 'color' variable. The test shows that there are statistically significant differences (p < 0.05) in 'price' among the 'color' categories.

## 3. Analyze the correlation of the variable *table* and the variable *clarity*. Can we use the anova method? If so, use it instead of kruskal-wallis.

In [4]:
print(kruskal(*[group.table for name,group in data.groupby('clarity')]),'\n','---')
print(shapiro(data.table),'\n','---')

KruskalResult(statistic=1508.77325361572, pvalue=0.0) 
 ---
ShapiroResult(statistic=0.9539790153503418, pvalue=0.0) 
 ---




In [5]:
print(normaltest(data.table)) # if normal: f_oneway(*[group.table for name,group in data.groupby('clarity')])

NormaltestResult(statistic=8034.751738354047, pvalue=0.0)


To explore whether the continuous variable 'table' follows a normal distribution, the Shapiro-Wilk test was used. Its result shows a warning related to the sample size (p-value may not be accurate for N > 5000), so the assumption of normality was additionally tested using D'Agostino and Pearson's approximation. Both tests show that the distribution of 'table' is different from a normal distribution (p < 0.05), so a parametric ANOVA is not suitable for estimating differences between groups in 'table'.
The Kruskal-Wallis test was used to explore whether there are differences in 'table' for the categories of 'clarity'. The test shows that there are statistically significant differences (p < 0.05) in 'table' among the 'clarity' groups.

## 4. Measure the association between *cut* and *color* variables.

In [6]:
print(pd.crosstab(data.cut,data.color),'\n','---')
chi2_statistic, _, _, c_table = chi2_contingency(pd.crosstab(data.cut,data.color))
cramers_v = (chi2_statistic / (c_table.sum() * (min(c_table.shape) - 1))) **0.5
print(f"Chi-squared statistic: {chi2_statistic} \nCramér's V: {cramers_v}")

color         D     E     F     G     H     I    J
cut                                               
Ideal      2834  3903  3826  4884  3115  2093  896
Premium    1603  2337  2331  2924  2360  1428  808
Very Good  1513  2400  2164  2299  1824  1204  678
Good        662   933   909   871   702   522  307
Fair        163   224   312   314   303   175  119 
 ---
Chi-squared statistic: 310.3179005211542 
Cramér's V: 0.03792433266357063


The association between 'cut' and 'color' variables was estimated by using Cramer's V coefficient. Its value (0.038) shows that there is no association between both variables.