## Hypothesis Testing of diamonds

We might want to know if our sample and sub-samples are representative of diamonds in general. Moreover, we might want to reach some conclusions about the influence of certain diamonds features in their price. In that sense, we propose you to perform two statistical tests:

Test 1 - one sample vs constant hypothesis test. We know from the available literature that diamonds average price rounds about 4000 USD. The aim is to test whether the prices in our sample are significantly different from the literature value. Give some conclusions about the implications of your test results.

Test 2 - two independent samples. Our sample includes diamonds with different features (carat, cut, color clarity, etc.). It seems clear that the carat plays an important role in price. However, it's not that clear whether the prices of some "sub-groups" are significantly different from each other. These are the "sub-groups" that you might feel suspicious about it:

Sub-test assigned: Sub-Test2: Good cut + color E vs. Good cut + color F

In [1]:
import numpy as np
import pandas as pd

from scipy.stats import t, norm, ttest_1samp, ttest_ind

In [2]:
path_database = '../data/diamonds_train.csv'

In [3]:
diamonds = pd.read_csv(path_database)
diamonds.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,1.21,Premium,J,VS2,62.4,58.0,4268,6.83,6.79,4.25
1,0.32,Very Good,H,VS2,63.0,57.0,505,4.35,4.38,2.75
2,0.71,Fair,G,VS1,65.5,55.0,2686,5.62,5.53,3.65
3,0.41,Good,D,SI1,63.8,56.0,738,4.68,4.72,3.0
4,1.02,Ideal,G,SI1,60.5,59.0,4882,6.55,6.51,3.95


In [4]:
diamonds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40455 entries, 0 to 40454
Data columns (total 10 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   carat    40455 non-null  float64
 1   cut      40455 non-null  object 
 2   color    40455 non-null  object 
 3   clarity  40455 non-null  object 
 4   depth    40455 non-null  float64
 5   table    40455 non-null  float64
 6   price    40455 non-null  int64  
 7   x        40455 non-null  float64
 8   y        40455 non-null  float64
 9   z        40455 non-null  float64
dtypes: float64(6), int64(1), object(3)
memory usage: 3.1+ MB


In [5]:
diamonds.shape

(40455, 10)

### Test 1: One sample vs constant hypothesis test

In this test, I will use two different methods to test the same hypothesis, in which I check if the prices in the sample are significantly different from the value in the literature.

#### First method through scipy

In [6]:
test_1 = ttest_1samp(diamonds['price'], 4000)
print(f'The statistic value returns: {test_1.statistic}')
print(f'The scipy pvalue returns: {test_1.pvalue}')

The statistic value returns: -3.604902369125729
The scipy pvalue returns: 0.00031264532833074845


Checking if the hypothesis pass the test I compare with alpha 5%. If the result is true, the hypothesis pass the test:

In [7]:
print(f'Can I confirm that the hypothesis pass the test?: {test_1.pvalue < 0.05}')

Can I confirm that the hypothesis pass the test?: True


*Note: Scipy returns two tails as standard in formula. To achieve the pvalue in one tail I need to divide it by two.

In [8]:
p_value_onetail = test_1.pvalue / 2
print(f'The pvalue corresponds to: {p_value_onetail}')

The pvalue corresponds to: 0.00015632266416537423


#### Second method through manual version

In [9]:
#Collate information
mu = 4000
mu_hat = diamonds['price'].mean()
n = diamonds.shape[0]
std_hat = diamonds['price'].std()

with this data I can clear the value of the statistic where I define it as t_test

In [10]:
t_test = (mu_hat - mu) / (std_hat / np.sqrt(n))
print(f'The statistic value returns: {t_test}')

The statistic value returns: -3.604902369125729


In [11]:
rv = t(df=n-1)
p_value = rv.cdf(t_test)
print(f'The pvalue corresponds to: {p_value}')

The pvalue corresponds to: 0.00015632266416537423


### Test 2: Two independent samples

In this second test I want to show that two independent samples are significantly different (talking about price) between them. I have been chosen for the following test:
- Sub-Test 2: Good cut + color E vs. Good cut + color F

In [12]:
sample1 = diamonds[(diamonds['cut'] == 'Good') & (diamonds['color'] == 'E')]
sample1.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
150,1.01,Good,E,SI1,64.3,59.0,5379,6.23,6.28,4.02
169,1.19,Good,E,VS2,57.4,60.0,6937,6.9,7.0,3.99
196,0.9,Good,E,SI1,63.9,57.0,3795,6.12,6.08,3.9
369,0.54,Good,E,SI2,63.8,54.0,1163,5.17,5.18,3.3
390,0.4,Good,E,VVS2,58.5,60.0,1059,4.82,4.86,2.83


In [13]:
sample2 = diamonds[(diamonds['cut'] == 'Good') & (diamonds['color'] == 'F')]
sample2.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
122,0.52,Good,F,SI2,58.7,64.0,1350,5.24,5.26,3.08
373,0.7,Good,F,SI1,64.3,63.0,2059,5.58,5.55,3.58
376,0.9,Good,F,VS1,63.3,59.0,4508,6.03,6.07,3.83
448,1.01,Good,F,VS1,59.0,62.0,6794,6.56,6.6,3.88
479,0.52,Good,F,SI1,63.6,62.0,1357,5.07,4.96,3.2


In [14]:
print(f'The size of sample one is: {sample1.shape[0]}')
print(f'The size of sample two is: {sample2.shape[0]}')

The size of sample one is: 690
The size of sample two is: 664


Taking a look, these are 2 independent samples with unequal variances. Then I'm going to use Welch's test because it is considered robust test since it does not need to make as many assumptions about the data. Then, I assign the null hypothesis (or H0) like the mean of first sample is equal to the mean of second sample, being the contrast hypothesis the opposite result.

In [15]:
?ttest_ind

In [17]:
ttest_ind(sample1['price'], sample2['price'], equal_var=False)

Ttest_indResult(statistic=-0.4406178833837438, pvalue=0.6595600994188809)

As I can see, the statistic result is close to 0 and pvalue is greater than statistic, then the test does not pass. I won't reject the null hypothesis and conclude that there is a small difference between the two sample means