# Student's test


In [0]:
from __future__ import division

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

from scipy import stats
from statsmodels.stats.weightstats import CompareMeans, DescrStatsW

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

The level of calcium in the blood of healthy young women is on average 9.5 milligrams per deciliter and has a characteristic standard deviation of 0.4 mg / dl. In a Guatemalan rural hospital, calcium levels were measured for 160 healthy pregnant women at the first call for pregnancy; the average value was 9.57 mg / dl. Can it be argued that the average calcium level in this population is different from 9.5?

Calculate the level of significance achieved. Since only the average and variance are known, and not the sample itself, standard criteria functions cannot be used - you need to implement the formula for the achieved significance level yourself.

Round the answer to four digits after the decimal point.

In [0]:
mean = 9.5
sigma = 0.4
num = 160
mean_sample = 9.57

In [0]:
p = (mean_sample - mean)/(sigma/np.sqrt(num))

In [10]:
print(f'p-value: {p:.4f}')

p-value: 2.2136


## Diamond's case

There is data on the cost and size of 53,940 diamonds:

Separate 25% of the random observations into the test sample using the sklearn.cross_validation.train_test_split function (fix random state = 1). On the training set, configure two regression models:

* linear regression using LinearRegression without parameters

* random forest using RandomForestRegressor with random_state = 1.

Which model predicts the price of diamonds better? Make predictions on a test sample, calculate the modules of deviations of predictions from true prices. Test the hypothesis of the same average quality of predictions, calculate the achieved significance level. Does the hypothesis of the same quality of models against the two-sided alternative at the significance level α = 0.05 be rejected?

In [0]:
diamonds = pd.read_csv('https://raw.githubusercontent.com/OzmundSedler/100-Days-Of-ML-Code/master/week_11/datasets/diamonds.txt', sep="\t")

In [15]:
diamonds.describe()

Unnamed: 0,carat,depth,table,price,x,y,z
count,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0
mean,0.79794,61.749405,57.457184,3932.799722,5.731157,5.734526,3.538734
std,0.474011,1.432621,2.234491,3989.439738,1.121761,1.142135,0.705699
min,0.2,43.0,43.0,326.0,0.0,0.0,0.0
25%,0.4,61.0,56.0,950.0,4.71,4.72,2.91
50%,0.7,61.8,57.0,2401.0,5.7,5.71,3.53
75%,1.04,62.5,59.0,5324.25,6.54,6.54,4.04
max,5.01,79.0,95.0,18823.0,10.74,58.9,31.8


In [16]:
diamonds.head()

Unnamed: 0,carat,depth,table,price,x,y,z
0,0.23,61.5,55.0,326,3.95,3.98,2.43
1,0.21,59.8,61.0,326,3.89,3.84,2.31
2,0.23,56.9,65.0,327,4.05,4.07,2.31
3,0.29,62.4,58.0,334,4.2,4.23,2.63
4,0.31,63.3,58.0,335,4.34,4.35,2.75


In [17]:
diamonds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 7 columns):
carat    53940 non-null float64
depth    53940 non-null float64
table    53940 non-null float64
price    53940 non-null int64
x        53940 non-null float64
y        53940 non-null float64
z        53940 non-null float64
dtypes: float64(6), int64(1)
memory usage: 2.9 MB


In [18]:
diamonds.columns

Index(['carat', 'depth', 'table', 'price', 'x', 'y', 'z'], dtype='object')

In [19]:
X_diam = diamonds.drop(['price'], axis=1)
X_diam.shape
y_diam = diamonds.loc[:, diamonds.columns == 'price']
np.ravel(y_diam).shape

(53940, 6)

(53940,)

In [0]:
X_diam_train, X_diam_test, y_diam_train, y_diam_test = train_test_split(X_diam, y_diam, random_state=1)

In [21]:
clf_lr = LinearRegression()
clf_lr.fit(X_diam_train, y_diam_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [0]:
pred_price_lr = clf_lr.predict(X_diam_test)

In [36]:
pred_price_lr_sub = pred_price_lr - y_diam_test
pred_price_lr_sub.describe()

Unnamed: 0,price
count,13485.0
mean,19.229235
std,1463.058136
min,-12455.940789
25%,-342.670547
50%,63.649682
75%,652.518106
max,18239.84636


In [24]:
clf_rf = RandomForestRegressor(random_state=1)
clf_rf.fit(X_diam_train, y_diam_train.values.ravel())



RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=10,
                      n_jobs=None, oob_score=False, random_state=1, verbose=0,
                      warm_start=False)

In [0]:
pred_price_rf = clf_rf.predict(X_diam_test)

In [26]:
pred_price_rf_sub = pred_price_rf[:, np.newaxis] - y_diam_test
pred_price_rf_sub.describe()

Unnamed: 0,price
count,13485.0
mean,47.970728
std,1404.570512
min,-12443.6
25%,-247.525
50%,33.4
75%,402.4
max,8878.7


In [27]:
stats.ttest_rel(np.abs(pred_price_lr_sub), np.abs(pred_price_rf_sub))

Ttest_relResult(statistic=array([13.01772978]), pvalue=array([1.65517458e-38]))

In [39]:
print (f"95%% confidence interval: {DescrStatsW(np.abs(pred_price_lr_sub) - np.abs(pred_price_rf_sub)).tconfint_mean()}")

95%% confidence interval: (array([74.28724533]), array([100.62452099]))
