# WIP: Diabetes Dataset Analysis 

In [16]:
import pandas as pd
import pandas_profiling
from sklearn import datasets

In [2]:
dataset = datasets.load_diabetes()

What attributes does the object have:

In [12]:
dir(dataset)

['DESCR', 'data', 'feature_names', 'target']

Standard attributes for a scikit dataset. Let’s see what the notes on the dataset are:

In [3]:
print(dataset.DESCR)

Diabetes dataset

Notes
-----

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

Data Set Characteristics:

  :Number of Instances: 442

  :Number of Attributes: First 10 columns are numeric predictive values

  :Target: Column 11 is a quantitative measure of disease progression one year after baseline

  :Attributes:
    :Age:
    :Sex:
    :Body mass index:
    :Average blood pressure:
    :S1:
    :S2:
    :S3:
    :S4:
    :S5:
    :S6:

Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times `n_samples` (i.e. the sum of squares of each column totals 1).

Source URL:
http://www4.stat.ncsu.edu/~boos/var.select/diabetes.html

For more information see:
Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani

To summarise the data:
- Only quantitative variables
- Data has been centred and scaled
- y is also quantitative
- six sereum measurements

In [6]:
dataset.feature_names

['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']

Convert the numpy matrix to a data frame object

In [13]:
df = pd.DataFrame(
    data=dataset.data, 
    index=None, 
    columns=dataset.feature_names
)

Add the target column:

In [14]:
df['y'] = dataset.target

In [19]:
df.head(n=5)

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,y
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019908,-0.017646,151.0
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.06833,-0.092204,75.0
2,0.085299,0.05068,0.044451,-0.005671,-0.045599,-0.034194,-0.032356,-0.002592,0.002864,-0.02593,141.0
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022692,-0.009362,206.0
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031991,-0.046641,135.0


What does the profile of the data look like:

In [18]:
pf = pandas_profiling.ProfileReport(df)

In [20]:
pf

0,1
Number of variables,11
Number of observations,442
Total Missing (%),0.0%
Total size in memory,38.0 KiB
Average record size in memory,88.1 B

0,1
Numeric,10
Categorical,0
Boolean,1
Date,0
Text (Unique),0
Rejected,0
Unsupported,0

0,1
Distinct count,58
Unique (%),13.1%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,-3.6396e-16
Minimum,-0.10723
Maximum,0.11073
Zeros (%),0.0%

0,1
Minimum,-0.10723
5-th percentile,-0.08543
Q1,-0.037299
Median,0.0053831
Q3,0.038076
95-th percentile,0.070769
Maximum,0.11073
Range,0.21795
Interquartile range,0.075375

0,1
Standard deviation,0.047619
Coef of variation,-130840000000000
Kurtosis,-0.67122
Mean,-3.6396e-16
MAD,0.039295
Skewness,-0.23138
Sum,-1.6087e-13
Variance,0.0022676
Memory size,3.5 KiB

Value,Count,Frequency (%),Unnamed: 3
0.0162806757273067,19,4.3%,
0.0417084448844436,17,3.8%,
0.00901559882526763,16,3.6%,
-0.0273097856849279,15,3.4%,
0.0453409833354632,14,3.2%,
0.0126481372762872,14,3.2%,
-0.0527375548420648,14,3.2%,
-0.00188201652779104,14,3.2%,
0.00538306037424807,13,2.9%,
0.0671362140415805,13,2.9%,

Value,Count,Frequency (%),Unnamed: 3
-0.107225631607358,3,0.7%,
-0.103593093156339,3,0.7%,
-0.099960554705319,2,0.5%,
-0.0963280162542995,4,0.9%,
-0.0926954778032799,4,0.9%,

Value,Count,Frequency (%),Unnamed: 3
0.0852989062966783,1,0.2%,
0.0889314447476978,1,0.2%,
0.0925639831987174,1,0.2%,
0.096196521649737,2,0.5%,
0.110726675453815,2,0.5%,

0,1
Distinct count,163
Unique (%),36.9%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,-8.014e-16
Minimum,-0.090275
Maximum,0.17056
Zeros (%),0.0%

0,1
Minimum,-0.090275
5-th percentile,-0.066563
Q1,-0.034229
Median,-0.0072838
Q3,0.031248
95-th percentile,0.085408
Maximum,0.17056
Range,0.26083
Interquartile range,0.065477

0,1
Standard deviation,0.047619
Coef of variation,-59420000000000
Kurtosis,0.095094
Mean,-8.014e-16
MAD,0.038358
Skewness,0.59815
Sum,-3.5422e-13
Variance,0.0022676
Memory size,3.5 KiB

Value,Count,Frequency (%),Unnamed: 3
-0.0245287593917836,8,1.8%,
-0.030995631835069,8,1.8%,
-0.0460850008694016,7,1.6%,
-0.00836157828357004,7,1.6%,
-0.0256065714656645,7,1.6%,
0.0142724752679289,6,1.4%,
-0.0331512559828308,6,1.4%,
-0.0234509473179027,6,1.4%,
0.00133873038135806,6,1.4%,
-0.02021751109626,6,1.4%,

Value,Count,Frequency (%),Unnamed: 3
-0.0902752958985185,1,0.2%,
-0.0891974838246376,1,0.2%,
-0.084886235529114,1,0.2%,
-0.0838084234552331,1,0.2%,
-0.0816527993074713,2,0.5%,

Value,Count,Frequency (%),Unnamed: 3
0.127442743025423,1,0.2%,
0.128520555099304,1,0.2%,
0.137143051690352,1,0.2%,
0.160854917315731,1,0.2%,
0.17055522598066,1,0.2%,

0,1
Distinct count,100
Unique (%),22.6%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1.2898e-16
Minimum,-0.1124
Maximum,0.13204
Zeros (%),0.0%

0,1
Minimum,-0.1124
5-th percentile,-0.074356
Q1,-0.036656
Median,-0.0056706
Q3,0.035644
95-th percentile,0.083672
Maximum,0.13204
Range,0.24444
Interquartile range,0.0723

0,1
Standard deviation,0.047619
Coef of variation,369190000000000
Kurtosis,-0.53278
Mean,1.2898e-16
MAD,0.039282
Skewness,0.29066
Sum,5.701e-14
Variance,0.0022676
Memory size,3.5 KiB

Value,Count,Frequency (%),Unnamed: 3
-0.00567061055493425,21,4.8%,
-0.0400993174922969,21,4.8%,
-0.0263278347173518,20,4.5%,
0.0218723549949558,15,3.4%,
-0.0332135761048244,14,3.2%,
-0.0228849640236156,13,2.9%,
-0.0125563519424068,11,2.5%,
0.0494153205448459,11,2.5%,
-0.015999222636143,11,2.5%,
0.0081008722200108,11,2.5%,

Value,Count,Frequency (%),Unnamed: 3
-0.112399602060758,1,0.2%,
-0.108956731367022,1,0.2%,
-0.10207098997955,1,0.2%,
-0.100923366426447,1,0.2%,
-0.0986281192858133,1,0.2%,

Value,Count,Frequency (%),Unnamed: 3
0.10105838095089,1,0.2%,
0.104501251644626,2,0.5%,
0.107944122338362,3,0.7%,
0.125158475807044,1,0.2%,
0.132044217194516,1,0.2%,

0,1
Distinct count,141
Unique (%),31.9%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,-9.0425e-17
Minimum,-0.12678
Maximum,0.15391
Zeros (%),0.0%

0,1
Minimum,-0.12678
5-th percentile,-0.073119
Q1,-0.034248
Median,-0.0043209
Q3,0.028358
95-th percentile,0.083671
Maximum,0.15391
Range,0.28069
Interquartile range,0.062606

0,1
Standard deviation,0.047619
Coef of variation,-526610000000000
Kurtosis,0.23295
Mean,-9.0425e-17
MAD,0.037367
Skewness,0.37811
Sum,-3.9968e-14
Variance,0.0022676
Memory size,3.5 KiB

Value,Count,Frequency (%),Unnamed: 3
-0.00707277125301585,10,2.3%,
-0.0373437341334407,10,2.3%,
0.0204462859110067,9,2.0%,
0.0121905687618,9,2.0%,
0.00118294589619092,8,1.8%,
-0.00294491267841247,8,1.8%,
-0.0249601584096305,8,1.8%,
-0.00432086553661359,8,1.8%,
0.0245741444856101,8,1.8%,
-0.00569681839481472,7,1.6%,

Value,Count,Frequency (%),Unnamed: 3
-0.126780669916514,1,0.2%,
-0.108893282759899,1,0.2%,
-0.104765424185296,1,0.2%,
-0.103389471327095,1,0.2%,
-0.100637565610693,1,0.2%,

Value,Count,Frequency (%),Unnamed: 3
0.126394655992494,1,0.2%,
0.127770608850695,2,0.5%,
0.133274420283499,1,0.2%,
0.152537760298315,1,0.2%,
0.153913713156516,1,0.2%,

0,1
Distinct count,302
Unique (%),68.3%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1.3011e-16
Minimum,-0.11561
Maximum,0.19879
Zeros (%),0.0%

0,1
Minimum,-0.11561
5-th percentile,-0.072712
Q1,-0.030358
Median,-0.0038191
Q3,0.029844
95-th percentile,0.079463
Maximum,0.19879
Range,0.3144
Interquartile range,0.060203

0,1
Standard deviation,0.047619
Coef of variation,365980000000000
Kurtosis,0.60138
Mean,1.3011e-16
MAD,0.037488
Skewness,0.43659
Sum,5.751e-14
Variance,0.0022676
Memory size,3.5 KiB

Value,Count,Frequency (%),Unnamed: 3
0.0162224364339952,5,1.1%,
-0.00100072896442909,5,1.1%,
-0.0248000120604336,4,0.9%,
-0.0470335528474903,4,0.9%,
-0.0138398158977999,4,0.9%,
0.0566185880048449,4,0.9%,
-0.00381906512053488,3,0.7%,
-0.0232342697514859,3,0.7%,
-0.0157187066685371,3,0.7%,
0.00620168565673016,3,0.7%,

Value,Count,Frequency (%),Unnamed: 3
-0.115613065979398,1,0.2%,
-0.112794729823292,1,0.2%,
-0.106844909049291,1,0.2%,
-0.104339721354975,1,0.2%,
-0.10089508827529,1,0.2%,

Value,Count,Frequency (%),Unnamed: 3
0.128016437292858,1,0.2%,
0.130208476525385,1,0.2%,
0.131461070372543,1,0.2%,
0.155886650392127,1,0.2%,
0.198787989657293,1,0.2%,

0,1
Distinct count,63
Unique (%),14.3%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,-4.564e-16
Minimum,-0.10231
Maximum,0.18118
Zeros (%),0.0%

0,1
Minimum,-0.10231
5-th percentile,-0.065491
Q1,-0.035117
Median,-0.0065845
Q3,0.029312
95-th percentile,0.077909
Maximum,0.18118
Range,0.28349
Interquartile range,0.064429

0,1
Standard deviation,0.047619
Coef of variation,-104340000000000
Kurtosis,0.98151
Mean,-4.564e-16
MAD,0.037518
Skewness,0.79926
Sum,-2.0173e-13
Variance,0.0022676
Memory size,3.5 KiB

Value,Count,Frequency (%),Unnamed: 3
-0.0139477432193303,22,5.0%,
-0.0434008456520269,19,4.3%,
-0.0397192078479398,18,4.1%,
-0.0029028298070691,15,3.4%,
-0.0323559322397657,15,3.4%,
0.0081420836051921,15,3.4%,
-0.0213110188275045,15,3.4%,
-0.0286742944356786,15,3.4%,
-0.00658446761115617,14,3.2%,
0.0155053592133662,14,3.2%,

Value,Count,Frequency (%),Unnamed: 3
-0.10230705051742,1,0.2%,
-0.098625412713333,1,0.2%,
-0.0912621371051588,1,0.2%,
-0.0802172236928976,2,0.5%,
-0.0765355858888105,5,1.1%,

Value,Count,Frequency (%),Unnamed: 3
0.151725957964588,1,0.2%,
0.159089233572762,1,0.2%,
0.17381578478911,1,0.2%,
0.177497422593197,1,0.2%,
0.181179060397284,1,0.2%,

0,1
Distinct count,66
Unique (%),14.9%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,3.8632e-16
Minimum,-0.076395
Maximum,0.18523
Zeros (%),0.0%

0,1
Minimum,-0.076395
5-th percentile,-0.076395
Q1,-0.039493
Median,-0.0025923
Q3,0.034309
95-th percentile,0.080767
Maximum,0.18523
Range,0.26163
Interquartile range,0.073802

0,1
Standard deviation,0.047619
Coef of variation,123260000000000
Kurtosis,0.4444
Mean,3.8632e-16
MAD,0.037103
Skewness,0.73537
Sum,1.7075e-13
Variance,0.0022676
Memory size,3.5 KiB

Value,Count,Frequency (%),Unnamed: 3
-0.0394933828740919,128,29.0%,
-0.00259226199818282,108,24.4%,
0.0343088588777263,68,15.4%,
0.0712099797536354,33,7.5%,
-0.076394503750001,28,6.3%,
0.108111100629544,13,2.9%,
0.145012221505454,2,0.5%,
-0.0214118336448964,2,0.5%,
-0.0376483268302965,2,0.5%,
0.0158582984397717,2,0.5%,

Value,Count,Frequency (%),Unnamed: 3
-0.076394503750001,28,6.3%,
-0.0708593356186146,1,0.2%,
-0.0693832907835783,1,0.2%,
-0.0535158088069373,1,0.2%,
-0.0516707527631419,1,0.2%,

Value,Count,Frequency (%),Unnamed: 3
0.13025177315509,1,0.2%,
0.141322109417863,1,0.2%,
0.145012221505454,2,0.5%,
0.155344535350708,1,0.2%,
0.185234443260194,1,0.2%,

0,1
Distinct count,184
Unique (%),41.6%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,-3.8481e-16
Minimum,-0.1261
Maximum,0.1336
Zeros (%),0.0%

0,1
Minimum,-0.1261
5-th percentile,-0.072128
Q1,-0.033249
Median,-0.0019476
Q3,0.032433
95-th percentile,0.079047
Maximum,0.1336
Range,0.2597
Interquartile range,0.065682

0,1
Standard deviation,0.047619
Coef of variation,-123750000000000
Kurtosis,-0.13437
Mean,-3.8481e-16
MAD,0.038733
Skewness,0.29177
Sum,-1.7009e-13
Variance,0.0022676
Memory size,3.5 KiB

Value,Count,Frequency (%),Unnamed: 3
-0.0181182673078967,11,2.5%,
-0.0307512098645563,10,2.3%,
-0.0411803851880079,8,1.8%,
-0.0259524244351894,7,1.6%,
-0.0514005352605825,7,1.6%,
-0.0332487872476258,7,1.6%,
-0.0236445575721341,6,1.4%,
-0.0109044358473771,6,1.4%,
-0.0611765950943345,6,1.4%,
0.0155668445407018,6,1.4%,

Value,Count,Frequency (%),Unnamed: 3
-0.126097385560409,1,0.2%,
-0.104364820832166,1,0.2%,
-0.101643547945512,1,0.2%,
-0.096433222891784,4,0.9%,
-0.0939356455087147,1,0.2%,

Value,Count,Frequency (%),Unnamed: 3
0.129019411600168,1,0.2%,
0.130080609521753,1,0.2%,
0.132372649338676,1,0.2%,
0.133395733837469,1,0.2%,
0.133598980013008,2,0.5%,

0,1
Distinct count,56
Unique (%),12.7%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,-3.3985e-16
Minimum,-0.13777
Maximum,0.13561
Zeros (%),0.0%

0,1
Minimum,-0.13777
5-th percentile,-0.075636
Q1,-0.033179
Median,-0.0010777
Q3,0.027917
95-th percentile,0.081764
Maximum,0.13561
Range,0.27338
Interquartile range,0.061096

0,1
Standard deviation,0.047619
Coef of variation,-140120000000000
Kurtosis,0.23692
Mean,-3.3985e-16
MAD,0.037041
Skewness,0.20792
Sum,-1.5021e-13
Variance,0.0022676
Memory size,3.5 KiB

Value,Count,Frequency (%),Unnamed: 3
0.00306440941436832,22,5.0%,
0.0196328370737072,20,4.5%,
0.00720651632920303,20,4.5%,
-0.00107769750046639,19,4.3%,
-0.0176461251598052,16,3.6%,
-0.0135040182449705,16,3.6%,
-0.0383566597339788,15,3.4%,
-0.0093619113301358,14,3.2%,
-0.0052198044153011,14,3.2%,
0.0154907301588724,14,3.2%,

Value,Count,Frequency (%),Unnamed: 3
-0.137767225690012,1,0.2%,
-0.129483011860342,2,0.5%,
-0.104630370371334,2,0.5%,
-0.0963461565416647,2,0.5%,
-0.09220404962683,4,0.9%,

Value,Count,Frequency (%),Unnamed: 3
0.106617082285236,4,0.9%,
0.11904340302974,2,0.5%,
0.12732761685941,1,0.2%,
0.131469723774244,2,0.5%,
0.135611830689079,3,0.7%,

0,1
Distinct count,2
Unique (%),0.5%
Missing (%),0.0%
Missing (n),0

0,1
Mean,1.3099e-16

0,1
-0.044641636506989,235
0.0506801187398187,207

Value,Count,Frequency (%),Unnamed: 3
-0.044641636506989,235,53.2%,
0.0506801187398187,207,46.8%,

0,1
Distinct count,214
Unique (%),48.4%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,152.13
Minimum,25
Maximum,346
Zeros (%),0.0%

0,1
Minimum,25.0
5-th percentile,51.0
Q1,87.0
Median,140.5
Q3,211.5
95-th percentile,282.9
Maximum,346.0
Range,321.0
Interquartile range,124.5

0,1
Standard deviation,77.093
Coef of variation,0.50675
Kurtosis,-0.88306
Mean,152.13
MAD,65.765
Skewness,0.44056
Sum,67243
Variance,5943.3
Memory size,3.5 KiB

Value,Count,Frequency (%),Unnamed: 3
200.0,6,1.4%,
72.0,6,1.4%,
178.0,5,1.1%,
90.0,5,1.1%,
71.0,5,1.1%,
202.0,4,0.9%,
85.0,4,0.9%,
131.0,4,0.9%,
59.0,4,0.9%,
65.0,4,0.9%,

Value,Count,Frequency (%),Unnamed: 3
25.0,1,0.2%,
31.0,1,0.2%,
37.0,1,0.2%,
39.0,2,0.5%,
40.0,1,0.2%,

Value,Count,Frequency (%),Unnamed: 3
321.0,1,0.2%,
332.0,1,0.2%,
336.0,1,0.2%,
341.0,1,0.2%,
346.0,1,0.2%,

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,y
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019908,-0.017646,151.0
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.06833,-0.092204,75.0
2,0.085299,0.05068,0.044451,-0.005671,-0.045599,-0.034194,-0.032356,-0.002592,0.002864,-0.02593,141.0
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022692,-0.009362,206.0
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031991,-0.046641,135.0
