## Descriptive Statistics

 Import **NumPy**, **SciPy**, and **Pandas**

In [135]:
import numpy as np
import pandas as pd
import scipy as stats

 Randomly generate 1,000 samples from the normal distribution using `np.random.normal()`(mean = 100, standard deviation = 15)

In [136]:
samples = np.random.normal(100, 15, 1000 )
samples

array([129.41946908,  84.33649302, 122.05125098, 112.20297336,
       101.53908621, 111.56729319,  97.96273742, 107.39700378,
       108.43009449,  76.83248604, 100.58612657,  98.05766459,
        98.0693224 ,  83.88193251,  69.78703346,  97.30062829,
       113.45985665, 124.81838758, 107.51774576, 106.34143128,
       100.46295794, 101.16499822,  72.64347405, 109.59997657,
        97.07235778, 102.59090716, 119.23834124, 124.65818872,
       108.93478968, 113.16959152, 144.83421612, 110.72286987,
       102.2334923 , 123.95714325, 107.60121399,  99.34337675,
        96.31893709, 126.01172264,  92.16875671,  94.42007405,
       100.99638656, 105.25866725,  79.07390276,  82.23264963,
       102.29012339, 119.51740912, 100.39475002, 109.88354387,
       121.48911278, 100.45457493,  74.10457751,  97.03167946,
       102.22226563,  98.9475067 , 119.58511324, 110.48916216,
       106.42577637, 104.00940843, 103.24198413,  72.38889338,
       103.92278642, 109.38392648,  95.46309758, 117.45

Compute the **mean**, **median**, and **mode**

In [137]:
mean = np.mean(samples)
median = np.median(samples)
mode =  %timeit stats.stats.mode(samples)

The slowest run took 6.47 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 5: 172 µs per loop


In [138]:
mean

99.88733450698744

In [139]:
median

100.02360970849514

Compute the **min**, **max**, **Q1**, **Q3**, and **interquartile range**

In [140]:
min = np.min(samples)
max = np.max(samples)
q1 = np.percentile(samples, 25, interpolation = 'midpoint')
q3 = np.percentile(samples, 75, interpolation = 'midpoint')
iqr = q3- q1



In [141]:
min

49.467386226802944

In [142]:
max

145.1987155607604

In [143]:
q1

89.67267405161475

In [144]:
q3

109.58231505911704

In [145]:
iqr

19.909641007502287

Compute the **variance** and **standard deviation**

In [146]:
variance = np.var(samples)
std_dev = np.std(samples)

In [147]:
variance

232.72233932136308

In [148]:
std_dev

15.255239733329761

Compute the **skewness** and **kurtosis**

In [149]:
skewness = stats.stats.skew(samples)
skewness


-0.007196495052755583

In [150]:
kurtosis = stats.stats.kurtosis(samples)
kurtosis

-0.014375826111582946

## NumPy Correlation Calculation

Create an array x of integers between 10 (inclusive) and 20 (exclusive). Use `np.arange()`

In [75]:
x = np.arange(10,20)
x

array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])

Then use `np.array()` to create a second array y containing 10 arbitrary integers.

In [77]:
y = np.array([5,3,7,6,10,14,19,35,94,58])
y

array([ 5,  3,  7,  6, 10, 14, 19, 35, 94, 58])

Once you have two arrays of the same length, you can compute the **correlation coefficient** between x and y

In [78]:
r = np.corrcoef(x,y)
r

array([[1.        , 0.80323888],
       [0.80323888, 1.        ]])

## Pandas Correlation Calculation

Run the code below

In [79]:
x = pd.Series(range(10, 20))
y = pd.Series([2, 1, 4, 5, 8, 12, 18, 25, 96, 48])

In [80]:
x

0    10
1    11
2    12
3    13
4    14
5    15
6    16
7    17
8    18
9    19
dtype: int64

In [81]:
y

0     2
1     1
2     4
3     5
4     8
5    12
6    18
7    25
8    96
9    48
dtype: int64

Call the relevant method  to calculate Pearson's r correlation.

In [86]:
r = x.corr(y)  
r

0.7586402890911867

In [87]:
r = y.corr(x)  
r

0.7586402890911869

OPTIONAL. Call the relevant method to calculate Spearman's rho correlation.

In [88]:
rho = x.corr(y, method='spearman')
rho

0.9757575757575757

## Seaborn Dataset Tips

Import Seaborn Library

In [89]:
import seaborn as sns

Load "tips" dataset from Seaborn

In [90]:
tips = sns.load_dataset("tips")
tips

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2


Generate descriptive statistics include those that summarize the central tendency, dispersion

In [92]:
tips.describe(include='all')

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
count,244.0,244.0,244,244,244,244,244.0
unique,,,2,2,4,2,
top,,,Male,No,Sat,Dinner,
freq,,,157,151,87,176,
mean,19.785943,2.998279,,,,,2.569672
std,8.902412,1.383638,,,,,0.9511
min,3.07,1.0,,,,,1.0
25%,13.3475,2.0,,,,,2.0
50%,17.795,2.9,,,,,2.0
75%,24.1275,3.5625,,,,,3.0


Call the relevant method to calculate pairwise Pearson's r correlation of columns

In [102]:
tips.corr()

Unnamed: 0,total_bill,tip,size
total_bill,1.0,0.675734,0.598315
tip,0.675734,1.0,0.489299
size,0.598315,0.489299,1.0
