## Descriptive Statistics

 Import **NumPy**, **SciPy**, and **Pandas**

In [1]:
import numpy as np
import pandas as pd
import scipy as stats

 Randomly generate 1,000 samples from the normal distribution using `np.random.normal()`(mean = 100, standard deviation = 15)

In [7]:
samples = np.random.normal(100, 15, 1000 )
# random.normal(loc=0.0, scale=1.0, size=None)
# loc:float or array_like of floats
# Mean (“centre”) of the distribution.
# scale:float or array_like of floats
# Standard deviation (spread or “width”) of the distribution. Must be non-negative.
# size:int or tuple of ints, optional
# Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn. If size is None (default), a single value is returned if loc and scale are both scalars. Otherwise, np.broadcast(loc, scale).size samples are drawn.


In [6]:
samples

array([ 97.21434455,  99.99721673,  93.36021263,  99.21958022,
       112.41679103,  80.31050295,  97.07465561, 112.90360349,
        78.51992549, 111.52491726, 104.70340324, 115.30214803,
        93.36831067,  93.36353011, 102.70520135, 110.88489553,
        93.34216753, 106.81654649, 101.7659317 ,  92.17841895,
       131.39259761, 127.57562024,  80.29667304, 104.96686265,
       100.08953698, 110.39353078,  71.81721233,  93.92540013,
       101.93673299,  79.07027516, 129.97118155,  89.62768494,
        99.16909446,  96.44294554, 103.74210402, 104.12375192,
       105.95668522,  99.92006789,  87.06070343, 110.79019052,
        91.37525245, 116.04535485, 105.15671069,  86.64392888,
        62.44473474,  93.79522828, 116.90974738, 117.05068258,
        98.00937246, 100.10180599,  87.83123518, 104.01093183,
        90.90451382,  91.47790101, 103.25048919,  82.60521258,
        81.63107063,  94.42055767,  88.91414503, 104.07282053,
        89.68757916,  90.24690536,  96.70085526,  59.19

Compute the **mean**, **median**, and **mode**

In [10]:
mean = np.mean(samples)
median = np.median(samples)
mode =  %timeit stats.stats.mode(samples)

AttributeError: module 'scipy' has no attribute 'stats'

In [None]:
mean

99.88733450698744

In [None]:
median

100.02360970849514

Compute the **min**, **max**, **Q1**, **Q3**, and **interquartile range**

In [None]:
min = np.min(samples)
max = np.max(samples)
q1 = np.percentile(samples, 25, interpolation = 'midpoint')
# numpy.percentile(a, q, axis=None, out=None, overwrite_input=False, 
# method='linear', keepdims=False, *, interpolation=None)
q3 = np.percentile(samples, 75, interpolation = 'midpoint')
iqr = q3- q1



In [None]:
min

49.467386226802944

In [None]:
max

145.1987155607604

In [None]:
q1

89.67267405161475

In [None]:
q3

109.58231505911704

In [None]:
iqr

19.909641007502287

Compute the **variance** and **standard deviation**

In [None]:
variance = np.var(samples)
std_dev = np.std(samples)

In [None]:
variance

232.72233932136308

In [None]:
std_dev

15.255239733329761

Compute the **skewness** and **kurtosis**

In [None]:
skewness = stats.stats.skew(samples)
skewness


-0.007196495052755583

In [None]:
kurtosis = stats.stats.kurtosis(samples)
kurtosis

-0.014375826111582946

## NumPy Correlation Calculation

Create an array x of integers between 10 (inclusive) and 20 (exclusive). Use `np.arange()`

In [None]:
x = np.arange(10,20)
x

array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])

Then use `np.array()` to create a second array y containing 10 arbitrary integers.

In [None]:
y = np.array([5,3,7,6,10,14,19,35,94,58])
y

array([ 5,  3,  7,  6, 10, 14, 19, 35, 94, 58])

Once you have two arrays of the same length, you can compute the **correlation coefficient** between x and y

In [None]:
r = np.corrcoef(x,y)
r

array([[1.        , 0.80323888],
       [0.80323888, 1.        ]])

## Pandas Correlation Calculation

Run the code below

In [None]:
x = pd.Series(range(10, 20))
y = pd.Series([2, 1, 4, 5, 8, 12, 18, 25, 96, 48])

In [None]:
x

0    10
1    11
2    12
3    13
4    14
5    15
6    16
7    17
8    18
9    19
dtype: int64

In [None]:
y

0     2
1     1
2     4
3     5
4     8
5    12
6    18
7    25
8    96
9    48
dtype: int64

Call the relevant method  to calculate Pearson's r correlation.

In [None]:
r = x.corr(y)  
r

0.7586402890911867

In [None]:
r = y.corr(x)  
r

0.7586402890911869

OPTIONAL. Call the relevant method to calculate Spearman's rho correlation.

In [None]:
rho = x.corr(y, method='spearman')
rho

0.9757575757575757

## Seaborn Dataset Tips

Import Seaborn Library

In [None]:
import seaborn as sns

Load "tips" dataset from Seaborn

In [None]:
tips = sns.load_dataset("tips")
tips

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2


Generate descriptive statistics include those that summarize the central tendency, dispersion

In [None]:
tips.describe(include='all')

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
count,244.0,244.0,244,244,244,244,244.0
unique,,,2,2,4,2,
top,,,Male,No,Sat,Dinner,
freq,,,157,151,87,176,
mean,19.785943,2.998279,,,,,2.569672
std,8.902412,1.383638,,,,,0.9511
min,3.07,1.0,,,,,1.0
25%,13.3475,2.0,,,,,2.0
50%,17.795,2.9,,,,,2.0
75%,24.1275,3.5625,,,,,3.0


Call the relevant method to calculate pairwise Pearson's r correlation of columns

In [None]:
tips.corr()

Unnamed: 0,total_bill,tip,size
total_bill,1.0,0.675734,0.598315
tip,0.675734,1.0,0.489299
size,0.598315,0.489299,1.0
