## Descriptive Statistics

 Import **NumPy**, **SciPy**, and **Pandas**

In [2]:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from scipy import stats

Randomly generate 1,000 samples from the normal distribution using `np.random.normal()`(mean = 100, standard deviation = 15)

In [31]:
samples = np.random.normal(100, 15, 1000)  # mu, sigma = 0, 0.1 # mean and standard deviation
samples

array([ 86.79899896, 110.4460147 ,  87.89310758, 104.55604269,
        99.25645679,  92.99927418,  94.72214438,  74.60947062,
       121.07955702,  84.73848177, 105.28170187, 106.33011858,
        92.20574021,  92.82271393,  94.6411572 , 110.64422739,
       100.62339312, 110.23083203,  79.10496649,  72.76888647,
        99.53672171,  74.67971011,  96.03517188, 106.7066129 ,
       114.69721677, 104.14574489, 103.00071949, 110.11721693,
       125.67981915, 115.2423908 , 109.88992765, 112.49033389,
       111.57196165, 101.20823147, 120.68992609,  77.60151507,
        85.98445774, 111.76623689, 108.85075412, 118.31418409,
       113.16584378, 109.91343372,  73.66408643, 134.27278725,
       100.69889974, 106.31360648,  74.23751889, 114.02501325,
       114.78808938,  84.42241139, 118.74778451, 114.21005111,
       102.08368185,  83.76538717,  76.64780302, 104.71611583,
       112.57950376,  97.17814499,  80.30255595,  75.60157188,
        99.85561132, 121.73520096, 100.24762453, 110.71

mu, sigma = 0, 0.1 # mean and standard deviationCompute the **mean**, **median**, and **mode**

In [34]:
mean = np.mean(samples)
print(mean)
median = np.median(samples)
print(median)
mode = stats.mode(samples)
print(mode)

99.92623122054934
100.3110570892002
ModeResult(mode=array([44.02769544]), count=array([1]))


Compute the **min**, **max**, **Q1**, **Q3**, and **interquartile range**

In [42]:
samples_pd = pd.Series(samples)

In [44]:
min = samples_pd.min()
print(min)
max = samples_pd.max()
print(max)
q1 = samples_pd.quantile(.25)
print(q1)
q3 = samples_pd.quantile(.75)
print(q3)
iqr = q3-q1
print(iqr)

44.027695442107444
146.6826701310997
89.50647653074716
110.45836219673744
20.951885665990275


Compute the **variance** and **standard deviation**

In [45]:
variance = samples_pd.var()
print(variance)
std_dev = samples_pd.std()
print(std_dev)

237.47563353741828
15.410244434707007


Compute the **skewness** and **kurtosis**

In [46]:
skewness = samples_pd.skew()
print(skewness)
kurtosis = samples_pd.kurtosis()
print(kurtosis)

-0.11809672199056959
0.07546700673003093


## NumPy Correlation Calculation

Create an array x of integers between 10 (inclusive) and 20 (exclusive). Use `np.arange()`

In [24]:
x = np.arange(10,20)
x

array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])

Then use `np.array()` to create a second array y containing 10 arbitrary integers.

In [22]:
y = np.array([1,15,2,8,5,9,22,15,9,13])
y

array([ 1, 15,  2,  8,  5,  9, 22, 15,  9, 13])

Once you have two arrays of the same length, you can compute the **correlation coefficient** between x and y

In [27]:
r = np.corrcoef(x,y)
r

array([[1.        , 0.50055752],
       [0.50055752, 1.        ]])

## Pandas Correlation Calculation

Run the code below

In [5]:
x = pd.Series(range(10, 20))
y = pd.Series([2, 1, 4, 5, 8, 12, 18, 25, 96, 48])

Call the relevant method  to calculate Pearson's r correlation.

In [6]:
r = stats.pearsonr(x, y)
r

(0.7586402890911869, 0.010964341301680832)

OPTIONAL. Call the relevant method to calculate Spearman's rho correlation.

In [54]:
rho = stats.spearmanr(x, y)
rho

SpearmanrResult(correlation=0.9757575757575757, pvalue=1.4675461874042197e-06)

## Seaborn Dataset Tips

Import Seaborn Library

In [55]:
import seaborn as sns

Load "tips" dataset from Seaborn

In [56]:
tips = sns.load_dataset("tips")
tips

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2


Generate descriptive statistics include those that summarize the central tendency, dispersion

In [66]:
tips.describe(include="all")

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
count,244.0,244.0,244,244,244,244,244.0
unique,,,2,2,4,2,
top,,,Male,No,Sat,Dinner,
freq,,,157,151,87,176,
mean,19.785943,2.998279,,,,,2.569672
std,8.902412,1.383638,,,,,0.9511
min,3.07,1.0,,,,,1.0
25%,13.3475,2.0,,,,,2.0
50%,17.795,2.9,,,,,2.0
75%,24.1275,3.5625,,,,,3.0


In [67]:
tips.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
total_bill,244.0,19.785943,8.902412,3.07,13.3475,17.795,24.1275,50.81
tip,244.0,2.998279,1.383638,1.0,2.0,2.9,3.5625,10.0
size,244.0,2.569672,0.9511,1.0,2.0,2.0,3.0,6.0


Call the relevant method to calculate pairwise Pearson's r correlation of columns

In [72]:
a = tips[["total_bill","tip"]]
a
a.corr(method="pearson")


Unnamed: 0,total_bill,tip
total_bill,1.0,0.675734
tip,0.675734,1.0


In [73]:
b = tips[["total_bill","size"]]
b
b.corr(method="pearson")

Unnamed: 0,total_bill,size
total_bill,1.0,0.598315
size,0.598315,1.0


In [81]:
c = tips[["tip","size"]]
c
c.corr(method="pearson")

Unnamed: 0,tip,size
tip,1.0,0.489299
size,0.489299,1.0
