## Descriptive Statistics

Import NumPy, SciPy, and Pandas

In [1]:
import numpy as np
from scipy import stats
import pandas as pd

Randomly generate 1,000 samples from the normal distribution using ```np.random.normal()```(mean = 100, standard deviation = 15)

In [2]:
samples = np.random.normal(100,15,1000)
samples

array([103.97525755, 124.24617161, 106.1178233 ,  97.60956774,
       117.65149113,  81.09600208,  86.69863052,  95.72849095,
        95.01305346,  81.04125866, 116.85741034,  89.61567273,
        98.71207116, 109.51540641, 116.15664911,  92.85949444,
        86.76279575,  97.30639847, 101.44880394,  88.91961942,
       116.30009406, 123.4743661 ,  78.79330232, 110.31896286,
       113.97065254, 112.60459801,  85.45009415, 103.60417329,
       129.67546377,  93.30613707,  95.95001941, 113.6288859 ,
       106.76850669,  79.351616  ,  86.64882961, 112.62320003,
        94.68542871,  82.17098683,  93.8454212 , 105.9552516 ,
        97.18248392, 115.57089945, 113.67189232, 106.15899269,
       125.86937201, 126.50508992,  74.55677996,  91.6876108 ,
        83.90031121,  71.17778303,  98.5351047 , 121.55212231,
       108.79742533, 106.18275465, 102.3025788 , 113.78805324,
        90.27594511,  80.6726139 ,  86.96346418,  94.78479589,
        87.93198052, 100.95272994, 103.58638537,  98.64

Compute the mean, median, and mode

In [3]:
mean = np.mean(samples)
median = np.median(samples)
mode = stats.mode(samples)
print("mean :" , mean)
print("median :", median)
print("mode :", mode)

mean : 100.06291523698387
median : 99.75097917068027
mode : ModeResult(mode=array([57.38807292]), count=array([1]))


Compute the min, max, Q1, Q3, and interquartile range

In [4]:
min = samples.min() 
max = samples.max()
q1 = np.percentile(samples, 25)
q3 = np.percentile(samples, 75)
iqr = stats.iqr(samples)
print("min :", min)
print("max :", max)
print("q1 :", q1)
print("q3 :", q3)
print("iqr :", iqr)

min : 57.38807292043404
max : 158.48367930031378
q1 : 89.2523118286846
q3 : 110.85246762444105
iqr : 21.60015579575645


Compute the variance and standard deviation

In [5]:
variance = np.var(samples)
std_dev = np.std(samples)
print("variance :", variance)
print("standard deviaton :", std_dev)

variance : 230.67594159332535
standard deviaton : 15.188019673193914


Compute the skewness and kurtosis

In [6]:
from scipy.stats import skew,kurtosis


print("skewness :",(stats.skew(samples)))

print("kurtosis :",(stats.kurtosis(samples)))

skewness : 0.10891819091198572
kurtosis : -0.19516609901657178


## NumPy Correlation Calculation

Create an array x of integers between 10 (inclusive) and 20 (exclusive). Use np.arange()

In [7]:
x = np.arange(10,20)
x

array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])

Then use np.array() to create a second array y containing 10 arbitrary integers.

In [8]:
y = np.array(np.random.randint(50, size=10))
y

array([49, 24, 14, 14, 11, 10, 33, 47, 28, 34])

Once you have two arrays of the same length, you can compute the correlation coefficient between x and y

In [9]:
r = np.corrcoef(x, y)
r

array([[1.        , 0.14557215],
       [0.14557215, 1.        ]])

## Pandas Correlation Calculation

Run the code below

In [10]:
x = pd.Series(range(10, 20))
y = pd.Series([2, 1, 4, 5, 8, 12, 18, 25, 96, 48])

Call the relevant method to calculate Pearson's r correlation.

In [11]:
r = stats.pearsonr(x,y)
r

(0.758640289091187, 0.010964341301680813)

OPTIONAL. Call the relevant method to calculate Spearman's rho correlation.

In [12]:
rho = stats.spearmanr(x,y)
rho

SpearmanrResult(correlation=0.9757575757575757, pvalue=1.4675461874042197e-06)

## Seaborn Dataset Tips

Import Seaborn Library

In [13]:
import seaborn as sns 

In [14]:
tips = sns.load_dataset("tips")
tips

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2


Generate descriptive statistics include those that summarize the central tendency, dispersion

In [15]:
tips.describe()

Unnamed: 0,total_bill,tip,size
count,244.0,244.0,244.0
mean,19.785943,2.998279,2.569672
std,8.902412,1.383638,0.9511
min,3.07,1.0,1.0
25%,13.3475,2.0,2.0
50%,17.795,2.9,2.0
75%,24.1275,3.5625,3.0
max,50.81,10.0,6.0


Call the relevant method to calculate pairwise Pearson's r correlation of columns

In [16]:
stats.pearsonr(tips["total_bill"],tips["tip"])

(0.6757341092113642, 6.6924706468640476e-34)