## Descriptive Statistics

 Import **NumPy**, **SciPy**, and **Pandas**

In [None]:
import numpy as np
import pandas as pd
import scipy.stats as stats

 Randomly generate 1,000 samples from the normal distribution using `np.random.normal()`(mean = 100, standard deviation = 15)

In [None]:
samples = np.random.normal(100, 15, 1000)
samples

array([ 87.91726001, 107.98888874, 117.46476323,  95.75621992,
       119.10248728, 113.69188432, 127.11366942, 131.51209742,
       112.52180285, 110.25888697,  92.32013484,  79.88191954,
        96.87743598, 104.02296144, 104.68151007,  67.04559184,
        95.51061134, 103.13882622, 112.71911029, 104.34889841,
        93.37477193, 106.20744087, 108.43683432,  81.5795588 ,
       103.11461993, 115.34547618,  75.037193  , 117.73231862,
       105.06684903, 101.83178609, 116.45908154, 107.62667526,
        72.38615338,  90.70139425, 109.93867694, 108.58807611,
        70.81538539, 114.50739252,  88.72518153,  70.32299113,
        82.77567562,  79.4695614 ,  70.86827136,  87.55575657,
       103.15382635, 103.70377516,  82.25410059,  90.1158126 ,
        95.85290678,  93.19593773, 113.58187296, 128.87053947,
       108.20363966, 108.56201794, 114.21050018,  94.18300273,
       118.05664756,  82.56929822, 102.43726336, 121.48739587,
       105.89896072, 130.60153297,  73.31792863,  80.37

Compute the **mean**, **median**, and **mode**

In [None]:
mean = np.mean(samples)
median = np.median(samples)
mode = stats.mode(samples)
print(mean)
print()
print(median)
print()
print(mode)

99.672050151899

99.76152867465606

ModeResult(mode=array([37.07078231]), count=array([1]))


Compute the **min**, **max**, **Q1**, **Q3**, and **interquartile range**

In [None]:
min = samples.min()
max = samples.max()
q1 = np.percentile(samples, 25)
q3 = np.percentile(samples, 75)
iqr = q3 - q1
iqr1 = stats.iqr(samples)

print("min = ",min)
print()
print("max = ",max)
print()
print("q1 = ", q1)
print()
print("q3 = ", q3)
print()
print("iqr =", iqr)
print()
print("iqr1 = ", iqr1)

min =  37.07078230906858

max =  160.44666195584608

q1 =  89.57782299156159

q3 =  110.03044381628847

iqr = 20.45262082472688

iqr1 =  20.45262082472688


Compute the **variance** and **standard deviation**

In [None]:
variance = np.var(samples)
std_dev = np.std(samples)
print("vairance =", variance)
print()
print("std_dev = ", std_dev)
print()
print(np.sqrt(np.var(samples)))

vairance = 233.4833218964766

std_dev =  15.280161055973089

15.280161055973089


Compute the **skewness** and **kurtosis**

In [None]:
skewness = stats.skew(samples)
kurtosis = stats.kurtosis(samples)
print("skewness =",skewness)
print()
print("kurtosis =", kurtosis)

skewness = -0.06590028797647153

kurtosis = 0.2846214179409432


## NumPy Correlation Calculation

Create an array x of integers between 10 (inclusive) and 20 (exclusive). Use `np.arange()`

In [None]:
x = np.arange(10,20)
x

array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])

Then use `np.array()` to create a second array y containing 10 arbitrary integers.

In [None]:
y = np.random.randint(10, size=10)
y

array([4, 1, 4, 8, 4, 6, 1, 8, 2, 7])

Once you have two arrays of the same length, you can compute the **correlation coefficient** between x and y

In [None]:
x = np.arange(5)
y = np.arange(5,10)
print(x)
print(y)
r = np.correlate(x, y)
r

[0 1 2 3 4]
[5 6 7 8 9]


array([80])

## Pandas Correlation Calculation

Run the code below

In [None]:
x = pd.Series(range(10, 20))
y = pd.Series([2, 1, 4, 5, 8, 12, 18, 25, 96, 48])

Call the relevant method  to calculate Pearson's r correlation.

In [None]:
r = pd.Series.corr(x, y, method="pearson")
r

0.7586402890911867

OPTIONAL. Call the relevant method to calculate Spearman's rho correlation.

In [None]:
rho = pd.Series.corr(x, y, method="spearman")
rho

0.9757575757575757

## Seaborn Dataset Tips

Import Seaborn Library

In [None]:
import seaborn as sns

Load "tips" dataset from Seaborn

In [None]:
tips = sns.load_dataset("tips")
tips

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2


Generate descriptive statistics include those that summarize the central tendency, dispersion

In [None]:
x = pd.DataFrame.describe(tips)
x

Unnamed: 0,total_bill,tip,size
count,244.0,244.0,244.0
mean,19.785943,2.998279,2.569672
std,8.902412,1.383638,0.9511
min,3.07,1.0,1.0
25%,13.3475,2.0,2.0
50%,17.795,2.9,2.0
75%,24.1275,3.5625,3.0
max,50.81,10.0,6.0


Call the relevant method to calculate pairwise Pearson's r correlation of columns

In [None]:
r = tips.corr(method="pearson")
r

Unnamed: 0,total_bill,tip,size
total_bill,1.0,0.675734,0.598315
tip,0.675734,1.0,0.489299
size,0.598315,0.489299,1.0
