## Descriptive Statistics

 Import **NumPy**, **SciPy**, and **Pandas**

In [1]:
import numpy as np
import scipy as stats
import pandas as pd

 Randomly generate 1,000 samples from the normal distribution using `np.random.normal()`(mean = 100, standard deviation = 15)

Docstring:
normal(loc=0.0, scale=1.0, size=None)

Draw random samples from a normal (Gaussian) distribution.

The probability density function of the normal distribution, first
derived by De Moivre and 200 years later by both Gauss and Laplace
independently [2]_, is often called the bell curve because of
its characteristic shape (see the example below).

The normal distributions occurs often in nature.  For example, it
describes the commonly occurring distribution of samples influenced
by a large number of tiny, random disturbances, each with its own
unique distribution [2]_.

.. note::
    New code should use the ``normal`` method of a ``default_rng()``
    instance instead; please see the :ref:`random-quick-start`.

Parameters

loc : float or array_like of floats
    Mean ("centre") of the distribution.
scale : float or array_like of floats
    Standard deviation (spread or "width") of the distribution. Must be
    non-negative.

In [2]:
np.random.seed(100)
samples = np.random.normal(loc=100, scale=15, size=1000)
samples

array([ 73.7535179 , 105.14020605, 117.29553704,  96.21345945,
       114.7198118 , 107.71328262, 103.31769504,  83.94935004,
        97.15756254, 103.82502166,  93.12959522, 106.52745232,
        91.24607425, 112.25270608, 110.09081209,  98.43383285,
        92.03079435, 115.44599028,  93.42796566,  83.22522631,
       124.28472491, 123.12407762,  96.22181291,  87.36346393,
       102.76778036, 114.05623302, 110.96500516, 120.42334188,
        95.10642911, 100.83514022, 103.33599413,  78.35174507,
        88.65471542, 112.24681017, 111.25667142,  93.16079609,
       117.84433402,  74.6407476 ,  79.65401427,  81.51348229,
        91.83341257,  89.97742395, 100.10971845,  90.80591897,
       119.49622112,  74.00356565,  85.25034851, 105.3626163 ,
        75.79632246, 122.060708  ,  82.17973604,  91.7538071 ,
        85.89930758,  87.58101453, 101.63295202, 107.61714386,
        87.0665898 , 118.74204614,  98.80583131,  86.65402778,
        86.77302416, 100.27958424, 103.56766933, 100.20

Compute the **mean**, **median**, and **mode**

In [3]:
mean = samples.mean()
median = np.median(samples)

from scipy import stats
mode = stats.mode(samples)

print("The mean of the samples: " , mean)
print("Median: ", median)
print("Mode: ", mode)

The mean of the samples:  99.74841763984136
Median:  99.605602838632
Mode:  ModeResult(mode=array([51.85066927]), count=array([1]))


Compute the **min**, **max**, **Q1**, **Q3**, and **interquartile range**

In [4]:
min = samples.min()
max = samples.max()
q1 = np.percentile(samples, [25])
q3 = np.percentile(samples, [75])
iqr = q3- q1

print("min: ", min, "\nmax: ", max, "\nQ1: ", q1, "\nQ3: ", q3, "\nIQR: ", iqr)

min:  51.85066927032931 
max:  157.8690951023446 
Q1:  [89.40898926] 
Q3:  [110.36147675] 
IQR:  [20.9524875]


Compute the **variance** and **standard deviation**

In [5]:
variance = samples.var()
std_dev = samples.std()

print("Samples variance: ", variance, "\nSamples standard deviation: ", std_dev)

Samples variance:  246.10207359530654 
Samples standard deviation:  15.687640791250498


Compute the **skewness** and **kurtosis**

In [6]:
skewness = stats.skew(samples)
kurtosis = stats.kurtosis(samples)
print("Skewness: ", skewness, "\nKurtosis: ", kurtosis)

Skewness:  0.13210251958658759 
Kurtosis:  0.21772990682551496


## NumPy Correlation Calculation

Create an array x of integers between 10 (inclusive) and 20 (exclusive). Use `np.arange()`

In [7]:
x = np.arange(10,20)
x

array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])

Then use `np.array()` to create a second array y containing 10 arbitrary integers.

In [8]:
y = np.array(np.random.randint(15, size=10))
y

array([ 1,  3,  5, 10,  8, 14,  0,  8,  2,  3])

Once you have two arrays of the same length, you can compute the **correlation coefficient** between x and y

In [9]:
print(len(x))
print(len(y))
r = np.corrcoef(x,y)
print(r)

10
10
[[1.       0.008197]
 [0.008197 1.      ]]


## Pandas Correlation Calculation

Run the code below

In [10]:
x = pd.Series(range(10, 20))
y = pd.Series([2, 1, 4, 5, 8, 12, 18, 25, 96, 48])

Call the relevant method  to calculate Pearson's r correlation.

In [11]:
r = x.corr(y)
print("Pearson's correalation: ", r)

Pearson's correalation:  0.7586402890911867


OPTIONAL. Call the relevant method to calculate Spearman's rho correlation.

In [12]:
rho =  x.corr(y, method="spearman")                             # DataFrame.corr(method='pearson', min_periods=1)
rho

0.9757575757575757

## Seaborn Dataset Tips

Import Seaborn Library

In [13]:
import seaborn as sns

Load "tips" dataset from Seaborn

In [14]:
tips = sns.load_dataset("tips")
tips.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   total_bill  244 non-null    float64 
 1   tip         244 non-null    float64 
 2   sex         244 non-null    category
 3   smoker      244 non-null    category
 4   day         244 non-null    category
 5   time        244 non-null    category
 6   size        244 non-null    int64   
dtypes: category(4), float64(2), int64(1)
memory usage: 7.3 KB


Generate descriptive statistics include those that summarize the central tendency, dispersion

In [15]:
tips.describe()

Unnamed: 0,total_bill,tip,size
count,244.0,244.0,244.0
mean,19.785943,2.998279,2.569672
std,8.902412,1.383638,0.9511
min,3.07,1.0,1.0
25%,13.3475,2.0,2.0
50%,17.795,2.9,2.0
75%,24.1275,3.5625,3.0
max,50.81,10.0,6.0


Call the relevant method to calculate pairwise Pearson's r correlation of columns

In [16]:
tips.corr()

Unnamed: 0,total_bill,tip,size
total_bill,1.0,0.675734,0.598315
tip,0.675734,1.0,0.489299
size,0.598315,0.489299,1.0
