## Descriptive Statistics

 Import **NumPy**, **SciPy**, and **Pandas**

In [12]:
import numpy as np
import pandas as pd
from scipy import stats

 Randomly generate 1,000 samples from the normal distribution using `np.random.normal()`(mean = 100, standard deviation = 15)

In [9]:
samples = np.random.normal(100, 15, 1000)
len(samples)

1000

Compute the **mean**, **median**, and **mode**

In [13]:
mean = samples.mean()
median = np.median(samples)
mode = stats.mode(samples)
print(mean)
print(median)
print(mode)

100.02625433026519
100.03212252580461
ModeResult(mode=array([50.98753339]), count=array([1]))


Compute the **min**, **max**, **Q1**, **Q3**, and **interquartile range**

In [19]:
min(samples)

50.987533390636145

In [20]:
max(samples)

145.86783215240283

In [21]:
q1 = np.percentile(samples, 25)
q1

90.39384870898836

In [22]:
q3 = np.percentile(samples, 75)
q3

109.77265628080403

In [23]:
iqr = q3-q1
iqr

19.37880757181567

Compute the **variance** and **standard deviation**

In [24]:
np.var(samples) # variance

217.90041399603254

In [25]:
np.std(samples)  #std_dev

14.76145026736982

Compute the **skewness** and **kurtosis**

In [27]:
stats.skew(samples) # skewness

-0.012774921670748033

In [29]:
stats.kurtosis(samples) # kurtosis

-0.09362823881761573

## NumPy Correlation Calculation

Create an array x of integers between 10 (inclusive) and 20 (exclusive). Use `np.arange()`

In [30]:
x = np.arange(10,20)
x

array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])

Then use `np.array()` to create a second array y containing 10 arbitrary integers.

In [31]:
y = np.array([43, 35, 34, 67, 89, 21, 48, 56 ,78, 23])
y

array([43, 35, 34, 67, 89, 21, 48, 56, 78, 23])

Once you have two arrays of the same length, you can compute the **correlation coefficient** between x and y

In [33]:
r = np.corrcoef(x, y)
r

array([[1.        , 0.08483986],
       [0.08483986, 1.        ]])

## Pandas Correlation Calculation

Run the code below

In [None]:
x = pd.Series(range(10, 20))
y = pd.Series([2, 1, 4, 5, 8, 12, 18, 25, 96, 48])

Call the relevant method  to calculate Pearson's r correlation.

In [34]:
r = stats.pearsonr(x, y)
r

(0.08483986373978589, 0.8157428589968816)

OPTIONAL. Call the relevant method to calculate Spearman's rho correlation.

In [36]:
rho = stats.spearmanr(x, y)
rho

SpearmanrResult(correlation=0.07878787878787878, pvalue=0.8287173946974606)

## Seaborn Dataset Tips

Import Seaborn Library

In [39]:
import seaborn as sns

Load "tips" dataset from Seaborn

In [40]:
tips = sns.load_dataset("tips")

Generate descriptive statistics include those that summarize the central tendency, dispersion

In [44]:
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [45]:
tips.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   total_bill  244 non-null    float64 
 1   tip         244 non-null    float64 
 2   sex         244 non-null    category
 3   smoker      244 non-null    category
 4   day         244 non-null    category
 5   time        244 non-null    category
 6   size        244 non-null    int64   
dtypes: category(4), float64(2), int64(1)
memory usage: 7.3 KB


In [46]:
tips.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
total_bill,244.0,19.785943,8.902412,3.07,13.3475,17.795,24.1275,50.81
tip,244.0,2.998279,1.383638,1.0,2.0,2.9,3.5625,10.0
size,244.0,2.569672,0.9511,1.0,2.0,2.0,3.0,6.0


Call the relevant method to calculate pairwise Pearson's r correlation of columns

In [79]:
tips_num = tips[['total_bill','tip','size']]

In [80]:
tips_num

Unnamed: 0,total_bill,tip,size
0,16.99,1.01,2
1,10.34,1.66,3
2,21.01,3.50,3
3,23.68,3.31,2
4,24.59,3.61,4
...,...,...,...
239,29.03,5.92,3
240,27.18,2.00,2
241,22.67,2.00,2
242,17.82,1.75,2


In [50]:
np.median(tips['total_bill'])

17.795

In [51]:
np.median(tips['tip'])

2.9

In [52]:
np.median(tips['size'])

2.0

In [53]:
stats.mode(tips)

ModeResult(mode=array([[13.42, 2.0, 'Male', 'No', 'Sat', 'Dinner', 2]], dtype=object), count=array([[  3,  33, 157, 151,  87, 176, 156]]))

In [113]:
tips.columns

Index(['total_bill', 'tip', 'sex', 'smoker', 'day', 'time', 'size'], dtype='object')

#  Skewness

In [110]:
for i in range(len(tips_num.columns)):
    print('Skewness of {};  {}'.format(tips_num.columns[i], stats.skew(tips_num[tips_num.columns[i]])))

Skewness of total_bill;  1.1262346334818638
Skewness of tip;  1.4564266884221506
Skewness of size;  1.4389653841920984


# Kurtosis

In [109]:
for i in range(len(tips_num.columns)):
    print('Kurtosis of {};  {}'.format(tips_num.columns[i], stats.kurtosis(tips_num[tips_num.columns[i]])))

Kurtosis of total_bill;  1.1691681323851366
Kurtosis of tip;  3.5495519893455114
Kurtosis of size;  1.6719276263625504
