# Data Science I Topic 3.1 Descriptive Statistics

## Non-programming Exercise

<u>In this non-programming/ "pen and paper" exercise, feel free to answer the questions/ perform the calculations on a piece of paper, use your calculator, or use the code cells like you'd use your calculator during the exam (i.e. standard arithmetic calculations up to whatever your scientific calculators can do). </u>

**Q1:** <u>Briefly (1-2 sentences), explain the differences of:</u>

1. Population vs. Sample

2. Parameter vs. Statistic

3. Variable vs. Constant

4. Qualitative vs. Quantitative Variables

**Q2:** Given the following data of the lifetime (in hours) of 12 transistors:<br>
113, 121, 140, 106, 132, 134, 118, 117, 108, 122, 127, 138

1. Find the sample mean, median, variance, and standard deviation.
2. Assuming the lifetime of transistors follows a normal distribution, calculate the standard scores of the data points: 106, 117, 132, and 140.
3. We want to transform this dataset such that the mean is now 100 with standard deviation of 10. Find the new transformed dataset.

**Q3**: <u>Describe the relationship of mean and median in:</u>
* Normal distribution
* Positively skewed distribution
* Negatively skewed distribution

**Q4**: Calculate the Pearson correlation coefficients between the numbers of **Gold** and **Total** medals achieved in the 2016 Olympic by the 12 countries shown below.

<img src="/content/medal.png" width="500">
(source:Wiki)

In [None]:
# when editing with colab, the "<img src..." doesn't work, so you can do this instead:
# from IPython.display import Image
# Image("/content/medal.png")

## Programming Exercise

<u>Descriptive statistics is concerned with the description and summarization of data. We need to present data in a meaningful way. We've seen before in T2 how to extract summary statistics and visual information from a data frame. In this tutorial we'll practise what we know from EDA and see more data description techniques.</u>

In [None]:
!pip install pydataset

In [None]:
# Run this cell
from pydataset import data
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

For this tutorial, we are going to use the `birthwt` data set from `pydataset`. Follow the instructions below.

In [None]:
# Run this cell to show the dataset documentation
birthwt = data('birthwt', show_doc=True)

In [None]:
# Run this cell to display the first 5 entries of birthwt
birthwt = data('birthwt')
print(birthwt.head())

In [None]:
# Optional: To see all available data from pydataset, run this cell.
pd.set_option("display.max_rows", None, "display.max_columns", None)
data()


In [None]:
# If you run the cell directly above, also run this cell to reset the option
pd.reset_option('all', silent=True) # because of the set_option above

***

### Describing data set

#### Reading a stem and leaf graph

<u>Run the following cell and answer the questions.</u>

In [None]:
import stemgraphic # may need to install first: pip install stemgraphic
stemgraphic.stem_graphic(birthwt.bwt)
plt.show()

<u>Note that the leaves are rounded off to the nearest tens (see the Key). Compare to the following list.</u>

In [None]:
# Run this cell to see the sorted values of all birth weights.
print(birthwt.bwt.sort_values().to_list())

<u>From the graph and rounded off to nearest tens of grams,</u>
1. What's the median weight?
2. How many babies were born with 3.2 kg birth weight?
3. How many babies were born with more than 4 kg? less than 1.5 kg?

**Ans:**






1.
2.
3.

***

#### Frequency tables using `pandas.crosstab()`

<u>Run the following cell and answer the questions.</u>

In [None]:
# Recall low=1 when birth weight is <2.5kg (see the dataset description)
pd.crosstab(index=birthwt.age, # set index based on birthwt.age
            columns=[birthwt.low], # values to group by in the columns
            margins=True) # Add marginal value (row sums)

1. What is the ratio of babies born with low birth weight?
2. Among women younger than 18, what is the ratio of babies born with low birth weight?

**Ans:**

<u>We can also set multiple column levels to group the values.</u>

In [None]:
# Try it: complete the following to get multiple column levels
pd.crosstab(__________, #set index based on the race
            columns=[birthwt.smoke, birthwt.low], # tabulate based on two column levels
            __________) # display row sums

* Among smokers, what's the ratio of babies with low birth weight?
* Among non-smokers, what's the ratio of babies with low birth weight?

**Ans:**

#### Class Intervals

<u>We can group several actual scores into an interval of scores. It is suitable for a large dataset.</u>

In [None]:
# Run this cell
import numpy as np
bins = np.arange(1,6)*1000 #[1000,...,5000]

# add a new column to contain the interval groups
birthwt["wtint"] = np.digitize(birthwt.bwt.to_list(),
                               bins,
                               right=True) # right-inclusive
birthwt.head()

Interval groups:
* 0: 0 < birth weight $\leq$ 1000
* 1: 1000 < birth weight $\leq$ 2000
* 2: 2000 < birth weight $\leq$ 3000
* 3: 3000 < birth weight $\leq$ 4000
* 4: 4000 < birth weight $\leq$ 5000

<u>Using `.groupby()` then `.count()`, display the count per interval.</u>

In [None]:
#Ans:


<u>Using `.groupby()` and `.agg()`, display the mean, median, and count of the birth weights (`bwt`) per interval group.</u>

In [None]:
#Ans:


#### Graphs

<u>Display the histograms of birthweights to smokers/non-smokers in one plot.</u>

In [None]:
#Ans:


<u>You can get a normalized distribution plot by using `sns.displot`. Run the cell below and compare it to the histogram you've just created.</u>

In [None]:
# Try removing the argument hist=False to both distplot and see the difference

sns.distplot(birthwt.bwt[birthwt.smoke==0], bins=20, hist=False, color='b', label='non-smokers')
sns.distplot(birthwt.bwt[birthwt.smoke==1], bins=20, hist=False, color='r', label='smokers')

plt.show()

### Summarizing data set

#### Percentile and Percentile Rank

<u>How do you interpret percentile, quantile, and percentile rank? Run the following cells. What do the numbers show?</u>

In [None]:
# with scipy.stats
from scipy import stats

stats.scoreatpercentile(birthwt.bwt,25)

In [None]:
# with numpy
np.percentile(birthwt.bwt,25)

In [None]:
# with pandas
birthwt.bwt.quantile(0.25)

**Ans:**




<u>What is the percentile rank of birth weight 2500 grams? How do you interprete this number?</u>

In [None]:
# Complete this
stats.percentileofscore(birthwt.bwt, _______)

**Ans:**




#### Measure of Central Tendency

<u>Using `.groupby()` and `.agg()`, display the mean and median from :</u>
* each weight interval
* each race group
* smokers/ non-smokers

In [None]:
#Ans:


#### Box plots

In [None]:
# run this cell
sns.boxplot(y='bwt', data=birthwt)
plt.yticks(range(500,5001,500))
plt.grid(True)
plt.show()

##### Inter-quartile range

<u>From the boxplot, estimate:</u>
* Min and max weights?
* First quartile, Q1?
* Third quartile, Q3?
* Inter-quartile range, IQR?

Compare your answers to the numbers obtained by using `pandas`/ `numpy`/ `stats`.

In [None]:
#Ans:


In [None]:
# programmatically
print('min:', ___________,'max:', ____________)

Q1 = _____________________
print(Q1)

Q3 = _____________________
print(Q3)

IQR = ____________________
print(IQR)

##### Outliers identification

<u>Define RUB (reasonable upper boundary) to be RUB=Q3+1.5\*IQR and RLB to be RLB=Q1-1.5\*IQR.<br>
Find the outliers, which fall either below the RLB or above the RUB.</u>

**Ans:**

In [None]:
RUB = __________
RLB = __________

birthwt.bwt[____________________]

#### Measure of variation

##### Range

<u>The range of a set of data is the difference between the highest and lowest values in the set.
<br>
Display the range of the birth weights for:</u>
* the whole set
* grouped by smokers/ non-smokers

**Ans:**

In [None]:
# range of birthweight of the whole set
print('range = ', __________, 'grams')

In [None]:
# range of birthweight, grouped
print('grouped range:')
birthwt.__________

##### Deviation score

<u>The deviation score is the difference between given score and the mean, $$x_{i}=\left(X_{i}-\bar{X}\right)$$</u>

<u>Add a new column, `devscore`, that contains the deviation scores of birth weights, to `birthwt` DataFrame.</u>

**Ans:**

In [None]:
birthwt["devscore"] = ______________________________
birthwt.head(3)

##### Mean absolute deviation

The average of the absolute values of the deviation scores, $$\text{MAD}=\frac{\sum\left|X_{i}-\bar{X}\right|}{n}=\frac{\sum x_{i}}{n}$$

***
Display the mean deviation by either:
* `.abs()` followed by `mean()` on `devscore`
* `.mad()` on birthwt.bwt

**Ans:**

In [None]:
print('Mean deviation =', __________)

##### Variance and standard deviation

<u>Display the variance and standard deviation of the birth weights per weight interval `wtint`.</u>

In [None]:
#Ans:


#### Measure of Symmetry

##### Skewness

<u>Skewness is a measure of (the lack of) symmetry. You can use `.skew()` on pandas series.</u>
* less than -1 or greater than +1: highly skewed.
* between -1 and -0.5 or between +0.5 and +1: moderately skewed.
* between -0.5 and 0.5: approximately symmetric.
<br>
<br>
<u>Display the skewness of the whole birthweights. What can you conclude?</u>

In [None]:
#Ans:


**Ans:**

##### Kurtosis

<u>Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution.</u>
* excess kurtosis of a normal distribution: exactly 0
* excess kurtosis<0 : platykurtic (shorter and thinner tails, central peak is lower and broader)
* excess kurtosis>0 : leptokurtic (longer and fatter tails, central peak is higher and sharper)
<br><br>
<u>Using `.kurt()`, display the (excess) kurtosis of the whole birthweights. What can you conclude?</u>

In [None]:
#Ans:


**Ans:**

In [None]:
# Run this cell
sns.distplot(birthwt.bwt, bins=20)
plt.show()

#### Standard scores

<u>A dataset is normal if its scores are clustered around its median (less data as you go farther from the median).<br>
Empirical rules: if the dataset is normal then</u>
* 68% of data lies within $\mu\pm s$
* 95% of data lies within $\mu\pm 2s$
* 99.7% of data lies within $\mu\pm 3s$
<br>
<u>When a dataset is normally distributed, we can find out the probability of a score occurring by standardising the scores.</u>

<u>Using `stats.percentileofscore()`, `.mean()`, and `.std`, check how many percents of the birth weights fall within $\mu\pm s$, $\mu\pm 2s$, and $\mu\pm 3s$. Close enough to approximate to normal distribution?</u>

**Ans:**

In [None]:
mu = _______________
s = ________________

#check +/-s
#

In [None]:
#check +/-2s
#

In [None]:
#check +/-3s
#

##### Z-scores

<u>The Z-score, or standard score, is the number of standard deviations a given data point lies above or below mean, $$z=\frac{X-\bar{X}}{s},$$ where $s$ is the standard deviation.</u>

<u>Add a new column, `zscore`, to `birthwt` DataFrame.</u>
<br><br>
Hint: either use the formula and the familiar `.mean()` and `.std()` or find out how to use `stats.zscore()`.
<br><br>
Note that the default degree of freedom (d.o.f) for `stats.zscore` is $n$, while the standard deviation in `.std()` is based on d.o.f of $n-1$, so the numbers will be slightly different unless we set the d.o.f correction, ddof, to 1 (`ddof=1`).

In [None]:
birthwt["zscore"] = _____________________________________

# Display a few entries
__________

#### Transformed standard scores

Say we want to transform our standardized score $z$ into a distribution with a new mean $\bar{X}^{\prime}$ and standard deviation $s^{\prime}$, the transformed score is given by $$X^{\prime}=\left(s^{\prime}\right)\left(z\right)+\bar{X}^{\prime}$$

<u>Add a new column `trfwt` to contain the transformed data with new mean of 3000 grams and standard deviation of 500 grams. Check the new mean, median, and standard deviation.</u>

**Ans:**

In [None]:
birthwt["trfwt"] = __________________________

# display mean, median, std
#

### Bivariate Analysis

<u>Find the Pearson's r and Spearman's rho correlation coefficients between `age` and `bwt`</u>

In [None]:
# Pearson's r
__________

# Spearman's rho
__________

<u>What can you conclude?</u>