## Day 7 : Statistics all the Way...

**REFERENCE:** https://medium.datadriveninvestor.com/day-7-60-days-of-data-science-and-machine-learning-6bc9cc2ceb0b

### **First of all, Why Statistics?**

There are numerous questions which stats can help you answer, like:
- **Consumer behaviour and insights** — How likely is someone to purchase a product? What payment system are they going to use based on their purchasing history data?
- **Ads Targeting and optimization** — Which ad is more effective in getting people to purchase a product?
- **Optimization** — How can you optimize occupancy in a hotel based on the previous occupancy history data?
- **Patterns** — What the most fitting size of t-shirts based on the what 95% of the population is wearing?
- If medicines are likely to cure diseases or if the vaccine is effective etc etc

The field of statistics is about collecting, summarizing and analyzing data.

### Terminologies you should know -

- **Population:** It’s an identified group of individuals.
- **Variable:** It’s a measurable factor or condition that exists in different amounts or types.
- **Effect Size:** Conveys how much difference there is between averages of variables.
- **Random Sampling:** It’s a (random) way of selecting individuals from a population that makes sure that every individual has an equal probability of being selected.
- **Point Estimate:** It’s an estimate of some value in a population, such as an average.
    - ***For example, the sample mean x is a point estimate of the population mean μ.***
- **Confidence Interval:** It’s the range of values around point estimates that likely contain the true value of a variable in the population/sample. It’s a range of values that encloses the true value of a population parameter.
- **Margin of Error:** Used for estimating miscalculation or errors, It’s a calculated amount added and subtracted to a point estimate.
- **Standard Deviation:** The average distance between each data point and the total average.

### Branches of statistics

Knowing the type of statistics you need to answer your question will help you choose the appropriate methods to get the most accurate answer possible. There are two branches of statistics:

- **Descriptive statistics** — It describes and summarize the data
**Example —**
How do the students get to their college? Answer can be that 40% of them drive, 35% ride the bus, and 25% bike etc..

- **Inferential statistics** — It takes sample data in order to make inferences with respect to a larger population.
**Example —**
What % of students drive to college by inferring to the sample data.


#### There are two types of data

<p style="text-align:center;">
    <img src='https://miro.medium.com/max/421/0*pwaDQmqZXuUB7SXa.png'>
</p>

- **Numeric / Quantitative Data:** It’s made up of numeric values which is further divided into Discrete (counted items) and Continuous (measured variables). For this type of data we can use summary statistics like mean, and plots like scatter plots etc..
    - ***Examples of Quantitative Data*** — 
        - Discrete : No of Schools, No of Children, No of Births/ Hour, 
        - Continuous — Weight, Height, Volume etc
<br></br>
- **Categorical / Qualitative Data:** It’s made up of values that belong to distinct groups which is further divided into Nominal and Ordinal. For this type of data we use summary statistics such as counts and plots like barplots etc..
    - ***Examples of Qualitative Data*** — Marital status, Type of disease, Eye Color etc

It’s important to understand your data as it will help you choosing right summary statistics and visualizations.

### Measures of Central Tendency

It’s the value at the center or middle of the data set. There are four Measures —
1. **Mean**
2. **Median**
3. **Mode**
4. **Range**

<p style="text-align:center;">
    <img src='https://miro.medium.com/max/467/1*6UAMMlQUKNGNnCKrE_69Jg.png'>
</p>

<p style="text-align:center;">
    <img src='https://miro.medium.com/max/658/0*gvJueqTvVX2OH0O7'>
</p>

#### Measure of Spread/ Measure of Variation

Variability is the key to the statistics. So, when you are describing the data, never rely on the center alone. Measure of spread identifies the spread of the values. There are four measures of spread/variation —

1. **Standard Deviation**
2. **Inter-Quartile Range (IQR)**
3. **Variance**

<p style="text-align:center;">
    <img src='https://miro.medium.com/max/313/1*puWqQJcb4e7MNdTCgX9iCA.png'>
</p>

<p style="text-align:center;">
    <img src='https://miro.medium.com/max/343/1*0uMyMPHCwgngO0f_guVUTA.png'>
</p>

**Measure of frequency:** Shows how often a value occurs

#### Which Measure to use?
- If the distribution is normal or symmetrical, use mean and standard deviation. Mean works better for symmetrical data and is more sensitive to extreme values.
- If the distribution is skewed or has large outliers, then use Median, Range or IQR. Median is usually better to use when your data is skewed i.e not symmetrical.
- If the distribution is bimodal, Use mode to figure out if the two modes represent different groups , or range.

### Numpy Statistical Functions

Along the specified axis and the given data set , below are the statistical functions that you can use to analyze your data.

- **np.mean()-** To determine the mean value.
- **np.median()-** To determine the median value.
- **np.std()-** To determine the standard deviation.
- **np.var()—** To determine the variance.
- **np.average()-** To determine the weighted average.
- **np.percentile()-** To determine the nth percentile.
- **np.amin()-** To determine the minimum value.
- **np.amax()-** To determine the maximum value.

#### IMPLEMENTATION

In [1]:
import numpy as np

arr1= np.array([[12, 43, 56], [78, 88, 95], [79, 89, 43], [101, 34, 67]])
arr2 = np.array([5, 6, 7, 12, 34, 67, 89])

In [2]:
### Mean Function

print("Mean of Array 1:", np.mean(arr1))
print("Mean of Array 2:", np.mean(arr2))

Mean of Array 1: 65.41666666666667
Mean of Array 2: 31.428571428571427


In [3]:
### Average Function

print("Average of Array 1:",np.average(arr1))
print("Average of Array 2:",np.average(arr2))

Average of Array 1: 65.41666666666667
Average of Array 2: 31.428571428571427


#### Question 1: What is the difference between Mean and Average?
The average is the sum of all values divided by the number of values. It is also sometimes referred to as mean. In statistics, mean is the average of the given sample or data set. It is equal to the total of observation divided by the number of observations.

In [4]:
### Median Function

print("Median of Array 1:",np.median(arr1))
print("Median of Array 2:",np.median(arr2))

Median of Array 1: 72.5
Median of Array 2: 12.0


In [5]:
### Standard Deviation Function

print("Standard Deviation of Array 1:", np.std(arr1))
print("Standard Deviation of Array 2:", np.std(arr2))

Standard Deviation of Array 1: 26.59404173034922
Standard Deviation of Array 2: 31.409084867994768


In [6]:
### Variance Function

print("Variance of Array 1:",np.var(arr1))
print("Variance of Array 2:",np.var(arr2))

Variance of Array 1: 707.2430555555557
Variance of Array 2: 986.530612244898


In [7]:
### Percentile Function

print("Percentile of Array 1:", np.percentile(arr1, 50, 0))
print("Percentile of Array 2:", np.percentile(arr2, 50))

Percentile of Array 1: [78.5 65.5 61.5]
Percentile of Array 2: 12.0


In [8]:
### Minimum Function

print("Minimum element of Array 1:", np.amin(arr1))
print("Minimum element of Array 2:", np.amin(arr2))

Minimum element of Array 1: 12
Minimum element of Array 2: 5


In [9]:
### Maximum Function

print("Maximum element of Array 1:", np.amax(arr1))
print("Maximum element of Array 2:", np.amax(arr2))

Maximum element of Array 1: 101
Maximum element of Array 2: 89


### Groupby and Aggregate Function

When you want to compare summary statistics between groups, it’s much easier to use .groupby() and .agg().
- The groupby function can be combined with one or more aggregation functions to easily summarize data.

In [10]:
import pandas as pd

df = pd.DataFrame({'A': [1, 10, 20, 2],
                   'B': [1, 2, 30, 40],
                   'C': np.random.randn(4)})

df.groupby('B').agg(['min', 'max'])

Unnamed: 0_level_0,A,A,C,C
Unnamed: 0_level_1,min,max,min,max
B,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,1,1,-1.513781,-1.513781
2,10,10,-1.424833,-1.424833
30,20,20,-1.008311,-1.008311
40,2,2,-2.276786,-2.276786


### Measure of Spread

Quantiles are a great way of summarizing numerical data since they can be used to measure center and spread, as well as to get a sense of where a data point stands in relation to the rest of the data set. For example, you might want to give a discount to the 20% most active users on a website.

- **np.quantile(arr, q, axis):** Compute the qth quantile of the given data.

<p style="text-align:center;">
    <img src='https://miro.medium.com/max/431/0*zLtERtPW5j9c6TR3.png'>
</p>

In [11]:
arr2 = np.array([5,6,7,12,34,67,89]) 

print("Q2/Median Quantile of Array : ", np.quantile(arr2, .50))
print("Q1 Quantile of Array : ", np.quantile(arr2, .25))
print("Q3 Quantile of Array : ", np.quantile(arr2, .75))
print("100th Quantile of Array : ", np.quantile(arr2, .1))

Q2/Median Quantile of Array :  12.0
Q1 Quantile of Array :  6.5
Q3 Quantile of Array :  50.5
100th Quantile of Array :  5.6


Interquartile range, or IQR is used to measure spread that’s less influenced by outliers and to find the outliers.
- ***If a value is less than Q1−1.5×IQR or greater than Q3+1.5×IQR then it’s considered an outlier.***

Calculate the lower and upper cutoffs for outliers
- lower = q1 – 1.5 * iqr
- upper = q3 + 1.5 * iqr


In [12]:
### Implementation
arr = [31, 35, 45, 49, 59, 69, 74, 79, 80, 81, 89, 94, 96, 99, 101, 104, 112, 117,119,127,134]
  
## First quartile (Q1)
Q1 = np.quantile(arr, .25)
  
## Third quartile (Q3)
Q3 = np.quantile(arr, .75)
  
## Interquartile range (IQR)
IQR = Q3 - Q1
  
print(IQR)

35.0


### Continuous Distribution Function

Distribution with location (loc) and Scale (scale) parameters. Continuous Distributions can be uniform or can take forms where some values have a higher probability than others.

In [13]:
from scipy.stats import uniform

arr2 = np.array([5,6,7,12,34,67,89]) 
print(uniform.cdf(arr2, loc= 4 , scale= 5))

[0.2 0.4 0.6 1.  1.  1.  1. ]


#### Normal Distribution

Also known as continuous random variable, the variable can take any value.

In [14]:
from scipy.stats import norm
import numpy as np

arr2 = np.array([0.91,0.17,0.99996833, 0.81, 0.97,0.54])
print(norm.ppf(arr2))

[ 1.34075503 -0.95416525  4.00000928  0.8778963   1.88079361  0.10043372]
