# <ins style="color:#9bdeac"> Concepts/Methods </ins>


## <ins style="color:#f2aaaa"> Descriptive Statistics </ins>
Descriptive statistics is the term given to the analysis of data that helps describe, show or summarize data in a meaningful way such that, for example, patterns might emerge from the data. Descriptive statistics do not, however, allow us to make conclusions beyond the data we have analysed or reach conclusions regarding any hypotheses we might have made. They are simply a way to describe our data.

Descriptive statistics are very important because if we simply presented our raw data it would be hard to visualize what the data was showing, especially if there was a lot of it. Descriptive statistics therefore enables us to present the data in a more meaningful way, which allows simpler interpretation of the data.

For example, if we had the results of 100 pieces of students' coursework, we may be interested in the overall performance of those students. We would also be interested in the distribution or spread of the marks. Descriptive statistics allow us to do this.

_Descriptive statistics are broken down into two categories. Measures of central tendency and measures of variability or spread._

### <span style="color:#f2aaaa"> Measures Of Central Tendency </span>
Central tendency refers to the idea that there is one number that best summarizes the entire set of measurements, a number that is in some way “central” to the set.

### <span style="color:#f2aaaa"> Mean </span>
Mean or Average is a central tendency of the data i.e. a number around which a whole data is spread out. In a way, it is a single number which can estimate the value of whole data set.

It's calculated by taking the sum of the observations divided by the total number of observations.

$$\mu =\dfrac {\sum x}{n}$$
Where:
* $\mu$ is the mean 
* ${\sum x}$ is the sum of all values
* ${n}$ is the number of observations
 
### <span style="color:#f2aaaa"> Median </span>
The median is the middle number in a sorted, ascending or descending, list of numbers and can be more descriptive of that data set than the average. The median is sometimes used as opposed to the mean when there are outliers in the sequence that might skew the average of the values.

![median](https://s3-us-west-2.amazonaws.com/courses-images/wp-content/uploads/sites/277/2017/04/24221812/CNX_BMath_Figure_05_05_008_img.png)

__Note__: When values are in arithmetic progression (difference between the consecutive terms is constant. Here it is 2.), median is always equal to mean.
![median](https://miro.medium.com/max/155/1*risOxB9tig15o1b7pE8xHw.png)
Both the mean and median is 6

### <span style="color:#f2aaaa"> Mode </span>
Mode is the term appearing maximum time in data set i.e. term that has highest frequency.
``` python
                                                   [1,2,3,4,4,4,5,6,7,7,8]
```
In this example the mode is 4 because it appears more frequently than any of the other values.

But there could be a data set where there is no mode at all as all values appears same number of times. If two values appeared same time and more than the rest of the values then the data set is bimodal. If three values appeared same time and more than the rest of the values then the data set is trimodal and for n modes, that data set is multimodal.

### <span style="color:#f2aaaa"> Measures Of Spread/Dispersion </span>
Measures of spread describe how similar or varied the set of observed values are for a particular variable

### <span style="color:#f2aaaa"> Standard Deviation </span>
Standard deviation is the measurement of average distance between each quantity and mean. That is, how data is spread out from mean. A low standard deviation indicates that the data points tend to be close to the mean of the data set, while a high standard deviation indicates that the data points are spread out over a wider range of values.

Standard deviation is calculated by taking the sum of the square of each number subtracted by the mean divided by the size of the population.

$$\sigma =\sqrt {\dfrac {\sum \left( x_{i}-\mu \right) ^{2}}{N}}$$

Where:
* $\sigma$ is the population standard deviation
* $\mu$ is the population mean
* $x_{i}$ is each value from the population
* ${N}$ is the size of the population

When we want to find the standard deviation of a sample (seegment of the population) we have to change the formula a bit

$$\sigma =\sqrt {\dfrac {\sum \left( x_{i}-\overline {x} \right) ^{2}}{N-1}}$$

Where:
* $\sigma$ is the population standard deviation
* $\overline {x}$ is the sample mean
* $x_{i}$ is each value from the population
* ${N}$ is the size of the population

##### <span style="color:#e36387"> What do you consider a good standard deviation? </span> 
For an approximate answer, estimate your coefficient of variation (CV=standard deviation / mean). As a rule of thumb, a CV >= 1 indicates a relatively high variation, while a CV < 1 can be considered low. This means that distributions with a coefficient of variation higher than 1 are considered to be high variance whereas those with a CV lower than 1 are considered to be low-variance. 

__For Example__:
``` python
l = [1,2,3,4,4,4,5,6,7,7,8]
mean = np.mean(l)
std = np.std(l)
cv = mean/std

        # mean            # STD              # CV
(4.636363636363637, 2.10076727423479, 0.4531066669918174 )
```
In this example the coefficient of variation is less than 1, so we can assume that this data has relatively low standard deviation, and in fact it does. If we change a few of the data points to very large numbers, we can see how it effects our standard deviation and CV.

``` python
l = [1,2,3,45,4,4,5,6,73,7,8]
mean = np.mean(l)
std = np.std(l)
cv = mean/std

        # mean            # STD              # CV
(14.363636363636363, 21.95976787123848, 1.528844598630527 )
```
### <span style="color:#f2aaaa"> Variance </span>
Variance ($\sigma ^{2}$) in statistics is a measurement of the spread between numbers in a data set. That is, it measures how far each number in the set is from the mean and therefore from every other number in the set.

Its also equal to the standard deviation squared

$$\sigma ^{2}=\left( \sigma \right) ^{2}$$

Where:
* $\sigma ^{2}$ is the variance
* $\sigma$ is the standard deviation

##### <span style="color:#e36387"> What do you consider a good standard deviation? </span> 
The standard deviation is the square root of the variance. The standard deviation is expressed in the same units as the mean is, whereas the variance is expressed in squared units, but for looking at a distribution, you can use either just so long as you are clear about what you are using

### <span style="color:#f2aaaa"> Range </span>
Range is one of the simplest techniques of descriptive statistics. It is the difference between lowest and highest value.

![range](https://miro.medium.com/max/330/1*xP8y25SjrptUu958fSL_-A.png)

Range is 99–12 = 87

### <span style="color:#f2aaaa"> Percentile </span>
Percentile is a way to represent position of a values in data set. To calculate percentile, values in data set should always be in ascending order (smallest to largest).

``` python

                                                [1,2,3,4,5,6,7,8,9,10]
```

Say we ask the question "what is the value at the 80th percentile?" the value would be 8 because out of the 10 numbers total, 8 is the position  where 80% of the data is less than the value. In general, if __k__ is __nth__ percentile, it implies that __n%__ of the total terms are less than __k__.


### <span style="color:#f2aaaa"> Quartiles </span>
In statistics and probability, quartiles are values that divide your data into quarters provided data is sorted in an ascending order.

![quartiles](https://miro.medium.com/max/612/1*y8wvnRzOTkDTpDHBQG3yHA.gif)

There are three quartile values. First quartile value is at 25 percentile. Second quartile is 50 percentile and third quartile is 75 percentile. Second quartile (Q2) is median of the whole data. First quartile (Q1) is median of upper half of the data. And Third Quartile (Q3) is median of lower half of the data.

### <span style="color:#f2aaaa"> Interquartile range (IQR) </span>
 a measure of statistical dispersion, being equal to the difference between 75th and 25th percentiles, or between upper and lower quartiles, IQR = Q3 − Q1.
 
![IQR](https://i2.wp.com/makemeanalyst.com/wp-content/uploads/2017/05/IQR-1.png?resize=431%2C460)

### <span style="color:#f2aaaa"> [Skewness](https://www.youtube.com/watch?v=XSSRrVMOqlQ) </span>
Skewness refers to distortion or asymmetry in a symmetrical bell curve, or normal distribution, in a set of data. If the curve is shifted to the left or to the right, it is said to be skewed. 

When a distribution is skewed to the left, the tail on the curve’s left-hand side is longer than the tail on the right-hand side, and the mean is less than the mode. This situation is also called negative skewness.

When a distribution is skewed to the right, the tail on the curve’s right-hand side is longer than the tail on the left-hand side, and the mean is greater than the mode. This situation is also called positive skewness.

![Skewness](https://miro.medium.com/max/1200/1*kIjrjUM73-K8agpGRdQ33w.jpeg)

##### <span style="color:#f2aaaa"> Skewness coefficient </span>
The coefficient of skewness measures the skewness of a distribution. It is based on the notion of the moment of the distribution.

There are two methods for finding skewness coefficient
* Pearson First Coefficient of Skewness (Mode skewness)
* Pearson Second Coefficient of Skewness (Median skewness)


__Mode Skewness__
![mode skew](https://miro.medium.com/max/272/1*tRwBmawQTvlrmOwTOVJFrQ.png)

__Median Skewness__
![mode skew](https://miro.medium.com/max/299/1*y8s4DFbZ_rJbv0WJTTTrLw.png)

__Interpretation__
* The direction of skewness is given by the sign.
* The coefficient compares the sample distribution with a normal distribution. The larger the value, the larger the distribution differs from a normal distribution.
* A value of zero means no skewness at all.
* A large negative value means the distribution is negatively skewed.
* A large positive value means the distribution is positively skewed.


#### <span style="color:#f2aaaa"> Coefficients </span>
A coefficient is a number used to multiply a variable. For Example, 6x means 6 times x, where x is the variable so 6 is a coefficient. Variables with no number have a coefficient of 1. Example: x is really 1x. Sometimes a letter stands in for the number.

#### <span style="color:#f2aaaa"> Correlation Coefficients </span>
A correlation coefficient is a value that describes how the change in variable predicts the change in another variable. In positively correlated variables, as the value of one variable increases the other increases as well or as one variable decreases the other decreases as well. Negatively correlated variables are inverses of eachother, as one increases the other decreases. 

Correlation coefficients are expressed as values between -1 and +1. A coefficient of +1 means there is a perfect positive correlation and a value of -1 means that there is a perfect negative correlation. A value of 0 means that there is no correlation.

![Perfect correlations](https://diagrammm.com/img/diagrams/scatter-plot-correlations.svg)
![Correlations](https://diagrammm.com/img/diagrams/scatter-plot-correlations-high-low.svg)


## <ins style="color:#AAA1C8"> Inferential Statistics </ins>
With inferential statistics, you are trying to reach conclusions that extend beyond the immediate data alone. For instance, we use inferential statistics to try to infer from the sample data what the population might think. Or, we use inferential statistics to make judgments of the probability that an observed difference between groups is a dependable one or one that might have happened by chance in this study. Thus, we use inferential statistics to make inferences from our data to more general conditions; we use descriptive statistics simply to describe what’s going on in our data.

Inferential statistics helps us answer the following questions:
* Making inferences about a population from a sample
* Concluding whether a sample is significantly different from the population. 
* If adding or removing a feature from a model will help in improving it.
* If one model is significantly different from the other.
* Hypothesis Testing.

#### <span style="color:#AAA1C8"> Normal distribution </span>
The normal distribution is a probability function that describes how the values of a variable are distributed. It is a symmetric distribution where most of the observations cluster around the central peak and the probabilities for values further away from the mean taper off equally in both directions.

Properties of the normal distribution
* mean = median = mode.
* The curve is symmetric with half of the values on the left and half of the values on the right.
* The area under the curve is 1.

In a normal distribution:
* 68% of the data falls within one standard deviation of the mean
* 95% of the data falls within two standard deviations of the mean
* 99.7 % of the data falls within three standard deviations of the mean.

If a data distribution is approximately normal then about 68 percent of the data values are within one standard deviation of the mean (mathematically, μ ± σ, where μ is the arithmetic mean), about 95 percent are within two standard deviations (μ ± 2σ), and about 99.7 percent lie within three standard deviations (μ ± 3σ). This is known as the 68-95-99.7 rule, or the empirical rule.

#### <span style="color:#AAA1C8"> Z-Score </span>
Simply put, a z-score (also called a standard score) gives you an idea of how far from the mean a data point is. But more technically it’s a measure of how many standard deviations below or above the population mean a raw score is.

A z-score can be placed on a normal distribution curve. Z-scores range from -3 standard deviations (which would fall to the far left of the normal distribution curve) up to +3 standard deviations (which would fall to the far right of the normal distribution curve). In order to use a z-score, you need to know the mean μ and also the population standard deviation σ.

In general, Z-scores allow us to compare things that are not on the same scale, as long as theyre normally distributed.

__For example__:

comparing the Scores of a student who took the ACT to a student who took the SAT to see which student did better. The ACT scores range from 1-36 and the SAT scores range from 400-1600, so if student 1 got a 25 on the ACT and student 2 got a 1200 on the SAT, its going to be difficult to know which student did better relative to eachother since the scores are on different scales. The z-score allows us to see how many standard deviations above or below the mean the raw score is. 

The z-score is calculated by dividing the difference between the raw value and the mean by the standard deviation

$$z=\dfrac {x_{i}-\mu }{\sigma }$$

Where:
* ${z}$ is the z-score
* ${x_i}$ is the raw value
* ${\mu}$ is the mean
* ${\sigma }$ is the standard deviation

So lets calculate the z-score for each student:
* SAT standard deviation: 200
* ACT standard deviation: 4.8

``` python
student1_z = (25-21)/4.8
student1_z == .83

student2_z = (1200 - 1000)/200
student2_z == 1
```
Visually this looks like

![](https://github.com/Gabe-flomo/DS-Notebooks/blob/master/Notes/images/sat%20vs%20act.PNG?raw=true)

We can see that the scores are pretty similar which wouldnt be as obvious if we were observing just the raw scores.

Now we can look up the corresponding z-scores in a z-table which will return the percentile that the value is in. A percentile is a value __k__, where __n%__ of the data is below it, therefore it's in the __nth__ percentile.

__Positive score__
![](http://www.z-table.com/uploads/2/1/7/9/21795380/8573955.png?759)

__Negative score__
![](http://www.z-table.com/uploads/2/1/7/9/21795380/9340559_orig.png)

According to this table: 
* Student 1 is in the 79th percentile
* Student 2 is in the 84th percentile