# STATISTICAL ANALYSIS

## Data & Information

The term 'data' and 'information' are often used interchangeably, but they aren't the same. There are subtle differences between these components and their purpose.


<b>Data:</b>  
Data is defined as individual facts, while information is the organization and interpretation of those facts. It is the raw form of knowledge.  
Data can be in various forms:
- Text
- Observations
- Figures
- Numbers
- Graphs, etc.  
eg: Individual prices, weights, addresses, ages, names, temperatures, dates or distances.

<u>Types of Data:</u>  
- Quantitative Data: Dealing with numerical form.  
eg: Salary, Weight, etc

- Qualitative Data: Dealing with non-numerical data.  
eg: Name, Gender, etc.


<b>Information:</b>    
Information is defined as knowledge gained through study, communication, research or instruction.  
It is the result of analyzing and interpreting pieces of data.  
eg:  
- Temperature readings in a location -> Data
- Determine seasonal temperature patterns -> Information


<b>Difference between data & information:</b>    
- Data is a collection of facts, while information puts those facts into context.  
- While data is raw and unorganised, information is organised.  
- Data points are individual and sometimes unrelated. Information maps out that data to provide a big-picture view of how it all fits together.  
- Data, on its own, is meaningless. When its analyzed and interpreted, it becomes meaningful information.  
- Data doesnot depend on information; however, information depends on data.  
- Data typically comes in the form of graphs, numbers, figures or statistics. Information is typically presented through words, languages, thoughts and ideas.  
- Data isnt sufficient for decision-making, but you can make decisions based on information.


<b>Examples:</b>  
- At a restaurant, a single customer's bill amount is data. However, when the restaurant owners collect and interpret multiple bills over a range of time, they can produce valuable information, such as what menu items are most popular and whether the prices are sufficient to cover supplies, overhead and wages.  
- A customer's reponse to an individual customer service survey is a point of data. But when you compile that customer's responses over time - and, on a grander scheme, multiple customer's responses over time - you can develop insights around areas for improvement within your customer service team.

Data and Information are both critical elements in business decision-making. By understanding how these components work together, you can move your business toward a more data and insights driven culture.

## Statistics

Statistics is the practice or science of collecting and analyzing numerical data in large quantities, especially for the purpose of inferring proportions in a whole from those in a representative sample.

<b>Statistical Analysis:</b>  
It is the science of collecting, exploring and presenting large amounts of data to discover underlying patterns and trends. We use raw data and convert into information.  
Steps in Statistical Analysis are:
- Analyze: Analyze the data of the sales a product related to user's demography.
- Apply: Apply Statistical Learning on the data.
- Predict: Predict the values of sales for a new product when its launched in a specific region.

<b>Importance of Statistical Analysis:</b>
- Helps us to collect the data using proper methods and employ the correct analysis.
- Helps us to conduct research and present the result effectively.
- Find the structures in data and make the relevant predictions.
- Apply statistical methods to build machine learning models.

<b>Types of Statistical Analysis:</b>  
- Descriptive Statistics: Helps us to generate summary of the data.
- Inferential Statistics: Helps us to make inferences and predictions about a population based on a sample.

## Types of Statistical Analysis - Descriptive Statistics

Descriptive statistics are brief descriptive coefficients that summarize a given data set.
- Measures of Central Tendency: focus on the average or middle values of data sets.
- Measures of Variability: focus on the dispersion of data.

<b>i. Measures of Central Tendency:</b>  
Measures of central tendency describe the center position of a distribution for a data set.  
A person analyzes the frequency of each data point in the distribution and describes it using the mean, median or mode, which measures the most common patterns of the analyzed data set.  
eg: The sum of the following data set is 60: (5, 10, 10, 15, 20).
- The mean is the central value i.e 15; (60/5).
- The mode is the value appearing most often i.e 10.
- The median is the value positioned at the middle of the data set i.e. 10; (5+1)/2.


<b>ii. Measures of Variability:</b>  
Measures of variability (or the measures of spread) aid in analyzing how dispersed the distribution is for a set of data.  
It includes standard deviation, variance, minimum and maximum variables, kurtosis and skewness.  
The measures of central tendency may give the average of a data set but it doesnt describe how the data is distributed within the set.  
eg: The average of the data maybe 65 out of 100, there can still be data points at both 1 and 100. Measures of variability help communicate this by describing the shape and spread of the data set.  
Consider the following data set: 5, 19, 24, 62, 91, 100. The range of the data set is 95, which is calculated by subtracting the lowest number (5) in the data set from the highest (100).


<b>iii. Measures of Shape:</b>  
Measures of shape describe the distribution (or pattern) of the data within a dataset. The distribution shape of quantitative data can be described as a logical order to the values, and the 'low' and 'high' end values on the  x-axis of the histogram are to be identified. The distribution shape of a qualitative data cannot be described as the data are not numeric.  
The histogram can give us a general idea of the shape, but these two numerical measures of shape give a more precise evaluation:
- Skewness: Tells us the amount and direction of skew (departure from horizontal symmetry). [width]
- Kurtosis: Tells us how tall and sharp the central peak is, relative to a standard bell curve. [height]


<b>iv. Measures of Position:</b>  
Measures of position determines the position of a single value in relation to other values in a sample or a population data set. Unlike the mean and the standard deviation, descriptive measures based on quantiles are not sensitive to the influence of a few extreme observations. For this reason, descriptive measures based on quantiles are often preferred over those based on the mean and standard deviation.

<br>

<b>Importance of Descriptive Statistics:</b>  
Descriptive statistics are used to describe or summarize the characteristics of a sample or data set, such as a variable's mean, standard deviation or frequency.  
Knowing the sample mean, variance and distribution of a variable can help us understand the world around us.

<br>

<b>Case Study:</b>  
The government collects information about characteristics of the population like age. By organizing this information, they can describe how the population is, what the mean age is, what the age sector with the most population is, how much the variation is, as so on.  
Descriptive analysis helps us to describe the data and the patterns on the data collection (Population). It also helps us to make conclusions beyond the population data.

The descriptive statistics helps us to understand data attributes whereas inferential statistical techniques - a seperate branch of statistics - are required to understand how variables interact with one another in a data set.  

Descriptive Statistics is performed on the whole dataset.

## Types of Statistical Analysis - Inferential Statistics

 <b>i. Population: </b>  
Population is the collection of individuals or events whose properties are to be analyzed and relationships to be identified.  

<b>ii. Sample: </b>  
Sample is the subset of population. Sample is selected based on various sampling methods. A correctly-choosen sample will have most of the information about the population.

<b>Inferential Statistics:</b>  
- Infer means to draw a conclusion.
- Inferential Statistics are a type of statistics that focus on processing sample data so that they can make decisions or conclusions on the population.
- It is the most preferred type of statistics as it produces accurate estimates at a relatively affordable cost.

<b>Steps in Inferential Analysis:</b>  
- Determine Population
- Sampling
- Data Analysis
- Decision making for the entire population.

<b>Advantages of Inferential Statistics:</b>  
- A precise tool for estimation of population.
- Highly structured analytical method.


<b>Uses of Inferential Statistics:</b>  
i. <u>Regression Analysis:</u>  
- Regression analysis is used to predict the relationship between independent (feature) and dependent (target) variables.
- eg: For factors influencing the decline in poverty (target), we use variables (features) such as road length, economic growth, electrification ratio, number of teachers, number of medical personnels, etc.

ii. <u>Hypothesis Test:</u>  
- Hypothesis testing helps us to prove whether the opinions or things we believe are true or false.
- eg: Women are more addicted to instagram than men.

iii. <u>Confidence Intervals:</u> 
- Estimate the population by using samples.
- eg: Estimate the average expenditure for the entire city.

iv. <u>Time Series Analysis:</u>
- Predict future event on the basis of pre-existing data.
- eg: Estimating the economic growth in the future.


<b>Case Study:</b>  
Suppose a regional head claims that the poverty rate in his area is very low. To prove this, he conducted a household income and expenditure survey that was theoretically able to produce poverty. Considering the survey period and budget, 10,000 household samples were selected from a total of 1,000,000 households in the district. Based on the survey result, it was found that there were still 5,000 poor people. Ofcourse, this number is not entirely true considering the survey always has errors. Confidence intervals were made to strengthen the results of this survey.  

Inferential Statistics is performed on the sample of the dataset.

## Relationship between Statistics and Machine Learning

<b>i. Statistical Learning:</b>
- Involves building of statistical models based on explicitly programmed instructions.
- Prior knowledge in the data is required.

<b>ii. Machine Learning:</b>
- Involves building the systems which can learn from data.
- No prior knowledge is required about the relationships in data.

<b> Applications of Statistical Analysis in Machine Learning & Data Science:</b>
- Overcome Uncertainty:  
    - Statistics is used in reducing uncertainty by conducting hypothesis testing.

- Data Preparation:  
    - Statistics helps in data understanding, selection, data manipulation as well as data cleaning activity.

- Modelling:  
    - Apply right model on the basis of statistical relationship in data.

- Pattern Recognition:  
    - Used in recognition of patterns in the given data.

- Knowledge Discovery:  
    - Improve the model accuracy by applying statistical analysis on data.


<b>Examples of Statistical Analysis applied in Machine Learning:</b>
- Problem Framing: Exploratory Data Analysis & Data Mining.
- Data Understanding: Summary Statistics, Data Visualization.
- Data Cleaning: Outlier Detection, Imputation.
- Data Selection: Data Sample, Feature Sample.
- Data Preparation: Scaling, Encoding, Transforms.
- Model Evaluation: Resampling Methods.
- Model Configuration, Selection & Presentation: Hypothesis test, Estimation & CI.
- Model Prediction: Inferential Statistics, Estimation Statistics.

## Understanding the Types of Data

<b>i. Qualitative Data:</b>
   - This takes on values in a set of categories.
   - Can be used as labels.
   - Mathematical Computation cannot be performed on these values.
   - The data is categorized based on the level of measurement.  
   - eg: Country, Gender, etc.
   - Levels of Measurement are:  
       - Nominal:  
            - Collection of names.  
            - eg: Favourite Color, Gender. 
       - Ordinal:  
            - Non-numerical data with a set order or scale to it.
            - eg: Rating of an Restaurant, Likert scale.
   
<b>ii. Quantitative Data:</b>  
   - This takes on numerical values.
   - The values are something which can be measured/counted.
   - Mathematical computations are possible on this data type.
   - Levels of Measurement are:  
       - Discrete:
           - Data that only takes certain values.
           - eg: Number of children.
       - Continuous:
           - Data which has values that are not fixed and have an infinite number of possible values.
           - eg: Salary.

<br>

<b>Variables in Statistics:</b>    
- Variable represents an unknown value or a value which could vary.  

- Dependent Variables:
    - The variable that depends on other variables or factors.
    - Dependent variable is the effect in research study.
    - The outcome of an experiment, outcome column in a dataset, typically denoted by 'y'.
    - eg: Whether the person is affected by COVID19 or not.  
    
- Independent Variables:
    - The variable that doesnot depend on ther variables or factors.
    - Independent variable is the cause in research study.
    - A variable typically denoted by 'x', input columns in a dataset.
    - eg: RBC count, O2 level, Weight, etc.  
    
- Categorical Variable:
    - Variable holding qualitative data in the dataset.  
    
- Continuous Variable:
    - Variable holding quantitative data.  

## Statistical Methods - Sampling Techniques

- Sampling is a technique of selecting individual members or a subset of the population to make statistical inferences from them and estimate characteristics of the whole population.

- Types:
    - Probability Sampling:
        - Probability sampling is a sampling technique where a researcher sets a selection of a few criteria and chooses members of a population randomly. All the members have an equal opportunity to be a part of the sample with this selection parameter.
        
    - Non-probability Sampling:
        - In non-probability sampling, the researcher chooses members for research at random. This sampling method is not a fixed or predefined selection process. This makes it difficult for all elements of the population to have equal opportunities to be included in a sample.



<b>Sampling Framework:</b>  
- Sample Goal:
    - Population property that we wish to estimate using the sample.

- Population:
    - Scope or domain from which the observations could theoretically be made.
    
- Selection Criteria:
    - Methodology used to accept or reject the observations in our sample.
    
- Sample Size:
    - The number of observations included in the study (or sample).


<b>Steps Involved in Sampling Framework:</b>  
1. Identify & define target population.
2. Select sampling frame: List of items or people which are forming a population from which the sample is taken.
3. Choose sampling methods.
4. Determine sampling methods.
5. Collect the required data.


<br>

<b>Types of Probability Sampling:</b>  
- <u>Simple random sampling:</u>
    - Every individual is choosen entirely by chance and each member of the population has an equal chance of being seletced.
    - Simple random sampling reduces selection bias.
    - It also reduces the chance of sampling error.
    - Sampling error is lowest in this method out of all the sampling methods.
    <img src='assets/Simple-random-sampling.png' width="500">

<br>

- <u>Cluster sampling:</u>
    - We use the subgroups of the population as the sampling unit rather than individuals.
    - The population is divided into subgroups, known as clusters and a whole cluster is randomly selected to be included in the study.
    - This type of sampling is used when we focus on a specific region or area.
    <img src='assets/Cluster-sampling.png' width="500">
    
<br>

- <u>Systematic sampling:</u>
    - Systematic sampling is a statistical method that researchers use to zero down on the desired population they want to research.
    - Researchers calculate the sampling interval by dividing the entire population size by the desired sample size.
    - Systematic sampling is an extended implementation of probability sampling in which each member of the group is selected at regular periods to form a sample.
    <img src='assets/Systematic-sampling.png' width='500' height='500'>
    
<br>

- <u>Stratified random sampling:</u>  
    - Divide the population into subgroups (called strata) based on different traits like gender, age, etc.
    - Select the sample(s) from each subgroup.
    - This type of sampling is used when we want representation from all the subgroups of the population.
    <img src='assets/Stratified-random-sampling.png' width='300' height='300'>


<br><br>

<b>Types of Non-Probability Sampling:</b>
- <u>Convenience Sampling:</u>
    - Relies on data collection from population members who are conveniently avaliable to participate in study.
    - Easy to carry out at a relatively low cost in a timely manner.
    - Social Media Polls is an example.
    <img src='assets/Convenience-sampling.png' width='500' height='500'>
    
<br>

- <u>Judgemental or Purposive or Selective Sampling:</u>  
    - Based on the assessment of experts in the field when choosing who to ask to be included in the sample.
    - Lets say you are selecting from a group of men aged 30-35 and the experts decide that only the men who have a college degree will be best suited to be included in the sample. This would be judgemental sampling.
    <img src='assets/Judgmental-sampling.png' width='500' height='500'>
    
<br>

- <u>Snowball Sampling:</u>
    - Existing people are asked to nominate further people known to them so that the sample increases in size like a rolling snowball.
    - Effective when a sampling frame is difficult to identity.
    <img src='assets/Snowball-sampling.png' width='500' height='500'>
    
<br>

- <u>Quota Sampling:</u>
    - Choosing the items based on predetermined characteristics of the population.
    - Researchers form a sample of individuals who are representative of a larger population.
    - eg: All men in a company, voters in the age range of 18-22 yrs in a region, etc.
    <img src='assets/Quota-Sampling.png' width='500' height='500'>


<br>

<b>Understanding the Data Sampling Errors:<b>
- When the data sampling occurs, it requires those involved, to make statistical conclusions about the population from a series of observations.
- There are two types of errors:
    - Selection Bias:
        - Occurs when participants in the sample are 'not equally balanced' or objectively represented.
    - Sampling Error:
        - Occurs when the sample doesnot include 'all members' of the population.

## Descriptive Statistics - Measures of Central Tendency

- A measure of central tendency is a single value that represents the center point of a dataset. This value can also be referred to as "the central location" of a dataset.
- Three common measures of central tendency are:
    - Mean
    - Median
    - Mode
    
<b>Use Case of Measures of Central Tendency:</b>
- A young couple is trying to decide where to buy their first home in a new city and the most they can spend is \\$140k. Some neighbourhoods in the city have expensive houses, some have cheap houses and others have medium-priced houses. They want to easily narrow down their search to specific neighbourhoods that are within their budget.
- Neighbourhood A home prices: \\$140k, \\$190k, \\$265k, \\$115k, \\$270k, \\$240k, \\$250k, \\$180k, \\$160k, \\$200k, \\$240k, \\$280k, ... 
- Neighbourhood B home prices: \\$140k, \\$290k, \\$155k, \\$165k, \\$280k, \\$220k, \\$155k, \\$185k, \\$160k, \\200k, \\$190, \\$140k, \\$145k, ...
- Neighbourhood C home prices: \\$140k, \\$130k, \\$165k, \\$115k, \\$170k, \\$100k, \\$150k, \\$180k, \\$190k, \\$120k, \\$110k, \\$130k, \\$120k, ...

Average:
- Average Neighbourhood A home price: \\$200k.  
- Average Neighbourhood B home price: \\$190k.
- Average Neighbourhood C home price: \\$140k.  
Therefore, house price in Neighbourhood C is under their budget.

<br>

<b>i. Mean:</b>
- The mean represents the average value of the dataset. It can be calculated as the sum of the values in the dataset divided by the number of values.
- Mean (x̄) = $\frac{Σx}{N}$

<b>ii. Median:</b>
- The median is the middle value in a dataset.
- You can find the median by arranging all the individual values in a dataset from smallest to largest and finding the middle value.
- If there are an odd number of values, the median is the middle value.
    - Median (Md) = $\frac{N+1}{2}$ ᵗʰ  item
- If there are an even number of values, the median is the average of two middle values.
    - Median (Md) = ( $\frac{N+1}{2}$ ᵗʰ item + $\frac{N}{2}$ ᵗʰ item ) / 2 ᵗʰ item
    
<b>iii. Mode:</b>
- The mode is the value that occurs most often in a dataset.
- A dataset can have no mode (if no value repeats), one mode or multiple modes.

<br>

<b>When to use Mean?</b>
- When the distribution of the data is symmetrical and there are no outliers.
- Suppose we have the following distribution that shows the salaries of individuals in a certain town.
- Distribution is Symmetrical.
<img src="assets/Salary-Distribution-Mean.png" width="400" height="400">

<br>

<b>When to use Median?</b>
- When the distribution is skewed (distorted).
- Suppose we have the following distribution that shows the salaries of individuals in a certain town.
- Median can capture the typical salary.
<img src="assets/Salary-Distribution-Median.png" width="400" height="400">

<br>

<b>When to use Mode?</b>
- When working with Categorical data.
- Finding which category occurs most frequently.
<img src="assets/Survival-Distribution-Mode.png" width="500" height="500">

## Descriptive Statistics - Measures of Dispersion

- Measures of dispersion (spread) help to describe the variability in the data. Dispersion is a statistical term that can be used to describe the extent to which the data is scattered.
- Hence, measures of dispersion are certain types of measures that are used to quantify the dispersion of data.
- We measure spread using:
    - Range
    - Interquartile Range
    - Variance
    - Standard Deviation

<b>Use Case of Measures of Central Tendency:</b>
- A young couple is trying to decide where to buy their first home in a new city and the most they can spend is \\$140k. Some neighbourhoods in the city have expensive houses, some have cheap houses and others have medium-priced houses. They want to easily narrow down their search to specific neighbourhoods that are within their budget.
- Neighbourhood A home prices: \\$140k, \\$190k, \\$265k, \\$115k, \\$270k, \\$240k, \\$250k, \\$180k, \\$160k, \\$200k, \\$240k, \\$280k, ... 
- Neighbourhood B home prices: \\$140k, \\$290k, \\$155k, \\$165k, \\$280k, \\$220k, \\$155k, \\$185k, \\$160k, \\200k, \\$190, \\$140k, \\$145k, ...
- Neighbourhood C home prices: \\$140k, \\$130k, \\$165k, \\$115k, \\$170k, \\$100k, \\$150k, \\$180k, \\$190k, \\$120k, \\$110k, \\$130k, \\$120k, ...

Average:
- Average Neighbourhood A home price: \\$200k.  
- Average Neighbourhood B home price: \\$190k.
- Average Neighbourhood C home price: \\$140k.  
Overall Range is missing, not a complete information !!! 

<br>

<b>i. Range:</b>
- The range is the difference between the largest and the smallest value in a dataset.
- If the largest value is 98, and the smallest value is 58, the range is 98 - 58 = 40.

<br>

<b>ii. Interquartile Range:</b>
- The interquartile range is the difference between the first quartile and the third quartile in a dataset.
- A dataset is divided into 3 quartiles: Q<sub>1</sub>, Q<sub>2</sub> & Q<sub>3</sub>.
- IQR is useful to determine whether a given data point (value) is an outlier or not.
- If any of these conditions are true, then we can classify the value as an outlier.
    - value > Q<sub>3</sub> + 1.5*IQR
    - value < Q<sub>1</sub> - 1.5*IQR

Steps to calculate IQR:
- Consider the price for a cup of coffee in a city.
- Q<sub>2</sub> = Median of the dataset
- Q<sub>1</sub> = Lower Quartile
- Q<sub>3</sub> = Upper Quartile  
- IQR = Q<sub>3</sub> - Q<sub>1</sub> = 87 - 52 = 35
<img src="assets/InterQuartile-Range.png" width="500" height="500">

<br>

Activity: Find IQR 
Find the values of IQR for the following data:  
42, 63, 64, 64, 70, 73, 76, 77, 81, 81  
solution:  
    here, N = 10 (even)  
    - Q<sub>2</sub> = Md =  $\frac{70 + 73}{2}$ = 71.5  
    - Q<sub>1</sub> = Md of [42, 70] = 64  
    - Q<sub>3</sub> = Md of [73, 81] = 77
    - IQR = Q<sub>3</sub> - Q<sub>1</sub> = 77 - 64 = 13

<b>IQR vs Range:<b>
- The interquartile range is more resistant to outliers compared to the range, which can make a better metric to measure "spread".
- Consider the following dataset:
- Range: \\$47,000
- IQR: \\$34,000
<img src="assets/IQR-vs-Range.png" width="200" height="300">

<br>

<b>iii. Variance:</b>
- A common way to measure how spread-out the data values are.
- Simply put, a variance is a measure of how far a set of data (numbers) are spread out from their mean (average) value.
- The more the value of variance, the more the data is scattered from its mean and if the value of variance is low or minimum, then it is less scattered from its mean.
- Formulae to calculate variance (Var):
    - Population Variance: σ² = $\frac{ Σ(x - μ)² }{ N }$

    - Sample Variance: s² = $\frac{ Σ(x - x̄)² }{ n - 1 }$  
    where,  
        N = total number of elements (subjects) in a population  
        n = total number of elements (subjects) in a sample  
        x = specific data point  
        μ = population mean  
        x̄ = sample mean
        
Steps to calculate Variance:
- Find the mean of the given data set. Calculate the average of a given set of values.
- Subtract the mean from each value and square them.
- Find the average of these squared values, that will result in variance.

<br>

Activity: Find Variance  
Find variance for the following data:  
Weights(in gm) : 610, 450, 160, 420, 310.  
solution:  
here, N = 5  
- Mean(x̄) = (610 + 450 + 160 + 420 + 310) / 5 = 390.  
- Compute the difference of each from the mean, square it and find their averge value once again.    
    i.e. (220² + 60² + (-230)² + 30² + (-80)²) / 5 = 22,440.  
- Variance = 22,440.  
    
But, Range = 610 - 160 = 450. Here, Variance >> Range. So, we will take the square root of the variance, which gives the value of 'Standard Deviation', which to get the value of spread in the required range.  
i.e. Standard Deviation (S.D) = $\sqrt{Var}$

<br>

<b>Key Things to Remember about Variance:</b>
- In statistics, the variance is used to understand how different data correlate to each other within a dataset.
- Variance considers that all the deviations from the mean are the same despite their direction.
- One of the disadvantages of finding variance is that it gives combined weight to extreme values.  
    i.e. the numbers that are far from the mean (outliers).  
    When squaring these numbers, there is a chance that they may skew the given dataset.
- Another disadvantage of variance is that sometimes it may conclude complex calculations.

<br>

<b>iv. Standard Deviation:</b>
- Square root of Variance.  
    i.e. Standard Deviation (S.D) = $\sqrt{Var}$ 
- Measures the value of spread in the required range.
- Most common way to measure the spread of the data.
<img src="assets/Standard-Deviation-Formula.png" width="500" height="500">

<br>

Steps to calculate Standard Deviation:
- Compute the mean for the given dataset.
- Subtract th mean from each observations and calculate the square in each instance.
- Find the mean of those squared deviations.
- Finally, get the squre root of the result to obtain the Standard Deviation.

## Descriptive Statistics - Measures of Shape

- Measures of shape describe the distribution (or pattern) of the data within a dataset.
- Helps us to understand the patterns that may be concealed and can be understood once the data is plotted on the graph.
- Only Quantitative data can be analyzed using the measures of shape, not the Qualitative data.
- Two common measures of shape are:
    - Skewness: Tells us the amount and direction of skew (departure from horizontal symmetry). [width]
    - Kurtosis: Tells us how tall and sharp the central peak is, relative to a standard bell curve. [height]


<b>Distribution of Data:</b>
- The shape of the data can be understood by considering how the data points are distributed in the space.
- Distribution can be categorized as:
    - Symmetrical Distribution (eg: Normal Distribution)
    - Asymmetrical Distribution (eg: Skewed Distribution)


<b>1) Symmetric Distribution:</b>
- A distribution can be called symmetrical when the two sides of the distribution are a mirror images of each other.
- Most of the symmetric distributions are "uni-model" i.e. they have single peak, and others are "bi-model".
- The mean and median are always same in symmetric distribution. For Normal distribution, mean, median and mode are same. 
- eg: U-shaped distribution,Rectangular distribution, Normal distribution, etc.


<b>i. Normal Distribution:</b>
- A normal distribution is a type of symmetric distribution where the mean, median and mode coincide with each other.
- Mean, Median and Mode coincide at the center. The distribution is highest in the middle.
- A bell-shaped curve can be observed when the data is plotted over a histogram.
- Normal Distribution is uni-modal as it has only one peak.
- Tails of the curve never touch the baseline.
- Normal distribution helps to compute the probability and plays a major role during inferential statistics.
<img src="assets/Normal-Distribution.png" width="500">


<b>When the characteristics are considered, we can draw conclusions:</b>
- Most of the data will be cluttered around the center.
- Extreme values will move away from center.

<b>Emperical Rule/Three Sigma Rule:</b>
- 68% of values lie within one standard deviaion.  
    i.e. Data above and below a unit standard deviation. (± 1*S.D.)
- 95% of values lie within two standard deviation.  
    i.e. Data ± 2*S.D.
- 99.7% of values lie within three standard deviation.  
    i.e. Data ± 3*S.D.

<img src="assets/Normal-Distribution-Division.png" width="500">

<br>

<b>2) Skewed Distribution:</b>
- Skew is a characteristic often used to describe the distribution of values in Asymmetric Distributions.
- The distribution can be skewed either Positively or Negatively and this happens when more frequent values get cluttered around the high or low ends of the x-axis.  
    Right skew  : Positive skew  
    Left skew : Negative skew
- Skewed Distribution is observed when distribution deviates from the Normal Distribution.
<img src="assets/Skewed-Distribution.png" width="400">

<br>

<b>3) Kurtosis:</b>
- Kurtosis is a descriptor of shape and it describes the shape of the distribution in terms of height or flatness.
- Types of Kurtosis are:
    - Platykurtic (Negative Kurtosis)
    - Mesokurtic (Normal Distribution)
    - Leptokurtic (Positive Kurtosis)
<img src="assets/Kurtosis-Types.png" width="400">

<br>

<b>Bimodal & Multimodal Distributions:</b>
- A distribution which has no mode at all is called a "Uniform Distribution". 
- A distribution which has a single mode is called a "Unimodal Distribution".  
    A typical normal distribution is unimodal.
- A distribution which has two modes is called a "Bimodal Distribution".
- A distribution which has more than two modes is called a "Multimodal Distribution".
<img src="assets/Bimodal-&-Multimodal-Distribution.png" width="500">

## Descriptive Statistics - Measures of Position

- Measures of position give us a way to see where a certain data point or value falls in a sample or population.
- Used on quantitative data and on few categorical data which belong to the ordinal data.
- Common Measures of Position are:
    - Box & Whisker Plot
    - Deciles
    - Five Number Summary
    - IQR
    - Outliers
    - Percentiles
    - Quartiles
    - Standard Scores
    
<b>i. Box & Whisker Plot:</b>
- Shows the Spread and Center of data.
- It is graphical representation of Five Number Summary:
    - Minimum
    - Maximum
    - Median
    - First Quartile
    - Third Quartile
<img src="assets/Boxplot.png" width="400">

<br>

<b>ii. Deciles:</b>
- Deciles are like quartiles.
- Deciles split the data into ten parts:  
    the 10ᵗʰ, 20ᵗʰ, 30ᵗʰ, 40ᵗʰ, 50ᵗʰ, 60ᵗʰ, 70ᵗʰ, 80ᵗʰ, 90ᵗʰ & 100ᵗʰ percentiles.
<img src="assets/Deciles.png" width="400">

<br>

<b>iii. Five Number Summary:</b>
- A Five Number summary consists of five values:  
    the most extreme values in the data set (the maximum and minimum values), the lower and upper quartiles, and the median.
- A Boxplot is used to visualize the five number summary.
<img src="assets/Five-Number-Summary.png" width="400">

<br>

<b>iv. Interquartile Range:</b>
- IQR is a measure of statistical dispersion, which is the spread of the data.
- The IQR may also be called the midspread, middle 50%, fourth spread, or H‑spread.
- Tells us where the middle fifty in our dataset is.
- Tells us where the bulk of the middling values lie.
<img src="assets/InterQuartile-Range-2.png" width="300">

<br>

<b>v. Outliers:</b>
- Unsual values that fall outside of an expected range of values.

Determine Outliers using IQR:
- Sort the data from low to high
- Identify the first quartile (Q1), the median, and the third quartile (Q3).
- Calculate IQR = Q3 – Q1
- Calculate upper fence = Q3 + (1.5 * IQR)
- Calculate lower fence = Q1 – (1.5 * IQR)
- Use fences to highlight any outliers which are values that fall outside the fences.  
Qutliers are any values greater than the upper fence or less than the lower fence.
<img src="assets/Outliers.png" width="300">

<br>

<b>vi. Percentiles:</b>
- A percentile is a number where a certain percentage of scores fall below that number.
- eg: a 90ᵗʰ percentile marks the spot where 90% of values fall below that cut-off point.
<img src="assets/Percentiles.png" width="400">

<br>

<b>vii. Quartiles:</b>
- Quartiles divide the data into quartiles: the lowest quartile, two middle quartiles and a highest quartile.
<img src="assets/Quartiles.png" width="500">

<br>

<b>viii. Standard Scores:</b>
- Standard Scores is also known as "z-scores".
- z-scores are a way to compare results from a test to a "normal" population.
- Describes how many Standard Deviations an element is from its mean.
- For a particular element(value); 
    - If z-score < 0, the element is less than the mean.
    - If z-score = 0, the element is equal the mean.
    - If z-score > 0, the element is greater than the mean.
- The formula to calculate z-score is:  
    z = $\frac{x - x̄}{S}$  
    where,  
        S = standard deviation of a sample  
        x = each value in the sample dataset  
        x̄ = mean of the sample

<b>Why use Standard Score?</b>
- Standard Score (z-score) is a set of scores that have the same mean and standard deviaion so they can be compared.
- we use the standard score which allows us to:
    - Calculate the probability of a score occuring within our normal distribution.
    - Compare two scores that are from different normal distributions.

When we apply standard scaler (z-score) to our dataset, it will convert the particular column such that, the Mean of the column will be 0 & the Standard Deviation of the column will be 1.

<b>Advantages of z-score:</b>
- z-score can be used to compare raw scores that are taken from different tests.
- z-score considers both the mean value and the variability in a set of raw scores.

<b>Disadvantages of z-score:</b>
- It always assumes our the dataset to be normally distributed.
- If the data is skewed, the distribution of the left and right of the origin line is not equal.

<b>How is z-score used in real life?</b>
- We can use the z-table and the normal distribution graph to give us a visual about how a z-score of 2.0 means "heigher than average".
- Lets say we have a person's weight (140 kg), and we know their z-score is 2.0. We know that 2.0 is above average (because of the high placement on the normal distribuion curve), but we want to now how much above average is this weight.
<img src="assets/Z-score.png" width="500">

 `