# Statistics For Data Science

## Introduction

### What is Statistics?

- Statistics is the science of collecting, organizing, analyzing, and interpreting data. It is a branch of mathematics that deals with the collection, analysis, interpretation, and presentation of data. Statistics is used in a wide variety of fields, including science, engineering, medicine, and business. Statistics is also used to make decisions in the face of uncertainty.

### Importance of Statistics in Data Science

- To analyze and make decisions based on data
- To identify patterns and trends in data
- To make predictions based on data
- To validate accuracy and reliability of data
- To identify relationships between variables in data
- To identify errors in data

![image.png](attachment:image.png)

### Types of Data


##### 1. Nominal Data

This type of data consists of categorical variables that do not have any natural order. Examples include gender, race, and eye color.

##### 2. Ordinal Data

This type of data consists of categorical variables that can be ordered or ranked. Examples include education levels (elementary, high school, college) and customer ratings (poor, fair, good).

##### 3. Interval Data

This type of data consists of numerical variables that have a constant interval between them, but do not have a true zero point. Examples include temperature (measured in Celsius or Fahrenheit) and time (measured in hours or minutes).

##### 4. Ratio Data

This type of data consists of numerical variables that have a true zero point and a constant interval between them. Examples include weight, height, and income.

#### Population and Sample

![image.png](attachment:image.png)

##### Population

- A population refers to the entire group of individuals, objects, or events that we are interested in studying. For example, if we want to study the heights of all the students in a school, then the population would be all the students in that school.

##### Sample

- A sample is a subset of the population that is selected for analysis. In our example, if we choose a group of 50 students from the school and measure their heights, then this group of 50 students would be our sample.

*Samples are often used in statistics because it is usually impractical or impossible to study an entire population. By studying a sample, we can make inferences about the population as a whole. However, it is important to ensure that the sample is representative of the population so that our inferences are valid.*

![image.png](attachment:image.png)

### Types of Statistics


#### 1. Descriptive Statistics

##### What?

- Descriptive statistics refers to the branch of statistics that deals with summarizing and describing the main features of a dataset. 
- This includes measures such as mean, median, mode, variance, standard deviation, range, and quartiles. 
- The aim of descriptive statistics is to provide a clear and concise summary of the data, allowing for better understanding and interpretation of the dataset.

##### Why?

- Provide valuable insights into the data's distribution, central tendency, and variability, which can be useful in making data-driven decisions.
- Descriptive statistics can also help identify potential outliers, which may require further investigation.

##### Measures of Central Tendency

**1. Mean**

- The mean is the average of all the values in a dataset.
- The mean is the most commonly used measure of central tendency.
- The mean is sensitive to outliers, which can skew the results.

**2. Median**
- The median is the middle value in a dataset.
- The median is less sensitive to outliers than the mean.

**3. Mode**
- The mode is the most frequently occurring value in a dataset.
- The mode is not always applicable, as there may be no single mode in a dataset.

![image.png](attachment:image.png)

##### Measures of Dispersion

**1. Range**
- The range is the difference between the largest and smallest values in a dataset.
- The range is the simplest measure of dispersion, but it is not very useful as it does not provide any information about the variability of the data.

**2. Variance**
- The variance is the average of the squared differences between each value and the mean.
- The variance is a measure of the spread of the data around the mean.
- The variance is not very useful as it is difficult to interpret.

**3. Standard Deviation**
- The standard deviation is the square root of the variance.
- The standard deviation is a measure of the spread of the data around the mean.
- The standard deviation is more useful than the variance as it is easier to interpret.

**SD v/s Variance**

- SD is like the person who constantly changes their mood and keeps you on your toes. One minute they're happy and the next they're upset, just like how SD can vary a lot from the mean.

- Variance, on the other hand, is like the stubborn person who refuses to change their ways. They have their routine and stick to it, just like how variance is a measure of how spread out the data is, regardless of its relationship to the mean.

##### Measures of Skewness

**1. Skewness**

- Skewness is a measure of the asymmetry of a distribution.
- A distribution is said to be skewed if it is not symmetric.

![image.png](attachment:image-2.png)

![image.png](attachment:image.png)

**2. Kurtosis**

- Kurtosis is a measure of the shape of a distribution.
- A distribution is said to be leptokurtic if it has a sharp peak and a thin tail.
- A distribution is said to be platykurtic if it has a flat peak and a wide tail.

![image.png](attachment:image.png)

![image.png](attachment:image.png)

#### 2. Inferential Statistics

##### What?

- Inferential statistics refers to the branch of statistics that deals with making inferences about a population based on a sample.
- This includes measures such as confidence intervals, hypothesis testing, and p-values.
- The aim of inferential statistics is to make inferences about a population based on a sample.

##### Why?

- Inferential statistics can be used to make predictions about a population based on a sample.
- Inferential statistics can also be used to test hypotheses about a population.

##### Probability & Probability Distributions

**Probability**

- Probability is the measure of the likelihood of an event occurring. It is expressed as a number between 0 and 1, where 0 represents an impossible event and 1 represents a certain event.

**Probability Distributions**

*What?*

- Probability distributions are mathematical functions that describe the probability of a random variable taking on a particular value.

*Importance*

1. Describing data: Probability distributions can be used to describe the characteristics of a dataset, such as the mean, variance, and skewness. By fitting a distribution to the data, we can gain insights into its underlying properties and make inferences about its behavior.

2. Generating random samples: Many statistical and machine learning models rely on generating random samples from a distribution to make predictions. For example, in Monte Carlo simulations, we might simulate the behavior of a system by drawing samples from a distribution of possible outcomes.

3. Estimating parameters: Probability distributions can also be used to estimate the parameters of a model. For example, we might use the normal distribution to estimate the mean and variance of a population, based on a sample of observations.

4. Predictive modeling: In machine learning, we often use probability distributions to model the likelihood of different outcomes. For example, in a classification problem, we might model the probability of each class label given the input features, and use this information to make predictions.

**Types of Probability Distributions**

![image.png](attachment:image.png)

**1. Discrete Probability Distributions**

- Discrete probability distributions are used to model the probability of a random variable taking on a discrete set of values.



- Binomial Distribution
    - The binomial distribution is used to model the probability of a binary outcome, such as the number of heads in 10 coin flips.
    - The binomial distribution is a discrete probability distribution.
    - The binomial distribution is parameterized by the number of trials, n, and the probability of success, p.
    - The binomial distribution is often used to model the number of successes in a sequence of n independent trials.

![image.png](attachment:image-2.png)

- Poisson Distribution
    - The Poisson distribution is used to model the probability of a discrete outcome, such as the number of customers arriving at a store in an hour.
    - The Poisson distribution is a discrete probability distribution.
    - The Poisson distribution is parameterized by the mean, λ.
    - The Poisson distribution is often used to model the number of events occurring in a fixed interval of time or space.

![image.png](attachment:image.png)

**Example**

For example, a call center that receives an average of 10 calls per hour can use the Poisson distribution to estimate the probability of receiving a certain number of calls in a given time period. However, if the call center has a fixed number of lines and each line can handle one call at a time, then the distribution of the number of calls in a given hour would be binomial.

**2. Continuous Probability Distributions**

- Continuous probability distributions are used to model the probability of a random variable taking on a continuous range of values.

##### Confidence Intervals