# Population & Sample

The first step of any statistical analysis we perform is to determine whether the data we are dealing with is **Population** or **Sample**.

> Population is a collection of all items of interest. Denoted by **'N'**.
> Numbers we have obtained when using population are called **parameters**.

> Sample is a subset of the population. Denoted by **'n'**.
> Number we have obtained when using sample are called statistics.

Let say you wanna count the total number of students in a university. You can't just go to university and get correct total because some students might be absent or learning from distance or abroad or part time. Hence, ***Populations are hard to define and hard to observe in real life.***

A Sample however is much easier to gather. It is less time consuming and less costly. Eg. we can go to canteen of university and take survey of 50 students.... This sample is drawn from the population of university.

***Populations are hard to observe & contact. Samples are easy to observe and contact.***

You will almost always be working with sample data and make data driven decisions based on it. Samples are key to statistical insights. They have 2 defining characteristics:-
- Randomness : A random sample is collected when each member of the sample is chosen from the population strictly by chance.
- Representativeness : A representative sample is a subset of the population that accurately reflects the members of the entire population.

A sample must be both random and representative for an insight to be precise.

# Types of Data

***We classify Data in 2 main ways :- 1. Type  2. Measurement Level***

> **CATEGORICAL Data** : Describe category or groups. Eg, Car brands, Yes/No Questions.

> **NUMERICAL Data** : Contains meaningful numbers. Has 2 subsets ->
> - **Discrete**   ->  Discrete data can be counted in a finite manner. Eg. How many children I want?  Exam Score?  .... We know a range to expect number in.
> - **Continuous** ->  Infinite and impossible to count. Eg. My weight can take any value at some range. It maybe 85.364 Kg, but if I eat something then it will change by some decimal.

Discrete examples : Grades at university, Number of objects, physical money.
Continuous examples : Height, Area, Distance, Time. All of these vary by infinitely smaller amounts. Time on clock is discrete by time in general isn't.

# Levels of Measurement

> **QUALITATIVE** :
> - **Nominal**  ->  These variable are the categories like car brands, seasons etc. They aren't in numbers and cannot be ordered.
> - **Ordinal**  ->  Consist of categoryies which follow a strict order. Eg. Rating the taste of food, though it can be from disgusting to Delicious but we know we are heading from negative to positive.

> **QUANTITATIVE** :
> - **Interval**  ->  Represented by number but doesn't have a True 0. Eg. Temperature in C/F, as it doesn't have a true ratio and doesn't have true 0.
> - **Ratio**  ->  Represented by number but have a True 0. Eg. Temperature in Kelvin, it can be really a zero.
>
> Numbers like 2,3,4,10.5 etc can be both interval and ratio.

# Data Visualization

It's much easier to visualize data once we know it's type and measurement level.

### ***Categorical Variables Representation***  ->

- Frequency Distribution Tables.
- Bar Charts.
- Pie Charts.
- Pareto Diagrams.

> 1. Frequency Distribution Table concludes of 2 columns... Categorical Variables and Frequency. Frequency is the number of occurrence of each item.
> 2. Bar Chart or Column Chart has 2 axis where vertical axis shows Frequency and Horizontal axis represents the Categorical variables.
> 3. Pie Chart require the percentage of variables out of 100. These percentage also known as Relative Frequency. Eg. Market Shares are shown by pie charts.
> 4. Pareto Diagram is a special type of bar chart where categories are shown in descending order of frequency. Then we put curve on the graph showing the cumulative frequency which is the sum of relative frequencies.
>
> Pareto Diagram combines the strong sides of the bar and pie charts.

### ***Numerical Variables Representation***  ->

- Frequency Distribution Tables.
- Histogram
- Cross Table
- Scatter Plot

When working with normal numbers, it's better to group them as per the intervals, it will give us Frequency to easily plot them on charts. Largest-Smallest/No. of desired Freq.
Relative Frequency  =  Frequency of variable / Total Frequency

Difference between bar chart and histogram is , in histogram both axis are numerical. Usually Horizontal with Intervals and Vertical with Frequency, it can be absolute or relative freq.

To show relationship b/w two variable, use Cross Table & Scatter Plots.
- Categorical Variables : Use Cross Table. Also known as Side-by-side chart. Eg. showing how much money 3 investors have invested in 3 different stocks.
- Numerical Variables : Use Scatter Plot. Eg. How much score 100 students got in English and Maths subject.

# Measures of Central Tendency

Measures that describe the data through 'averages'. The most common are the mean, median and mode. There is also geometric mean, harmonic mean, weighted-average mean, etc.

> #### ***MEAN***

>  Also known as 'Simple Average'. Denoted by **µ** for population and **x̄** for sample.
>  Easily affected by outliers.
>  Mean is not enough for definite conclusions.

> #### ***MEDIAN***

>  we have to sort our data in ascending order first.
>  n+1 / 2
>  Not affected by outliers.

> #### ***MODE***

>  Mode is the value that occurs most often.
>  There maybe NO mode at all in the dataset.

Mean, Median and Mode should be used together to get good understanding

# Measure of Asymmetry

Measures that describe the data through the level of symmetry that is observed. The most common are skewness and kurtosis.

> #### ***SKEWNESS***

>  Skewness indicates whether the data is concentrated on one side.
>  **Mean > Median** : Positive Skew or Right Skew. We care about which side the tail is leading to. Right Skew means outliers are on the right.
>  **Mean = Median = Mode** :  No Skew / Zero Skew / Symmetrical
>  **Mean < Median** : Negative Skew or Left Skew. Outliers are on the left.

> Skewness is important because it tells us where the data is situated.

![skew](https://www.zarantech.com/blog/wp-content/uploads/2019/03/skewed-curves.png)

# Measures of Variability

Measures that describe the data through the level of dispersion (variability). The most common ones are variance and standard deviation.

> ### ***VARIANCE***

> Variance measures the dispersion of a set of data points around their mean.

> **Population Variance** : σ**2 = ∑ni=1 (xi – μ)**2 / N
> - Closer the number to mean, lower the result we will obtain.
> - Further the number to mean, larger the difference.

> **Sample Variance** : s**2 = Σ(xi−x̄)**2 / n−1
> - Sample variance is used when our set of observation is a sample drawn from a bigger population.


> ### ***STD Deviation***

> Though variance is a common method for data dispersion but in most cases, the figure we obtain is pretty large and hard to compare.

> **Population Std Deviation** : Sqrt of population variance
> **Sample Std Deviation** : Sqrt of sample variance

> Std Deviation is the most common measure of variability for a SINGLE dataset.

> ### ***Coefficient of Variation (CV)***
>
> **Relative Std Dev = Std Dev / Mean**
> **Population Formula** : cv = σ/μ
> **Sample Formula** : cv = S/x̄

>  Comparing cv of TWO or MORE dataset.

# ***STEPS to Follow*** :
- Figure out data is a SAMPLE or POPULATION?
- Then find the mean
- Find the sample variance
- Find Sample Std Deviation
- Find Coefficients

Professionals prefer using ***Std Deviation*** as the preferred measure of variability.

# Covariance

So far we have covered univariate measures. Now we will work with measures that require more than one variable. Eg. House price depends upon Size.

**Scatter Plot** requires covariance where two variables are correlated and the main statistic to measure this correlation is called covariance.

> Covariance gives a sense of direction :
> - `>0, the two variables move together
> - <0, the two variables move in opposite directions
> - =0, the two variables are independent
>
>
> Problem with covariance is that it's value can be totally different and out of scale!

# Correlation

Correlation adjusts covariance, so that the relationship b/w two variables become easy and intuitive to interpret.

cor = cov(x,y) / std(x) * std(y)

correlation value stands between -1 and 1

> correlation coeff. = 1
> - The entire variability of one variable is explained by the other. Also known as, Perfect Positive Correlation. Eg. Size of house determines price.
>
> correlation coeff. = 0
> - Absolutely independent variables. Eg. House price in london and coffee price in brazil.
>
> correlation coeff. = -1 (-1,0)
> - Negative Correlation among variables eg, ice-cream and warm jacket, one sells more in summer while other one makes loss.

cov(x,y) = cov(y,x)  , symmetrical with respect to both variables.

CAUSALITY :- ***It's important to understand the direction of casual relationships***
- It's a asymmetric relation (x causes y is different from y causes x)