## Statistic Fundamentals

Many significant developments in statistics occurred in the past century. Thomas Bayes, Pierre Simon Laplace, and Carl Gauss notably developed probability theory, a foundational component of statistics, between the 17th and 19th centuries. Unlike the theoretical nature of probability theory, statistics is an applied branch of science focused on data analysis

- The first step of any data science project is to explore data.
- Classical statistics focuses exclusively on Inference, which is a set of procedures to draw conclusions about large populations by studying small samples.
- New technologies, increased access to large datasets, and expanded use of quantitative analysis across various disciplines have driven this growth.

## What is Statistics?
- Statistics is a discipline that deals with methodologies to collect, prepare, analyze, and interpret conclusions from the data. You can mine the raw data to find patterns using statistical concepts.
- By combining statistical analysis with domain expertise, you can interpret these patterns and use the findings for decision-making in real-world situations.The ultimate goal is to create value for an organization.

##  Statistics common terms
- Population and Sample: The population is the complete data pool from which a sample is drawn for further analysis. The sample is a subset of the population.
- Measurement and Sample Data: A measurement is a number or attribute calculated for each member of the population or sample. The measurements of the sample members are collectively called sample data. 
- Parameter: It is a characteristic of the population that you want to estimate or test, such as the population mean.
- Variable: A variable is something that can take on different values in the dataset.
- istribution: It refers to how sample data is spread across a range of values.

## Statistics Types 

- Descriptive Statistics: It is a branch of statistics that involves organizing, displaying, and describing data. It involves using numbers (numerical facts, figures, or information) to describe phenomena, a process referred to as descriptive statistics.

- Inferential Statistics: It is a branch of statistics that involves drawing conclusions about a population based on the information obtained from a sample taken from that population.

- Example: When estimating the number of automobiles produced in a month, you must consider the entire output as the population. Within this population, a subset of cars undergoes inspection for quality characteristics such as mileage per gallon of gasoline, forming a sample. The average mileage of all cars represents a parameter, while the average lifespan of the inspected sample is considered a statistic.

    - If you deal with descriptive statistics here, it will include the selection of a sample, the presentation of sample data as diagrams or tables, and the computation of the value of a statistic.

    - Inferential statistics is utilized for making generalizations about a population based on a sample. For example, it can be used to determine whether the average fuel efficiency of all cars in a population is at least 23 miles per gallon based on data from a studied sample.

- Predictive Statistics: It is defined as the science of extracting information from data and using it to predict trends, behavior patterns, or relationships between characteristics.

## Types of Data

- Categorical Data: It represents characteristics such as a person’s gender, marital status, or the types of movies they like. It can also take numerical values, such as 1 indicating male and 2 indicating female, but these numbers don’t have mathematical meaning.

- Numerical Data: It represents the data as a measurement, such as a person’s height, weight, IQ, or blood pressure. It includes data that cannot be counted, such as the number of stocks owned by a person.

## Measures of Central Tendency

- A measure of central tendency is a summary that describes the central position in a dataset. These measures indicate the central location of a distribution, revealing where most values fall. In statistics, the three most common measures of central tendency are the mean, median, and mode. Each of these measures calculates the location of the central point using a different method.

##  Mean

- It is the most frequently utilized measure of central tendency and is applicable to both continuous and discrete datasets.
- To calculate the mean, add all the numbers and divide the result by the number of data points.
* The mean is sensitive to outliers and skewed data.

## Median
- The median is the middle number obtained by arranging the data in ascending or descending order.
- In datasets with an odd number of points, the median is the exact middle value.
- In datasets with an even number of points, the median is the average of the two central values.
- In both cases, the median is notably less sensitive to outliers and skewed data.

    - Example1:  Consider the dataset with odd numbers x = 5, 76, 98, 32, 1, -6, 34, 3, -65

        - Step 1: Arrange the numbers in ascending order, that is, -65, -6, 1, 3, 5, 32, 34, 76, 98.
        - Step 2: The middle number is the fifth number, as there are nine numbers in total. The median of the given dataset is 5.
    
    - Example 2: Consider the dataset with even numbers x = 5, 76, 98, 32, 1, -6, 34, 3, -65, 99

        - Step 1: Arrange the numbers in ascending order, that is, -65, -6, 1, 3, 5, 32, 34, 76, 98, 99.
        - Step 2: Identify the middle numbers 5 and 32.
        - Step 3: To calculate the median, take the average of two middle values, that is, 5 and 32. So, the median of the given dataset is (5+32)/2 = 18.5

## Mode 

- Mode is the most frequently occurring data point in the set.
- The mode is adaptable and applicable to both numerical and categorical data.
- The mode is sometimes misleading, potentially not reflecting the true center of a distribution.
- The mode is occasionally distant, representing the most frequent data points, which might be far from the actual central point.

## Measures of Dispersion
- Also known as measures of variability and is used to characterize the extent of spread or diversity in a dataset. At times, relying exclusively on measures of central tendency falls short of providing a thorough understanding of a dataset's distribution.

## Range 
- Range is the difference between the largest and smallest data points in the set. It is sensitive to outliers and does
not use every data point in the set. It also provides maximum and minimum values in the set.

## Percentile

- A percentile is a statistical measure used to indicate the value below which a given percentage of observations falls in a dataset. In simpler terms, it tells you how a particular value compares to the rest of the data.
- Example: if a student scores in the 80th percentile on a standardized test, it means that their score is higher than 80% of all other test takers.

## Quartile

- Quartiles are statistical measures used to divide a dataset into four equal parts or quarters. They are calculated by arranging the data in ascending order and then dividing it into four equal-sized groups.

- There are three quartiles, which are:
    - First quartile (Q1): This is the value below which 25% of the data fall. In other words, 25% of the data points are less than or equal to Q1.
    - Second quartile (Q2): This is the median of the dataset. It divides the data into two halves, with 50% of the data points falling below it and 50% above it.
    - Third quartile (Q3): This is the value below which 75% of the data fall. 75% of the data points are less than or equal to Q3.

##  Interquartile Range

- Interquartile range is the difference between the 25th and 75th percentiles.
- It describes the middle 50% of the observations, and if they are spaced widely apart, their interquartile range will be large.
- It is useful even if the extreme values are not accurate, as it is insensitive to them.
- It is not amenable to mathematical manipulation.

    - Example: Consider the following dataset, where the values are arranged in ascending order: [10, 15, 20, 25, 30, 35, 40, 50, 70, 100]

        - The 25th percentile = average of 2nd and 3rd values = (15 + 20)/2 = 17.5.
        - The 75th percentile = average of the 7th and 8th values = (40 + 50)/2 = 45.
        - The interquartile range = 45 - 17.5 = 27.5.

## Standard Deviation

- Standard Deviation (SD) is the most popular measure of dispersion. It measures the spread of data around the mean. It is defined as the square root of the sum of squares of the deviation around the mean divided by the number of observations.

## Variance 

- Variance is defined as the average of the squared differences from the mean.

## Skewness

- Skewness is defined as the amount and direction of deviation from horizontal symmetry
- For many statistical inferences, it's ideal for the distribution to be normal or nearly normal. Skewness is vital, as it helps you test for normality. In a normal distribution, skewness is 0. So, if the skewness is close to 0, it is a nearly normal distribution.
    
    - In a normal distribution, the graph appears as a classical, symmetrical bell-shaped curve. The mean, average, and mode or maximum point on the curve are equal, and the tails on either side of the curve are exact mirror images of each other.

    - When a distribution is skewed to the left, the tail on the curve's left side is longer than the tail on the right side, and the mean is less than the mode. This situation is called negative skewness.

    - When a distribution is skewed to the right, the tail on the curve's right side is longer than the tail on the left side, and the mean is greater than the mode. This situation is called positive skewness.

## Measures of Shape (Kurtosis)
- Kurtosis measures how heavy-tailed or light-tailed the distribution is relative to a normal distribution.
- Data with high kurtosis tend to have heavy tails or outliers.
- If kurtosis is low, there will be no outliers.
- A uniform distribution is an extreme case of low kurtosis.
- Positive excess kurtosis means a heavy-tailed distribution, and negative excess kurtosis means a light-tailed distribution.

##  Covariance and Correlation
- Covariance and correlation measure the relationship and dependency between two variables. While covariance gives the direction of the linear relationship, correlation gives both direction and strength. Therefore, correlation is a function of covariance. Furthermore, correlation values are standardized, while covariance values are not.

## Correlation

- The correlation coefficient is often referred to as the Pearson correlation coefficient.
- The correlation coefficient between two variables is calculated by dividing their covariance by the product of their individual standard deviations. Since standard deviation measures the absolute variability (or spread) of a data distribution, this division by the product of the standard deviations normalizes the correlation coefficient, ensuring it falls within the range of -1 to +1.

In [None]:
import pandas as pd
import statistics
import numpy as np
from statsmodels.stats.stattools import medcouple
from statsmodels.stats.stattools import robust_skewness

#mean
statistics.mean([4,89,54,-7,-9,27,5])

#median
x = (1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
statistics.median(x)

#mode
x = ['Nokia','Samsung','Samsung','Apple','Oppo','Vivo']
statistics.mode(x)


#calculate inter quartile range
# Example dataset
data = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
# Calculating quartiles
q1 = np.percentile(data, 25)
q3 = np.percentile(data, 75)

# Calculating Interquartile Range (IQR)
iqr = q3 - q1

print("Q1:", q1)
print("Q3:", q3)
print("Interquartile Range (IQR):", iqr)

#standard deviation
x = [1, 2, 3, 4, 5]
print(statistics.stdev(x))

#variance
x = [1, 2, 3, 4, 5]
print(statistics.variance(x))


x = np.array([1, 2, 3, 4, 7, 8])
# Using statsmodels.robust_skewness() method
skewness = medcouple(x)

print(skewness)

#Kurtosis
from statsmodels.stats.stattools import robust_kurtosis

x = np.array([2,4,5,7,8,9,11,15])

kurtosis  = robust_kurtosis(x)
kurtosis

# Creating a DataFrame with two columns, x and y
data = {'x': [1, 2, 3, 4, 5],
        'y': [5, 4, 3, 2, 1]}
df = pd.DataFrame(data)

# Calculating the covariance between x and y
cov_xy = df.cov().iloc[0, 1]
cov_xy

# Creating a DataFrame with two columns, x and y
data = {'x': [1, 2, 3, 4, 5],
        'y': [5, 4, 3, 2, 1]}
df = pd.DataFrame(data)

# Calculating the correlation between x and y
correlation_xy = df.corr().iloc[0, 1]
correlation_xy