# An introduction to Statistics as a skillset for Data Science

#### Statistics is one of the many pillars of data science that enables us to transform raw data into actionable insights. 

### Statistics' use case in Data Science: 
- **Data Understanding**: Statistics helps you summarize and describe the main features of a dataset (e.g., average, variability).
- **Uncertainty Handling**: Real-world data is messy and incomplete. Statistics helps quantify uncertainty and make informed decisions despite it.
- **Pattern Detection**: You can identify trends, correlations, and anomalies that might not be obvious just by looking at raw data.
- **Modeling and Prediction**: Statistical methods are the foundation of machine learning models (especially supervised learning).
- **Hypothesis Testing**: You can test assumptions or claims about data, which is essential for experiments (A/B testing, clinical trials, etc.).

<hr> 

## Understanding the kind of data you’re working with is crucial.

### Types of Data

Importance: 
- Determines what statistical tools and visualizations can be applied.
- Helps avoid incorrect analysis (e.g., calculating average of categorical data makes no sense).
  
Use Cases:
- Choosing encoding strategies for machine learning (e.g., one-hot encoding for nominal data).
- Selecting appropriate models: regression for numerical, classification for categorical.

| Type            | Description                          | Examples                           |
|-----------------|--------------------------------------|------------------------------------|
| **Numerical**    | Quantitative values                 | Age, salary, temperature           |
| - Continuous     | Any value in a range                | 5.5, 3.14159, 100.01               |
| - Discrete       | Countable values                    | Number of students, cars           |
| **Categorical**  | Qualitative values (labels)         | Gender, country, product category  |
| - Nominal        | No inherent order                   | Red, Blue, Green                   |
| - Ordinal        | Ordered categories                  | Low, Medium, High; Star Ratings    |

<hr>

### Shape of a Distribution

Importance:
- Reveals how data is spread.
- Helps detect skewness, outliers, and the need for transformations.

Use Cases:
- Normal distribution enables using many parametric tests (e.g., t-test).
- Skewed data may need transformation (e.g., log or Box-Cox) before modeling.

| Shape            | What it Means                          |
|------------------|----------------------------------------|
| **Symmetrical**   | Mean ≈ Median ≈ Mode (e.g., normal)    |
| **Skewed Left**   | Long tail on the left (mean < median) |
| **Skewed Right**  | Long tail on the right (mean > median)|
| **Uniform**       | All values equally likely              |

<hr> 

### Variance and Standard Deviation

Importance:
- Measures **spread** or **dispersion** in data.
- Low spread = consistent data; High spread = more variability.

Use Cases:
- Feature scaling (e.g., in SVM or KNN, where distance matters).
- Risk analysis in finance (e.g., high standard deviation = high volatility).


<hr>

## Percentiles and Quartiles

Importance:
- Provide robust measures of spread (not affected by extreme values).
- IQR is useful for outlier detection.

Use Cases:
- Setting grading curves in education.
- Understanding salary distribution (e.g., top 25% earners).
- Detecting data anomalies (e.g., fraud detection).

| Term             | Meaning                                 |
|------------------|------------------------------------------|
| **Percentile**    | Value below which a % of data falls     |
| **Quartiles**     | Divide data into four parts             |
| - Q1 (25%)        | Lower quartile                          |
| - Q2 (50%)        | Median                                  |
| - Q3 (75%)        | Upper quartile                          |
| **IQR**           | Interquartile range = Q3 - Q1           |


<hr>

## Outlier Detection (Z-score / IQR)

Importance:
- Outliers can distort statistical models and mean values.
- Important for cleaning and preprocessing.

Use Cases:
- Data cleaning before ML modeling.
- Detecting fraudulent transactions or sensor failures.


<hr>


## Correlation and Covariance

Importance:
- Identifies relationships between variables.
- Avoids using redundant or highly collinear features in models.

Use Cases:
- Feature selection (e.g., removing multicollinearity before linear regression).
- Recommender systems (e.g., item similarity).
- Market basket analysis (e.g., product A correlates with product B).

| Concept       | Meaning                                      | Range          |
|---------------|----------------------------------------------|----------------|
| **Covariance** | Direction of relationship                   | -∞ to +∞        |
| **Correlation**| Strength & direction (normalized covariance)| -1 to +1        |


<hr>

## Probability Basics

**Probability** = Likelihood of an event (from 0 to 1)
Key Concepts:
- Independent events
- Dependent events
- Conditional probability
- Bayes’ Theorem

Importance:
- Foundation of inferential statistics and machine learning.
- Quantifies uncertainty.

Use Cases:
- Predictive modeling (e.g., Naive Bayes, probabilistic models).
- Spam detection, risk scoring, recommendation engines.


<hr>


## Inferential Statistics

Importance:
- Allows generalizing from samples to populations.
- Helps test hypotheses and estimate parameters.

Use Cases:
- A/B testing for product changes or marketing strategies.
- Clinical trials for drug effectiveness.
- Determining if observed differences are statistically significant.

| Tool/Concept            | Use Case                                 |
|--------------------------|-------------------------------------------|
| **Confidence Intervals** | Estimate range of a population parameter |
| **Hypothesis Testing**   | Test claims using data                   |
| - Null vs. Alternative   | \( H_0 \) vs. \( H_1 \)                   |
| - p-value                | Probability the result is due to chance  |
| **t-tests / z-tests**    | Compare means                            |
| **Chi-Square Test**      | Test for categorical distributions       |
| **ANOVA**                | Compare more than 2 means                |


<hr>

## Distributions

Importance:
- Underpin models and statistical assumptions.
- Choosing the right distribution improves prediction accuracy.

Use Cases:
- Poisson: modeling website traffic or call center events.
- Binomial: success/failure outcomes like click-through rates.
- Normal: quality control, measurement errors.
- Exponential: modeling wait times or survival analysis.

| Distribution     | Use Case                          |
|------------------|-----------------------------------|
| **Normal**        | Most natural data                 |
| **Binomial**      | Binary outcomes (yes/no)          |
| **Poisson**       | Rare events (per time unit)       |
| **Uniform**       | Equal chance outcomes             |
| **Exponential**   | Time between events               |