<a href="https://colab.research.google.com/github/Tanu-N-Prabhu/Python/blob/master/Data%20Analysis/Level%202/descriptive_statistics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Level 2: Exploration

_Understand the Data_

**Goal**: Use Exploratory Data Analysis (EDA) to explore and summarize the dataset, helping you understand patterns, anomalies, and insights before modeling.

## What to Learn: Descriptive Statistics
Descriptive statistics summarize and describe the features of a dataset. These statistics give you a quick overview of the data distribution and help you make sense of what you're working with.

## 1. Basic Descriptive Methods

`.describe()` - Summary of numeric columns

In [2]:
import pandas as pd

df = pd.DataFrame({
    'Age': [25, 32, 40, 29],
    'Score': [88.5, 92.0, 79.5, 85.0]
})

df.describe()

Unnamed: 0,Age,Score
count,4.0,4.0
mean,31.5,86.25
std,6.350853,5.330729
min,25.0,79.5
25%,28.0,83.625
50%,30.5,86.75
75%,34.0,89.375
max,40.0,92.0


### Output includes

- `count`: total non-null values

- `mean`: average value

- `std`: standard deviation

- `min, 25%, 50%, 75%, max`: spread of values



`.info()` - Data types and non-null counts

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Age     4 non-null      int64  
 1   Score   4 non-null      float64
dtypes: float64(1), int64(1)
memory usage: 196.0 bytes


Useful to:
- Check column data types
- See missing values (nulls)
- Understand memory usage

## 2. Central Tendency

Mean, Median, Mode

In [4]:
df['Score'].mean()   # Average score
df['Score'].median() # Middle value
df['Score'].mode()   # Most frequent value(s)

Unnamed: 0,Score
0,79.5
1,85.0
2,88.5
3,92.0


## 3. Spread / Variability

Range, Standard Deviation, Variance

In [5]:
score_range = df['Score'].max() - df['Score'].min()
std_dev = df['Score'].std()
variance = df['Score'].var()

print("Range:", score_range)
print("Standard Deviation:", std_dev)
print("Variance:", variance)

Range: 12.5
Standard Deviation: 5.330728530573158
Variance: 28.416666666666668


> **Why does this matter?**
>
> Variability tells us how spread out the data is. A high variance means scores are scattered widely, while low variance means theyâ€™re clustered near the mean.

## 4. Value Counts & Frequency

`.value_counts()` - Frequency of categorical values

In [6]:
df = pd.DataFrame({'Status': ['Pass', 'Fail', 'Pass', 'Pass']})
df['Status'].value_counts()

Unnamed: 0_level_0,count
Status,Unnamed: 1_level_1
Pass,3
Fail,1


Use this to:

- See how many entries belong to each category
- Detect class imbalance (e.g. 95% Pass, 5% Fail)

## 5. Correlation Between Columns

`.corr()` - Pearson correlation coefficient

In [7]:
df = pd.DataFrame({
    'Age': [25, 32, 40, 29],
    'Score': [88.5, 92.0, 79.5, 85.0]
})

df.corr()

Unnamed: 0,Age,Score
Age,1.0,-0.649836
Score,-0.649836,1.0


Use `.corr()` to:

- Measure how closely two numeric columns are related
- Values range from -1 (perfect inverse) to +1 (perfect direct)

## Summary Table

| Task                   | Method                        | Use It For                            |
| ---------------------- | ----------------------------- | ------------------------------------- |
| Summary of dataset     | `df.describe()`               | Get mean, std, min, quartiles         |
| Data types & nulls     | `df.info()`                   | Understand structure and missing data |
| Mean / Median / Mode   | `.mean()`, `.median()`        | Measure central tendency              |
| Spread (std/var/range) | `.std()`, `.var()`, `max-min` | Understand distribution               |
| Frequency counts       | `.value_counts()`             | Check class balance or group sizes    |
| Correlation            | `.corr()`                     | See how features relate to each other |
