<a href="https://colab.research.google.com/github/SURESHBEEKHANI/Statistics-For-Data-Science-learining/blob/main/Intermediate_Statistics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Intermediate Statistics**

This notebook covers some key concepts in intermediate statistics, including measures of central tendency, measures of dispersion, skewness, Z-scores, and standardization/normalization.

### **1. Measures of Central Tendency**

Measures of central tendency summarize a dataset by identifying a central or typical value.

- **Mean**: Represents the average value of the data points. It is calculated by summing all the values and dividing by the total number of observations.

- **Median**: Refers to the middle value in a dataset when the data points are arranged in ascending or descending order. For an even number of observations, it is the average of the two middle values.

- **Mode**: Indicates the most frequently occurring value(s) in the dataset. A dataset may have one mode (unimodal), more than one mode (multimodal), or no mode if all values occur with the same frequency.


In [None]:
# Import the NumPy library for numerical operations, aliasing it as 'np'
import numpy as np

# Import the 'stats' module from the SciPy library for statistical functions
from scipy import stats

# Define a sample dataset as a list of integers
data = [2, 4, 6, 8, 10, 10, 12]

# Calculate the mean (average) of the dataset using NumPy's mean function
mean = np.mean(data)

# Calculate the median (middle value) of the dataset using NumPy's median function
median = np.median(data)

# Calculate the mode (most frequent value) of the dataset using SciPy's stats.mode function
# The result is stored in a structured object that contains the mode value and its count
mode_result = stats.mode(data)

# Print the calculated mean value in a formatted string
print(f"Mean: {mean}")

# Print the calculated median value in a formatted string
print(f"Median: {median}")

# Print the mode result (both the mode value and its frequency) in a formatted string
print(f"Mode: {mode_result}")


Mean: 7.428571428571429
Median: 8.0
Mode: ModeResult(mode=10, count=2)


### **2. Measures of Dispersion**

Dispersion quantifies the spread or variability of a dataset.

- **Range**: The difference between the maximum and minimum values in the dataset.  

- **Variance**: Measures how far each data point is from the mean.

- **Standard Deviation**: Represents the square root of variance, providing a measure of spread in the same units as the data.

- **Interquartile Range (IQR)**: The difference between the 75th percentile (Q3) and the 25th percentile (Q1). It measures the spread of the middle 50% of the data.



In [None]:
import numpy as np

# Sample dataset
data = [15, 20, 35, 40, 50, 60, 75]

# 1. Range
data_range = max(data) - min(data)

# 2. Variance
# Population variance
variance_population = np.var(data)
# Sample variance
variance_sample = np.var(data, ddof=1)

# 3. Standard Deviation
# Population standard deviation
std_dev_population = np.sqrt(variance_population)
# Sample standard deviation
std_dev_sample = np.sqrt(variance_sample)

# 4. Interquartile Range (IQR)
Q1 = np.percentile(data, 25)  # 25th percentile
Q3 = np.percentile(data, 75)  # 75th percentile
IQR = Q3 - Q1

# Display the results
print("Measures of Dispersion:")
print(f"Range: {data_range}")
print(f"Population Variance: {variance_population:.2f}")
print(f"Sample Variance: {variance_sample:.2f}")
print(f"Population Standard Deviation: {std_dev_population:.2f}")
print(f"Sample Standard Deviation: {std_dev_sample:.2f}")
print(f"Interquartile Range (IQR): {IQR:.2f}")


Measures of Dispersion:
Range: 60
Population Variance: 391.84
Sample Variance: 457.14
Population Standard Deviation: 19.79
Sample Standard Deviation: 21.38
Interquartile Range (IQR): 27.50


## **Percentiles and Quartiles**

**Percentiles**

- Indicate the value below which a certain percentage of the data falls.
- **Examples**:
  - 25th Percentile (P25): 25% of data is below this value.
  - 90th Percentile (P90): 90% of data is below this value.
- Used in:
  - Test scores
  - Statistical trends

## **Quartiles**
- Divide the data into four equal parts:
  1. **Q1**: 25th percentile (lower quartile).
  2. **Q2**: 50th percentile (median).
  3. **Q3**: 75th percentile (upper quartile).
- **Interquartile Range (IQR)**: Measures the middle 50% of the data (Q3 - Q1).

## **Key Differences**
| Feature         | Percentiles              | Quartiles               |
|------------------|--------------------------|-------------------------|
| Division         | 100 equal parts          | 4 equal parts           |
| Examples         | P10, P25, P90            | Q1, Q2 (Median), Q3     |
| Focus            | Any percentage           | 25%, 50%, 75%           |


In [3]:
import numpy as np

# Sample dataset
data = [15, 20, 35, 40, 50, 60, 65, 70, 80, 90]

# Sort the data
data.sort()

# Calculate Percentiles
percentile_25 = np.percentile(data, 25)  # 25th percentile (Q1)
percentile_50 = np.percentile(data, 50)  # 50th percentile (Median or Q2)
percentile_75 = np.percentile(data, 75)  # 75th percentile (Q3)

# Calculate Quartiles
Q1 = percentile_25
Q2 = percentile_50
Q3 = percentile_75

# Calculate Interquartile Range (IQR)
IQR = Q3 - Q1

# Calculate bounds for outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Identify outliers
outliers = [x for x in data if x < lower_bound or x > upper_bound]

# Display results
print(f"Dataset: {data}")
print(f"Q1 (25th Percentile): {Q1}")
print(f"Q2 (Median, 50th Percentile): {Q2}")
print(f"Q3 (75th Percentile): {Q3}")
print(f"Interquartile Range (IQR): {IQR}")
print(f"Lower Bound for Outliers: {lower_bound}")
print(f"Upper Bound for Outliers: {upper_bound}")
print(f"Outliers: {outliers}")

# Calculate any other percentile (e.g., 90th)
percentile_90 = np.percentile(data, 90)
print(f"90th Percentile: {percentile_90}")


Dataset: [15, 20, 35, 40, 50, 60, 65, 70, 80, 90]
Q1 (25th Percentile): 36.25
Q2 (Median, 50th Percentile): 55.0
Q3 (75th Percentile): 68.75
Interquartile Range (IQR): 32.5
Lower Bound for Outliers: -12.5
Upper Bound for Outliers: 117.5
Outliers: []
90th Percentile: 81.0


In [2]:
import numpy as np

# Dataset
data = [10, 15, 20, 25, 30, 35, 40, 100]

# Sort the data
data.sort()

# Calculate Quartiles and IQR
Q1 = np.percentile(data, 25)
Q3 = np.percentile(data, 75)
IQR = Q3 - Q1

# Calculate bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Check if 30 is within bounds
is_30_outlier = not (lower_bound <= 30 <= upper_bound)

# Results
print(f"Q1 (25th Percentile): {Q1}")
print(f"Q3 (75th Percentile): {Q3}")
print(f"IQR: {IQR}")
print(f"Lower Bound: {lower_bound}")
print(f"Upper Bound: {upper_bound}")
print(f"Is 30 an Outlier? {'Yes' if is_30_outlier else 'No'}")


Q1 (25th Percentile): 18.75
Q3 (75th Percentile): 36.25
IQR: 17.5
Lower Bound: -7.5
Upper Bound: 62.5
Is 30 an Outlier? No
