# Descriptive Statistics 

**Goal:** Understand and compute descriptive statistics (mean, median, mode, variance, std dev, IQR, skewness, kurtosis)

**Data:** Marks and ages of 10 students

---

## 1: Import Libraries and Create Sample Data

In [5]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Set up plotting style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (10, 6)

print("All libraries imported successfully!")

All libraries imported successfully!


## 2: Create Sample Dataset

We'll use marks and ages of 10 students as our example data.

In [7]:
# Create a dataset
data = {
    "Student_ID": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    "Marks": [45, 56, 67, 89, 76, 90, 34, 55, 68, 72],
    "Age": [15, 16, 16, 15, 17, 16, 15, 14, 17, 16]
}

df = pd.DataFrame(data)
print("Dataset:")
print(df)
print(f"\nShape: {df.shape}")
print(f"Data types:\n{df.dtypes}")

Dataset:
   Student_ID  Marks  Age
0           1     45   15
1           2     56   16
2           3     67   16
3           4     89   15
4           5     76   17
5           6     90   16
6           7     34   15
7           8     55   14
8           9     68   17
9          10     72   16

Shape: (10, 3)
Data types:
Student_ID    int64
Marks         int64
Age           int64
dtype: object


## 3: Measures of Central Tendency

### 3.1 Mean (Average)
The average value - sum of all values divided by count.

In [8]:
# Calculate mean
marks = df["Marks"]
mean_marks = marks.mean()

print(f"Marks: {marks.tolist()}")
print(f"Sum: {marks.sum()}")
print(f"Count: {len(marks)}")
print(f"\nMean = Sum / Count = {marks.sum()} / {len(marks)} = {mean_marks}")
print(f"\nUsing NumPy: {np.mean(marks)}")
print(f"Using Pandas: {marks.mean()}")

Marks: [45, 56, 67, 89, 76, 90, 34, 55, 68, 72]
Sum: 652
Count: 10

Mean = Sum / Count = 652 / 10 = 65.2

Using NumPy: 65.2
Using Pandas: 65.2


### 3.2 Median (Middle Value)
The middle value when data is sorted. Robust to outliers.

In [9]:
# Calculate median
median_marks = marks.median()
marks_sorted = sorted(marks)

print(f"Marks (sorted): {marks_sorted}")
print(f"Count: {len(marks)} (even number)")
print(f"Middle two values: {marks_sorted[4]} and {marks_sorted[5]}")
print(f"\nMedian = ({marks_sorted[4]} + {marks_sorted[5]}) / 2 = {median_marks}")
print(f"\nUsing Pandas: {marks.median()}")

Marks (sorted): [34, 45, 55, 56, 67, 68, 72, 76, 89, 90]
Count: 10 (even number)
Middle two values: 67 and 68

Median = (67 + 68) / 2 = 67.5

Using Pandas: 67.5


### 3.3 Mode (Most Frequent Value)
The value that appears most frequently.

In [10]:
# Calculate mode
mode_marks = marks.mode()

print(f"Marks: {marks.tolist()}")
print(f"\nValue counts:")
print(marks.value_counts().sort_index())
print(f"\nMode (most frequent value): {mode_marks.values if len(mode_marks) > 0 else 'No mode (all values appear once)'}")
print(f"\nUsing Pandas: {marks.mode().tolist()}")

Marks: [45, 56, 67, 89, 76, 90, 34, 55, 68, 72]

Value counts:
Marks
34    1
45    1
55    1
56    1
67    1
68    1
72    1
76    1
89    1
90    1
Name: count, dtype: int64

Mode (most frequent value): [34 45 55 56 67 68 72 76 89 90]

Using Pandas: [34, 45, 55, 56, 67, 68, 72, 76, 89, 90]


### Summary: Central Tendency
All three measures in one place:

In [11]:
print("\n" + "="*50)
print("MEASURES OF CENTRAL TENDENCY - Summary")
print("="*50)
print(f"Mean:   {marks.mean():.2f}")
print(f"Median: {marks.median():.2f}")
print(f"Mode:   {marks.mode().values[0] if len(marks.mode()) > 0 else 'No single mode'}")
print("="*50)

print("\nInterpretation:")
print(f"- On average, students scored {marks.mean():.1f} marks")
print(f"- The middle student scored {marks.median():.1f} marks")
print(f"- Mean and median are close, suggesting symmetric distribution")


MEASURES OF CENTRAL TENDENCY - Summary
Mean:   65.20
Median: 67.50
Mode:   34

Interpretation:
- On average, students scored 65.2 marks
- The middle student scored 67.5 marks
- Mean and median are close, suggesting symmetric distribution


## 4: Measures of Dispersion

### 4.1 Range
Difference between max and min - simplest measure of spread.

In [12]:
# Calculate range
max_marks = marks.max()
min_marks = marks.min()
range_marks = max_marks - min_marks

print(f"Marks: {marks.tolist()}")
print(f"\nMax: {max_marks}")
print(f"Min: {min_marks}")
print(f"Range = Max - Min = {max_marks} - {min_marks} = {range_marks}")
print(f"\nInterpretation: Marks vary by {range_marks} points")

Marks: [45, 56, 67, 89, 76, 90, 34, 55, 68, 72]

Max: 90
Min: 34
Range = Max - Min = 90 - 34 = 56

Interpretation: Marks vary by 56 points


### 4.2 Variance
Average of squared deviations from the mean. Measures spread.

In [13]:
# Calculate variance
variance_marks = marks.var()  # Pandas uses n-1 (sample variance) by default
variance_marks_pop = marks.var(ddof=0)  # Population variance (divide by n)

# Manual calculation to show steps
mean = marks.mean()
deviations = marks - mean
squared_deviations = deviations ** 2
variance_manual = squared_deviations.sum() / (len(marks) - 1)  # Sample variance

print(f"Marks: {marks.tolist()}")
print(f"Mean: {mean:.2f}")
print(f"\nDeviations from mean: {deviations.tolist()}")
print(f"Squared deviations: {[f'{x:.2f}' for x in squared_deviations]}")
print(f"\nVariance (Sample, n-1): {variance_marks:.2f}")
print(f"Variance (Population, n): {variance_marks_pop:.2f}")
print(f"\nManual calculation: {variance_manual:.2f}")
print(f"Using Pandas: {marks.var():.2f}")

Marks: [45, 56, 67, 89, 76, 90, 34, 55, 68, 72]
Mean: 65.20

Deviations from mean: [-20.200000000000003, -9.200000000000003, 1.7999999999999972, 23.799999999999997, 10.799999999999997, 24.799999999999997, -31.200000000000003, -10.200000000000003, 2.799999999999997, 6.799999999999997]
Squared deviations: ['408.04', '84.64', '3.24', '566.44', '116.64', '615.04', '973.44', '104.04', '7.84', '46.24']

Variance (Sample, n-1): 325.07
Variance (Population, n): 292.56

Manual calculation: 325.07
Using Pandas: 325.07


### 4.3 Standard Deviation
Square root of variance. Easier to interpret (same units as original data).

In [14]:
# Calculate standard deviation
std_marks = marks.std()  # Sample std dev (uses n-1)
std_marks_pop = marks.std(ddof=0)  # Population std dev

print(f"Variance: {variance_marks:.2f}")
print(f"\nStandard Deviation = √Variance")
print(f"Standard Deviation (Sample, n-1): {std_marks:.2f}")
print(f"Standard Deviation (Population, n): {std_marks_pop:.2f}")
print(f"\nManual check: √{variance_marks:.2f} = {np.sqrt(variance_marks):.2f}")
print(f"\nInterpretation:")
print(f"On average, marks deviate from the mean ({mean:.1f}) by {std_marks:.2f} points")

Variance: 325.07

Standard Deviation = √Variance
Standard Deviation (Sample, n-1): 18.03
Standard Deviation (Population, n): 17.10

Manual check: √325.07 = 18.03

Interpretation:
On average, marks deviate from the mean (65.2) by 18.03 points


### 4.4 Quartiles and Interquartile Range (IQR)
Divide data into 4 equal parts. IQR = Q3 - Q1 (middle 50% of data).

In [15]:
# Calculate quartiles
Q1 = marks.quantile(0.25)
Q2 = marks.quantile(0.50)  # Same as median
Q3 = marks.quantile(0.75)
IQR = Q3 - Q1

# Using numpy percentile (alternative)
Q1_np = np.percentile(marks, 25)
Q3_np = np.percentile(marks, 75)

print(f"Marks (sorted): {sorted(marks.tolist())}")
print(f"\nQuartiles:")
print(f"Q1 (25th percentile): {Q1:.2f}")
print(f"Q2 (50th percentile): {Q2:.2f} [This is the Median]")
print(f"Q3 (75th percentile): {Q3:.2f}")
print(f"\nIQR = Q3 - Q1 = {Q3:.2f} - {Q1:.2f} = {IQR:.2f}")
print(f"\nInterpretation:")
print(f"The middle 50% of marks range from {Q1:.1f} to {Q3:.1f}")
print(f"This spread is {IQR:.1f} marks")

Marks (sorted): [34, 45, 55, 56, 67, 68, 72, 76, 89, 90]

Quartiles:
Q1 (25th percentile): 55.25
Q2 (50th percentile): 67.50 [This is the Median]
Q3 (75th percentile): 75.00

IQR = Q3 - Q1 = 75.00 - 55.25 = 19.75

Interpretation:
The middle 50% of marks range from 55.2 to 75.0
This spread is 19.8 marks


### 4.5 Detecting Outliers using IQR
Values beyond 1.5 × IQR are considered outliers.

In [16]:
# Calculate outlier boundaries
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers = marks[(marks < lower_bound) | (marks > upper_bound)]

print(f"Q1: {Q1:.2f}")
print(f"Q3: {Q3:.2f}")
print(f"IQR: {IQR:.2f}")
print(f"\nOutlier thresholds:")
print(f"Lower bound = Q1 - 1.5×IQR = {Q1:.2f} - 1.5×{IQR:.2f} = {lower_bound:.2f}")
print(f"Upper bound = Q3 + 1.5×IQR = {Q3:.2f} + 1.5×{IQR:.2f} = {upper_bound:.2f}")
print(f"\nValues below {lower_bound:.2f} or above {upper_bound:.2f} are outliers")
print(f"\nOutliers in our data: {outliers.tolist() if len(outliers) > 0 else 'None'}")

Q1: 55.25
Q3: 75.00
IQR: 19.75

Outlier thresholds:
Lower bound = Q1 - 1.5×IQR = 55.25 - 1.5×19.75 = 25.62
Upper bound = Q3 + 1.5×IQR = 75.00 + 1.5×19.75 = 104.62

Values below 25.62 or above 104.62 are outliers

Outliers in our data: None


### Summary: Dispersion
All dispersion measures in one place:

In [17]:
print("\n" + "="*50)
print("MEASURES OF DISPERSION - Summary")
print("="*50)
print(f"Range:           {range_marks}")
print(f"Variance:        {variance_marks:.2f}")
print(f"Std Deviation:   {std_marks:.2f}")
print(f"Q1:              {Q1:.2f}")
print(f"Q2 (Median):     {Q2:.2f}")
print(f"Q3:              {Q3:.2f}")
print(f"IQR:             {IQR:.2f}")
print("="*50)

print("\nInterpretation:")
print(f"- Data spreads from {min_marks} to {max_marks} (range: {range_marks})")
print(f"- On average, values deviate from mean by {std_marks:.2f} marks (std dev)")
print(f"- Middle 50% of students scored between {Q1:.0f} and {Q3:.0f}")


MEASURES OF DISPERSION - Summary
Range:           56
Variance:        325.07
Std Deviation:   18.03
Q1:              55.25
Q2 (Median):     67.50
Q3:              75.00
IQR:             19.75

Interpretation:
- Data spreads from 34 to 90 (range: 56)
- On average, values deviate from mean by 18.03 marks (std dev)
- Middle 50% of students scored between 55 and 75


## 5: Shape of Distribution

### 5.1 Skewness
Measures asymmetry of the distribution.

In [18]:
# Calculate skewness
skewness = stats.skew(marks)

print(f"Skewness: {skewness:.4f}")
print(f"\nInterpretation:")
if abs(skewness) < 0.5:
    print("The distribution is approximately SYMMETRIC")
elif skewness > 0.5:
    print("The distribution is RIGHT-SKEWED (positive skew)")
    print("Long tail on the right side")
    print("Mean > Median > Mode")
else:
    print("The distribution is LEFT-SKEWED (negative skew)")
    print("Long tail on the left side")
    print("Mean < Median < Mode")

print(f"\nComparison:")
print(f"Mean:   {marks.mean():.2f}")
print(f"Median: {marks.median():.2f}")
print(f"Difference: {abs(marks.mean() - marks.median()):.2f}")

Skewness: -0.2022

Interpretation:
The distribution is approximately SYMMETRIC

Comparison:
Mean:   65.20
Median: 67.50
Difference: 2.30


### 5.2 Kurtosis
Measures how heavy the tails are (extreme values).

In [19]:
# Calculate kurtosis (excess kurtosis)
kurtosis = stats.kurtosis(marks)

print(f"Excess Kurtosis: {kurtosis:.4f}")
print(f"\nInterpretation:")
if abs(kurtosis) < 0.5:
    print("Similar to normal distribution (Mesokurtic)")
elif kurtosis > 0.5:
    print("Heavy tails and sharp peak (Leptokurtic)")
    print("More extreme values than normal distribution")
else:
    print("Light tails and flat peak (Platykurtic)")
    print("Fewer extreme values than normal distribution")

Excess Kurtosis: -0.8421

Interpretation:
Light tails and flat peak (Platykurtic)
Fewer extreme values than normal distribution


In [20]:
print("\n" + "="*50)
print("SHAPE OF DISTRIBUTION - Summary")
print("="*50)
print(f"Skewness:  {skewness:.4f}")
print(f"Kurtosis:  {kurtosis:.4f}")
print("="*50)


SHAPE OF DISTRIBUTION - Summary
Skewness:  -0.2022
Kurtosis:  -0.8421
