# CSU1658 Statistical Foundation of Data Sciences - Assignment 1

**Student Information:**
- **Name:** Aryan Dhiman
- **Subject Code:** CSU1658
- **Assignment:** Practical 1
- **Date:** September 16, 2025

---

## Assignment Overview

This assignment demonstrates statistical computations and data manipulation techniques using synthetic datasets with NaN values. The practical covers:

- Descriptive statistics with missing data handling
- Standardization and outlier detection
- Data binning and group analysis
- Multi-dimensional array operations

**Learning Objectives:**
- Handle missing data appropriately in statistical computations
- Apply standardization techniques for outlier detection
- Create meaningful data groupings and summarizations
- Demonstrate array operations and linear algebra concepts


In [16]:
# Environment Setup - Import Required Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Set random seed for reproducibility
np.random.seed(42)

# Display options for better output formatting
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 2)

print("Libraries imported successfully!")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")


Libraries imported successfully!
NumPy version: 1.26.4
Pandas version: 2.2.2


---

## Dataset Creation

### Synthetic Dataset Generation

Creating a synthetic dataset with the following specifications:
- **Size:** 100 observations
- **Age Range:** 18-60 years
- **Income Range:** $20,000 - $150,000
- **Missing Values:** 10 NaN values in income column


In [17]:
# Dataset parameters
DATA_SIZE = 100
AGE_RANGE = (18, 60)
INCOME_RANGE = (20000, 150000)
NAN_COUNT = 10

# Generate base dataset
np.random.seed(42)  # Ensure reproducibility
df = pd.DataFrame({
    'age': np.random.randint(AGE_RANGE[0], AGE_RANGE[1], DATA_SIZE),
    'income': np.random.randint(INCOME_RANGE[0], INCOME_RANGE[1], DATA_SIZE).astype(float)
})

# Introduce NaN values randomly in income column
nan_indices = np.random.choice(DATA_SIZE, size=NAN_COUNT, replace=False)
df.loc[nan_indices, 'income'] = np.nan

# Dataset overview
print("Dataset Shape:", df.shape)
print(f"Missing values in income: {df['income'].isnull().sum()}")
print("\nFirst 10 rows:")
display(df.head(10))


Dataset Shape: (100, 2)
Missing values in income: 10

First 10 rows:


Unnamed: 0,age,income
0,56,28392.0
1,46,50535.0
2,32,98603.0
3,25,
4,38,72256.0
5,56,109135.0
6,36,147478.0
7,40,
8,28,97373.0
9,28,99575.0


In [18]:
print("\nDataset Statistics:")
display(df.describe())



Dataset Statistics:


Unnamed: 0,age,income
count,100.0,90.0
mean,37.91,90708.92
std,12.22,37998.75
min,18.0,20206.0
25%,26.75,62653.0
50%,38.0,95608.0
75%,46.25,125866.0
max,59.0,148376.0


---

## Problem 1: Descriptive Statistics

### Task:
Compute (a) **mean**, (b) **median**, and (c) **age-weighted mean** of income. Ignore NaNs where appropriate and explain when a weighted mean is preferable.

### Approach:
- Use pandas built-in functions for mean and median (automatically handle NaNs)
- Calculate age-weighted mean using proportional weights
- Provide interpretation of when weighted means are useful


In [19]:
# (a) Mean income (ignoring NaNs)
mean_income = df['income'].mean()

# (b) Median income (ignoring NaNs)
median_income = df['income'].median()

# (c) Age-weighted mean income
valid_data = df.dropna(subset=['income'])
age_weights = valid_data['age'] / valid_data['age'].sum()
weighted_mean_income = np.sum(valid_data['income'] * age_weights)

# Display results
print("DESCRIPTIVE STATISTICS RESULTS")
print("=" * 40)
print(f"Mean Income: ${mean_income:,.2f}")
print(f"Median Income: ${median_income:,.2f}")
print(f"Age-Weighted Mean Income: ${weighted_mean_income:,.2f}")


DESCRIPTIVE STATISTICS RESULTS
Mean Income: $90,708.92
Median Income: $95,608.00
Age-Weighted Mean Income: $89,948.01


### When to Use Weighted Mean:

**Weighted means are preferable when:**
- Observations have different importance or relevance
- Sample composition differs from target population
- Age-weighted mean accounts for demographic representation
- Need to adjust for sampling bias or unequal representation


---

## Problem 2: Standardization and Outlier Detection

### Task:
Standardize income using z-score and report how many incomes are outliers using rule |z| > 3. Handle NaNs correctly without dropping entire rows unnecessarily.

### Approach:
- Calculate z-scores: (x - μ) / σ
- Count outliers where |z-score| > 3
- Preserve NaN values in standardized column


In [20]:
# Calculate z-scores (handling NaNs appropriately)
mean_val = df['income'].mean()
std_val = df['income'].std()
df['income_zscore'] = (df['income'] - mean_val) / std_val

# Count outliers using |z| > 3 rule
outlier_mask = df['income_zscore'].abs() > 3
outlier_count = outlier_mask.sum()
non_nan_count = df['income_zscore'].notna().sum()

print("OUTLIER DETECTION RESULTS")
print("=" * 40)
print(f"Total non-NaN observations: {non_nan_count}")
print(f"Number of outliers (|z| > 3): {outlier_count}")
print(f"Outlier percentage: {(outlier_count/non_nan_count)*100:.2f}%")

# Display standardized data sample
print("\nSample of standardized data:")
display(df[['age', 'income', 'income_zscore']].head(10))


OUTLIER DETECTION RESULTS
Total non-NaN observations: 90
Number of outliers (|z| > 3): 0
Outlier percentage: 0.00%

Sample of standardized data:


Unnamed: 0,age,income,income_zscore
0,56,28392.0,-1.64
1,46,50535.0,-1.06
2,32,98603.0,0.21
3,25,,
4,38,72256.0,-0.49
5,56,109135.0,0.48
6,36,147478.0,1.49
7,40,,
8,28,97373.0,0.18
9,28,99575.0,0.23


In [21]:
if outlier_count > 0:
    print("\nOutlier observations:")
    display(df[outlier_mask][['age', 'income', 'income_zscore']])
else:
    print("No outliers detected using |z| > 3 criterion.")


No outliers detected using |z| > 3 criterion.


---

## Problem 3: Age Binning and Group Statistics

### Task:
Create age bins [18-25), [25-35), [35-45), [45-60) and compute for each bin:
- Count of observations
- Mean income  
- Median income

Show result as a tidy DataFrame sorted by age bin.

### Approach:
- Use pandas `cut()` function for binning
- Apply `groupby()` with aggregation functions
- Handle missing values appropriately in group calculations


In [22]:
# Define age bins and labels
age_bins = [18, 25, 35, 45, 60]
age_labels = ['18-25', '25-35', '35-45', '45-60']

# Create age bins
df['age_bin'] = pd.cut(df['age'], bins=age_bins, labels=age_labels, right=False)

# Compute statistics for each age group
bin_statistics = df.groupby('age_bin').agg({
    'income': ['count', 'mean', 'median']
}).round(2)

# Flatten column names for clarity
bin_statistics.columns = ['Count_Observations', 'Mean_Income', 'Median_Income']

# Sort by age bin (already sorted due to categorical ordering)
bin_statistics_sorted = bin_statistics.sort_index()

print("AGE GROUP STATISTICS")
print("=" * 50)
display(bin_statistics_sorted)


AGE GROUP STATISTICS


  bin_statistics = df.groupby('age_bin').agg({


Unnamed: 0_level_0,Count_Observations,Mean_Income,Median_Income
age_bin,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
18-25,15,99426.13,107939.0
25-35,22,84921.27,84107.5
35-45,25,97352.6,99309.0
45-60,28,84654.57,85629.0


In [23]:
# Additional insights
print("\nAge Distribution:")
age_dist = df['age_bin'].value_counts().sort_index()
for bin_name, count in age_dist.items():
    print(f"{bin_name}: {count} individuals ({count/len(df)*100:.1f}%)")



Age Distribution:
18-25: 17 individuals (17.0%)
25-35: 25 individuals (25.0%)
35-45: 27 individuals (27.0%)
45-60: 31 individuals (31.0%)


---

## Problem 4: Multi-dimensional Array Operations

### Task:
Create an array (not 1-dimensional) and showcase:
- **Shape Operations:** shape, size, transpose, flatten
- **Indexing:** negative indexing and slicing errors
- **Arithmetic Operations:** broadcasting, dot product
- **Linear Algebra:** determinant, inverse

### Approach:
- Create 2D array for comprehensive demonstrations
- Show proper error handling for invalid operations
- Use numpy linear algebra functions with validation


In [24]:
# Create multi-dimensional array (not 1D)
np.random.seed(42)
array_2d = np.random.randint(1, 10, (4, 5))

print("ARRAY OPERATIONS DEMONSTRATION")
print("=" * 45)

# Shape and Resize operations
print("Original Array:")
print(array_2d)
print(f"Shape: {array_2d.shape}")
print(f"Size: {array_2d.size}")

print("\nTranspose:")
print(array_2d.T)

print("\nFlattened:")
print(array_2d.flatten())

# Resizing
resized = array_2d.reshape(5, 4)
print(f"\nReshaped to (5,4):")
print(resized)


ARRAY OPERATIONS DEMONSTRATION
Original Array:
[[7 4 8 5 7]
 [3 7 8 5 4]
 [8 8 3 6 5]
 [2 8 6 2 5]]
Shape: (4, 5)
Size: 20

Transpose:
[[7 3 8 2]
 [4 7 8 8]
 [8 8 3 6]
 [5 5 6 2]
 [7 4 5 5]]

Flattened:
[7 4 8 5 7 3 7 8 5 4 8 8 3 6 5 2 8 6 2 5]

Reshaped to (5,4):
[[7 4 8 5]
 [7 3 7 8]
 [5 4 8 8]
 [3 6 5 2]
 [8 6 2 5]]


In [25]:
# Negative indexing examples
print("\nINDEXING OPERATIONS")
print("=" * 30)

print("Last row (negative indexing):")
print(array_2d[-1])

print("Last element:")
print(array_2d[-1, -1])

# Demonstrate slicing error
print("\nDemonstrating slicing error:")
try:
    result = array_2d[:, 10]  # Invalid column index
    print(result)
except IndexError as e:
    print(f"IndexError: {e}")

try:
    result = array_2d[5, :]  # Invalid row index
    print(result)
except IndexError as e:
    print(f"IndexError: {e}")



INDEXING OPERATIONS
Last row (negative indexing):
[2 8 6 2 5]
Last element:
5

Demonstrating slicing error:
IndexError: index 10 is out of bounds for axis 1 with size 5
IndexError: index 5 is out of bounds for axis 0 with size 4


In [26]:
# Broadcasting demonstration
print("\n ARITHMETIC OPERATIONS")
print("=" * 35)

broadcast_array = np.arange(5)
print(f"Broadcasting with: {broadcast_array}")
broadcast_result = array_2d + broadcast_array
print("Broadcasting result:")
print(broadcast_result)

# Dot product
dot_product = np.dot(array_2d, array_2d.T)
print(f"\nDot product (4x5 × 5x4 = 4x4):")
print(dot_product)



 ARITHMETIC OPERATIONS
Broadcasting with: [0 1 2 3 4]
Broadcasting result:
[[ 7  5 10  8 11]
 [ 3  8 10  8  8]
 [ 8  9  5  9  9]
 [ 2  9  8  5  9]]

Dot product (4x5 × 5x4 = 4x4):
[[203 166 177 139]
 [166 163 154 140]
 [177 154 198 135]
 [139 140 135 133]]


In [27]:
# Linear algebra operations (requires square matrix)
print("\n LINEAR ALGEBRA OPERATIONS")
print("=" * 40)

# Create square matrix for determinant and inverse
square_matrix = array_2d[:4, :4]  # Extract 4x4 submatrix
print("Square matrix (4x4):")
print(square_matrix)

# Determinant
det = np.linalg.det(square_matrix)
print(f"\nDeterminant: {det:.4f}")

# Inverse (handle potential singular matrix)
try:
    inverse_matrix = np.linalg.inv(square_matrix)
    print("\nInverse matrix:")
    print(inverse_matrix.round(4))
    
    # Verify inverse
    identity_check = np.dot(square_matrix, inverse_matrix)
    print("\nVerification (A × A⁻¹ ≈ I):")
    print(identity_check.round(4))
    
except np.linalg.LinAlgError:
    print("Matrix is singular (determinant ≈ 0), inverse does not exist")



 LINEAR ALGEBRA OPERATIONS
Square matrix (4x4):
[[7 4 8 5]
 [3 7 8 5]
 [8 8 3 6]
 [2 8 6 2]]

Determinant: 928.0000

Inverse matrix:
[[ 0.1789 -0.2759  0.0453  0.1067]
 [-0.0948 -0.0345  0.0603  0.1422]
 [ 0.125   0.     -0.125   0.0625]
 [-0.1746  0.4138  0.0884 -0.3631]]

Verification (A × A⁻¹ ≈ I):
[[ 1.  0.  0.  0.]
 [ 0.  1.  0. -0.]
 [ 0. -0.  1.  0.]
 [ 0.  0.  0.  1.]]


---

## Summary and Conclusions

### Key Findings:
- **Dataset:** Successfully created synthetic dataset with controlled NaN values
- **Statistics:** Computed robust measures handling missing data appropriately  
- **Outliers:** Applied standardization and identified outliers using statistical thresholds
- **Grouping:** Demonstrated effective data binning and group-wise analysis
- **Arrays:** Showcased comprehensive multi-dimensional array operations

### Technical Skills Demonstrated:
1. Proper NaN handling in statistical computations
2. Z-score standardization and outlier detection
3. Data binning and categorical analysis  
4. Multi-dimensional array manipulation
5. Linear algebra operations with error handling

**Note:** All code cells above should be run sequentially to reproduce the complete analysis. The random seed ensures reproducibility across different execution environments.