# Problem 1:
### Compute (a) mean, (b) median, and (c) age-weighted mean of income. Ignore NaNs where appropriate. Explain when a weighted mean is preferable. 

#### Synthetic dataset is generated with `income` and `age` columns, including some NaN values. The goal is to compute three key statistics for the `income` data, while properly handling missing values.*

- **Mean**: The average income is calculated, with NaN values excluded.
- **Median**: The middle value of income is determined, again ignoring NaNs.
- **Age-Weighted Mean**: Each income value is weighted by the corresponding age, normalized to sum to 1. NaN income values do not contribute to the weighted mean.

#### Weighted means are preferable when some data points are more influential or representative than others.
> eg: weighting incomes by age could reflect the greater relevance of certain age groups in the population.

In [34]:
import numpy as np
import pandas as pd

In [35]:
np.random.seed(42)

In [36]:
n = 100
ages = np.random.randint(18, 60, n)
incomes = np.random.normal(50000, 15000, n)
incomes[np.random.choice(n, 10, replace=False)] = np.nan

In [37]:
df = pd.DataFrame({'age': ages, 'income': incomes})
df

Unnamed: 0,age,income
0,56,47476.923679
1,46,67471.529683
2,32,46313.784689
3,25,38454.983395
4,38,68182.585132
...,...,...
95,59,42163.077807
96,56,50195.508496
97,58,55666.543880
98,45,


### (a) mean

In [38]:
mean_income = df['income'].mean()
print("Mean Income:", mean_income)

Mean Income: 50101.079120515446


### (b) median

In [39]:
median_income = df['income'].median()
print("Median Income:", median_income)

Median Income: 49719.76839237093


### (c) age-weighted mean of income

In [40]:
age_weights = df['age'] / df['age'].sum()
weighted_mean_income = np.nansum(df['income'] * age_weights)
print("Weighted Mean Income:", weighted_mean_income)

Weighted Mean Income: 45826.22405033739


***

# Problem 2:
### Standardize income (z-score). Report how many incomes are outliers using rule |z| > 3. Handle NaNs correctly (do not drop entire rows unnecessarily).


#### `income` values are standardized using Z-scores, and outliers are identified. Any NaN (missing) values in the dataset are handled correctly, so no rows are dropped unnecessarily.

#### Steps and Explanation

- **Z-Score Calculation**: Each income value is standardized by subtracting the mean and dividing by the standard deviation. Missing values (NaN) are ignored for these calculations.
- **Outlier Detection**: Values are considered outliers if their Z-score absolute value exceeds 3 (i.e., \(|z| > 3\)).
- **Handling NaNs**: Calculations are performed without dropping rows with NaNs; only valid income values are included.

> Z-score standardization enables a consistent scale for identifying unusually high or low values relative to the rest of the data. The common threshold \(|z| > 3\) marks statistical outliers.


In [41]:
# Problem 2: Z-score standardization
income_clean = df['income'].dropna()
mean_inc = income_clean.mean()
std_inc = income_clean.std()
z_scores = (df['income'] - mean_inc) / std_inc

In [42]:
# Outlier detection (|z| > 3)
outliers = z_scores.abs() > 3
num_outliers = outliers.sum()
print("Number of outliers:", num_outliers)

Number of outliers: 0


***

# Problem 3:
### Create age bins: [18-25), [25-35), [35-45), [45-60) and compute for each bin:
- count of observations,
- mean income,
- median score.
> Show result as a tidy DataFrame sorted by age bin.



#### This problem involves grouping observations into specific age bins and calculating key statistics for each group. The dataset contains some missing income values (NaNs), which are handled appropriately during aggregation.

## Tasks

- Create age bins with intervals:  
  - [18, 25)  
  - [25, 35)  
  - [35, 45)  
  - [45, 60)
- For each bin, compute:
  - Count of observations (non-NaN incomes)
  - Mean income
  - Median income
- Present the result as a tidy DataFrame sorted by age bin.


In [43]:
bins = [18, 25, 35, 45, 60]
labels = ["18-25", "25-35", "35-45", "45-60"]

In [44]:
df['age_bin'] = pd.cut(df['age'], bins=bins, labels=labels, right=False)

In [45]:
agg_df = df.groupby('age_bin', observed=False).agg(
    count=('income', 'count'),
    mean_income=('income', 'mean'),
    median_income=('income', 'median')
).reset_index().sort_values('age_bin')

print(agg_df)

  age_bin  count   mean_income  median_income
0   18-25     17  46078.737214   46127.228022
1   25-35     22  47393.526164   48999.847420
2   35-45     22  49184.025233   49235.238427
3   45-60     29  55208.705431   55666.543880


***

# Problem 4: 
### Create an array it cannot be of 1 Dimension. And then showcase the operation for the following:
- Shape and Resize → shape, size, Transpose, Flatten
- Showcasing negative indexing and display error while doing slicing
- Arithmetic Operations → Broadcasting, Dot Product
- Linear Algebra → Determinant, Inverse

### This problem demonstrates various operations on NumPy arrays, including working with multi-dimensional arrays, shape manipulation, indexing, arithmetic operations, and linear algebra functions.

## Tasks and Operations

### 1. Create a Multi-Dimensional Array
> Create an array with more than one dimension (e.g., 2D array):

In [46]:
import numpy as np
arr = np.random.randint(1, 10, (3, 4))

print("array:\n", arr)

array:
 [[7 7 3 2]
 [9 8 7 9]
 [4 4 1 8]]


### 2. Shape and Resize Operations
> Display the shape, size, transpose, and flattened versions of the array:

In [47]:
print("Shape:", arr.shape)
print("Size:", arr.size)
print("Transposed:", arr.T)
print("Flattened:", arr.flatten())

Shape: (3, 4)
Size: 12
Transposed: [[7 9 4]
 [7 8 4]
 [3 7 1]
 [2 9 8]]
Flattened: [7 7 3 2 9 8 7 9 4 4 1 8]


### 3. Negative Indexing and Slicing Error
> Use negative indexing to access elements from the end.
> deliberately access out-of-bounds slice to show an error:

In [48]:
print("Last row:", arr[-1])

try:
    slice_error = arr[:, 10]
except Exception as e:
    print("Slicing error:", e)

Last row: [4 4 1 8]
Slicing error: index 10 is out of bounds for axis 1 with size 4


### 4. Arithmetic Operations
> Broadcasting example: adding a 3x1 array to the original 3x4 array
> Dot product of the original array with its transpose:

In [49]:
arr2 = np.random.randint(1, 10, (3, 1))
broadcasted = arr + arr2 
print("Broadcasted:\n", broadcasted)
dot_product = np.dot(arr, arr.T)
print("Dot Product:\n", dot_product)

Broadcasted:
 [[10 10  6  5]
 [16 15 14 16]
 [ 6  6  3 10]]
Dot Product:
 [[111 158  75]
 [158 275 147]
 [ 75 147  97]]


### 5. Linear Algebra: Determinant and Inverse
> Compute determinant and inverse of a square matrix (3x3):

In [50]:
square_arr = np.random.randint(1, 10, (3, 3))
det = np.linalg.det(square_arr)
inverse = np.linalg.inv(square_arr)

print("Square Array:\n", square_arr)
print("Determinant:", det)
print("Inverse:\n", inverse)


Square Array:
 [[2 7 6]
 [3 9 6]
 [6 1 4]]
Determinant: -78.0
Inverse:
 [[-0.38461538  0.28205128  0.15384615]
 [-0.30769231  0.35897436 -0.07692308]
 [ 0.65384615 -0.51282051  0.03846154]]


***