## Session 2: Data Processing and Statistical Analysis Tasks

**Dataset**: NSMES1988new.csv

### Tasks:
1. Import relevant Python libraries.
2. Import the CSV file – NSMES1988new.csv into a dataframe.
3. Perform memory analysis of the new dataframe and compare it with the memory of the dataframe in the previous week and mark your comments.
4. Perform the following operations on age and income columns: Multiply age by 10 and income by 10000.
5. Perform basic statistical analysis on the new dataframe and generate a brief report on the outcome. Save the dataframe as NSMES1988updated.csv file in the local space for possible future use.
6. Invoke describe command on the dataframe and compare that with the basic statistical analysis done in the previous step.
7. Indicate which of the columns are not eligible for statistical analysis and indicate possible datatype changes, and report.
8. Make changes to the recommended file from previous step (Optional).
9. Prepare a brief report and enter it in the mark-up cells of JupyterLab Notebook.

---

---
## Task 1-2: Library Import and Data Loading

In [2]:
# imports and data loading
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv('outputs/NSMES1988new.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,visits,nvisits,ovisits,novisits,emergency,hospital,health,chronic,adl,region,age,gender,married,school,income,employed,insurance,medicaid
0,1,5,0,0,0,0,1,average,2,normal,other,6.9,male,yes,6,2.881,yes,yes,no
1,2,1,0,2,0,2,0,average,2,normal,other,7.4,female,yes,10,2.7478,no,yes,no
2,3,13,0,0,0,3,3,poor,4,limited,other,6.6,female,no,10,0.6532,no,no,yes
3,4,16,0,5,0,1,1,poor,2,limited,other,7.6,male,yes,3,0.6588,no,yes,no
4,5,3,0,0,0,0,0,average,2,limited,other,7.9,female,yes,6,0.6588,no,yes,no


---
## Task 3: Memory Analysis and Comparison

Compare memory usage with Session 1 original dataframe.

In [12]:
# memory analysis

# load both datasets for comparison
df_original = pd.read_csv('data/NSMES1988.csv')
df = pd.read_csv('outputs/NSMES1988new.csv')

# memory before optimization
original_memory = df_original.memory_usage(deep=True).sum() / 1024**2
current_memory_unoptimized = df.memory_usage(deep=True).sum() / 1024**2

print("MEMORY USAGE BEFORE RE-OPTIMIZATION:")
print("="*50)
print(f"Original (Session 1): {original_memory:.2f} MB")
print(f"Current (unoptimized): {current_memory_unoptimized:.2f} MB")
print(f"\nCSV files don't preserve data types.")
print("    Need to re-apply optimizations\n")


# RE-APPLY OPTIMIZATIONS
# convert categorical columns
categorical_cols = ['health', 'gender', 'married', 'region', 
                    'employed', 'insurance', 'medicaid', 'adl']

for col in categorical_cols:
    if col in df.columns:
        df[col] = df[col].astype('category')

# optimize integer columns
int_cols = ['visits', 'nvisits', 'ovisits', 'novisits', 
            'emergency', 'hospital', 'chronic', 'school']

for col in int_cols:
    if col in df.columns:
        max_val = df[col].max()
        min_val = df[col].min()
        
        if min_val >= 0 and max_val < 255:
            df[col] = df[col].astype('uint8')
        elif min_val >= -128 and max_val < 127:
            df[col] = df[col].astype('int8')
        elif min_val >= 0 and max_val < 65535:
            df[col] = df[col].astype('uint16')
        elif min_val >= -32768 and max_val < 32767:
            df[col] = df[col].astype('int16')
        else:
            df[col] = df[col].astype('int32')

# optimize float columns
float_cols = ['age', 'income']
for col in float_cols:
    if col in df.columns:
        df[col] = df[col].astype('float32')

# memory AFTER re-optimization
current_memory_optimized = df.memory_usage(deep=True).sum() / 1024**2

print("\nMEMORY USAGE AFTER RE-OPTIMIZATION:")
print("="*50)
print(f"Original (Session 1): {original_memory:.2f} MB")
print(f"After re-optimization: {current_memory_optimized:.2f} MB")
print(f"Memory Reduction: {original_memory - current_memory_optimized:.2f} MB")
print(f"Percentage Saved: {((original_memory - current_memory_optimized) / original_memory * 100):.1f}%")

print("\n✓ Optimizations successfully re-applied")

MEMORY USAGE BEFORE RE-OPTIMIZATION:
Original (Session 1): 2.43 MB
Current (unoptimized): 2.43 MB

CSV files don't preserve data types.
    Need to re-apply optimizations


MEMORY USAGE AFTER RE-OPTIMIZATION:
Original (Session 1): 2.43 MB
After re-optimization: 0.14 MB
Memory Reduction: 2.29 MB
Percentage Saved: 94.4%

✓ Optimizations successfully re-applied!


---
## Task 4: Data Transformation - Age and Income Scaling

Correct the scaling issues identified in Session 1:
- **Age**: Multiply by 10 (scaled → actual years)
- **Income**: Multiply by 10,000 (scaled → actual dollars)

In [25]:
# load fresh copy as baseline (keep this untouched for safety)
df_untransformed = pd.read_csv('outputs/NSMES1988new.csv')

# drop Unnamed: 0 if it exists
if 'Unnamed: 0' in df_untransformed.columns:
    df_untransformed = df_untransformed.drop('Unnamed: 0', axis=1)

# re-apply optimizations to untransformed version
categorical_cols = ['health', 'gender', 'married', 'region', 
                    'employed', 'insurance', 'medicaid', 'adl']
for col in categorical_cols:
    if col in df_untransformed.columns:
        df_untransformed[col] = df_untransformed[col].astype('category')

# create working copy for transformation
df = df_untransformed.copy()

print("BEFORE TRANSFORMATION:")
print("="*50)
print("\nAge Column:")
print(f"  Min: {df['age'].min():.2f}")
print(f"  Max: {df['age'].max():.2f}")
print(f"  Mean: {df['age'].mean():.2f}")
print(f"  → Values are scaled (divided by 10)")

print("\nIncome Column:")
print(f"  Min: {df['income'].min():.2f}")
print(f"  Max: {df['income'].max():.2f}")
print(f"  Mean: {df['income'].mean():.2f}")
print(f"  → Values are in $10,000 units")

# perform transformations on working copy
df['age'] = df['age'] * 10
df['income'] = df['income'] * 10000

print("\n\nAFTER TRANSFORMATION:")
print("="*50)
print("\nAge Column:")
print(f"  Min: {df['age'].min():.1f} years")
print(f"  Max: {df['age'].max():.1f} years")
print(f"  Mean: {df['age'].mean():.1f} years")
print(f"  ✓ Now in actual years")

print("\nIncome Column:")
print(f"  Min: ${df['income'].min():,.2f}")
print(f"  Max: ${df['income'].max():,.2f}")
print(f"  Mean: ${df['income'].mean():,.2f}")
print(f"  ✓ Now in actual dollars")

# investigate negative income
negative_count = len(df[df['income'] < 0])
print(f"\n  NEGATIVE INCOME ANALYSIS:")
print(f"  Total records: {negative_count}")
if negative_count > 0:
    print(f"  Range: ${df[df['income'] < 0]['income'].min():,.2f} to ${df[df['income'] < 0]['income'].max():,.2f}")
    print(f"  Percentage: {(negative_count / len(df) * 100):.2f}%")

print("\n✓ Transformation complete!")
print("\nNote: df_untransformed is preserved if you need to reset")

BEFORE TRANSFORMATION:

Age Column:
  Min: 6.60
  Max: 10.90
  Mean: 7.40
  → Values are scaled (divided by 10)

Income Column:
  Min: -1.01
  Max: 54.84
  Mean: 2.53
  → Values are in $10,000 units


AFTER TRANSFORMATION:

Age Column:
  Min: 66.0 years
  Max: 109.0 years
  Mean: 74.0 years
  ✓ Now in actual years

Income Column:
  Min: $-10,125.00
  Max: $548,351.00
  Mean: $25,271.32
  ✓ Now in actual dollars

  NEGATIVE INCOME ANALYSIS:
  Total records: 3
  Range: $-10,125.00 to $-8,180.00
  Percentage: 0.07%

✓ Transformation complete!

Note: df_untransformed is preserved if you need to reset


---
## Task 5: Basic Statistical Analysis

Perform comprehensive statistical calculations on the transformed data.
Save results as NSMES1988updated.csv.

In [35]:
# select only numeric columns for analysis
numeric_cols = df.select_dtypes(include=[np.number]).columns

print("COMPREHENSIVE STATISTICAL ANALYSIS:")
print("="*50)
print(f"\nAnalyzing {len(numeric_cols)} numeric columns")
print(f"Total records: {len(df)}\n")

# create a comprehensive statistics dictionary
stats_dict = {}

for col in numeric_cols:
    stats_dict[col] = {
        'Count': df[col].count(),
        'Mean': df[col].mean(),
        'Median': df[col].median(),
        'Mode': df[col].mode()[0] if len(df[col].mode()) > 0 else np.nan,
        'Std Dev': df[col].std(),
        'Variance': df[col].var(),
        'Min': df[col].min(),
        'Max': df[col].max(),
        'Range': df[col].max() - df[col].min(),
        'Q1 (25%)': df[col].quantile(0.25),
        'Q2 (50%)': df[col].quantile(0.50),
        'Q3 (75%)': df[col].quantile(0.75),
        'IQR': df[col].quantile(0.75) - df[col].quantile(0.25),
        'Skewness': df[col].skew(),
        'Kurtosis': df[col].kurtosis()
    }

# convert to DataFrame for better display
stats_df = pd.DataFrame(stats_dict).T

# display with nice formatting
print(stats_df.round(2))

# save the transformed dataframe
df.to_csv('outputs/NSMES1988updated.csv', index=False)
print("\n" + "="*50)
print("✓ Statistical analysis complete")
print("✓ Data saved as: NSMES1988updated.csv")

COMPREHENSIVE STATISTICAL ANALYSIS:

Analyzing 10 numeric columns
Total records: 4406

            Count      Mean   Median    Mode   Std Dev      Variance      Min  \
visits     4406.0      5.77      4.0     0.0      6.76  4.569000e+01      0.0   
nvisits    4406.0      1.62      0.0     0.0      5.32  2.827000e+01      0.0   
ovisits    4406.0      0.75      0.0     0.0      3.65  1.334000e+01      0.0   
novisits   4406.0      0.54      0.0     0.0      3.88  1.505000e+01      0.0   
emergency  4406.0      0.26      0.0     0.0      0.70  5.000000e-01      0.0   
hospital   4406.0      0.30      0.0     0.0      0.75  5.600000e-01      0.0   
chronic    4406.0      1.54      1.0     1.0      1.35  1.820000e+00      0.0   
age        4406.0     74.02     73.0    66.0      6.33  4.012000e+01     66.0   
school     4406.0     10.29     11.0    12.0      3.74  1.398000e+01      0.0   
income     4406.0  25271.32  16981.5  4320.0  29246.48  8.553564e+08 -10125.0   

                Max  

## Summary Statistics

The comprehensive statistical analysis reveals the following about our healthcare dataset:

**Population Demographics:**
- **Age**: Mean 74 years, Median 73 years
  - Dataset focuses on elderly population (66-109 years)
  
- **Income**: Mean \$25271.32, Median \$16981.5
  - Income distribution shows positive skew with high earners
  - High standard deviation (\$29246.48) indicates significant economic diversity

**Healthcare Utilization Patterns:**
- **Physician Visits**: Average 5.7 visits per person
- **Emergency Visits**: Average 0.26 visits per person  
- **Hospital Stays**: Average 0.3 stays per person
- **Chronic Conditions**: Average 1.5 conditions per person


**Distribution Characteristics:**

| Variable | Skewness | Interpretation |
|----------|----------|----------------|
| Age | 0.89 | Right |
| Income | 5.93 | Right |
| Visits | 3.34 | Right |
| Hospital | 3.97 | Right |

*Skewness interpretation: >0.5 = right-skewed, <-0.5 = left-skewed, -0.5 to 0.5 = fairly symmetric*

**Key Observations:**
1. Income, Hospital and Emergency had the highest variance.
2. Most values mean and medians were very close, indicating no significant outliers.


**Data Quality**: All transformations verified. Dataset ready for advanced analysis in Session 3.

---
## Task 6: Comparison with Pandas .describe() Method

Compare our manual statistical calculations with pandas built-in describe() function.

In [37]:
# use .describe() and compare with manual analysis

print("PANDAS .describe() METHOD OUTPUT:")
print("="*50)
describe_df = df.describe()
print(describe_df.round(2))

print("\n\nCOMPARISON: MANUAL vs .describe()")
print("="*50)

# what describe() provides
describe_metrics = ['count', 'mean', 'std', 'min', '25%', '50%', '75%', 'max']

# what our manual analysis provides
manual_metrics = ['count', 'mean', 'median', 'mode', 'std', 'variance', 
                  'min', 'max', 'range', 'Q1', 'Q2', 'Q3', 'IQR', 
                  'skewness', 'kurtosis']

print("\n✓ METRICS PROVIDED BY .describe():")
for metric in describe_metrics:
    print(f"   • {metric}")

print("\n✓ ADDITIONAL METRICS FROM MANUAL ANALYSIS:")
manual_only = set(manual_metrics) - set(['count', 'mean', 'std', 'min', 'max'])
for metric in manual_only:
    print(f"   • {metric}")

print("\n\nVERIFICATION (checking if values match):")
print("-"*50)

# pick a few columns to verify
test_cols = ['age', 'income', 'visits']

for col in test_cols:
    if col in df.columns:
        print(f"\n{col.upper()}:")
        manual_mean = stats_df.loc[col, 'Mean']
        describe_mean = describe_df.loc['mean', col]
        match = "✓ MATCH" if abs(manual_mean - describe_mean) < 0.01 else "✗ MISMATCH"
        print(f"   Mean:   Manual={manual_mean:.2f}, describe()={describe_mean:.2f} {match}")
        
        manual_std = stats_df.loc[col, 'Std Dev']
        describe_std = describe_df.loc['std', col]
        match = "✓ MATCH" if abs(manual_std - describe_std) < 0.01 else "✗ MISMATCH"
        print(f"   Std:    Manual={manual_std:.2f}, describe()={describe_std:.2f} {match}")

print("\n" + "="*50)
print("CONCLUSION:")
print("✓ Both methods produce identical results for overlapping metrics")
print("✓ Manual analysis provides deeper insights (mode, variance, IQR, skewness, kurtosis)")
print("✓ .describe() is faster for quick checks")
print("✓ Manual analysis is better for comprehensive reporting")

PANDAS .describe() METHOD OUTPUT:
        visits  nvisits  ovisits  novisits  emergency  hospital  chronic  \
count  4406.00  4406.00  4406.00   4406.00    4406.00   4406.00  4406.00   
mean      5.77     1.62     0.75      0.54       0.26      0.30     1.54   
std       6.76     5.32     3.65      3.88       0.70      0.75     1.35   
min       0.00     0.00     0.00      0.00       0.00      0.00     0.00   
25%       1.00     0.00     0.00      0.00       0.00      0.00     1.00   
50%       4.00     0.00     0.00      0.00       0.00      0.00     1.00   
75%       8.00     1.00     0.00      0.00       0.00      0.00     2.00   
max      89.00   104.00   141.00    155.00      12.00      8.00     8.00   

           age   school     income  
count  4406.00  4406.00    4406.00  
mean     74.02    10.29   25271.32  
std       6.33     3.74   29246.48  
min      66.00     0.00  -10125.00  
25%      69.00     8.00    9121.50  
50%      73.00    11.00   16981.50  
75%      78.00    12.0

### Comparison Analysis

#### Pandas .describe() Method

The `.describe()` method provides quick statistical summaries with 8 standard metrics:
- Count, Mean, Standard Deviation
- Min, 25th percentile, Median (50th), 75th percentile, Max

#### Verification Results

Comparing manual calculations with `.describe()` for key variables:

AGE:
   Mean:   Manual=74.02, describe()=74.02 ✓ MATCH
   Std:    Manual=6.33, describe()=6.33 ✓ MATCH

INCOME:
   Mean:   Manual=25271.32, describe()=25271.32 ✓ MATCH
   Std:    Manual=29246.48, describe()=29246.48 ✓ MATCH

VISITS:
   Mean:   Manual=5.77, describe()=5.77 ✓ MATCH
   Std:    Manual=6.76, describe()=6.76 ✓ MATCH

#### Key Insights

**Advantages of .describe():**
- Fast and convenient
- Standard output format
- Good for quick data exploration

**Advantages of Manual Analysis:**
- Provides additional metrics (Mode, Variance, IQR, Skewness, Kurtosis)
- Better understanding of data distribution shape
- More control over which statistics to calculate
- Essential for detailed reporting

**Conclusion:** Both methods are valid and produce identical results for overlapping metrics. Use `.describe()` for quick checks, manual analysis for comprehensive reports.

---
## Task 7: Identify Non-Statistical Columns

Determine which columns are not eligible for statistical analysis and recommend appropriate data types.

In [38]:
# Task 7: Identify columns not eligible for statistical analysis

print("COLUMNS NOT ELIGIBLE FOR STATISTICAL ANALYSIS:")
print("="*50)

# Separate columns by type
categorical_cols = df.select_dtypes(include=['object', 'category']).columns
numeric_cols = df.select_dtypes(include=[np.number]).columns

print(f"\nTotal Columns: {len(df.columns)}")
print(f"Numeric Columns: {len(numeric_cols)}")
print(f"Categorical Columns: {len(categorical_cols)}")

print("\n" + "-"*50)
print("CATEGORICAL COLUMNS DETAILS:")
print("-"*50)

categorical_info = []
for col in categorical_cols:
    unique_count = df[col].nunique()
    current_type = df[col].dtype
    top_value = df[col].mode()[0] if len(df[col].mode()) > 0 else 'N/A'
    top_count = df[col].value_counts().iloc[0] if len(df[col]) > 0 else 0
    
    categorical_info.append({
        'Column': col,
        'Current Type': str(current_type),
        'Unique Values': unique_count,
        'Most Common': top_value,
        'Frequency': top_count
    })

cat_df = pd.DataFrame(categorical_info)
print(cat_df.to_string(index=False))

print("\n" + "="*50)
print("WHY CATEGORICAL COLUMNS ARE NOT SUITABLE FOR STATISTICS:")
print("="*50)
print("""
1. MEANING: Categorical variables represent groups/categories, not quantities
   Example: "male" and "female" are labels, not numbers

2. MATHEMATICAL OPERATIONS: Computing mean, std, variance on categories is meaningless
   Example: What's the "average" of ['yes', 'no', 'yes', 'no']? → Nonsense!

3. APPROPRIATE ANALYSIS: Categorical data requires different methods:
   • Frequency counts (how many in each category?)
   • Proportions/Percentages (what % in each category?)
   • Mode (most common category)
   • Cross-tabulations (relationships between categories)
   • Chi-square tests (statistical relationships)

4. ORDERING: Even if coded as numbers (0/1), categories have no inherent order
   Example: male=1, female=2 doesn't mean female is "greater than" male
""")

print("\n" + "="*50)
print("DATA TYPE RECOMMENDATIONS:")
print("="*50)

recommendations = []
for col in categorical_cols:
    unique_count = df[col].nunique()
    current_type = str(df[col].dtype)
    
    # Recommendation logic
    if unique_count <= 10:
        recommended = 'category'
        reason = f'Only {unique_count} unique values - category dtype is memory efficient'
        benefit = f'Memory savings: ~{(df[col].memory_usage(deep=True) / 1024):.1f}KB → ~{(unique_count * 8 / 1024):.1f}KB'
    elif unique_count <= 50:
        recommended = 'category'
        reason = f'{unique_count} unique values - category still beneficial'
        benefit = 'Moderate memory savings'
    else:
        recommended = 'object (keep current)'
        reason = f'{unique_count} unique values - too many for category optimization'
        benefit = 'Minimal benefit from conversion'
    
    recommendations.append({
        'Column': col,
        'Current Type': current_type,
        'Unique Values': unique_count,
        'Recommended': recommended,
        'Reason': reason
    })

rec_df = pd.DataFrame(recommendations)
print(rec_df.to_string(index=False))

print("\n" + "="*50)
print("SUMMARY:")
print(f"• {len(numeric_cols)} columns ARE eligible for statistical analysis")
print(f"• {len(categorical_cols)} columns are NOT eligible (require different analysis methods)")
print(f"• Recommended: Convert {len([r for r in recommendations if r['Recommended'] == 'category'])} columns to category dtype")
print("="*50)

COLUMNS NOT ELIGIBLE FOR STATISTICAL ANALYSIS:

Total Columns: 18
Numeric Columns: 10
Categorical Columns: 8

--------------------------------------------------
CATEGORICAL COLUMNS DETAILS:
--------------------------------------------------
   Column Current Type  Unique Values Most Common  Frequency
   health     category              3     average       3509
      adl     category              2      normal       3507
   region     category              4       other       1614
   gender     category              2      female       2628
  married     category              2         yes       2406
 employed     category              2          no       3951
insurance     category              2         yes       3421
 medicaid     category              2          no       4004

WHY CATEGORICAL COLUMNS ARE NOT SUITABLE FOR STATISTICS:

1. MEANING: Categorical variables represent groups/categories, not quantities
   Example: "male" and "female" are labels, not numbers

2. MATHEMATICAL 

### Task 7: Non-Statistical Columns Analysis

#### Categorical Columns Found

| Column | Current Type | Unique Values | Most Common | Frequency |
|--------|--------------|---------------|-------------|-----------|
| health | category | 3 | average | 3,509 |
| adl | category | 2 | normal | 3,507 |
| region | category | 4 | other | 1,614 |
| gender | category | 2 | female | 2,628 |
| married | category | 2 | yes | 2,406 |
| employed | category | 2 | no | 3,951 |
| insurance | category | 2 | yes | 3,421 |
| medicaid | category | 2 | no | 4,004 |

Found 8 categorical columns total. 6 are binary (yes/no or two categories), and 2 have multiple categories (health has 3, region has 4).

#### Why We Can't Use Statistics on These

These columns have text labels instead of numbers. You can't calculate things like mean or standard deviation on categories - what would "average gender" even mean? Instead, we count how many are in each category and look at percentages.

For these columns, I'll need to use different methods in Session 3 like counting frequencies, making pivot tables, and looking at how they relate to the numerical columns.

#### Data Type Notes

All these columns are already set to 'category' type which saves memory. This works well because the same values repeat many times (like "yes"/"no" or "male"/"female"), so pandas stores them efficiently as codes instead of repeating the text.

In [42]:
# apply recommended data type changes

print("APPLYING DATA TYPE OPTIMIZATIONS:")
print("="*50)

# check memory BEFORE optimization
memory_before = df.memory_usage(deep=True).sum() / 1024**2
print(f"\nMemory BEFORE optimization: {memory_before:.2f} MB")

# create optimized copy
df_optimized = df.copy()

print("\nApplying optimizations...")
print("-"*50)

# convert categorical columns with few unique values to category dtype
categorical_candidates = df_optimized.select_dtypes(include=['object']).columns

optimizations_applied = []

for col in categorical_candidates:
    unique_count = df_optimized[col].nunique()
    memory_before_col = df_optimized[col].memory_usage(deep=True) / 1024
    
    if unique_count <= 10:
        df_optimized[col] = df_optimized[col].astype('category')
        memory_after_col = df_optimized[col].memory_usage(deep=True) / 1024
        savings = memory_before_col - memory_after_col
        
        optimizations_applied.append({
            'Column': col,
            'Unique Values': unique_count,
            'Before (KB)': f"{memory_before_col:.2f}",
            'After (KB)': f"{memory_after_col:.2f}",
            'Savings (KB)': f"{savings:.2f}"
        })
        
        print(f"✓ {col}: object → category ({unique_count} unique values)")

# also optimize integer columns if not already optimized
int_cols = df_optimized.select_dtypes(include=['int64']).columns

for col in int_cols:
    max_val = df_optimized[col].max()
    min_val = df_optimized[col].min()
    
    # Determine optimal integer type
    if min_val >= 0 and max_val < 255:
        df_optimized[col] = df_optimized[col].astype('uint8')
        print(f"✓ {col}: int64 → uint8")
    elif min_val >= -128 and max_val < 127:
        df_optimized[col] = df_optimized[col].astype('int8')
        print(f"✓ {col}: int64 → int8")
    elif min_val >= 0 and max_val < 65535:
        df_optimized[col] = df_optimized[col].astype('uint16')
        print(f"✓ {col}: int64 → uint16")
    elif min_val >= -32768 and max_val < 32767:
        df_optimized[col] = df_optimized[col].astype('int16')
        print(f"✓ {col}: int64 → int16")

# optimize float columns
float_cols = df_optimized.select_dtypes(include=['float64']).columns

for col in float_cols:
    df_optimized[col] = df_optimized[col].astype('float32')
    print(f"✓ {col}: float64 → float32")

# check memory AFTER optimization
memory_after = df_optimized.memory_usage(deep=True).sum() / 1024**2

print("\n" + "="*50)
print("OPTIMIZATION RESULTS:")
print("="*50)
print(f"Memory BEFORE: {memory_before:.2f} MB")
print(f"Memory AFTER:  {memory_after:.2f} MB")
print(f"Memory SAVED:  {memory_before - memory_after:.2f} MB")
print(f"Reduction:     {((memory_before - memory_after) / memory_before * 100):.1f}%")

if optimizations_applied:
    print("\n\nDETAILED CATEGORICAL CONVERSIONS:")
    opt_df = pd.DataFrame(optimizations_applied)
    print(opt_df.to_string(index=False))

print("\n" + "="*50)
print("DATA TYPES AFTER OPTIMIZATION:")
print("-"*50)
print(df_optimized.dtypes)

# Optional: Save optimized version
# df_optimized.to_csv('../outputs/NSMES1988updated_optimized.csv', index=False)
# print("\n✓ Optimized data saved as: NSMES1988updated_optimized.csv")

print("\n✓ Optimization complete!")
print("Note: df remains unchanged, df_optimized contains optimized version")

APPLYING DATA TYPE OPTIMIZATIONS:

Memory BEFORE optimization: 0.37 MB

Applying optimizations...
--------------------------------------------------
✓ visits: int64 → uint8
✓ nvisits: int64 → uint8
✓ ovisits: int64 → uint8
✓ novisits: int64 → uint8
✓ emergency: int64 → uint8
✓ hospital: int64 → uint8
✓ chronic: int64 → uint8
✓ school: int64 → uint8
✓ age: float64 → float32
✓ income: float64 → float32

OPTIMIZATION RESULTS:
Memory BEFORE: 0.37 MB
Memory AFTER:  0.10 MB
Memory SAVED:  0.27 MB
Reduction:     72.3%

DATA TYPES AFTER OPTIMIZATION:
--------------------------------------------------
visits          uint8
nvisits         uint8
ovisits         uint8
novisits        uint8
emergency       uint8
hospital        uint8
health       category
chronic         uint8
adl          category
region       category
age           float32
gender       category
married      category
school          uint8
income        float32
employed     category
insurance    category
medicaid     category
dtyp

### Task 8: Data Type Optimization (Optional)

#### Memory Improvements

Applied optimizations to reduce memory usage:

**Before optimization**: 0.37 MB  
**After optimization**: 0.10 MB  
**Saved**: 0.27 MB (72.3% reduction)

#### What I Changed

**Categorical columns**: Already converted to category type (8 columns)
- These were done earlier - saves memory by storing unique values once with integer codes

**Integer columns**: Downcasted from int64 to smaller types
- Changed columns that only have small numbers (like visit counts) to use uint8 or int16
- Most visit counts don't need the huge range that int64 provides

**Float columns**: Changed from float64 to float32
- Age and income don't need super high precision
- float32 is accurate enough and uses half the memory

#### Why This Matters

The original data types were using way more memory than needed. For example, storing the number "5" as int64 uses 8 bytes when uint8 only needs 1 byte. When you have thousands of rows, this adds up fast.

#### Important Note

When I save this as a CSV and reload it later, these optimizations get lost because CSV only stores the actual values, not the data types. I'll need to re-apply these optimizations at the start of Session 3.

Could use .parquet or .pickle files instead to keep the types, but the assignment asks for CSV so I'll just re-optimize each session.

---
## Task 9: Session 2 Summary Report

### Overview

This session focused on fixing the scaling issues from Session 1 and doing a full statistical analysis of the dataset. The data is now ready for the more advanced Pandas analysis in Session 3.

---

### 1. Memory Analysis

**Starting point**: Loaded NSMES1988new.csv from Session 1

**Memory comparison:**
- **Original (Session 1)**: 2.43 MB
- **After re-optimization**: 0.14 MB
- **Memory Reduction**: 2.29 MB
- **Percentage Saved**: 94.4%

**Key finding**: CSV files don't save the optimized data types, so I had to re-apply the optimizations after loading. Next time I could use .parquet format to keep the types, but sticking with CSV for this project.

---

### 2. Data Transformations

#### Age Column
- **Before**: 6.6 - 10.9 (divided by 10)
- **After**: 66 - 109 years (actual ages)
- **How**: Multiplied by 10
- **Result**: Now makes sense - this is an elderly population dataset

#### Income Column  
- **Before**: -1.01 - 54.8 (in \$10k units)
- **After**: -\$10,100 - \$548,000 (actual dollars)
- **How**: Multiplied by 10,000
- **Issue found**: 3 people have negative income - need to investigate why

The negative income could be data errors, or it could be real (like business losses). Keeping them in the dataset for now but flagged for review.

---

### 3. Statistical Analysis Results

#### Key Demographics

**Population Demographics:**
- **Age**: Mean 74 years, Median 73 years
  - Dataset focuses on elderly population (66-109 years)
  
- **Income**: Mean \$25,271.32, Median \$16,981.50
  - Income distribution shows positive skew with high earners
  - High standard deviation (\$29,246.48) indicates significant economic diversity

- **Education**: Mean 10.29 years of school (Range: 0-18 years)

**Healthcare Utilization Patterns:**
- **Physician Visits**: Average 5.77 visits per person
- **Emergency Visits**: Average 0.26 visits per person  
- **Hospital Stays**: Average 0.30 stays per person
- **Chronic Conditions**: Average 1.5 conditions per person

Most people have low visit counts, but a few people have very high usage which creates right-skewed distributions.

#### Distribution Patterns

| Variable | Mean | Median | Skewness | What This Means |
|----------|------|--------|----------|-----------------|
| Age | 74.0 | 73.0 | 0.89 | Right-skewed |
| Income | \$25,271 | \$16,982 | 5.96 | Right-skewed |
| Visits | 5.77 | 4 | 3.34 | Right-skewed |
| Emergency | 0.26 | 0 | 5.07 | Right-skewed |
| Hospital | 0.30 | 0 | 3.97 | Right-skewed |

**Skewness guide**: Positive = few very high values, Negative = few very low values, Near zero = symmetric

---

### 4. Manual Calculations vs .describe()

I calculated statistics manually (mean, median, std, etc.) and then compared with pandas `.describe()` method. All the overlapping numbers matched perfectly, which validates both approaches work.

**Manual method gave me extra info:**
- Mode, Variance, Range, IQR
- Skewness and Kurtosis (shape of distribution)

**Takeaway**: Use `.describe()` for quick checks, manual calculations for detailed reports.

---

### 5. Categorical vs Numerical Variables

**Found 8 categorical columns** that can't have statistics calculated on them:
- health, adl, region, gender, married, employed, insurance, medicaid

These represent groups/labels, not quantities. Can't calculate "average gender" - doesn't make sense.

**For these I'll use instead:**
- Frequency counts (how many in each category)
- Percentages  
- Cross-tabulations with other variables
- Pivot tables in Session 3

All of these are already optimized as 'category' data type for memory efficiency.

---

### 6. Memory Optimization

Applied optimizations to reduce memory usage:

**Changes made:**
- Categorical columns: converted to category type (stores efficiently)
- Integer columns: downcasted to smaller types (uint8, int16 instead of int64)
- Float columns: changed to float32 (still accurate, half the memory)

**Before optimization**: 0.37 MB  
**After optimization**: 0.10 MB  
**Saved**: 0.27 MB (72.3% reduction)

---

### 7. Key Findings

1. **Elderly population**: Everyone is 66-109 years old
2. **Economic diversity**: Income ranges from -\$10,100 to \$548,000  
3. **Right-skewed distributions**: Most healthcare variables show few people with very high usage
4. **Data quality**: No missing values, but 3 negative income cases to investigate
5. **All transformations verified**: Numbers now make sense and match validation checks

---

### 8. What's Ready for Session 3

**Data saved as**: NSMES1988updated.csv

**Ready to analyze:**
- Age and income now in correct units
- All statistics calculated and documented
- Categorical variables identified
- Memory optimized

**Next steps:**
- Pivot tables by demographics
- Group analysis by categories
- Distribution tables (age/gender, income/region, etc.)
- Investigate negative income cases
- Look at relationships between variables

---

### Completion Checklist

- [x] Loaded and compared memory usage
- [x] Fixed age scaling (×10)
- [x] Fixed income scaling (×10,000)
- [x] Full statistical analysis
- [x] Validated with .describe()
- [x] Identified categorical columns
- [x] Optimized data types
- [x] Saved NSMES1988updated.csv
- [x] Documented everything

---