# Session 1: Data Import and Cleaning Tasks

**Dataset**: NSMES1988.csv

## Tasks:
1. Import relevant Python libraries necessary for Python programming and Numpy for doing numerical operations.
2. Import the CSV file – NSMES1988.csv into a dataframe.
3. Inspect the data and report the details from physical inspection – rows, columns, data types etc.
4. Find out if the data is clean or if the data has missing values.
5. Comment on the data types, their values and range, specifically on age and income columns.
6. Export the data to JSON as NSMES1988.json format file and view and enter your comments.
7. Perform memory information on the data and recommend what non-default data types would you recommend to optimize memory settings for the dataframe.
8. What changes would you recommend on the dataframe before attempting a detailed data analysis?
9. Export the data frame as a new CSV file NSMES1988new.csv and store it in the local space for likely use in other assignments.
10. Write a short report on the visual observations of the data.

In [9]:
# import libraries
import numpy as np
import pandas as pd
import os

# load dataset
df = pd.read_csv('notebooks/data/NSMES1988.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,visits,nvisits,ovisits,novisits,emergency,hospital,health,chronic,adl,region,age,gender,married,school,income,employed,insurance,medicaid
0,1,5,0,0,0,0,1,average,2,normal,other,6.9,male,yes,6,2.881,yes,yes,no
1,2,1,0,2,0,2,0,average,2,normal,other,7.4,female,yes,10,2.7478,no,yes,no
2,3,13,0,0,0,3,3,poor,4,limited,other,6.6,female,no,10,0.6532,no,no,yes
3,4,16,0,5,0,1,1,poor,2,limited,other,7.6,male,yes,3,0.6588,no,yes,no
4,5,3,0,0,0,0,0,average,2,limited,other,7.9,female,yes,6,0.6588,no,yes,no


In [30]:
# inspect the data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4406 entries, 0 to 4405
Data columns (total 19 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  4406 non-null   int64  
 1   visits      4406 non-null   int64  
 2   nvisits     4406 non-null   int64  
 3   ovisits     4406 non-null   int64  
 4   novisits    4406 non-null   int64  
 5   emergency   4406 non-null   int64  
 6   hospital    4406 non-null   int64  
 7   health      4406 non-null   object 
 8   chronic     4406 non-null   int64  
 9   adl         4406 non-null   object 
 10  region      4406 non-null   object 
 11  age         4406 non-null   float64
 12  gender      4406 non-null   object 
 13  married     4406 non-null   object 
 14  school      4406 non-null   int64  
 15  income      4406 non-null   float64
 16  employed    4406 non-null   object 
 17  insurance   4406 non-null   object 
 18  medicaid    4406 non-null   object 
dtypes: float64(2), int64(9), ob

In [31]:
df.describe()

Unnamed: 0.1,Unnamed: 0,visits,nvisits,ovisits,novisits,emergency,hospital,chronic,age,school,income
count,4406.0,4406.0,4406.0,4406.0,4406.0,4406.0,4406.0,4406.0,4406.0,4406.0,4406.0
mean,2203.5,5.774399,1.618021,0.750794,0.536087,0.263504,0.29596,1.541988,7.402406,10.290286,2.527132
std,1272.046972,6.759225,5.317056,3.652759,3.879506,0.703659,0.746398,1.349632,0.633405,3.738736,2.924648
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.6,0.0,-1.0125
25%,1102.25,1.0,0.0,0.0,0.0,0.0,0.0,1.0,6.9,8.0,0.91215
50%,2203.5,4.0,0.0,0.0,0.0,0.0,0.0,1.0,7.3,11.0,1.69815
75%,3304.75,8.0,1.0,0.0,0.0,0.0,0.0,2.0,7.8,12.0,3.17285
max,4406.0,89.0,104.0,141.0,155.0,12.0,8.0,8.0,10.9,18.0,54.8351


In [25]:
# more detailed value counts

categorical_cols = df.select_dtypes(include=['object']).columns

for col in categorical_cols:
    print(f"\n{col.upper()}")
    print("="*50)
    
    # Create a DataFrame from value_counts
    counts_df = df[col].value_counts().reset_index()
    counts_df.columns = ['Category', 'Count']
    counts_df['Percentage'] = (counts_df['Count'] / len(df) * 100).round(2)
    
    display(counts_df)


HEALTH


Unnamed: 0,Category,Count,Percentage
0,average,3509,79.64
1,poor,554,12.57
2,excellent,343,7.78



ADL


Unnamed: 0,Category,Count,Percentage
0,normal,3507,79.6
1,limited,899,20.4



REGION


Unnamed: 0,Category,Count,Percentage
0,other,1614,36.63
1,midwest,1157,26.26
2,northeast,837,19.0
3,west,798,18.11



GENDER


Unnamed: 0,Category,Count,Percentage
0,female,2628,59.65
1,male,1778,40.35



MARRIED


Unnamed: 0,Category,Count,Percentage
0,yes,2406,54.61
1,no,2000,45.39



EMPLOYED


Unnamed: 0,Category,Count,Percentage
0,no,3951,89.67
1,yes,455,10.33



INSURANCE


Unnamed: 0,Category,Count,Percentage
0,yes,3421,77.64
1,no,985,22.36



MEDICAID


Unnamed: 0,Category,Count,Percentage
0,no,4004,90.88
1,yes,402,9.12


In [8]:
# checks
df.isnull().sum()

Unnamed: 0    0
visits        0
nvisits       0
ovisits       0
novisits      0
emergency     0
hospital      0
health        0
chronic       0
adl           0
region        0
age           0
gender        0
married       0
school        0
income        0
employed      0
insurance     0
medicaid      0
dtype: int64

# Data Observations

## Age column:

Min: 6.6, Max: 10.9

Remember: age is divided by 10

Real ages: 66 years old to 109 years old

This is an elderly population dataset!

## Income column:

Min: -1.01, Max: 54.8

Remember: income is in units of $10,000

Real income: -\$10,100 to \$548,000

Wait... NEGATIVE INCOME? That's unusual - might be data entry error or could represent debt/losses

No missing values: Clean dataset! That's rare and good news.

In [14]:
# export to JSON
df.to_json('notebooks/outputs/NSMES1988.json', orient='records', indent=2)
df_json = pd.read_json('notebooks/outputs/NSMES1988.json')
df_json.head()

Unnamed: 0.1,Unnamed: 0,visits,nvisits,ovisits,novisits,emergency,hospital,health,chronic,adl,region,age,gender,married,school,income,employed,insurance,medicaid
0,1,5,0,0,0,0,1,average,2,normal,other,6.9,male,yes,6,2.881,yes,yes,no
1,2,1,0,2,0,2,0,average,2,normal,other,7.4,female,yes,10,2.7478,no,yes,no
2,3,13,0,0,0,3,3,poor,4,limited,other,6.6,female,no,10,0.6532,no,no,yes
3,4,16,0,5,0,1,1,poor,2,limited,other,7.6,male,yes,3,0.6588,no,yes,no
4,5,3,0,0,0,0,0,average,2,limited,other,7.9,female,yes,6,0.6588,no,yes,no


## Memory Analysis and Optimization

Goal: Understand how much memory our DataFrame uses and optimize it.

In [36]:
# Create optimized copy
df_optimized = df.copy()

# Convert object columns to category (if they have limited unique values)
# Category is efficient for columns with repeated values like gender, region
categorical_candidates = ['health', 'gender', 'married', 'region', 
                          'employed', 'insurance', 'medicaid', 'adl']

for col in categorical_candidates:
    if col in df_optimized.columns:
        df_optimized[col] = df_optimized[col].astype('category')

# Optimize integer columns based on their ranges
int_cols = ['visits', 'nvisits', 'ovisits', 'novisits', 
            'emergency', 'hospital', 'chronic', 'school']

for col in int_cols:
    if col in df_optimized.columns:
        max_val = df_optimized[col].max()
        min_val = df_optimized[col].min()
        
        # Choose appropriate integer type based on range
        if min_val >= 0 and max_val < 255:
            df_optimized[col] = df_optimized[col].astype('uint8')
        elif min_val >= -128 and max_val < 127:
            df_optimized[col] = df_optimized[col].astype('int8')
        elif min_val >= 0 and max_val < 65535:
            df_optimized[col] = df_optimized[col].astype('uint16')
        elif min_val >= -32768 and max_val < 32767:
            df_optimized[col] = df_optimized[col].astype('int16')
        else:
            df_optimized[col] = df_optimized[col].astype('int32')

# Optimize float columns (age and income can use float32 instead of float64)
float_cols = ['age', 'income']
for col in float_cols:
    if col in df_optimized.columns:
        df_optimized[col] = df_optimized[col].astype('float32')

# Compare memory usage
print("MEMORY COMPARISON:")
original_memory = df.memory_usage(deep=True).sum() / 1024**2
optimized_memory = df_optimized.memory_usage(deep=True).sum() / 1024**2

print(f"Original: {original_memory:.2f} MB")
print(f"Optimized: {optimized_memory:.2f} MB")
print(f"Reduction: {((original_memory - optimized_memory) / original_memory * 100):.1f}%")

MEMORY COMPARISON:
Original: 2.43 MB
Optimized: 0.14 MB
Reduction: 94.4%


# RECOMMENDATIONS FOR DATA CLEANING:

## 1. **Negative Income: 3 rows**
   → Decision needed: Remove or investigate?


## 2. **Age Scaling: Currently divided by 10 (range: 6.6-10.9)**
   → Recommendation: Multiply by 10 to get actual ages


## 3. **Income Scaling: Currently in $10,000 units (range: -1.0125-54.8351)**
   → Recommendation: Multiply by 10,000 to get actual dollars


## 4. **Categorical Variables: Check for consistency**

   health: ['average' 'poor' 'excellent']

   
   adl: ['normal' 'limited']

   
   region: ['other' 'midwest' 'northeast' 'west']

   
   gender: ['male' 'female']

   
   married: ['yes' 'no']

   
   employed: ['yes' 'no']

   
   insurance: ['yes' 'no']

   
   medicaid: ['no' 'yes']
   
   → Recommendation: Standardize categories (lowercase, trim spaces)


## 6. **Data Types: Use optimized types for memory efficiency**
   → Recommendation: Convert to category/smaller int types


## 7. **Potential Outliers:**
   visits: 43 values above 99th percentile (31.0)

   
   emergency: 25 values above 99th percentile (3.0)

   
   hospital: 42 values above 99th percentile (3.0)
   
   → Recommendation: Investigate extreme values

# Export Optimized Version

In [39]:
# Export the optimized version
df_optimized.to_csv('notebooks/outputs/NSMES1988new.csv', index=False)

# Verify export
df_verify = pd.read_csv('notebooks/outputs/NSMES1988new.csv')
print(f"Exported successfully: {optimized_memory:.2f} MB")
print(f"File saved to: notebooks/outputs/NSMES1988new.csv")

Exported successfully: 0.14 MB
File saved to: notebooks/outputs/NSMES1988new.csv


#### Observations Report

### Dataset Overview
- **Total Records**: 4406 rows
- **Total Features**: 19 columns
- **Missing Values**: None detected
- **Memory Usage**: Original: 2.43 MB | Optimized: 0.14 MB | Reduction: 94.4%

### Key Findings

#### 1. **Age Distribution**
- Range: 6.6 to 10.9 (scaled, actual: 66-109 years)
- This is an elderly population dataset
- All individuals are senior citizens

#### 2. **Income Distribution**  
- Range: -1.01 to 54.8 (scaled, actual: -\$10,100 to \$548,000)
- Negative values detected - requires investigation
- Wide income variance in population

#### 3. **Data Quality**
- No missing values - exceptionally clean dataset
- Scaling issues need correction (age ÷10, income in $10k units)
- Data types can be optimized for memory efficiency

#### 4. **Categorical Variables**
- See chart below

### Recommendations for Next Steps
1. Correct age and income scaling in Session 2
2. Investigate negative income values
3. Apply memory-optimized data types
4. Perform statistical analysis on corrected values
5. Explore relationships between demographics and healthcare usage

In [29]:
# categorical variables
categorical_cols = df.select_dtypes(include=['object']).columns

summary_data = []
for col in categorical_cols:
    summary_data.append({
        'Column': col,
        'Unique Values': df[col].nunique(),
        'Categories': ', '.join(map(str, df[col].unique()))
    })

summary_df = pd.DataFrame(summary_data)
summary_df

Unnamed: 0,Column,Unique Values,Categories
0,health,3,"average, poor, excellent"
1,adl,2,"normal, limited"
2,region,4,"other, midwest, northeast, west"
3,gender,2,"male, female"
4,married,2,"yes, no"
5,employed,2,"yes, no"
6,insurance,2,"yes, no"
7,medicaid,2,"no, yes"
