# Setup: Generate Sample Dataset

This cell creates the required folder structure (`data/raw/` and `data/processed/`) relative to the notebook, and generates the sample CSV dataset with missing values. 
This ensures the dataset is ready for cleaning functions and saves it to `data/raw/sample_data.csv`.

In [1]:
import os
import pandas as pd
import numpy as np

# Define folder paths relative to this notebook
raw_dir = '../data/raw'
processed_dir = '../data/processed'

# Create folders if they don't exist
os.makedirs(raw_dir, exist_ok=True)
os.makedirs(processed_dir, exist_ok=True)

# Define the sample data
data = {
    'age': [34, 45, 29, 50, 38, np.nan, 41],
    'income': [55000, np.nan, 42000, 58000, np.nan, np.nan, 49000],
    'score': [0.82, 0.91, np.nan, 0.76, 0.88, 0.65, 0.79],
    'zipcode': ['90210', '10001', '60614', '94103', '73301', '12345', '94105'],
    'city': ['Beverly', 'New York', 'Chicago', 'SF', 'Austin', 'Unknown', 'San Francisco'],
    'extra_data': [np.nan, 42, np.nan, np.nan, np.nan, 5, np.nan]
}

# Create DataFrame
df = pd.DataFrame(data)

# Save to CSV in raw data folder
csv_path = os.path.join(raw_dir, 'sample_data.csv')
if not os.path.exists(csv_path):
    df.to_csv(csv_path, index=False)
    print(f'Sample dataset created and saved to {csv_path}')
else:
    print(f'File already exists at {csv_path}. Skipping CSV creation to avoid overwrite.')


File already exists at ../data/raw/sample_data.csv. Skipping CSV creation to avoid overwrite.


# Homework— Stage 6: Data Preprocessing
In this notebook, we take sample data, preprocess it, then save it to the processed data folder

In [1]:
import pandas as pd, sys, os

project_root = os.path.abspath(os.path.join(os.getcwd(), ".."))
sys.path.append(project_root)
from src import cleaning


## Load Raw Dataset
Data is read in and examined

In [2]:
df = pd.read_csv('../data/raw/sample_data.csv')
df.head()

Unnamed: 0,age,income,score,zipcode,city,extra_data
0,34.0,55000.0,0.82,90210,Beverly,
1,45.0,,0.91,10001,New York,42.0
2,29.0,42000.0,,60614,Chicago,
3,50.0,58000.0,0.76,94103,SF,
4,38.0,,0.88,73301,Austin,


## Apply Cleaning Functions and Comparing data

Now we utilize 3 functions from our src/cleaning.py document. 
- fill_missing_median: replaces NAs in a numeric variable with the median of all non-NAs
- drop_missing: drops rows with NA counts surpassing an amount set by the user
- normalize_data: standardize the data in the variable columns passed in by the user, prints error if non-numeric column is passed

In [13]:
df_medianfill = cleaning.fill_missing_median(df)
print(df_medianfill)

limitNA = 1 # This variable sets the max number of NAs allowed in any given row
df_missing = cleaning.drop_missing(df, limitNA)
print(df_missing)

df_normal = cleaning.normalize_data(df, ['age','income'])
print(df_normal)

print(df)


    age   income  score  zipcode           city  extra_data
0  34.0  55000.0  0.820    90210        Beverly        23.5
1  45.0  52000.0  0.910    10001       New York        42.0
2  29.0  42000.0  0.805    60614        Chicago        23.5
3  50.0  58000.0  0.760    94103             SF        23.5
4  38.0  52000.0  0.880    73301         Austin        23.5
5  39.5  52000.0  0.650    12345        Unknown         5.0
6  41.0  49000.0  0.790    94105  San Francisco        23.5
    age   income  score  zipcode           city  extra_data
0  34.0  55000.0   0.82    90210        Beverly         NaN
1  45.0      NaN   0.91    10001       New York        42.0
3  50.0  58000.0   0.76    94103             SF         NaN
6  41.0  49000.0   0.79    94105  San Francisco         NaN
        age    income  score  zipcode           city  extra_data
0 -0.797325  0.653197   0.82    90210        Beverly         NaN
1  0.797325       NaN   0.91    10001       New York        42.0
2 -1.522165 -1.469694    

## Save Cleaned Dataset
Cleaned data set is then saved to data/processed/ and stored in a central location

In [11]:
df_medianfill.to_csv('../data/processed/sample_data_medianfilled.csv', index=False)
df_missing.to_csv('../data/processed/sample_data_missingremoved.csv', index=False)
df_normal.to_csv('../data/processed/sample_data_normalized.csv', index=False)

## Assumptions 

- Removing or imputing the data assumes that the data being NA itself is not significant
- StandardScaler() assumes that the data is normally distributed, so that would need to be checked first
- How much values you drop is also significant, too much values being removed will cause skewed results
- Future data sets need to be in the same format