# Setup: Generate Sample Dataset

This cell creates the required folder structure (`data/raw/` and `data/processed/`) relative to the notebook, and generates the sample CSV dataset with missing values. 
This ensures the dataset is ready for cleaning functions and saves it to `data/raw/sample_data.csv`.

In [8]:
import os
import pandas as pd
import numpy as np

# Define folder paths relative to this notebook
raw_dir = '../data/raw'
processed_dir = '../data/processed'

# Create folders if they don't exist
os.makedirs(raw_dir, exist_ok=True)
os.makedirs(processed_dir, exist_ok=True)

# Define the sample data
data = {
    'age': [34, 45, 29, 50, 38, np.nan, 41],
    'income': [55000, np.nan, 42000, 58000, np.nan, np.nan, 49000],
    'score': [0.82, 0.91, np.nan, 0.76, 0.88, 0.65, 0.79],
    'zipcode': ['90210', '10001', '60614', '94103', '73301', '12345', '94105'],
    'city': ['Beverly', 'New York', 'Chicago', 'SF', 'Austin', 'Unknown', 'San Francisco'],
    'extra_data': [np.nan, 42, np.nan, np.nan, np.nan, 5, np.nan]
}

# Create DataFrame
df = pd.DataFrame(data)

# Save to CSV in raw data folder
csv_path = os.path.join(raw_dir, 'sample_data.csv')
if not os.path.exists(csv_path):
    df.to_csv(csv_path, index=False)
    print(f'Sample dataset created and saved to {csv_path}')
else:
    print(f'File already exists at {csv_path}. Skipping CSV creation to avoid overwrite.')


File already exists at ../data/raw\sample_data.csv. Skipping CSV creation to avoid overwrite.


# Homework Starter — Stage 6: Data Preprocessing
Use this notebook to apply your cleaning functions and save processed data.

In [11]:
import sys
import os

# Add the src folder to Python path
sys.path.append(os.path.abspath('../src'))

from cleaning import fill_missing_median, drop_missing, normalize_data
import pandas as pd

# Load raw dataset
df = pd.read_csv('../data/raw/sample_data.csv')
print("Original DataFrame:\n", df)

# Fill missing numeric values (age, income, score, extra_data)
df = fill_missing_median(df, [...])
print("\nAfter filling missing values:\n", df)

# Drop rows with >50% missing values
df = drop_missing(df, threshold=0.5)
print("\nAfter dropping rows with too many missing values:\n", df)

df = normalize_data(df, ['age', 'income', 'score', 'extra_data'])
print("\nAfter normalization:\n", df)

df.to_csv('../data/processed/sample_data_cleaned.csv', index=False)
print("Cleaned dataset saved to data/processed/")


Original DataFrame:
     age   income  score  zipcode           city  extra_data
0  34.0  55000.0   0.82    90210        Beverly         NaN
1  45.0      NaN   0.91    10001       New York        42.0
2  29.0  42000.0    NaN    60614        Chicago         NaN
3  50.0  58000.0   0.76    94103             SF         NaN
4  38.0      NaN   0.88    73301         Austin         NaN
5   NaN      NaN   0.65    12345        Unknown         5.0
6  41.0  49000.0   0.79    94105  San Francisco         NaN

After filling missing values:
     age   income  score  zipcode           city  extra_data
0  34.0  55000.0   0.82    90210        Beverly         NaN
1  45.0      NaN   0.91    10001       New York        42.0
2  29.0  42000.0    NaN    60614        Chicago         NaN
3  50.0  58000.0   0.76    94103             SF         NaN
4  38.0      NaN   0.88    73301         Austin         NaN
5   NaN      NaN   0.65    12345        Unknown         5.0
6  41.0  49000.0   0.79    94105  San Francisco

## Load Raw Dataset

In [12]:
# df = pd.read_csv('../data/raw/sample_data.csv')
# df.head()

import pandas as pd

# Original raw data
original_df = pd.read_csv('../data/raw/sample_data.csv')

# Cleaned data
cleaned_df = pd.read_csv('../data/processed/sample_data_cleaned.csv')


In [13]:
print("Original Data Info:")
print(original_df.info())

print("\nCleaned Data Info:")
print(cleaned_df.info())


Original Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   age         6 non-null      float64
 1   income      4 non-null      float64
 2   score       6 non-null      float64
 3   zipcode     7 non-null      int64  
 4   city        7 non-null      object 
 5   extra_data  2 non-null      float64
dtypes: float64(4), int64(1), object(1)
memory usage: 464.0+ bytes
None

Cleaned Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   age         6 non-null      float64
 1   income      4 non-null      float64
 2   score       6 non-null      float64
 3   zipcode     7 non-null      int64  
 4   city        7 non-null      object 
 5   extra_data  2 non-null      float64
dtypes: float64(4), int64(1), ob

In [14]:
print("Original Data Summary:")
print(original_df.describe())

print("\nCleaned Data Summary:")
print(cleaned_df.describe())


Original Data Summary:
             age        income     score      zipcode  extra_data
count   6.000000      4.000000  6.000000      7.00000    2.000000
mean   39.500000  51000.000000  0.801667  62097.00000   23.500000
std     7.556454   7071.067812  0.092826  36869.63632   26.162951
min    29.000000  42000.000000  0.650000  10001.00000    5.000000
25%    35.000000  47250.000000  0.767500  36479.50000   14.250000
50%    39.500000  52000.000000  0.805000  73301.00000   23.500000
75%    44.000000  55750.000000  0.865000  92156.50000   32.750000
max    50.000000  58000.000000  0.910000  94105.00000   42.000000

Cleaned Data Summary:
            age    income     score      zipcode  extra_data
count  6.000000  4.000000  6.000000      7.00000    2.000000
mean   0.500000  0.562500  0.583333  62097.00000    0.500000
std    0.359831  0.441942  0.357023  36869.63632    0.707107
min    0.000000  0.000000  0.000000  10001.00000    0.000000
25%    0.285714  0.328125  0.451923  36479.50000    0.2

In [15]:
comparison = pd.concat([original_df.describe(), cleaned_df.describe()], axis=1, keys=['Original', 'Cleaned'])
print(comparison)


        Original                                                   Cleaned  \
             age        income     score      zipcode extra_data       age   
count   6.000000      4.000000  6.000000      7.00000   2.000000  6.000000   
mean   39.500000  51000.000000  0.801667  62097.00000  23.500000  0.500000   
std     7.556454   7071.067812  0.092826  36869.63632  26.162951  0.359831   
min    29.000000  42000.000000  0.650000  10001.00000   5.000000  0.000000   
25%    35.000000  47250.000000  0.767500  36479.50000  14.250000  0.285714   
50%    39.500000  52000.000000  0.805000  73301.00000  23.500000  0.500000   
75%    44.000000  55750.000000  0.865000  92156.50000  32.750000  0.714286   
max    50.000000  58000.000000  0.910000  94105.00000  42.000000  1.000000   

                                                   
         income     score      zipcode extra_data  
count  4.000000  6.000000      7.00000   2.000000  
mean   0.562500  0.583333  62097.00000   0.500000  
std    0.44

## Apply Cleaning Functions

In [16]:
# TODO: Apply your functions here
# Example:
# df = cleaning.fill_missing_median(df, ['col1','col2'])
# df = cleaning.drop_missing(df, threshold=0.5)
# df = cleaning.normalize_data(df, ['col1','col2'])

In [18]:
import sys
import os

sys.path.append(os.path.abspath('../src'))

from cleaning import fill_missing_median, drop_missing, normalize_data
import pandas as pd


df = pd.read_csv('../data/raw/sample_data.csv')  # adjust path if notebook location differs
print("Original DataFrame:\n", df)


numeric_cols = ['age', 'income', 'score', 'extra_data']
df = fill_missing_median(df, numeric_cols)
print("After filling missing values:\n", df)

df = drop_missing(df, threshold=0.5)
print("After dropping rows with >50% missing values:\n", df)


df = normalize_data(df, numeric_cols)
print("After normalization:\n", df)


Original DataFrame:
     age   income  score  zipcode           city  extra_data
0  34.0  55000.0   0.82    90210        Beverly         NaN
1  45.0      NaN   0.91    10001       New York        42.0
2  29.0  42000.0    NaN    60614        Chicago         NaN
3  50.0  58000.0   0.76    94103             SF         NaN
4  38.0      NaN   0.88    73301         Austin         NaN
5   NaN      NaN   0.65    12345        Unknown         5.0
6  41.0  49000.0   0.79    94105  San Francisco         NaN
After filling missing values:
     age   income  score  zipcode           city  extra_data
0  34.0  55000.0  0.820    90210        Beverly        23.5
1  45.0  52000.0  0.910    10001       New York        42.0
2  29.0  42000.0  0.805    60614        Chicago        23.5
3  50.0  58000.0  0.760    94103             SF        23.5
4  38.0  52000.0  0.880    73301         Austin        23.5
5  39.5  52000.0  0.650    12345        Unknown         5.0
6  41.0  49000.0  0.790    94105  San Francisco 

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_copy[col].fillna(median_value, inplace=True)


**DOCUMENTATION**

1️⃣ Number of rows

Original: 7 rows

Cleaned: 7 rows

Observation:

No rows were dropped because none had >50% missing values.

2️⃣ Columns with missing values filled

age: 1 missing → filled with median (39.5)

income: 3 missing → filled with median (52000)

score: 1 missing → filled with median (0.805)

extra_data: 5 missing → filled with median (23.5)

Observation:

All numeric columns with missing values were filled using the median.

3️⃣ Range of numeric columns after normalization

age: 0 → 1

income: 0 → 1

score: 0 → 1

extra_data: 0 → 1

Observation:

Numeric columns were scaled to 0–1 using Min-Max normalization.

4️⃣ Non-numeric columns

zipcode and city were not modified.

Their counts, types, and values remain unchanged.

5️⃣ Summary statistics comparison

Original describe() shows missing counts (NaN) in numeric columns.

Cleaned describe() shows count = 7 for all columns, no missing values.

Means, min, max, std for numeric columns now correspond to normalized values (0–1).

Observation:

Cleaning removed missing data issues and normalized numeric columns, improving data quality for analysis or modeling.

## Save Cleaned Dataset

In [19]:
df.to_csv('../data/processed/sample_data_cleaned.csv', index=False)
print("Cleaned dataset saved to data/processed/")


Cleaned dataset saved to data/processed/
