# DataCleanser Example Usage

This notebook demonstrates how to use the DataCleanser utility for EDA and data preprocessing, based on techniques from the kaggle-courses repository.

In [None]:
# Import the DataCleanser class
from data_cleanser import DataCleanser

# Other necessary imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## 1. Load Sample Data

In [None]:
# For demonstration, let's create a sample dataset with common issues
np.random.seed(42)
n = 1000

# Create a sample dataset
data = {
    'age': np.random.normal(35, 10, n),  # Numeric feature
    'income': np.random.exponential(50000, n),  # Skewed numeric feature
    'gender': np.random.choice(['M', 'F', None], n, p=[0.48, 0.48, 0.04]),  # Categorical with missing values
    'education': np.random.choice(['High School', 'Bachelor', 'Master', 'PhD', None], n, p=[0.3, 0.4, 0.2, 0.05, 0.05]),  # Categorical
    'signup_date': pd.date_range(start='2020-01-01', periods=n, freq='D')  # Date feature
}

# Create dataframe
df = pd.DataFrame(data)

# Add some outliers
df.loc[np.random.choice(n, 20), 'age'] = np.random.uniform(70, 100, 20)
df.loc[np.random.choice(n, 20), 'income'] = np.random.uniform(200000, 1000000, 20)

# Add some missing values
df.loc[np.random.choice(n, 50), 'age'] = np.nan
df.loc[np.random.choice(n, 100), 'income'] = np.nan

# Show the raw data
print("Sample Dataset:")
df.head()

## 2. Initialize DataCleanser with our dataset

In [None]:
# Create a DataCleanser instance with our dataframe
cleanser = DataCleanser(df=df)

## 3. Exploratory Data Analysis

In [None]:
# Get basic information about the dataset
cleanser.get_basic_info()

In [None]:
# Visualize distributions of numeric features
cleanser.visualize_distributions()

In [None]:
# Plot correlations between numeric features
cleanser.plot_correlations()

## 4. Data Preprocessing

Now let's clean the data with our DataCleanser utility.

In [None]:
# Handle missing values automatically
cleanser.handle_missing_values(strategy='auto')

In [None]:
# Handle outliers using IQR method
cleanser.handle_outliers(method='iqr', threshold=1.5)

In [None]:
# Create date features from signup_date
cleanser.create_date_features('signup_date')

In [None]:
# Encode categorical variables
cleanser.encode_categorical(method='onehot')

In [None]:
# Scale numerical features
cleanser.scale_features(method='standard')

## 5. Get the Processed Data

In [None]:
# Get the processed dataframe
processed_df = cleanser.get_data()

# View the cleaned and processed data
print("Processed Dataset:")
processed_df.head()

In [None]:
# Check for any remaining issues
print("Missing values:", processed_df.isnull().sum().sum())

# Summary statistics of processed data
processed_df.describe()

## 6. Reset to Original Data (if needed)

If you want to try different preprocessing approaches, you can reset to the original data.

In [None]:
# Reset to original data
cleanser.reset_to_original()

# Verify we're back to original
original_df = cleanser.get_data()
print("Back to original data with missing values:", original_df.isnull().sum().sum())

## 7. Alternative Preprocessing Pipeline

Let's try a different approach using method chaining.

In [None]:
# Create a complete pipeline using method chaining
processed_df = cleanser.handle_missing_values(strategy='median') \
                     .handle_outliers(method='zscore', threshold=3) \
                     .create_date_features('signup_date', drop_original=True) \
                     .encode_categorical(method='label') \
                     .scale_features(method='minmax') \
                     .get_data()

# Show results
processed_df.head()

## Conclusion

In this notebook, we demonstrated how to use the DataCleanser utility for comprehensive data exploration and preprocessing. This follows best practices extracted from the kaggle-courses repository, making data preparation for machine learning models simpler and more standardized.