# Dataruns Basic Usage Examples

This notebook demonstrates the basic functionality of the dataruns library including:
- Data extraction from CSV files
- Building simple pipelines
- Basic data transformations
- Integration with pandas DataFrames

## Prerequisites

Make sure you have the dataruns library available in your Python environment. The examples use sample data that will be generated in the notebook.

## Setup and Imports

First, let's import the required libraries

In [4]:
import os
import numpy as np
import pandas as pd
import dataruns

# Import dataruns components
from dataruns.core.pipeline import Pipeline, Make_Pipeline
from dataruns.source.datasource import CSVSource

print("✓ All imports successful!")
print(f"✓ NumPy version: {np.__version__}")
print(f"✓ Pandas version: {pd.__version__}")
print(f"✓ Dataruns version: {dataruns.__version__}")

✓ All imports successful!
✓ NumPy version: 2.2.5
✓ Pandas version: 2.2.3
✓ Dataruns version: 1.0.dev


## 1. Basic Pipeline Example

Let's start with a simple pipeline that applies a transformation function to data. This demonstrates the core concept of chaining operations in dataruns.

In [5]:
# Define a simple transformation function
def transform_function(data):
    """Simple transformation that doubles the values."""
    if isinstance(data, pd.DataFrame):
        return data * 2
    elif isinstance(data, (list, np.ndarray)):
        return np.array(data) * 2
    else:
        return data * 2

# Create pipeline
pipeline = Pipeline(transform_function)

# Test with list
input_list = [1, 2, 3, 4, 5]
result_list = pipeline(input_list)
print(f"List input: {input_list}")
print(f"List result: {result_list}")

# Test with DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
result_df = pipeline(df)
print(f"\nDataFrame input:")
print(df)
print(f"\nDataFrame result:")
print(result_df)

List input: [1, 2, 3, 4, 5]
List result: [ 2  4  6  8 10]

DataFrame input:
   A  B
0  1  4
1  2  5
2  3  6

DataFrame result:
   A   B
0  2   8
1  4  10
2  6  12


## 2. Advanced Pipeline with Multiple Functions

Now let's create a more complex pipeline that chains multiple transformation functions together.

In [6]:
# Create sample data with missing values
np.random.seed(42)
df = pd.DataFrame({
    'A': [1, 2, None, 4, 5], 
    'B': [10, 20, 30, 40, 50],
    'C': [0.1, 0.2, 0.3, 0.4, 0.5]
})

print("Original data:")
print(df)
print(f"\nData shape: {df.shape}")
print(f"Missing values per column:")
print(df.isnull().sum())

Original data:
     A   B    C
0  1.0  10  0.1
1  2.0  20  0.2
2  NaN  30  0.3
3  4.0  40  0.4
4  5.0  50  0.5

Data shape: (5, 3)
Missing values per column:
A    1
B    0
C    0
dtype: int64


In [7]:
# Define transformation functions
def clean_data(df):
    """Remove missing values."""
    print(f"  → Cleaning data: {df.isnull().sum().sum()} missing values found")
    return df.dropna()

def normalize_columns(df):
    """Normalize columns to have mean 0 and std 1."""
    print(f"  → Normalizing {len(df.columns)} columns")
    return (df - df.mean()) / df.std()

def add_sum_column(df):
    """Add a column with the sum of all other columns."""
    print(f"  → Adding sum column")
    df = df.copy()
    df['sum'] = df.sum(axis=1)
    return df

# Create pipeline with multiple functions
print("Creating pipeline with 3 transformation steps:")
pipeline = Pipeline(clean_data, normalize_columns, add_sum_column)
print(f"Pipeline created: {pipeline}")

# Apply the pipeline
print("\nApplying pipeline...")
result = pipeline(df)

print(f"\nFinal result:")
print(result)
print(f"Result shape: {result.shape}")

Creating pipeline with 3 transformation steps:
Pipeline created: Pipeline(
    Function[clean_data]
    Function[normalize_columns]
    Function[add_sum_column]
)

Applying pipeline...
  → Cleaning data: 1 missing values found
  → Normalizing 3 columns
  → Adding sum column

Final result:
          A         B         C       sum
0 -1.095445 -1.095445 -1.095445 -3.286335
1 -0.547723 -0.547723 -0.547723 -1.643168
3  0.547723  0.547723  0.547723  1.643168
4  1.095445  1.095445  1.095445  3.286335
Result shape: (4, 4)


## 3. Pipeline Builder Example

The `Make_Pipeline` class allows you to build pipelines step by step, which is useful for dynamic pipeline construction.

In [8]:
# Create data for the builder example
data = pd.DataFrame({
    'x': [1, 2, 3, 4, 5],
    'y': [10, 20, 30, 40, 50]
})
print("Original data:")
print(data)

# Build pipeline using Make_Pipeline
print("\nBuilding pipeline step by step:")
builder = Make_Pipeline()

# Add transformations one by one
builder.add(lambda df: df * 2)  # Double values
print("  → Added: Double values")

builder.add(lambda df: df + 10)  # Add 10
print("  → Added: Add 10")

builder.add(lambda df: df.round(2))  # Round to 2 decimal places
print("  → Added: Round to 2 decimals")

# Build and apply pipeline
pipeline = builder.build()
print(f"\nBuilt pipeline: {pipeline}")

result = pipeline(data)
print(f"\nFinal result:")
print(result)

Original data:
   x   y
0  1  10
1  2  20
2  3  30
3  4  40
4  5  50

Building pipeline step by step:
  → Added: Double values
  → Added: Add 10
  → Added: Round to 2 decimals

Built pipeline: Pipeline(
    Function[<lambda>]
    Function[<lambda>]
    Function[<lambda>]
)

Final result:
    x    y
0  12   30
1  14   50
2  16   70
3  18   90
4  20  110


## 4. CSV Data Extraction Example

Let's create a sample CSV file and demonstrate how to extract data using the CSVSource class.

In [9]:
# Create sample CSV data
sample_data = pd.DataFrame({
    'feature1': np.random.normal(0, 1, 20),
    'feature2': np.random.normal(5, 2, 20),
    'target': np.random.randint(0, 2, 20),
    'category': np.random.choice(['A', 'B', 'C'], 20)
})

# Save to CSV file
csv_path = 'sample_data.csv'
sample_data.to_csv(csv_path, index=False)
print(f"✓ Created sample CSV file: {csv_path}")
print(f"Sample data shape: {sample_data.shape}")
print("\nFirst 5 rows of sample data:")
print(sample_data.head())

✓ Created sample CSV file: sample_data.csv
Sample data shape: (20, 4)

First 5 rows of sample data:
   feature1  feature2  target category
0  0.496714  7.931298       0        A
1 -0.138264  4.548447       1        B
2  0.647689  5.135056       1        C
3  1.523030  2.150504       1        A
4 -0.234153  3.911235       1        B


In [10]:
# Extract data using CSVSource
try:
    source = CSVSource(file_path=csv_path)
    extracted_data = source.extract_data()
    print(f"✓ Successfully extracted data using CSVSource")
    print(f"Extracted data shape: {extracted_data.shape}")
    print("\nFirst 5 rows of extracted data:")
    print(extracted_data.head())
    
    # Apply a simple pipeline to the CSV data
    def process_csv_data(df):
        """Process CSV data by selecting numeric columns and filling missing values."""
        print("  → Selecting numeric columns...")
        numeric_df = df.select_dtypes(include=[np.number])
        print(f"  → Found {len(numeric_df.columns)} numeric columns: {list(numeric_df.columns)}")
        print("  → Filling missing values with mean...")
        return numeric_df.fillna(numeric_df.mean())
    
    # Create and apply pipeline
    pipeline = Pipeline(process_csv_data)
    processed_data = pipeline(extracted_data)
    
    print(f"\n✓ Pipeline processing completed")
    print(f"Processed data shape: {processed_data.shape}")
    print("\nProcessed data (first 5 rows):")
    print(processed_data.head())
    
    # Clean up - remove the sample CSV file
    os.remove(csv_path)
    print(f"\n✓ Cleaned up: Removed {csv_path}")
    
except Exception as e:
    print(f"✗ Error: {e}")
    # Clean up even if there's an error
    if os.path.exists(csv_path):
        os.remove(csv_path)

✓ Successfully extracted data using CSVSource
Extracted data shape: (20, 4)

First 5 rows of extracted data:
               feature1            feature2 target category
0    0.4967141530112327   7.931297537843108      0        A
1  -0.13826430117118466   4.548447399026928      1        B
2    0.6476885381006925  5.1350564093758475      1        C
3    1.5230298564080254  2.1505036275730864      1        A
4  -0.23415337472333597  3.9112345509496347      1        B
  → Selecting numeric columns...
  → Found 0 numeric columns: []
  → Filling missing values with mean...

✓ Pipeline processing completed
Processed data shape: (20, 0)

Processed data (first 5 rows):
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4]

✓ Cleaned up: Removed sample_data.csv


## Conclusion

This notebook demonstrated the core functionality of the dataruns library:

1. **Basic Pipeline**: Simple transformation functions chained together
2. **Advanced Pipeline**: Multiple transformation steps with data cleaning and normalization
3. **Pipeline Builder**: Dynamic pipeline construction using `Make_Pipeline`
4. **CSV Extraction**: Data extraction from CSV files using `CSVSource`

### Key Takeaways:
- ✅ Dataruns provides a clean, functional approach to data processing
- ✅ Pipelines can be easily chained and composed
- ✅ The library works seamlessly with pandas DataFrames and numpy arrays
- ✅ CSV data extraction is straightforward and integrates well with pipeline processing

### Next Steps:
- Explore the transform examples notebook for advanced data transformation techniques
- Check out the comprehensive examples for more complex use cases
- Review the documentation for additional features and capabilities