# Iris Dataset Data Exploration

**Author:** Data Science Essentials Project  
**Date:** September 22, 2025  
**Purpose:** Basic data loading and exploration of the Iris dataset

This notebook demonstrates how to use our custom `PandasSource` class to load and explore the famous Iris dataset.

## Prerequisites

**Before running this notebook**, make sure you have set up the project structure and downloaded the required data:

```bash
# From the project root directory
python setup.py
```

This will:
- Create the `data/` directory structure  
- Download the Iris dataset to `data/raw/iris.csv`
- Install required dependencies

---

## 1. Setup and Data Loading

In [None]:
# Add project root to Python path
import sys
import os
import urllib.request
from pathlib import Path

# Find the project root - handle both local and CI environments
notebook_dir = Path(os.getcwd())
if notebook_dir.name == 'exploratory' and notebook_dir.parent.name == 'notebooks':
    # We're running the notebook directly in its folder
    project_root = notebook_dir.parent.parent
else:
    # We're probably in a CI environment or another directory
    # Look for a directory structure that suggests we're in the project
    for possible_root in [Path(os.getcwd()), Path(os.getcwd()).parent]:
        if (possible_root / 'notebooks' / 'exploratory').exists():
            project_root = possible_root
            break
    else:
        # Fallback to relative path from notebook
        project_root = Path('.').absolute().parent.parent
        # Add to Python path
        sys.path.append(str(project_root))

# Add to Python path if not already there
if str(project_root) not in sys.path:
    sys.path.append(str(project_root))

print(f"Project root: {project_root}")

# Import PandasSource from project
from src.data.sources import PandasSource

# Set up paths
data_dir = project_root / 'data' / 'raw'
iris_path = data_dir / 'iris.csv'

print(f"Looking for Iris dataset at: {iris_path}")

# Create directories if they don't exist
if not data_dir.exists():
    os.makedirs(data_dir, exist_ok=True)
    print(f"Created directory: {data_dir}")

# Download iris data if it doesn't exist
if not iris_path.exists():
    print(f"Downloading Iris dataset to {iris_path}")
    iris_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
    
    try:
        # Create directory if it doesn't exist
        os.makedirs(os.path.dirname(iris_path), exist_ok=True)
        
        # Download the file
        urllib.request.urlretrieve(iris_url, iris_path)
        print(f"Successfully downloaded Iris dataset to {iris_path}")
    except Exception as e:
        print(f"Error downloading Iris dataset: {e}")
        raise
else:
    print(f"Iris dataset found at: {iris_path}")

# Verify file exists
if not iris_path.exists():
    raise FileNotFoundError(f"Data file not found: {iris_path}")

# Load Iris dataset from data/raw/ directory
data_source = PandasSource(
    file_path=str(iris_path),  # Convert Path to string
    separator=',',
    header=False,
    names=['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'target']
)

## 2. Basic Data Exploration

In [None]:
# Display first 5 rows of the dataset
data_source.head()

In [None]:
# Display last 5 rows of the dataset
data_source.tail()

In [None]:
# Display first 2 rows of the dataset
data_source.head(2)

In [None]:
# Display column names
data_source.df.columns.tolist()

In [None]:
# Generate descriptive statistics
data_source.describe()

## 3. Metadata Information

Explore metadata information about the dataset using the new PandasSource API.

In [None]:
# Display dataset metadata
data_source.metadata