# Data Exploration

This notebook performs exploratory data analysis (EDA) on the AMR dataset.

## Objectives
- Load and inspect the raw dataset
- Understand data structure and types
- Identify missing values and outliers
- Visualize data distributions
- Explore relationships between features
- Generate initial insights about antibiotic resistance patterns

## 1. Setup and Imports

In [None]:
# Standard libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

# Custom modules
import sys
sys.path.append('..')
from src.data.preprocessing import load_raw_data
from src.visualization.plots import plot_data_distribution, plot_correlation_matrix

# Settings
warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8')
sns.set_palette('husl')

%matplotlib inline
%load_ext autoreload
%autoreload 2

## 2. Load Data

In [None]:
# Load raw data
# TODO: Load data from ../data/raw/rawdata.csv
df = pd.read_csv('../data/raw/rawdata.csv')

# Display basic information
print(f"Dataset shape: {df.shape}")
print(f"\nColumn names: {df.columns.tolist()}")

## 3. Data Overview

In [None]:
# Display first few rows
# TODO: Examine first 10 rows of the dataset
df.head(10)

In [None]:
# Data types and non-null counts
# TODO: Check data types and missing values
df.info()

In [None]:
# Statistical summary
# TODO: Generate descriptive statistics
df.describe()

## 4. Missing Values Analysis

In [None]:
# Check for missing values
# TODO: Calculate and visualize missing values
missing_values = df.isnull().sum()
missing_percent = (missing_values / len(df)) * 100
missing_df = pd.DataFrame({
    'Missing Count': missing_values,
    'Percentage': missing_percent
}).sort_values('Missing Count', ascending=False)

print(missing_df[missing_df['Missing Count'] > 0])

## 5. Data Distributions

In [None]:
# TODO: Plot distributions of numerical features
# Example: df.hist(figsize=(15, 10), bins=30)

In [None]:
# TODO: Visualize categorical features
# Example: Plot antibiotic resistance distributions

## 6. Correlation Analysis

In [None]:
# TODO: Calculate and visualize correlations
# Example: sns.heatmap(df.corr(), annot=True, cmap='coolwarm')

## 7. Key Insights and Observations

TODO: Document key findings from the exploratory analysis:
- Dataset characteristics
- Data quality issues
- Patterns observed
- Recommendations for preprocessing