In [None]:
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Sierra Leone Data Profiling and Cleaning\n",
    "\n",
    "This notebook profiles and cleans `sierraleone-bumbuna.csv` for the Solar Data Challenge.\n",
    "\n",
    "## Objectives\n",
    "- **Profile**: Generate summary statistics, check missing values, and validate GHI and Timestamp.\n",
    "- **Clean**: Remove negative GHI/DNI/DHI, impute missing values with median.\n",
    "- **Validate**: Ensure data quality for downstream EDA and comparison.\n",
    "\n",
    "## Approach\n",
    "Use `SolarDataProcessor` class to encapsulate data handling. Steps:\n",
    "1. Load data and convert Timestamp to datetime.\n",
    "2. Compute statistics for key columns (GHI, DNI, DHI, etc.).\n",
    "3. Check for missing values and negative GHI.\n",
    "4. Clean data by removing invalid entries and imputing missing values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from src.data_processor import SolarDataProcessor\n",
    "\n",
    "# Initialize processor\n",
    "processor = SolarDataProcessor('../data/sierraleone-bumbuna.csv')\n",
    "\n",
    "# Load data\n",
    "if not processor.load_data():\n",
    "    print('Failed to load data')\n",
    "else:\n",
    "    print('Data loaded successfully')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Profiling Results\n",
    "Generate summary statistics, missing values, and validation checks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Profile data\n",
    "profile = processor.profile_data()\n",
    "print('Summary Statistics:')\n",
    "for col, stats in profile['statistics'].items():\n",
    "    print(f'{col}:\\n{stats}\\n')\n",
    "print('Missing Values:')\n",
    "for col, count in profile['missing_values'].items():\n",
    "    print(f'{col}: {count}')\n",
    "print(f'\\nNegative GHI Count: {profile[\"negative_ghi_count\"]}')\n",
    "print(f'Timestamp Dtype: {profile[\"timestamp_dtype\"]}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Cleaning\n",
    "Remove negative GHI/DNI/DHI and impute missing values with median."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Clean data\n",
    "processor.clean_data()\n",
    "print('Data after cleaning:')\n",
    "print(processor.data.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Next Steps\n",
    "- Generate time series plots for GHI, DNI, DHI.\n",
    "- Analyze correlations and wind patterns.\n",
    "- Export cleaned data to `data/sierra_leone_clean.csv`."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}