## Data Overview

### Purpose
This notebook provides a general overview of the final dataset before detailed analysis.  
We’ll confirm its structure, completeness, and key characteristics for each variable.

### Objectives
- Load the final dataset.
- Inspect structure (shape, columns, and data types).
- Check for missing values and duplicates.
- Generate basic descriptive statistics.
- Summarize the year and regional coverage.
- Record initial observations.


In [2]:
# Import libraries
import pandas as pd

# Load final dataset
df = pd.read_csv("../1_datasets/Final_dataset/final_merged_dataset.csv")  # adjust name/path if needed

# Quick look
display(df.head())

# Basic structure
print("Shape:", df.shape)
print("\nData types:")
print(df.dtypes)

# Check for missing values
print("\nMissing values:")
print(df.isnull().sum().sum(), "missing values in total")


Unnamed: 0,REGION,YEAR,JAN_RAIN,FEB_RAIN,MAR_RAIN,APR_RAIN,MAY_RAIN,JUN_RAIN,JUL_RAIN,AUG_RAIN,...,APR_TEMP,MAY_TEMP,JUN_TEMP,JUL_TEMP,AUG_TEMP,SEP_TEMP,OCT_TEMP,NOV_TEMP,DEC_TEMP,ANN_TEMP
0,Central,1990,0.0,0.0,0.0,0.0,0.002,0.026667,1.387333,0.429333,...,30.844667,33.960222,33.907,32.037444,32.395111,32.977667,32.436444,29.210333,27.665444,28.028111
1,Central,1991,0.0,0.0,0.0,0.047333,0.216,0.036667,0.713333,1.087333,...,33.137556,35.484,34.458778,32.991222,31.911333,33.060556,32.269778,27.483222,22.874889,28.717444
2,Central,1992,0.0,0.0,0.0,0.003,0.292333,0.145,1.118,2.190333,...,30.957444,32.862667,33.887667,32.620111,30.592222,31.677111,31.273667,26.608444,21.895444,26.553889
3,Central,1993,0.0,0.0,0.000333,0.119333,0.646333,0.173667,3.025,2.957667,...,30.436333,32.704222,32.951444,31.131556,30.119111,30.464778,30.612667,28.435111,24.960222,26.655667
4,Central,1994,0.0,0.0,0.0,0.0,0.389333,0.144667,2.592,1.556667,...,31.737556,33.064222,33.141778,30.683111,30.624778,31.021778,31.593778,25.942444,22.093889,27.520667


Shape: (175, 28)

Data types:
REGION       object
YEAR          int64
JAN_RAIN    float64
FEB_RAIN    float64
MAR_RAIN    float64
APR_RAIN    float64
MAY_RAIN    float64
JUN_RAIN    float64
JUL_RAIN    float64
AUG_RAIN    float64
SEP_RAIN    float64
OCT_RAIN    float64
NOV_RAIN    float64
DEC_RAIN    float64
ANN_RAIN    float64
JAN_TEMP    float64
FEB_TEMP    float64
MAR_TEMP    float64
APR_TEMP    float64
MAY_TEMP    float64
JUN_TEMP    float64
JUL_TEMP    float64
AUG_TEMP    float64
SEP_TEMP    float64
OCT_TEMP    float64
NOV_TEMP    float64
DEC_TEMP    float64
ANN_TEMP    float64
dtype: object

Missing values:
0 missing values in total


In [3]:
# Check for duplicates
duplicates = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicates}")

# Check region and year coverage
print("\nUnique regions:", df['REGION'].unique())
print("Year range:", df['YEAR'].min(), "to", df['YEAR'].max())


Number of duplicate rows: 0

Unique regions: ['Central' 'East' 'North' 'South' 'West']
Year range: 1990 to 2024


## Descriptive Statistics

Now we summarize the main characteristics of the dataset.  
This includes mean, standard deviation, min, max, and range for all numeric variables.


In [6]:
desc = df.describe().T
desc['range'] = desc['max'] - desc['min']
desc[['mean', 'std', 'min', 'max', 'range']].round(2)


Unnamed: 0,mean,std,min,max,range
YEAR,2007.0,10.13,1990.0,2024.0,34.0
JAN_RAIN,0.01,0.04,0.0,0.34,0.34
FEB_RAIN,0.01,0.02,0.0,0.21,0.21
MAR_RAIN,0.02,0.06,0.0,0.54,0.54
APR_RAIN,0.14,0.29,0.0,1.84,1.84
MAY_RAIN,0.69,1.14,0.0,6.9,6.9
JUN_RAIN,1.12,1.56,0.0,6.53,6.53
JUL_RAIN,2.51,2.44,0.0,10.99,10.99
AUG_RAIN,2.88,2.31,0.01,9.17,9.16
SEP_RAIN,1.79,2.01,0.0,8.21,8.21


## Observations

- The dataset includes **175 records** and **28 columns**.
- Covers years **1990–2024** across **five regions** (North, South, East, West, Central).
- No missing values or duplicate records were found.
- Annual temperature ranges between X–Y °C, rainfall between X–Y mm.
- Data appears consistent and ready for deeper EDA (monthly and regional trends).


### Next Steps
In the next notebook (`2_monthly_trends.ipynb`), we’ll analyze how temperature and rainfall vary across **months** and **regions** to identify seasonal patterns.
