
# Session 2 – **Exploratory Data Analysis (EDA)**

This notebook guides you through a practical, end‑to‑end EDA workflow.  
It is a warm up, after you finished it successfully, repeat the process for your own data and work through the steps.

**What you'll do**
1. Load and inspect your dataset
2. Summarize numeric & categorical variables
3. Explore missingness and duplicates
4. Visualize distributions, relationships, and correlations
5. Capture insights in short Markdown notes
6. Draft 2–3 KPI ideas informed by your EDA

**Note:** This notebook intentionally focuses on *EDA only*.  
> Cleaning/feature engineering strategies (e.g., imputation/outlier handling) will be done in the next session.


You might need to download files from:
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/GTNEJD&version=1.2


## 0) Setup
- Update `DATA_PATH` to your chosen dataset (CSV).  
- If the file is not found, a small **synthetic demo dataset** will be generated so you can still run the notebook end‑to‑end.


In [1]:

import os
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Display options
pd.set_option("display.max_columns", 200)
pd.set_option("display.width", 120)

# ---- Path to your CSV ----
# If you have the Harvard PeopleSuN dataset in your data folder from last session, copy it
# in the data folder of this week and set the path like:
DATA_PATH = "../data/peoplesun_hh_anon.csv"  

RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)



## 1) Load Data
- Try to read the CSV at `DATA_PATH`.
- If not found, we build a small **synthetic dataset** to let you proceed.


In [3]:

# def make_synthetic_demo(n=1000):
#     # Toy dataset with mixed types: numeric, categorical, date, and some missingness
#     dates = pd.date_range("2021-01-01", periods=n, freq="D")
#     region = np.random.choice(["North","South","East","West"], size=n, p=[0.25,0.25,0.25,0.25])
#     income = np.random.lognormal(mean=10, sigma=0.5, size=n)  # heavily skewed
#     pop_density = np.random.gamma(shape=2.0, scale=50, size=n)
#     electrified = np.random.binomial(n=1, p=np.clip(0.3 + 0.001*(pop_density) + 0.00001*(income), 0.05, 0.95))
#     # insert some missingness
#     income[np.random.choice(n, size=n//15, replace=False)] = np.nan
#     pop_density[np.random.choice(n, size=n//20, replace=False)] = np.nan
#     # a few duplicates
#     df = pd.DataFrame({
#         "date": dates,
#         "region": region,
#         "income": income,
#         "population_density": pop_density,
#         "electrified": electrified
#     })
#     df = pd.concat([df, df.sample(5, random_state=RANDOM_SEED)], ignore_index=True)
#     return df

if os.path.exists(DATA_PATH):
    df = pd.read_csv(DATA_PATH,low_memory=False)
    source_note = f"Loaded dataset from: {DATA_PATH}"
# else:
#     df = make_synthetic_demo(n=1200)
#     source_note = "Using synthetic demo dataset (file not found)."

source_note, df.head()


('Loaded dataset from: ../data/peoplesun_hh_anon.csv',
             zone  state    eaid          lga                  urca_cat                                  hhid  q102  \
 0  North Central  Niger  NI_197      mashegu  <1hr to small city/town+  d8af8ab5-30ab-4d9f-bfe4-81231dbe5dbf     1   
 1  North Central  Niger  NI_197      mashegu  <1hr to small city/town+  e8245d5c-8130-4e78-b4b0-1053b7ecbc9b     1   
 2     North West   Kano  KN_104  garun_malam        <1hr to large city  435c8e27-517a-46b9-af04-48830e086d7a     1   
 3     North West   Kano  KN_104  garun_malam        <1hr to large city  9303fa53-9fd2-41a9-9f0d-9567dbe5168e     1   
 4     North West   Kano  KN_104  garun_malam        <1hr to large city  c62cc5a5-29c5-423b-9543-a7b05bda454b     1   
 
    q103  q104  q105  q106  q107  q108  q109  q110  q201  q202  q203_1  q203_2  q203_3  q203_4  q204  q205  q206  \
 0   NaN     1     1    36     2   NaN   NaN   NaN     5     4      13       4       1       0     2     8     6 


## 2) First Look
**Questions**
- What do rows represent? What does each column mean?
- Do we recognize numeric vs categorical vs datetime columns?


In [5]:

print(source_note)
print("\nShape:", df.shape)
print("\nColumns:", list(df.columns))
print("\nInfo:")
print(df.info())
#display(df.head(10))


Loaded dataset from: ../data/peoplesun_hh_anon.csv

Shape: (3599, 293)

Columns: ['zone', 'state', 'eaid', 'lga', 'urca_cat', 'hhid', 'q102', 'q103', 'q104', 'q105', 'q106', 'q107', 'q108', 'q109', 'q110', 'q201', 'q202', 'q203_1', 'q203_2', 'q203_3', 'q203_4', 'q204', 'q205', 'q206', 'q207_1', 'q207_2', 'q207_3', 'q208', 'q209', 'q210', 'q210_1', 'q211', 'q211__1', 'q211__2', 'q211__3', 'q211__4', 'q211__5', 'q211__6', 'q211__7', 'q211__8', 'q211__9', 'q211__10', 'q211__11', 'q211__12', 'q211__0', 'q211__96', 'q211_1', 'q212', 'q212_1', 'q213', 'q213_1', 'q213_1__1', 'q213_1__2', 'q213_1__96', 'q214', 'q214__1', 'q214__2', 'q214__3', 'q214__4', 'q214__5', 'q214__6', 'q214__7', 'q214__8', 'q214__9', 'q214__10', 'q214__11', 'q214__12', 'q214__13', 'q214__96', 'q215', 'q215_1', 'q215_2', 'q216', 'q216__1', 'q216__2', 'q216__3', 'q216__4', 'q216__5', 'q216__6', 'q216__7', 'q216__8', 'q216__9', 'q216__10', 'q216__11', 'q216__12', 'q216__13', 'q217', 'q217__1', 'q217__2', 'q217__3', 'q217__

To make your EDA more interpretable:

### a) Load and inspect the codebook

If you have something like peoplesun_codebook.csv or .xlsx:

In [10]:
codebook = pd.read_excel("../data/peoplesun_hh_odk_codebook.xlsx")
display(codebook.head())

Unnamed: 0,type_kobo,type,name,label
0,select_one state,select_one,states,State
1,select_one lga,select_one,lgas,Local Government Areas
2,select_one yesnolist,logical,q102,102. Would you like to participate in the survey?
3,select_one rejectlist,select_one,q103,103. Record the reason from the following and ...
4,select_one rellist,select_one,q104,104. What is your relationship to the househol...


In [12]:
mapping = dict(zip(codebook["name"], codebook["label"]))
df_renamed = df.rename(columns=mapping)


In [13]:
df_renamed

Unnamed: 0,zone,state,eaid,lga,urca_cat,hhid,102. Would you like to participate in the survey?,103. Record the reason from the following and abort the survey,104. What is your relationship to the household head?,105. What is your gender?,106. What is your age?,107. What is your marital status?,108. What is the gender of the household head?,109. What is the age of the household head?,110. What is the marital status of the household head?,201. How many adults live in this household permanently?,202. How many chairs do you have in your home?,203.1. How many household members are less than 18 years of age?,203.2. How many household members are between 18 to 30 years of age?,203.3. How many household members are between 31 to 55 years of age?,203.4. How many household members are over 55 years of age?,204. How many separate roofs make-up your home plot/compound?,205. How many separate rooms are in all buildings of your home plot/compound?,206. How would you describe the majority of the roofs? [PROMPT],207.1. How many [bicycle] do you own?,207.2. How many [Motorbike / scooter] do you own?,207.3. How many [Car / van / truck] do you own?,208. What is the highest level of education achieved by a male currently living in this household?,209. What is the highest level of education achieved by a female currently living in this household?,210. What is the MAIN income source of the household?,210.1 Other (specify),211. Can you list ALL of the income sources of your household? [MC],q211__1,q211__2,q211__3,q211__4,q211__5,q211__6,q211__7,q211__8,q211__9,q211__10,q211__11,q211__12,q211__0,q211__96,211.1 Other (specify),212. Does anyone in the household have a bank account at a formal institution?,212.1. At which institution is this account or savings? [PROMPT],213. Does anyone in the household use an informal savings groups (adashi/esusu/ajo) to save money?,213.1. What type of informal savings group do household members use? [MC] [PROMPT],q213_1__1,q213_1__2,q213_1__96,"214. If you can get a loan/credit, what are the sources of credit/loans? [MC]",q214__1,q214__2,q214__3,q214__4,q214__5,q214__6,q214__7,q214__8,q214__9,q214__10,q214__11,q214__12,q214__13,q214__96,"215. Do you have a mobile money account (e.g. Opay, Firstmonie, Paga, MoMo) [ENUMERATOR: if respondant does not know what this is, select ""no""]",215.1. Do you use mobile money to make payments over the mobile phone?,215.2. Have you used the account in the past 90 days?,216. In which months of the year does the household have the least money or is it the same over the whole year? [MC] [ENUMERATOR: make sure you check option “same over the whole year” if this is the case],q216__1,q216__2,q216__3,q216__4,q216__5,q216__6,q216__7,q216__8,q216__9,q216__10,q216__11,q216__12,q216__13,217. In which months of the year does the household have the most money? [MC] [ENUMERATOR: make sure you check option “same over the whole year” if this is the case],q217__1,q217__2,q217__3,q217__4,q217__5,q217__6,q217__7,q217__8,q217__9,q217__10,q217__11,q217__12,q217__13,...,q307_2__1,q307_2__2,q307_2__3,q307_2__4,q307_2__5,q307_2__6,q307_2__7,q307_2__8,q307_2__9,q307_2__10,q307_2__11,q307_2__12,q307_2__13,"307.3. On average, how many hours per day over the last 7 days did you have access to electricity from the NATIONAL GRID ONLY? [ENUMERATOR: explain that a full day is 24 hours]",307.4. How many days did you have an unscheduled outage or blackout over the last 7 days from the NATIONAL GRID ONLY?,"307.5. On average, what was the duration of the outage or blackout each day over the last 7 days, in HOURS, from the NATIONAL GRID ONLY? [ENUMERATOR: If the outage was the entire day each time, or it was out for the entire week, enter 24].","307.6. Do you have a meter installed, for the NATIONAL GRID?","307.7. Are you sharing a meter, for the NATIONAL GRID?","307.8. What was the connection fee (NOT wiring), for the NATIONAL GRID?","307.9. What did you pay for wiring and materials, for the NATIONAL GRID?","307.10. What is the typical monthly bill, for the NATIONAL GRID?",307.11. How much do you agree that your monthly bill for the NATIONAL GRID is affordable? [PROMPT],307.12. How much do you trust your monthly bill for the NATIONAL GRID? [PROMPT],"308. Do you regularly use basic or traditional lighting sources, e.g. battery torches, candles and fuel lamps? [ENUMERATOR: clarify that this does NOT include generators / generator fuel]","309. What is the total amount that this household normally spends, per month, on fuels, candles and batteries for lighting only? (Not including other sources of electricity)",401. Who decides / whose opinions are valued when deciding on the purchase of durable goods like electric appliances? [MC] [PROMPT] [ENUMERATOR: make sure it is clear this is multiple choice],q401__1,q401__2,q401__3,q401__4,q401__5,q401__6,q401__99,q401__96,q401__98,401.1 Other (specify),402. Which of these common appliances does your household own? [MC] [PROMPT],q402__1,q402__2,q402__3,q402__4,q402__5,q402__6,404. Which of these other appliances does your household own? [MC] [PROMPT],q404__1,q404__2,q404__3,q404__4,q404__5,q404__6,q404__7,q404__8,q404__9,q404__10,q404__11,q404__0,q404__96,404.1. Other (specify).,405. Has anyone in your family taken a loan to buy a household appliance?,405.1. Who has provided a loan for appliance purchase to your household?,q405_1__1,q405_1__2,q405_1__3,q405_1__4,q405_1__5,q405_1__6,q405_1__7,q405_1__8,q405_1__9,q405_1__10,q405_1__11,q405_1__12,q405_1__13,q405_1__96,405.2. How much was the largest loan your household has taken for buying an appliance?,501. Which of the following cooking stoves does your household use for cooking or boiling water? [MC] [PROMPT],q501__1,q501__2,q501__3,q501__4,q501__5,q501__96,503. Who decides / whose opinion is valued when deciding on which stove you buy and which fuels you cook with? [MC] [PROMPT] [ENUMERATOR: make sure it is clear this is multiple choice],q503__1,q503__2,q503__3,q503__4,q503__5,q503__6,q503__99,q503__96,q503__98,503.1. Other (specify),604.1. How much do you agree that everyone can breathe air free of smoke or pollution in the home all day? [PROMPT],604.2. How much do you agree that everyone in the household has adequate electrical light when they need it? [PROMPT],604.3. How much do you agree that you are able to keep your house at a comfortable temperature for everyone? [PROMPT],604.4. How much do you agree that you are able to store perishable foods over long periods of time? [PROMPT],604.5. How much do you agree that every adult in the household has regular access to a communications device? [PROMPT],604.6. How much do you agree that every adult in the household has regular access to digital information and entertainment sources in the home? [PROMPT],natweight
0,North Central,Niger,NI_197,mashegu,<1hr to small city/town+,d8af8ab5-30ab-4d9f-bfe4-81231dbe5dbf,1,,1,1,36,2,,,,5,4,13,4,1,0,2,8,6,0,2,0,22,14,96,Politics,2 96,0,1,0,0,0,0,0,0,0,0,0,0,0,1,Politics and farming for consumption,1,1.0,1,1,1.0,0.0,0.0,1 11,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,,,6 7,0,0,0,0,0,1,1,0,0,0,0,0,0,12 1 2,1,1,0,0,0,0,0,0,0,0,0,1,0,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,4.0,3.0,20.0,0.0,0.0,3000.0,15000.0,3000.0,2.0,2.0,1,2000.0,1,1,0,0,0,0,0,,0,0,,1 3 4 6,1,0,1,1,0,1,2 11,0,1,0,0,0,0,0,0,0,0,1,0,0,,0,,,,,,,,,,,,,,,,,4,0,0,0,1,0,0,1,1,0,0,0,0,0,,0,0,,4,2,3,2,3,3,0.039141
1,North Central,Niger,NI_197,mashegu,<1hr to small city/town+,e8245d5c-8130-4e78-b4b0-1053b7ecbc9b,1,,1,1,60,2,,,,6,2,4,3,2,1,3,9,6,0,0,0,14,13,1,,1 3 96,1,0,1,0,0,0,0,0,0,0,0,0,0,1,Farming for consumption,0,,1,1,1.0,0.0,0.0,11,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,,,8,0,0,0,0,0,0,0,1,0,0,0,0,0,11 12,0,0,0,0,0,0,0,0,0,0,1,1,0,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,6.0,3.0,18.0,0.0,0.0,5000.0,8000.0,3000.0,1.0,2.0,1,1000.0,1,1,0,0,0,0,0,,0,0,,1 2 3 4 6,1,1,1,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,,0,,,,,,,,,,,,,,,,,4,0,0,0,1,0,0,1,1,0,0,0,0,0,,0,0,,4,2,3,1,3,3,0.039141
2,North West,Kano,KN_104,garun_malam,<1hr to large city,435c8e27-517a-46b9-af04-48830e086d7a,1,,1,1,45,1,,,,3,6,3,2,1,0,1,5,6,1,1,0,14,8,2,,2 8,0,1,0,0,0,0,0,1,0,0,0,0,0,0,,1,1.0,1,1,1.0,0.0,0.0,11,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,,,8 9,0,0,0,0,0,0,0,1,1,0,0,0,0,10 11 1,1,0,0,0,0,0,0,0,0,1,1,0,0,...,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,12.0,7.0,12.0,0.0,0.0,1000.0,3000.0,2000.0,3.0,3.0,0,,1,1,0,0,0,0,0,0.0,0,0,,1 3 6 4 2,1,1,1,1,0,1,11,0,0,0,0,0,0,0,0,0,0,1,0,0,,0,,,,,,,,,,,,,,,,,4,0,0,0,1,0,0,1,1,0,0,0,0,0,0.0,0,0,,4,3,4,2,4,1,0.109779
3,North West,Kano,KN_104,garun_malam,<1hr to large city,9303fa53-9fd2-41a9-9f0d-9567dbe5168e,1,,1,1,47,1,,,,3,15,7,2,1,0,1,10,6,3,1,0,14,8,2,,2 7,0,1,0,0,0,0,1,0,0,0,0,0,0,0,,1,1.0,0,,,,,11,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,,,1 4 3 8,1,0,1,1,0,0,0,1,0,0,0,0,0,12 11 10,0,0,0,0,0,0,0,0,0,1,1,1,0,...,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,12.0,7.0,12.0,0.0,0.0,2500.0,5000.0,2000.0,4.0,3.0,0,,1,1,0,0,0,0,0,0.0,0,0,,1 2 4 5 6 3,1,1,1,1,1,1,2 11 1,1,1,0,0,0,0,0,0,0,0,1,0,0,,0,,,,,,,,,,,,,,,,,4,0,0,0,1,0,0,1 2,1,1,0,0,0,0,0.0,0,0,,4,3,4,3,4,2,0.109779
4,North West,Kano,KN_104,garun_malam,<1hr to large city,c62cc5a5-29c5-423b-9543-a7b05bda454b,1,,1,1,45,1,,,,2,1,4,1,1,0,2,3,6,0,1,0,24,24,2,,2,0,1,0,0,0,0,0,0,0,0,0,0,0,0,,0,,0,,,,,11,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,,,8 9,0,0,0,0,0,0,0,1,1,0,0,0,0,10 11 12,0,0,0,0,0,0,0,0,0,1,1,1,0,...,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,7.0,19.0,0.0,0.0,2000.0,4500.0,1000.0,4.0,2.0,1,4000.0,1 2,1,1,0,0,0,0,0.0,0,0,,1 3,1,0,1,0,0,0,11,0,0,0,0,0,0,0,0,0,0,1,0,0,,0,,,,,,,,,,,,,,,,,4,0,0,0,1,0,0,1 2,1,1,0,0,0,0,0.0,0,0,,4,2,4,2,4,2,0.109779
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3594,North Central,Niger,NI_194,chanchaga,<1hr to int. city,9303a8af-9dc3-466f-9fa1-65dbc93c0990,1,,4,1,25,4,1.0,50.0,1.0,6,3,0,4,2,0,1,4,6,0,0,1,22,20,5,,5 6,0,0,0,0,1,1,0,0,0,0,0,0,0,0,,1,1.0,0,,,,,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1.0,1.0,1 2,1,1,0,0,0,0,0,0,0,0,0,0,0,11 12 10,0,0,0,0,0,0,0,0,0,1,1,1,0,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,5.0,2.0,19.0,1.0,0.0,99.0,99.0,4000.0,2.0,3.0,1,1000.0,1,1,0,0,0,0,0,0.0,0,0,,1 3 4 6,1,0,1,1,0,1,2 6 8,0,1,0,0,0,1,0,1,0,0,0,0,0,,0,,,,,,,,,,,,,,,,,1 4,1,0,0,1,0,0,98,0,0,0,0,0,0,0.0,0,1,,4,2,4,2,3,2,0.080206
3595,North West,Kaduna,KD_81,kagarko,<1hr to small city/town+,22c88a1a-4e0d-49ec-bfff-8f6f70282e07,1,,1,1,50,1,,,,5,5,2,3,2,0,3,7,6,0,2,0,11,20,7,,7,0,0,0,0,0,0,1,0,0,0,0,0,0,0,,1,1.0,0,,,,,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,8,0,0,0,0,0,0,0,1,0,0,0,0,0,11 12 1 2 3 4 5 6 7,1,1,1,1,1,1,1,0,0,0,1,1,0,...,,,,,,,,,,,,,,,,,,,,,,,,0,,1 2,1,1,0,0,0,0,0.0,0,0,,1 2 3 4 5 6,1,1,1,1,1,1,2 96,0,1,0,0,0,0,0,0,0,0,0,0,1,Stabilizer,0,,,,,,,,,,,,,,,,,2 4,0,1,0,1,0,0,1 2,1,1,0,0,0,0,0.0,0,0,,2,2,2,2,3,3,0.069566
3596,North West,Kaduna,KD_82,jaba,<1hr to small city/town+,e3ffc39e-5b54-4de0-857f-d69a9f5de4a9,1,,1,1,42,1,,,,2,4,3,0,2,0,2,4,96,0,1,0,17,14,5,,1 5,1,0,0,0,1,0,0,0,0,0,0,0,0,0,,1,1.0,1,1,1.0,0.0,0.0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1.0,1.0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,12 11,0,0,0,0,0,0,0,0,0,0,1,1,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,7.0,5.0,17.0,0.0,0.0,3000.0,5000.0,2000.0,3.0,2.0,1,1500.0,1,1,0,0,0,0,0,0.0,0,0,,1 2 3 4 6,1,1,1,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,,0,,,,,,,,,,,,,,,,,5 4,0,0,0,1,1,0,2,0,1,0,0,0,0,0.0,0,0,,3,2,3,1,3,2,0.069566
3597,North West,Katsina,KT_146,bindawa,Town,d3fb5d54-0e8c-4a67-bbfc-b0ea12ae9438,1,,2,2,35,2,1.0,45.0,2.0,3,8,8,1,2,0,3,11,6,0,1,1,22,20,5,,5 1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,,1,1.0,0,,,,,1 5 6,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,,,8 7 6,0,0,0,0,0,1,1,1,0,0,0,0,0,2 3 4 1 12 11,1,1,1,1,0,0,0,0,0,0,1,1,0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,9.0,7.0,15.0,0.0,0.0,2000.0,5000.0,3000.0,4.0,4.0,1,1200.0,1 2,1,1,0,0,0,0,0.0,0,0,,1 3 4 6,1,0,1,1,0,1,2 4 11 7,0,1,0,1,0,0,1,0,0,0,1,0,0,,0,,,,,,,,,,,,,,,,,1 2 4,1,1,0,1,0,0,1 2,1,1,0,0,0,0,0.0,0,0,,4,4,4,4,4,4,0.064779



## 3) Data Types & Basic Hygiene (EDA-only)
- Identify numeric, categorical, datetime columns.
- Parse obvious datetimes if necessary (**no** cleaning beyond parsing).


In [None]:

# Try to parse any columns that look like datetimes (non-destructive; creates a copy if successful)
def coerce_datetimes(frame: pd.DataFrame, candidates=None):
    frame = frame.copy()
    if candidates is None:
        candidates = [c for c in frame.columns if "date" in c.lower() or "time" in c.lower()]
    for c in candidates:
        try:
            frame[c] = pd.to_datetime(frame[c], errors="raise")
        except Exception:
            pass
    return frame

df_dt = coerce_datetimes(df)

# Identify column types
numeric_cols = df_dt.select_dtypes(include=[np.number]).columns.tolist()
datetime_cols = df_dt.select_dtypes(include=["datetime64[ns]", "datetime64[ns, UTC]"]).columns.tolist()
categorical_cols = [c for c in df_dt.columns if c not in numeric_cols + datetime_cols]

print("Numeric:", numeric_cols)
print("Datetime:", datetime_cols)
print("Categorical:", categorical_cols[:10], "... ({} total)".format(len(categorical_cols)))



## 4) Descriptive Statistics
Use `.describe()` for numeric and a custom summary for categorical columns.


In [None]:

numeric_summary = df_dt[numeric_cols].describe().T if numeric_cols else pd.DataFrame()
categorical_summary = pd.DataFrame({
    "n_unique": [df_dt[c].nunique(dropna=True) for c in categorical_cols],
    "top_values": [df_dt[c].value_counts(dropna=False).head(5).to_dict() for c in categorical_cols]
}, index=categorical_cols) if categorical_cols else pd.DataFrame()

display(numeric_summary)
display(categorical_summary.head(15))



## 5) Missingness Overview
- Which columns have missing values? (EDA awareness only — **no imputation yet**)


In [None]:

missing = df_dt.isna().mean().sort_values(ascending=False)
missing = (missing * 100).round(2).rename("%missing")
display(missing.to_frame().head(30))



## 6) Duplicate Awareness
- How many duplicated rows exist?


In [None]:

dup_count = int(df_dt.duplicated().sum())
total = len(df_dt)
print(f"Duplicated rows: {dup_count} / {total} ({dup_count/total*100:.2f}%)")



## 7) Univariate Distributions (Numeric)
- Histograms help reveal shape, skew, and potential outliers.
> Tip: For skewed variables (e.g., income), inspect both linear and log scales.


In [None]:

def plot_histograms(frame, columns, bins=30, logscale_candidates=None):
    for col in columns:
        data = frame[col].dropna()
        if data.empty:
            continue
        plt.figure(figsize=(6,4))
        plt.hist(data, bins=bins)
        plt.title(f"Histogram: {col}")
        plt.xlabel(col)
        plt.ylabel("Count")
        plt.tight_layout()
        plt.show()

        # Optional log view for strongly positive & skewed vars
        if logscale_candidates and col in logscale_candidates and (data>0).all():
            plt.figure(figsize=(6,4))
            plt.hist(np.log1p(data), bins=bins)
            plt.title(f"Histogram (log1p): {col}")
            plt.xlabel(f"log1p({col})")
            plt.ylabel("Count")
            plt.tight_layout()
            plt.show()

log_candidates = [c for c in numeric_cols if c.lower().startswith(("inc","rev","sales","price"))]
plot_histograms(df_dt, numeric_cols[:6], bins=30, logscale_candidates=log_candidates)



## 8) Categorical Frequencies
- Bar plots of the **top categories** help you understand distribution of labels/classes.


In [None]:

def plot_top_categories(frame, col, top_n=10):
    vc = frame[col].value_counts(dropna=False).head(top_n)
    labels = vc.index.astype(str)
    values = vc.values
    plt.figure(figsize=(7,4))
    plt.bar(labels, values)
    plt.title(f"Top {top_n} categories: {col}")
    plt.xticks(rotation=45, ha="right")
    plt.ylabel("Count")
    plt.tight_layout()
    plt.show()

for c in categorical_cols[:3]:
    plot_top_categories(df_dt, c, top_n=10)



## 9) Bivariate Relationships (Numeric ↔ Numeric)
- Scatter plots help spot trends, clusters, heteroscedasticity.


In [None]:

def scatter_pair(frame, x, y, sample=5000):
    sub = frame[[x, y]].dropna()
    if len(sub) > sample:
        sub = sub.sample(sample, random_state=RANDOM_SEED)
    plt.figure(figsize=(6,5))
    plt.scatter(sub[x], sub[y], s=8, alpha=0.6)
    plt.xlabel(x)
    plt.ylabel(y)
    plt.title(f"Scatter: {x} vs {y}")
    plt.tight_layout()
    plt.show()

# Try a few pairs automatically
if len(numeric_cols) >= 2:
    for x in numeric_cols[:2]:
        for y in numeric_cols[2:4]:
            if x != y:
                scatter_pair(df_dt, x, y)



## 10) Numeric vs Categorical
- Compare distributions of a numeric variable **across categories** (group summaries).


In [None]:

def grouped_summary(frame, num_col, cat_col):
    g = frame[[num_col, cat_col]].dropna().groupby(cat_col)[num_col]
    summary = pd.DataFrame({
        "count": g.count(),
        "mean": g.mean(),
        "median": g.median(),
        "std": g.std()
    }).sort_values("mean", ascending=False)
    return summary

# Print grouped summaries for the first few combinations
for cat in categorical_cols[:2]:
    for num in numeric_cols[:3]:
        print(f"\n--- {num} by {cat} ---")
        display(grouped_summary(df_dt, num, cat).head(10))



## 11) Correlation Matrix (Numeric Only)
- Use with care: correlation ≠ causation.
- Still, helpful for spotting redundant features or strong linear relationships.


In [None]:

if numeric_cols:
    corr = df_dt[numeric_cols].corr(numeric_only=True)
    plt.figure(figsize=(6,5))
    plt.imshow(corr, interpolation="nearest")
    plt.title("Correlation Heatmap")
    plt.colorbar()
    tick_marks = range(len(corr.columns))
    plt.xticks(tick_marks, corr.columns, rotation=90)
    plt.yticks(tick_marks, corr.columns)
    plt.tight_layout()
    plt.show()
    display(corr.round(3))
else:
    print("No numeric columns to correlate.")



## 12) Time Trends (if you have a datetime column)
- Aggregate by day/week/month and plot trends.


In [None]:

def plot_time_series(frame, date_col, value_col, freq="M"):
    sub = frame[[date_col, value_col]].dropna().copy()
    if sub.empty:
        print(f"No data to plot for {value_col}.")
        return
    sub = sub.set_index(date_col).sort_index().resample(freq).mean()
    plt.figure(figsize=(7,4))
    plt.plot(sub.index, sub[value_col])
    plt.title(f"Time Trend ({value_col}, resampled {freq})")
    plt.xlabel("Date")
    plt.ylabel(value_col)
    plt.tight_layout()
    plt.show()

if datetime_cols and numeric_cols:
    date_col = datetime_cols[0]
    for v in numeric_cols[:3]:
        plot_time_series(df_dt, date_col, v, freq="M")



## 13) (Optional) Geographic Hints
If you have **region/country** columns, consider:  
- Choropleth (later in Streamlit), or  
- Per‑region summary tables now.


In [None]:

geo_like = [c for c in df_dt.columns if any(k in c.lower() for k in ["region","country","state","province"])]
for geo in geo_like[:1]:
    for num in numeric_cols[:2]:
        print(f"\nAverage {num} by {geo}:")
        display(df_dt.groupby(geo, dropna=False)[num].mean().sort_values(ascending=False).to_frame().head(10))



## 14) Write Down 2–3 Insights (Markdown)
Use this cell to note **what surprised you**, **potential data issues**, and **early hypotheses**.

- Insight 1: …  
- Insight 2: …  
- Hypothesis / Next question: …  



## 15) KPI Drafts (Driven by EDA)
Fill in a first draft of KPIs **based on variables that are present and meaningful** in your dataset.

| KPI Name | Definition | Formula (words) | Python Expression (sketch) |
|---|---|---|---|
| Example: Electrification Rate | % of electrified households | electrified / total_households | `df['electrified'] / df['total_households']` |
|  |  |  |  |
|  |  |  |  |

> Keep KPIs **simple and measurable**. We’ll refine them after cleaning/feature engineering.



## 16) What We'll Tackle Next (Feature Engineering Preview)
- Data type fixes (categorical, datetime)  
- Missing value strategies (when to impute vs. drop)  
- Outlier detection & robust statistics  
- Scaling/encoding and derived features for your KPIs  
> These are **not performed here** — this is a pure EDA notebook.
