# Analysis and Visualization of the NEET Population (15-24 years)

**Author:** Nina and Ligia

**Data Source:** `dados_brutos.csv`

**Objective:** This notebook analyzes microdata from the PNAD Contínua for Q4 2024, calculates the total NEET population and breaks it down by gender, and generates a chart to visualize the results.

ano — interview year 2024
Source: IBGE

trimestre — 4
Source: IBGE

id_uf — numeric code of the Federative Unit (state) (11=RO … 53=DF). For abbreviation/name, join with br_bd_diretorios_brasil.uf.
Source: Base dos Dados

V1022 – Dwelling location (urban/rural): 1=Urban, 2=Rural.
Source: IBGE (FTP docs)

V2007 – Sex: 1=Male, 2=Female.
Source: IBGE

V2009 – Age: completed years.
Source: IBGE

V2010 – Race/Color: 1=White, 2=Black, 3=Asian (Yellow), 4=Brown (Pardo), 5=Indigenous, 9=Ignored/Not declared.
Source: IBGE

V3002 – Attends school/course? 1=Yes, 2=No (basis for the “E” in NEET).
Source: IBGE

VD4002 – Labor force status in the reference week (derived): 1=Employed, 2=Unemployed, 3=Out of the labor force (basis for the “T” in NEET: not employed).
Source: IBGE (FTP docs)

V4032 – Contributes to a social security institute for this job? 1=Yes, 2=No (asked of employed; “not applicable” if not employed).
Source: IBGE (FTP docs)

VD4019 – Usual earnings from all jobs (derived): monthly nominal income (currency values).
Source: IBGE (FTP docs)

V1028 – Sample weight: historical “household/person weight” with corrections and post-stratification.
Note: for person-level analyses, the more common weight is V1032 (final weight) (and its replicate weights for variance).
Source: IBGE

In [26]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


# Load labeled dataset
INPUT = "dados_brutos.csv"
data = pd.read_csv(INPUT)

# Quick peek
display(data.head())
print("Columns:", list(data.columns))

Unnamed: 0,ano,trimestre,id_uf,area_code,sex_code,age,race_code,school_code,education_code,labor_force_code,occupation_code,discouraged_code,children_under14_code,weight
0,2024,1,12,2,2,27,4,2.0,5.0,2.0,,,,42.102464
1,2024,1,12,2,1,0,4,,,,,,,47.496029
2,2024,1,12,2,2,28,4,2.0,7.0,1.0,2.0,,,47.496029
3,2024,1,12,2,1,24,4,2.0,5.0,1.0,1.0,,,57.921004
4,2024,1,12,2,2,0,4,,,,,,,51.784148


Columns: ['ano', 'trimestre', 'id_uf', 'area_code', 'sex_code', 'age', 'race_code', 'school_code', 'education_code', 'labor_force_code', 'occupation_code', 'discouraged_code', 'children_under14_code', 'weight']


Statistics


In [27]:
missing = data.isnull().sum()
missing_pct = (missing / len(data)) * 100
missing_table = pd.DataFrame({
    "Missing": missing,
    "Missing %": missing_pct.round(2)
}).sort_values("Missing", ascending=False)
display(missing_table)

Unnamed: 0,Missing,Missing %
discouraged_code,3728075,97.81
children_under14_code,3284683,86.18
occupation_code,2029284,53.24
labor_force_code,682910,17.92
education_code,217069,5.7
school_code,217069,5.7
age,0,0.0
sex_code,0,0.0
area_code,0,0.0
id_uf,0,0.0


In [28]:
display(data.describe().T)

for col in data.select_dtypes(include="object").columns:
    print(f"\nColumn: {col}")
    print(data[col].value_counts(dropna=False).head(10))


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
ano,3811436.0,2023.501241,0.499999,2023.0,2023.0,2024.0,2024.0,2024.0
trimestre,3811436.0,2.495888,1.116032,1.0,1.0,2.0,3.0,4.0
id_uf,3811436.0,30.949518,10.98236,11.0,23.0,31.0,41.0,53.0
area_code,3811436.0,1.267603,0.44271,1.0,1.0,1.0,2.0,2.0
sex_code,3811436.0,1.51696,0.499712,1.0,1.0,2.0,2.0,2.0
age,3811436.0,37.323538,22.218444,0.0,18.0,37.0,54.0,119.0
race_code,3811436.0,2.628823,1.433083,1.0,1.0,4.0,4.0,9.0
school_code,3594367.0,1.751398,0.432203,1.0,2.0,2.0,2.0,2.0
education_code,3594367.0,3.643639,1.929314,1.0,2.0,3.0,5.0,7.0
labor_force_code,3128526.0,1.430354,0.495126,1.0,1.0,1.0,2.0,2.0


### 2. Organizing Raw Columns  

To make the dataset more intuitive, we will create cleaner versions of some raw columns.  
- **Sex:** Instead of `sex_code` (1=Male, 2=Female), we will create a binary variable called `female`:  
  - `1` if the person is Female  
  - `0` if the person is Male  
This will simplify later analysis, especially when calculating gender-specific statistics.  


In [29]:
# Replace sex_code with binary female indicator (1=Female, 0=Male)
data["sex_code"] = data["sex_code"].map({1: 0, 2: 1})

# Rename column to 'female'
data.rename(columns={"sex_code": "female"}, inplace=True)

# Quick check
print("Column 'female' distribution (1=Female, 0=Male):")
print(data["female"].value_counts(dropna=False))


Column 'female' distribution (1=Female, 0=Male):
female
1    1970360
0    1841076
Name: count, dtype: int64


In [30]:
# Drop children_under14_code column if it exists
data.drop(columns=["children_under14_code"])


Unnamed: 0,ano,trimestre,id_uf,area_code,female,age,race_code,school_code,education_code,labor_force_code,occupation_code,discouraged_code,weight
0,2024,1,12,2,1,27,4,2.0,5.0,2.0,,,42.102464
1,2024,1,12,2,0,0,4,,,,,,47.496029
2,2024,1,12,2,1,28,4,2.0,7.0,1.0,2.0,,47.496029
3,2024,1,12,2,0,24,4,2.0,5.0,1.0,1.0,,57.921004
4,2024,1,12,2,1,0,4,,,,,,51.784148
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3811431,2023,1,15,1,0,49,4,2.0,2.0,1.0,1.0,,1330.390534
3811432,2023,1,15,1,1,49,1,2.0,5.0,1.0,1.0,,794.985492
3811433,2023,4,15,1,0,50,4,2.0,2.0,1.0,1.0,,1323.351904
3811434,2023,4,15,1,1,40,4,2.0,5.0,1.0,1.0,,1601.243374


In [31]:
print(data)

          ano  trimestre  id_uf  area_code  female  age  race_code  \
0        2024          1     12          2       1   27          4   
1        2024          1     12          2       0    0          4   
2        2024          1     12          2       1   28          4   
3        2024          1     12          2       0   24          4   
4        2024          1     12          2       1    0          4   
...       ...        ...    ...        ...     ...  ...        ...   
3811431  2023          1     15          1       0   49          4   
3811432  2023          1     15          1       1   49          1   
3811433  2023          4     15          1       0   50          4   
3811434  2023          4     15          1       1   40          4   
3811435  2023          4     15          1       0   28          4   

         school_code  education_code  labor_force_code  occupation_code  \
0                2.0             5.0               2.0              NaN   
1        

### 3. Filter Data for Analysis  

In this step, we narrow down the dataset to focus only on the population relevant to our NEET study:  
1. **Period:** Q4 2024 (year = 2024, quarter = 4).  
2. **Age group:** Youth aged between 15 and 24 years old.  


In [32]:
mask_period = (data["ano"] == 2024) & (data["trimestre"] == 4)
mask_age = (data["age"].between(14, 25, inclusive="both"))

data_scope = data[mask_period & mask_age].copy()

print(f"Filtered dataset shape: {data_scope.shape[0]:,} rows × {data_scope.shape[1]:,} columns")
display(data_scope.head())


Filtered dataset shape: 77,145 rows × 14 columns


Unnamed: 0,ano,trimestre,id_uf,area_code,female,age,race_code,school_code,education_code,labor_force_code,occupation_code,discouraged_code,children_under14_code,weight
1002,2024,4,12,2,1,25,5,1.0,6.0,2.0,,,,53.395149
1018,2024,4,12,2,0,25,4,1.0,6.0,1.0,1.0,,1.0,66.173231
1019,2024,4,12,2,0,22,2,2.0,4.0,1.0,1.0,,,74.279937
1020,2024,4,12,2,1,22,4,2.0,2.0,2.0,,,,82.500457
1022,2024,4,12,2,1,18,4,2.0,4.0,2.0,,,,77.519537


### 4. Identify NEET Population and Calculate Totals

In this step, we apply the NEET definition to filter individuals who are not studying and not working. Then, we use the sample weights to estimate the actual population and print the results.

The column contains the sample weights, or expansion factors. Since the PNAD survey is based on a statistical sample, not the entire population, each record is assigned a weight that indicates how many people it represents in the total population. By summing these weights for our filtered group (the NEETs), we are extrapolating from the sample to get a statistically valid estimate of the total number of NEETs nationwide. Simply counting the rows would only give us the number of NEETs in the sample itself, not the population estimate we need.

In [34]:
print("Identifying the NEET population...")
# A person is NEET if they are NOT studying (school_code == 2) AND they are NOT occupied (occupation_code != 1).
df_neet = data_scope[
    (data_scope['school_code'] == 2) &
    (data_scope['occupation_code'] != 1)
].copy()

print("Calculating population estimates using weights...")
# Total youth population from the age-and-period-filtered dataframe
total_youth = data_scope['weight'].sum()

# Total NEET population from the NEET-filtered dataframe
total_neet = df_neet['weight'].sum()

# NEET population broken down by gender, using your new 'female' column (0=Male, 1=Female)
male_neet = df_neet[df_neet['female'] == 0]['weight'].sum()
female_neet = df_neet[df_neet['female'] == 1]['weight'].sum()

# Calculate the NEET rate as a percentage
neet_percentage = (total_neet / total_youth) * 100 if total_youth > 0 else 0

# Print the final results
print("\n--- Analysis Results for Q4 2024 (15-24 years old) ---")
print(f"Estimated Total Youth Population: {total_youth:,.0f}")
print(f"Estimated Total NEET Population: {total_neet:,.0f}")
print(f"NEET Rate: {neet_percentage:.1f}% of the youth population")
print("-" * 50)
print(f"Breakdown of NEET Population:")
print(f"  - Men: {male_neet:,.0f}")
print(f"  - Women: {female_neet:,.0f}")

Identifying the NEET population...
Calculating population estimates using weights...

--- Analysis Results for Q4 2024 (15-24 years old) ---
Estimated Total Youth Population: 36,802,649
Estimated Total NEET Population: 6,538,266
NEET Rate: 17.8% of the youth population
--------------------------------------------------
Breakdown of NEET Population:
  - Men: 2,419,931
  - Women: 4,118,335
