# Curs 4 - Curatarea Datelor WDI

1. Profilare – verific tipuri, lipsuri, valori suspecte
2. Standardizare – fac year, aleg coloane utile
3. Lipsuri – păstrez NaN, nu imputăm indicatori WDI
4. Curățare – elimin pseudo-țări, rânduri inutile
5. Transformări – normalizez indicatori, pivot la format analizabil
6. Validare – verific chei unice, valori pozitive, structură finală

In [1]:
import pandas as pd
import numpy as np

### Incarcam datele

In [3]:
wdi = pd.read_csv("../data/wdi_data.csv")
print(wdi.shape)
print(wdi.columns)
wdi.head(3)

(224770, 11)
Index(['countryiso3code', 'date', 'value', 'unit', 'obs_status', 'decimal',
       'indicator.id', 'indicator.value', 'country.id', 'country.value',
       'indicator_name'],
      dtype='object')


Unnamed: 0,countryiso3code,date,value,unit,obs_status,decimal,indicator.id,indicator.value,country.id,country.value,indicator_name
0,AFE,2024,1567.635839,,,1,NY.GDP.PCAP.CD,GDP per capita (current US$),ZH,Africa Eastern and Southern,gdp_per_capita
1,AFE,2023,1510.742951,,,1,NY.GDP.PCAP.CD,GDP per capita (current US$),ZH,Africa Eastern and Southern,gdp_per_capita
2,AFE,2022,1628.318944,,,1,NY.GDP.PCAP.CD,GDP per capita (current US$),ZH,Africa Eastern and Southern,gdp_per_capita


# Pasul 1. Profilarea si intelegerea datelor

Profilarea răspunde la 3 întrebări:
- Ce tipuri de date am?
- Cât de curate sunt?
- Ce probleme pot anticipa înainte de curățare

Ce verific la tipurile de date (exemplificat pe WDI):
- Structură generală – df.info()
- Tipurile detectate – df.dtypes
- Tipuri amestecate în aceeași coloană – apply(type)
- Coloane numerice potențiale – to_numeric()
- Coloane categorice potențiale – nunique()
- Validare tipurilor cu o schemă
- Detectare valori ne-parseabile (date)
- select_dtypes() pentru clasificare

In [12]:
wdi.head(3) #print first 3 rows

Unnamed: 0,countryiso3code,date,value,unit,obs_status,decimal,indicator.id,indicator.value,country.id,country.value,indicator_name
0,AFE,2024,1567.635839,,,1,NY.GDP.PCAP.CD,GDP per capita (current US$),ZH,Africa Eastern and Southern,gdp_per_capita
1,AFE,2023,1510.742951,,,1,NY.GDP.PCAP.CD,GDP per capita (current US$),ZH,Africa Eastern and Southern,gdp_per_capita
2,AFE,2022,1628.318944,,,1,NY.GDP.PCAP.CD,GDP per capita (current US$),ZH,Africa Eastern and Southern,gdp_per_capita


In [13]:
wdi.shape # dimensiunea dataset-ului

(224770, 11)

In [14]:
wdi.columns  # listez coloanele din dataset

Index(['countryiso3code', 'date', 'value', 'unit', 'obs_status', 'decimal',
       'indicator.id', 'indicator.value', 'country.id', 'country.value',
       'indicator_name'],
      dtype='object')

In [4]:
wdi.info()  # structură generală + tipuri + valori non-null


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 224770 entries, 0 to 224769
Data columns (total 11 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   countryiso3code  220545 non-null  object 
 1   date             224770 non-null  int64  
 2   value            123660 non-null  float64
 3   unit             0 non-null       float64
 4   obs_status       0 non-null       float64
 5   decimal          224770 non-null  int64  
 6   indicator.id     224770 non-null  object 
 7   indicator.value  224770 non-null  object 
 8   country.id       223925 non-null  object 
 9   country.value    224770 non-null  object 
 10  indicator_name   224770 non-null  object 
dtypes: float64(3), int64(2), object(6)
memory usage: 18.9+ MB


In [5]:
wdi.dtypes  # tipurile detectate de pandas


countryiso3code     object
date                 int64
value              float64
unit               float64
obs_status         float64
decimal              int64
indicator.id        object
indicator.value     object
country.id          object
country.value       object
indicator_name      object
dtype: object

In [7]:
wdi['country.value'].apply(type).value_counts().head()  # verific dacă există tipuri amestecate în aceeași coloană


country.value
<class 'str'>    224770
Name: count, dtype: int64

In [17]:
wdi.describe()  # statistici pentru coloanele numerice (mean, min, max etc.)


Unnamed: 0,date,value,unit,obs_status,decimal
count,224770.0,123660.0,0.0,0.0,224770.0
mean,1992.0,30347610.0,,,0.384615
std,18.761705,276821900.0,,,0.624927
min,1960.0,-2.590877,,,0.0
25%,1976.0,10.0,,,0.0
50%,1992.0,67.911,,,0.0
75%,2008.0,847.2674,,,1.0
max,2024.0,8142056000.0,,,2.0


In [18]:
wdi.describe(include='object')  # summary pentru coloanele text (count, unique, top)


Unnamed: 0,countryiso3code,indicator.id,indicator.value,country.id,country.value,indicator_name
count,220545,224770,224770,223925,224770,224770
unique,261,13,13,265,266,13
top,AFE,NY.GDP.PCAP.CD,GDP per capita (current US$),ZH,Africa Eastern and Southern,gdp_per_capita
freq,845,17290,17290,845,845,17290


In [19]:
wdi.isna().mean().sort_values(ascending=False)  # rata lipsurilor pe fiecare coloană


unit               1.000000
obs_status         1.000000
value              0.449838
countryiso3code    0.018797
country.id         0.003759
date               0.000000
decimal            0.000000
indicator.id       0.000000
indicator.value    0.000000
country.value      0.000000
indicator_name     0.000000
dtype: float64

In [20]:
wdi['countryiso3code'].str.len().value_counts()  # detectez coduri ISO3 greșite (pseudo-țări)


countryiso3code
3.0    220545
Name: count, dtype: int64

In [22]:
pd.to_datetime(wdi['date'], errors='coerce').isna().sum()  # detectez valori ne-parseabile ca dată


np.int64(0)

### Value counts

In [24]:
wdi['indicator_name'].value_counts()  # văd indicatorii existenți


indicator_name
gdp_per_capita          17290
population_total        17290
life_expectancy         17290
urbanization_rate       17290
energy_use_pc           17290
renewables_share        17290
electricity_access      17290
gov_effectiveness       17290
reg_quality             17290
rule_of_law             17290
control_corruption      17290
gini                    17290
mobile_subscriptions    17290
Name: count, dtype: int64

In [27]:
wdi['obs_status'].value_counts()  # confirm că nu este util


Series([], Name: count, dtype: int64)

In [26]:
wdi['unit'].value_counts(dropna=False)  # demonstrează că este complet NaN


unit
NaN    224770
Name: count, dtype: int64

# Pasul 2 Standardizare

- conversie la tipuri corecte
- alinierea formatelor
- curățarea și uniformizarea coloanelor
- alegerea structurii standard a tabelului

In [29]:
print(wdi.columns)
wdi.head(3)

Index(['countryiso3code', 'date', 'value', 'unit', 'obs_status', 'decimal',
       'indicator.id', 'indicator.value', 'country.id', 'country.value',
       'indicator_name'],
      dtype='object')


Unnamed: 0,countryiso3code,date,value,unit,obs_status,decimal,indicator.id,indicator.value,country.id,country.value,indicator_name
0,AFE,2024,1567.635839,,,1,NY.GDP.PCAP.CD,GDP per capita (current US$),ZH,Africa Eastern and Southern,gdp_per_capita
1,AFE,2023,1510.742951,,,1,NY.GDP.PCAP.CD,GDP per capita (current US$),ZH,Africa Eastern and Southern,gdp_per_capita
2,AFE,2022,1628.318944,,,1,NY.GDP.PCAP.CD,GDP per capita (current US$),ZH,Africa Eastern and Southern,gdp_per_capita


#### Schimbam tipul datelor

In [30]:
wdi['year'] = wdi['date'].astype(int)  # convertim  și creăm coloana year


In [31]:
wdi['value'] = pd.to_numeric(wdi['value'], errors='coerce')  # garantăm numeric


#### Pastram doar coloanele necesare
Stim ca:
- unit → 0 valori
- obs_status → 0 valori
- decimal → nu folosește la analiză
- indicator.value → redundanță față de indicator_name
- country.id → redundanță față de countryiso3code
- country.value → doar numele țării (nu strict necesar)

In [None]:
# Pastram doar coloanele necesare
cols_keep = [
    'countryiso3code',   # codul ISO al țării
    'year',              # anul numeric
    'indicator.id',      # cod indicator
    'indicator_name',    # indicator curățat de tine
    'indicator.value',   # numele complet original al indicatorului
    'value'              # valoarea numerică
]

wdi_std = wdi[cols_keep].copy()


#### Metode pentru curatarea textului


In [None]:
# Standardizez indicator_name

wdi_std['indicator_name'] = wdi_std['indicator_name'].str.strip().str.lower().str.replace(' ', '_').str.replace(r'[^0-9a-zA-Z_]', '', regex=True)


### Redenumim coloanele 

In [43]:
wdi_std = wdi_std.rename(columns={
    'countryiso3code': 'country_iso3',
    'indicator.id': 'indicator_code',
    'indicator.value': 'indicator_label'
})

#### Validare 

In [44]:
wdi_std.head(3)

Unnamed: 0,country_iso3,year,indicator_code,indicator_name,indicator_label,value
0,AFE,2024,NY.GDP.PCAP.CD,gdp_per_capita,GDP per capita (current US$),1567.635839
1,AFE,2023,NY.GDP.PCAP.CD,gdp_per_capita,GDP per capita (current US$),1510.742951
2,AFE,2022,NY.GDP.PCAP.CD,gdp_per_capita,GDP per capita (current US$),1628.318944


# Pasul 3. LIPSURI (Missing Values Analysis)

In [49]:
wdi_std.isna().sum()  # număr de NaN pe coloană


country_iso3         4225
year                    0
indicator_code          0
indicator_name          0
indicator_label         0
value              101110
dtype: int64

### Ratele de lipsuri pe fiecare coloană

In [45]:
wdi_std.isna().mean().sort_values(ascending=False)  # rata lipsurilor pe coloană


value              0.449838
country_iso3       0.018797
year               0.000000
indicator_code     0.000000
indicator_name     0.000000
indicator_label    0.000000
dtype: float64

### Lipsuri in functie de indicator

In [57]:
# folosind gruparea pentru a vedea lipsurile pe indicator
(wdi_std['value'].isna()
        .groupby(wdi_std['indicator_code'])
        .mean()
        .sort_values())


indicator_code
SP.POP.TOTL          0.005495
SP.URB.TOTL.IN.ZS    0.011278
SP.DYN.LE00.IN       0.021053
NY.GDP.PCAP.CD       0.158994
IT.CEL.SETS.P2       0.257201
EG.FEC.RNEW.ZS       0.523771
EG.ELC.ACCS.ZS       0.545460
EG.USE.PCAP.KG.OE    0.620069
RL.EST               0.706304
CC.EST               0.711510
RQ.EST               0.712782
GE.EST               0.712898
SI.POV.GINI          0.861076
Name: value, dtype: float64

In [58]:
# același analiză, dar desfăcută pe pași pentru claritate
missing_flag = wdi_std['value'].isna()
grouped = missing_flag.groupby(wdi_std['indicator_code'])
missing_rate = grouped.mean()
missing_rate_sorted = missing_rate.sort_values()
missing_rate_sorted


indicator_code
SP.POP.TOTL          0.005495
SP.URB.TOTL.IN.ZS    0.011278
SP.DYN.LE00.IN       0.021053
NY.GDP.PCAP.CD       0.158994
IT.CEL.SETS.P2       0.257201
EG.FEC.RNEW.ZS       0.523771
EG.ELC.ACCS.ZS       0.545460
EG.USE.PCAP.KG.OE    0.620069
RL.EST               0.706304
CC.EST               0.711510
RQ.EST               0.712782
GE.EST               0.712898
SI.POV.GINI          0.861076
Name: value, dtype: float64

#### Folosind apply si lambda



In [60]:
#Dataset → Groupby indicator → Serie de valori → True/False → Media → Procent lipsuri → Sortare → Top 10

wdi_std.groupby('indicator_code')['value'].apply(lambda x: x.isna().mean()).sort_values().head(10)  # cei mai compleți indicatori


indicator_code
SI.POV.GINI          0.861076
GE.EST               0.712898
RQ.EST               0.712782
CC.EST               0.711510
RL.EST               0.706304
EG.USE.PCAP.KG.OE    0.620069
EG.ELC.ACCS.ZS       0.545460
EG.FEC.RNEW.ZS       0.523771
IT.CEL.SETS.P2       0.257201
NY.GDP.PCAP.CD       0.158994
Name: value, dtype: float64

In [61]:
# acelasi lucru dar folosind o functie definita pentru apply
def missing_rate(series):
    return series.isna().mean()  # procentul de valori NaN
wdi_std.groupby('indicator_code')['value'].apply(missing_rate).sort_values(ascending= False).head(10)  # cei mai compleți indicatori

indicator_code
SI.POV.GINI          0.861076
GE.EST               0.712898
RQ.EST               0.712782
CC.EST               0.711510
RL.EST               0.706304
EG.USE.PCAP.KG.OE    0.620069
EG.ELC.ACCS.ZS       0.545460
EG.FEC.RNEW.ZS       0.523771
IT.CEL.SETS.P2       0.257201
NY.GDP.PCAP.CD       0.158994
Name: value, dtype: float64

### Lipsuri pe tara

In [48]:
wdi_std.groupby('country_iso3')['value'].apply(lambda x: x.isna().mean()).sort_values().head()


country_iso3
USA    0.326627
GBR    0.331361
CAN    0.342012
BRA    0.349112
LUX    0.349112
Name: value, dtype: float64

### Lipsuri pe ani

In [65]:
wdi_std.groupby('year')['value'].apply(lambda x: x.isna().mean()).sort_values(ascending=False).head(10)


year
2024    0.775304
1961    0.726721
1964    0.726142
1963    0.726142
1962    0.726142
1966    0.723829
1967    0.722672
1969    0.722094
1968    0.722094
1972    0.715442
Name: value, dtype: float64

### Groupby pe  mai multe coloane

In [63]:
wdi_std.groupby(['country_iso3', 'year'])['value'].apply(lambda x: x.isna().mean())


country_iso3  year
ABW           1960    0.692308
              1961    0.769231
              1962    0.769231
              1963    0.769231
              1964    0.769231
                        ...   
ZWE           2020    0.076923
              2021    0.076923
              2022    0.153846
              2023    0.230769
              2024    0.769231
Name: value, Length: 16965, dtype: float64

### Pivot table

In [68]:
wdi_std.pivot_table(values='value', index='country_iso3', columns='year', aggfunc='count')


year,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,...,2015,2016,2017,2018,2019,2020,2021,2022,2023,2024
country_iso3,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ABW,4,3,3,3,3,4,3,3,3,3,...,11,10,10,10,11,11,11,11,9,2
AFE,5,4,4,4,4,5,4,4,4,4,...,8,8,8,8,8,8,7,7,6,3
AFG,4,3,3,3,3,4,3,3,3,3,...,11,11,11,11,11,11,11,11,10,2
AFW,5,4,4,4,4,5,4,4,4,4,...,8,8,8,8,8,8,7,7,6,3
AGO,4,3,3,3,3,4,3,3,3,3,...,12,12,12,13,12,12,12,11,10,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
XKX,2,2,2,2,2,2,2,2,2,2,...,8,8,8,8,8,8,8,9,7,2
YEM,4,3,3,3,3,4,3,3,3,3,...,12,12,12,12,11,11,11,10,9,2
ZAF,5,4,4,4,4,5,4,4,4,4,...,12,12,12,12,12,12,12,11,11,3
ZMB,5,4,4,4,4,5,4,4,4,4,...,13,12,12,12,12,12,12,12,10,3


# Pasul 4 – Curățare

In [85]:
wdi_std.columns

Index(['country_iso3', 'year', 'indicator_code', 'indicator_name',
       'indicator_label', 'value'],
      dtype='object')

### Ani - Filtrare

In [None]:
# ne mai uitam odata la ani
missing_by_year = wdi_std.groupby('year')['value'].apply(lambda x: x.isna().mean())  # procent lipsuri pe an

years_over_40 = missing_by_year[missing_by_year > 0.40]  # filtrăm doar anii cu peste 30%
years_over_40

In [82]:
years_keep = list(range(2000, 2024))  # de la 2000 la 2021 inclusiv
# years_keep = missing_by_year[missing_by_year < 0.50].index
wdi_std = wdi_std[wdi_std['year'].isin(years_keep)].copy()


### Tari

In [None]:
# putem folosi un pachet extern pentru a obține lista țărilor valide
#%pip install pycountry
import pycountry 
iso3_list = [country.alpha_3 for country in pycountry.countries]
iso3 = set(iso3_list)
len(iso3_list)  # număr de țări valide

249

In [None]:
# varianta cu for fara list comprehension
iso3_list = []
for country in pycountry.countries:
    iso3_list.append(country.alpha_3)

In [91]:
codes_wdi = set(wdi_std['country_iso3']) # tarile unice
len(codes_wdi)  # număr de coduri unice în dataset

262

In [93]:
# diferența dintre cele două seturi
invalid_codes = codes_wdi - iso3
print(invalid_codes)

{'NAC', 'TMN', 'EAR', 'TEC', 'CSS', 'OSS', 'LTE', 'CHI', 'LAC', 'AFW', 'SAS', 'MIC', 'EUU', 'FCS', 'MEA', 'IDA', 'XKX', 'PRE', 'WLD', 'EMU', 'EAP', 'ARB', 'ECS', 'TLA', 'LMY', 'SSA', 'PSS', 'PST', 'LCN', 'OED', 'MNA', 'ECA', 'AFE', 'IDX', 'IBD', 'LDC', 'IDB', 'HPC', 'TSA', 'SSF', 'SST', nan, 'TSS', 'CEB', 'TEA', 'EAS', 'IBT'}


In [94]:
other_country_codes = {
    'AFE':'Africa E/S','AFW':'Africa W/C','ARB':'Arab','CEB':'Central Europe','CHI':'Channel Isl',
    'CSS':'Caribbean SS','EAP':'E Asia/Pacific','EAR':'E Asia','EAS':'E+S Asia','ECA':'Europe+CA',
    'ECS':'Europe+CA','EMU':'Eurozone','EUU':'EU','FCS':'Fragile States','HPC':'Heavily Poor',
    'IBD':'IDA Blend','IBT':'IDA Total','IDA':'IDA','IDB':'IDA Borrowers','IDX':'IDA Only',
    'LAC':'Latin America','LCN':'LatAm/Carib','LDC':'Least Dev','LMY':'Low+Mid Income','LTE':'LatAm',
    'MEA':'Middle East','MIC':'Middle Inc','MNA':'M East+N Afr','NAC':'North America','OED':'OECD',
    'OSS':'Other SS Africa','PRE':'Pre-Europe','PSS':'Pacific SS','PST':'Pacific States','SAS':'S Asia',
    'SSA':'Sub-Saharan','SSF':'Sub-Saharan','SST':'Small States','TEA':'E Asia bloc','TEC':'Europe bloc',
    'TLA':'LatAm bloc','TMN':'M East bloc','TSA':'S Asia bloc','TSS':'SmallStates bloc','WLD':'World',
    'XKX':'Kosovo', None:'NaN'
}


### a) Varianta de eliminare

In [95]:
wdi_only_countries = wdi_std[wdi_std['country_iso3'].isin(iso3)]


### b) Varianta de clasificare

In [99]:
pseudo_type = {
    'AFE':'region','AFW':'region','ARB':'region','CEB':'region','CHI':'other',
    'CSS':'region','EAP':'region','EAR':'region','EAS':'region','ECA':'region',
    'ECS':'region','EMU':'bloc','EUU':'bloc','FCS':'income','HPC':'income',
    'IBD':'income','IBT':'income','IDA':'income','IDB':'income','IDX':'income',
    'LAC':'region','LCN':'region','LDC':'income','LMY':'income','LTE':'region',
    'MEA':'region','MIC':'income','MNA':'region','NAC':'region','OED':'bloc',
    'OSS':'region','PRE':'region','PSS':'region','PST':'region','SAS':'region',
    'SSA':'region','SSF':'region','SST':'region','TEA':'bloc','TEC':'bloc',
    'TLA':'bloc','TMN':'bloc','TSA':'bloc','TSS':'bloc','WLD':'world',
    'XKX':'country', None:'unknown'
}

wdi_std['type'] = wdi_std['country_iso3'].map(pseudo_type).fillna('country')
wdi_std.groupby('type')['country_iso3'].nunique()

type
bloc         9
country    216
income      10
other        1
region      24
world        1
Name: country_iso3, dtype: int64

In [125]:
wdi_std = wdi_std.dropna(subset=['country_iso3']).copy()
#sau 
#wdi_std.dropna(subset=['country_iso3'], inplace = True)


### Indicator code

In [126]:
wdi_std['indicator_code'].unique()


array(['NY.GDP.PCAP.CD', 'SP.POP.TOTL', 'SP.DYN.LE00.IN',
       'SP.URB.TOTL.IN.ZS', 'EG.USE.PCAP.KG.OE', 'EG.FEC.RNEW.ZS',
       'EG.ELC.ACCS.ZS', 'GE.EST', 'RQ.EST', 'RL.EST', 'CC.EST',
       'SI.POV.GINI', 'IT.CEL.SETS.P2'], dtype=object)

In [127]:
missing_rate_indicator = (
    wdi_std.groupby('indicator_code')['value']
             .apply(lambda x: x.isna().mean())
)

missing_rate_indicator[missing_rate_indicator > 0.7]


indicator_code
SI.POV.GINI    0.709291
Name: value, dtype: float64

In [128]:
wdi_clean = wdi_std[wdi_std['indicator_code'] != 'SI.POV.GINI']


### Indicator name

In [129]:
wdi_clean['indicator_name'].unique()

array(['gdp_per_capita', 'population_total', 'life_expectancy',
       'urbanization_rate', 'energy_use_pc', 'renewables_share',
       'electricity_access', 'gov_effectiveness', 'reg_quality',
       'rule_of_law', 'control_corruption', 'mobile_subscriptions'],
      dtype=object)

In [130]:
wdi_clean['indicator_name'].isna().sum()


np.int64(0)

### Indicator_label

In [131]:
wdi_clean['indicator_label'].unique()


array(['GDP per capita (current US$)', 'Population, total',
       'Life expectancy at birth, total (years)',
       'Urban population (% of total population)',
       'Energy use (kg of oil equivalent per capita)',
       'Renewable energy consumption (% of total final energy consumption)',
       'Access to electricity (% of population)',
       'Government Effectiveness: Estimate',
       'Regulatory Quality: Estimate', 'Rule of Law: Estimate',
       'Control of Corruption: Estimate',
       'Mobile cellular subscriptions (per 100 people)'], dtype=object)

In [133]:
# Verificare duplicate indicator_label
duplicates = (
    wdi_clean.groupby('indicator_label')['indicator_code']
             .nunique()
             .sort_values(ascending=False)
)

duplicates[duplicates > 1]


Series([], Name: indicator_code, dtype: int64)

In [134]:
wdi_clean.loc[:, 'indicator_label'] = (
    wdi_clean['indicator_label']
        .str.strip()
        .str.replace(r'\s+', ' ', regex=True)
)



In [135]:
wdi_clean.columns

Index(['country_iso3', 'year', 'indicator_code', 'indicator_name',
       'indicator_label', 'value', 'type'],
      dtype='object')

### Value

In [136]:
# ne uitam la dtype-ul coloanei value
wdi_clean['value'].dtype


dtype('float64')

In [137]:
# Dacă ai valori string (nu în WDI, dar în multe dataseturi apare), atunci:
wdi_clean['value'] = pd.to_numeric(wdi_clean['value'], errors='coerce')


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  wdi_clean['value'] = pd.to_numeric(wdi_clean['value'], errors='coerce')


In [138]:
# verificam din nou lipsurile
wdi_clean['value'].isna().mean()


np.float64(0.12718177948063006)

In [139]:
wdi_clean.groupby('indicator_code')['value'].apply(lambda x: x.isna().mean())


indicator_code
CC.EST               0.262612
EG.ELC.ACCS.ZS       0.011654
EG.FEC.RNEW.ZS       0.101852
EG.USE.PCAP.KG.OE    0.271711
GE.EST               0.265326
IT.CEL.SETS.P2       0.059227
NY.GDP.PCAP.CD       0.028416
RL.EST               0.252395
RQ.EST               0.265326
SP.DYN.LE00.IN       0.000000
SP.POP.TOTL          0.000000
SP.URB.TOTL.IN.ZS    0.007663
Name: value, dtype: float64

In [140]:
# verificam valori negatie
negative_values = wdi_clean[wdi_clean['value'] < 0]
negative_values.indicator_name.unique()


array(['gov_effectiveness', 'reg_quality', 'rule_of_law',
       'control_corruption'], dtype=object)

## Duplicate

In [141]:
wdi_clean.duplicated(subset=['country_iso3','year','indicator_code']).sum()


np.int64(0)

In [143]:
# stergem duplicatele
wdi_clean = wdi_clean.drop_duplicates(
    subset=['country_iso3', 'year', 'indicator_code'],
    keep='first'
)


# Pasul 5. Transformari

Transformările ajustează structura datelor pentru analiză. În această etapă reorganizăm informația, schimbăm forma tabelului (de exemplu din long în wide), sortăm și pregătim variabilele astfel încât datasetul final să fie coerent, ușor de folosit și compatibil cu etapele următoare de modelare și vizualizare.

### Conversia unitatilor

nu e cazul aici acum

WDI pentru indicatorii tăi NU are unități amestecate.
Ai indicatori precum:
- GDP per capita → mereu USD
- Population → persoane
- Life expectancy → ani
- Urban population % → procent
- Energy use (kg oil equivalent/capita) → kg OE/capita
- Renewable energy consumption % → procent
- Access to electricity % → procent
- Worldwide Governance Indicators → scale [-2.5, +2.5]
- Mobile subscriptions → per 100 persons

In [5]:
wdi.columns

Index(['country_iso3', 'year', 'indicator_code', 'indicator_name',
       'indicator_label', 'value', 'type'],
      dtype='object')

In [None]:
# pastream doar tarile
wdi_country = wdi[wdi['type'] == 'country'].copy()
#eliminam coloana type - nu mai ave nevoie de ea
wdi_country = wdi_country.drop(columns=['type'])



### Pivotare

In [None]:
# pivotam
wdi_wide = (
    wdi_clean
        .pivot(
            index=['country_iso3','year'],
            columns='indicator_name',
            values='value'
        )
        .reset_index()
)
wdi_wide.head(2)

indicator_name,country_iso3,year,control_corruption,electricity_access,energy_use_pc,gdp_per_capita,gov_effectiveness,life_expectancy,mobile_subscriptions,population_total,reg_quality,renewables_share,rule_of_law,urbanization_rate
0,ABW,2000,,91.7,,20681.023027,,72.939,16.8993,90588.0,,0.2,,46.717
1,ABW,2001,,100.0,,20740.132583,,73.044,58.69,91439.0,,0.2,,46.339


# Pasul 6. Validare finala

In [16]:
# ============================================================
# VALIDARE FINALĂ WDI (totul într-o singură celulă)
# ============================================================

# 1. Duplicate country-year
print("Duplicate country-year:", 
      wdi_country.duplicated(subset=['country_iso3','year']).sum())

# 2. Lipsuri pe coloane
print("\nProcent lipsuri pe coloană (%):")
print((wdi_country.isna().mean() * 100).round(2))

# 3. Valori imposibile la procente (>100)
pct_cols = [c for c in wdi_country.columns 
            if 'percent' in c or '_zs' in c or '_pct' in c]
print("\nValori >100% (indicatori procentuali):")
for col in pct_cols:
    print(col, (wdi_country[col] > 100).sum())

# 4. Valori negative la indicatori care nu pot fi negativi
cols_no_negative = [
    'population_total',
    'gdp_per_capita_current_usd',
    'mobile_cellular_subscriptions_per_100_people',
    'access_to_electricity_percent_of_population',
]
print("\nValori negative pe indicatori care nu pot fi negativi:")
for col in cols_no_negative:
    if col in wdi_country.columns:
        print(col, (wdi_country[col] < 0).sum())

# 5. Acoperire temporală (câte rânduri per an)
print("\nNumăr observații pe an:")
print(wdi_country.groupby('year').size())

# 6. Număr țări
print("\nNumăr țări unice:", wdi_country['country_iso3'].nunique())

# 7. Dimensiune finală a tabelului
print("\nDimensiune finală tabel:", wdi_country.shape)


Duplicate country-year: 57024

Procent lipsuri pe coloană (%):
country_iso3       0.00
year               0.00
indicator_code     0.00
indicator_name     0.00
indicator_label    0.00
value              7.89
dtype: float64

Valori >100% (indicatori procentuali):

Valori negative pe indicatori care nu pot fi negativi:

Număr observații pe an:
year
2000    2592
2001    2592
2002    2592
2003    2592
2004    2592
2005    2592
2006    2592
2007    2592
2008    2592
2009    2592
2010    2592
2011    2592
2012    2592
2013    2592
2014    2592
2015    2592
2016    2592
2017    2592
2018    2592
2019    2592
2020    2592
2021    2592
2022    2592
2023    2592
dtype: int64

Număr țări unice: 216

Dimensiune finală tabel: (62208, 6)


# Salvam datele


In [None]:
wdi_clean.to_csv("data/wdi_clean.csv", index = False)
wdi_wide.to_csv("data/wdi_wide.csv", index =)

In [20]:
wdi.to_csv("../data/wdi_all_data.csv", index = False)
wdi_wide.to_csv("../data/wdi_wide.csv", index =False)
wdi_country.to_csv("../data/wdi_country.csv", index = False)