# **SDG Prediction**

## **Dependencies**

In [1]:
from transformers import pipeline
import pandas as pd
from tqdm import tqdm

## **SDG Classifier**

### Load Model

Model predicts first 15 sdg

https://huggingface.co/jonas/sdg_classifier_osdg

In [2]:
pipe = pipeline("text-classification", model="jonas/bert-base-uncased-finetuned-sdg")

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/348 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

### Load CSV

In [3]:
df = pd.read_csv("../../../src/merged_orgas.csv")
df.head(1)

Unnamed: 0,iati_id,iati_orga_id,orga_abbreviation,orga_full_name,client,title_en,title_other,title_main,organization,country_code_list,...,actual_end,last_update,crs_5_code,crs_5_name,crs_3_code,crs_3_name,docs,title_and_description,sgd_pred_code,sgd_pred_str
0,DE-1-201822287-0,DE-1,bmz,Bundesministerium für wirtschaftliche Zusammen...,BMZ,Strengthening quality infrastructure for trade...,Stärkung der Qualitätsinfrastruktur für den Ha...,Strengthening quality infrastructure for trade...,Bundesministerium für wirtschaftliche Zusammen...,,...,2016-03-14T00:00:00Z,2024-02-29T00:00:00Z,33130;,Regional trade agreements (RTAs);,331;,Trade Policies & Regulations;,,Strengthening quality infrastructure for trade...,9,"8 9. Build resilient infrastructure, promot..."


In [4]:
df.columns

Index(['iati_id', 'iati_orga_id', 'orga_abbreviation', 'orga_full_name',
       'client', 'title_en', 'title_other', 'title_main', 'organization',
       'country_code_list', 'country', 'country_name', 'country_flag',
       'region', 'location', 'description_en', 'description_other',
       'description_main', 'status', 'planned_start', 'actual_start',
       'planned_end', 'actual_end', 'last_update', 'crs_5_code', 'crs_5_name',
       'crs_3_code', 'crs_3_name', 'docs', 'title_and_description',
       'sgd_pred_code', 'sgd_pred_str'],
      dtype='object')

### Load SDG CSV

In [6]:
sdg_df = pd.read_csv("../../../src/codelists/sdg_goals.csv")
sdg_df.head(16)

Unnamed: 0,code,name,description,language,category,category-name,category-description
0,1,1. End poverty in all its forms everywhere,,en,,,
1,2,"2. End hunger, achieve food security and impro...",,en,,,
2,3,3. Ensure healthy lives and promote well-being...,,en,,,
3,4,4. Ensure inclusive and equitable quality educ...,,en,,,
4,5,5. Achieve gender equality and empower all wom...,,en,,,
5,6,6. Ensure availability and sustainable managem...,,en,,,
6,7,"7. Ensure access to affordable, reliable, sust...",,en,,,
7,8,"8. Promote sustained, inclusive and sustainabl...",,en,,,
8,9,"9. Build resilient infrastructure, promote inc...",,en,,,
9,10,10. Reduce inequality within and among countries,,en,,,


### Apply Model

In [8]:
# Define sdg columns 
df["sgd_pred_code"] = "NaN"
df["sgd_pred_str"] = "NaN"

len_df = len(df)

for index, row in tqdm(df.iterrows(), total=len_df, desc="Processing"):
    if index % 500 == 0:
        print(f" Debugger: {index} / {len_df}")
    descr_row = row['description_main']
    try:
        # nan in pandas is type float
        # check if nan 
            if isinstance(descr_row, float):
                df["sgd_pred_code"][index] = "NaN"
                df["sgd_pred_str"][index] = "NaN"
            else:
                if len(descr_row) > 512:
                    descr_row = descr_row[:512]
                # use clf with description and predict sgd 
                pred = pipe(descr_row)
                pred_str = pred[0]["label"]
                pred_int = int(pred_str)

                # Map sgd codes to names
                sdg_translation = sdg_df.loc[sdg_df['code'] == pred_int, 'name']

                df["sgd_pred_code"][index] = pred_int
                df["sgd_pred_str"][index] = sdg_translation
    except Exception as e:
        print(f"Error {e}: {descr_row}")

df.head()

Processing:   0%|          | 1/27397 [00:00<1:20:00,  5.71it/s]

 Debugger: 0 / 27397


Processing:   2%|▏         | 499/27397 [01:04<43:24, 10.33it/s]  

 Debugger: 500 / 27397


Processing:   4%|▎         | 1002/27397 [02:16<44:11,  9.95it/s] 

 Debugger: 1000 / 27397


Processing:   5%|▌         | 1500/27397 [03:44<1:14:52,  5.76it/s]

 Debugger: 1500 / 27397


Processing:   7%|▋         | 2001/27397 [04:56<36:23, 11.63it/s]  

 Debugger: 2000 / 27397


Processing:   9%|▉         | 2500/27397 [06:10<1:03:05,  6.58it/s]

 Debugger: 2500 / 27397


Processing:  11%|█         | 3001/27397 [07:38<50:31,  8.05it/s]  

 Debugger: 3000 / 27397


Processing:  13%|█▎        | 3502/27397 [09:12<52:10,  7.63it/s]  

 Debugger: 3500 / 27397


Processing:  15%|█▍        | 4001/27397 [10:38<56:30,  6.90it/s]  

 Debugger: 4000 / 27397


Processing:  16%|█▋        | 4501/27397 [12:01<1:04:13,  5.94it/s]

 Debugger: 4500 / 27397


Processing:  18%|█▊        | 5001/27397 [13:22<1:00:23,  6.18it/s]

 Debugger: 5000 / 27397


Processing:  20%|██        | 5502/27397 [14:40<30:24, 12.00it/s]  

 Debugger: 5500 / 27397


Processing:  22%|██▏       | 6001/27397 [15:37<38:04,  9.37it/s]  

 Debugger: 6000 / 27397


Processing:  24%|██▎       | 6501/27397 [16:40<40:01,  8.70it/s]  

 Debugger: 6500 / 27397


Processing:  26%|██▌       | 7001/27397 [17:54<41:32,  8.18it/s]  

 Debugger: 7000 / 27397


Processing:  27%|██▋       | 7501/27397 [19:16<42:49,  7.74it/s]  

 Debugger: 7500 / 27397


Processing:  29%|██▉       | 8000/27397 [20:41<59:33,  5.43it/s]  

 Debugger: 8000 / 27397


Processing:  31%|███       | 8501/27397 [21:56<40:07,  7.85it/s]  

 Debugger: 8500 / 27397


Processing:  33%|███▎      | 9000/27397 [23:10<35:40,  8.60it/s]  

 Debugger: 9000 / 27397


Processing:  35%|███▍      | 9502/27397 [24:09<32:24,  9.20it/s]  

 Debugger: 9500 / 27397


Processing:  37%|███▋      | 10001/27397 [25:13<41:38,  6.96it/s] 

 Debugger: 10000 / 27397


Processing:  38%|███▊      | 10501/27397 [26:06<29:32,  9.53it/s]  

 Debugger: 10500 / 27397


Processing:  40%|████      | 11002/27397 [27:06<31:10,  8.76it/s]  

 Debugger: 11000 / 27397


Processing:  42%|████▏     | 11500/27397 [28:05<46:44,  5.67it/s]  

 Debugger: 11500 / 27397


Processing:  44%|████▍     | 12002/27397 [29:02<24:18, 10.56it/s]

 Debugger: 12000 / 27397


Processing:  46%|████▌     | 12503/27397 [30:08<23:52, 10.40it/s]  

 Debugger: 12500 / 27397


Processing:  47%|████▋     | 13001/27397 [31:02<47:42,  5.03it/s]

 Debugger: 13000 / 27397


Processing:  49%|████▉     | 13501/27397 [32:10<18:44, 12.36it/s]  

 Debugger: 13500 / 27397


Processing:  51%|█████     | 14001/27397 [33:15<38:51,  5.75it/s]

 Debugger: 14000 / 27397


Processing:  53%|█████▎    | 14500/27397 [34:45<54:26,  3.95it/s]  

 Debugger: 14500 / 27397


Processing:  55%|█████▍    | 15000/27397 [36:25<38:54,  5.31it/s]  

 Debugger: 15000 / 27397


Processing:  57%|█████▋    | 15500/27397 [38:04<52:02,  3.81it/s]  

 Debugger: 15500 / 27397


Processing:  58%|█████▊    | 16000/27397 [40:03<46:21,  4.10it/s]

 Debugger: 16000 / 27397


Processing:  60%|██████    | 16500/27397 [42:09<42:28,  4.28it/s]  

 Debugger: 16500 / 27397


Processing:  62%|██████▏   | 17000/27397 [44:14<37:20,  4.64it/s]  

 Debugger: 17000 / 27397


Processing:  64%|██████▍   | 17501/27397 [46:15<30:05,  5.48it/s]  

 Debugger: 17500 / 27397


Processing:  66%|██████▌   | 18000/27397 [48:20<40:58,  3.82it/s]

 Debugger: 18000 / 27397


Processing:  68%|██████▊   | 18500/27397 [51:02<1:16:33,  1.94it/s]

 Debugger: 18500 / 27397


Processing:  69%|██████▉   | 19000/27397 [55:35<39:49,  3.51it/s]  

 Debugger: 19000 / 27397


Processing:  71%|███████   | 19501/27397 [57:38<24:29,  5.37it/s]  

 Debugger: 19500 / 27397


Processing:  73%|███████▎  | 20000/27397 [59:19<26:56,  4.58it/s]

 Debugger: 20000 / 27397


Processing:  75%|███████▍  | 20502/27397 [1:01:19<12:36,  9.12it/s]

 Debugger: 20500 / 27397


Processing:  77%|███████▋  | 21001/27397 [1:02:40<20:01,  5.32it/s]

 Debugger: 21000 / 27397


Processing:  78%|███████▊  | 21500/27397 [1:04:16<20:14,  4.85it/s]  

 Debugger: 21500 / 27397


Processing:  80%|████████  | 22001/27397 [1:05:47<25:33,  3.52it/s]

 Debugger: 22000 / 27397


Processing:  82%|████████▏ | 22500/27397 [1:07:36<24:42,  3.30it/s]

 Debugger: 22500 / 27397


Processing:  84%|████████▍ | 23000/27397 [1:08:56<08:58,  8.17it/s]

 Debugger: 23000 / 27397


Processing:  86%|████████▌ | 23501/27397 [1:10:22<11:51,  5.48it/s]

 Debugger: 23500 / 27397


Processing:  88%|████████▊ | 24000/27397 [1:11:59<10:34,  5.36it/s]

 Debugger: 24000 / 27397


Processing:  89%|████████▉ | 24501/27397 [1:13:39<08:31,  5.66it/s]

 Debugger: 24500 / 27397


Processing:  91%|█████████▏| 25001/27397 [1:15:32<07:23,  5.41it/s]

 Debugger: 25000 / 27397


Processing:  93%|█████████▎| 25500/27397 [1:16:54<04:54,  6.44it/s]

 Debugger: 25500 / 27397


Processing:  95%|█████████▍| 26001/27397 [1:18:30<03:33,  6.52it/s]

 Debugger: 26000 / 27397


Processing:  97%|█████████▋| 26500/27397 [1:19:43<01:54,  7.85it/s]

 Debugger: 26500 / 27397


Processing:  99%|█████████▊| 27000/27397 [1:21:01<01:15,  5.26it/s]

 Debugger: 27000 / 27397


Processing: 100%|██████████| 27397/27397 [1:22:03<00:00,  5.56it/s]


Unnamed: 0,iati_id,iati_orga_id,orga_abbreviation,orga_full_name,client,title_en,title_other,title_main,organization,country_code_list,...,actual_end,last_update,crs_5_code,crs_5_name,crs_3_code,crs_3_name,docs,title_and_description,sgd_pred_code,sgd_pred_str
0,DE-1-201822287-0,DE-1,bmz,Bundesministerium für wirtschaftliche Zusammen...,BMZ,Strengthening quality infrastructure for trade...,Stärkung der Qualitätsinfrastruktur für den Ha...,Strengthening quality infrastructure for trade...,Bundesministerium für wirtschaftliche Zusammen...,,...,2016-03-14T00:00:00Z,2024-02-29T00:00:00Z,33130;,Regional trade agreements (RTAs);,331;,Trade Policies & Regulations;,,Strengthening quality infrastructure for trade...,9,"8 9. Build resilient infrastructure, promot..."
1,DE-1-201920016-0,DE-1,bmz,Bundesministerium für wirtschaftliche Zusammen...,BMZ,Strengthening of Metrology for the Improvement...,Stärkung des Messwesens in Ägypten zur Verbess...,Strengthening of Metrology for the Improvement...,Bundesministerium für wirtschaftliche Zusammen...,['AG'],...,2016-03-14T00:00:00Z,2024-02-29T00:00:00Z,14010;,Water sector policy and administrative managem...,140;,Water Supply & Sanitation;,,Strengthening of Metrology for the Improvement...,9,"8 9. Build resilient infrastructure, promot..."
2,DE-1-201721877-0,DE-1,bmz,Bundesministerium für wirtschaftliche Zusammen...,BMZ,Strengthening regional integration and coopera...,Stärkung der regionalen Integration und Zusamm...,Strengthening regional integration and coopera...,Bundesministerium für wirtschaftliche Zusammen...,,...,2016-03-14T00:00:00Z,2024-02-29T00:00:00Z,33130;,Regional trade agreements (RTAs);,331;,Trade Policies & Regulations;,,Strengthening regional integration and coopera...,9,"8 9. Build resilient infrastructure, promot..."
3,DE-1-201276351-0,DE-1,bmz,Bundesministerium für wirtschaftliche Zusammen...,BMZ,Strengthening Non-Violent Popular Movements in...,Kapazitätsentwicklung für gewaltfreie Basisbew...,Strengthening Non-Violent Popular Movements in...,Bundesministerium für wirtschaftliche Zusammen...,['VU'],...,2016-03-14T00:00:00Z,2024-03-20T00:00:00Z,15160;,Human rights;,151;,Government & Civil Society-general;,,Strengthening Non-Violent Popular Movements in...,16,15 16. Promote peaceful and inclusive socie...
4,DE-1-201676584-0,DE-1,bmz,Bundesministerium für wirtschaftliche Zusammen...,BMZ,Rebuilding Further Arts after cyclone,Wiederaufbau von Further Arts nach Wirbelsturm,Rebuilding Further Arts after cyclone,Bundesministerium für wirtschaftliche Zusammen...,['VU'],...,2016-03-14T00:00:00Z,2024-03-20T00:00:00Z,73010;,Immediate post-emergency reconstruction and re...,730;,Reconstruction Relief & Rehabilitation;,,Rebuilding Further Arts after cyclone. Rebuild...,11,10 11. Make cities and human settlements in...


### With Batch processing

>> Not faster!

In [None]:
df["sgd_pred_code"] = "NaN"
df["sgd_pred_str"] = "NaN"

batch_size = 8
n_batches = len(df) // batch_size + (len(df) % batch_size > 0)

for batch_n in tqdm(range(n_batches), desc="Processing batches"):
    batch_start = batch_n * batch_size
    batch_end = (batch_n + 1) * batch_size
    df_batch = df.iloc[batch_start:batch_end]

    for index, row in df_batch.iterrows():
        descr_row = row['description_main']
        try:
            if isinstance(descr_row, float):  
                continue  
            else:
                if len(descr_row) > 512:
                    descr_row = descr_row[:512]
                pred = pipe(descr_row)  
                pred_str = pred[0]["label"]
                pred_int = int(pred_str)
                
                sdg_translation = sdg_df.loc[sdg_df['code'] == pred_int, 'name'].values[0] if not sdg_df.loc[sdg_df['code'] == pred_int, 'name'].empty else "NaN"

                df.loc[index, "sgd_pred_code"] = pred_int
                df.loc[index, "sgd_pred_str"] = sdg_translation
        except Exception as e:
            print(f"Error at index {index}: {e}")

    if (batch_n + 1) % 1 == 0 or batch_n == n_batches - 1:
        tqdm.write(f"Processed batch {batch_n + 1}/{n_batches}")

In [11]:
df["sgd_pred_code"].value_counts()

sgd_pred_code
8      3638
2      3071
11     2981
9      2313
16     2163
3      2134
4      2126
7      1588
1      1547
6      1542
13     1398
5      1344
15      822
12      287
14      214
10      144
NaN      85
Name: count, dtype: int64