# Assignment 
**for the 2024 komex Course on Social Data Science with Python, with Prof. David Garcia**

*name:* Cecilia Natalie Strom

*mail:* cecilia-natalie.strom@giga-hamburg.de


------------------------------------------------------------------------------------------------------------------------------------

*This Assignment contains code from my project related work on individual targeted Sanctions at the German Institute for Global and Area Studies. The task is, to extract data on individual targeted Sanctions, specifically Individual Sanctions by the US Government that are found on the Specially Designated Nationals list by the Office of Foreign Asset Controls (OFAC). The Assignment will demonstrate my skills in both accessing the data from an external provider via url as well as extracting information from text files for the year 2023. Step 2 might take some time to load. The Assignment is structured as follows*:

1. Querying of the data from the OpenSantionsDefault Dataset(url:https://www.opensanctions.org/datasets/default/)

2. Descriptive Analysis of the data above

3. Systematic extraction of information from the OFAC sDN file archives (url: https://ofac.treasury.gov/specially-designated-nationals-list-sdn-list/archive-of-changes-to-the-sdn-list)

4. Descriptive Analysis of the data above

5. Comparison and Face Validity Test of both sources


------------------------------------------------------------------------------------------------------------------------------------

In [10]:
#loading of the packages
import pandas as pd
import os
import requests
import datetime
import re
import seaborn as sns

# 1.Dataquery from OpenSanctions.org

In [2]:
#create urls for all relevant dates
date_list = pd.date_range(start='20230101',end='20231231',freq='D').strftime('%Y%m%d')
date_list

Index(['20230101', '20230102', '20230103', '20230104', '20230105', '20230106',
       '20230107', '20230108', '20230109', '20230110',
       ...
       '20231222', '20231223', '20231224', '20231225', '20231226', '20231227',
       '20231228', '20231229', '20231230', '20231231'],
      dtype='object', length=365)

In [3]:
#get all the urls for all the dates we need the data for
#make each day an individual list so we can compare and match them
websites = []

for i in date_list:
    test = 'https://data.opensanctions.org/datasets/'+(i)+'/us_ofac_sdn/targets.simple.csv'
    websites.append(test)
print(websites)

['https://data.opensanctions.org/datasets/20230101/us_ofac_sdn/targets.simple.csv', 'https://data.opensanctions.org/datasets/20230102/us_ofac_sdn/targets.simple.csv', 'https://data.opensanctions.org/datasets/20230103/us_ofac_sdn/targets.simple.csv', 'https://data.opensanctions.org/datasets/20230104/us_ofac_sdn/targets.simple.csv', 'https://data.opensanctions.org/datasets/20230105/us_ofac_sdn/targets.simple.csv', 'https://data.opensanctions.org/datasets/20230106/us_ofac_sdn/targets.simple.csv', 'https://data.opensanctions.org/datasets/20230107/us_ofac_sdn/targets.simple.csv', 'https://data.opensanctions.org/datasets/20230108/us_ofac_sdn/targets.simple.csv', 'https://data.opensanctions.org/datasets/20230109/us_ofac_sdn/targets.simple.csv', 'https://data.opensanctions.org/datasets/20230110/us_ofac_sdn/targets.simple.csv', 'https://data.opensanctions.org/datasets/20230111/us_ofac_sdn/targets.simple.csv', 'https://data.opensanctions.org/datasets/20230112/us_ofac_sdn/targets.simple.csv', 'ht

In [4]:
#now loop over the batches to get all the data and append the dataframes to one another
#this gives us a list of dataframes
#the loop must contain a date identifyer as a new column so each dataframe is marked with the retrieval date 
#that is then needed to match the different days against each other
entities_list = []
date_pattern = r'/datasets/(\d{8})/'
for site in websites:
    response = requests.get(site)
    if response.status_code != 200:
        continue
    data = pd.read_csv(site, low_memory=False)
    match = re.search(date_pattern, site) #extract the date from the url
    if match:
        date = match.group(1)
        data['date_stamp'] = date #set an individual date stamp for each dataframe matching the url date
        #ata =  data.set_index('date_stamp') #set as an index
    entities_list.append(data)
res = pd.concat(entities_list)  # concatenate list of dataframes

In [5]:
#now match the different dates against each other and return an indicator per row for a new listing or delisting
# Sort the DataFrame based on date_stamp and id
res.sort_values(by=['id', 'date_stamp'], inplace=True)

# Identify new entries and deletions based on the index and columns you want to compare
new_entries = ~res.duplicated(subset=['id'], keep='first')
deletions = ~res.duplicated(subset=['id'], keep='last')

# Identify unchanged rows by checking for duplicates based on the id and date_stamp
unchanged_rows = ~new_entries & ~deletions

# Set the values of new_entry, deletion, and unchanged columns
res['new_entry'] = new_entries
res['deletion'] = deletions
res['unchanged'] = unchanged_rows

# Reset index for the final result
res.reset_index(drop=True, inplace=True)

In [6]:
#test the matching function
test = res.query('new_entry == True')
print(test)

                                id        schema  \
0        NK-22HtK7WrxZ2sU3rmhz6PuZ        Person   
364      NK-22oMG6jqPQknWaMjzTn4hK       Company   
672      NK-23p2d4vMT5sJtQ845GyzJt  Organization   
1036     NK-23rgYEXa9AHtupZKgS8Tbc        Person   
1400     NK-24KmksG96rQedGYXzm4xHU  Organization   
...                            ...           ...   
4561328                 ofac-47042  Organization   
4561329                 ofac-47068  Organization   
4561333                 ofac-47069        Person   
4561337                 ofac-47088  Organization   
4561341                 ofac-47089  Organization   

                                                      name  \
0                                          Michael Kuajien   
364                    Limited Liability Company Garantiya   
672      Scientific and Production Association of Measu...   
1036                                  Bahram Ali SHAYESTEH   
1400                    CONSTRUCCIONES E INVERSIONES LTDA.   
...

In [32]:
# Add a 'month' column
res['month'] = pd.to_datetime(res['date_stamp']).dt.to_period('M')
#get a count of all listings per month
res['listing_count'] = 0  # Initialize the 'listing_count' column with zeros
res['listing_count'] = res.groupby('month')['id'].transform('nunique')
res

Unnamed: 0,id,schema,name,aliases,birth_date,countries,addresses,identifiers,sanctions,phones,...,dataset,first_seen,last_seen,date_stamp,last_change,new_entry,deletion,unchanged,month,listing_count
0,NK-22HtK7WrxZ2sU3rmhz6PuZ,Person,Michael Kuajien,Michael Kuajian;Michael Kuajien Duer Mayok,1979-01-01,ke;ss,"Nairobi, 248-00100;South Sudan",,SDN List - Program - Block - Executive Order 1...,,...,US OFAC Specially Designated Nationals (SDN) List,2021-09-30 11:39:21,2023-01-01 18:18:26,20230101,,True,False,False,2023-01,11559
1,NK-22HtK7WrxZ2sU3rmhz6PuZ,Person,Michael Kuajien,Michael Kuajian;Michael Kuajien Duer Mayok,1979-01-01,ke;ss,Nairobi;South Sudan,,SDN List - Block - Program - Executive Order 1...,,...,US OFAC Specially Designated Nationals (SDN) List,2021-09-30 11:39:21,2023-01-02 18:16:42,20230102,,False,False,True,2023-01,11559
2,NK-22HtK7WrxZ2sU3rmhz6PuZ,Person,Michael Kuajien,Michael Kuajian;Michael Kuajien Duer Mayok,1979-01-01,ke;ss,"Nairobi, 248-00100;South Sudan",,Block - Program - SDN List - Executive Order 1...,,...,US OFAC Specially Designated Nationals (SDN) List,2021-09-30 11:39:21,2023-01-03 18:17:28,20230103,,False,False,True,2023-01,11559
3,NK-22HtK7WrxZ2sU3rmhz6PuZ,Person,Michael Kuajien,Michael Kuajian;Michael Kuajien Duer Mayok,1979-01-01,ke;ss,Nairobi;South Sudan,,Block - Program - SDN List - Executive Order 1...,,...,US OFAC Specially Designated Nationals (SDN) List,2021-09-30 11:39:21,2023-01-04 18:16:55,20230104,,False,False,True,2023-01,11559
4,NK-22HtK7WrxZ2sU3rmhz6PuZ,Person,Michael Kuajien,Michael Kuajian;Michael Kuajien Duer Mayok,1979-01-01,ke;ss,"Nairobi, 248-00100;South Sudan",,Program - SDN List - Block - Executive Order 1...,,...,US OFAC Specially Designated Nationals (SDN) List,2021-09-30 11:39:21,2023-01-05 18:18:04,20230105,,False,False,True,2023-01,11559
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4561340,ofac-47088,Organization,Al Rawda Exchange and Money Transfers Company,Al Rawda Exchange and Transfers Co.;Al Rawdah ...,,ye,"Airport Line, Al-Jumna Roundabout, Sana'a;Al-H...",,SDN List - Executive Order 13224 (Terrorism),,...,US OFAC Specially Designated Nationals (SDN) List,2023-12-28T16:10:01,2023-12-31T22:10:01,20231231,2023-12-28T16:10:01,False,True,False,2023-12,14229
4561341,ofac-47089,Organization,Al Aman Kargo Ithalat Ihracat Ve Nakliyat Limi...,Al Aman Co Kargo,,tr,"11 Eylul Cd., No. 32, Yavus Selim, Bursa;Cakma...",919198;921643-0,SDN List - Executive Order 13224 (Terrorism),,...,US OFAC Specially Designated Nationals (SDN) List,2023-12-28T16:10:01,2023-12-28T22:10:01,20231228,2023-12-28T16:10:01,True,False,False,2023-12,14229
4561342,ofac-47089,Organization,Al Aman Kargo Ithalat Ihracat Ve Nakliyat Limi...,Al Aman Co Kargo,,tr,"11 Eylul Cd., No. 32, Yavus Selim, Bursa;Cakma...",919198;921643-0,SDN List - Executive Order 13224 (Terrorism),,...,US OFAC Specially Designated Nationals (SDN) List,2023-12-28T16:10:01,2023-12-29T22:10:29,20231229,2023-12-28T16:10:01,False,False,True,2023-12,14229
4561343,ofac-47089,Organization,Al Aman Kargo Ithalat Ihracat Ve Nakliyat Limi...,Al Aman Co Kargo,,tr,"11 Eylul Cd., No. 32, Yavus Selim, Bursa;Cakma...",919198;921643-0,SDN List - Executive Order 13224 (Terrorism),,...,US OFAC Specially Designated Nationals (SDN) List,2023-12-28T16:10:01,2023-12-30T22:10:01,20231230,2023-12-28T16:10:01,False,False,True,2023-12,14229


In [33]:
res_deduplicated = res.copy()
res_deduplicated.sort_values(by=['id', 'date_stamp'], inplace=True)

# Identify new entries and deletions based on the index and columns you want to compare
new_entries = ~res_deduplicated.duplicated(subset=['id'], keep='first')
deletions = ~res_deduplicated.duplicated(subset=['id'], keep='last')

# Identify unchanged rows by checking for duplicates based on the id and date_stamp
unchanged_rows = ~new_entries & ~deletions
# Create new columns for first seen and last seen dates
res_deduplicated['listing_date'] = res_deduplicated['date_stamp'].where(new_entries)
res_deduplicated['delisting_date'] = res_deduplicated['date_stamp'].where(deletions)

# Deduplicate the entries (keep the first occurrence for each entity)
res_deduplicated = res_deduplicated.drop_duplicates(subset=['id'], keep='first')

# Reset index for the final result
res_deduplicated=res_deduplicated.reset_index(drop=True)

In [34]:
res_deduplicated['new_listing_count'] = 0  # Initialize the 'listing_count' column with zeros
res_deduplicated['new_listing_count'] = res_deduplicated.groupby('month')['new_entry'].transform('nunique')
res_deduplicated

Unnamed: 0,id,schema,name,aliases,birth_date,countries,addresses,identifiers,sanctions,phones,...,date_stamp,last_change,new_entry,deletion,unchanged,month,listing_count,listing_date,delisting_date,new_listing_count
0,NK-22HtK7WrxZ2sU3rmhz6PuZ,Person,Michael Kuajien,Michael Kuajian;Michael Kuajien Duer Mayok,1979-01-01,ke;ss,"Nairobi, 248-00100;South Sudan",,SDN List - Program - Block - Executive Order 1...,,...,20230101,,True,False,False,2023-01,11559,20230101,,1
1,NK-22oMG6jqPQknWaMjzTn4hK,Company,Limited Liability Company Garantiya,Garantiya OOO,,ru,"bulvar Tverskoi, d. 15 str. 2, Moscow",5067746901426;7703610362,Block - SDN List - Program - Executive Order 1...,,...,20230226,,True,False,False,2023-02,12137,20230226,,1
2,NK-23p2d4vMT5sJtQ845GyzJt,Organization,Scientific and Production Association of Measu...,Aktsionernoe Obschestvo Nauchno Proizvodstvenn...,,ru,"2k4 Pionerskaya Str., Korolyov, Moscow Region,...",1095018006555;5018139517,SDN List - Program - Block - Executive Order 1...,,...,20230101,,True,False,False,2023-01,11559,20230101,,1
3,NK-23rgYEXa9AHtupZKgS8Tbc,Person,Bahram Ali SHAYESTEH,Bahrami Ali JADALI;Bahrami Ali SHAYESTEH,1958-06-13;1963-05-06;1963-08-06,de,"80331 Muenchen, Bayern",,SDN List - Program - Block - Unknown,,...,20230101,,True,False,False,2023-01,11559,20230101,,1
4,NK-24KmksG96rQedGYXzm4xHU,Organization,CONSTRUCCIONES E INVERSIONES LTDA.,,,co,"Calle 15 No. 10-52, La Union, Valle",800154939-3,SDN List - Program - Block - Unknown,,...,20230101,,True,False,False,2023-01,11559,20230101,,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16441,ofac-47042,Organization,BELLATRIX ENERGY LIMITED,,,cn;hk,"Unit 601, 6/F of Mill 5 of the Mills, 45 Pak T...",3000934,SDN List - Executive Order 14024 (Russia),,...,20231220,2023-12-20T16:10:01,True,True,False,2023-12,14229,20231220,20231220,1
16442,ofac-47068,Organization,Nabco Money Exchange and Remittance Co.,NABCO MONEY EXCHANGE & REMITTANCE CO.;Nabako M...,,ye,"Al-Khamis Street, Lebanese University Neighbor...",,SDN List - Executive Order 13224 (Terrorism),,...,20231228,2023-12-28T16:10:01,True,False,False,2023-12,14229,20231228,,1
16443,ofac-47069,Person,Nabil Ali Ahmed Al-Hadha,Nabil al-Haza';نبیل علي أحمد الحظا,1975-02-02,ye,,08928715,SDN List - Executive Order 13224 (Terrorism),,...,20231228,2023-12-28T16:10:01,True,False,False,2023-12,14229,20231228,,1
16444,ofac-47088,Organization,Al Rawda Exchange and Money Transfers Company,Al Rawda Exchange and Transfers Co.;Al Rawdah ...,,ye,"Airport Line, Al-Jumna Roundabout, Sana'a;Al-H...",,SDN List - Executive Order 13224 (Terrorism),,...,20231228,2023-12-28T16:10:01,True,False,False,2023-12,14229,20231228,,1


In [35]:
#save to csv
res_deduplicated.to_csv("us_ofac_sdn.csv")

# 2. Get Descriptives of OpenSanctions Data

In [36]:
data = res_deduplicated.copy()
#get column names
print(data.columns)
print(data.describe())
print(data.schema.unique())
print(data.month.unique())
print(data.month.nunique())

print(data.date_stamp.unique())
print(data.date_stamp.nunique())
# receive information on missing values/ null values
is_null = data.isnull().sum()
print(is_null)

Index(['id', 'schema', 'name', 'aliases', 'birth_date', 'countries',
       'addresses', 'identifiers', 'sanctions', 'phones', 'emails', 'dataset',
       'first_seen', 'last_seen', 'date_stamp', 'last_change', 'new_entry',
       'deletion', 'unchanged', 'month', 'listing_count', 'listing_date',
       'delisting_date', 'new_listing_count'],
      dtype='object')
       listing_count  new_listing_count
count   16446.000000            16446.0
mean    12031.462483                1.0
std       839.141150                0.0
min     11559.000000                1.0
25%     11559.000000                1.0
50%     11559.000000                1.0
75%     12137.000000                1.0
max     14229.000000                1.0
['Person' 'Company' 'Organization' 'Airplane' 'Vessel']
<PeriodArray>
['2023-01', '2023-02', '2023-03', '2023-09', '2023-11', '2023-12', '2023-05',
 '2023-04', '2023-10', '2023-07', '2023-08', '2023-06']
Length: 12, dtype: period[M]
12
['20230101' '20230226' '20230314' '20

In [37]:
#create a monthly listing column for plotting the data
data = data.sort_values(by=['month'])
data['listing_month'] = data['new_listing_count'].groupby(data['month']).transform('sum')
print(data.dtypes)

id                   object
schema               object
name                 object
aliases              object
birth_date           object
countries            object
addresses            object
identifiers          object
sanctions            object
phones               object
emails               object
dataset              object
first_seen           object
last_seen            object
date_stamp           object
last_change          object
new_entry              bool
deletion               bool
unchanged              bool
month                object
listing_count         int64
listing_date         object
delisting_date       object
new_listing_count     int64
listing_month         int64
dtype: object


In [38]:
data.month

0        2023-01
9101     2023-01
9102     2023-01
9103     2023-01
9104     2023-01
          ...   
2179     2023-12
2157     2023-12
2152     2023-12
11418    2023-12
16445    2023-12
Name: month, Length: 16446, dtype: object

In [16]:
#get the top shema
data['schema'].value_counts().nlargest(5)

schema
Person          7529
Organization    5490
Company         2267
Vessel           782
Airplane         378
Name: count, dtype: int64

In [17]:
#get the top sanctions regime
data['sanctions'].value_counts().nlargest(15)

sanctions
SDN List - Program - Block - Unknown                                      2162
SDN List - Program - Block - Executive Order 14024                        1745
SDN List - Executive Order 14024 (Russia)                                 1673
SDN List - Program - Block - Executive Order 13224 (Terrorism)            1260
SDN List - Program - Block - Foreign Narcotics Kingpin Designation Act     789
SDN List - Executive Order 14024                                           686
SDN List - Program - Block - Executive Order 13818 (Global Magnitsky)      605
SDN List - Program - Block - Executive Order 13599 (Iran)                  486
SDN List - Program - Block - Executive Order 13582 (Syria)                 415
SDN List - Executive Order 13224 (Terrorism)                               376
SDN List - Program - Block - Executive Order 13382 (Non-proliferation)     345
SDN List - Block - Program - Executive Order 14024                         270
Block - SDN List - Program - Executive Ord

In [18]:
#get the dates with the most new listings
data['listing_date'].value_counts().nlargest(10)

listing_date
20230101    11394
20230519      324
20231212      277
20230224      253
20231102      229
20231106      226
20230226      226
20231120      185
20230522      169
20230914      168
Name: count, dtype: int64

# 3. Extract data from OFAC 2023 SDN arcive file


#patterns in the txt file:

The following [RUSSIA-EO14024] [UKRAINE-EO13661] entries have been
changed: 

    
The following [SDGT] entries have been added to OFAC's SDN List: 


The following [IRAN-HR] entries have been added to OFAC's SDN List:

The following [SDGT] entries have been removed: 

In [5]:
#get the filecontents from the OFAC website
url = 'https://www.treasury.gov/ofac/downloads/sdnnew23.txt'
response = requests.get(url)
print(response.status_code)
data = response.text
data

200


'This publication of Treasury\'s Office of Foreign Assets Control\n("OFAC") is designed as a reference tool providing actual notice of\nactions by OFAC with respect to Specially Designated Nationals and\nother entities whose property is blocked, to assist the public in\ncomplying with the various sanctions programs administered by OFAC.\nThe latest changes may appear here prior to their publication in\nthe Federal Register, and it is intended that users rely on changes\nindicated in this document that post-date the most recent Federal\nRegister publication with respect to a particular sanctions program\nin the appendices to chapter V of Title 31, Code of Federal\nRegulations.  Such changes reflect official actions of OFAC, and\nwill be reflected as soon as practicable in the Federal Register\nunder the index heading "Foreign Assets Control."  New Federal\nRegister notices with regard to Specially Designated Nationals or\nblocked entities may be published at any time.  Users are advised

In [None]:
#tokenize
##access a huggingface library on tokenizers
#identify patterns

In [13]:
#need to access an online corpus so i can use it to tokenize
import requests
from tokenizers import SentencePieceBPETokenizer

def load_text_from_url(url):
    try:
        response = requests.get(url)
        # Check if request was successful
        response.raise_for_status()
        # Decode the content assuming it's UTF-8
        text = response.text
        return text
    except requests.exceptions.RequestException as e:
        print("Error loading text from URL:", e)
        return None

def tokenize_into_batches(text):
    tokenizer = SentencePieceBPETokenizer()

    # Load pre-trained tokenizer
    tokenizer.train([text])

    # Tokenize text into batches
    batches = []
    current_batch = []
    for sentence in tokenizer.encode(text).tokens:
        sentence = sentence.replace('▁', ' ').strip()  # Remove special tokens
        # Check if the sentence ends with a period
        if sentence.endswith('.'):
            # Append the sentence to the current batch
            current_batch.append(sentence)
            # If the current batch is not empty, add it to batches
            if current_batch:
                batches.append(current_batch)
            # Start a new batch
            current_batch = []
        else:
            # Append the sentence to the current batch
            current_batch.append(sentence)

    return batches

# Example usage:
url = "https://www.treasury.gov/ofac/downloads/sdnnew23.txt"
text = load_text_from_url(url)

if text:
    batches = tokenize_into_batches(text)
    for i, batch in enumerate(batches):
        print(f"Batch {i+1}:")
        for sentence in batch:
            print(sentence)
        print()
else:
    print("Failed to load text from the URL.")


Exception: Der Dateiname oder die Erweiterung ist zu lang. (os error 206)

In [None]:
# Define regular expressions
name_pattern = r'Name: ([A-Za-z\s]+)'
date_pattern = r'Date: (\d{2}/\d{2}/\d{4})'
address_pattern = r'Address: (.+)'

# Find matches
names = re.findall(name_pattern, text)
dates = re.findall(date_pattern, text)
addresses = re.findall(address_pattern, text)

# Create a dictionary to store the extracted information
data = {
    'Name': names,
    'Date': dates,
    'Address': addresses
}


In [None]:
#parse into df format
# Create a DataFrame
df = pd.DataFrame(data)

# Print the DataFrame
print(df)

# 4. Descriptive analysis of OFAC SDN files

# 5. Comparison

In [None]:
list1 = data["Name"]
list2=data2["name"]

set(list1) & set(list2)
set(list1).intersection(list2)
