# **Dataset Pre-Processing**: **`FHI_2020_strat_sampling`** part 2

This second part is dedicated to preprocessing data from the `FHI_2022` project, which is intended to contribute to the creation of the `FHI_2020_strat_sampling` dataset.\
More specifically , we will use data of two classes from `FHI2022`, `Blank` and `Swiped-but-not-run` classes.

---

## Import main libs

In [1]:
import sys
#import ipyplot
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from functools import partial
import grequests

sys.path.append("../../../src/")
from utils import *

## Parameters

In [2]:
DATASET_NAME = 'DATASET__FHI_2020_strat_sampling'
STORAGE_DIR = f"../../../data/intermediate/{DATASET_NAME}/"
PROJECT_ID = 12
DATA_FNAME = f"{STORAGE_DIR}/data_project-id-{PROJECT_ID}.csv"

## load data from the project using PAD API

In [3]:
project_id = PROJECT_ID
df = get_project_data(project_id)
df

Unnamed: 0,id,sample_name,test_name,user_name,date_of_creation,raw_file_location,processed_file_location,processing_date,camera_type_1,notes,...,project.neutral_filler,project.qpc20,project.qpc50,project.qpc80,project.qpc100,project.notes,issue.id,issue.name,issue.description,issue
0,42275,Ampicillin,12LanePADKenya2015,api-5NWT4K7IS60WMLR3J2LV,2022-04-28T12:28:36,/var/www/html/images/padimages/raw/40000/42275...,/var/www/html/images/padimages/processed/40000...,2022-04-28T12:28:36,Google Pixel 3a,"{""Predicted drug"": ""ampicillin"", ""User"": ""Unkn...",...,Lactose,1,1,1,1,FHI360 validation study with real and mock dos...,3.0,Stuck,,
1,42276,Ampicillin,12LanePADKenya2015,api-5NWT4K7IS60WMLR3J2LV,2022-04-28T12:28:55,/var/www/html/images/padimages/raw/40000/42276...,/var/www/html/images/padimages/processed/40000...,2022-04-28T12:28:55,Google Pixel 3a,"{""Predicted drug"": ""ampicillin"", ""User"": ""Unkn...",...,Lactose,1,1,1,1,FHI360 validation study with real and mock dos...,3.0,Stuck,,
2,42277,Ampicillin,12LanePADKenya2015,api-5NWT4K7IS60WMLR3J2LV,2022-04-28T12:29:13,/var/www/html/images/padimages/raw/40000/42277...,/var/www/html/images/padimages/processed/40000...,2022-04-28T12:29:13,Google Pixel 3a,"{""Predicted drug"": ""ampicillin"", ""User"": ""Unkn...",...,Lactose,1,1,1,1,FHI360 validation study with real and mock dos...,3.0,Stuck,,
3,42278,Ampicillin,12LanePADKenya2015,api-5NWT4K7IS60WMLR3J2LV,2022-04-28T12:30:54,/var/www/html/images/padimages/raw/40000/42278...,/var/www/html/images/padimages/processed/40000...,2022-04-28T12:30:54,Google Pixel 3a,"{""Predicted drug"": ""ampicillin"", ""User"": ""Unkn...",...,Lactose,1,1,1,1,FHI360 validation study with real and mock dos...,,,,
4,42279,Ampicillin,12LanePADKenya2015,api-5NWT4K7IS60WMLR3J2LV,2022-04-28T12:31:39,/var/www/html/images/padimages/raw/40000/42279...,/var/www/html/images/padimages/processed/40000...,2022-04-28T12:31:39,Google Pixel 3a,"{""Predicted drug"": ""ampicillin"", ""User"": ""Unkn...",...,Lactose,1,1,1,1,FHI360 validation study with real and mock dos...,3.0,Stuck,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3861,47168,,12 Lane PAD Kenya 2014,api-5NWT4K7IS60WMLR3J2LV,2023-06-06T14:54:40,/var/www/html/images/padimages/raw_local/40000...,/var/www/html/images/padimages/processed/40000...,2023-06-06T14:54:40,iPhone,"{""Safe"":""Suspected unsafe"",""Prediction score"":...",...,Lactose,1,1,1,1,FHI360 validation study with real and mock dos...,,,,
3862,47170,Amoxicillin,12 Lane PAD Kenya 2014,api-5NWT4K7IS60WMLR3J2LV,2023-06-08T10:14:12,/var/www/html/images/padimages/raw_local/40000...,/var/www/html/images/padimages/processed/40000...,2023-06-08T10:14:12,iPhone,"{""Safe"":""Suspected safe"",""Prediction score"":1,...",...,Lactose,1,1,1,1,FHI360 validation study with real and mock dos...,,,,
3863,47171,Amoxicillin,12 Lane PAD Kenya 2014,api-5NWT4K7IS60WMLR3J2LV,2023-06-08T10:15:27,/var/www/html/images/padimages/raw_local/40000...,/var/www/html/images/padimages/processed/40000...,2023-06-08T10:15:27,iPhone,"{""Safe"":""Suspected safe"",""Prediction score"":0....",...,Lactose,1,1,1,1,FHI360 validation study with real and mock dos...,,,,
3864,47172,Amoxicillin,12 Lane PAD Kenya 2014,api-5NWT4K7IS60WMLR3J2LV,2023-06-08T10:16:01,/var/www/html/images/padimages/raw_local/40000...,/var/www/html/images/padimages/processed/40000...,2023-06-08T10:16:01,iPhone,"{""Safe"":""Suspected safe"",""Prediction score"":1,...",...,Lactose,1,1,1,1,FHI360 validation study with real and mock dos...,,,,


In [4]:
df.columns.to_list()

['id',
 'sample_name',
 'test_name',
 'user_name',
 'date_of_creation',
 'raw_file_location',
 'processed_file_location',
 'processing_date',
 'camera_type_1',
 'notes',
 'sample_id',
 'quantity',
 'deleted',
 'project.id',
 'project.user_name',
 'project.project_name',
 'project.annotation',
 'project.test_name',
 'project.sample_names.sample_names',
 'project.neutral_filler',
 'project.qpc20',
 'project.qpc50',
 'project.qpc80',
 'project.qpc100',
 'project.notes',
 'issue.id',
 'issue.name',
 'issue.description',
 'issue']

## Normalize columns

In [5]:
## Counting unique values to identify categories in the 'sample_name' column

# Normalize case to lowercase and replace spaces with dashes in 'sample_name'
df['sample_name'] = df['sample_name'].str.lower().str.replace(' ', '-', regex=False)

# Counting unique values to identify categories
print(df['sample_name'].value_counts())

sample_name
acetominophen                 499
ceftriaxone                   468
doxycycline                   432
ampicillin                    401
amoxicillin                   371
ciprofloxacin                 353
blank                         208
caco3-starch                  137
lactose                        83
swiped-but-not-run             78
ripe                           75
ethambutol                     67
promethazine-hydrochloride     67
rifampicin                     65
ferrous-sulfate                64
pyrazinamide                   63
albendazole                    61
isoniazid                      60
sulfamethoxazole               60
chloroquine                    60
azithromycin                   56
ampicillin-starch              53
doxycycline-starch             42
ciprofloxacin-starch           32
distractor                      6
                                5
Name: count, dtype: int64


## Select Columns of Interest

In [6]:
columns = ['blank', 'swiped-but-not-run']
df = df[df.sample_name.isin(columns)].reset_index()

## Check for deleted samples

In [7]:
# Filter by deleted
num_cards = len(df.index)
df = df[~df['deleted']]
print(f"Deleted {num_cards} cards to {len(df.index)} cards")

Deleted 286 cards to 286 cards


## Add `URL` colum to the data

In [8]:
# Add url to dataframe
df['url'] = df['processed_file_location'].apply(lambda x: f"https://pad.crc.nd.edu/{x}")

## Show number of samples by `sample_name`

In [9]:
df.value_counts(['sample_name']).reset_index(name='counts')

Unnamed: 0,sample_name,counts
0,blank,208
1,swiped-but-not-run,78


## Exclude samples with issues

In [10]:
# selct cards that have no issues
size_before = len(df.index)
df = df[df['issue'].isnull()].copy()
size_after = len(df.index)

print(f"Samples with issues: {size_before-size_after} samples")

Samples with issues: 0 samples


## Find and check samples with `sample_name`==`unknown` or empty

In [11]:


# Filter by not empty and not null column_name
def filter_by_empty_column(df, column_name):
    if column_name not in df.columns:
        raise ValueError("The column name is not in the dataframe")
    else:
        return df[(df[column_name].isnull()) | (df[column_name] == "")].copy()
    

In [12]:
column_name = "sample_name"

print(f"Total samples: {len(df.index)}")

empty_name = filter_by_empty_column(df, column_name)
print(f"Total num of samples with empty {column_name}: {len(empty_name.index)} samples")

unknown_name = filter_by_unknown_column(df, column_name)
print(f"Total num of samples with unknown sample_name: {len(unknown_name.index)} samples")

Total samples: 286
Total num of samples with empty sample_name: 0 samples
Total num of samples with unknown sample_name: 0 samples


In [13]:
if len(unknown_name.index) > 0:
    print(f"Unknown sample_name: {unknown_name['sample_name'].unique()}")
    unknown_name[['id','sample_id','sample_name','quantity','test_name','user_name','date_of_creation','url']
                    ].to_csv('../intermediate/data/FHI2020_analysis/check_samples_with_unknown_sample_name.csv', index=False)
    
if len(empty_name.index) > 0:
    print(f"Samples with empty sample_name: {empty_name['sample_name'].unique()}")
    empty_name[['id','sample_id','sample_name','quantity','test_name','user_name','date_of_creation','url']
                    ].to_csv('../intermediate/data/FHI2020_analysis/check_samples_with_empty_sample_name.csv', index=False)

## Checking if there are any missing and unknown values in the `processed_file_location` column

In [14]:
column_name = "processed_file_location"

empty_name = filter_by_empty_column(df, column_name)
print(f"Total num of samples with empty '{column_name}': {len(empty_name.index)} samples")

unknown_name = filter_by_unknown_column(df, column_name)
print(f"Total num of samples with unknown '{column_name}': {len(unknown_name.index)} samples")

Total num of samples with empty 'processed_file_location': 0 samples
Total num of samples with unknown 'processed_file_location': 0 samples


In [15]:
if len(unknown_name.index) > 0:
    print(f"Unknown '{column_name}': {unknown_name['sample_name'].unique()}")
    unknown_name[['id','sample_id','sample_name','quantity','test_name','user_name','date_of_creation','url']
                    ].to_csv('../data/intermediate/FHI2020_analysis/check_samples_with_unknown_sample_location.csv', index=False)
    
if len(empty_name.index) > 0:
    print(f"Samples with empty '{column_name}': {empty_name['sample_name'].unique()}")
    empty_name[['id','sample_id','sample_name','quantity','test_name','user_name','date_of_creation','url']
                    ].to_csv('../data/intermediate/FHI2020_analysis/check_samples_with_empty_sample_location.csv', index=False)

## 

# Ckeck the urls for the `processed_file_location` column

In [16]:
column_name = "url"

bad_urls_df = check_url(df)
print(f"Samples with bad urls: {len(bad_urls_df.index)} samples")

# save the samples that have a status code different from 200 in a new csv file called check_samples_with_bad_urls.csv
if len(bad_urls_df.index) > 0:
    bad_urls_df.to_csv('../data/intermediate/FHI2020_analysis/check_samples_with_bad_urls.csv', index=False)

Samples with bad urls: 0 samples


# Hash

- Downloading images and Calculating the hash of the processed files

In [17]:
save_dir = '../intermediate_data/' 
hash_codes = get_hash_all(df, save_dir)

In [18]:
hash_codes_df = pd.DataFrame(hash_codes, columns=['id', 'url_status_code', 'hashlib_md5'])
hash_codes_df

Unnamed: 0,id,url_status_code,hashlib_md5
0,44047,200,f517e0f8f1c1acb13e2a3137e005758a
1,44050,200,c8de1e0506648f835ca165713b4fe5f7
2,44048,200,353054b5f7e3b5d80e6c07006eef719a
3,44049,200,17ee78855434eb9baf6f05eb9281e87e
4,44051,200,3a2ba067a2bfa6a83a4583d8379c8955
...,...,...,...
281,46173,200,76cd5adc83f3893be6f5d67a19bd324a
282,46174,200,91c9233e4a18a8970c247125efd85776
283,46175,200,060e8a91f123f8a599c9a020a07ffb83
284,46176,200,02ed26b7f059f341848b333457436621


- check is all images have hash 


In [19]:
no_hash = hash_codes_df[hash_codes_df.hashlib_md5.isnull()]
print(f"Samples with null hash: {len(no_hash)}")

no_image = hash_codes_df[hash_codes_df.url_status_code != 200]
print(f"Samples with no image: {len(no_image)}")

Samples with null hash: 0
Samples with no image: 0


- Finally, you can create a new column called `hashlib_md5` with the hash of the images

In [20]:
# drop 'url_status_code' column from dataframes
hash_codes_df.drop(columns=['url_status_code'], inplace=True)
#if 'url_status_code' in df.columns: df.drop(columns=['url_status_code'], inplace=True)

# merge for adding the 'hashlib_md5' column
df = pd.merge(df, hash_codes_df, on='id')

df.columns

Index(['index', 'id', 'sample_name', 'test_name', 'user_name',
       'date_of_creation', 'raw_file_location', 'processed_file_location',
       'processing_date', 'camera_type_1', 'notes', 'sample_id', 'quantity',
       'deleted', 'project.id', 'project.user_name', 'project.project_name',
       'project.annotation', 'project.test_name',
       'project.sample_names.sample_names', 'project.neutral_filler',
       'project.qpc20', 'project.qpc50', 'project.qpc80', 'project.qpc100',
       'project.notes', 'issue.id', 'issue.name', 'issue.description', 'issue',
       'url', 'url_status_code', 'hashlib_md5'],
      dtype='object')

- check if there are any samples that have the same hash


In [21]:
num_samples = len(df)
data = df.groupby(['hashlib_md5']).size().reset_index(name='counts')
one_sample_hash = data[data['counts']==1]
two_more_sample_hash = data[data['counts']>1]

print('Summary:')
print(f"Total unique hash codes : {len(data.index)}")
print(f"Total of hash code with one sample: {len(one_sample_hash.index)}")
print(f"Total of hash code with two or more samples: {len(two_more_sample_hash.index)}")

print('')
print(f"Total of samples: {num_samples}")
print(f"Total of samples without duplicates: {len(data.index)}")
print(f"Total of samples in some duplicate case (will be deleted): {num_samples-len(data.index)}")

Summary:
Total unique hash codes : 286
Total of hash code with one sample: 286
Total of hash code with two or more samples: 0

Total of samples: 286
Total of samples without duplicates: 286
Total of samples in some duplicate case (will be deleted): 0


# Check Single samples 

In [31]:
data = df.copy()


In [32]:
samples_unique = data[['sample_id','sample_name']].value_counts().reset_index(name='counts')
print(f"Total of Unique samples {samples_unique.shape[0]}")

samples_unique_grp = samples_unique.groupby(['sample_name']).size().reset_index(name='counts')
samples_unique_grp.sort_values(by=['counts'], ascending=False)

Total of Unique samples 76


Unnamed: 0,sample_name,counts
0,blank,47
1,swiped-but-not-run,29


# Fixing quantities

In [33]:
counts_per_group = data.groupby(['sample_name', 'quantity']).agg({
    'sample_id': 'nunique',  # Counts unique sample_ids
    'id': 'count'            # Counts total ids
})

print("Counts of 'sample_id's and 'id's by 'sample_name' and 'quantity' group:")
counts_per_group

Counts of 'sample_id's and 'id's by 'sample_name' and 'quantity' group:


Unnamed: 0_level_0,Unnamed: 1_level_0,sample_id,id
sample_name,quantity,Unnamed: 2_level_1,Unnamed: 3_level_1
blank,0,28,152
blank,20,8,24
blank,50,10,30
blank,80,1,2
swiped-but-not-run,20,8,23
swiped-but-not-run,50,20,54
swiped-but-not-run,80,1,1


# Remove the quantity  for the blank and swiped-but-not-run samples 

The quantity label should not be considered for types such as the blank (both with and without reagents) and swiped-but-not-run samples in the context of training or demonstrating the classification model's capabilities.

For the blank PADs, the quantity is irrelevant because they don't carry samples or reactions. For the swiped-but-not-run PADs, the quantity indicator doesn't provide meaningful information for analysis since the composition and appearance of the applied powder do not correspond directly to the expected quantification due to the presence of a significant amount of lactose. Therefore, the quantity label should be removed from these samples.

In [34]:
data['quantity'] = 0
data.columns

Index(['index', 'id', 'sample_name', 'test_name', 'user_name',
       'date_of_creation', 'raw_file_location', 'processed_file_location',
       'processing_date', 'camera_type_1', 'notes', 'sample_id', 'quantity',
       'deleted', 'project.id', 'project.user_name', 'project.project_name',
       'project.annotation', 'project.test_name',
       'project.sample_names.sample_names', 'project.neutral_filler',
       'project.qpc20', 'project.qpc50', 'project.qpc80', 'project.qpc100',
       'project.notes', 'issue.id', 'issue.name', 'issue.description', 'issue',
       'url', 'url_status_code', 'hashlib_md5'],
      dtype='object')

In [35]:
_ = check_duplicates_by_hash(data)

There is no duplicates.


## Save Cleaned Data

On that point the dataframe `data` should have the cleaned data samples to put in the dataset.


In [38]:
# save cleaned dataframe to csv
data = data[['id','sample_id','sample_name', 'quantity', 'camera_type_1', 'project.id',  'url', 'hashlib_md5']]
data.to_csv(DATA_FNAME, index=False)