# **Data Collection Notebook - CBC Thalassemia Screening**

## Objectives

* Collect CBC (Complete Blood Count) data with thalassemia phenotype labels
* Load and inspect the dataset with hemoglobin electrophoresis results
* Save raw data for further processing
* Perform initial data quality assessment

## Inputs

* CSV file with columns: sex, hb, pcv, rbc, mcv, mch, mchc, rdw, wbc, neut, lymph, plt, hba, hba2, hbf, phenotype

## Outputs

* Storing Dataset: outputs/datasets/collection/thalassemia_data.csv
* Data inspection report
* Initial data quality assessment

## Dataset Structure

* **sex**: Gender (male/female)
* **hb**: Hemoglobin level (g/dL)
* **pcv**: Packed Cell Volume/Hematocrit (%)
* **rbc**: Red Blood Cell count (million/μL)
* **mcv**: Mean Corpuscular Volume (fL)
* **mch**: Mean Corpuscular Hemoglobin (pg)
* **mchc**: Mean Corpuscular Hemoglobin Concentration (g/dL)
* **rdw**: Red Cell Distribution Width (%)
* **wbc**: White Blood Cell count (thousand/μL)
* **neut**: Neutrophils (%)
* **lymph**: Lymphocytes (%)
* **plt**: Platelet count (thousand/μL)
* **hba**: Hemoglobin A (%)
* **hba2**: Hemoglobin A2 (%)
* **hbf**: Hemoglobin F (%)
* **phenotype**: Thalassemia classification (carrier, normal)


---

# Change working directory

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/Users/nour/Desktop/DiplomaProjects/ThalassemiaPredictor/thalassemia_predictor/jupyter_notebooks'

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


In [3]:
current_dir = os.getcwd()
current_dir

'/Users/nour/Desktop/DiplomaProjects/ThalassemiaPredictor/thalassemia_predictor'

# Fetch Data from sources

### Fetch Data from "letslive/alpha-thalassemia-dataset"

In [4]:
%pip install kaggle

Note: you may need to restart the kernel to use updated packages.


In [10]:
import kaggle

dataset = "letslive/alpha-thalassemia-dataset"
data_path = current_dir + "/inputs"

kaggle.api.dataset_download_files(dataset, path=data_path, unzip=True)

Dataset URL: https://www.kaggle.com/datasets/letslive/alpha-thalassemia-dataset


In [11]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

---

### Check Kaggle data

In [12]:
files = os.listdir(data_path)
files

['twoalphas.csv', 'alphanorm.csv']

## Inspect files content

In [13]:
import pandas as pd
for file in files:
    if '.csv' in file:
        df = pd.read_csv(data_path+'/'+file)
        print(f"File: {file}")
        display(df.head())
        print(f"Shape: {df.shape}")
        print("-" * 50)


File: twoalphas.csv


Unnamed: 0,sex,hb,pcv,rbc,mcv,mch,mchc,rdw,wbc,neut,lymph,plt,hba,hba2,hbf,phenotype
0,female,10.8,35.2,5.12,68.7,21.2,30.8,13.4,9.6,53.0,33.0,309.0,88.5,2.6,0.11,alpha trait
1,male,10.8,26.6,4.28,62.1,25.3,40.8,19.8,10.3,49.4,43.1,687.0,87.8,2.4,0.9,alpha trait
2,female,10.8,35.2,5.12,68.7,21.2,30.8,13.4,9.6,53.0,33.0,309.0,88.5,2.6,0.1,silent carrier
3,male,14.5,43.5,5.17,84.0,28.0,33.4,12.1,11.9,31.0,50.0,334.0,86.8,2.8,0.3,silent carrier
4,male,11.5,34.4,5.02,68.7,22.9,33.4,15.7,20.4,67.0,30.0,596.0,86.3,2.4,1.3,silent carrier


Shape: (147, 16)
--------------------------------------------------
File: alphanorm.csv


Unnamed: 0,sex,hb,pcv,rbc,mcv,mch,mchc,rdw,wbc,neut,lymph,plt,hba,hba2,hbf,phenotype
0,female,10.8,35.2,5.12,68.7,21.2,30.8,13.4,9.6,53.0,33.0,309.0,88.5,2.6,0.11,alpha carrier
1,male,10.8,26.6,4.28,62.1,25.3,40.8,19.8,10.3,49.4,43.1,687.0,87.8,2.4,0.9,alpha carrier
2,female,10.8,35.2,5.12,68.7,21.2,30.8,13.4,9.6,53.0,33.0,309.0,88.5,2.6,0.1,alpha carrier
3,male,14.5,43.5,5.17,84.0,28.0,33.4,12.1,11.9,31.0,50.0,334.0,86.8,2.8,0.3,alpha carrier
4,male,11.5,34.4,5.02,68.7,22.9,33.4,15.7,20.4,67.0,30.0,596.0,86.3,2.4,1.3,alpha carrier


Shape: (203, 16)
--------------------------------------------------


## Push files to repo:

In [14]:
import os
try:
  os.makedirs(name='outputs/datasets/collection') # create outputs/datasets/collection folder
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/alphanorm.csv",index=False)
