<div style="text-align:center;">
    <h1>Data-driven Estimation of Pharmacological Prescription Duration using Python</h1>
    <h3>Assignment 3 - Implementation and EDA</h3>
    <h4>Authors:</h4>
    <ul style="list-style:none;">
        <li>👤 Elgen Mar Arinasa</li>
        <li>👤 Shawn Jurgen Mayol</li>
    </ul>
    <hr>
</div>


<h2 style="color:#4B0082;">📖 Introduction</h2>
<p>
    Accurate estimation of prescription durations is critical in pharmacoepidemiological research to assess medication adherence, efficacy, and patient safety. Due to missing or imprecise data, innovative methods like the Sessa Empirical Estimator (SEE) using clustering techniques become valuable. This notebook will demonstrate initial preprocessing and exploratory analysis steps on the provided dataset (<code>medeventsATC.csv</code>) prior to applying clustering methods such as K-Means and DBSCAN in SEE.
</p>


<h2 style="color:#4B0082;">🎯 Objectives</h2>
<p>This notebook aims to accomplish the following tasks comprehensively:</p>
<ol>
    <li><strong>Load and inspect</strong> the provided dataset for preliminary insights.</li>
    <li><strong>Clean and preprocess</strong> the dataset to prepare for robust analysis.</li>
    <li>Conduct thorough <strong>Exploratory Data Analysis (EDA)</strong> to understand data patterns and distributions clearly.</li>
    <li><strong>Implement</strong> the <em>Sessa Empirical Estimator (SEE)</em> using <strong>K-Means clustering</strong>.</li>
    <li><strong>Implement</strong> the <em>Sessa Empirical Estimator (SEE)</em> using <strong>DBSCAN clustering</strong> as an alternative approach.</li>
    <li><strong>Perform comparative analyses and visualizations</strong> to evaluate and interpret the effectiveness of both clustering techniques.</li>
    <li><strong>Provide comprehensive conclusions, insights, and a detailed discussion</strong> of findings, highlighting the strengths and weaknesses of the clustering methodologies used.</li>
</ol>


<h2 style="color:#4B0082;">📂 Data Loading and Initial Inspection</h2>
<p>First, we load the provided dataset and perform basic inspection to understand its structure and contents.</p>


In [1]:
# Importing necessary libraries
import pandas as pd

# Load dataset
data = pd.read_csv('medeventsATC.csv')

# Quick inspection
print("Dataset Dimensions:", data.shape)
data.head()


Dataset Dimensions: (1564, 7)


Unnamed: 0,PATIENT_ID,DATE,DURATION,PERDAY,CATEGORY,CATEGORY_L1,CATEGORY_L2
0,1,2057-09-04,28.0,20.0,A02BC02,ALIMENTARY TRACT AND METABOLISM,DRUGS FOR ACID RELATED DISORDERS
1,1,2058-06-03,28.0,20.0,A02BC02,ALIMENTARY TRACT AND METABOLISM,DRUGS FOR ACID RELATED DISORDERS
2,1,2058-07-09,28.0,20.0,A02BC02,ALIMENTARY TRACT AND METABOLISM,DRUGS FOR ACID RELATED DISORDERS
3,1,2056-10-09,41.666667,36000.0,A09AA02,ALIMENTARY TRACT AND METABOLISM,"DIGESTIVES, INCL. ENZYMES"
4,1,2056-12-10,40.0,36000.0,A09AA02,ALIMENTARY TRACT AND METABOLISM,"DIGESTIVES, INCL. ENZYMES"


<h2 style="color:#4B0082;">🛠️ Data Preprocessing</h2>
<p>We preprocess the data by handling missing values, converting date formats, and ensuring correct data types.</p>


In [2]:
# Check for missing values
missing_data = data.isnull().sum()
print(missing_data)

# Convert DATE column to datetime
data['DATE'] = pd.to_datetime(data['DATE'], errors='coerce')

# Drop critical missing entries
data.dropna(subset=['PATIENT_ID', 'DATE', 'DURATION', 'CATEGORY'], inplace=True)

# Verify data types after cleaning
data.info()


PATIENT_ID     0
DATE           0
DURATION       0
PERDAY         0
CATEGORY       0
CATEGORY_L1    0
CATEGORY_L2    0
dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1564 entries, 0 to 1563
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   PATIENT_ID   1564 non-null   int64         
 1   DATE         1564 non-null   datetime64[ns]
 2   DURATION     1564 non-null   float64       
 3   PERDAY       1564 non-null   float64       
 4   CATEGORY     1564 non-null   object        
 5   CATEGORY_L1  1564 non-null   object        
 6   CATEGORY_L2  1564 non-null   object        
dtypes: datetime64[ns](1), float64(2), int64(1), object(3)
memory usage: 85.7+ KB
