# Chapter 1 - Data Preparation : Pulling and Cleaning the Imaging Dataset

## Summary

This notebook is basically the first step of our project - getting the imaging data and cleaning it up so I can actually use it. I'm going to load up the dataset, take a look at what we're working with, and fix any messy stuff like missing data or weird formatting. Then I'll save a clean version that's ready to go.

We're keeping all the charts and analysis for another notebook so this one doesn't get too messy and it stays focused on just prepping the data.


Pulling the Data
---

Imports 

In [5]:
import pandas as pd

<br><br>
Reading in the Raw Data

In [6]:
df = pd.read_excel('Data\raw_imaging_data.xlsx')

df.head()
df.info()
df.describe()

OSError: [Errno 22] Invalid argument: 'Data\raw_imaging_data.xlsx'

<br><br>
### Exploratory Data Analysis

#### **Dataset Size & Coverage**
- We've got **405 procedures** to work with
- The data spans from **2012 to 2025**
- Most procedures happened around **2021**


#### **Quick Observations**
- No missing data in our main columns (Date of Procedure, Patient ID) 
- Pretty even distribution of patient IDs suggests good coverage across different patients
- The date column is already in a datetime format

Cleaning the Data
---

Quick cleanup steps:
- Check for missing values
- Fix any weird formatting
- Make sure everything looks good before we start analyzing



<br><br>
1. So first we'll comb through and check for any data that might be missing


In [None]:
# Was an error with age column name being wrong so we viwed the columns and realized it was a space after the 'Age'
df.columns



Index(['Date of Procedure', 'Patient ID', 'Age ', 'Gender',
       'Surgical Findings', 'Surgical Cure', 'SPECT/CT', 'Ultrasound',
       '4D CT Scan', 'Sestamibi', 'MRI'],
      dtype='object')

<br><br>
2. Get rid of (if any) whitespaces in Columns

In [None]:
#  Fixed the column names by getting rid of any whitespace after the column names
df.columns = df.columns.str.strip()
print(df.columns)

Index(['Date of Procedure', 'Patient ID', 'Age', 'Gender', 'Surgical Findings',
       'Surgical Cure', 'SPECT/CT', 'Ultrasound', '4D CT Scan', 'Sestamibi',
       'MRI'],
      dtype='object')


<br><br>
3. Check for Columns that have Null values

In [None]:
# Then we checked for null values
print("\n")
print(df.isna().sum())

# Then we found the rows with null ages
print("\n\nRows with null ages:")
print(df[df['Age'].isna()][['Patient ID']])



Date of Procedure      0
Patient ID             0
Age                    2
Gender                 0
Surgical Findings      0
Surgical Cure          0
SPECT/CT             175
Ultrasound           260
4D CT Scan           177
Sestamibi            231
MRI                  390
dtype: int64


Rows with null ages:
     Patient ID
156      212628
239      281633


<br><br>
4. Check if there's any unncesessary columns

In [None]:
# Sometimes Excel exports columns that are unanmed so we'll just get rid of them

# But this gives an error because no unnamed columns so it's just good to check and make sure
# df = df.drop(columns='Unnamed: 0')