# Data Exploration and Preprocessing

This notebook handles the data loading, preprocessing, and preparation for the machine learning task. We'll work with development and validation datasets, perform necessary cleaning operations, and prepare the data for model training.

## 1. Import Required Libraries

First, let's import the necessary Python libraries for data manipulation and analysis.

In [1]:
import pandas as pd
import numpy as np
import os

## 2. Define File Paths

Set up the paths for input and output files.

In [2]:
# File paths
a = "../data/assignment1_dev_set.csv"
b = "../data/assignment1_val_set.csv"
c = "../data"
d = os.path.join(c, "development_final_data.csv")
e = os.path.join(c, "evaluation_final_data.csv")

if not os.path.exists(c):
    os.makedirs(c)

print("Loading from", a, "and", b)

Loading from data/assignment1_dev_set.csv and data/assignment1_val_set.csv


## 3. Load Data

Load the development and validation datasets.

In [3]:
try:
    x = pd.read_csv(a, index_col=0)
    y = pd.read_csv(b, index_col=0)
    print("Data loaded.")
except FileNotFoundError as err:
    print("Error:", err)
    print("Check if assignment1_dev_set.csv and assignment1_val_set.csv are in ./data")
    exit()

print("Shapes:")
print(" Dev:", x.shape)
print(" Val:", y.shape)

Data loaded.
Shapes:
 Dev: (489, 140)
 Val: (211, 140)


## 4. Data Preprocessing

### 4.1 Merge Datasets

Combine development and validation sets for preprocessing.

In [4]:
x["src"] = "dev"
y["src"] = "val"

print("Merging data...")
z = pd.concat([x, y], ignore_index=True)
print("Merged shape:", z.shape)

Merging data...
Merged shape: (700, 141)


### 4.2 Feature Selection

Remove unnecessary columns from the dataset.

In [5]:
# Drop some columns
cols = ['Project ID', 'Experiment type', 'Disease MESH ID']
print("Dropping columns", cols)
z = z.drop(columns=cols)

Dropping columns ['Project ID', 'Experiment type', 'Disease MESH ID']


### 4.3 Handle Missing Values

Impute missing values in numerical columns using median values.

In [6]:
# Impute missing values for numerical columns
nums = ['Host age', 'BMI']
print("Checking missing in", nums)
m = z[nums].isnull().sum()
if m.sum() > 0:
    print("Missing values:", m[m > 0])
    for col in nums:
        med = z[col].median()
        z[col].fillna(med, inplace=True)
    print("After fill:", z[nums].isnull().sum())
else:
    print("No missing values in", nums)

Checking missing in ['Host age', 'BMI']
No missing values in ['Host age', 'BMI']


### 4.4 Categorical Encoding

Convert categorical variables to numerical using one-hot encoding.

In [7]:
# Encode categorical
print("Encoding categorical column 'Sex'...")
z = pd.get_dummies(z, columns=['Sex'], drop_first=True)
print("Encoding done.")

Encoding categorical column 'Sex'...
Encoding done.


## 5. Split Data Back

Separate the preprocessed data back into development and validation sets.

In [8]:
print("Splitting data back...")
dev = z[z['src'] == 'dev'].copy()
val = z[z['src'] == 'val'].copy()
dev = dev.drop(columns=['src'])
val = val.drop(columns=['src'])

print("Final shapes:")
print(" Dev:", dev.shape)
print(" Val:", val.shape)

Splitting data back...
Final shapes:
 Dev: (489, 137)
 Val: (211, 137)


## 6. Save Processed Data

Save the preprocessed datasets to CSV files.

In [9]:
print("Saving files...")
try:
    dev.to_csv(d, index=False)
    val.to_csv(e, index=False)
    print("Saved to", d, "and", e)
except Exception as ex:
    print("Error saving files:", ex)

print("Done. Note: Data not scaled.")

Saving files...
Saved to data/development_final_data.csv and data/evaluation_final_data.csv
Done. Note: Data not scaled.
