
*Phase 1: Alien Pet Health, Data Preparation*

# Project Context


# Report: Alien Pet Health Data Preparation

This report documents the comprehensive data preparation process for the Alien Pet Health dataset. The analysis includes data loading, missing value handling, categorical attribute normalization, feature selection, distribution analysis, class balance evaluation, and final data export.

## 1. Data

The dataset for Phase 1 can be found here:


In your notebook, you can access and read the data directly from this GitHub repository.


## 2. Tasks

1. **Load the dataset**

	- Read the CSV file from the provided GitHub URL.
	- Show the shape of the data, as well as the first five rows.

In [None]:
# Task 2.1: Load the dataset
import pandas as pd
import numpy as np

url = "data/alien_pet_health.csv"
df = pd.read_csv(url)

print(f"Dataset shape: {df.shape}")
print(f"Number of rows: {df.shape[0]}")
print(f"Number of columns: {df.shape[1]}")

print("First 5 rows of the dataset:")

display(df.head())

2. **Missing values**

	- Examine the dataset to identify and assess missing values in various attributes. Missing values may be represented by symbols such as ‘?’, empty strings, or other placeholders.
	- List the attribute or attributes with missing values.
	- Describe the methodology used for this investigation, and provide the corresponding code, if applicable.
	- Convert missing tokens (e.g., empty strings, `n/a`, `?`) to `NaN`.
	- Coerce numeric-like columns to numeric (errors→`NaN`).

	Following this step, each attribute will be populated with either specific values or `NaN`.

In [None]:
print("Overall data structure:")
print(df.info())
print("\n" + "="*60)

print("Missing values:")
missing_counts = df.isnull().sum()
print(missing_counts[missing_counts > 0])
print("\n" + "="*60)

for col in df.columns:
    unique_vals = df[col].unique()
    print(f"\n{col}: {len(unique_vals)} unique values")
    if len(unique_vals) <= 20:
        print(f"Values: {unique_vals}")
    else:
        print(f"Sample values: {unique_vals[:10]}")


df_clean = df.copy()

missing_tokens = ['', ' ', 'n/a', 'N/A', 'na', 'NA', '?', 'null', 'NULL']
for token in missing_tokens:
    df_clean = df_clean.replace(token, np.nan)

numeric_cols = ['thermoreg_reading', 'enzyme_activity_index', 'dual_lobe_signal', 
                'stress_variability', 'activity_score', 'fasting_flag', 
                'health_outcome', 'ingest_marker', 'diagnostic_noise', 
                'thermoreg_reading_fahrenheit']

for col in numeric_cols:
    if col in df_clean.columns:
        df_clean[col] = pd.to_numeric(df_clean[col], errors='coerce')

final_missing = df_clean.isnull().sum()
print(final_missing[final_missing > 0])

df = df_clean

### Results and Analysis - Task 2.2

Used `df.info()` to examine data structure
Analyzed unique values
Identified missing value tokens: `'', ' ', 'n/a', 'N/A', 'na', 'NA', '?', 'null', 'NULL'`
Applied `pd.to_numeric()` with `errors='coerce'`

All columns except `health_outcome` contain missing values
`thermoreg_reading_fahrenheit` (949 missing), `record_id` (283 missing)
Some missing values were represented as text tokens ('?', 'N/a') rather than NaN
Several numeric columns contained non-numeric values that needed coercion

All missing values are now standardized as NaN

3. **Categorical attributes**

	- Analyze the dataset to detect potential issues with categorical attributes. For example, you may encounter instances where the same category is inconsistently represented using both lowercase and uppercase letters, or where extraneous spaces are included.
	- Describe the methodology used for this investigation, and provide the corresponding code, if applicable.
	- Normalize the values of categorical attributes.

In [None]:
categorical_cols = ['record_id', 'habitat_zone', 'station_code', 'calibration_tag']

print("Columns analysis:")
for col in categorical_cols:
    if col in df.columns:
        unique_vals = df[col].dropna().unique()
        print(f"{col}: {unique_vals}")

print("\nNormalizing categorical values:")
df['habitat_zone'] = df['habitat_zone'].str.lower()
df['station_code'] = df['station_code'].str.upper()
df['calibration_tag'] = df['calibration_tag'].str.upper()

print("After normalization:")
for col in ['habitat_zone', 'station_code', 'calibration_tag']:
    if col in df.columns:
        unique_vals = df[col].dropna().unique()
        print(f"{col}: {unique_vals}")

### Results and Analysis - Task 2.3

- Identified categorical columns: `record_id`, `habitat_zone`, `station_code`, `calibration_tag`
- Examined unique values to detect inconsistencies in formatting
- Applied string normalization functions: `.str.lower()` and `.str.upper()`

#### Issues
- Mixed case representation (c1, C1, c2, C2, etc.)
- Inconsistent capitalization (z-eat vs Z-EAT)
- Mixed case values (A, a, B, b)

4. **Remove non-informative attributes**

	- Eliminate the following types of attributes from the dataset, if applicable:
	  - Unique identifiers (IDs)
	  - Constant and quasi-constant features
	  - High-cardinality quasi-identifiers
	  - Scaled linear duplicates
	- Provide the list of the specific attributes being removed.
	- For each attribute listed, offer a brief justification for its exclusion.

In [None]:
print(f"record_id: {df['record_id'].nunique()} unique values out of {len(df)} rows")
print(f"station_code: {df['station_code'].nunique()} unique values")
print(f"ingest_marker: {df['ingest_marker'].value_counts()}")

correlation_matrix = df[['thermoreg_reading', 'thermoreg_reading_fahrenheit']].corr()
print(f"Correlation between thermoreg_reading and fahrenheit: {correlation_matrix.iloc[0,1]:.4f}")

to_remove = ['record_id', 'station_code', 'ingest_marker', 'thermoreg_reading_fahrenheit']

print("record_id: Unique identifier")
print("station_code: High-cardinality quasi-identifier") 
print("ingest_marker: Quasi-constant (mostly 1.0)")
print("thermoreg_reading_fahrenheit: Scaled duplicate of thermoreg_reading")

df_filtered = df.drop(columns=to_remove)
print(f"\nDataset shape after removal: {df_filtered.shape}")

df = df_filtered

### Results and Analysis - Task 2.4

- Analyzed cardinality and variability of each attribute
- Computed correlation between potential duplicate features

1. record_id - (4,714 unique values out of 5,000 rows) - Unique identifier
2. station_code - (4,161 unique values) - High-cardinality quasi-identifier
3. ingest_marker - (Quasi-constant: 4,748 × 1.0 values) - Quasi-constant feature
4. thermoreg_reading_fahrenheit - (Correlation with thermoreg_reading: 0.6836) - Scaled linear duplicate

5. **Characterize distributions**

	- For each numerical attribute, provide a detailed characterization of its value distribution. 
		- Evaluate whether the distribution exhibits normality or skewness.
		- Determine if it is unimodal or multimodal.
		- Identify the presence of any outliers.
		- Justify your answers.
	- Create histograms to visually support your findings.

In [None]:
import matplotlib.pyplot as plt
from scipy import stats

numeric_cols = ['thermoreg_reading', 'enzyme_activity_index', 'dual_lobe_signal', 
                'stress_variability', 'activity_score', 'fasting_flag', 'diagnostic_noise']

fig, axes = plt.subplots(3, 3, figsize=(15, 12))
axes = axes.flatten()

for i, col in enumerate(numeric_cols):
    data = df[col].dropna()
    
    axes[i].hist(data, bins=30, alpha=0.7, edgecolor='black')
    axes[i].set_title(f'{col}')
    axes[i].set_ylabel('Frequency')
    
    skewness = stats.skew(data)
    kurtosis = stats.kurtosis(data)
    
    q1, q3 = data.quantile([0.25, 0.75])
    iqr = q3 - q1
    lower_bound = q1 - 1.5 * iqr
    upper_bound = q3 + 1.5 * iqr
    outliers = data[(data < lower_bound) | (data > upper_bound)]
    
    print(f"{col}:")
    print(f"  Skewness: {skewness:.3f} ({'right' if skewness > 0.5 else 'left' if skewness < -0.5 else 'normal'})")
    print(f"  Kurtosis: {kurtosis:.3f} ({'heavy-tailed' if kurtosis > 0 else 'light-tailed'})")
    print(f"  Outliers: {len(outliers)} ({len(outliers)/len(data)*100:.1f}%)")

for j in range(len(numeric_cols), len(axes)):
    axes[j].remove()

plt.tight_layout()
plt.show()

### Results and Analysis - Task 2.5


- Computed skewness and kurtosis statistics for each numerical attribute
- IQR-based outlier detection
- Created histograms
- Classified distributions based on statistical thresholds

- thermoreg_reading: Left skewed (-2.102), heavy-tailed
- enzyme_activity_index: Right skewed (1.839), heavy-tailed

6. **Class balance**

	- Report target proportions; include a simple bar chart.

In [None]:
class_counts = df['health_outcome'].value_counts().sort_index()
class_props = df['health_outcome'].value_counts(normalize=True).sort_index()

print("Target class distribution:")
for class_val, count in class_counts.items():
    prop = class_props[class_val]
    print(f"Class {class_val}: {count} samples ({prop:.1%})")

plt.figure(figsize=(8, 6))
bars = plt.bar(class_counts.index, class_counts.values, color=['lightcoral', 'lightblue'], edgecolor='black')
plt.xlabel('Health Outcome')
plt.ylabel('Count')
plt.title('Class Distribution')
plt.xticks([0, 1], ['Unhealthy (0)', 'Healthy (1)'])

for i, (bar, count) in enumerate(zip(bars, class_counts.values)):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 25, 
             str(count), ha='center', va='bottom')

plt.tight_layout()
plt.show()

### Results and Analysis - Task 2.6

1. Computed value counts and proportions for `health_outcome`
2. Created a bar chart
3. Evaluated balance

- Class 0 (Unhealthy): 2,501 samples
- Class 1 (Healthy): 2,499 samples
- Difference: 2 samples

The dataset has ideal class balance.

7. **Save the clean data**

	- Keep the core features plus `health_outcome`.
	- Ensure correct dtypes (numeric/ordinal/binary).
	- Save as `alien_pet_health_cleaned.csv`.

In [None]:
print("Current data types:")
print(df.dtypes)
print("\nCleaned dataset shape:", df.shape)
print("Columns:", list(df.columns))

df['fasting_flag'] = df['fasting_flag'].fillna(0).astype('int8')

core_features = ['thermoreg_reading', 'enzyme_activity_index', 'dual_lobe_signal', 
                'stress_variability', 'habitat_zone', 'activity_score', 
                'fasting_flag', 'calibration_tag', 'diagnostic_noise', 'health_outcome']

df_final = df[core_features].copy()

print("\nFinal dataset info:")
print(f"Shape: {df_final.shape}")
print(f"Data types:\n{df_final.dtypes}")

df_final.to_csv('alien_pet_health_cleaned.csv', index=False)
print("\nSaved as 'alien_pet_health_cleaned.csv'")