# Data Cleaning Notebook

## Objectives
- Evaluate missing data to determine the need for cleaning.
- Clean data by addressing missing values, outliers, and irrelevant features.

## Inputs
- outputs/datasets/collection/Fahrraddiebstahl.csv (collected dataset).

## Outputs
- Cleaned train and test sets saved under outputs/datasets/cleaned/.

1. Change the Working Directory

In [1]:
import os

# Get the current directory
current_dir = os.getcwd()
print(f"Current directory: {current_dir}")

# Set the parent directory as the new working directory
os.chdir(os.path.dirname(current_dir))
print(f"New working directory: {os.getcwd()}")

Current directory: /workspace/bicycle_thefts_berlin/jupyter_notebooks
New working directory: /workspace/bicycle_thefts_berlin


2. Load Collected Data

In [2]:
import pandas as pd

# Path to the raw dataset
df_raw_path = "outputs/datasets/collection/Fahrraddiebstahl.csv"

# Load the dataset
df = pd.read_csv(df_raw_path, encoding='latin1')

# Preview the first few rows
df.head(3)

ModuleNotFoundError: No module named 'pandas'

3. Data Exploration
Evaluate variables with missing data:

In [3]:
# List of variables with missing data
vars_with_missing_data = df.columns[df.isna().sum() > 0].to_list()

# Display missing variables
vars_with_missing_data

NameError: name 'df' is not defined

4. Correlation and PPS Analysis
You can analyze the correlation and Power Predictive Score (PPS) between features to evaluate relationships.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import ppscore as pps

# Correlation and PPS functions as provided in your example
def heatmap_corr(df, threshold, figsize=(20, 12), font_annot=8):
    mask = np.zeros_like(df, dtype=np.bool)
    mask[np.triu_indices_from(mask)] = True
    mask[abs(df) < threshold] = True
    fig, axes = plt.subplots(figsize=figsize)
    sns.heatmap(df, annot=True, xticklabels=True, yticklabels=True, mask=mask, cmap='viridis',
                annot_kws={"size": font_annot}, ax=axes, linewidth=0.5)
    axes.set_yticklabels(df.columns, rotation=0)
    plt.ylim(len(df.columns), 0)
    plt.show()

def CalculateCorrAndPPS(df):
    df_corr_spearman = df.corr(method="spearman")
    df_corr_pearson = df.corr(method="pearson")

    pps_matrix_raw = pps.matrix(df)
    pps_matrix = pps_matrix_raw.filter(['x', 'y', 'ppscore']).pivot(columns='x', index='y', values='ppscore')

    return df_corr_pearson, df_corr_spearman, pps_matrix

# Calculate correlation and PPS
df_corr_pearson, df_corr_spearman, pps_matrix = CalculateCorrAndPPS(df)

# Display correlation and PPS heatmaps
heatmap_corr(df_corr_spearman, threshold=0.4)
heatmap_corr(df_corr_pearson, threshold=0.4)