# Data Cleaning and Preprocessing for Advanced Statistical Analysis

This notebook is designed to perform comprehensive data cleaning and preprocessing to prepare datasets for advanced statistical analysis. It includes steps for handling missing values, data normalization, feature encoding, outlier detection, and data transformation. By systematically applying these cleaning techniques, the notebook ensures that the data is in the optimal format for accurate and robust statistical modeling and analysis.

### Importing libraries and modules
This section is setting up the environment by importing necessary libraries and reloading specific modules to ensure the latest versions are used. It prepares the script for data processing tasks by including essential dependencies and configurations.

In [2]:
import sys
import os
import importlib

sys.path.insert(0, os.path.abspath('../src'))

import scripts.data_processor
import configs

importlib.reload(sys.modules['scripts.data_processor'])
importlib.reload(sys.modules['configs.config'])

from scripts.data_processor import DataProcessor
from configs import config

import pandas as pd
import numpy as np
from IPython.display import display

### Data Processor Initialization and Data Loading

Purpose: Initialize the DataProcessor with the specified file path from the configuration, load the dataset into a DataFrame, and display the first few rows to provide an initial overview.

In [3]:
# Initialize the DataProcessor with the file path from the configuration
processor = DataProcessor(config.file_path)

# Load the data using the DataProcessor
processor.load_data()

# Display the first few rows of the loaded dataframe
processor.df.head()

INFO:root:Loading raw data from Excel file ../data/raw/Urliste_Datenerhebung_WS23_24.xlsx.


Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Messwert 1: Puls (in Schlägen pro Minute),...,Messwert 4: Priorisierte Hand zum Schreiben?,Unnamed: 15,Unnamed: 16,Messwert 5: Wassermenge (in ml pro Tag),Unnamed: 18,Messwert 6: Stamina mit Glas (in Sekunden),Unnamed: 20,Messwert 7: Luftanhalten (in Sekunden),Unnamed: 22,Messwert 8: Video
0,Gesamtgruppe,,Gruppe,Geschlecht,Geb.-Datum,"Alter, Jahre","Körpergröße, cm","Gewicht, Kg",BMI,Ruhepuls,...,(Linkshänder/Rechtshänder/Beidhändig),,,,,Rechts,Links,,,Gesammthäufigkeit des Blinzelns während des 0...
1,1,Studierende,1,männlich,2005-01-20 00:00:00,18,160,58,22.7,56,...,Rechtshänder,,,1400.0,,180,160,53.0,,66
2,2,,1,männlich,2001-08-26 00:00:00,22,184,69,20.4,60,...,Rechtshänder,,,1000.0,,365,275,120.0,,88
3,3,,1,weiblich,2003-12-17 00:00:00,19,178,69,21.8,56,...,Rechtshänder,,,1500.0,,160,140,60.0,,20
4,4,,1,männlich,2004-01-01 00:00:00,19,187,92,26.3,72,...,Rechtshänder,,,3000.0,,,,85.0,,


### Data Cleaning and Transformation

Purpose: Perform a series of data cleaning and transformation steps, including column combination, header setting, typo corrections, data type assignments, BMI calculation, age correction, column standardization, and conversion to numeric data types.

In [4]:
# Combine columns 17 and 18 in the dataset
processor.combine_columns(17, 18)

# Drop columns that are entirely empty
processor.drop_empty_columns()

INFO:root:Combining columns: 17 and 18
INFO:root:Dropping empty columns.


In [5]:
# Set new headers for the DataFrame
processor.set_headers(config.new_headers)

processor.df.head()

INFO:root:Setting new headers.


Unnamed: 0,Gesamtgruppe,Type,Gruppe,Geschlecht,Geb.-Datum,"Alter, Jahre","Körpergröße, cm","Gewicht, Kg",BMI,Ruhepuls,Fußlänge Rechts (cm),Fußlänge Links (cm),Handlänge Rechts (cm),Handlänge Links (cm),Priorisierte Hand,Wassermenge (ml/Tag),Stamina Rechts (s),Stamina Links (s),Luftanhalten (s),Häufigkeit Blinzeln (/min)
1,1,Studierende,1,männlich,2005-01-20 00:00:00,18,160,58,22.7,56,24.0,24.0,17.0,17.0,Rechtshänder,1400.0,180.0,160.0,53.0,66.0
2,2,,1,männlich,2001-08-26 00:00:00,22,184,69,20.4,60,26.0,26.0,18.0,18.0,Rechtshänder,1000.0,365.0,275.0,120.0,88.0
3,3,,1,weiblich,2003-12-17 00:00:00,19,178,69,21.8,56,27.0,27.0,18.0,18.0,Rechtshänder,1500.0,160.0,140.0,60.0,20.0
4,4,,1,männlich,2004-01-01 00:00:00,19,187,92,26.3,72,28.0,28.0,20.0,20.0,Rechtshänder,3000.0,,,85.0,
5,5,,1,weiblich,2004-06-03 00:00:00,19,169,73,25.6,68,25.8,26.3,18.2,18.4,Rechtshänder,2500.0,150.0,115.0,51.5,17.0


In [6]:

# Apply typo corrections to the 'Geschlecht' column
processor.apply_corrections('Geschlecht', config.typos)

# Apply typo corrections to the 'Priorisierte Hand' column
processor.apply_corrections('Priorisierte Hand', config.typos)

# Correct specific columns in the DataFrame
processor.correct_column()

processor.df.loc[:75, 'Type'] = 'Studierende'
processor.df.loc[76:, 'Type'] = 'simulierte Daten'
processor.df.loc[:75, 'Gruppe'] = 1

processor.df.head()

INFO:root:Applying typo corrections to Geschlecht.
INFO:root:Applying typo corrections to Priorisierte Hand.


Unnamed: 0,Gesamtgruppe,Type,Gruppe,Geschlecht,Geb.-Datum,"Alter, Jahre","Körpergröße, cm","Gewicht, Kg",BMI,Ruhepuls,Fußlänge Rechts (cm),Fußlänge Links (cm),Handlänge Rechts (cm),Handlänge Links (cm),Priorisierte Hand,Wassermenge (ml/Tag),Stamina Rechts (s),Stamina Links (s),Luftanhalten (s),Häufigkeit Blinzeln (/min)
1,1,Studierende,1,männlich,2005-01-20 00:00:00,18,160,58,22.7,56,24.0,24.0,17.0,17.0,rechsthänder,1400.0,180.0,160.0,53.0,66.0
2,2,Studierende,1,männlich,2001-08-26 00:00:00,22,184,69,20.4,60,26.0,26.0,18.0,18.0,rechsthänder,1000.0,365.0,275.0,120.0,88.0
3,3,Studierende,1,weiblich,2003-12-17 00:00:00,19,178,69,21.8,56,27.0,27.0,18.0,18.0,rechsthänder,1500.0,160.0,140.0,60.0,20.0
4,4,Studierende,1,männlich,2004-01-01 00:00:00,19,187,92,26.3,72,28.0,28.0,20.0,20.0,rechsthänder,3000.0,,,85.0,
5,5,Studierende,1,weiblich,2004-06-03 00:00:00,19,169,73,25.6,68,25.8,26.3,18.2,18.4,rechsthänder,2500.0,150.0,115.0,51.5,17.0


In [8]:

# Calculate and update BMI values in the DataFrame
processor.calculate_and_update_bmi()

# Calculate or correct age values in the DataFrame
processor.calculate_or_correct_age()

# Convert specified columns to numeric data types
processor.convert_columns_to_numeric(config.numeric_columns)

# Standardize the 'Häufigkeit Blinzeln (/min)' column
processor.standardize_data('Häufigkeit Blinzeln (/min)', std_unit=60, rel_unit=116)

INFO:root:Calculating and updating BMI.
INFO:root:Missing or incorrect ages calculated/corrected using 'Geb.-Datum'.
INFO:root:Converting specified columns to numeric values.
INFO:root:Standardizing data in column Häufigkeit Blinzeln (/min) to per 60 seconds, relative to 116 seconds.


### Identifying and Handling Invalid Data

Purpose: Identify rows with invalid data types, mark invalid data with a placeholder, and prepare the DataFrame for saving.

- Invalid Data Identification: Uses get_invalid_rows() to find rows with data types that do not match the expected types.
- Marking Invalid Data: Replaces invalid data with a placeholder value (NaN) for consistency.
- Saving Data: Prepares the DataFrame for saving to an Excel file (the save function is currently commented out).

In [None]:
df_invalid = processor.get_invalid_rows(config.expected_types)

processor.mark_invalid_data(config.expected_types, placeholder=np.nan)


# Save the DataFrame
#processor.save_to_excel()

### DataFrame Display Settings and Initial Preview

Purpose: Dsplay the first few rows of the processed DataFrame.

In [10]:

display(processor.df.head())

Unnamed: 0,Gesamtgruppe,Type,Gruppe,Geschlecht,Geb.-Datum,"Alter, Jahre","Körpergröße, cm","Gewicht, Kg",BMI,Ruhepuls,Fußlänge Rechts (cm),Fußlänge Links (cm),Handlänge Rechts (cm),Handlänge Links (cm),Priorisierte Hand,Wassermenge (ml/Tag),Stamina Rechts (s),Stamina Links (s),Luftanhalten (s),Häufigkeit Blinzeln (/min)
1,1,Studierende,1,männlich,2005-01-20,19,160,58.0,22.7,56.0,24.0,24.0,17.0,17.0,rechsthänder,1400.0,180.0,160.0,53.0,17.657551
2,2,Studierende,1,männlich,2001-08-26,22,184,69.0,20.4,60.0,26.0,26.0,18.0,18.0,rechsthänder,1000.0,365.0,275.0,120.0,23.543401
3,3,Studierende,1,weiblich,2003-12-17,20,178,69.0,21.8,56.0,27.0,27.0,18.0,18.0,rechsthänder,1500.0,160.0,140.0,60.0,5.350773
4,4,Studierende,1,männlich,2004-01-01,20,187,92.0,26.3,72.0,28.0,28.0,20.0,20.0,rechsthänder,3000.0,,,85.0,
5,5,Studierende,1,weiblich,2004-06-03,20,169,73.0,25.6,68.0,25.8,26.3,18.2,18.4,rechsthänder,2500.0,150.0,115.0,51.5,4.548157


## Conclusion

Summary: The data cleaning process successfully transforms raw data into a clean, structured dataset. By handling missing values, normalizing features, correcting typos, and identifying invalid data, the resulting dataset is consistent and error-free. This ensures it is ready for accurate statistical modeling, providing a robust foundation for generating reliable insights.