# Exploratory Data Analysis (EDA) Notebook
 
## Objectives:
- Understand the distribution of the data.
- Identify patterns, correlations, and trends.
- Detect potential outliers and missing values.
- Generate visualizations to summarize key insights.

In [None]:
1. Import Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Configure visualization settings
sns.set(style="whitegrid")
plt.rcParams["figure.figsize"] = (10, 6)

2. Load Data

In [2]:
# Load datasets
train_data_path = '/workspace/bicycle_thefts_berlin/outputs/datasets/featured/TrainSet_Featured.csv'
test_data_path = '/workspace/bicycle_thefts_berlin/outputs/datasets/featured/TestSet_Featured.csv'

# Read CSV files
train_data = pd.read_csv(train_data_path)
test_data = pd.read_csv(test_data_path)

# Display data overview
print("Train Data Overview:")
print(train_data.info())
print(train_data.head())

print("\nTest Data Overview:")
print(test_data.info())
print(test_data.head())

Train Data Overview:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34508 entries, 0 to 34507
Data columns (total 16 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   ANGELEGT_AM                          34508 non-null  object 
 1   TATZEIT_ANFANG_STUNDE                34508 non-null  float64
 2   TATZEIT_ENDE_DATUM                   34508 non-null  object 
 3   TATZEIT_ENDE_STUNDE                  34508 non-null  int64  
 4   VERSUCH                              34508 non-null  object 
 5   ERFASSUNGSGRUND                      34508 non-null  object 
 6   ART_DES_FAHRRADS_Fahrrad             34508 non-null  bool   
 7   ART_DES_FAHRRADS_Herrenfahrrad       34508 non-null  bool   
 8   ART_DES_FAHRRADS_Kinderfahrrad       34508 non-null  bool   
 9   ART_DES_FAHRRADS_Lastenfahrrad       34508 non-null  bool   
 10  ART_DES_FAHRRADS_Mountainbike        34508 non-null  bool   
 11  ART_DES

  train_data = pd.read_csv(train_data_path)


3. Data Summary

In [3]:
# Check for missing values
print("Missing Values in Train Data:")
print(train_data.isnull().sum())

print("\nMissing Values in Test Data:")
print(test_data.isnull().sum())

# Summary statistics
print("\nSummary Statistics for Train Data:")
print(train_data.describe())

# Check the unique values in the target column
TARGET_COLUMN = 'VERSUCH'
print("\nUnique values in target column (VERSUCH):", train_data[TARGET_COLUMN].unique())

Missing Values in Train Data:
ANGELEGT_AM                            0
TATZEIT_ANFANG_STUNDE                  0
TATZEIT_ENDE_DATUM                     0
TATZEIT_ENDE_STUNDE                    0
VERSUCH                                0
ERFASSUNGSGRUND                        0
ART_DES_FAHRRADS_Fahrrad               0
ART_DES_FAHRRADS_Herrenfahrrad         0
ART_DES_FAHRRADS_Kinderfahrrad         0
ART_DES_FAHRRADS_Lastenfahrrad         0
ART_DES_FAHRRADS_Mountainbike          0
ART_DES_FAHRRADS_Rennrad               0
ART_DES_FAHRRADS_diverse FahrrÃ¤der    0
DELIKT_Keller- und Bodeneinbruch       0
TATZEIT_ANFANG_YEAR                    0
TATZEIT_ANFANG_MONTH                   0
dtype: int64

Missing Values in Test Data:
ANGELEGT_AM                            0
TATZEIT_ANFANG_STUNDE                  0
TATZEIT_ENDE_DATUM                     0
TATZEIT_ENDE_STUNDE                    0
VERSUCH                                0
ERFASSUNGSGRUND                        0
ART_DES_FAHRRADS_Fahrrad 