# Initial Data Load and Inspection

## Objectives:
- Load the train and test datasets.
- Inspect column data types.
- Check for missing values.
- Verify column alignment between train and test datasets.
- Identify unique values in the target column.
- Check for duplicate rows.

 1. Import Libraries and Define Paths

In [3]:
import pandas as pd
import streamlit as st  # Streamlit can be used for testing output interactively.

# Define file paths
train_data_path = '/workspace/bicycle_thefts_berlin/outputs/datasets/featured/TrainSet_Featured.csv'
test_data_path = '/workspace/bicycle_thefts_berlin/outputs/datasets/featured/TestSet_Featured.csv'

2. Load Datasets

In [4]:
try:
    train_data = pd.read_csv(train_data_path)
    test_data = pd.read_csv(test_data_path)
    print("Train and test data loaded successfully.")
except Exception as e:
    print(f"Error loading data: {e}")

Train and test data loaded successfully.


  train_data = pd.read_csv(train_data_path)


3. Display Data Overview

In [5]:
print("### Train Data Overview:")
print(train_data.head())

print("### Test Data Overview:")
print(test_data.head())

### Train Data Overview:
  ANGELEGT_AM  TATZEIT_ANFANG_STUNDE TATZEIT_ENDE_DATUM  TATZEIT_ENDE_STUNDE  \
0  2023-05-04               0.812322         2023-05-04                   11   
1  2023-08-09               1.371716         2023-08-09                    7   
2  2022-05-02               0.252928         2022-05-02                   16   
3  2023-10-11               0.066463         2023-07-03                   14   
4  2022-09-05              -0.865860         2022-09-05                   15   

  VERSUCH                               ERFASSUNGSGRUND  \
0       0  Sonstiger schwerer Diebstahl von FahrrÃ¤dern   
1       0  Sonstiger schwerer Diebstahl von FahrrÃ¤dern   
2       0  Sonstiger schwerer Diebstahl von FahrrÃ¤dern   
3       0  Sonstiger schwerer Diebstahl von FahrrÃ¤dern   
4       0  Sonstiger schwerer Diebstahl von FahrrÃ¤dern   

   ART_DES_FAHRRADS_Fahrrad  ART_DES_FAHRRADS_Herrenfahrrad  \
0                     False                            True   
1            

4. Check Data Types

In [6]:
print("### Data Types:")
print(train_data.dtypes)

### Data Types:
ANGELEGT_AM                             object
TATZEIT_ANFANG_STUNDE                  float64
TATZEIT_ENDE_DATUM                      object
TATZEIT_ENDE_STUNDE                      int64
VERSUCH                                 object
ERFASSUNGSGRUND                         object
ART_DES_FAHRRADS_Fahrrad                  bool
ART_DES_FAHRRADS_Herrenfahrrad            bool
ART_DES_FAHRRADS_Kinderfahrrad            bool
ART_DES_FAHRRADS_Lastenfahrrad            bool
ART_DES_FAHRRADS_Mountainbike             bool
ART_DES_FAHRRADS_Rennrad                  bool
ART_DES_FAHRRADS_diverse FahrrÃ¤der       bool
DELIKT_Keller- und Bodeneinbruch          bool
TATZEIT_ANFANG_YEAR                      int64
TATZEIT_ANFANG_MONTH                     int64
dtype: object


5. Check Missing Values

In [7]:
print("### Missing Values in Train Data:")
print(train_data.isnull().sum())

print("### Missing Values in Test Data:")
print(test_data.isnull().sum())

### Missing Values in Train Data:
ANGELEGT_AM                            0
TATZEIT_ANFANG_STUNDE                  0
TATZEIT_ENDE_DATUM                     0
TATZEIT_ENDE_STUNDE                    0
VERSUCH                                0
ERFASSUNGSGRUND                        0
ART_DES_FAHRRADS_Fahrrad               0
ART_DES_FAHRRADS_Herrenfahrrad         0
ART_DES_FAHRRADS_Kinderfahrrad         0
ART_DES_FAHRRADS_Lastenfahrrad         0
ART_DES_FAHRRADS_Mountainbike          0
ART_DES_FAHRRADS_Rennrad               0
ART_DES_FAHRRADS_diverse FahrrÃ¤der    0
DELIKT_Keller- und Bodeneinbruch       0
TATZEIT_ANFANG_YEAR                    0
TATZEIT_ANFANG_MONTH                   0
dtype: int64
### Missing Values in Test Data:
ANGELEGT_AM                            0
TATZEIT_ANFANG_STUNDE                  0
TATZEIT_ENDE_DATUM                     0
TATZEIT_ENDE_STUNDE                    0
VERSUCH                                0
ERFASSUNGSGRUND                        0
ART_DES_FAHRRADS_F

6. Ensure Column Alignment

In [8]:
if set(train_data.columns) == set(test_data.columns):
    print("Train and test datasets have matching columns.")
else:
    print("Train and test datasets do not have matching columns!")
    print("Columns in Train Data:", train_data.columns)
    print("Columns in Test Data:", test_data.columns)

Train and test datasets have matching columns.


7. Unique Values in Target Column

In [9]:
TARGET_COLUMN = 'VERSUCH'
if TARGET_COLUMN in train_data.columns:
    print(f"### Unique values in the target column `{TARGET_COLUMN}`:")
    print(train_data[TARGET_COLUMN].unique())
else:
    print(f"Target column `{TARGET_COLUMN}` not found in train data.")

### Unique values in the target column `VERSUCH`:
['0' '1' 'Unbekannt' 0 1]


8. Check for Duplicate Rows

In [10]:
print("### Duplicate Rows in Train Data:")
print(train_data.duplicated().sum())

print("### Duplicate Rows in Test Data:")
print(test_data.duplicated().sum())

### Duplicate Rows in Train Data:
566
### Duplicate Rows in Test Data:
40
