# Step: Raw Data Collection
Corresdong to Chapter 2 of the course "Introduction to digital twin" by Romain CHASSAGNE - https://rlchassagne.github.io/

At this stage, the data is "messy." It comes from different sources (forms, manual entry) with inconsistent formats, typos, and missing values.

In [1]:
import pandas as pd
import numpy as np

# Creating the "Messy" Raw Dataset
data = {
    'Name': ['mathieu Durand', 'Alice Bernard', 'MATHIEU DURAND', 'Jean Petit', None],
    'Enrollment_Date': ['22/03/24', '2024-03-21', '22-03-2024', 'April 15th', '20/03/24'],
    'City': ['Paris', 'lyon', 'PARIS', 'Marseille', 'Lille'],
    'Phone': ['0601020304', '+33 7 88 99 00', '06.01.02.03.04', '0140', '0611223344']
}

df = pd.DataFrame(data)
print("--- RAW DATA ---")
df

--- RAW DATA ---


Unnamed: 0,Name,Enrollment_Date,City,Phone
0,mathieu Durand,22/03/24,Paris,0601020304
1,Alice Bernard,2024-03-21,lyon,+33 7 88 99 00
2,MATHIEU DURAND,22-03-2024,PARIS,06.01.02.03.04
3,Jean Petit,April 15th,Marseille,0140
4,,20/03/24,Lille,0611223344


### 2. Step: Sorting
Sorting organizes the data to make visual inspection easier, it also prepare data for the next step. It helps identify potential duplicates or obvious gaps in the dataset. Here just sorting by alphabetic order.

In [4]:
# Sorting by Name to group similar entries together
df_sorted = df.sort_values(by='Name').copy()

print("--- SORTED DATA ---")
df_sorted

--- SORTED DATA ---


Unnamed: 0,Name,Enrollment_Date,City,Phone
1,Alice Bernard,2024-03-21,lyon,+33 7 88 99 00
3,Jean Petit,April 15th,Marseille,0140
2,MATHIEU DURAND,22-03-2024,PARIS,06.01.02.03.04
0,mathieu Durand,22/03/24,Paris,0601020304
4,,20/03/24,Lille,0611223344


### 3. Step: Data Harmonization (Standardization)
In this phase, we apply uniform formatting rules. This ensures that we can compare data that "looks" different but means the same thing.

In [3]:
# 1. Standardize Names
df_sorted['Name'] = df_sorted['Name'].str.upper()

# 2. Standardize Cities
df_sorted['City'] = df_sorted['City'].str.capitalize()

# 3. Standardize Dates (CORRECTED)
df_sorted['Enrollment_Date'] = pd.to_datetime(df_sorted['Enrollment_Date'], errors='coerce', format='mixed')

# 4. Standardize Phone Numbers
df_sorted['Phone'] = df_sorted['Phone'].str.replace(r'[.\s+]', '', regex=True)

print("--- HARMONIZED DATA ---")
df_sorted

--- HARMONIZED DATA ---


Unnamed: 0,Name,Enrollment_Date,City,Phone
1,ALICE BERNARD,2024-03-21,Lyon,337889900
3,JEAN PETIT,NaT,Marseille,140
2,MATHIEU DURAND,2024-03-22,Paris,601020304
0,MATHIEU DURAND,2024-03-22,Paris,601020304
4,,2024-03-20,Lille,611223344


Enrollment date ? Need to define a standard for the date. 

##### Is there another thing to be harmonized? or cleaned?