#### Data set cleaning

Dataset: 
 
 - _logs_exp_us.csv_

Author: Luis Sergio Pastrana Lemus  
Date: 20250909

# Data Cleaning – Food Supplier Dataset

## __1. Libraries__.

In [1]:
from pathlib import Path
import sys

# Define project root dynamically, gets the current directory from which the notebook belongs and moves one level upper
project_root = Path.cwd().parent

# Add src to sys.path if it is not already
if str(project_root) not in sys.path:
    sys.path.append(str(project_root))

# Import function directly (more controlled than import *)
from src import *


from IPython.display import display, HTML
import os
import pandas as pd

## __2. Path to Data file__.

In [2]:
data_file_path = project_root / "data" / "raw"

df_fs = load_dataset_from_csv(data_file_path, "logs_exp_us.csv", sep='\t', header='infer', keep_default_na=False)


##### `LSPL`

**Note:** `keep_default_na=False` is used to later convert missing values to `pd.NA`.  
This is beneficial because `pd.NA` provides:

- Consistency across data types  
- Type integrity preservation  
- Cleaner logical operations  
- Improved control over missing data

Since high performance or heavy computation is not required here, using `pd.NA` is appropriate.

In [3]:
# Format notebook output
format_notebook()

## __3. Data set cleaning__.

In [4]:
df_fs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244126 entries, 0 to 244125
Data columns (total 4 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   EventName       244126 non-null  object
 1   DeviceIDHash    244126 non-null  int64 
 2   EventTimestamp  244126 non-null  int64 
 3   ExpId           244126 non-null  int64 
dtypes: int64(3), object(1)
memory usage: 7.5+ MB


### 3.1. Standardizing String values using "snake case".

#### 3.1.1 Standardizing Column Labels.

In [5]:
# Standardize column labels with snake_case format
df_fs = normalize_columns_headers_format(df_fs)
df_fs.columns

Index(['eventname', 'deviceidhash', 'eventtimestamp', 'expid'], dtype='object')

#### 3.1.2 Standardizing Dataframe String values.

In [6]:
# Standardize data frame string values with snake_case format
df_fs = normalize_string_format(df_fs, include=['eventname'])
df_fs

Unnamed: 0,eventname,deviceidhash,eventtimestamp,expid
0,mainscreenappear,4575588528974610257,1564029816,246
1,mainscreenappear,7416695313311560658,1564053102,246
2,paymentscreensuccessful,3518123091307005509,1564054127,248
3,cartscreenappear,3518123091307005509,1564054127,248
4,paymentscreensuccessful,6217807653094995999,1564055322,248
...,...,...,...,...
244121,mainscreenappear,4599628364049201812,1565212345,247
244122,mainscreenappear,5849806612437486590,1565212439,246
244123,mainscreenappear,5746969938801999050,1565212483,246
244124,mainscreenappear,5746969938801999050,1565212498,246


##### `LSPL`

**Note:** 

The column names and string values did not follow a consistent format; they contained spaces and capital letters, making them difficult to manipulate.

__Solution__: Column names and string values were standardized using lowercase letters, removing spaces, and applying the snake_case format.   
__Impact__: This facilitated data access and manipulation, improving readability and reducing errors in analysis.

### 3.2 Explicit duplicates.

In [7]:
# Show explicit duplicates amount
display(HTML(f"> Explicit duplicates amount Dataframe <i>'df_fs'</i>: <b>{df_fs.duplicated().sum()}</b>"))

In [8]:
# Delete explicit duplicated rows
df_fs = df_fs.drop_duplicates().reset_index(drop=True)

display(HTML(f"> Explicit duplicates amount Dataframe <i>'df_fs'</i>: <b>{df_fs.duplicated().sum()}</b>"))

##### `LSPL`

**Note:** 

Explicit duplicates found: 413, explicit duplicates were removed.

### 3.3 Missing values.

#### 3.3.1 Missing values check.

In [9]:
# Show missing values
check_existing_missing_values(df_fs)




#### 3.3.2 Replacing missing values (pd.NA).

In [10]:
# Replace missing values with pd.NA
# No need

#### 3.3.3 Preview missing values.

In [11]:
# Show missing values heatmap
# No need

In [12]:
# Show pd.NA missing values for 'columns' column
# No need

#### 3.3.4  Missing values data imputation.

In [13]:
# No need

##### `LSPL`

**Note:** 

1. No missing values were found


### 3.4 Implicit duplicates.

#### 3.4.1 Implicit duplicates check.

In [14]:
# Show implicit duplicates df_aisles
detect_implicit_duplicates_fuzzy(df_fs, 'eventname')

> Scanning for duplicates ...: 100%|██████████| 5/5 [00:00<00:00, 12018.06it/s]


#### 3.4.2 Implicit duplicates data imputation.

In [15]:
# No need

### 3.5 Casting data types.

#### 3.5.1 Casting to string data type.

In [16]:
# No need

#### 3.5.2 Casting to numeric data type.

In [17]:
# No need

#### 3.5.3 Casting to category data type.

In [18]:
# df_fs 'eventname' to category
df_fs.loc[:, 'eventname'] = df_fs['eventname'].astype('category')
df_fs['eventname'].dtype

dtype('O')

#### 3.5.4 Casting to boolean data type.

In [19]:
# No need

#### 3.5.5 Casting to datetime data type.

In [22]:
# df_fs 'eventtimestamp' to datetime
df_fs.loc[:, :] = normalize_datetime(df_fs, include=['eventtimestamp'], unix_unit='s')
df_fs = df_fs.rename(columns={'eventtimestamp': 'datetime'})
df_fs.dtypes

eventname                    object
deviceidhash                  int64
datetime        datetime64[ns, UTC]
expid                         int64
dtype: object

## __4. Final cleaning dataframe review__.

In [23]:
df_fs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 243713 entries, 0 to 243712
Data columns (total 4 columns):
 #   Column        Non-Null Count   Dtype              
---  ------        --------------   -----              
 0   eventname     243713 non-null  object             
 1   deviceidhash  243713 non-null  int64              
 2   datetime      243713 non-null  datetime64[ns, UTC]
 3   expid         243713 non-null  int64              
dtypes: datetime64[ns, UTC](1), int64(2), object(1)
memory usage: 7.4+ MB


## __5. Generate a new clean Data set .csv file__.

In [24]:
project_root = Path.cwd().parent
processed_path = project_root / "data" / "processed" / "clean" / "fs_clean.csv"
df_fs.to_csv(processed_path, index=False)