# <u>This is our Mini project for the course "Fundamental Data Concepts" (4DACF) at SUPINFO Lyon :</u>

## <u>Evaluation Project - Data Processing and Visualization :</u>

## <u>Project Objective :</u>
You will design a complete data processing pipeline that includes several key steps: anonymization, transformation, cleaning, and data visualization.


The goal is to leverage multiple technologies to produce a high-quality pipeline that adheres to best practices.

This project must be carried out in groups of up to three students.

## <u>Contexte :</u>
A fictional e-commerce company aims to leverage its customer and transaction data while complying with GDPR regulations.

The company has a dataset containing sensitive information and seeks to obtain:

- [ ] An automated pipeline for anonymizing, transforming, and cleaning the data in python.
- [ ] A final output optimized for direct use in Power BI.

## <u>BONUS :</u>

- [ ] A set of visualizations in Python that provide insights into the data.

# <u>Step 1:</u> Pipeline Preparation: Python code for anonymization, cleaning, and transformation

In [2]:
# Importing the necessary libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re

## 1. Exploratory Data Analysis (EDA)

The first step in any data processing pipeline is to understand the data. This involves exploring the data to identify patterns, trends, and potential issues. EDA is a critical step that helps data engineer understand the data and make informed decisions about how to process it.

In [3]:
# Load the dataset : Mini_Projet_Evaluation.csv

dataset_file_path: str = 'data/raw/Mini_Projet_Evaluation.csv'
dataset = pd.read_csv(dataset_file_path)

In [4]:
# Explore the data : Structure of the data, data types, etc.

# Display the first few rows of the dataset

print(f"The dataset head is : \n {dataset.head()}") #to see a quick view of the dataset

The dataset head is : 
                                ClientID       Nom       Prénom  \
0  d34f9cab-5d14-469f-aa80-c0146f3b93c7   Walters  Christopher   
1  d9b374f9-8cec-4ae7-9137-c1d930d0aae0    Weaver        Linda   
2  72855e63-d98e-42e9-a10f-d9e4fac6e82f  Odonnell        Julie   
3  3fcb2796-9692-4fcf-affb-0100d9a74ae1     Clark      Charles   
4  50b21cc8-6f68-45c5-9b75-335ef55b41b2  Martinez        David   

                       Email              Téléphone  \
0       vickie68@hotmail.com     818-767-2351x61325   
1    mackrenee@rodriguez.com      892-112-2129x2425   
2         alexis55@gmail.com  001-505-122-4709x1134   
3  jenniferschmidt@yahoo.com          (101)867-7119   
4     hmiddleton@mendoza.com     (820)441-6404x9218   

                                             Adresse          Ville  \
0                   Unit 4018 Box 5177, DPO AA 69318      Lauratown   
1           19114 Ryan Grove, East Miranda, MO 40887    Herreraview   
2           610 Donna Neck, Lake Pa

In [5]:
# Display the infos of the dataset

print(f"The dataset info is : \n {dataset.info()}") #to see the structure of the dataset

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   ClientID                 1000 non-null   object 
 1   Nom                      1000 non-null   object 
 2   Prénom                   1000 non-null   object 
 3   Email                    1000 non-null   object 
 4   Téléphone                1000 non-null   object 
 5   Adresse                  1000 non-null   object 
 6   Ville                    1000 non-null   object 
 7   CodePostal               1000 non-null   int64  
 8   Pays                     1000 non-null   object 
 9   DateNaissance            1000 non-null   object 
 10  Âge                      1000 non-null   int64  
 11  Sexe                     1000 non-null   object 
 12  NuméroCarteCrédit        1000 non-null   int64  
 13  TypeCarteCrédit          1000 non-null   object 
 14  DateExpirationCarte      

In [6]:
# Display the shape of the dataset

print(f"This dataset have {dataset.shape[0]} entries and {dataset.shape[1]} columns.") #to see the number of rows and columns in resume, and to see the eventual problems on the dataset

This dataset have 1000 entries and 31 columns.


In [7]:
# Display the descriptive statistics of the dataset

print(f"The dataset describe is : \n {dataset.describe()}") #to see the statistical summary of the dataset, to see the eventual problems on the dataset (outliers, etc.)

The dataset describe is : 
          CodePostal          Âge  NuméroCarteCrédit  SoldeCompte  \
count   1000.000000  1000.000000       1.000000e+03  1000.000000   
mean   50576.295000    55.466000       3.992066e+17  2619.975090   
std    29093.854764    21.021816       1.288862e+18  1443.421728   
min      525.000000    18.000000       6.042067e+10     2.680000   
25%    23961.250000    37.000000       1.800195e+14  1385.430000   
50%    52513.000000    56.000000       3.505393e+15  2734.705000   
75%    76447.000000    73.000000       4.610821e+15  3865.872500   
max    99876.000000    91.000000       4.998787e+18  4994.880000   

       NombreAchats  MontantTotalAchats  FréquenceAchatMensuel  PanierMoyen  \
count   1000.000000          1000.00000            1000.000000  1000.000000   
mean      10.487000          2608.90697               4.879000   238.349630   
std        5.890036          2287.38721               3.260213   150.407597   
min        0.000000             0.00000    

In [8]:
# Display the data types in the dataset

print(f"The dataset data types are : \n {dataset.dtypes}") #to see the data types of the dataset

The dataset data types are : 
 ClientID                    object
Nom                         object
Prénom                      object
Email                       object
Téléphone                   object
Adresse                     object
Ville                       object
CodePostal                   int64
Pays                        object
DateNaissance               object
Âge                          int64
Sexe                        object
NuméroCarteCrédit            int64
TypeCarteCrédit             object
DateExpirationCarte         object
SoldeCompte                float64
TypeClient                  object
NombreAchats                 int64
MontantTotalAchats         float64
DernierAchat                object
ProduitPréféré              object
CatégorieProduitPréféré     object
FréquenceAchatMensuel        int64
PanierMoyen                float64
ScoreFidélité                int64
NombreRemboursements         int64
MontantTotalRemboursé      float64
AvisClient              

## 2. Data Cleaning

Data cleaning is the process of identifying and correcting errors in the data. This step is essential for ensuring the quality of the data and the accuracy of the analysis. Data cleaning involves several key tasks, including:
- [X] Handling missing values
- [X] Removing duplicates
- [X] Correcting errors
- [X] Standardizing data
- [ ] Handling outliers
- [ ] Encoding categorical variables
- [ ] Feature engineering
- [ ] Handling skewed data
- [ ] Handling time series data
- [ ] Standardizing data types


In [9]:
# Data Cleaning : Cleaning the dataset

# Handling missing values

missing_values_count = dataset.isnull().sum()
print(f"The missing values count is : \n {missing_values_count}") #to see the number of missing values in each column of the dataset

The missing values count is : 
 ClientID                   0
Nom                        0
Prénom                     0
Email                      0
Téléphone                  0
Adresse                    0
Ville                      0
CodePostal                 0
Pays                       0
DateNaissance              0
Âge                        0
Sexe                       0
NuméroCarteCrédit          0
TypeCarteCrédit            0
DateExpirationCarte        0
SoldeCompte                0
TypeClient                 0
NombreAchats               0
MontantTotalAchats         0
DernierAchat               0
ProduitPréféré             0
CatégorieProduitPréféré    0
FréquenceAchatMensuel      0
PanierMoyen                0
ScoreFidélité              0
NombreRemboursements       0
MontantTotalRemboursé      0
AvisClient                 0
AbonnementNewsletter       0
TypePaiementFavori         0
StatutCompte               0
dtype: int64


In [10]:
# Removing duplicates

duplicates_values_count = dataset.duplicated().sum()
print(f"The number of duplicates in the dataset is : {duplicates_values_count}") #to see the number of duplicates in the dataset

The number of duplicates in the dataset is : 0


In [11]:
# Correcting errors

columns_to_verify = ['Sexe', 'CatégorieProduitPréféré', 'AvisClient', 'AbonnementNewsletter', 'TypePaiementFavori', 'StatutCompte'] #to see the columns that we want to verify

for col in columns_to_verify:
    print(f"\n🔹 {col} : {dataset[col].nunique()} unique values") #to see the number of unique values in each column of the dataset
    print(dataset[col].unique())  # to see the unique values in each column of the dataset


🔹 Sexe : 2 unique values
['M' 'F']

🔹 CatégorieProduitPréféré : 5 unique values
['Sport' 'Électronique' 'Alimentation' 'Mode' 'Maison']

🔹 AvisClient : 5 unique values
['Neutre' 'Satisfait' 'Mécontent' 'Très mécontent' 'Très satisfait']

🔹 AbonnementNewsletter : 2 unique values
[ True False]

🔹 TypePaiementFavori : 4 unique values
['Paypal' 'Carte bancaire' 'Cryptomonnaie' 'Virement']

🔹 StatutCompte : 3 unique values
['Inactif' 'Actif' 'Suspendu']


In [23]:
# Standardizing data

# Standardizing the float columns
columns_to_round = ['MontantTotalAchats', 'SoldeCompte', 'PanierMoyen', 'MontantTotalRemboursé'] #to see the columns that we want to round

for col in columns_to_round:
    dataset[col] = dataset[col].apply(lambda x: round(x, 2)) #to round the values in each column of the dataset

# Standardizing the countries and cities columns
standardize_name_columns = ['Pays', 'Ville'] #to see the columns that we want to standardize
for col in standardize_name_columns:
    dataset[col] = dataset[col].str.title() #to standardize the values in each column of the dataset

# Standardizing the date columns
standardize_date_columns = ['DateNaissance', 'DernierAchat'] #to see the columns that we want to standardize
def standardise_date(date):
    date = str(date).strip()
    match = re.match(r"^(\d{4})-(\d{2})-(\d{2})$", date)
    if match:
        return f"{match.group(3)}/{match.group(2)}/{match.group(1)}"  # JJ/MM/AAAA
    return date

for col in standardize_date_columns:
    dataset[col] = dataset[col].apply(standardise_date) #to standardize the values in each column of the dataset

                                ClientID        Nom       Prénom  \
0   d34f9cab-5d14-469f-aa80-c0146f3b93c7    Walters  Christopher   
1   d9b374f9-8cec-4ae7-9137-c1d930d0aae0     Weaver        Linda   
2   72855e63-d98e-42e9-a10f-d9e4fac6e82f   Odonnell        Julie   
3   3fcb2796-9692-4fcf-affb-0100d9a74ae1      Clark      Charles   
4   50b21cc8-6f68-45c5-9b75-335ef55b41b2   Martinez        David   
5   1149c70b-e75d-4398-a784-ff40152810db      Black      Stephen   
6   20dea61b-c087-482c-9a71-2def299ed020    Spencer       Alexis   
7   3faa9f83-0a50-4657-96cc-401944b5708c      Baker      Stephen   
8   cd0594c6-c289-4fd6-92ad-495772743f5a   Williams      Timothy   
9   6bed1a9f-2ee0-402e-8907-9fad020e36f7  Macdonald        Kevin   
10  0c2cab59-6989-4832-aa97-84c398e67a5d       Haas       Rhonda   
11  79eb32ae-12ca-4009-b8ec-b558d039caf8      Gould     Jennifer   
12  073fdfca-a8df-4e28-ae42-460761586770     Martin      Tiffany   
13  ae2fe7cd-0387-4e2b-abb6-9e6bce679b74       W

## 3. Data Transformation (Anonymization, pseudonomization, columns selection)
Data transformation is the process of converting raw data into a format that is suitable for analysis. This step involves several key tasks, including:
- [ ] Anonymization
- [ ] Pseudonymization
- [ ] Aggregation
- [ ] Encoding
- [ ] Data discretization
- [ ] Data imputation
- [ ] Data integration
- [ ] Data reduction
- [ ] Data wrangling
- [ ] Data munging
- [ ] Data fusion
- [ ] Data harmonization

## 4. Validation
Data validation is the process of ensuring that the data is accurate, complete, and consistent. This step involves several key tasks, including:
- [ ] Data profiling
- [ ] Data quality assessment
- [ ] Data integrity checks
- [ ] Data validation rules
- [ ] Data validation checks
- [ ] Data validation methods
- [ ] Data validation techniques
- [ ] Data validation tools

## 5. Data Export (To CSV or Excel)
The final step in the data processing pipeline is to export the cleaned and transformed data to a file format that can be used for analysis. This step involves exporting the data to a CSV or Excel file, which can then be imported into a data visualization tool for further analysis.

## 6. Data Visualization (BONUS)

Data visualization is the process of representing data graphically to help data engineers and analysts understand the data and identify patterns and trends. Data visualization is a critical step in the data processing pipeline, as it helps to communicate the results of the analysis to stakeholders and decision-makers. Data visualization involves several key tasks, including: