# Data profiling

**Data profiling** is the systematic process of examining, analyzing, and summarizing data to understand its structure, quality, and content. It helps uncover data issues, assess readiness for processing, and inform decisions in data integration, cleansing, or analytics projects.

In general, data profiling can help us to:
- **Assess Data Quality**: Detect nulls, duplicates, outliers, inconsistent formats, or unexpected patterns.
- **Understand Schema & Structure**: Analyze data types, column lengths, key constraints, and relationships.
- **Discover Relationships**: Identify foreign key candidates, overlaps, and referential integrity between datasets.
- **Generate Metadata**: Produce statistics (e.g. cardinality, min/max, frequency) to build a data dictionary.

## Types of Profiling
- **Column Profiling**: Statistics on individual columns (e.g., null %, distinct count)

- **Cross-Column Profiling**: Detecting dependencies or correlations between columns

- **Cross-Table Profiling**: Matching keys across tables to validate joins or relationships



## Key Metrics in Data Profiling
| Metric	                      | Description                                              |
|------------------------------|----------------------------------------------------------|
| Null count                   | 	Number of missing values                                |
| Unique values (cardinality)	 | How many distinct values exist                           |
| Value distribution	          | Frequency of each value (useful for categorical columns) |
| Pattern recognition          | 	Common formats, e.g. YYYY-MM-DD, email patterns         |
| Min/Max/Mean                 | 	For numerical columns                                   |
| Length analysis	             | Min/Max/Avg string lengths                               |
| Referential integrity	       | Whether values match across related tables               |


## 1. Example with ydata-profiling

In this section, we use a tool called ydata-profiling. You can visit their GitHub [page](https://github.com/ydataai/ydata-profiling) for more details.

The installation is quite simple

```shell
# via pip
pip install ydata-profiling

# via conda
conda install -c conda-forge ydata-profiling
```

If you want to use spark as backend, you need to install ydata with the spark extension.

```shell
pip install ydata-profiling[spark]
```

> The official doc use `pip install ydata-profiling[pyspark]`, which is wrong. You can check the origin https://github.com/ydataai/ydata-profiling/blob/develop/pyproject.toml. The mode is spark, not pyspark
 > The spark mode was introduced in version **v4.0.0**


In [1]:
from pyspark.sql import SparkSession
import pandas as pd
from ydata_profiling import ProfileReport

In [2]:
file_path = "../data/csv/us_census_1994.csv"

### 1.1 Use pandas

In [3]:
columns = ["age", "workclass", "fnlwgt", "education", "education-num", "marital-status", "occupation", "relationship",
           "race", "sex", "capital-gain", "capital-loss", "hours-per-week", "native-country", "income"]
df = pd.read_csv(file_path, names=columns, header=0)
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [4]:
print(df.describe())

                age        fnlwgt  education-num  capital-gain  capital-loss  \
count  32561.000000  3.256100e+04   32561.000000  32561.000000  32561.000000   
mean      38.581647  1.897784e+05      10.080679   1077.648844     87.303830   
std       13.640433  1.055500e+05       2.572720   7385.292085    402.960219   
min       17.000000  1.228500e+04       1.000000      0.000000      0.000000   
25%       28.000000  1.178270e+05       9.000000      0.000000      0.000000   
50%       37.000000  1.783560e+05      10.000000      0.000000      0.000000   
75%       48.000000  2.370510e+05      12.000000      0.000000      0.000000   
max       90.000000  1.484705e+06      16.000000  99999.000000   4356.000000   

       hours-per-week  
count    32561.000000  
mean        40.437456  
std         12.347429  
min          1.000000  
25%         40.000000  
50%         40.000000  
75%         45.000000  
max         99.000000  


In [5]:
profile = ProfileReport(df, title="Profiling Report")
profile.to_file("my_report.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]


  0%|          | 0/15 [00:00<?, ?it/s][A
  7%|▋         | 1/15 [00:00<00:01,  7.21it/s][A
100%|██████████| 15/15 [00:00<00:00, 35.32it/s][A


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

### 1.2 Use spark

In [6]:
spark = SparkSession.builder.master("local[4]") \
      .appName("spark data profiling") \
      .getOrCreate()

In [7]:
from pyspark.sql.types import StructType, IntegerType, StringType

schema = StructType() \
      .add("age",IntegerType(),True) \
      .add("workclass",StringType(),True) \
      .add("fnlwgt",IntegerType(),True) \
      .add("education",StringType(),True) \
      .add("education-num",IntegerType(),True) \
      .add("marital-status",StringType(),True) \
      .add("occupation",StringType(),True) \
      .add("relationship",StringType(),True) \
      .add("race",StringType(),True) \
      .add("sex",StringType(),True) \
      .add("capital-gain",IntegerType(),True) \
      .add("capital-loss",IntegerType(),True) \
      .add("hours-per-week",IntegerType(),True) \
      .add("native-country",StringType(),True) \
      .add("income",StringType(),True)

In [8]:
df = spark.read.csv(file_path, header=False, schema=schema)

df.printSchema()

root
 |-- age: integer (nullable = true)
 |-- workclass: string (nullable = true)
 |-- fnlwgt: integer (nullable = true)
 |-- education: string (nullable = true)
 |-- education-num: integer (nullable = true)
 |-- marital-status: string (nullable = true)
 |-- occupation: string (nullable = true)
 |-- relationship: string (nullable = true)
 |-- race: string (nullable = true)
 |-- sex: string (nullable = true)
 |-- capital-gain: integer (nullable = true)
 |-- capital-loss: integer (nullable = true)
 |-- hours-per-week: integer (nullable = true)
 |-- native-country: string (nullable = true)
 |-- income: string (nullable = true)



In [9]:
df.show(5)

+----+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+----+------------+------------+--------------+--------------+------+
| age|       workclass|fnlwgt|education|education-num|    marital-status|       occupation| relationship| race| sex|capital-gain|capital-loss|hours-per-week|native-country|income|
+----+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+----+------------+------------+--------------+--------------+------+
|NULL|       workclass|  NULL|education|         NULL|     marial-status|       occupation| relationship| race| sex|        NULL|        NULL|          NULL|native-country|income|
|  39|       State-gov| 77516|Bachelors|           13|     Never-married|     Adm-clerical|Not-in-family|White|Male|        2174|           0|            40| United-States| <=50K|
|  50|Self-emp-not-inc| 83311|Bachelors|           13|Married-civ-spouse|  Exec-managerial|      Hus

In [None]:
# this does not work
a = ProfileReport(df)
a.to_file("../spark_profile.html")