# **Module `pepper.univar`**

✔ Revue des **typehints** et des **docstrings**.

On part de l'existant et on retravaille.

Les exemples de test doivent être bâtis sur des jeux de données factices et du monde réel.

Pour chaque fonction, ses dépendance avant, i.e. où est-elle utilisée, pour faire quoi ?

**TODO** tests unitaires sur micros cas manuels pour tests, avec make_truc pour ensemble fictif, un cas réel comme iris, grandeur réelle sur le projet en cours : c'est une table supplémentaire générée, qui peut-être chargée sur des pages cachées de l'application streamlit.

**TODO** indissociable de `from pepper.data_dict import _load_struct`. Le travail de refactoring sera terminé une fois réintégrée cette partie + `univar.py` vidé au profit de `pepper.data_dict` d'une part, et d'une `pepper.corr` d'autre part.

**TODO** Une évolution intéressante de cette fonctionnalité serait le montage en base de donnée SQLITE.

In [None]:
from pepper.data_dict import _load_struct
s = _load_struct()
display(s)

# Data Analysis and Reporting

- `series_infos(s: pd.Series, idx: int) -> Dict[str, Union[str, int, float, bool, List[float], List[str]]]`:
    - Analyze a Pandas Series and return information about it.
- `dataframe_infos(df: pd.DataFrame) -> pd.DataFrame`:
    - Analyze a Pandas DataFrame and return information about all its Series.
- `data_report(data, csv_filename) -> pd.DataFrame`:
    - Generate a data report, perform data reduction, and save it as a CSV file.
- `data_report_to_gsheet(data_on_data: pd.DataFrame, spread, sheet_name: str) -> None`:
    - Export a data report to a Google Sheets spreadsheet.

## **`series_infos`**`(s, idx)`

Cette fonction d'analyse exploratoire retourne l'analyse univariée la plus exhaustive possible sur une seule variable.

Dans le présent projet, elle n'est utilisée que comme fonction auxiliaire de **`dataframe_infos`**`(data)` qui l'applique à l'ensemble des variables d'un _dataframe_ pour produire un rapport complet d'analyse exploratoire univariée.

**Note** Plusieurs **TODO** dans la première section "Identification, group, and type".

### Exemples numériques

#### Exemple numérique basique

In [1]:
from pepper.univar import series_infos

display(series_infos([1, 2, 3, 2, 3]))

{'idx': 0,
 'group': '<NYI>',
 'subgroup': '<NYI>',
 'name': None,
 'domain': '<NYI>',
 'format': '<NYI>',
 'dtype': dtype('int64'),
 'astype': '<NYI>',
 'unity': '<NYI>',
 'is_numeric': True,
 'n': 5,
 'hasnans': False,
 'n_unique': 3,
 'n_notna': 5,
 'n_na': 0,
 'filling_rate': 1.0,
 'uniqueness': 0.6,
 'val_min': 1,
 'val_max': 3,
 'val_mode': '[2, 3]',
 'val_mean': 2.2,
 'val_trim_mean_10pc': 2.2,
 'val_med': 2.0,
 'val_std': 0.837,
 'val_interq_range': 1.0,
 'val_med_abs_dev': 1.482602218505602,
 'val_skew': -0.512,
 'val_kurt': -0.612,
 'interval': [1, 3],
 'modalities': None,
 'mod_counts': None,
 'mod_freqs': None,
 'shape': (5,),
 'ndim': 1,
 'empty': False,
 'size': 5,
 'nbytes': 40,
 'memory_usage': <bound method Series.memory_usage of 0    1
 1    2
 2    3
 3    2
 4    3
 dtype: int64>,
 'flags': <Flags(allows_duplicate_labels=True)>,
 'array container type': pandas.core.arrays.numpy_.PandasArray,
 'values container type': numpy.ndarray}

#### Exemple numérique généré

In [3]:
from sklearn.datasets import make_regression
from pepper.univar import series_infos

X, y = make_regression(n_samples=100, n_features=1, noise=0.1)
display(series_infos(y))

{'idx': 0,
 'group': '<NYI>',
 'subgroup': '<NYI>',
 'name': None,
 'domain': '<NYI>',
 'format': '<NYI>',
 'dtype': dtype('float64'),
 'astype': '<NYI>',
 'unity': '<NYI>',
 'is_numeric': True,
 'n_elts': 100,
 'hasnans': False,
 'n_unique': 100,
 'n_notna': 100,
 'n_na': 0,
 'filling_rate': 1.0,
 'uniqueness': 1.0,
 'val_min': -153.31906658871696,
 'val_max': 224.1884608573641,
 'val_mode': '[-153.32, -146.74, -142.53, -137.68, -130.85, -119.1, -118.11, -116.93, -97.25, -93.19]',
 'val_mean': 3.846,
 'val_trim_mean_10pc': 3.9113807395561997,
 'val_med': -0.012583910254141656,
 'val_std': 72.694,
 'val_interq_range': 95.40169228596248,
 'val_med_abs_dev': 70.50775852561647,
 'val_skew': 0.126,
 'val_kurt': 0.257,
 'interval': [-153.31906658871696, 224.1884608573641],
 'modalities': None,
 'mod_counts': None,
 'mod_freqs': None,
 'shape': (100,),
 'ndim': 1,
 'empty': False,
 'size': 100,
 'nbytes': 800,
 'memory_usage': <bound method Series.memory_usage of 0      64.432040
 1     -42.

#### Exemple numérique Iris

In [4]:
from sklearn.datasets import load_iris
from pepper.univar import series_infos

iris = load_iris(as_frame=True)
display(series_infos(iris.data["sepal length (cm)"]))

{'idx': 0,
 'group': '<NYI>',
 'subgroup': '<NYI>',
 'name': 'sepal length (cm)',
 'domain': '<NYI>',
 'format': '<NYI>',
 'dtype': dtype('float64'),
 'astype': '<NYI>',
 'unity': '<NYI>',
 'is_numeric': True,
 'n': 150,
 'hasnans': False,
 'n_unique': 35,
 'n_notna': 150,
 'n_na': 0,
 'filling_rate': 1.0,
 'uniqueness': 0.233,
 'val_min': 4.3,
 'val_max': 7.9,
 'val_mode': '[5.0]',
 'val_mean': 5.843,
 'val_trim_mean_10pc': 5.808,
 'val_med': 5.8,
 'val_std': 0.828,
 'val_interq_range': 1.3000000000000007,
 'val_med_abs_dev': 1.0378215529539216,
 'val_skew': 0.315,
 'val_kurt': -0.552,
 'interval': [4.3, 7.9],
 'modalities': None,
 'mod_counts': None,
 'mod_freqs': None,
 'shape': (150,),
 'ndim': 1,
 'empty': False,
 'size': 150,
 'nbytes': 1200,
 'memory_usage': <bound method Series.memory_usage of 0      5.1
 1      4.9
 2      4.7
 3      4.6
 4      5.0
       ... 
 145    6.7
 146    6.3
 147    6.5
 148    6.2
 149    5.9
 Name: sepal length (cm), Length: 150, dtype: float64>,


#### Exemple numérique Home Credit

In [5]:
from home_credit.load import get_table
from pepper.univar import series_infos

application = get_table("application")
display(series_infos(application.AMT_INCOME_TOTAL))

load C:/Users/franc/Projects/pepper_credit_scoring_tool\dataset\pqt\application_train.pqt
load C:/Users/franc/Projects/pepper_credit_scoring_tool\dataset\pqt\application_test.pqt


{'idx': 0,
 'group': '<NYI>',
 'subgroup': '<NYI>',
 'name': 'AMT_INCOME_TOTAL',
 'domain': '<NYI>',
 'format': '<NYI>',
 'dtype': dtype('float64'),
 'astype': '<NYI>',
 'unity': '<NYI>',
 'is_numeric': True,
 'n': 356255,
 'hasnans': False,
 'n_unique': 2741,
 'n_notna': 356255,
 'n_na': 0,
 'filling_rate': 1.0,
 'uniqueness': 0.008,
 'val_min': 25650.0,
 'val_max': 117000000.0,
 'val_mode': '[135000.0]',
 'val_mean': 170116.06,
 'val_trim_mean_10pc': 156770.066,
 'val_med': 153000.0,
 'val_std': 223506.819,
 'val_interq_range': 90000.0,
 'val_med_abs_dev': 73388.8098160273,
 'val_skew': 403.65,
 'val_kurt': 209715.939,
 'interval': [25650.0, 117000000.0],
 'modalities': None,
 'mod_counts': None,
 'mod_freqs': None,
 'shape': (356255,),
 'ndim': 1,
 'empty': False,
 'size': 356255,
 'nbytes': 2850040,
 'memory_usage': <bound method Series.memory_usage of 0         202500.0
 1         270000.0
 2          67500.0
 3         135000.0
 4         121500.0
             ...   
 356250    1

### Exemples catégoriels

#### Exemple catégoriel basique

In [6]:
from pepper.univar import series_infos
display(series_infos(['A', 'B', 'C', 'A', 'B']))

{'idx': 0,
 'group': '<NYI>',
 'subgroup': '<NYI>',
 'name': None,
 'domain': '<NYI>',
 'format': '<NYI>',
 'dtype': dtype('O'),
 'astype': '<NYI>',
 'unity': '<NYI>',
 'is_numeric': False,
 'n': 5,
 'hasnans': False,
 'n_unique': 3,
 'n_notna': 5,
 'n_na': 0,
 'filling_rate': 1.0,
 'uniqueness': 0.6,
 'val_min': None,
 'val_max': None,
 'val_mode': None,
 'val_mean': None,
 'val_trim_mean_10pc': None,
 'val_med': None,
 'val_std': None,
 'val_interq_range': None,
 'val_med_abs_dev': None,
 'val_skew': None,
 'val_kurt': None,
 'interval': None,
 'modalities': ['A', 'B', 'C'],
 'mod_counts': [2, 2, 1],
 'mod_freqs': [0.4, 0.4, 0.2],
 'shape': (5,),
 'ndim': 1,
 'empty': False,
 'size': 5,
 'nbytes': 40,
 'memory_usage': <bound method Series.memory_usage of 0    A
 1    B
 2    C
 3    A
 4    B
 dtype: object>,
 'flags': <Flags(allows_duplicate_labels=True)>,
 'array container type': pandas.core.arrays.numpy_.PandasArray,
 'values container type': numpy.ndarray}

#### Exemple catégoriel généré

In [7]:
from sklearn.datasets import make_classification
from pepper.univar import series_infos
X, y = make_classification(
    n_samples=100, n_features=1, n_redundant=0,
    n_informative=1, n_classes=2, n_clusters_per_class=1
)
classes = ['Class_A', 'Class_B']
display(series_infos([classes[yi] for yi in y]))

{'idx': 0,
 'group': '<NYI>',
 'subgroup': '<NYI>',
 'name': None,
 'domain': '<NYI>',
 'format': '<NYI>',
 'dtype': dtype('O'),
 'astype': '<NYI>',
 'unity': '<NYI>',
 'is_numeric': False,
 'n': 100,
 'hasnans': False,
 'n_unique': 2,
 'n_notna': 100,
 'n_na': 0,
 'filling_rate': 1.0,
 'uniqueness': 0.02,
 'val_min': None,
 'val_max': None,
 'val_mode': None,
 'val_mean': None,
 'val_trim_mean_10pc': None,
 'val_med': None,
 'val_std': None,
 'val_interq_range': None,
 'val_med_abs_dev': None,
 'val_skew': None,
 'val_kurt': None,
 'interval': None,
 'modalities': ['Class_B', 'Class_A'],
 'mod_counts': [50, 50],
 'mod_freqs': [0.5, 0.5],
 'shape': (100,),
 'ndim': 1,
 'empty': False,
 'size': 100,
 'nbytes': 800,
 'memory_usage': <bound method Series.memory_usage of 0     Class_B
 1     Class_A
 2     Class_B
 3     Class_A
 4     Class_A
        ...   
 95    Class_A
 96    Class_A
 97    Class_A
 98    Class_A
 99    Class_B
 Length: 100, dtype: object>,
 'flags': <Flags(allows_dupl

#### Exemple catégoriel Titanic

In [8]:
from seaborn import load_dataset
from pepper.univar import series_infos

titanic = load_dataset("titanic")
display(series_infos(titanic.embark_town))

{'idx': 0,
 'group': '<NYI>',
 'subgroup': '<NYI>',
 'name': 'embark_town',
 'domain': '<NYI>',
 'format': '<NYI>',
 'dtype': dtype('O'),
 'astype': '<NYI>',
 'unity': '<NYI>',
 'is_numeric': False,
 'n': 891,
 'hasnans': True,
 'n_unique': 3,
 'n_notna': 889,
 'n_na': 2,
 'filling_rate': 0.998,
 'uniqueness': 0.003,
 'val_min': None,
 'val_max': None,
 'val_mode': None,
 'val_mean': None,
 'val_trim_mean_10pc': None,
 'val_med': None,
 'val_std': None,
 'val_interq_range': None,
 'val_med_abs_dev': None,
 'val_skew': None,
 'val_kurt': None,
 'interval': None,
 'modalities': ['Southampton', 'Cherbourg', 'Queenstown'],
 'mod_counts': [644, 168, 77],
 'mod_freqs': [0.7244094488188977, 0.1889763779527559, 0.08661417322834646],
 'shape': (891,),
 'ndim': 1,
 'empty': False,
 'size': 891,
 'nbytes': 7128,
 'memory_usage': <bound method Series.memory_usage of 0      Southampton
 1        Cherbourg
 2      Southampton
 3      Southampton
 4      Southampton
           ...     
 886    Southa

#### Exemple catégoriel Home Credit

In [9]:
from home_credit.load import get_table
from pepper.univar import series_infos

application = get_table("application")
display(series_infos(application.NAME_CONTRACT_TYPE))

{'idx': 0,
 'group': '<NYI>',
 'subgroup': '<NYI>',
 'name': 'NAME_CONTRACT_TYPE',
 'domain': '<NYI>',
 'format': '<NYI>',
 'dtype': dtype('O'),
 'astype': '<NYI>',
 'unity': '<NYI>',
 'is_numeric': False,
 'n': 356255,
 'hasnans': False,
 'n_unique': 2,
 'n_notna': 356255,
 'n_na': 0,
 'filling_rate': 1.0,
 'uniqueness': 0.0,
 'val_min': None,
 'val_max': None,
 'val_mode': None,
 'val_mean': None,
 'val_trim_mean_10pc': None,
 'val_med': None,
 'val_std': None,
 'val_interq_range': None,
 'val_med_abs_dev': None,
 'val_skew': None,
 'val_kurt': None,
 'interval': None,
 'modalities': ['Cash loans', 'Revolving loans'],
 'mod_counts': [326537, 29718],
 'mod_freqs': [0.9165822234073908, 0.08341777659260922],
 'shape': (356255,),
 'ndim': 1,
 'empty': False,
 'size': 356255,
 'nbytes': 2850040,
 'memory_usage': <bound method Series.memory_usage of 0              Cash loans
 1              Cash loans
 2         Revolving loans
 3              Cash loans
 4              Cash loans
        

## **`dataframe_infos`**`(data)`

Cette fonction est essentiellement la version itérée de la précédente. Elle produit un _dataframe_ qui est un rapport exhaustif d'analyse exploratoire.

**Note 05/09/2023** Le fait d'avoir importé cette fonction qui date de mes premiers projets DSIA dans le projet Home Credit marque une intension : revisiter cette fonctionnalité sur la base de mes derniers développements, notamment pour la qualification automatique des variables, en termes de nature (inférence des types techniques et types métiers), et en termes de regroupements métier (découverte sémantique, regroupement des variables par famille, hiérarchisation, etc). L'idée reste de capitaliser sous forme de lib professionnelle, ma boîte à outil d'artisan.

#### Exemple basique

In [10]:
import pandas as pd
from pepper.univar import dataframe_infos

data = pd.DataFrame({
    'A': [1, 2, 3, 4, 5],
    'B': ['apple', 'banana', 'cherry', 'date', 'elderberry'],
    'C': [0.1, 0.2, 0.3, 0.4, 0.5]
})

display(dataframe_infos(data))

Unnamed: 0,idx,group,subgroup,name,domain,format,dtype,astype,unity,is_numeric,...,mod_freqs,shape,ndim,empty,size,nbytes,memory_usage,flags,array container type,values container type
0,0,<NYI>,<NYI>,A,<NYI>,<NYI>,int64,<NYI>,<NYI>,True,...,,"(5,)",1,False,5,40,<bound method Series.memory_usage of 0 1\n1...,<Flags(allows_duplicate_labels=True)>,<class 'pandas.core.arrays.numpy_.PandasArray'>,<class 'numpy.ndarray'>
1,1,<NYI>,<NYI>,B,<NYI>,<NYI>,object,<NYI>,<NYI>,False,...,"[0.2, 0.2, 0.2, 0.2, 0.2]","(5,)",1,False,5,40,<bound method Series.memory_usage of 0 ...,<Flags(allows_duplicate_labels=True)>,<class 'pandas.core.arrays.numpy_.PandasArray'>,<class 'numpy.ndarray'>
2,2,<NYI>,<NYI>,C,<NYI>,<NYI>,float64,<NYI>,<NYI>,True,...,,"(5,)",1,False,5,40,<bound method Series.memory_usage of 0 0.1\...,<Flags(allows_duplicate_labels=True)>,<class 'pandas.core.arrays.numpy_.PandasArray'>,<class 'numpy.ndarray'>


#### Exemple Iris

In [11]:
from sklearn.datasets import load_iris
from pepper.univar import dataframe_infos

iris = load_iris(as_frame=True)
display(dataframe_infos(iris.data).T)

Unnamed: 0,0,1,2,3
idx,0,1,2,3
group,<NYI>,<NYI>,<NYI>,<NYI>
subgroup,<NYI>,<NYI>,<NYI>,<NYI>
name,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
domain,<NYI>,<NYI>,<NYI>,<NYI>
format,<NYI>,<NYI>,<NYI>,<NYI>
dtype,float64,float64,float64,float64
astype,<NYI>,<NYI>,<NYI>,<NYI>
unity,<NYI>,<NYI>,<NYI>,<NYI>
is_numeric,True,True,True,True


#### Exemple Titanic

In [1]:
from seaborn import load_dataset
from pepper.univar import dataframe_infos

titanic = load_dataset("titanic")
display(dataframe_infos(titanic))

Unnamed: 0,idx,group,subgroup,name,domain,format,dtype,astype,unity,is_numeric,...,mod_freqs,shape,ndim,empty,size,nbytes,memory_usage,flags,array container type,values container type
0,0,<NYI>,<NYI>,survived,<NYI>,<NYI>,int64,<NYI>,<NYI>,True,...,,"(891,)",1,False,891,7128,<bound method Series.memory_usage of 0 0\...,<Flags(allows_duplicate_labels=True)>,<class 'pandas.core.arrays.numpy_.PandasArray'>,<class 'numpy.ndarray'>
1,1,<NYI>,<NYI>,pclass,<NYI>,<NYI>,int64,<NYI>,<NYI>,True,...,,"(891,)",1,False,891,7128,<bound method Series.memory_usage of 0 3\...,<Flags(allows_duplicate_labels=True)>,<class 'pandas.core.arrays.numpy_.PandasArray'>,<class 'numpy.ndarray'>
2,2,<NYI>,<NYI>,sex,<NYI>,<NYI>,object,<NYI>,<NYI>,False,...,"[0.6475869809203143, 0.35241301907968575]","(891,)",1,False,891,7128,<bound method Series.memory_usage of 0 ...,<Flags(allows_duplicate_labels=True)>,<class 'pandas.core.arrays.numpy_.PandasArray'>,<class 'numpy.ndarray'>
3,3,<NYI>,<NYI>,age,<NYI>,<NYI>,float64,<NYI>,<NYI>,True,...,,"(891,)",1,False,891,7128,<bound method Series.memory_usage of 0 22...,<Flags(allows_duplicate_labels=True)>,<class 'pandas.core.arrays.numpy_.PandasArray'>,<class 'numpy.ndarray'>
4,4,<NYI>,<NYI>,sibsp,<NYI>,<NYI>,int64,<NYI>,<NYI>,True,...,,"(891,)",1,False,891,7128,<bound method Series.memory_usage of 0 1\...,<Flags(allows_duplicate_labels=True)>,<class 'pandas.core.arrays.numpy_.PandasArray'>,<class 'numpy.ndarray'>
5,5,<NYI>,<NYI>,parch,<NYI>,<NYI>,int64,<NYI>,<NYI>,True,...,,"(891,)",1,False,891,7128,<bound method Series.memory_usage of 0 0\...,<Flags(allows_duplicate_labels=True)>,<class 'pandas.core.arrays.numpy_.PandasArray'>,<class 'numpy.ndarray'>
6,6,<NYI>,<NYI>,fare,<NYI>,<NYI>,float64,<NYI>,<NYI>,True,...,,"(891,)",1,False,891,7128,<bound method Series.memory_usage of 0 7...,<Flags(allows_duplicate_labels=True)>,<class 'pandas.core.arrays.numpy_.PandasArray'>,<class 'numpy.ndarray'>
7,7,<NYI>,<NYI>,embarked,<NYI>,<NYI>,object,<NYI>,<NYI>,False,...,"[0.7244094488188977, 0.1889763779527559, 0.086...","(891,)",1,False,891,7128,<bound method Series.memory_usage of 0 S\...,<Flags(allows_duplicate_labels=True)>,<class 'pandas.core.arrays.numpy_.PandasArray'>,<class 'numpy.ndarray'>
8,8,<NYI>,<NYI>,class,<NYI>,<NYI>,category,<NYI>,<NYI>,False,...,"[0.5510662177328844, 0.24242424242424243, 0.20...","(891,)",1,False,891,915,<bound method Series.memory_usage of 0 T...,<Flags(allows_duplicate_labels=True)>,<class 'pandas.core.arrays.categorical.Categor...,<class 'pandas.core.arrays.categorical.Categor...
9,9,<NYI>,<NYI>,who,<NYI>,<NYI>,object,<NYI>,<NYI>,False,...,"[0.6026936026936027, 0.3041526374859708, 0.093...","(891,)",1,False,891,7128,<bound method Series.memory_usage of 0 ...,<Flags(allows_duplicate_labels=True)>,<class 'pandas.core.arrays.numpy_.PandasArray'>,<class 'numpy.ndarray'>


#### Exemple Home Credit

In [9]:
from home_credit.load import get_table
from pepper.univar import dataframe_infos

application = get_table("application")
display(dataframe_infos(application))

Unnamed: 0,idx,group,subgroup,name,domain,format,dtype,astype,unity,is_numeric,...,mod_freqs,shape,ndim,empty,size,nbytes,memory_usage,flags,array container type,values container type
0,0,<NYI>,<NYI>,SK_ID_CURR,<NYI>,<NYI>,int64,<NYI>,<NYI>,True,...,,"(356255,)",1,False,356255,2850040,<bound method Series.memory_usage of 0 ...,<Flags(allows_duplicate_labels=True)>,<class 'pandas.core.arrays.numpy_.PandasArray'>,<class 'numpy.ndarray'>
1,1,<NYI>,<NYI>,TARGET,<NYI>,<NYI>,int64,<NYI>,<NYI>,True,...,,"(356255,)",1,False,356255,2850040,<bound method Series.memory_usage of 0 ...,<Flags(allows_duplicate_labels=True)>,<class 'pandas.core.arrays.numpy_.PandasArray'>,<class 'numpy.ndarray'>
2,2,<NYI>,<NYI>,NAME_CONTRACT_TYPE,<NYI>,<NYI>,object,<NYI>,<NYI>,False,...,"[0.9165822234073908, 0.08341777659260922]","(356255,)",1,False,356255,2850040,<bound method Series.memory_usage of 0 ...,<Flags(allows_duplicate_labels=True)>,<class 'pandas.core.arrays.numpy_.PandasArray'>,<class 'numpy.ndarray'>
3,3,<NYI>,<NYI>,CODE_GENDER,<NYI>,<NYI>,object,<NYI>,<NYI>,False,...,"[0.6599935439502603, 0.33999522813714894, 1.12...","(356255,)",1,False,356255,2850040,<bound method Series.memory_usage of 0 ...,<Flags(allows_duplicate_labels=True)>,<class 'pandas.core.arrays.numpy_.PandasArray'>,<class 'numpy.ndarray'>
4,4,<NYI>,<NYI>,FLAG_OWN_CAR,<NYI>,<NYI>,object,<NYI>,<NYI>,False,...,"[0.660299504568357, 0.33970049543164305]","(356255,)",1,False,356255,2850040,<bound method Series.memory_usage of 0 ...,<Flags(allows_duplicate_labels=True)>,<class 'pandas.core.arrays.numpy_.PandasArray'>,<class 'numpy.ndarray'>
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
117,117,<NYI>,<NYI>,AMT_REQ_CREDIT_BUREAU_DAY,<NYI>,<NYI>,float64,<NYI>,<NYI>,True,...,,"(356255,)",1,False,356255,2850040,<bound method Series.memory_usage of 0 ...,<Flags(allows_duplicate_labels=True)>,<class 'pandas.core.arrays.numpy_.PandasArray'>,<class 'numpy.ndarray'>
118,118,<NYI>,<NYI>,AMT_REQ_CREDIT_BUREAU_WEEK,<NYI>,<NYI>,float64,<NYI>,<NYI>,True,...,,"(356255,)",1,False,356255,2850040,<bound method Series.memory_usage of 0 ...,<Flags(allows_duplicate_labels=True)>,<class 'pandas.core.arrays.numpy_.PandasArray'>,<class 'numpy.ndarray'>
119,119,<NYI>,<NYI>,AMT_REQ_CREDIT_BUREAU_MON,<NYI>,<NYI>,float64,<NYI>,<NYI>,True,...,,"(356255,)",1,False,356255,2850040,<bound method Series.memory_usage of 0 ...,<Flags(allows_duplicate_labels=True)>,<class 'pandas.core.arrays.numpy_.PandasArray'>,<class 'numpy.ndarray'>
120,120,<NYI>,<NYI>,AMT_REQ_CREDIT_BUREAU_QRT,<NYI>,<NYI>,float64,<NYI>,<NYI>,True,...,,"(356255,)",1,False,356255,2850040,<bound method Series.memory_usage of 0 ...,<Flags(allows_duplicate_labels=True)>,<class 'pandas.core.arrays.numpy_.PandasArray'>,<class 'numpy.ndarray'>


In [10]:
report = dataframe_infos(application)
display(report.iloc[0])

idx                                                                      0
group                                                                <NYI>
subgroup                                                             <NYI>
name                                                            SK_ID_CURR
domain                                                               <NYI>
format                                                               <NYI>
dtype                                                                int64
astype                                                               <NYI>
unity                                                                <NYI>
is_numeric                                                            True
n                                                                   356255
hasnans                                                              False
n_unique                                                            356255
n_notna                  

## **`data_report`**`(data, csv_filename)`

Utilisée, sans aller plus loin que le stade de prototype, dans `z_experiment/analyse_expl.ipynb`.

Cette fonction sous-traite à **`dataframe_infos`**`(data)` et exporte le résultat dans un fichier CSV, après quelques retraitements pour réduire la taille de certaines données en liste.

**TODO** tests unitaires sur micros cas manuels pour tests, avec make_truc pour ensemble fictif, un cas réel comme iris, grandeur réelle sur le projet en cours.

In [1]:
from pepper.univar import data_report
from home_credit.load import get_application

app = get_application()

# Build the report dataframe
report = data_report(app)   #, 'application_data_report.csv')
display(report)

load C:/Users/franc/Projects/pepper_credit_scoring_tool\dataset\pqt\application_train.pqt
load C:/Users/franc/Projects/pepper_credit_scoring_tool\dataset\pqt\application_test.pqt


Unnamed: 0,idx,group,subgroup,name,domain,format,dtype,astype,unity,is_numeric,...,mod_counts,mod_freqs,shape,ndim,empty,size,nbytes,flags,array container type,values container type
0,0,<NYI>,<NYI>,SK_ID_CURR,<NYI>,<NYI>,int64,<NYI>,<NYI>,True,...,,,"(356255,)",1,False,356255,2850040,<Flags(allows_duplicate_labels=True)>,<class 'pandas.core.arrays.numpy_.PandasArray'>,<class 'numpy.ndarray'>
1,1,<NYI>,<NYI>,TARGET,<NYI>,<NYI>,int64,<NYI>,<NYI>,True,...,,,"(356255,)",1,False,356255,2850040,<Flags(allows_duplicate_labels=True)>,<class 'pandas.core.arrays.numpy_.PandasArray'>,<class 'numpy.ndarray'>
2,2,<NYI>,<NYI>,NAME_CONTRACT_TYPE,<NYI>,<NYI>,object,<NYI>,<NYI>,False,...,,,"(356255,)",1,False,356255,2850040,<Flags(allows_duplicate_labels=True)>,<class 'pandas.core.arrays.numpy_.PandasArray'>,<class 'numpy.ndarray'>
3,3,<NYI>,<NYI>,CODE_GENDER,<NYI>,<NYI>,object,<NYI>,<NYI>,False,...,,,"(356255,)",1,False,356255,2850040,<Flags(allows_duplicate_labels=True)>,<class 'pandas.core.arrays.numpy_.PandasArray'>,<class 'numpy.ndarray'>
4,4,<NYI>,<NYI>,FLAG_OWN_CAR,<NYI>,<NYI>,object,<NYI>,<NYI>,False,...,,,"(356255,)",1,False,356255,2850040,<Flags(allows_duplicate_labels=True)>,<class 'pandas.core.arrays.numpy_.PandasArray'>,<class 'numpy.ndarray'>
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
117,117,<NYI>,<NYI>,AMT_REQ_CREDIT_BUREAU_DAY,<NYI>,<NYI>,float64,<NYI>,<NYI>,True,...,,,"(356255,)",1,False,356255,2850040,<Flags(allows_duplicate_labels=True)>,<class 'pandas.core.arrays.numpy_.PandasArray'>,<class 'numpy.ndarray'>
118,118,<NYI>,<NYI>,AMT_REQ_CREDIT_BUREAU_WEEK,<NYI>,<NYI>,float64,<NYI>,<NYI>,True,...,,,"(356255,)",1,False,356255,2850040,<Flags(allows_duplicate_labels=True)>,<class 'pandas.core.arrays.numpy_.PandasArray'>,<class 'numpy.ndarray'>
119,119,<NYI>,<NYI>,AMT_REQ_CREDIT_BUREAU_MON,<NYI>,<NYI>,float64,<NYI>,<NYI>,True,...,,,"(356255,)",1,False,356255,2850040,<Flags(allows_duplicate_labels=True)>,<class 'pandas.core.arrays.numpy_.PandasArray'>,<class 'numpy.ndarray'>
120,120,<NYI>,<NYI>,AMT_REQ_CREDIT_BUREAU_QRT,<NYI>,<NYI>,float64,<NYI>,<NYI>,True,...,,,"(356255,)",1,False,356255,2850040,<Flags(allows_duplicate_labels=True)>,<class 'pandas.core.arrays.numpy_.PandasArray'>,<class 'numpy.ndarray'>


In [17]:
display(report[report.name == "SK_ID_CURR"].T)

Unnamed: 0,1
idx,1
group,<NYI>
subgroup,<NYI>
name,SK_ID_CURR
domain,<NYI>
format,<NYI>
dtype,int64
astype,<NYI>
unity,<NYI>
is_numeric,True


### Toutes les tables

In [5]:
from home_credit.utils import get_table_names
from home_credit.load import get_table
from pepper.univar import data_report

table_names = get_table_names()
tables = [get_table(table_name) for table_name in table_names]
data_reports = [data_report(table) for table in tables]

load C:/Users/franc/Projects/pepper_credit_scoring_tool\dataset\pqt\previous_application.pqt
load C:/Users/franc/Projects/pepper_credit_scoring_tool\dataset\pqt\bureau.pqt
load C:/Users/franc/Projects/pepper_credit_scoring_tool\dataset\pqt\bureau_balance.pqt
load C:/Users/franc/Projects/pepper_credit_scoring_tool\dataset\pqt\POS_CASH_balance.pqt
load C:/Users/franc/Projects/pepper_credit_scoring_tool\dataset\pqt\credit_card_balance.pqt
load C:/Users/franc/Projects/pepper_credit_scoring_tool\dataset\pqt\installments_payments.pqt


## **`data_report_to_csv_file`**`(data, csv_filename)`

In [3]:
from home_credit.load import get_application
from pepper.univar import data_report, data_report_to_csv_file

# Retrieve the dataframe
app = get_application()

# Build the report dataframe
report = data_report(app)

# Save it to a CSV file
data_report_to_csv_file(report, "data_dicts/application_data_dict.csv")

### Toutes les tables

In [6]:
from pepper.univar import data_report_to_csv_file

for table_name, report in zip(table_names, data_reports):
    data_report_to_csv_file(report, f"data_dicts/{table_name}_data_dict.csv")

## **`data_report_to_gsheet`**`(report, spread, sheet_name)`

In [16]:
from gspread_pandas import Spread
from home_credit.load import get_table
from pepper.univar import data_report, data_report_to_gsheet

# Target GSheet
spread = Spread("1KP0iX6YxZO-GS0DLeqrhdRB4l1unkF2nRn0onoOlS7s")
table_name = "installments_payments"
table = get_table(table_name)
report = data_report(table)
# Export to 'table_name' sheet
data_report_to_gsheet(report, spread, table_name)

### Toutes les tables

In [18]:
from gspread_pandas import Spread
from pepper.univar import data_report_to_gsheet

# Target GSheet
spread = Spread("1KP0iX6YxZO-GS0DLeqrhdRB4l1unkF2nRn0onoOlS7s")
for table_name, report in zip(table_names, data_reports):
    data_report_to_gsheet(report, spread, table_name)

# Weighted Classes Correlation

- `agg_value_counts(s: pd.Series, agg: Union[None, bool, float, int] = .01, dropna: bool = True) -> pd.DataFrame`:
    - Compute value counts and relative frequencies of a Pandas Series.
- `target_weights(target: np.ndarray) -> Tuple[float, float]`:
    - Calculate the weights for each class in a binary target array.
- `get_sample_weights(target: np.ndarray) -> np.ndarray`:
    - Compute sample weights for each element in a binary target array.
- `wmcc(target: np.ndarray, var: np.ndarray) -> float`:
    - Compute the weighted Matthews correlation coefficient between two arrays.
- `weighted_kendall_tau(target: np.ndarray, var: np.ndarray) -> float`:
    - Calculate the weighted Kendall's Tau rank correlation between two arrays.
- `show_correlations(data: pd.DataFrame, title: str, method: str = "pearson", ratio: float = 1) -> pd.DataFrame`:
    - Compute pairwise correlations between columns in a Pandas DataFrame and
    display the correlation matrix as a heatmap.
- `test_wmcc(target) -> None`:
    - Test the `wmcc` function using various scenarios.

## **`agg_value_counts`**`(s, agg, dropna)`

## **`target_weights`**`(target)`

## **`get_sample_weights`**`(target)`

## **`wmcc`**`(target, var)`

## **`weighted_kendall_tau`**`(target, var)`

## **`show_correlations`**`(data, title, method, ratio)`

## **`test_wmcc`**`(target)`