/ prince Public

👑 Multivariate exploratory data analysis in Python — PCA, CA, MCA, MFA, FAMD, GPA

# MaxHalford/prince

## Folders and files

NameName
Last commit message
Last commit date

394 Commits

## Repository files navigation

Prince is a Python library for multivariate exploratory data analysis in Python. It includes a variety of methods for summarizing tabular data, including principal component analysis (PCA) and correspondence analysis (CA). Prince provides efficient implementations, using a scikit-learn API.

## Example usage

```>>> import prince

>>> dataset = prince.datasets.load_decathlon()
>>> decastar = dataset.query('competition == "Decastar"')

>>> pca = prince.PCA(n_components=5)
>>> pca = pca.fit(decastar, supplementary_columns=['rank', 'points'])
>>> pca.eigenvalues_summary
eigenvalue % of variance % of variance (cumulative)
component
0              3.114        31.14%                     31.14%
1              2.027        20.27%                     51.41%
2              1.390        13.90%                     65.31%
3              1.321        13.21%                     78.52%
4              0.861         8.61%                     87.13%

>>> pca.transform(dataset).tail()
component                       0         1         2         3         4
competition athlete
OlympicG    Lorenzo      2.070933  1.545461 -1.272104 -0.215067 -0.515746
Karlivans    1.321239  1.318348  0.138303 -0.175566 -1.484658
Korkizoglou -0.756226 -1.975769  0.701975 -0.642077 -2.621566
Uldal        1.905276 -0.062984 -0.370408 -0.007944 -2.040579
Casarsa      2.282575 -2.150282  2.601953  1.196523 -3.571794```
`>>> chart = pca.plot(dataset)`

This chart is interactive, which doesn't show on GitHub. The green points are the column loadings.

```>>> chart = pca.plot(
...     dataset,
...     show_row_labels=True,
...     show_row_markers=False,
...     row_labels_column='athlete',
...     color_rows_by='competition'
... )```

## Installation

`pip install prince`

🎨 Prince uses Altair for making charts.

## Methods

```flowchart TD
cat?(Categorical data?) --> |"✅"| num_too?(Numerical data too?)
num_too? --> |"✅"| FAMD
num_too? --> |"❌"| multiple_cat?(More than two columns?)
multiple_cat? --> |"✅"| MCA
multiple_cat? --> |"❌"| CA
cat? --> |"❌"| groups?(Groups of columns?)
groups? --> |"✅"| MFA
groups? --> |"❌"| shapes?(Analysing shapes?)
shapes? --> |"✅"| GPA
shapes? --> |"❌"| PCA
```

## Correctness

Prince is tested against scikit-learn and FactoMineR. For the latter, rpy2 is used to run code in R, and convert the results to Python, which allows running automated tests. See more in the `tests` directory.

## Citation

Please use this citation if you use this software as part of a scientific publication.

```@software{Halford_Prince,
author = {Halford, Max},
title = {{Prince}},
url = {https://github.com/MaxHalford/prince}
}```

## Support

I made Prince when I was at university, back in 2016. I've had very little time over the years to maintain this package. I spent a significant amount of time in 2022 to revamp the entire package. Prince has now been downloaded over 1 million times. I would be grateful to anyone willing to sponsor me. Sponsorships allow me to spend more time working on open source software, including Prince.

👑 Multivariate exploratory data analysis in Python — PCA, CA, MCA, MFA, FAMD, GPA

+ 469