#**Factor Analysis of Mixed Data(FAMD)**

In the previous 4 steps we have examined how we can use principal component and multiple correspondence analysis to help us search for latent factors for both continous and categorical data respectively. The problem we have is that most of our datasets have both categorical and quantative data.

FAMD is a method dedicated to explore data with both continuous and categorical variables. It can be seen roughly as a mixed between PCA and MCA. More precisely, the continuous variables are scaled to unit variance and the categorical variables are transformed into a set of dummy variables and then scaled using the specific scaling of MCA. This ensures to balance the influence of both continous and categorical variables in the analysis. It means that both variables are on a equal foot to determine the dimensions of variability. This method allows one to study the similarities between individuals taking into account mixed variables and to study the relationships between all the variables. It also provides graphical outputs such as the representation of the individuals, the correlation circle for the continuous variables and representations of the categories of the categorical variables, and also specific graphs to visulaize the associations between both type of variables,[RDocumentation](https://www.rdocumentation.org/packages/FactoMineR/versions/2.0/topics/FAMD).

Now the issue we have is that the libraries that are available in Python are not as well established as those provided by R, the statistical programming language. You can find out how to install R [here](https://cran.r-project.org/bin/windows/base/), and this [video](http://www.sthda.com/english/articles/22-principal-component-methods-videos/72-famd-in-r-using-factominer-quick-scripts-and-videos/) shows you how to do FAMD using the R FactoMineR library. There is an implementation of a FAMD using python's "prince" library.


In [2]:
!pip install light_famd



We will create another toy dataset with both categorical and continous data.

In [3]:
from light_famd import FAMD
import pandas as pd
import numpy as np
#from sklearn.preprocessing import OneHotEncoder
X_n = pd.DataFrame(data=np.random.randint(0,100,size=(100,2)),columns=list('AB'))
X_c =pd.DataFrame(np.random.choice(list('abcde'),size=(100,4),replace=True),columns =list('CDEF'))
X=pd.concat([X_n,X_c],axis=1)
print(X)
X.to_csv('FAMD_dataset.csv')

     A   B  C  D  E  F
0    2  98  d  a  d  d
1    2  66  b  b  c  a
2    2  21  a  d  d  b
3   22   8  b  b  d  d
4   47  36  a  a  a  a
..  ..  .. .. .. .. ..
95  43   2  e  e  a  d
96  96  87  d  c  e  d
97  65   1  e  d  c  e
98  69  17  d  a  c  d
99  65  16  b  d  b  a

[100 rows x 6 columns]


Now let use to FAMD to build a 4 component structure on the X dataset.

In [4]:
!pip install prince

Collecting prince
  Downloading prince-0.16.2-py3-none-any.whl.metadata (6.4 kB)
Collecting altair<6.0.0,>=5.0.0 (from prince)
  Downloading altair-5.5.0-py3-none-any.whl.metadata (11 kB)
Collecting narwhals>=1.14.2 (from altair<6.0.0,>=5.0.0->prince)
  Downloading narwhals-2.13.0-py3-none-any.whl.metadata (12 kB)
Downloading prince-0.16.2-py3-none-any.whl (179 kB)
Downloading altair-5.5.0-py3-none-any.whl (731 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m731.2/731.2 kB[0m [31m46.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading narwhals-2.13.0-py3-none-any.whl (426 kB)
Installing collected packages: narwhals, altair, prince
Successfully installed altair-5.5.0 narwhals-2.13.0 prince-0.16.2


In [6]:
import pandas as pd
import numpy as np
import prince

# Read the dataset
X = pd.read_csv('FAMD_dataset.csv')

# Convert specific columns to float if needed
X = X.astype({'A': 'float', 'B': 'float'})
print(type(X['A'][0]))

# Create and fit the FAMD model
famd = prince.FAMD(
    n_components=5,
    n_iter=5,
    copy=True,
    check_input=True,
    random_state=42,
    engine="sklearn",
    handle_unknown="error"
)

famd = famd.fit(X)

<class 'numpy.float64'>


In [7]:
print(famd.eigenvalues_summary)

print(famd.transform(X))


          eigenvalue % of variance % of variance (cumulative)
component                                                    
0              6.404         2.21%                      2.21%
1              6.175         2.13%                      4.34%
2              6.098         2.11%                      6.45%
3              5.805         2.01%                      8.46%
4              5.542         1.91%                     10.37%
component         0         1         2         3         4
0         -5.172401  0.627090  0.459032  3.836371  0.149998
1         -0.332211 -2.634102 -2.361540 -1.664587 -4.547124
2          3.957964  0.800320  0.393471  5.177911  0.250136
3         -2.329728 -1.082533 -0.883503  2.472405 -4.930956
4         -1.119615 -1.614853 -1.642344  1.854448  5.899648
..              ...       ...       ...       ...       ...
95        -2.382739 -2.717907 -0.137299 -2.488238  1.630764
96        -1.976280  1.251917  6.109324  1.309447 -0.095182
97         1.300916 -1.457

#**Conclusions**

We have implemented a FAMD using  "prince".

