## Latent Personal Analysis in Python
by Uri Alon

This short tutorial will demonstrate how to use the Python implementation of LPA, as described in the article - https://arxiv.org/pdf/2004.02346.pdf.
The first implementation for LPA was written in SQL and can be found here - https://github.com/hagitbenshoshan/text_distance/
For very large datasets, users of this package are encouraged to switch to the SQL implementation using cloud infrustructure, as the calculation of the results will be much faster.

Dependencies:
- pandas
- numpy
- scipy 

Dependecies  for further analysis demonstrated in this notebook:
- scikit-learn
- plotly

In [1]:
import pandas as pd
import numpy as np
from scipy.spatial import distance
import LPA
import plotly.express as px
from sklearn.decomposition import PCA

The algorithm assumes correct input as in the following format (element, category_id, frequency_in_category). Failure to use data in the correct form will lead to erroneous results. In our case, elements are clone IDs, categories are samples and the frequency are instances.

In [2]:
df = pd.read_csv('clones_D145.csv').iloc[:,0:3]
df.columns = ['element','category_id','frequency_in_category']
df['category_id'] = df['category_id'].astype(str)
df['element'] = df['element'].astype(str)

In [3]:
df.head()

Unnamed: 0,element,category_id,frequency_in_category
0,917504,657,8
1,917505,158,1
2,917505,159,2
3,917505,647,1
4,917506,657,6


Once the data exists, one can deploy LPA to receive various results. The following examples will demonstrate this.

#### Distance from the world
The following method returns the distance of every category from the domain, as a similarity measure between 0 and 1.

In [4]:
LPA.distance_from_world(df)

Unnamed: 0_level_0,distance_summary
category_id,Unnamed: 1_level_1
148,0.908867
149,0.954583
150,0.872462
151,0.664769
152,0.651656
153,0.624774
154,0.671368
155,0.654299
156,0.688119
157,0.535183


#### Creating Signatures
Another prominent use of LPA is creating a signature for every category, which is made up of the most meaningful terms for every category, whether in their prominence or absence. By default the signature length is 500 and the epsilon is set to 1*(corpus size * 2), but both these parameters can be changed when calling the method.
Epsilon frac - a number greater than 1 will decrease the weight of epsilon (the default weight given to missing terms) while a number between 0 and 1 will increase it.

In [5]:
LPA.create_signatures(df)

Unnamed: 0,category_id,element,KL,existing_element_flag
0,148,912073,0.092313,0.0
1,148,910708,0.057108,1.0
2,148,937487,0.041921,1.0
3,148,906708,0.035227,1.0
4,148,913922,0.029040,0.0
...,...,...,...,...
495,663,898718,0.000442,1.0
496,663,916272,0.000442,1.0
497,663,909442,0.000442,1.0
498,663,887844,0.000442,1.0


In [7]:
LPA.create_signatures(df,epsilon_frac=5,sig_length=20)

Unnamed: 0,category_id,element,KL,existing_element_flag
0,148,912073,0.102636,0.0
1,148,910708,0.057108,1.0
2,148,937487,0.041921,1.0
3,148,906708,0.035227,1.0
4,148,913922,0.032759,0.0
...,...,...,...,...
15,663,889373,0.006784,1.0
16,663,907782,0.006750,1.0
17,663,909584,0.006199,0.0
18,663,886894,0.006065,0.0


#### Distances between pairs of categories (Sockpuppet Distance)
Finally, one can use the signatures created to calculate the L1 distance between every pair of categories. An elaboration on the method can be found here - https://github.com/hagitbenshoshan/text_distance/blob/master/Step3.md
Different signature lengths and epsilons can have dramatic effects on the results.

In [8]:
LPA.SockPuppetDistance(LPA.create_signatures(df),df)

Unnamed: 0,user1,user2,distance_between_users
0,657,657,0.000000
1,657,158,0.886054
2,657,159,0.909390
3,657,647,0.911908
4,657,157,0.902767
...,...,...,...
2911,170,164,0.489834
2912,170,166,0.315109
2913,170,167,0.289273
2914,170,168,0.461529


In [9]:
LPA.SockPuppetDistance(LPA.create_signatures(df, epsilon_frac=10, sig_length=300),df)

Unnamed: 0,user1,user2,distance_between_users
0,657,657,0.000000
1,657,158,0.841226
2,657,159,0.840656
3,657,647,0.812306
4,657,157,0.826292
...,...,...,...
2911,170,164,0.629986
2912,170,166,0.442135
2913,170,167,0.429387
2914,170,168,0.492493


### Examples of Further analysis - PCA
Once we have calculated the distances between every category, we can perform further analysis on the results. The following example demonstrates this using scikit-learn's PCA method.

In [18]:
df = pd.read_csv('D145_PCA.csv')

In [22]:
tissue_colors =    {
                   'PBL' : '#67001f',
                   'BM' : '#b2182b',
                   'SPL' : '#d6604d',
                   'Lung' : '#f4a582',
                   'MLN' : '#515151',
                   'Duodenum' : '#a1daf7',
                   'Jejunum' : '#92c5de',
                   'Ileum' : '#4393c3',                  
                   'Colon' : '#2166ac',
                    }


In [23]:
fig = px.scatter_3d(df, x='x', y='y', z='z', color='tissue', color_discrete_map = tissue_colors)


In [24]:
fig.show()

## Good Luck!