## Latent Personal Analysis in Python
by Uri Alon

This short tutorial will demonstrate how to use the Python implementation of LPA, as described in the article - https://arxiv.org/pdf/2004.02346.pdf.
The first implementation for LPA was written in SQL and can be found here - https://github.com/hagitbenshoshan/text_distance/
For very large datasets, users of this package are encouraged to switch to the SQL implementation using cloud infrustructure, as the calculation of the results will be much faster.

Dependencies:
- pandas
- numpy
- scipy 

Dependecies  for further analysis demonstrated in this notebook:
- scikit-learn
- plotly

In [7]:
import pandas as pd
import numpy as np
from scipy.spatial import distance
import LPA
import plotly.express as px
from sklearn.decomposition import PCA

The algorithm assumes correct input as in the following format (element, category_id, frequency_in_category). Failure to use data in the correct form will lead to erroneous results.

In [2]:
df = pd.read_csv('TripAdvisor_beaches2609.csv')
df['category_id'] = df['category_id'].astype(str)
df['element'] = df['element'].astype(str)

In [3]:
df.head()

Unnamed: 0,element,category_id,frequency_in_category
0,abit,Al-Iskandariyah (Alexandria),4
1,accepted,Al-Iskandariyah (Alexandria),5
2,activity,Al-Iskandariyah (Alexandria),13
3,afraid,Al-Iskandariyah (Alexandria),5
4,age,Al-Iskandariyah (Alexandria),5


Once the data exists, one can deploy LPA to receive various results. The following examples will demonstrate this.

#### Distance from the world
The following method returns the distance of every category from the domain, as a similarity measure between 0 and 1.

In [4]:
LPA.distance_from_world(df)

Unnamed: 0_level_0,distance_summary
category_id,Unnamed: 1_level_1
Al-Iskandariyah (Alexandria),0.865214
Alicante,0.442441
Antalya,0.417821
Athínai (Athens),0.581777
Barcelona,0.430834
Bari,0.620462
Cagliari,0.43587
Catania,0.606696
Hefa (Haifa),0.572605
Istanbul,0.630152


#### Creating Signatures
Another prominent use of LPA is creating a signature for every category, which is made up of the most meaningful terms for every category, whether in their prominence or absence. By default the signature length is 500 and the epsilon is set to 1*(corpus size * 2), but both these parameters can be changed when calling the method.
Epsilon frac - a number greater than 1 will decrease the weight of epsilon (the default weight given to missing terms) while a number between 0 and 1 will increase it.

In [7]:
LPA.create_signatures(df)

Unnamed: 0,category_id,element,KL,existing_element_flag
0,Al-Iskandariyah (Alexandria),place,0.019829,1.0
1,Al-Iskandariyah (Alexandria),bar,0.015515,0.0
2,Al-Iskandariyah (Alexandria),winter,0.013617,1.0
3,Al-Iskandariyah (Alexandria),alex,0.012069,1.0
4,Al-Iskandariyah (Alexandria),summer,0.010023,1.0
...,...,...,...,...
495,Valencia,laaarge,0.000073,1.0
496,Valencia,tulum,0.000073,1.0
497,Valencia,lamarina,0.000073,1.0
498,Valencia,trouser,0.000073,1.0


In [8]:
LPA.create_signatures(df,epsilon_frac=5,sig_length=20)

Unnamed: 0,category_id,element,KL,existing_element_flag
0,Al-Iskandariyah (Alexandria),place,0.019829,1.0
1,Al-Iskandariyah (Alexandria),bar,0.018494,0.0
2,Al-Iskandariyah (Alexandria),winter,0.013617,1.0
3,Al-Iskandariyah (Alexandria),alex,0.012069,1.0
4,Al-Iskandariyah (Alexandria),summer,0.010023,1.0
...,...,...,...,...
15,Valencia,pebble,0.001089,1.0
16,Valencia,tayelet,0.001044,0.0
17,Valencia,crystal,0.001044,1.0
18,Valencia,promenade,0.001028,1.0


#### Distances between pairs of categories (Sockpuppet Distance)
Finally, one can use the signatures created to calculate the L1 distance between every pair of categories. An elaboration on the method can be found here - https://github.com/hagitbenshoshan/text_distance/blob/master/Step3.md
Different signature lengths and epsilons can have dramatic effects on the results.

In [9]:
LPA.SockPuppetDistance(LPA.create_signatures(df),df)

Unnamed: 0,user1,user2,distance_between_users
0,Al-Iskandariyah (Alexandria),Al-Iskandariyah (Alexandria),0.000000
1,Al-Iskandariyah (Alexandria),Alicante,0.991202
2,Al-Iskandariyah (Alexandria),Antalya,0.972759
3,Al-Iskandariyah (Alexandria),Athínai (Athens),0.899410
4,Al-Iskandariyah (Alexandria),Barcelona,0.979442
...,...,...,...
395,Valencia,Palma,0.854393
396,Valencia,Tel Aviv-Yafo (Tel Aviv-Jaffa),0.853495
397,Valencia,Thessaloniki,0.980272
398,Valencia,Toulon,0.957916


In [3]:
LPA.SockPuppetDistance(LPA.create_signatures(df, epsilon_frac=10, sig_length=300),df)

Unnamed: 0,user1,user2,distance_between_users
0,Al-Iskandariyah (Alexandria),Al-Iskandariyah (Alexandria),0.000000
1,Al-Iskandariyah (Alexandria),Alicante,0.980263
2,Al-Iskandariyah (Alexandria),Antalya,0.976128
3,Al-Iskandariyah (Alexandria),Athínai (Athens),0.917338
4,Al-Iskandariyah (Alexandria),Barcelona,0.992403
...,...,...,...
395,Valencia,Palma,0.840990
396,Valencia,Tel Aviv-Yafo (Tel Aviv-Jaffa),0.851789
397,Valencia,Thessaloniki,0.974089
398,Valencia,Toulon,0.957929


### Examples of Further analysis - PCA
Once we have calculated the distances between every category, we can perform further analysis on the results. The following example demonstrates this using scikit-learn's PCA method.

In [5]:
df = pd.read_csv('jane_austen_for_pca.csv')

In [6]:
X = df.iloc[:,1:4].copy()
X = pd.pivot_table(X, values='distance_between_users', index='user1',columns='user2')
X.fillna(0, inplace= True)

pca = PCA(n_components=3, svd_solver='full')
clusters = pca.fit_transform(X)

sample_clusters = pd.DataFrame(index=X.index, data = clusters, columns=['x','y', 'z'])

sample_clusters.reset_index(inplace = True)

sample_clusters['Book'] = sample_clusters['user1'].apply(lambda s: s.split(',')[0])

In [10]:
fig = px.scatter_3d(sample_clusters, x='x', y='y', z='z', color='Book')


In [11]:
fig.show()

## Good Luck!