## Latent Personal Analysis in Python
by Uri Alon and Alex Abbey

This short tutorial will demonstrate how to use the Python implementation of LPA, as described in the [article](https://link.springer.com/article/10.1007/s11257-021-09295-7) in User Modeling and User-Adapted Interaction.

LPA helps analyze a corpus of text, or any set of data, by taking into account the missing popular elements and frequently used yet generally infrequent elements. We will be using the terms element, category and domain. It can be helpful to think of the element as a word, a category as a chapter of a book, and the domain as the book.

The first implementation for LPA was written in SQL and can be found [here](https://github.com/hagitbenshoshan/text_distance/) - 
For very large datasets, users of this package are encouraged to switch to the SQL implementation using cloud infrustructure, as the calculation of the results will be much faster.

Dependencies:
- pandas
- numpy
- scipy 

In [1]:
import pandas as pd
from LPA import LPA

In this tutorial we demonstrate LPA using a small portion of the LOCO dataset (Miani, 2021), the unprocessed data available here: https://osf.io/snpcg/. The data you are using should consist of a count of elements per category. For instance, with this dataset, each category is an article, each element is a (tokenized) word, and the frequency is a count of the amount of times the word appeared in an article. The input should be a csv file with the columns: `element`, `category`, `frequency_in_category`. 

In [2]:
df = pd.read_csv('./frequency.csv')
df.describe(include="all")

Unnamed: 0,category,element,frequency_in_category
count,350874,350869,350874.0
unique,1000,24709,
top,C0029e,one,
freq,2047,727,
mean,,,1.82995
std,,,2.434249
min,,,1.0
25%,,,1.0
50%,,,1.0
75%,,,2.0


#### Creating the domain
The domain (DVR) consists of the frequency of all elements from all categories together. 

In [3]:
dvr = LPA.create_dvr(df)
dvr

Unnamed: 0,element,frequency_in_category,global_weight
0,one,3023.0,0.004708
1,peopl,2981.0,0.004643
2,us,2763.0,0.004303
3,world,2728.0,0.004249
4,use,2476.0,0.003856
...,...,...,...
24704,kultur,1.0,0.000002
24705,kung,1.0,0.000002
24706,kushcyyenko,1.0,0.000002
24707,kutless,1.0,0.000002


#### Creating an instance of `LPA`
To create an instance of the `LPA` object you must create a `dvr`, either by creating it ahead of time and loading it from a file, or by using the static method `create_dvr`. Epsilon is set to 1 / (domain size * `epsilon_frac`), where if `epsilon_frac` is a number greater than 1 will decrease the weight of epsilon (the default weight given to missing terms) while a number between 0 and 1 will increase it. By default `epsilon_frac` is set to 2.

In [4]:
lpa = LPA(dvr, categories=1000, epsilon_frac=2)

#### Creating Signatures
Another prominent use of LPA is creating a signature for every category, which is made up of the most meaningful terms for every category, whether in their prominence or absence. Usually, one won't need the full signature, but rather signatures of a certain length, as shown in the paper. Thus, the function `create_and_cut()` creates signatures and keeps the most prominent elements according to `sig_length`, which defaults to 500. 

In [5]:
sigs = lpa.create_and_cut(df, sig_length=500)
sigs

Unnamed: 0,category,element,KL,missing
0,C00001,aquino,0.053742,False
1,C00001,one,0.025547,True
2,C00001,peopl,0.025126,True
3,C00001,us,0.022955,True
4,C00001,world,0.022609,True
...,...,...,...,...
499995,C004c4,onlin,0.001876,False
499996,C004c4,least,0.001876,True
499997,C004c4,men,0.001876,True
499998,C004c4,crimin,0.001876,True


To get a feel for the dataset as a whole, you can check the distance of each document from the domain by computing the summed distance of the signature from the `dvr`.

In [6]:
lpa.distance_summary(df).head()

Unnamed: 0_level_0,KL
category,Unnamed: 1_level_1
C00001,0.732932
C00003,1.3922
C00004,0.941718
C00005,0.820699
C00007,1.291429


#### Distances between pairs of categories (Sockpuppet Distance)
Finally, one can use the signatures created to calculate the L1 distance between every pair of categories. 
`sockpuppet_distance` accepts two DataFrames of $n$ and $m$ signatures respectively and compares them, returning a matrix sized $n \times m$. Different signature lengths and epsilons can have dramatic effects on the results.

In [7]:
lpa.sockpuppet_distance(signatures1=sigs.iloc[:5000], signatures2=sigs.iloc[:5000])

func: sockpuppet_distance took: 0:00:00.616931


category,C00001,C00003,C00004,C00005,C00007,C00008,C00009,C0000a,C0000b,C0000c
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
C00001,0.0,1.769578,1.294044,0.848502,1.500809,1.512502,1.080796,1.44174,1.735578,1.218679
C00003,1.769578,0.0,2.085275,1.820153,2.466488,2.264003,1.865359,2.108518,2.662046,2.117797
C00004,1.294044,2.085275,0.0,1.343928,1.962349,1.811959,1.369492,1.752249,2.226319,1.610398
C00005,0.848502,1.820153,1.343928,0.0,1.595416,1.564651,1.121952,1.485686,1.854744,1.249274
C00007,1.500809,2.466488,1.962349,1.595416,0.0,2.169357,1.779474,2.122418,2.370993,1.887602
C00008,1.512502,2.264003,1.811959,1.564651,2.169357,0.0,1.601261,1.953664,2.444378,1.845919
C00009,1.080796,1.865359,1.369492,1.121952,1.779474,1.601261,0.0,1.539143,2.018666,1.432241
C0000a,1.44174,2.108518,1.752249,1.485686,2.122418,1.953664,1.539143,0.0,2.370622,1.779703
C0000b,1.735578,2.662046,2.226319,1.854744,2.370993,2.444378,2.018666,2.370622,0.0,2.183852
C0000c,1.218679,2.117797,1.610398,1.249274,1.887602,1.845919,1.432241,1.779703,2.183852,0.0


### Further analysis
Once we have calculated the distances between every category, we can perform further analysis on the results, for instance by clustering the sockpuppet distances and finding similar categories. 

## Good Luck!

### References
Miani, A., Hills, T. & Bangerter, A. LOCO: The 88-million-word language of conspiracy corpus. Behav Res (2021). https://doi.org/10.3758/s13428-021-01698-z

Mokryn, O., Ben-Shoshan, H. Domain-based Latent Personal Analysis and its use for impersonation detection in social media. User Model User-Adap Inter 31, 785–828 (2021). https://doi.org/10.1007/s11257-021-09295-7