In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import re

I have three "cuts" of the data that I want to perform machine learning on:

- all the usable data (i.e. excluding entries that don't have reports)

- the "Auburn" cut, which should remove Vultology reports from authors that have 20 or fewer reports (since they are less experienced)

- the "Lapis" cut, layered over the Auburn cut, which should take into account this analysis:
"I'm seeing something rather suspicious in the data. If I calculate P-axis coordinates with the formula (Suspended - Grounded )/(Suspended + Grounded ), which ranges from -1 (completely Grounded) to +1 (completely Suspended), then there are 90 samples with -1, i.e., Grounded with zero signal mixing.  However, there are only 8 samples in the (-1, -0.9] interval.  In the (-0.9, 0.8] interval there are 40 samples. IMO that strongly suggests that many of those Grounded samples with zero P-axis signal mixing have neglected Suspended signals. The other end of the spectrum also looks suspicious, though it's less extreme. 
There are 36 samples with P-axis coordinates of 1 (Suspended with zero signal mixing), only 3 samples in the [0.9, 1) interval, but 26 samples in the [0.8, 0.9) interval. As for J-axis coordinates, (Measured - Candid)/(Measured + Candid), there are 55 samples with coordinates of 1 (Measured with zero signal mixing), 23 samples in the [0.9, 1) interval, and 35 in the [0.8, 0.9) interval.  There's still an odd dip, but it's a lot less suspicious. At the other end, there are 23 samples with -1 (Candid with zero signal mixing), 12 samples in the (-1, -0.9] interval and 18 samples in the (-0.9, -0.8] interval. It might be worth repeating the statistical analysis reported earlier in this channel just on samples in the (-0.9, 0.9) intervals."

We will see if these different versions of the data give different statistical results or different results in machine learning algorithms.

This notebook is devoted to the Auburn cut of the data. Please read the notebook "Full Data EDA" if you want a primer and what modifications I have already performed on the data to get to this point, since that is what I will be starting from.

## Statistics on the Auburn Cut

In [2]:
auburn_df = pd.read_csv("../Data/Vultology_Database_2024-02-24_-_PlusAuthors_FullData.csv", index_col=0)
auburn_df.head()

Unnamed: 0,Sample Name,Vultologist,Type,Development,Emotions,Fallen Affect,J Signal Mixing,P Signal Mixing,Sex,Age Range,...,SU8 Quirky Skits,EU1 Responsive Nodding,EU2 Polite Smiling,EU3 Bashful Body Movements,EG1 Upset Mouth Tension,EG2 Assertive Pushing,EG3 Stern Expressions,sum,Lead Energetic,Quadra
0,Cœur de Pirate,Miriam Greenfield,sefi,iii-,unguarded,0.0,low,medium,female,1980s,...,2.0,4.0,2.0,2.0,2.0,0.0,2.0,207.0,Pe,Gamma
1,Michael Gervais,Calin Copil,fesi,i---,neutral,1.0,low,low,male,-1,...,0.0,4.0,4.0,2.0,2.0,4.0,2.0,209.0,Je,Alpha
2,Joan Jett,Calin Copil,seti,i-i-,guarded,2.0,low,low,female,1950s,...,2.0,2.0,2.0,0.0,2.0,4.0,4.0,205.0,Pe,Beta
3,Ben Stein,Calin Copil,site,i---,guarded,2.0,low,low,male,1940s,...,0.0,0.0,0.0,0.0,2.0,2.0,2.0,114.0,Pi,Delta
4,Billie Joe Armstrong,Peter Foy,fise,ii-i,neutral,0.0,low,medium,male,1970s,...,0.0,2.0,4.0,2.0,4.0,2.0,4.0,211.0,Ji,Gamma


In [3]:
auburn_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 502 entries, 0 to 501
Data columns (total 85 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Sample Name                    502 non-null    object 
 1   Vultologist                    502 non-null    object 
 2   Type                           502 non-null    object 
 3   Development                    502 non-null    object 
 4   Emotions                       502 non-null    object 
 5   Fallen Affect                  502 non-null    float64
 6   J Signal Mixing                291 non-null    object 
 7   P Signal Mixing                290 non-null    object 
 8   Sex                            502 non-null    object 
 9   Age Range                      502 non-null    object 
 10  Geography                      502 non-null    object 
 11  Ethnicity                      502 non-null    object 
 12  R1 Rigid Posture               502 non-null    flo

It's interesting to me that J Axis and P Axis have almost exactly the same number of samples with signal mixing. I might want to briefly investigate that:

In [4]:
auburn_df['J Signal Mixing'].value_counts()

low       231
medium     47
high       13
Name: J Signal Mixing, dtype: int64

In [5]:
auburn_df['P Signal Mixing'].value_counts()

low       228
medium     48
high       14
Name: P Signal Mixing, dtype: int64

In [6]:
auburn_df['J Signal Mixing'].head(55)

0        low
1        low
2        low
3        low
4        low
5        low
6        low
7        low
8        low
9     medium
10       low
11       low
12       low
13       low
14       low
15       low
16       low
17      high
18       low
19    medium
20       low
21       low
22       low
23    medium
24       low
25       low
26       low
27    medium
28       low
29    medium
30       low
31       low
32       low
33       low
34       low
35       low
36       low
37    medium
38       low
39       low
40       low
41       low
42       NaN
43       low
44       low
45       low
46       low
47       low
48    medium
49       NaN
50      high
51    medium
52       low
53       low
54       low
Name: J Signal Mixing, dtype: object

So it would seem that most recent samples have signal mixing labels, which would suggest that it wasn't common practice to include such information in older samples. I'm not sure how important that is

The main alteration that we want to make to reach the final Auburn cut is to exclude Vultology reports written by Vultologists with 5 or fewer reports, so we can just get right to that

In [7]:
auburn_df['Vultologist'].value_counts()

Calin Copil          207
Juan E. Sandoval     147
Peter Foy             47
Ash Rose              28
Sierra Schwartz       26
Hila Hershkoviz       15
Jacquelyn Scott       14
Nathaniel Vetter      10
Mitchell Newman        4
Ahmad Aldroubi         2
Miriam Greenfield      1
Kyle O’Reilly          1
Name: Vultologist, dtype: int64

Thus, Vultology reports by myself (Mitchell Newman), Ahmad, Miriam, and Kyle, as well as Hila, Jacquelyn and Nathaniel will be excluded from the data for now 

In [8]:
vultologists_included = list(auburn_df['Vultologist'].value_counts()[auburn_df['Vultologist'].value_counts() > 20].index)
vultologists_included

['Calin Copil', 'Juan E. Sandoval', 'Peter Foy', 'Ash Rose', 'Sierra Schwartz']

In [9]:
auburn_df = auburn_df[auburn_df['Vultologist'].isin(vultologists_included)]
auburn_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 455 entries, 1 to 501
Data columns (total 85 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Sample Name                    455 non-null    object 
 1   Vultologist                    455 non-null    object 
 2   Type                           455 non-null    object 
 3   Development                    455 non-null    object 
 4   Emotions                       455 non-null    object 
 5   Fallen Affect                  455 non-null    float64
 6   J Signal Mixing                286 non-null    object 
 7   P Signal Mixing                286 non-null    object 
 8   Sex                            455 non-null    object 
 9   Age Range                      455 non-null    object 
 10  Geography                      455 non-null    object 
 11  Ethnicity                      455 non-null    object 
 12  R1 Rigid Posture               455 non-null    flo

That number of rows makes sense since we removed 47 entries

Now we can begin analyzing the statistics of this dataset to see how, if at all significantly, it differs from the full set

In [10]:
auburn_df['sum'].describe()

count    455.000000
mean     173.615385
std       28.977193
min       74.000000
25%      154.000000
50%      174.000000
75%      193.000000
max      255.000000
Name: sum, dtype: float64

The average moved up to 174 points in a vultology report instead of 172, I doubt this is a significant change but it does show that reports from more experienced vultologists are slightly richer

Let's look at the vultology signals with the highest and lowest means

In [11]:
auburn_means = auburn_df.drop(['sum', 'Fallen Affect'], axis=1).describe().loc['mean', :]
auburn_means.describe()

count    70.000000
mean      2.466562
std       0.759358
min       0.804396
25%       1.871429
50%       2.539560
75%       3.035165
max       4.164835
Name: mean, dtype: float64

This is near identical to the previous distribution as well

The 10 least marked vultology signals are:

In [12]:
auburn_means.sort_values().head(10)

SU6 Eye Head Trailing Motions    0.804396
SU8 Quirky Skits                 0.903297
CA9 Grasping Hands               0.982418
MS9 Puppeteer Hands              1.037363
EG1 Upset Mouth Tension          1.210989
CA5 Asymmetrical Smirks          1.450549
SU7 Levity Effect                1.465934
EU3 Bashful Body Movements       1.483516
RR3 Exerted Pushes               1.505495
GR8 Bodily Awareness             1.564835
Name: mean, dtype: float64

It's quite similar, but Bodily Awareness newly enters out top 10 here whereas Intense Scowling exits

10 most marked Vultology signals:

In [13]:
auburn_means.sort_values(ascending=False).head(10)

PR5 Projecting Hands       4.164835
PF2 Toggling Eyes          3.778022
PR4 Fluent Articulation    3.707692
GR3 Taut Outer Edges       3.654945
PR1 Head Pushes            3.580220
PR2 Head Shakes            3.389011
F2 Eye Centric             3.378022
F1 Fluid Posture           3.358242
F4 Horizontal Movements    3.347253
PF3 Body Swaying           3.331868
Name: mean, dtype: float64

This is also very similar with just a few signals changing places, but it's the same 10

Let's also look at std

In [14]:
auburn_std = auburn_df.drop(['sum', 'Fallen Affect'], axis=1).describe().loc['std', :]
auburn_std.describe()

count    70.000000
mean      2.311199
std       0.309176
min       1.417007
25%       2.158396
50%       2.365811
75%       2.540542
max       2.867484
Name: std, dtype: float64

Almost the same

Bottom 10 (so most unvarying signals)

In [15]:
auburn_std.sort_values().head(10)

SU6 Eye Head Trailing Motions    1.417007
MS9 Puppeteer Hands              1.611061
PR3 Shoulder Shrugs              1.665947
CA9 Grasping Hands               1.712779
EG1 Upset Mouth Tension          1.750862
SU8 Quirky Skits                 1.777084
RR3 Exerted Pushes               1.825985
RR4 Momentum Halting             1.847505
EU3 Bashful Body Movements       1.910801
CA5 Asymmetrical Smirks          1.929733
Name: std, dtype: float64

Basically the same

Top 10 (most varying)

In [16]:
auburn_std.sort_values(ascending=False).head(10)

MS2 Horizontal Curtain Smiles    2.867484
GR3 Taut Outer Edges             2.862805
R1 Rigid Posture                 2.776746
GR1 Taut Preseptal Area          2.757164
MS3 Two Point Pulling            2.684820
F2 Eye Centric                   2.666888
MS7 Parabolic Velocity           2.654051
R2 Face Centric                  2.647145
F5 Subordinate Rigidity          2.627519
MS1 Lax Nasolabial Area          2.625037
Name: std, dtype: float64

Also basically the same with just a few changing places, but the same 10

Now let's look at Lead Energetic

In [17]:
auburn_df['Lead Energetic'].value_counts()

Pe    150
Je    126
Pi    100
Ji     79
Name: Lead Energetic, dtype: int64

In [18]:
100 * auburn_df['Lead Energetic'].value_counts() / len(auburn_df)

Pe    32.967033
Je    27.692308
Pi    21.978022
Ji    17.362637
Name: Lead Energetic, dtype: float64

nearly identical to the full data

Now let's look at Quadra

In [19]:
auburn_df['Quadra'].value_counts()

Beta     147
Gamma    125
Delta     92
Alpha     91
Name: Quadra, dtype: int64

In [20]:
100 * auburn_df['Quadra'].value_counts() / len(auburn_df['Quadra'])

Beta     32.307692
Gamma    27.472527
Delta    20.219780
Alpha    20.000000
Name: Quadra, dtype: float64

nearly identical to the full data

Now let's look at Type

In [21]:
auburn_df['Type'].value_counts()

seti    54
sefi    41
feni    37
teni    36
nife    34
neti    30
fesi    28
nite    28
tesi    25
nefi    25
fine    22
tise    22
site    20
fise    20
sife    18
tine    15
Name: Type, dtype: int64

In [22]:
100 * auburn_df['Type'].value_counts() / len(auburn_df['Type'])

seti    11.868132
sefi     9.010989
feni     8.131868
teni     7.912088
nife     7.472527
neti     6.593407
fesi     6.153846
nite     6.153846
tesi     5.494505
nefi     5.494505
fine     4.835165
tise     4.835165
site     4.395604
fise     4.395604
sife     3.956044
tine     3.296703
Name: Type, dtype: float64

Very similar to the full data, though some of the types changed places in order of prominence and we have 1 more type with fewer than 30 samples, just something to keep in mind (though more data in the future may help with this)

Now let's look at Development

In [23]:
auburn_df['Development'].value_counts()

i---    151
i-i-     81
ii--     68
i--i     55
iii-     33
ii-i     30
i-ii     20
iiii     17
Name: Development, dtype: int64

In [24]:
100 * auburn_df['Development'].value_counts() / len(auburn_df['Development'])

i---    33.186813
i-i-    17.802198
ii--    14.945055
i--i    12.087912
iii-     7.252747
ii-i     6.593407
i-ii     4.395604
iiii     3.736264
Name: Development, dtype: float64

It's the exact same phenomenon that the full data contained, with i-i- being more than ii--, for reasons we already investigated in the Full Data EDA notebook (please see the explanation in that notebook if you are interested)

Finally we'll look at Emotional Attitude

In [25]:
auburn_df['Emotions'].value_counts()

unguarded    219
guarded      196
neutral       40
Name: Emotions, dtype: int64

In [26]:
100 * auburn_df['Emotions'].value_counts() / len(auburn_df['Emotions'])

unguarded    48.131868
guarded      43.076923
neutral       8.791209
Name: Emotions, dtype: float64

It's a nearly identical distribution to that of the Full Data with a little more neutral attitudes here, not sure if it's a significant difference (probably not since it's about 1%)

On the whole, I see no significant aggregate differences between Auburn Cut distributions and those of the full set of usable data. I'll keep it for now though since there could be minor enough differences in the signals to impact ML results, but that remains to be seen

In [27]:
auburn_df.to_csv('../Data/Vultology_Database_2024-02-24_-_PlusAuthors_AuburnCut.csv')