In [1]:
import pandas as pd
import numpy as np

### NaturDoc - TL BL WT 22-23

# Data clustering:

## Groundwork: Approaches

What approaches are possible? And while waiting for the input data to be generated, what data can we use to already test these approaches?

### Loading Embeddings Data:


In [2]:
symptoms_embeddings = pd.read_csv("../data/embeddings/word_embeddings_dataframe.csv")

The dataframe contains three columns: the symptom name (from the Duke dataset and the google symptom data), and one column for the two embedding models each: the first being `all-MiniLM-L6-v2` and the second being `average_word_embeddings_glove.840B.300d`.

In [3]:
print(symptoms_embeddings.shape)
symptoms_embeddings.head(2)

(2404, 3)


Unnamed: 0,Symptom,Embedding1,Embedding2
0,Abcess,[-9.81967244e-03 1.01662287e-02 3.75229940e-...,[ 2.1690e-02 -1.8056e-01 -8.5585e-02 -5.6702e-...
1,Abdomen,[ 5.98840415e-02 1.64022837e-02 -4.90665212e-...,[-0.73936 -0.18636 0.59149 0.47356 ...


Extracting a dictionary matching index to symptom name:

In [4]:
dict_symptom = symptoms_embeddings["Symptom"].to_dict()

#### Transforming:

Reading from the csv, it is no longer a proper list but instead a string containing extra characters:

In [5]:
symptoms_embeddings.loc[0, "Embedding1"][:100]

'[-9.81967244e-03  1.01662287e-02  3.75229940e-02  1.75703913e-02\n -1.11436069e-01  3.83325890e-02  1'

In [6]:
type(symptoms_embeddings.loc[1, "Embedding1"])

str

In [7]:
test_list_1 = symptoms_embeddings.loc[0, "Embedding1"].replace("\n", "").replace("[", "").replace("]", "").split(" ")
test_list_2 = symptoms_embeddings.loc[0, "Embedding2"].replace("\n", "").replace("[", "").replace("]", "").split(" ")

In [8]:
test_list_1[:5]

['-9.81967244e-03', '', '1.01662287e-02', '', '3.75229940e-02']

Removing all empty strings:

In [9]:
test_list_1 = [x for x in test_list_1 if x]
test_list_2 = [x for x in test_list_2 if x]
test_list_1[:5]

['-9.81967244e-03',
 '1.01662287e-02',
 '3.75229940e-02',
 '1.75703913e-02',
 '-1.11436069e-01']

### Creating useable dataframes:

#### Embedding 1 column:

First, transform content of rows from strings to lists:

In [10]:
def listify_df_values(df_series: pd.Series):
    df_series = df_series.str.replace("\n", "", regex=True)
    df_series = df_series.str.replace("[", "", regex=True).replace("]", "", regex=True)
    df_series = df_series.str.split(" ")
    # df_list = df_list.apply(lambda x: x for x in df_list if x)
    return df_series

#### Embedding 2 column:

Repeat the above steps for the _Embedding2_ column:

Creating the basic dataframe:

In [19]:
embeddings2_series = listify_df_values(symptoms_embeddings.loc[:, "Embedding2"])
embeddings2_series = embeddings2_series.apply(lambda row: [val for val in row if val])

embeddings2_df = pd.DataFrame(embeddings2_series)
embeddings2_df.head()

Unnamed: 0,Embedding2
0,"[2.1690e-02, -1.8056e-01, -8.5585e-02, -5.6702..."
1,"[-0.73936, -0.18636, 0.59149, 0.47356, 0.59297..."
2,"[0.58928, 0.24762, 0.5015, -0.31308, -0.029607..."
3,"[8.2946e-02, 1.6964e-01, -2.1112e-01, 2.1073e-..."
4,"[-3.7954e-01, 4.4132e-01, 3.6332e-02, 2.2410e-..."


Exploding the lists of values into their own columns so that every cell only contains a single value:

In [20]:
embeddings2_df = pd.concat(
    [embeddings2_df[c].apply(pd.Series).add_prefix(c + "_") for c in embeddings2_df], axis=1
)

embeddings2_df.head()

Unnamed: 0,Embedding2_0,Embedding2_1,Embedding2_2,Embedding2_3,Embedding2_4,Embedding2_5,Embedding2_6,Embedding2_7,Embedding2_8,Embedding2_9,...,Embedding2_290,Embedding2_291,Embedding2_292,Embedding2_293,Embedding2_294,Embedding2_295,Embedding2_296,Embedding2_297,Embedding2_298,Embedding2_299
0,0.02169,-0.18056,-0.085585,-0.56702,-0.37991,0.74952,0.27161,-0.20359,0.28772,-1.4985,...,0.7687,-0.57498,-0.10212,-0.0557,-0.45765,-0.26548,0.19396,0.38276,-0.015735,-0.036918
1,-0.73936,-0.18636,0.59149,0.47356,0.59297,-0.22319,0.066332,0.35977,0.063273,-1.5661,...,0.78603,0.54811,0.23896,-0.42036,-0.085291,0.64376,0.54307,0.42253,0.61038,-0.75482
2,0.58928,0.24762,0.5015,-0.31308,-0.029607,0.39451,-0.22913,0.57697,-0.76873,-1.3676,...,-0.42955,-0.14359,0.16626,0.3584,-0.10825,-0.1961,-0.15036,0.13764,-0.41586,-0.72983
3,0.082946,0.16964,-0.21112,0.21073,-0.0094237,0.34631,-0.25166,0.18472,-0.33269,-1.6627,...,0.33283,-0.15003,0.54558,-0.023841,-0.48079,0.51326,-0.2866,0.041394,-0.066671,-0.3077
4,-0.37954,0.44132,0.036332,0.2241,0.087512,-0.41484,-0.0060271,0.098966,-0.11458,-1.7897,...,0.10686,0.27241,-0.31783,0.13302,-0.17751,0.74856,0.36981,0.35658,0.13955,-0.54288


Converting the cell values to floats:

In [21]:
embeddings2_df = embeddings2_df.apply(pd.to_numeric, errors='coerce')
type(embeddings2_df.loc[0, "Embedding2_0"])

numpy.float64

### Creating the Distance Matrix:

In [22]:
# importing the library
from scipy.spatial import distance_matrix

In [23]:
import math

def generate_distance_matrix(df : pd.DataFrame,
                distance_metric : str = "euclidean") -> pd.DataFrame: # 2.5k x 2.5k
    if distance_metric == "manhattan":
        p = 1
    elif distance_metric == "euclidean":
        p = 2
    elif distance_metric == "chebychev":
        p = math.inf
    else:
        p = 2
    dis_matrix = distance_matrix(df.values, df.values, p)
    dis_df = pd.DataFrame(dis_matrix)
    return dis_df


### Creating the Dictionaries:

In [25]:
def generate_dict(df_dist : pd.DataFrame,
                threshold : float) -> dict:
    filt = (df_dist[:] > threshold)
    df_filt = df_dist.copy()
    df_filt[filt] = np.nan
    dict_dist = df_filt.to_dict('dict')
    for i, dic in dict_dist.items():
        to_pop = list()
        for key, value in dic.items():
            if np.isnan(value):
                to_pop.append(key)
            # elif value == 0.0:
            #     to_pop.append(key)
        for target_key in to_pop:
            dic.pop(target_key)
        dict_dist[i] = dic
    return dict_dist

In [26]:
def generate_dict_match(dict_dist: dict) -> dict:   
    dict_match = dict()

    for key, value in dict_dist.items():
        for sub_key in value.keys():
            if dict_symptom[key] not in dict_match:
                dict_match[dict_symptom[key]] = [dict_symptom[sub_key]]
            else:
                dict_match[dict_symptom[key]] = [*dict_match.get(dict_symptom[key]), dict_symptom[sub_key]]
                # [*response.get("match_partial"), match_partial[i_partial]]

    return dict_match

#### A Dictionary of Symptoms Only and their Related Terms from the Duke Activities:

For the sake of Naturedoc's proof of concept, we will focus on the 422 symptoms as they exist in the google Database:

In [27]:
activities_symptoms_df = pd.read_csv("../output/activities_symptoms_bool.csv")
activities_symptoms_df.drop(columns="Unnamed: 0", inplace=True)
activities_symptoms_df.head()

Unnamed: 0,symptomName,is_symptom,is_activity
0,Abcess,0,1
1,Abdomen,0,1
2,Abortifacient,0,1
3,Abortive,0,1
4,Abrasion,0,1


This dataframe contains all symptoms and activities from both the Google Dataset and Duke's Database. The _is_symptom_ and _is_activity_ columns indicate which of these sources they originate from.

In [28]:
filt_sym = (activities_symptoms_df["is_symptom"] == 1)
filt_sym_df = activities_symptoms_df[filt_sym]
filt_sym_list = filt_sym_df["symptomName"].values.tolist()

In general, symptoms originating from the Google Dataset should probably be removed from the dictionary values, as the might cause issues when querying the database.

To eventually exclude not-activities from the dictionary, we also create a list of entries not in the activities:

In [29]:
filt_not_act = (activities_symptoms_df["is_activity"] == 0)
filt_not_act_df = activities_symptoms_df[filt_not_act]
filt_not_act_list = filt_not_act_df["symptomName"].values.tolist()

Generate the dictionary while removing Google Symptoms from the values inside the dictionary:

In [30]:
def create_dict_sym(dict_dist):  
    dict_sym = dict()

    for sym, list_sym in dict_dist.items():
        if sym not in filt_sym_list:
            continue
        for sub_sym in list_sym:
            if sub_sym in filt_not_act_list:
                continue
            if sym not in dict_sym:
                dict_sym[sym] = [sub_sym]
            else:
                dict_sym[sym] = [*dict_sym.get(sym), sub_sym]
    
    return dict_sym

## Creating the Dictionary for the Second Embedding Model:

#### Distance Matrix:

In [31]:
df_dist_2 = generate_distance_matrix(embeddings2_df)
df_dist_2

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2394,2395,2396,2397,2398,2399,2400,2401,2402,2403
0,0.000000,9.942843,8.784920,8.232345,9.910264,9.184355,7.202011,9.933517,8.554601,9.013563,...,10.233912,9.314567,9.376453,9.180041,9.993439,10.348576,10.379647,8.481104,11.215959,9.881715
1,9.942843,0.000000,10.594719,9.092435,9.080341,9.760703,7.240622,9.523623,9.701527,7.085359,...,9.217572,8.534188,8.879009,8.650456,8.355311,11.036702,10.223542,8.380999,11.489306,10.246362
2,8.784920,10.594719,0.000000,8.911869,9.984212,9.694277,7.788598,10.527021,9.885447,9.483890,...,10.491074,9.849962,9.382018,9.452173,10.308570,10.837974,11.214281,9.132222,11.857727,9.906005
3,8.232345,9.092435,8.911869,0.000000,9.329214,8.618238,6.308059,9.776170,8.539223,8.399780,...,9.360827,8.679456,8.652150,8.572486,8.727339,9.806711,10.267550,8.039326,11.093683,9.261648
4,9.910264,9.080341,9.984212,9.329214,0.000000,9.532016,7.203297,8.390142,9.782737,8.664715,...,9.731884,9.167016,8.643785,8.095538,8.852319,11.230354,10.294182,8.781414,10.703237,9.869799
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2399,10.348576,11.036702,10.837974,9.806711,11.230354,10.804562,8.675304,11.606166,10.381668,10.743392,...,10.649502,10.544861,10.239899,10.334715,10.653345,0.000000,12.256771,10.208283,12.240513,10.836945
2400,10.379647,10.223542,11.214281,10.267550,10.294182,10.384463,8.632461,11.402102,10.012039,9.716490,...,10.498840,10.266239,9.820044,10.444798,10.856580,12.256771,0.000000,10.042034,12.678147,10.458506
2401,8.481104,8.380999,9.132222,8.039326,8.781414,8.460653,5.594245,8.909344,8.437277,7.433279,...,8.786142,7.539064,7.838269,7.730331,8.045926,10.208283,10.042034,0.000000,10.921098,8.689270
2402,11.215959,11.489306,11.857727,11.093683,10.703237,11.734419,9.532773,11.079050,11.446188,11.062574,...,12.057926,11.506828,10.885291,10.925064,10.904707,12.240513,12.678147,10.921098,0.000000,11.735542


Several entries seem to have matched to an absurd amount of activities:

In [32]:
print("Value counts:", df_dist_2.loc[:, 1].value_counts())

Value counts: 7.240622     830
8.358644       3
9.079539       2
10.997180      2
10.270435      2
            ... 
9.093840       1
10.213916      1
11.291925      1
8.894402       1
10.246362      1
Name: 1, Length: 1554, dtype: int64


In [33]:
print("Value counts:", df_dist_2.loc[:, 6].value_counts())

Value counts: 0.000000    830
7.467473      3
7.259116      2
9.168405      2
7.789581      2
           ... 
8.068840      1
7.136250      1
8.165595      1
8.418919      1
7.147895      1
Name: 6, Length: 1554, dtype: int64


Trying other distance metrics results in similar results:

In [34]:
df_dist_2_manhattan = generate_distance_matrix(embeddings2_df, 1)

In [35]:
df_dist_2_chebychev = generate_distance_matrix(embeddings2_df, 3)

In [36]:
print("Value counts:", df_dist_2_manhattan.loc[:, 1].value_counts())

Value counts: 7.240622     830
8.358644       3
9.079539       2
10.997180      2
10.270435      2
            ... 
9.093840       1
10.213916      1
11.291925      1
8.894402       1
10.246362      1
Name: 1, Length: 1554, dtype: int64


In [37]:
print("Value counts:", df_dist_2_chebychev.loc[:, 1].value_counts())

Value counts: 7.240622     830
8.358644       3
9.079539       2
10.997180      2
10.270435      2
            ... 
9.093840       1
10.213916      1
11.291925      1
8.894402       1
10.246362      1
Name: 1, Length: 1554, dtype: int64


### Final Dictionaries:

##### Trying Different Threshold:

Threshold 2.0:

In [41]:
dict_dist_20 = generate_dict(df_dist_2, 2)
dict_20 = generate_dict_match(dict_dist_20)
dict_20_sym = create_dict_sym(dict_20)
print(len(dict_20_sym))
dict_20_sym

132


{'Acne': ['Acne'],
 'Alcoholism': ['Alcoholism'],
 'Allergy': ['Allergy'],
 'Amblyopia': ['Amblyopia'],
 'Amenorrhea': ['Amenorrhea'],
 'Amnesia': ['Amnesia'],
 'Anemia': ['Anemia'],
 'Anxiety': ['Anxiety'],
 'Arthralgia': ['Arthralgia'],
 'Arthritis': ['Arthritis', 'Arthritis?'],
 'Ascites': ['Ascites'],
 'Asthma': ['Asthma'],
 'Ataxia': ['Ataxia'],
 'Atheroma': ['Atheroma'],
 'Boil': ['Boil'],
 'Bronchitis': ['Bronchitis'],
 'Bruise': ['Bruise'],
 'Bunion': ['Bunion'],
 'Burn': ['Burn'],
 'Cataract': ['Cataract'],
 'Chancre': ['Chancre'],
 'Chills': ['Chills'],
 'Chorea': ['Chorea'],
 'Cirrhosis': ['Cirrhosis'],
 'Colitis': ['Colitis'],
 'Coma': ['Coma'],
 'Conjunctivitis': ['Conjunctivitis'],
 'Constipation': ['Constipation'],
 'Convulsion': ['Convulsion'],
 'Cough': ['Cough'],
 'Cramp': ['Cramp'],
 'Croup': ['Croup'],
 'Dandruff': ['Dandruff'],
 'Dementia': ['Dementia'],
 'Depression': ['Depression'],
 'Dermatitis': ['Dermatitis'],
 'Diabetes': ['Diabetes', 'Diabetes Mellitis'],
 '

In [42]:
print("Abdominal pain" in dict_20_sym.keys())
print("Eye pain" in dict_20_sym.keys())
print("Common cold" in dict_20_sym.keys())

False
False
False


Threshold 5.0:

In [43]:
dict_dist_50 = generate_dict(df_dist_2, 5)
dict_50 = generate_dict_match(dict_dist_50)
dict_50_sym = create_dict_sym(dict_50)
print(len(dict_50_sym))
dict_50_sym

226


{'Acne': ['Acne'],
 'Alcoholism': ['Alcoholism'],
 'Allergy': ['Allergy'],
 'Amblyopia': ['Amblyopia',
  'Anorectic',
  'Antacid',
  'Carcinogenic',
  'Dullness'],
 'Amenorrhea': ['Amenorrhea'],
 'Amnesia': ['Amnesia'],
 'Anemia': ['Anemia'],
 'Anxiety': ['Anxiety'],
 'Arthralgia': ['Arthralgia'],
 'Arthritis': ['Arthritis', 'Arthritis?'],
 'Ascites': ['Ascites'],
 'Asthma': ['Asthma', 'Asthma (Ivy)', 'Asthma (Hay)'],
 'Ataxia': ['Ataxia'],
 'Atheroma': ['Atheroma'],
 'Boil': ['Boil'],
 'Bronchitis': ['Bronchitis'],
 'Bruise': ['Bruise'],
 'Bunion': ['Bunion'],
 'Burn': ['Burn'],
 'Cataract': ['Cataract'],
 'Chancre': ['Chancre'],
 'Chills': ['Chills'],
 'Chorea': ['Chorea'],
 'Cirrhosis': ['Cirrhosis'],
 'Colitis': ['Colitis'],
 'Coma': ['Coma'],
 'Conjunctivitis': ['Conjunctivitis'],
 'Constipation': ['Constipation'],
 'Convulsion': ['Convulsion'],
 'Cough': ['Cough'],
 'Cramp': ['Cramp'],
 'Croup': ['Croup'],
 'Dandruff': ['Dandruff'],
 'Dementia': ['Dementia'],
 'Depression': ['Dep

In [44]:
print("Abdominal pain" in dict_50_sym.keys())
print("Eye pain" in dict_50_sym.keys())
print("Common cold" in dict_50_sym.keys())

False
True
True


Due to the erroneous values in the distance matrix, some symptoms match an absurd amount of activities:

In [45]:
print(dict_50_sym["Eye pain"])
print(len(dict_50_sym["Common cold"]))
dict_50_sym["Common cold"]

['Ear drop', 'Evil eye', 'Eye', 'Eye drop', 'Pain', 'Cold sore']
827


['Abscess(Breast)',
 'Ache(Arm)',
 'Ache(Back)',
 'Ache(Body)',
 'Ache(Ear)',
 'Ache(Foot)',
 'Ache(Head)',
 'Ache(Leg)',
 'Ache(Limb)',
 'Ache(Loin)',
 'Ache(Rib)',
 'Ache(Side)',
 'Ache(Stomach)',
 "Addison's-Disease",
 'Aftosa',
 'Alactia',
 'Alexipharmic',
 'Alexiteric',
 'Amygdalitis',
 'Amygdalosis',
 'Anal-Eversion',
 'Ancylostomiasis',
 'Anecbolic',
 'Angina-Catarrhalis',
 'Anhydrosis',
 'Anhydrotic',
 'Bite(Animal)',
 'Anorexiac',
 'Antemetic',
 'Antiabortifacient',
 'Antibilious',
 'Anticathartic',
 'Anticonception',
 'Antidiarrheic',
 'Antidote(Antiaris)',
 'Antidote(Crab)',
 'Antidote(Datura)',
 'Antidote(Fish)',
 'Antidote(Ipoh)',
 'Antidote(Opium)',
 'Antidote(Pithecellobium)',
 'Antidote(Poison)',
 'Antidote(Aconite)',
 'Antidote(Alcohol)',
 'Antidote(Alkaloid)',
 'Antidote(Arrow)',
 'Antidote(Arsenic)',
 'Antidote(Atropine)',
 'Antidote(Belladonna)',
 'Antidote(Brassica)',
 'Antidote(Cantharid)',
 'Antidote(Capsicum)',
 'Antidote(Caterpillar)',
 'Antidote(Centipede)',
 

In [46]:
dict_50_sym["Avoidant personality disorder"]

['Abscess(Breast)',
 'Ache(Arm)',
 'Ache(Back)',
 'Ache(Body)',
 'Ache(Ear)',
 'Ache(Foot)',
 'Ache(Head)',
 'Ache(Leg)',
 'Ache(Limb)',
 'Ache(Loin)',
 'Ache(Rib)',
 'Ache(Side)',
 'Ache(Stomach)',
 "Addison's-Disease",
 'Aftosa',
 'Alactia',
 'Alexipharmic',
 'Alexiteric',
 'Amygdalitis',
 'Amygdalosis',
 'Anal-Eversion',
 'Ancylostomiasis',
 'Anecbolic',
 'Angina-Catarrhalis',
 'Anhydrosis',
 'Anhydrotic',
 'Bite(Animal)',
 'Anorexiac',
 'Antemetic',
 'Antiabortifacient',
 'Antibilious',
 'Anticathartic',
 'Anticonception',
 'Antidiarrheic',
 'Antidote(Antiaris)',
 'Antidote(Crab)',
 'Antidote(Datura)',
 'Antidote(Fish)',
 'Antidote(Ipoh)',
 'Antidote(Opium)',
 'Antidote(Pithecellobium)',
 'Antidote(Poison)',
 'Antidote(Aconite)',
 'Antidote(Alcohol)',
 'Antidote(Alkaloid)',
 'Antidote(Arrow)',
 'Antidote(Arsenic)',
 'Antidote(Atropine)',
 'Antidote(Belladonna)',
 'Antidote(Brassica)',
 'Antidote(Cantharid)',
 'Antidote(Capsicum)',
 'Antidote(Caterpillar)',
 'Antidote(Centipede)',
 

##### Bandaid Solution?:

As a bandaid, we tried to modify the initial dictionary generating code. Unproblematic nested dictionaries are added to a new dictionary, and then the problematic ones will be further processed. This did not really work out too well and is a bad approach anyway:

In [52]:
def generate_dict_patch(df_dist : pd.DataFrame,
                threshold : float) -> dict:
    filt = (df_dist[:] > threshold)
    df_filt = df_dist.copy()
    df_filt[filt] = np.nan
    dict_dist = df_filt.to_dict('dict')
    for i, dic in dict_dist.items():
        to_pop = list()
        for key, value in dic.items():
            if np.isnan(value):
                to_pop.append(key)
        for target_key in to_pop:
            dic.pop(target_key)
        dict_dist[i] = dic
    dict_clear = dict()
    for key, dic in dict_dist.items():
        
# first, add unproblematic dictionaries to dict_clear:
        if len(dic) <= 100:
            dict_clear[key] = dic
            continue

# a new dict lists the indices for each value:
        count_dict = dict()
        for sub_key, val_dis in dic.items():
            if val_dis not in count_dict:
                count_dict[val_dis] = [sub_key]
            else:
                count_dict[val_dis] = [*count_dict.get(val_dis), sub_key]

# checking the count_dict: 
# if a certain key has too many indices as its value, it will be skipped:
        for val, val_i in count_dict.items():
            if len(val_i) > 100:
                continue

# in reverse:
# remaining values will be assigned to correct index keys in dict_clear:
            for i in val_i:
                if i not in dict_clear:
                    dict_clear[i] = {i: val}
                else:
                    dict_clear[i][i] = val

# in case self referencing 0.0 value was removed, add it again:
    target_i = list()
    for key, value in dict_clear.items():
        if key not in value:
            target_i.append(key)
    for i in target_i:
        dict_clear[i] = 0.0

    return dict_clear

In [53]:
dict_dist_50[6]

{6: 0.0,
 10: 0.0,
 11: 0.0,
 12: 0.0,
 13: 0.0,
 14: 0.0,
 15: 0.0,
 16: 0.0,
 17: 0.0,
 18: 0.0,
 19: 0.0,
 20: 0.0,
 21: 0.0,
 26: 0.0,
 32: 0.0,
 36: 0.0,
 39: 0.0,
 40: 0.0,
 51: 0.0,
 52: 0.0,
 53: 0.0,
 57: 4.875748031052368,
 59: 0.0,
 61: 0.0,
 67: 0.0,
 69: 0.0,
 70: 0.0,
 71: 0.0,
 76: 0.0,
 79: 0.0,
 82: 0.0,
 83: 0.0,
 85: 0.0,
 88: 0.0,
 89: 0.0,
 91: 0.0,
 93: 0.0,
 94: 0.0,
 96: 0.0,
 98: 0.0,
 99: 0.0,
 100: 0.0,
 102: 0.0,
 103: 0.0,
 104: 0.0,
 105: 0.0,
 106: 0.0,
 107: 0.0,
 108: 0.0,
 110: 0.0,
 112: 0.0,
 113: 0.0,
 114: 0.0,
 115: 0.0,
 116: 0.0,
 117: 0.0,
 118: 0.0,
 119: 0.0,
 120: 0.0,
 121: 0.0,
 122: 0.0,
 123: 0.0,
 124: 0.0,
 125: 0.0,
 126: 0.0,
 127: 0.0,
 128: 0.0,
 129: 0.0,
 130: 0.0,
 131: 0.0,
 132: 0.0,
 133: 0.0,
 134: 0.0,
 135: 0.0,
 136: 0.0,
 137: 0.0,
 138: 0.0,
 139: 0.0,
 140: 0.0,
 141: 0.0,
 142: 0.0,
 143: 0.0,
 144: 0.0,
 145: 0.0,
 147: 0.0,
 148: 0.0,
 149: 0.0,
 150: 0.0,
 151: 0.0,
 152: 0.0,
 153: 0.0,
 154: 0.0,
 155: 0.0,
 156:

In [54]:
dict_dist_patch_50 = generate_dict_patch(df_dist_2, 5)

In [56]:
dict_dist_patch_50

{0: {0: 0.0},
 1: {1: 0.0},
 2: {2: 0.0},
 3: {3: 0.0, 2097: 0.0},
 4: {4: 0.0},
 5: {5: 0.0},
 57: {57: 4.875748031052368},
 315: {315: 4.807588894425033},
 345: {345: 4.779970515124296},
 519: {519: 4.719025126728991},
 525: {525: 4.820558842842306},
 617: {617: 4.374457595769542},
 675: {675: 4.556676871097907},
 688: {688: 4.346432889709969},
 842: {842: 4.866545534119251},
 972: {972: 4.809214478254774},
 973: {973: 4.5797691258042725},
 1041: {1041: 4.492894780162099},
 1118: {1118: 4.677415587338868},
 1145: {1145: 4.955651878925598},
 1198: {1198: 4.695847538646427},
 1205: {1205: 4.559299357543582},
 1206: {1206: 4.3926189097771005},
 1297: {1297: 4.902182190217483},
 1361: {1361: 4.814437430507499},
 1543: {1543: 4.918255729720203},
 1830: {1830: 4.497542816205251},
 1851: {1851: 4.95383177460237},
 1921: {1921: 4.183974805448398},
 1923: {1923: 4.847278496109896},
 1924: {1924: 4.709420950127583},
 1951: {1951: 4.566508583897752},
 1969: {1969: 4.52908892737472},
 1984: {198

In [63]:
dict_patch_50 = generate_dict_match(dict_dist_patch_50)
dict_dist_patch_50_sym = create_dict_sym(dict_patch_50)
print(len(dict_dist_patch_50_sym))
dict_dist_patch_50_sym

183


{'Acne': ['Acne'],
 'Alcoholism': ['Alcoholism'],
 'Allergy': ['Allergy'],
 'Amblyopia': ['Amblyopia',
  'Anorectic',
  'Antacid',
  'Carcinogenic',
  'Dullness'],
 'Amenorrhea': ['Amenorrhea'],
 'Amnesia': ['Amnesia'],
 'Anemia': ['Anemia'],
 'Anxiety': ['Anxiety'],
 'Arthralgia': ['Arthralgia'],
 'Arthritis': ['Arthritis', 'Arthritis?'],
 'Ascites': ['Ascites'],
 'Asthma': ['Asthma', 'Asthma (Ivy)', 'Asthma (Hay)'],
 'Ataxia': ['Ataxia'],
 'Atheroma': ['Atheroma'],
 'Blood in stool': ['Blood', 'Bloody stool'],
 'Boil': ['Boil'],
 'Bronchitis': ['Bronchitis'],
 'Bruise': ['Bruise'],
 'Bunion': ['Bunion'],
 'Burn': ['Burn'],
 'Cataract': ['Cataract'],
 'Chancre': ['Chancre'],
 'Chills': ['Chills'],
 'Chorea': ['Chorea'],
 'Cirrhosis': ['Cirrhosis'],
 'Colitis': ['Colitis'],
 'Coma': ['Coma'],
 'Conjunctivitis': ['Conjunctivitis'],
 'Constipation': ['Constipation'],
 'Convulsion': ['Convulsion'],
 'Hair loss': ['Cosmetic (Grey hair)',
  'Grey Hair',
  'Hair',
  'Preventative (Gray Hair)

In [58]:
print("Abdominal pain" in dict_dist_patch_50_sym.keys())
print("Eye pain" in dict_dist_patch_50_sym.keys())
print("Common cold" in dict_dist_patch_50_sym.keys())

False
True
False


Changing thresholds somehow results in keys being removed (maybe they slipped into a territory of having too many matches):

In [64]:
dict_dist_patch_80 = generate_dict_patch(df_dist_2, 8)
dict_patch_80 = generate_dict_match(dict_dist_patch_80)
dict_dist_patch_80_sym = create_dict_sym(dict_patch_80)
print(len(dict_dist_patch_80_sym))
print("Abdominal pain" in dict_dist_patch_80_sym.keys())
print("Eye pain" in dict_dist_patch_80_sym.keys())
print("Common cold" in dict_dist_patch_80_sym.keys())

128
False
False
False


In [65]:
dict_dist_patch_80_sym

{'Inflammation': ['Inflammation'],
 'Nausea': ['Nausea'],
 'Pain': ['Pain'],
 'Swelling': ['Swelling'],
 'Hyperglycemia': ['Hyperglycemia'],
 'Rhinitis': ['Rhinitis'],
 'Alcoholism': ['Alcoholism'],
 'Allergy': ['Allergy'],
 'Amblyopia': ['Amblyopia'],
 'Amnesia': ['Amnesia'],
 'Anemia': ['Anemia'],
 'Anxiety': ['Anxiety'],
 'Arthritis': ['Arthritis'],
 'Ascites': ['Ascites'],
 'Asthma': ['Asthma'],
 'Ataxia': ['Ataxia'],
 'Boil': ['Boil'],
 'Bronchitis': ['Bronchitis'],
 'Bruise': ['Bruise'],
 'Bunion': ['Bunion'],
 'Burn': ['Burn'],
 'Cataract': ['Cataract'],
 'Chills': ['Chills'],
 'Chorea': ['Chorea'],
 'Cirrhosis': ['Cirrhosis'],
 'Colitis': ['Colitis'],
 'Coma': ['Coma'],
 'Conjunctivitis': ['Conjunctivitis'],
 'Constipation': ['Constipation'],
 'Convulsion': ['Convulsion'],
 'Cough': ['Cough'],
 'Cramp': ['Cramp'],
 'Dandruff': ['Dandruff'],
 'Dementia': ['Dementia'],
 'Depression': ['Depression'],
 'Dermatitis': ['Dermatitis'],
 'Diabetes': ['Diabetes'],
 'Diarrhea': ['Diarrhea

After more cautious threshold changes, 5.1 seems the best so far:

In [75]:
dict_dist_patch_51 = generate_dict_patch(df_dist_2, 5.1)
dict_patch_51 = generate_dict_match(dict_dist_patch_51)
dict_dist_patch_51_sym = create_dict_sym(dict_patch_51)
print(len(dict_dist_patch_51_sym))
print("Abdominal pain" in dict_dist_patch_51_sym.keys())
print("Eye pain" in dict_dist_patch_51_sym.keys())
print("Common cold" in dict_dist_patch_51_sym.keys())

186
True
True
False


In [76]:
dict_dist_patch_51_sym

{'Acne': ['Acne'],
 'Alcoholism': ['Alcoholism'],
 'Allergy': ['Allergy'],
 'Amblyopia': ['Amblyopia',
  'Anorectic',
  'Antacid',
  'Carcinogenic',
  'Dullness'],
 'Amenorrhea': ['Amenorrhea'],
 'Amnesia': ['Amnesia'],
 'Anemia': ['Anemia'],
 'Anxiety': ['Anxiety'],
 'Arthralgia': ['Arthralgia'],
 'Arthritis': ['Arthritis', 'Arthritis?'],
 'Ascites': ['Ascites'],
 'Asthma': ['Asthma', 'Asthma (Ivy)', 'Asthma (Hay)'],
 'Ataxia': ['Ataxia'],
 'Atheroma': ['Atheroma'],
 'Kidney stone': ['Bladder stone', 'Kidney', 'Kidney stones'],
 'Blood in stool': ['Blood', 'Bloody stool'],
 'Boil': ['Boil'],
 'Bronchitis': ['Bronchitis'],
 'Bruise': ['Bruise'],
 'Bunion': ['Bunion'],
 'Burn': ['Burn'],
 'Cataract': ['Cataract'],
 'Chancre': ['Chancre'],
 'Chills': ['Chills'],
 'Chorea': ['Chorea'],
 'Cirrhosis': ['Cirrhosis'],
 'Colitis': ['Colitis'],
 'Coma': ['Coma'],
 'Conjunctivitis': ['Conjunctivitis'],
 'Constipation': ['Constipation'],
 'Convulsion': ['Convulsion'],
 'Eye pain': ['Cosmetic (Gre

In [77]:
print(dict_dist_patch_51_sym["Abdominal pain"])
print(dict_dist_patch_51_sym["Eye pain"])

['Pain']
['Cosmetic (Grey hair)', 'Ear drop', 'Evil eye', 'Eye', 'Eye drop', 'Pain', 'Skin diseases', 'Cold sore']


### Closing thoughts:

In any case, as we can see above, the results are not particularly great so there are probably issues with our initial approach to creating or processing the second set of embeddings.

We should proceed with _Embeddings1_ values for now and see if we can pinpoint the issues in the second set of vectors.