In this notebook, let's have a look at the database and add some label information to our dataframe.

# Imports

In [1]:
import numpy as np

In [2]:
import pathlib
from pathlib import Path
output_path = Path('Output')

# A look at the labels

## Importing labels and comparing them to our slides

In [3]:
import csv

tabDict = {}

kortiFile = 'KortikotropHA_gelabled.txt'
with open(kortiFile) as tsv:
    for line in csv.reader(tsv, delimiter="\t"): #You can also use delimiter="\t" rather than giving a dialect.
        tabDict[line[0]] = line[1].split()

gonaFile = 'GonadotropeHA_gelabled.txt'
with open(gonaFile) as tsv:
    for line in csv.reader(tsv, delimiter="\t"): #You can also use delimiter="\t" rather than giving a dialect.
        tabDict[line[0]] = line[1].split()

 - corticotrop=0
 - silent=3
 - gonadotrop=7
 - LH=8
 - FSH=9

In [4]:
dict(list(tabDict.items())[:10]), dict(list(tabDict.items())[-10:])

(classified)


In [5]:
len(tabDict)

843

In [5]:
# importing the dataframe we built in Notebook 1
import pandas as pd
pair_df = pd.read_csv(output_path/'pairlist_df_after_notebook_1.csv')
pair_df.head(3)

classified


In [6]:
import re
# testing regular expressions
m2 = re.search(
'(.+\-.+)\-.+\-.+',
're-example-II-HE')
m2.group(1)

're-example'

In [8]:
# are all names available in the label dictionary ("tabDict")?
idsFromTable = []
pair_df.reset_index(drop=True, inplace=True)
for index, pair in pair_df.iterrows():
    idHE = re.search('(.+?\-.+?)\-.+',pair["he name"]).group(1)
#     print(idHE) # e.g. '1929-13'
    if idHE != re.search('(.+?\-.+?)\-.+',pair["ihc name"]).group(1):
        print(str(index) + ": HE='" + pair["he name"] + "', IHC='" + pair["ihc name"] + "''")
    if idHE not in tabDict:
        print(str(index) + ": '" + idHE)
    idsFromTable.append(idHE)
idsFromTable = list(set(idsFromTable))

How many cases are labelled?

In [9]:
len(tabDict)

843

How many cases do we have as slides?

In [10]:
len(idsFromTable)

409

Since `tabDict` is oversatured, let's fetch only those IDs that are relevant

In [11]:
relevantIDs = {x:tabDict[x] for x in tabDict if x in idsFromTable}
list(relevantIDs)[:5]+['...']

classified


In [12]:
numOfCases = len(relevantIDs)
numOfCases

409

## Making a label dataframe

We now want to translate the labels and prepare them for a dataframe:

In [13]:
labelMeanings = {0: 'ACTH', 8: 'LH', 9: 'FSH'}
arrayForDF = []
for rId in relevantIDs:
    arrayForDF.append([rId, " ".join([labelMeanings[x] for x in labelMeanings if str(x) in relevantIDs[rId]])])
    
#     print(rId, relevantIDs[rId])
arrayForDF[:5]

classified


In [14]:
df = pd.DataFrame(np.array(arrayForDF),
                   columns=['ID', 'labels'])
df

classified


In [15]:
df['labels'].value_counts()

ACTH      180
LH FSH    179
LH         31
FSH        19
Name: labels, dtype: int64

Make sure it adds up to the number of all cases

In [16]:
acthCases = 180
lhfshCases = 179
lhCases = 31
fshCases = 19
acthCases + lhfshCases + lhCases + fshCases == numOfCases

True

## Counting WSI-pairs

Next, let's see how many WSI-pairs there are.

In [17]:
def getCaseIDFromName(slideName):
    return re.search('(.+?\-.+?)\-.+',slideName).group(1)

### ACTH pairs

In [18]:
df[:5]

(classified)


Let's see if some IDs are overrepresented in our dataset:

In [19]:
for i in range(len(df)):
    if df['labels'][i] == 'ACTH':
        howManySame = len(list(set([ihc_n for ihc_n in pair_df["ihc name"] if getCaseIDFromName(ihc_n)==df['ID'][i]])))
        if not howManySame == 1:
            print("WARNING: " + str(i) + ": " + str(howManySame))



Let's pick one index out of this list:

In [20]:
oddIndex = 131
pair_df.reset_index(drop=True, inplace=True)
[(pair["ihc name"], pair["he name"]) for index, pair in pair_df.iterrows() if getCaseIDFromName(pair["ihc name"])==df['ID'][oddIndex]]

(classified)


In [21]:
pair_df.head(3)

(classified)


Conclusion: For some IDs, there is more than one WSI pair. How many ACTH WSI pairs are there, then?

In [22]:
acthPairs = len(list(set([ihc_n for ihc_n in pair_df["ihc name"] if 'ACTH' in ihc_n])))
acthPairs

202

From how many cases?

In [23]:
acthCases

180

202 pairs from 180 cases then.

Does the number of HE slides match the number of ACTH slides?

In [24]:
pair_df.reset_index(drop=True, inplace=True)
HEslidesACTH = [pair["he name"] for index, pair in pair_df.iterrows() if 'ACTH' in pair["ihc name"]]
uniqueHEslidesACTH = list(set(HEslidesACTH))
len(uniqueHEslidesACTH)

202

Yes it does. Wonderful.

### LH and FSH

In [25]:
df['labels'].value_counts()

ACTH      180
LH FSH    179
LH         31
FSH        19
Name: labels, dtype: int64

In [26]:
lhfshCases,lhCases,fshCases

(179, 31, 19)

In [27]:
lhfshCases+lhCases+fshCases

229

229 cases of gonadotropic tissue

How many slide pairs?

In [28]:
gona_slide_pairs = len(list(set([ihc_n for ihc_n in pair_df["ihc name"] if 'LH' in ihc_n or 'FSH' in ihc_n])))
gona_slide_pairs

414

How many unique HE slides?

In [29]:
gona_he_slides = len(list(set([pair["he name"] for index, pair in pair_df.iterrows() if 'LH' in pair["ihc name"] or 'FSH' in pair["ihc name"]])))
gona_he_slides

229

Ye that makes sense

Does it add up though?

In [30]:
gona_slide_pairs+acthPairs

616

In [31]:
len(list(set([ihc_n for ihc_n in pair_df["ihc name"]])))

616

Yes it does

#### FSH vs. LH vs. FSH+LH

For how many cases do we have FSH only, for how many do we have LH only, for how many do we have both?

FSH:

In [32]:
fsh_list = list(set([ihc_n for ihc_n in pair_df["ihc name"] if 'FSH' in ihc_n]))
fsh_num = len(fsh_list)
fsh_num

202

In [33]:
fshcaselist = list(set([
    re.search('(.+?\-.+?)\-.+',fsh).group(1)
    for fsh in fsh_list]))
len(fshcaselist)

202

LH:

In [34]:
lh_list = list(set([ihc_n for ihc_n in pair_df["ihc name"] if 'LH' in ihc_n]))
lh_num = len(lh_list)
lh_num

212

In [35]:
lhcaselist = list(set([
    re.search('(.+?\-.+?)\-.+',lh).group(1)
    for lh in lh_list]))
len(lhcaselist)

212

BOTH LH AND FSH:

In [36]:
lh_and_fsh = len([x for x in fshcaselist if x in lhcaselist])
lh_and_fsh

185

FSH ONLY:

In [37]:
fsh_only = len([x for x in fshcaselist if x not in lhcaselist])
fsh_only

17

LH ONLY:

In [38]:
lh_only = len([x for x in lhcaselist if x not in fshcaselist])
lh_only

27

Let's compare sums:

In [39]:
lh_and_fsh+fsh_only+lh_only

229

In [40]:
#from above
lhfshCases+lhCases+fshCases

229

Good, they're equal

### HE

How many HE slides do we have?

In [41]:
he_wsinum = len(list(set([he_n for he_n in pair_df["he name"]])))
he_wsinum

431

### Sum

In [42]:
ihc_wsinum = len(list(set([ihc_n for ihc_n in pair_df["ihc name"]])))
ihc_wsinum

616

In [43]:
wsinum = he_wsinum + ihc_wsinum
wsinum

1047

In [44]:
wsinum2 = lh_and_fsh*3+fsh_only*2+lh_only*2 + acthPairs*2
wsinum2

1047

In [45]:
print("We used a set of "+str(wsinum)+" WSIs!")

We used a set of 1047 WSIs!


# Adding labels to `pair_df`

To do this, we reuse a regex from above:

In [46]:
pair_df.reset_index(drop=True, inplace=True)
for index, pair in pair_df.iterrows():
    idHE = re.search('(.+?\-.+?)\-.+',pair["he name"]).group(1)
    thisLabels = df.loc[df['ID'] == idHE]["labels"].values[0]
    pair_df.loc[index, 'case'] = idHE
    pair_df.loc[index, 'labels'] = thisLabels

In [47]:
pair_df

(classified)


We are happy with this result. Let's save the dataframe again.

In [49]:
pair_df.to_csv(output_path/'pairlist_df.csv', index=False)
# as before, we save a copy we won't overwrite any more:
pair_df.to_csv(output_path/'pairlist_df_after_notebook_2.csv', index=False)