In [73]:
import pandas as pd
import os

from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

from collections import Counter

### Classifying Scientists
Aim: To classify Scientists into those with links to Germany and those without links to Germany.

<b> First Loading in each xlsx into 1 dataframe </b>

Loading the excel sheets into dataframes, fixing the column names and adding a column to indicate the source file.

[ The Ocr_Index is a guess - it's not defined for manually inserted rows so am assuming it's from the OCR process(?) - won't matter either way. ]

In [74]:
cols = ['Ocr_Index', 'Page', 'Surname', 'Other_Names', 'Affiliation', 'Field', 'Full_Text', 'Indicator', 'sheet_name', 'EMPTY', 'Cleaning_Comments']
# get all xls in the folder
sheet_paths = [f for f in os.listdir(
    '.') if os.path.isfile(f) and f.endswith('.xls')]
dfs = []
for sheet_path in sheet_paths:
    df = pd.read_excel(sheet_path)

    df.loc[-1] = df.columns.tolist() # the sheets have no header, so pandas is using the first row as header
    df.index = df.index + 1  # so we have to re-add the columns as the first row
    df.sort_index(inplace=True)

    df_cols = cols[:len(df.columns)] # then fill in the correct column names
    df.columns = df_cols

    dfs.append(df)
# concat all dataframes
df = pd.concat(dfs, ignore_index=True)
df.head(3)

Unnamed: 0,Ocr_Index,Page,Surname,Other_Names,Affiliation,Field,Full_Text,Indicator,sheet_name,EMPTY,Cleaning_Comments
0,22968.0,1200,Rogers,Prof. Charles E(dwin),33 Concord St,CIVIL ENGINEERING,"Saratoga Co, N. Y, June 5, 74. C.E, Rensselaer...",0.0,6_1200_1400,Unnamed: 9,Unnamed: 10
1,22969.0,1200,Rogers,Charles F(letcher),University Farm,BIOCHEMISTRY,"Denver, Colo, June 15, 02. A.B, Nebr. Wesleyan...",0.0,6_1200_1400,,
2,22970.0,1200,Rogers,Prof. C(harles) G(ardner),378 Reamer Place,PHYSIOLOGY,"Perry, N. Y, March 4, 75. A.B, Syracuse, 97, A...",0.0,6_1200_1400,,


and just fixing str types and making Full_Text lower case

In [75]:
# Converting NaNs and other to empty strings
df = df.fillna('')

df['Full_Text'] = df['Full_Text'].str.lower()

<img src="https://raw.githubusercontent.com/FM-ds/FM-ds.github.io/main/misc_resource/men_of_science.png"> </img>

## First Approach: Just search for 'Germany', 'German' and 'Deutschland' etc
This is easy, and should be a good start.  There are likely unlabeled Germans in the dataset but unsure how many so far. Doing this will yield a list of institutions which can be searched for in futher steps

e.g: if a "Gortmund, Germany" is found, then scientists with just "Gortmund" later on can be classified as German

In [76]:
search_terms = ["German", "Germany", "Deutschland"]
search_terms = [term.lower() for term in search_terms]

In [77]:
def is_de(full_text, search_terms):
    global counter
    counter+=1
    if(counter%5000==0):
        print(f"Processed {counter} Rows")
    matched_terms = []
    for term in search_terms:
        if term in full_text:
            matched_terms.append(term)
    return matched_terms

In [78]:
# Filter for rows where full_text contains any of the search terms
counter = 0
df['matched_terms'] = df.apply(lambda x: is_de(x['Full_Text'], search_terms), axis=1)
df['de'] = df.apply(lambda x: len(x['matched_terms']) > 0, axis=1)
df['de'].value_counts()

Processed 5000 Rows
Processed 10000 Rows
Processed 15000 Rows
Processed 20000 Rows
Processed 25000 Rows


False    27267
True       463
Name: de, dtype: int64

Here we've only matched 436 of the 27,000 - around 1.7% which is almost definitely an undercount.
However, if we compare term frequencies in matched rows with non-matched rows we can gain an insight into other terms to consider.

We can do this by using term frequency-inverse document frequency (tf-idf) which is a measure of how important a term is to a document in a collection or corpus.  If we find tf-idf for each full-text row, we can compare the tf-idf of the matched rows with the non-matched rows.

In [79]:
# Find term counts in de=True rows
# Calculate tf–idf for rows with German and non-German matches
v = TfidfVectorizer()
tf_idf = v.fit_transform(df['Full_Text'])

Now We can put this in its own data frame to find the 10 terms in the matched rows and the top terms in the non-matched rows. This takes about 40 seconds to run on my laptop.

In [80]:
tf_idf_df = pd.DataFrame(tf_idf.toarray(), columns=v.get_feature_names())
tf_idf_df.head(3)
# Merge de from df with tf_idf_df
tf_idf_df = tf_idf_df.merge(df[['de']], left_index=True, right_index=True)

tf_grouped_df = tf_idf_df.groupby('de_y').mean().T
tf_grouped_df.columns = ['Not Matched', 'Matched']

tf_grouped_df['diff'] = tf_grouped_df['Matched']-tf_grouped_df['Not Matched']
tf_grouped_df = tf_grouped_df.sort_values(by=['diff'], ascending=False)
tf_grouped_df.head(25)



There aren't any huge surprises here but probably sensible to start grabbing terms from this list to build a bigger list.

### 2. Expanding our Term List with the tf-idf Insights

From this list I've selected ~35 terms that are unambiguous and are likely to be good indicators of German scientists. Let's expand our search to include these terms and see how many more we can find.

In [83]:
new_terms = ['germany',  'berlin',  'german',  'munich',  'gottingen',  'deuts',  'hamburg',  'heidelberg',  'hochschule',  'freiburg',  'breslau',  'kiel',  'karlsruhe',  'stuttgart',  'darmstadt',  'leipzig',  'baden',  'tubingen',  'dresden',  'cologne',  'erlangen',  'montefiore',  'strassburg',  'wiesbaden',  'charlottenburg',  'hanover',  'chemnitz',  'bielefeld',  'konigsberg',  'gnissau',  'bavarian',  'eberswalde',  'schoeneberg',  'braunschweig'] 
search_terms = list(set(search_terms + new_terms))

In [84]:
# Filter for rows where full_text contains any of the search terms
counter = 0
df['matched_terms'] = df.apply(lambda x: is_de(x['Full_Text'], search_terms), axis=1)
df['de_2'] = df.apply(lambda x: len(x['matched_terms']) > 0, axis=1)
df['de_2'].value_counts()

Processed 5000 Rows
Processed 10000 Rows
Processed 15000 Rows
Processed 20000 Rows
Processed 25000 Rows


False    25803
True      1927
Name: de_2, dtype: int64

1927 matches is a lot better! 7.5% is much higher than 1.6%. Let's look at a random sample of these rows to check the match:

In [85]:
df[df['de_2'] == True][["Surname", "Other_Names", "Affiliation", "Full_Text"]].sample(20)

Unnamed: 0,Surname,Other_Names,Affiliation,Full_Text
19336,Alsberg,Dr. C(arl) L(ucas),Calif. ‘. ^@,"new york, n. y, april 2. 77. a ll, colrnn- .'k..."
15837,Mumford,Dr. F(rederick) B(lackmar),812 College Ave,"moscow, mich, may 28, 68. b.s, mich. state col..."
19821,Babcock,Prof. H(arold) L(ester),Woodleigh Road,"holliston, mass, may 30, 86. worcester polytec..."
6842,Koffka,Prof. K(urt),57 Crescent St,"berlin, germany, march 18, 86. edinburgh; frei..."
26106,Garrey,Prof. W(alter) E(ugene),Vanderbilt University School of Medicine,"reeds- ville, wis, april 7, 73. b.s, lawrence ..."
1219,Senior,Dr. James K(uhn),5612 Kenwood Ave,"cincinnati, ohio, sept. 15, 89. a.b, harvard, ..."
13283,Miles,Dr. Catharine Cox (Mrs. Walter R(ichard) Miles),Yale University,"san jose, calif, may 20, 90. a.b, stanford, 12..."
25076,Fernelius,Dr. W(illis) Conrad,Ohio State University,"riverdale, utah, aug. 7, 05. a.b, stanford, 26..."
21406,Cavett,Dr. J(esse) W(illiam),Dr. Salsbury’s Laboratories,"kent. ind, march 6, 00. a.b, hanover col. (ind..."
27050,Griffiths,Dr. Francis P(riday),Oregon State Agricultural College,"seattle, wash, july- 12, 04. b.s, washington (..."


and just the full_texts:

In [86]:
list(df[df['de_2'] == True][["Surname", "Other_Names", "Affiliation", "Full_Text"]].sample(20)["Full_Text"])

['wirballen, russia, aug. 18, 83. m.d, n. y. univ, 08, fellow, 08-09; munich, 11. instr. pharmacol, univ. and bellevue hosp. med. col, 08-09; asst, physiol, med. col, cornell, 09-11; instr. physiol, chem, pennsylvania, 11-13, asst, prof, 13-15; prof. clin. med, fordham, 17-20; attending physician, montefiore hosp. lecturer, med. col, cornell, 18-19; consulting physician, lenox hill hosp. am. med. asn; physiol. soc; soc. biol. chem; soc. exp. biol; harvey soc; n. y. acad. med. diseases of metabolism; physiology of the kidney; kidney and adrenal products; blood pressure; influence of adrenalin; gluconeogenesis; nephritis; psoriasis; diabetes; carbohydrate and protein metabolism; diagnosis of diseases of metabolism; insulin in treatment of, and surgery in diabetes.',
 'boissevain, man, can, aug. 10, 96. b.a, sa.katche wan, 23, m.a, 25; california; berlin; ph.d, cornell, 30 a«t rust research, saskatchewan, 25-26; hot. california. 2h28. independent research, 28- research grant. am. arad. sc

it looks good! 1927 is a lot better than 436.  We can use the tf-idf method again to see if there are any more terms we can add to our list. It's unlikely to be as fruitful as the first time but there's no harm in trying.

In [89]:
tf_idf_df = tf_idf_df.merge(df[['de_2']], left_index=True, right_index=True)

In [92]:
tf_grouped_df = tf_idf_df.groupby('de_2').mean().T
tf_grouped_df.columns = ['Not Matched', 'Matched']

tf_grouped_df['diff'] = tf_grouped_df['Matched']-tf_grouped_df['Not Matched']
tf_grouped_df = tf_grouped_df.sort_values(by=['diff'], ascending=False)
tf_grouped_df.head(25)

# filter for rows where the index is not in search_terms
tf_grouped_df[tf_grouped_df.index.isin(search_terms) == False].head(25)

Unnamed: 0,Not Matched,Matched,diff
de_y,0.0,0.24027,0.24027
chem,0.030998,0.054219,0.023221
oberlin,0.0,0.019609,0.019609
soc,0.036316,0.054307,0.017991
gesell,0.000437,0.018112,0.017675
prof,0.03618,0.051634,0.015454
and,0.044615,0.058622,0.014007
med,0.018712,0.03218,0.013469
pres,0.012078,0.025321,0.013242
of,0.036965,0.050118,0.013153


yeah, there's not anything particularly interesting here

<h3> 3. Expanding our Term List with a list of German Cities - <b style="color:darkred"> Poor Solution </b> </h3>

<h5> Constructing the Term List </h5>

The German government has a (modern and incomplete) list of some cities <a href="https://raw.githubusercontent.com/robbi5/german-gov-domains/master/data/domains.cities.csv"> here <a>

In [93]:
de_cities_df = pd.read_csv("https://raw.githubusercontent.com/robbi5/german-gov-domains/master/data/domains.cities.csv")

In [99]:
# Get ascii version of the cities
de_cities_df['ascii_cities']=de_cities_df.City.apply(lambda x: x.encode('ascii', 'ignore').decode('ascii'))
# Append the ascii version and the standard version to the search terms
search_terms.extend(de_cities_df.ascii_cities.tolist())
search_terms.extend(de_cities_df.City.tolist())
search_terms = [term.lower() for term in search_terms]

In [100]:
# Filter for rows where full_text contains any of the search terms
counter = 0
df['de_city_list'] = df.apply(lambda x: is_de(x['Full_Text'], search_terms), axis=1)

Processed 5000 Rows
Processed 10000 Rows
Processed 15000 Rows
Processed 20000 Rows
Processed 25000 Rows


An issue here is with German city names that are also English words, e.g: "March" or "Bell"
Limiting the usefullness of searching across the whole Full_Text

In [101]:
# Find most common items in matched_terms
Counter([item for sublist in df[df['de_city_list'].apply(lambda x: len(x) > 0)]['de_city_list'].tolist() for item in sublist]).most_common(15)

[('au', 18784),
 ('burg', 12976),
 ('march', 7240),
 ('berg', 6520),
 ('vil', 4632),
 ('rain', 4624),
 ('bell', 4104),
 ('berlin', 3560),
 ('ering', 3344),
 ('riol', 2928),
 ('sen', 2892),
 ('rust', 2372),
 ('nebra', 2224),
 ('lam', 2200),
 ('lf', 1986)]