# Analysing Topical Shifts in Migration Publication via Title Words

The titles of publications on migration management give us an overview of what the scientific community focused in their research over the decades.

The dataset consists of authors, titles and publication year of articles published in the journal International Migration (IM) over the period 1961-2011. 

We use Content Analysis as a method to investigate topical shifts in the discourse around migration management. “Content analysis is an approach to the analysis of documents and texts (which may consist of words and/or images and may be printed or online, written or spoken) that seeks to quantify content in terms of predetermined categories and in a systematic and replicable manner.” (Bryman, 2016, p.283).


### References 

Bryman, A. (2016). Social research methods (Fifth edition.). Oxford ; New York: Oxford University Press.



## Loading and inspecting the dataset

We start by loading the dataset in Pandas, which allows us to do basic analysis of the metadata.

In [1]:
%reload_ext autoreload
%autoreload 2

import pandas as pd
import re


In [94]:
# the name and location of the article records for the IM journal (in CSV format)
records_file = '../data/main-review-article-records.csv'

# load the csv data into a data frame
df = pd.read_csv(records_file)
# show the first and last records of the dataset to demonstrate what the records look like
df.head(5)

Unnamed: 0.1,Unnamed: 0,article_title,article_doi,article_author,article_author_index_name,article_author_affiliation,article_page_range,article_pub_date,article_pub_year,issue_section,issue_number,issue_title,issue_page_range,issue_pub_date,issue_pub_year,volume,journal,publisher,article_type
0,9,The importance of emigration for the solution ...,,"Wander, H.","Wander, H.",,,1951,1951,article,,,,1951,1951,1,Publications of the research group for europea...,Staatsdrukkerij,main
1,11,European emigration overseas past and future,,"Citroen, H.A.","Citroen, H.A.",,,1951,1951,article,,,,1951,1951,2,Publications of the research group for europea...,Staatsdrukkerij,main
2,13,Some aspects of migration problems in the Neth...,,"Beijer, G. && Oudegeest, J.J.","Beijer, G. && Oudegeest, J.J.",&&,,1952,1952,article,,,,1952,1952,3_1,Publications of the research group for europea...,Staatsdrukkerij,main
3,15,Some quantitative aspects of future population...,,"Brink, van den, T.","Brink, van den, T.",,,1952,1952,article,,,,1952,1952,3_2,Publications of the research group for europea...,Staatsdrukkerij,main
4,17,"The refugees as a burden, a stimulus, and a ch...",,"Edding, F.","Edding, F.",,,1951,1951,article,,,,1951,1951,4,Publications of the research group for europea...,Staatsdrukkerij,main


In [None]:
df.article_author

In [96]:
df.article_author.fillna('', inplace=True)
def parse_surname(author_name: str):
    return author_name.split(',')[0].replace('ij', 'y').title()


def parse_surname_initial(author_name: str):
    try:
        if ',' not in author_name:
            return author_name
        surname = author_name.split(',')[0].replace('ij', 'y').title()
        initial = author_name.split(', ')[1][0]
        return f'{surname}, {initial}'
    except TypeError:
        print(author_name)


df['article_author'] = df['article_author'].str.title()
df['author_surname_initial'] = df.article_author.apply(parse_surname_initial)
df['issue_pub_decade'] = df.issue_pub_year.apply(lambda x: int(x/10)*10)

### Basic summary statistics

A count of the values in the journal name column, reveals that the metadata of articles has some variation, but almost all use the canonical _International Migration_. In the early years, the journal was called _Migration_ but also _Migracion_ was used. Some contributed articles are written in Spanish or French.



In [97]:
df[['publisher', 'article_type']].value_counts().sort_index()

publisher        article_type
Sage Publishing  main            1539
                 review          1842
Staatsdrukkerij  main             141
Wiley            main             719
                 review            43
dtype: int64

## Data Selection

Before performing a content analysis of the article titles, we want to make a selection of titles that is focused on the academic debate around migration, without any distracting non-topical titles that might obscure any topical shifts across the decades from 1950-2000.

Given the analysis above, we use the following three sets of articles for analysis:

1. main articles from REMP and IM, plus the IM review articles,
2. main articles from IMR
3. review articles from IMR

The reason to combine the main and review articles of IM is that the number of review articles is too low for independent content analysis. 


In [98]:
def map_dataset(journal, article_type):
    if journal != 'International Migration Review' and article_type != 'supplementary':
        return 'REMP_IM'
    elif journal == 'International Migration Review' and article_type == 'main':
        return 'IMR_main'
    elif journal == 'International Migration Review' and article_type == 'review':
        return 'IMR_review'
    else:
        return None

df['dataset'] = df.apply(lambda x: map_dataset(x['journal'], x['article_type']), axis=1)
df.dataset.value_counts()
df_selected = df

This leaves 1183 out of 1690 article titles for analysis.

## Analysing the Article Titles

To analyse the topics of discourse, we use the article titles. To do a content analysis, some data transformations are needed:

- standardising the use of upper and lowercase characters,
- removing common stopwords, as they convey nothing about the topics discussed
- counting individual words and sequences of words as a quantitative signal for the attention to different topics.

We look specifically at:

- word unigram frequencies: how often individual words occur across titles
- word bigram frequencies: how often combinations of two words occur across titles.

To demonstrate the need for the transformations described above, we look at the first 20 titles.

In [99]:
# Get a list of all the titles
titles = list(df.article_title)

# show the first 20 titles
titles[:20]



['The importance of emigration for the solution of population problems in Western Europe',
 'European emigration overseas past and future',
 'Some aspects of migration problems in the Netherlands',
 'Some quantitative aspects of future population development in the Netherlands',
 'The refugees as a burden, a stimulus, and a challenge to the West German economy  (eerder uitgegeven in Duits, Kieler Studien 12)',
 'The solution of the Karelian refugee problem in Finland',
 'Some factors influencing postwar emigration from the Netherlands',
 'Some remarks on selective migration',
 "L'immigration en France depuis 1945",
 'Economic impacts of immigration : the Brazilian immigration problem',
 'Industrialization-emigration; the consequences of the demographic development in the Netherlands',
 'The assimilation and integration of pre- and postwar refugees in the Netherlands',
 'Introduction',
 'The German exodus : a selective study on the post World War II expulsion of German populations and i

#### Inconsistent Case

The titles differ in their use of upper and lower case, so one step is to normalise all titles to be lower case. 

One consequence of this is that meaningful differences between a word with an initial uppercase that is part of a name (like _Migration_ in the organisation name _Internationl Committee for European Migration_) is merged with the regular noun _migration_. But in most cases this is not a problem, as they represent the same concept. Moreover, titles of journal article tend to use title casing of all words in the title, or at least all content-bearing words (i.e. non-stopwords). So it is difficult to make this distinction with algorithmic processing anyway.



In [100]:
# show the first 20 titles
[title.lower() for title in titles[:20]]



['the importance of emigration for the solution of population problems in western europe',
 'european emigration overseas past and future',
 'some aspects of migration problems in the netherlands',
 'some quantitative aspects of future population development in the netherlands',
 'the refugees as a burden, a stimulus, and a challenge to the west german economy  (eerder uitgegeven in duits, kieler studien 12)',
 'the solution of the karelian refugee problem in finland',
 'some factors influencing postwar emigration from the netherlands',
 'some remarks on selective migration',
 "l'immigration en france depuis 1945",
 'economic impacts of immigration : the brazilian immigration problem',
 'industrialization-emigration; the consequences of the demographic development in the netherlands',
 'the assimilation and integration of pre- and postwar refugees in the netherlands',
 'introduction',
 'the german exodus : a selective study on the post world war ii expulsion of german populations and i

For analysing in how many titles each word occurs, there are several other issues:

- some titles have footnote symbols like '*' and '1', which are not part of the words they are attached to. For example: "theoretical considerations and empirical evidence on brain drain grounding the review of albania's and bulgaria's experience 1"
    - **The normalisation step is to remove these symbols.**
- some words contain contractions like "Europe's". We only want to count the content word, e.g. 'Europe'. 
    - **The normalisation step is to remove the 's part to retain only the content word.**
- some titles contain acronyms with dots, like A.D., C.E.E. or U.S.A. 
    - **The normalisation steps is to remove the dots.**
- We want to compare plain words without any attached punctuation, such that “mater et magistra” should become three words 'mater', 'et', 'magistra' without the opening and closing double quote characters
    - **The normalisation step is to replace all such punctuation symbols by a whitespace, so that their removal doesn't contract two words on opposite so of the punctuation.**
    - **The exception is hyphenated words, which should remain intact, e.g. 'co-development' should be treated as a single word.**
    
    

In [101]:
def remove_footnote_symbols(title):
    """Remove footnote symbols in the title like * and 1."""
    title = re.sub(r'([a-z]+)[12]', r'\1', title)
    if title.endswith('*'):
        return title[:-1]
    elif title.endswith(' 1'):
        return title[:-2]
    else:
        return title
    
    
def normalise_acronyms(title):
    """Remove dots in acronyms (A.D. -> AD , U.S.A. -> USA)"""
    match = re.search(r'\b((\w\.){2,})', title)
    if match:
        acronym = match.group(1)
        normalised_acronym = acronym.replace('.','')
        title = title.replace(acronym, normalised_acronym)
    return title
        

def resolve_apostrophes(title):
    """Remove 's from words."""
    title = title.replace("‘", "'")
    title = re.sub(r"(\w)'s\b", r'\1', title)
    return title


def remove_punctuation(title):
    """Remove all non-alpha-numeric characters except whitespace and hyphens"""
    title = re.sub(r'[^\w -]', ' ', title)
    return title


def collapse_multiple_whitespace(title):
    """Reduce multiple whitespace to a single whitespace, e.g. '  ' -> ' '. """
    title = re.sub(r' +', ' ', title)
    return title


def normalise_title(title):
    title = remove_footnote_symbols(title)
    title = normalise_acronyms(title)
    title = resolve_apostrophes(title)
    title = remove_punctuation(title)
    title = collapse_multiple_whitespace(title)
    return title.lower()

def demonstrate_normalisation():
    titles = list(df_selected.article_title)
    for title in titles:
        print('Original:', title)
        title = normalise_title(title)
        print('Normalised:', title)
        print()

demonstrate_normalisation()


Original: The importance of emigration for the solution of population problems in Western Europe
Normalised: the importance of emigration for the solution of population problems in western europe

Original: European emigration overseas past and future
Normalised: european emigration overseas past and future

Original: Some aspects of migration problems in the Netherlands
Normalised: some aspects of migration problems in the netherlands

Original: Some quantitative aspects of future population development in the Netherlands
Normalised: some quantitative aspects of future population development in the netherlands

Original: The refugees as a burden, a stimulus, and a challenge to the West German economy  (eerder uitgegeven in Duits, Kieler Studien 12)
Normalised: the refugees as a burden a stimulus and a challenge to the west german economy eerder uitgegeven in duits kieler studien 12 

Original: The solution of the Karelian refugee problem in Finland
Normalised: the solution of the kare

### Word Frequency Lists

We start with a quick look at individual word frequencies for the first 20 titles to get an insight in some easy that need preprossing.

In [102]:
from collections import Counter # import to count word frequencies


# count frequencies of individual words
uni_freq = Counter()

for title in titles[:20]:
    # normalise the title using the steps described above
    title = normalise_title(title)
    # .split(' ') splits the title into chunks wherever there is a whitespace
    terms = title.split(' ')
    # count each term in the sentence, excluding 'empty' words
    uni_freq.update([term for term in terms if term != ''])

# Show the 25 most common words and their frequencies
for term, freq in uni_freq.most_common(25):
    print(f'{term: <30}{freq: >5}')

the                              25
of                               16
in                               11
a                                 8
and                               7
netherlands                       5
some                              4
refugees                          4
study                             4
emigration                        3
migration                         3
to                                3
german                            3
immigration                       3
solution                          2
population                        2
problems                          2
western                           2
future                            2
aspects                           2
development                       2
problem                           2
postwar                           2
from                              2
on                                2


#### Stopwords and Content Words

Now we notice that the most frequent words are stopwords. We can use a standard stopword list provided by [NLTK](http://www.nltk.org) to remove those from the frequency lists to focus on the content words.

Since there are publications in English, French and Spanish, we use the stopword lists of all three languages.

In [103]:
from collections import Counter # import to count word frequencies
import re # import to remove punctuation
from nltk.corpus import stopwords # import to remove stopwords

stopwords_en = stopwords.words('english')
stopwords_fr = stopwords.words('french')
stopwords_sp = stopwords.words('spanish')
stopwords_all = stopwords_en + stopwords_fr + stopwords_sp

print('The first 10 English stopwords:', stopwords_en[:10])
print('The first 10 French stopwords:', stopwords_fr[:10])
print('The first 10 Spanish stopwords:', stopwords_sp[:10])
print('\nTotal number of distinct stopwords:', len(stopwords_all))

The first 10 English stopwords: ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your']
The first 10 French stopwords: ['au', 'aux', 'avec', 'ce', 'ces', 'dans', 'de', 'des', 'du', 'elle']
The first 10 Spanish stopwords: ['de', 'la', 'que', 'el', 'en', 'y', 'a', 'los', 'del', 'se']

Total number of distinct stopwords: 621


In [104]:
# count frequencies of individual words
uni_freq = Counter()

for title in titles[:20]:
    # normalise the title using the steps described above
    title = normalise_title(title)
    # .split(' ') splits the title into chunks wherever there is a whitespace
    terms = title.split(' ')
    # remove stopwords
    terms = [term for term in terms if term not in stopwords_all]
    # count each term in the sentence, excluding 'empty' words
    uni_freq.update([term for term in terms if term != ''])

for term, freq in uni_freq.most_common(25):
    print(f'{term: <30}{freq: >5}')
    



netherlands                       5
refugees                          4
study                             4
emigration                        3
migration                         3
german                            3
immigration                       3
solution                          2
population                        2
problems                          2
western                           2
future                            2
aspects                           2
development                       2
problem                           2
postwar                           2
selective                         2
assimilation                      2
integration                       2
world                             2
war                               2
adjustment                        2
second                            2
immigrants                        2
return                            2


Now we have a list of mostly content words. Note that the stopwords list contains English stopwords, so common stopwords in other languages are not filtered out. We assume these are not very common so will leave them in for now. 

Now we repeat the the frequency counting for all 1183 titles, instead of the first 20.

In [105]:
# count frequencies of individual words
uni_freq = Counter()

for title in titles:
    # normalise the title using the steps described above
    title = normalise_title(title)
    # .split(' ') splits the title into chunks wherever there is a whitespace
    terms = title.split(' ')
    # remove stopwords
    terms = [term for term in terms if term not in stopwords_all]
    # count each term in the sentence, excluding 'empty' words
    uni_freq.update([term for term in terms if term != ''])

for term, freq in uni_freq.most_common(25):
    print(f'{term: <30}{freq: >5}')
    



review                         1897
book                           1891
migration                       850
immigration                     367
immigrants                      319
international                   310
states                          278
united                          273
new                             243
america                         232
american                        210
ethnic                          190
immigrant                       173
labor                           146
social                          139
refugees                        137
policy                          136
migrants                        136
study                           130
workers                         126
emigration                      123
economic                        120
case                            118
australia                       113
canada                          111


#### Word Bigrams

Next, we look at combinations of two words. Individual words can have quite different meanings in different contexts. Two neighbouring words in a title tend to contextualise each other, so convey more meaning.

We create word bigrams in the following way:

- we split the normalised title into individual words
- we create bigrams for each two adjacent words, so 'evidence on brain drain' results in three bigrams:
    - 'evidence on', 'on brain', 'brain drain'
- we remove all bigrams in which either the first or the second word is a stopword, assuming that the stopword convey little contextual information.

First we look at how this process transforms a title to a list of bigrams, and discuss how that affects frequencies.

In [106]:
for title in titles[:20]:
    # normalise the title using the steps described above
    title = normalise_title(title)
    print('Title:', title)
    # .split(' ') splits the title into chunks wherever there is a whitespace
    terms = title.split(' ')
    # get all pairs of subsequent title words
    bigrams = list(zip(terms[:-1], terms[1:]))
    print('\nAll bigrams:', [' '.join(bigram) for bigram in bigrams])
    # remove all bigrams for which the first or second word is a stopword
    bigram_terms = [' '.join(bigram) for bigram in bigrams if bigram[0] not in stopwords_all and bigram[1] not in stopwords_all]
    print('\nFiltered bigrams:', bigram_terms)
    print('\n')


Title: the importance of emigration for the solution of population problems in western europe

All bigrams: ['the importance', 'importance of', 'of emigration', 'emigration for', 'for the', 'the solution', 'solution of', 'of population', 'population problems', 'problems in', 'in western', 'western europe']

Filtered bigrams: ['population problems', 'western europe']


Title: european emigration overseas past and future

All bigrams: ['european emigration', 'emigration overseas', 'overseas past', 'past and', 'and future']

Filtered bigrams: ['european emigration', 'emigration overseas', 'overseas past']


Title: some aspects of migration problems in the netherlands

All bigrams: ['some aspects', 'aspects of', 'of migration', 'migration problems', 'problems in', 'in the', 'the netherlands']

Filtered bigrams: ['migration problems']


Title: some quantitative aspects of future population development in the netherlands

All bigrams: ['some quantitative', 'quantitative aspects', 'aspects of

Now we process all titles and count how often each word bigram occurs. The 25 most frequent bigrams are displayed for analysis.

In [107]:
# count frequencies of individual words
bi_freq = Counter()

for title in titles:
    # normalise the title using the steps described above
    title = normalise_title(title)
    # .split(' ') splits the title into chunks wherever there is a whitespace
    terms = title.split(' ')
    # get all pairs of subsequent title words
    bigrams = list(zip(terms[:-1], terms[1:]))
    # remove all bigrams for which the first or second word is a stopword
    bigram_terms = [' '.join(bigram) for bigram in bigrams if bigram[0] not in stopwords_all and bigram[1] not in stopwords_all]
    # count the occurrence of each bigram
    bi_freq.update(bigram_terms)


for term, freq in bi_freq.most_common(25):
    print(f'{term: <30}{freq: >5}')
    



book review                    1846
united states                   255
international migration         109
international newsletter        100
new york                         82
review ethnic                    42
return migration                 40
immigration policy               39
book reviews                     39
york city                        39
review migration                 33
internal migration               32
labor migration                  31
labor market                     28
puerto rican                     28
migrant workers                  26
case study                       24
latin america                    23
review immigration               23
review immigrants                23
review immigrant                 22
western europe                   21
new zealand                      21
labour migration                 21
brain drain                      21


Now, 'international migration' is the most common two-word combination, followed by 'United States', 'labour market', 'labour migration', 'case study', 'migrant workers' and 'brain drain'.

It is no surprise that bigrams containing the word 'migration' or 'migrant' are frequent, but the bigrams result in meaningful distinctions like 'labour migration' versus 'return migration' and 'migrant workers' versus 'migrant women'.

#### Assessing the Impact of Normalisation

In [108]:
# count frequencies of individual words
bi_freq = Counter()

for title in titles:
    # .split(' ') splits the title into chunks wherever there is a whitespace
    terms = title.split(' ')
    # get all pairs of subsequent title words
    bigrams = list(zip(terms[:-1], terms[1:]))
    # remove all bigrams for which the first or second word is a stopword
    bigram_terms = [' '.join(bigram) for bigram in bigrams if bigram[0] not in stopwords_all and bigram[1] not in stopwords_all]
    # count the occurrence of each bigram
    bi_freq.update(bigram_terms)


for term, freq in bi_freq.most_common(25):
    print(f'{term: <30}{freq: >5}')
    



                               2338
Book Review:                   1812
Review: The                     366
United States                   184
International Newsletter        100
International Migration          86
New York                         67
The Case                         58
A Study                          50
Review: Ethnic                   42
Book Reviews                     39
United States:                   38
The Impact                       34
Book Review                      33
 
                               33

                                33
Return Migration                 32
Review: A                        30
Review: Migration                30
Review: From                     30
Puerto Rican                     28
York City                        27
 :                               27
A Comparative                    26
Immigration Policy               25


#### Assessing the impact of removing bigrams with stopwords

In [109]:
# count frequencies of individual words
bi_freq = Counter()

for title in titles:
    # normalise the title using the steps described above
    title = normalise_title(title)
    # .split(' ') splits the title into chunks wherever there is a whitespace
    terms = title.split(' ')
    # remove stopwords before making bigrams
    terms = [term for term in terms if term not in stopwords_all]
    # get all pairs of subsequent title words
    bigrams = list(zip(terms[:-1], terms[1:]))
    # remove all bigrams for which the first or second word is a stopword
    bigram_terms = [' '.join(bigram) for bigram in bigrams]
    # count the occurrence of each bigram
    bi_freq.update(bigram_terms)


for term, freq in bi_freq.most_common(25):
    print(f'{term: <30}{freq: >5}')
    



book review                    1846
united states                   255
international migration         109
international newsletter        100
newsletter migration             98
new york                         82
review ethnic                    54
return migration                 40
immigration policy               40
book reviews                     39
york city                        39
review migration                 36
review immigrant                 33
internal migration               32
labor migration                  31
review immigration               29
labor market                     28
puerto rican                     28
review american                  28
migrant workers                  26
review reviews                   26
review immigrants                25
case study                       24
immigrants united                24
latin america                    23




### Analysing Title Words Per Decade

The articles are published over a period of several decades, and there might be shifts in the discourse over time. A next step is to group uni-grams and bi-grams per decade, to visualise shifts.

The first step is to group the article titles per decade. We derive the decade from the year that the issue was published. 

In [110]:
# adding a column per article with the publication decade
df['issue_decade'] = df.issue_pub_year.apply(lambda x: int(x/10) * 10 if not pd.isnull(x) else x)
# recreate the selection of the dataset with the new column
df_selected = df[(df.recurring_title == False) & (df.article_author.str.len() > 0)]

df_selected[['issue_pub_year', 'issue_decade']]



AttributeError: 'DataFrame' object has no attribute 'recurring_title'

The number of articles per decade shows that there is a slight dip in the 1970s, but after that increases per decade.

In [111]:
df_selected.issue_decade.value_counts().sort_index()

1950     102
1960     404
1970    1052
1980    1236
1990    1490
Name: issue_decade, dtype: int64

In [112]:
# make a list of all decades in the dataset
decades = sorted([int(decade) for decade in list(set(df_selected.issue_decade)) if not pd.isnull(decade)])
decades

[1950, 1960, 1970, 1980, 1990]

#### Analysing content words per decade

In [113]:
def make_title_unigram_term_list(title: str, stopwords):
    # normalise the title using the steps described above
    title = normalise_title(title)
    # .split(' ') splits the title into chunks wherever there is a whitespace
    terms = title.split(' ')
    # remove stopwords
    terms = [term for term in terms if term not in stopwords]
    return terms


for decade in decades:
    titles = list(df_selected[df_selected.issue_decade == decade].article_title)
    normalised_titles = [normalise_title(title) for title in titles]
    unigram_terms = [term for title in normalised_titles for term in make_title_unigram_term_list(title, stopwords_all)]
    unigram_freq = Counter(unigram_terms)
    print(decade)
    print('--------------------------')
    for term, freq in unigram_freq.most_common(25):
        print(f'{term: <30}{freq: >5}')
    print('\n\n')
    



1950
--------------------------
                                 55
migration                        20
immigration                      20
population                       13
emigration                       10
aspects                          10
economic                         10
netherlands                       9
western                           8
refugee                           8
europe                            7
problem                           7
problems                          6
refugees                          6
australia                         6
immigrants                        6
new                               5
dutch                             5
post-war                          5
germany                           5
european                          4
overseas                          4
development                       4
german                            4
settlement                        4



1960
--------------------------
review                          1

In [114]:
def make_title_bigram_term_list(title: str, stopwords):
    # first turn the title into a list of normalised words, but don't remove stopwords
    terms = make_title_unigram_term_list(title, [])
    # get all pairs of subsequent words in the title
    bigrams = list(zip(terms[:-1], terms[1:]))
    # remove all bigrams for which the first or second word is a stopword
    bigram_terms = [' '.join(bigram) for bigram in bigrams 
                    if bigram[0] not in stopwords and bigram[1] not in stopwords]
    return bigram_terms


for decade in decades:
    titles = list(df_selected[df_selected.issue_decade == decade].article_title)
    normalised_titles = [normalise_title(title) for title in titles]
    bigram_terms = [term for title in titles for term in make_title_bigram_term_list(title, stopwords_all)]
    bigram_freq = Counter(bigram_terms)
    print(decade)
    print('--------------------------')
    for term, freq in bigram_freq.most_common(25):
        print(f'{term: <30}{freq: >5}')
    print('\n\n')
    



1950
--------------------------
refugee problem                   4
western germany                   4
australia                         4
population problems               3
migration                         3
western australia                 3
quantitative aspects              2
future population                 2
factors influencing               2
influencing postwar               2
postwar emigration                2
economic impacts                  2
brazilian immigration             2
immigration problem               2
middle east                       2
group settlement                  2
1952                              2
reprint                           2
ethnic german                     2
german refugee                    2
austria 1945                      2
post-war migration                2
special reference                 2
new zealand                       2
zealand                           2



1960
--------------------------
book review                     1

### Analysing Countries Mentioned in Titles

We map countries to continents, as larger geographic units.

In some titles, the nationality of a migrant group is mentioned, which differs from the country name. We add an analysis in which also nationalities are mapped to their respective country names and continents. E.g. `Polish` is mapped to `Poland` and `Europe`. 

We started with the countries and continents listed on the [World Atlas](https://www.worldatlas.com/cntycont.htm) website, and extended these several former countries (e.g. Soviet Union) and not formally-recognised countries (e.g. Kosovo, Palestina) as well as some larger regions (Caribbean, Latin America, Middle East). For the UK, we included England, Northern Ireland, Scotland and Wales, as these are also mentioned in some titles. 

For the nationalities, we used a list provided by Wikipedia ([List of adjectival and demonymic forms for countries and nations](https://en.wikipedia.org/w/index.php?title=List_of_adjectival_and_demonymic_forms_for_countries_and_nations&oldid=1004136953)).


In [115]:
from scripts.countries import CountryLookup, show_counts

lookup = CountryLookup()

titles = list(df_selected.article_title)

# First we count without nationalities
country_count, continent_count = lookup.count_countries_continents(titles, include_nationalities=False)


show_counts(country_count, continent_count)

Countries
----------------------------------
United States                  327
Australia                      154
Canada                         113
Israel                          61
Japan                           56
India                           55
Germany                         52
Mexico                          46
Korea                           37
Vietnam                         33
France                          33
Caribbean                       31
Latin America                   28
Cuba                            27
China                           26
Brazil                          26
New Zealand                     21
Netherlands                     20
Middle East                     20
Italy                           19


Continents
----------------------------------
North America                  571
Asia                           471
Europe                         423
Oceania                        178
Africa                         124
South America                  1

In [116]:
# Next we count with nationalities included
country_count, continent_count = lookup.count_countries_continents(titles, include_nationalities=True)

show_counts(country_count, continent_count)

Countries
----------------------------------
United States                  687
Canada                         161
Italy                          156
Australia                      154
Mexico                         151
Germany                        105
China                           81
Israel                          61
Macau                           60
Japan                           56
India                           55
Puerto Rico                     54
Greece                          50
France                          49
Netherlands                     47
Ireland                         39
Poland                          38
United Kingdom                  38
Korea                           37
Vietnam                         33


Continents
----------------------------------
North America                 1027
Europe                         860
Asia                           568
Oceania                        178
Africa                         129
South America                  1

In [117]:
decades = [1960, 1970, 1980, 1990, 2000, 2010]

for decade in decades:
    print(decade)
    print('\n')
    titles = list(df_selected[df_selected.issue_decade == decade].article_title)
    country_count, continent_count = lookup.count_countries_continents(titles, include_nationalities=True)
    show_counts(country_count, continent_count)
    print('\n\n')

1960


Countries
----------------------------------
United States                   42
Australia                       35
Italy                           24
Canada                          14
Puerto Rico                     12
Netherlands                      7
United Kingdom                   7
Israel                           7
Hungary                          6
Germany                          6
Spain                            6
Brazil                           6
Latin America                    6
Mexico                           5
France                           5
Japan                            4
China                            4
Poland                           4
Greece                           4
Norway                           4


Continents
----------------------------------
Europe                         101
North America                   67
Oceania                         37
Asia                            19
South America                   14
Africa                   

In [131]:
g = df.loc[~df.author_surname_initial.isna()].groupby(['author_surname_initial', 'dataset']).dataset.count()
df_overlap = g.unstack('dataset').fillna(0.0)
print('number of authors with at least two articles in both journals:', 
      len(df_overlap[(df_overlap.REMP_IM > 1) & (df_overlap.IMR_main > 1)]))

overlappers = df_overlap[(df_overlap.REMP_IM > 1) & (df_overlap.IMR_main > 1)]
overlappers

number of authors with at least two articles in both journals: 5


dataset,IMR_main,IMR_review,REMP_IM
author_surname_initial,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
,406.0,0.0,42.0
Charles B. Keely,7.0,1.0,2.0
Charles W. Stahl,2.0,1.0,2.0
Nasra M. Shah,3.0,1.0,3.0
W. R. Böhning,2.0,2.0,3.0


In [120]:
raw_authors = df.loc[df.author_surname_initial.isin(overlappers.index)][['article_author', 'author_surname_initial']].drop_duplicates()
raw_authors

Unnamed: 0,article_author,author_surname_initial
172,,
221,Nasra M. Shah,Nasra M. Shah
274,W. R. Böhning,W. R. Böhning
295,Charles B. Keely,Charles B. Keely
673,Charles W. Stahl,Charles W. Stahl


In [121]:
raw_authors.loc[raw_authors.author_surname_initial.str.contains('Beijer')]

Unnamed: 0,article_author,author_surname_initial


In [122]:
df_overlap

dataset,IMR_main,IMR_review,REMP_IM
author_surname_initial,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
,406.0,0.0,42.0
A. A. C. Cavelaars,0.0,0.0,1.0
A. A. Weinberg M.D.,0.0,0.0,1.0
A. Adepoju,0.0,0.0,1.0
A. Begag,0.0,0.0,2.0
...,...,...,...
"Zubrzycki, J",0.0,0.0,3.0
Zvi Halevy,0.0,0.0,1.0
Zvonimir Baletić,1.0,0.0,0.0
Øystein Opdahl,0.0,0.0,1.0


In [123]:
df.columns

Index(['Unnamed: 0', 'article_title', 'article_doi', 'article_author',
       'article_author_index_name', 'article_author_affiliation',
       'article_page_range', 'article_pub_date', 'article_pub_year',
       'issue_section', 'issue_number', 'issue_title', 'issue_page_range',
       'issue_pub_date', 'issue_pub_year', 'volume', 'journal', 'publisher',
       'article_type', 'author_surname_initial', 'issue_pub_decade', 'dataset',
       'issue_decade'],
      dtype='object')

Datasets reminder

- REMP_IM
- IMR_main
- IMR_review

In [124]:
df_remp = df.loc[df.dataset=='REMP_IM']
len(df_remp)

903

In [125]:
trunked_df = df[['article_author', 'article_title', 'dataset']].drop_duplicates()

In [126]:
author_grouped = trunked_df.groupby([df.dataset, df.article_author]).agg('count')[['dataset','article_title']]

In [127]:
author_grouped.sort_values(['article_title'], ascending=False).loc[author_grouped.article_title>5]

Unnamed: 0_level_0,Unnamed: 1_level_0,dataset,article_title
dataset,article_author,Unnamed: 2_level_1,Unnamed: 3_level_1
IMR_main,,232,232
IMR_review,Mary Elizabeth Brown,19,19
IMR_review,Mark J. Miller,13,13
IMR_review,Peter Kivisto,13,13
IMR_review,William S. Egelman,12,12
IMR_review,Peter I. Rose,12,12
IMR_review,Joseph Velikonja,11,11
IMR_review,Joseph S. Roucek,11,11
IMR_review,Robert V. Kemper,10,10
IMR_review,Leonard Dinnerstein,10,10


In [128]:
remp_trunk_df = df_remp[['article_author', 'article_title', 'dataset']].drop_duplicates()

In [129]:
remp_author_grouped = trunked_df.groupby([df.article_author]).agg('count')
remp_author_grouped.loc[remp_author_grouped.article_title >2].sort_values('article_title', ascending=False)

Unnamed: 0_level_0,article_author,article_title,dataset
article_author,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
,236,236,236
Mary Elizabeth Brown,20,20,20
Daniel Kubat,16,16,16
Mark J. Miller,15,15,15
Peter Kivisto,14,14,14
...,...,...,...
John Moritsugu,3,3,3
John O. Oucho,3,3,3
John W. Briggs,3,3,3
Jonas Widgren,3,3,3


In [26]:
# Spreadsheets:
# - frequency lists 
# - titles + countries + direction + return yes/no
# 

In [27]:
df.columns

Index(['Unnamed: 0', 'article_title', 'article_doi', 'article_author',
       'article_author_index_name', 'article_author_affiliation',
       'article_page_range', 'article_pub_date', 'article_pub_year',
       'issue_section', 'issue_number', 'issue_title', 'issue_page_range',
       'issue_pub_date', 'issue_pub_year', 'volume', 'journal', 'publisher',
       'article_type', 'dataset', 'issue_decade'],
      dtype='object')

In [28]:
author_grouped.loc[author_grouped>10].sort_values(ascending=False)


article_author
Mary Elizabeth Brown    20
Daniel Kubat            16
Mark J. Miller          15
Peter Kivisto           14
Peter I. Rose           13
William S. Egelman      12
Joseph Velikonja        12
Robert V. Kemper        11
Joseph S. Roucek        11
Name: article_title, dtype: int64

In [29]:
author_grouped.loc[author_grouped.index.str.contains('Beijer', regex=True)]

article_author
Beijer, G.                           9
Beijer, G. && Beld, van den, C.A.    1
Beijer, G. && Oudegeest, J.J.        1
Beijer, G. && Zeegers, G.H.L.        1
G. Beijer                            4
Name: article_title, dtype: int64