# __Step 7.5: Topic per country__

Goal:
- Plot overall pub volume in plant science per country
- Plot relative volume
- Plot relative pub volume over time

Additional analyses:
- Plant sci pub #
- Per country
- Noramlized against subset of [UNESCO](https://apiportal.uis.unesco.org/bdds) data
  - Education
    - `SDG` Global and Thematic Indicators
    - `OPRI` Other Policy Relevant Indicators
  - Science (SCN)
    - `SCN-SDG` Research and Development (R&D) SDG 9.5
    - `SCN-OPRI` Research and Development (R&D) – Other Policy Relevant Indicators
  - Culture
    - `SDG11` SDG 11.4
  - External
    - `DEM` Demographic and Socio-economic Indicators
  - Focus on:
    - Total population
    - GDP total
    - GDP per capita
    - Researcher number
    - Gross domestic expenditure on R&D (GERD) as percent GDP 

Resources:
- [Best Libraries for Geospatial Data Visualisation in Python](https://towardsdatascience.com/best-libraries-for-geospatial-data-visualisation-in-python-d23834173b35)
- [Plotly documentation](https://plotly.com/python/)
- [Interactive map in python](https://www.earthdatascience.org/courses/scientists-guide-to-plotting-data-in-python/plot-spatial-data/customize-raster-plots/interactive-maps/)
- [Analysing and Visualising the Country wise Population from 1955 to 2020 with Pandas, Matplotlib, Seaborn and Plotly](https://towardsdatascience.com/analysing-and-visualising-the-country-wise-population-from-1955-to-2020-with-pandas-matplotlib-70b3614eed6b)
- [How to plot and color world map based on data](https://medium.com/@petrica.leuca/how-to-plot-and-color-world-map-based-on-data-c367e4cda7d9)

## ___Setup___

### Module import

In conda env `base`

In [7]:
import pickle
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
from pathlib import Path
from tqdm import tqdm
from datetime import datetime

### Key variables

In [2]:
# Reproducibility
seed = 20220609

# Setting working directory
proj_dir   = Path.home() / "projects/plant_sci_hist"
work_dir   = proj_dir / "7_countries/7_5_country_over_time"
work_dir.mkdir(parents=True, exist_ok=True)

# plant science corpus with topic assignment info
dir42      = proj_dir / "4_topic_model/4_2_outlier_assign"
corpus_file = dir42 / "table4_2_corpus_with_topic_assignment.tsv.gz"
#corpus_file = dir42 / "test.tsv"

# timestamp bins
dir44            = proj_dir / "4_topic_model/4_4_over_time"
ts_for_bins_file = dir44 / "table4_4_bin_timestamp_date.tsv"
file_topic_name   = dir44 / "fig4_4_tot_heatmap_weighted_xscaled_names.txt"

# country data
dir74            = proj_dir / "7_countries/7_4_consolidate_all"
ci_file          = dir74 / 'country_info_final_a3.txt'

# unesco data dir
unesco_dir       = work_dir / 'unesco'

# So PDF is saved in a format properly
mpl.rcParams['pdf.fonttype'] = 42
plt.rcParams["font.family"] = "sans-serif"

## ___Topic assignment___

### Read topic assignment

Lifted from script_5_3

In [4]:
# topic data-frame
tdf = pd.read_csv(corpus_file, sep='\t', compression='gzip', index_col=[0])

print("topic dataframe:", tdf.shape)

topic dataframe: (421658, 12)
topic assignments: (421658,)


In [5]:
tdf.head(2)

Unnamed: 0,Index_1385417,PMID,Date,Journal,Title,Abstract,Initial filter qualifier,Corpus,reg_article,Text classification score,Preprocessed corpus,Topic
0,3,61,1975-12-11,Biochimica et biophysica acta,Identification of the 120 mus phase in the dec...,After a 500 mus laser flash a 120 mus phase in...,spinach,Identification of the 120 mus phase in the dec...,1,0.716394,identification 120 mus phase decay delayed flu...,52
1,4,67,1975-11-20,Biochimica et biophysica acta,Cholinesterases from plant tissues. VI. Prelim...,Enzymes capable of hydrolyzing esters of thioc...,plant,Cholinesterases from plant tissues. VI. Prelim...,1,0.894874,cholinesterases plant tissues . vi . prelimina...,48


In [27]:
# Found that the original dataframe has some duplicated PMIDs
tdf.duplicated(subset=['PMID']).sum()

351

In [31]:
duplicated = tdf[tdf.duplicated(subset=['PMID'])]['PMID'].values
for pmid in duplicated:
    print(tdf[tdf['PMID'] == pmid][['PMID', 'Topic']])

            PMID  Topic
237916  24358894     -1
243715  24358894     35
            PMID  Topic
243717  24555064     51
243962  24555064     51
            PMID  Topic
254554  25165534     83
254742  25165534     83
            PMID  Topic
263498  25653837     23
263687  25653837     23
            PMID  Topic
308523  28232862     74
313258  28232862     74
            PMID  Topic
275391  26309725      6
329893  26309725      6
            PMID  Topic
329667  29188023     16
336639  29188023     16
            PMID  Topic
336707  29560253     55
337976  29560253     55
            PMID  Topic
325000  28928963     60
339161  28928963     60
            PMID  Topic
339625  29721312     72
339875  29721312     72
            PMID  Topic
371175  31031972     27
371798  31031972     27
            PMID  Topic
336119  29528046     59
381910  29528046     59
            PMID  Topic
383683  31633057     -1
387955  31633057     -1
            PMID  Topic
399763  32399197     60
399981  32399197

In [33]:
# rid of duplicated entries
tdf_nodup = tdf.drop_duplicates(subset=['PMID'])
tdf_nodup.to_csv(dir42 / 'table7_5_corpus_with_topic_assignment_nodup.tsv.gz', 
                 sep='\t', compression='gzip')
tdf_nodup.shape

(421307, 12)

### Get pmid, date, and topic

Code lifted from 4.4

In [34]:
# Create pmid-topic dataframe
pmid_topic = tdf_nodup[['PMID', 'Date', 'Topic']]
pmid_topic.head(2)

Unnamed: 0,PMID,Date,Topic
0,61,1975-12-11,52
1,67,1975-11-20,48


### Get topc indices and names

In [36]:
# topic assignments
toc_array = tdf_nodup['Topic'].values

# topic indices
tocs = np.unique(toc_array)

# exclude topic=-1
tocs_90 = tocs[1:]

# number of topic=-1
n_rec_toc_unassigned = sum((toc_array==-1).astype(int))

# number of docs with topic assignment. Originally was thinking about minus
# unassigned, but realize that the totol for taxa would be the number of total
# docs, so it does not make sense to remove unassigned.
n_rec_total  = len(toc_array)

print("number of topic=-1:", n_rec_toc_unassigned)
print("number of docs with topic assignment:", n_rec_total)

number of topic=-1: 49192
number of docs with topic assignment: 421307


### Read topic names

In [37]:

toc_names = pd.read_csv(file_topic_name, sep='\t')
toc_names.head(2)

Unnamed: 0,Topic,Mod_name
0,22,enzyme | fatty acids | lipid | synthesis
1,18,protein | dna | rna | synthesis | mrna


## ___Process country data___




### Country data

In [38]:
ci = pd.read_csv(ci_file, sep='\t')
ci.shape

(330328, 3)

In [39]:
ci.head()

Unnamed: 0,PMID,A3,Confidence
0,400957,CAN,3
1,1279107,FRA,3
2,1279650,JPN,3
3,1280064,BRA,3
4,1280162,JPN,3


In [40]:
ci.duplicated(subset=['PMID']).sum()

0

### Join country data with pmid_topic

https://sparkbyexamples.com/pandas/pandas-join-dataframes-on-columns/

In [41]:
ci_pmid_topic = pd.merge(ci, pmid_topic, on='PMID')
ci_pmid_topic.shape

(330328, 5)

In [42]:
ci_pmid_topic.head(2)

Unnamed: 0,PMID,A3,Confidence,Date,Topic
0,400957,CAN,3,1978-01-01,50
1,1279107,FRA,3,1992-11-01,12


### Add year column

In [46]:
year = [int(date.split('-')[0]) for date in ci_pmid_topic['Date'].values]
ci_pmid_topic['Year'] = year
ci_pmid_topic.to_csv(work_dir / 'ci_pmid_topic.tsv', sep='\t', index=False)
ci_pmid_topic.head(2)

Unnamed: 0,PMID,A3,Confidence,Date,Topic,Year
0,400957,CAN,3,1978-01-01,50,1978
1,1279107,FRA,3,1992-11-01,12,1992


## ___Normalized country count___

### Total population

In [47]:
cpop_file = unesco_dir / 'DEMO_DS_13062023083742557-total_pop.csv'
cpop = pd.read_csv(cpop_file, sep=',')
cpop.head(2)

Unnamed: 0,DEMO_IND,Indicator,LOCATION,Country,TIME,Time,Value,Flag Codes,Flags
0,SP_DYN_TFRT_IN,"Fertility rate, total (births per woman)",AUS,Australia,2016,2016,1.8,,
1,SP_DYN_TFRT_IN,"Fertility rate, total (births per woman)",AUS,Australia,2017,2017,1.7,,


## ___Country representation among topics___

|                 | In topic T | Not in topic T |
|    ---          | ---        | ---            |
|In country X     | a          | b              |
|Not in country X | c          | d              |
