# __Step 7.5: Topic per country__

Goal:
- Plot overall pub volume in plant science per country
- Plot relative volume
- Plot relative pub volume over time

Additional analyses:
- Plant sci pub #
- Per country
- Noramlized against subset of [UNESCO](https://apiportal.uis.unesco.org/bdds) data
  - Education
    - `SDG` Global and Thematic Indicators
    - `OPRI` Other Policy Relevant Indicators
  - Science (SCN)
    - `SCN-SDG` Research and Development (R&D) SDG 9.5
    - `SCN-OPRI` Research and Development (R&D) – Other Policy Relevant Indicators
  - Culture
    - `SDG11` SDG 11.4
  - External
    - `DEM` Demographic and Socio-economic Indicators
  - Focus on:
    - 


Resources:
- [Best Libraries for Geospatial Data Visualisation in Python](https://towardsdatascience.com/best-libraries-for-geospatial-data-visualisation-in-python-d23834173b35)
- [Plotly documentation](https://plotly.com/python/)
- [Interactive map in python](https://www.earthdatascience.org/courses/scientists-guide-to-plotting-data-in-python/plot-spatial-data/customize-raster-plots/interactive-maps/)
- [Analysing and Visualising the Country wise Population from 1955 to 2020 with Pandas, Matplotlib, Seaborn and Plotly](https://towardsdatascience.com/analysing-and-visualising-the-country-wise-population-from-1955-to-2020-with-pandas-matplotlib-70b3614eed6b)
- [How to plot and color world map based on data](https://medium.com/@petrica.leuca/how-to-plot-and-color-world-map-based-on-data-c367e4cda7d9)

## ___Setup___

### Module import

In conda env `base`

In [5]:
import pickle
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
from pathlib import Path
from tqdm import tqdm

### Key variables

In [6]:
# Reproducibility
seed = 20220609

# Setting working directory
proj_dir   = Path.home() / "projects/plant_sci_hist"
work_dir   = proj_dir / "7_countries/7_3_topic_per_country"
work_dir.mkdir(parents=True, exist_ok=True)

# plant science corpus with topic assignment info
dir42      = proj_dir / "4_topic_model/4_2_outlier_assign"
corpus_file = dir42 / "table4_2_corpus_with_topic_assignment.tsv.gz"
#corpus_file = dir42 / "test.tsv"

# timestamp bins
dir44            = proj_dir / "4_topic_model/4_4_over_time"
ts_for_bins_file = dir44 / "table4_4_bin_timestamp_date.tsv"
file_topic_name   = dir44 / "fig4_4_tot_heatmap_weighted_xscaled_names.txt"

# country data
dir74            = proj_dir / "7_countries/7_4_consolidate_all"
ci_file          = dir74 / 'country_info_final_a3.txt'

# So PDF is saved in a format properly
mpl.rcParams['pdf.fonttype'] = 42
plt.rcParams["font.family"] = "sans-serif"

## ___Topic assignment___

### Read topic assignment

Lifted from script_5_3

In [7]:
# topic data-frame
tdf = pd.read_csv(corpus_file, sep='\t', compression='gzip', index_col=[0])

# topic assignments
toc_array = tdf['Topic'].values

print("topic dataframe:", tdf.shape)
print("topic assignments:", toc_array.shape)

topic dataframe: (421658, 12)
topic assignments: (421658,)


In [8]:
tdf.head(2)

Unnamed: 0,Index_1385417,PMID,Date,Journal,Title,Abstract,Initial filter qualifier,Corpus,reg_article,Text classification score,Preprocessed corpus,Topic
0,3,61,1975-12-11,Biochimica et biophysica acta,Identification of the 120 mus phase in the dec...,After a 500 mus laser flash a 120 mus phase in...,spinach,Identification of the 120 mus phase in the dec...,1,0.716394,identification 120 mus phase decay delayed flu...,52
1,4,67,1975-11-20,Biochimica et biophysica acta,Cholinesterases from plant tissues. VI. Prelim...,Enzymes capable of hydrolyzing esters of thioc...,plant,Cholinesterases from plant tissues. VI. Prelim...,1,0.894874,cholinesterases plant tissues . vi . prelimina...,48


In [9]:
# Create pmid-topic dataframe
pmid_topic = tdf[['PMID', 'Topic']]
pmid_topic.head(2)

Unnamed: 0,PMID,Topic
0,61,52
1,67,48


### Get topc indices and names

In [10]:
# topic indices
tocs = np.unique(toc_array)

# exclude topic=-1
tocs_90 = tocs[1:]

# number of topic=-1
n_rec_toc_unassigned = sum((toc_array==-1).astype(int))

# number of docs with topic assignment. Originally was thinking about minus
# unassigned, but realize that the totol for taxa would be the number of total
# docs, so it does not make sense to remove unassigned.
n_rec_total  = len(toc_array)

print("number of topic=-1:", n_rec_toc_unassigned)
print("number of docs with topic assignment:", n_rec_total)

number of topic=-1: 49228
number of docs with topic assignment: 421658


### Read topic names

In [11]:

toc_names = pd.read_csv(file_topic_name, sep='\t')
toc_names.head(2)

Unnamed: 0,Topic,Mod_name
0,22,enzyme | fatty acids | lipid | synthesis
1,18,protein | dna | rna | synthesis | mrna


## ___Basic country analysis___




### Country data

In [None]:
ci = pd.read_csv(ci_file, sep='\t', index_col=0)
ci.shape

In [None]:
ci.head()

## ___Country representation among topics___

|                 | In topic T | Not in topic T |
|    ---          | ---        | ---            |
|In country X     | a          | b              |
|Not in country X | c          | d              |
