# Metadata Analysis of Web of Science Data

This notebook uses a data file of WoS merged in the previous notebook (notebook "01").

### 0. Import files if using Google Colab

If using Colab, uncomment out the cell below and run.

In [None]:
#!wget https://git.dartmouth.edu/lib-digital-strategies/RDS/projects/bibliometrics/-/archive/main/bibliometrics-main.zip
#!unzip bibliometrics-main.zip

In [None]:
import sys
sys.path.insert(0, '../code')
import wos_functions

import pandas as pd 
from pathlib import Path
import collections
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
inputdir = Path("../data/resilience/merged")
data = pd.read_csv(Path(inputdir, "merged_wos_files.csv"), encoding = 'utf-8', index_col=[0])
data.head()

In [None]:
data.info()

## Visualize Papers over Time

In [None]:
wosdf = pd.read_csv("../data/resilience/merged/merged-wos_subcols.csv", encoding = 'utf-8', index_col=[0])
print(wosdf.shape)
wosdf.head()

In [None]:
wos_yrs = wosdf.groupby("PY").size()
wos_yrs

In [None]:
sns.barplot(wos_yrs)
plt.xticks(rotation=90);

## Visualize Distribution of Papers by WoS Group

## Analysis of Web of Science Groups

The Web of Science places each article or paper into at least one, and often multiple, [WoS categories](https://jcr.clarivate.com/jcr/browse-category-list) (254 total available categories). If your Web of Science dataset is narrowly focused in one discipline, then an analysis of these categories could be fruitful.

However, if the search criteria you used to create your dataset is large, then you will want to aggregate these 254 possible categories into a narrower range of groups. Fortunately, the WoS assigns each of these 254 categories into a narrower range of [21 groups](https://jcr.clarivate.com/jcr/browse-categories). Unfortunately, the WoS links many of these 254 categories with multiple groups. Thus, the aggregation of your dataset from categories to groups is not a straightforward process.

Thus, I have created the following functions that each return a new dataframe:
1. **wos_add_and_explode_groups()**: returns your original dataframe with a new column identifying the group(s) matching each paper / category
2. **wos_groupby_Groups**: a summary dataframe grouping your original data by WoS Group.



In [None]:
data_explode = wos_functions.wos_add_and_explode_groups(data)

In [None]:
groupsdf = wos_functions.wos_groupby_Groups(data_explode)
groupsdf.head()

In [None]:
sns.barplot(groupsdf, y="Group", x="numitems_insample_per100kinWOS");

## Group by WoS Group *and* Year

In [None]:
data_yr_group = data_explode.groupby(["Group", "PY"])["PT"].count().reset_index(name="yr_ct")

In [None]:
# code for creating the faceted grid of area graphs below is 
## adapted from: https://python-graph-gallery.com/242-area-chart-and-faceting/

# Create a grid : initialize it
g = sns.FacetGrid(data_yr_group, col="Group", hue="Group", col_wrap = 3)

# Add the line over the area with the plot function
g = g.map(plt.plot, 'PY', 'yr_ct')

# Fill the area with fill_between
g = g.map(plt.fill_between, 'PY', 'yr_ct', alpha=0.2).set_titles("{col_name}")

# Add a title for the whole plot
plt.subplots_adjust(top=0.92)
g = g.figure.suptitle('Frequency of use of keyword "resilience" in Web of Science database')

What can you learn from the above graphs? What changes would you want to make to more clearly discern patterns in scholarship?

In [None]:
#len(kwslist3)

In [None]:
"""
from itertools import combinations
from collections import Counter
d  = Counter()
for sub in kwslist3:
    print(len(sub))
    if len(sub) < 2:
        continue
    sub.sort()
    for sz in range(2, len(sub)+1):
        for comb in combinations(sub, sz):
            d[comb] += 1

print(d.most_common())
"""