# Does the Network reports have everything from the remaining reports?

In [1]:
import glob, os

In [2]:
import numpy as np
import pandas as pd

## Loading all `journals.csv` files at once

Each package have been unzipped directories named like `tabs_spa`,
where `spa` is a collection code
(i.e., there's just a `tabs_` leading prefix).
Let's load all `journals.csv` files!

In [3]:
journals = {os.path.split(fname)[0][5:]:
              pd.read_csv(fname, dtype=str, keep_default_na=False)
            for fname in glob.glob("tabs_*/journals.csv")}

The `tabs_network/journals.csv` file is in:

In [4]:
network_journals = journals["network"]

## Are the rows from all `journals.csv` in `tabs_network/journals.csv`?

Yes, and every row from `tabs_network/journals.csv`
are in another `journals.csv` file.
To prove that,
let's join the rows from every `journals.csv` source
but the one from the `network`:

In [5]:
all_journals = pd.concat([df for k, df in journals.items() if k != "network"])

This joined dataframe has the same shape/size of the network journals dataframe,
and no row is duplicated in these two dataframes:

In [6]:
{
    "all_journals": all_journals.shape,
    "all_journals (unique)": all_journals.drop_duplicates().shape,
    "network_journals": network_journals.shape,
    "network_journals (unique)": network_journals.drop_duplicates().shape,
}

{'all_journals': (1721, 98),
 'all_journals (unique)': (1721, 98),
 'network_journals': (1721, 98),
 'network_journals (unique)': (1721, 98)}

The column names are all the same:

In [7]:
np.all(network_journals.columns.sort_values() == all_journals.columns.sort_values())

True

Every row is in the intersection:

In [8]:
pd.merge(network_journals, all_journals).shape

(1721, 98)

And the symmetric difference is empty:

In [9]:
pd.concat([network_journals, all_journals]).drop_duplicates(keep=False)

Unnamed: 0,extraction date,study unit,collection,ISSN SciELO,ISSN's,title at SciELO,title thematic areas,title is agricultural sciences,title is applied social sciences,title is biological sciences,...,google scholar h5 2016,google scholar h5 2015,google scholar h5 2014,google scholar h5 2013,google scholar m5 2018,google scholar m5 2017,google scholar m5 2016,google scholar m5 2015,google scholar m5 2014,google scholar m5 2013


Therefore, we can say the `journals.csv` in `tabs_network`
has exactly the same rows from the remaining `journals.csv`
joined together.