# Data Separator Notebook

Import the necessary modules:

In [4]:
import pandas as pd
import shutil
import os
from pathlib import Path

Define a path to this notebook, to the JSTOR data, and load the prepared CSV into a dataframe to use as a reference for the next step.

In [5]:
file_dir = Path(os.getcwd()).resolve()
jstor_dir = file_dir / 'jstor_data'

reference_df = pd.read_csv("articles_gender.csv")

The following code block does a lot of work. Simply put, it takes the original JSTOR directory and splits the contents of both the ngram1 and metadata folders into separated research articles and book reviews.

To illustrate the changes it makes:

**Before:**

└── js-sotf/

    ├── LICENSE

    └── jstor_data/

        ├── metadata/

        │   ├── 1.xml

        │   ├── 2.xml

        │   └── etc.xml

        ├── ngram1/

        │   ├── 1.txt

        │   ├── 2.txt

        │   └── etc.txt

        ├── ngram2
        
        ├── ngram3

        └── ocr

**After:**

└── js-sotf/

    ├── LICENSE

    ├── jstor_data

    ├── metadata/

    │   ├── book-review/

    │   │   ├── 1.xml

    │   │   ├── 2.xml

    │   │   └── etc.xml

    │   └── research-article/

    │       ├── 1.xml

    │       ├── 2.xml


    │       └── etc.xml

    └── ngram1/

        ├── book-review/

        │   ├── 1.txt

        │   ├── 2.txt

        │   └── etc.txt

        └── research-article/

            ├── 1.txt

            ├── 2.txt

            └── etc.txt

In [6]:
meta_path_ra = file_dir / "metadata" / "research-article"
if meta_path_ra.is_dir():
    shutil.rmtree(meta_path_ra)
else:
    Path.mkdir(meta_path_ra, parents=True)

meta_path_br = file_dir / "metadata" / "book-review"
if meta_path_br.is_dir():
    shutil.rmtree(meta_path_br)
else:
    Path.mkdir(meta_path_br, parents=True)

ngram1_path_ra = file_dir / "ngram1" / "research-article"
if ngram1_path_ra.is_dir():
    shutil.rmtree(ngram1_path_ra)
else:
    Path.mkdir(ngram1_path_ra, parents=True)

ngram1_path_br = file_dir / "ngram1" / "book-review"
if ngram1_path_br.is_dir():
    shutil.rmtree(ngram1_path_br)
else:
    Path.mkdir(ngram1_path_br, parents=True)

for _, row in reference_df.iterrows():
    type = row['type']
    if type == "misc":
        continue
    meta_file = row['file_x']
    ngram1_file = row['file_y']
    meta_source_path = jstor_dir / f"metadata/{meta_file}"
    shutil.copyfile(meta_source_path, file_dir / f"metadata/{type}/{meta_file}")
    ngram1_source_path = jstor_dir / f"ngram1/{ngram1_file}"
    shutil.copyfile(ngram1_source_path, file_dir / f"ngram1/{type}/{ngram1_file}")

print("The book review articles and research articles have been separated.")

The book review articles and research articles have been separated.


This separation makes subsequent analysis much easier when considering book reviews and research articles in isolation.