In [1]:
import os
import re
import pandas as pd

# Metadata

In research projects based on Text and Data Mining, it is common to examine differences and similarities between separate groups of texts. To establish such groups, it can be useful to create a CSV files with metadata describing these text. Next to the basic information about the titles and the filenames of the texts, such metadata files ought to provide values for categorical variables. These are variables which can take a limited number of values. This notebook offers a number of instructions on how you can create such a CSV file containing metdata. 


## Structure of the CSV file


The CSV file containing the metdata minimally needs to contain the describe the `title`. This field will be used as an identifier for the text.

The CSV file below is an example.


```
path,title,year_of_publication,class
Corpus/Ulysses.txt,Ulysses,1920,A
Corpus/ThroughtheLookingGlass.txt,ThroughtheLookingGlass,1871,A
Corpus/HeartofDarkness.txt,HeartofDarkness,1899,A
Corpus/ARoomWithaView.txt,ARoomWithaView,1908,B
Corpus/ATaleofTwoCities.txt,ATaleofTwoCities,1859,B
Corpus/PrideandPrejudice.txt,PrideandPrejudice,1813,B
```


The CSV file can of course be created manually. The remainder of this notebook also contains some code which can help you to make such a file. 


## Collect all the file names

Firstly, if you have made a directory containing all the files in your corpus, we can collect all path to these files, and save these in a list named `corpus`. 

In [2]:
dir = 'Corpus'
corpus = []

for file in os.listdir(dir):
    if not(re.search(r'^\.' , file)): 
        path = os.path.join(dir,file)
        corpus.append(path)

## Collect all the titles

If the file names reflect the titles of your texts, these titles can be extracted using the finction that is dfefined below. 

In [3]:
def extract_title(path):
    title = os.path.basename(path)
    title = re.sub( r'[.]txt$' , '' , title )
    return title

Using the list named `corpus` that was created earlier, the CSV file can already be generated partly. 

In [4]:
## The header
print('path,title')

for title in corpus:
    print(f'{title},{extract_title(title)}')

path,title
Corpus\A_Confederate_Girls_Diary.txt,A_Confederate_Girls_Diary
Corpus\A_Diary_from_Dixie.txt,A_Diary_from_Dixie
Corpus\A_Virginia_Girl_in_the_Civil_War.txt,A_Virginia_Girl_in_the_Civil_War
Corpus\A_Womans_Wartime_Journal.txt,A_Womans_Wartime_Journal
Corpus\A_Womans_War_Record.txt,A_Womans_War_Record
Corpus\Belle_Boyd_in_Camp_and_Prison_Vol1.txt,Belle_Boyd_in_Camp_and_Prison_Vol1
Corpus\Belle_Boyd_in_Camp_and_Prison_Vol2.txt,Belle_Boyd_in_Camp_and_Prison_Vol2
Corpus\Diary_of_Belle_Edmondson.txt,Diary_of_Belle_Edmondson
Corpus\Reminiscences_of_the_Civil_War.txt,Reminiscences_of_the_Civil_War
Corpus\The_War-Time_Journal_of_a_Georgia_Girl.txt,The_War-Time_Journal_of_a_Georgia_Girl
Corpus\Two_Diaries_from_Middle_St._Johns.txt,Two_Diaries_from_Middle_St._Johns


## Adding additional fields

The input below can help you to add additional fields.

In [5]:
nr_columns = int(input( "How many columns would you like to add?\n"))

How many columns would you like to add?
4


In [6]:
column_names = []
for column in range(1,nr_columns+1):
    column_name = input( f"Name of column {column}:\n")
    column_names.append(column_name)
    

Name of column 1:
first- vs. second-hand account
Name of column 2:
years described
Name of column 3:
year of publication
Name of column 4:
age of the author


In [7]:
csv = []
for file in corpus:
    print(f'{extract_title(file)}:')
    row = []
    row.extend([file,extract_title(file)])
    for column_name in column_names:
        value = input(f"{column_name}: ")
        row.append(value)
    csv.append(row)

A_Confederate_Girls_Diary:
first- vs. second-hand account: first-hand
years described: 1862-1865
year of publication: 1913
age of the author: 20-23
A_Diary_from_Dixie:
first- vs. second-hand account: first-hand
years described: 1860-1865
year of publication: 1905
age of the author: 37-42
A_Virginia_Girl_in_the_Civil_War:
first- vs. second-hand account: second-hand
years described: 1861-1865
year of publication: 1903
age of the author: 4-8 (during years described) / 46 (at time of publication)
A_Womans_Wartime_Journal:
first- vs. second-hand account: first-hand
years described: 1864-1865
year of publication: 1918
age of the author: 47-48
A_Womans_War_Record:
first- vs. second-hand account: first-hand
years described: 1861-1865
year of publication: 1889
age of the author: 19-23
Belle_Boyd_in_Camp_and_Prison_Vol1:
first- vs. second-hand account: first-hand
years described: 1860-1863
year of publication: 1865
age of the author: 16-19
Belle_Boyd_in_Camp_and_Prison_Vol2:
first- vs. second-ha

In [8]:
column_names = ['path','title'] + column_names
print(column_names)

['path', 'title', 'first- vs. second-hand account', 'years described', 'year of publication', 'age of the author']


The values that were collected in this way will finally be saved as a CSV file named `metadata.csv`. 

In [9]:
df = pd.DataFrame(csv, columns = column_names )
df.to_csv('metadata.csv' , index=False)