# Parsing file titles

This notebook shows how you can parse file titles. The example uses a subset of the SID Today Newsletters shared by the intercept.

First we see what folders we have.

In [17]:
ls -d */

[34m2003-04-SIDToday-Text/[m[m/


Then we define the directory we want to operate on.

In [18]:
directory = "2003-04-SIDToday-Text"

Here we can check the file names.

In [70]:
import glob
glob.glob(directory + "/*.txt")[:5]

['2003-04-SIDToday-Text/2003-04-01_SIDToday_-_Deployed_SIGINT_Analysts--An_Urgent_Need.pdf.txt',
 '2003-04-SIDToday-Text/2003-04-01_SIDToday_-_Practical_Jokes_and_April_Fools.pdf.txt',
 '2003-04-SIDToday-Text/2003-04-02_SIDToday_-_Last_QUICKMASK_Training_Today.pdf.txt',
 '2003-04-SIDToday-Text/2003-04-02_SIDToday_-_New_Hire_Whats_On_Your_Mind_Session_--_Today.pdf.txt',
 '2003-04-SIDToday-Text/2003-04-02_SIDToday_-_SIGINT_Strategy__The_Importance_of_Common_Goals.pdf.txt']

## Create a list of file names

Now we create a list of file names.

In [26]:
fileList = []
for file in glob.glob(directory + "/*.txt"):
    fileList.append(file)
    
fileList[:5]

['2003-04-SIDToday-Text/2003-04-01_SIDToday_-_Deployed_SIGINT_Analysts--An_Urgent_Need.pdf.txt',
 '2003-04-SIDToday-Text/2003-04-01_SIDToday_-_Practical_Jokes_and_April_Fools.pdf.txt',
 '2003-04-SIDToday-Text/2003-04-02_SIDToday_-_Last_QUICKMASK_Training_Today.pdf.txt',
 '2003-04-SIDToday-Text/2003-04-02_SIDToday_-_New_Hire_Whats_On_Your_Mind_Session_--_Today.pdf.txt',
 '2003-04-SIDToday-Text/2003-04-02_SIDToday_-_SIGINT_Strategy__The_Importance_of_Common_Goals.pdf.txt']

## Strip information

Now we strip the folder name from the start and the ".pdf.txt" extension from the end.

In [39]:
justDateNName = []
for fileName in fileList:
    partsOfFilename = fileName.split("/") # splits on the forward slash
    dateNtitle = partsOfFilename[1].strip(".pdf.txt") # takes everything after the / and strips the end
    justDateNName.append(dateNtitle)
    
justDateNName[:5]

['2003-04-01_SIDToday_-_Deployed_SIGINT_Analysts--An_Urgent_Nee',
 '2003-04-01_SIDToday_-_Practical_Jokes_and_April_Fools',
 '2003-04-02_SIDToday_-_Last_QUICKMASK_Training_Today',
 '2003-04-02_SIDToday_-_New_Hire_Whats_On_Your_Mind_Session_--_Today',
 '2003-04-02_SIDToday_-_SIGINT_Strategy__The_Importance_of_Common_Goals']

## Spilt the name and save parts

Now we split the rest on "_" and use the first part and final part. We get a list of lists.

In [54]:
listOfDtsNNames = []
for string in justDateNName:
    partsOfString = string.split("_")
    row = [partsOfString[0]," ".join(partsOfString[3:])] # This adds the first part and then joins all the other words
    listOfDtsNNames.append(row)
    
listOfDtsNNames[:4]

[['2003-04-01', 'Deployed SIGINT Analysts--An Urgent Nee'],
 ['2003-04-01', 'Practical Jokes and April Fools'],
 ['2003-04-02', 'Last QUICKMASK Training Today'],
 ['2003-04-02', 'New Hire Whats On Your Mind Session -- Today']]

## Write out to CSV

And finally we write it out to a CSV.

In [72]:
import csv

with open("metadata.csv", 'w', newline='') as csvfile:
    resultsWriter = csv.writer(csvfile, delimiter=',',)
    for item in listOfDtsNNames:
        resultsWriter.writerow(item)
        
print("Done")

Done
