# IS547 Project Jupyter Notebook

<details>
<summary>Project Overview</summary>

This project involves managing approximately 2200 digital documents originating from an internal WordPress site migration at my workplace. As previously outlined in my Dataset Profile, the data consists of PDFs, Word documents, Excel spreadsheets, and occasionally PowerPoint presentations already archived in our Box storage. These were curated over a decade or more by our seventy-plus library committees, albeit the majority of the data comes from 10-15 committees. The documents include meeting minutes, agendas, and related institutional records. With FAIR in mind, the curation goals I have are to enhance internal accessibility, maintain institutional memory and data provenance, and support governance through improved data organization and documentation. These documents were publicly available via our open staff site.

In this project, I will adopt the role of curator or archivist, actively managing the data curation lifecycle using the Digital Curation Centre’s (DCC) Curation Lifecycle Model (Higgins, 2008). This model provides structured guidance through the stages of creation and collection, processing and organization, storage and preservation, access and use, and disposition of data within the set.

</details>

<details>
<summary>Tentative Deliverables</summary>

- Consistent naming conventions applied across all documents
- Documentation of data governance and ethical compliance per our institutional policies; if none exist, resources from university-wide policies will be utilized
- Metadata enhancement to improve retrieval, searchability, and discoverability
- Documented provenance and fixity check to support institutional memory

</details>

<details>
<summary>High-Level Timeline</summary>

My high-level timeline anticipates scoping the naming and metadata enhancement portions of the project over the next few weeks, completing them by the end of March. Documentation for governance and ethical compliance will be ongoing as I work through the technical details, research and creation. Provenance documentation will occur via repository use and proper documentation with the code. Fixity goals may be adjusted based on the complexities identified during initial technical tasks of naming and working with metadata.

</details>

<details>
<summary>Known Gaps</summary>

Known gaps requiring further research include institutional data governance policies and application of fixity measures. I have found a couple internal resources for governance, but I may need to consult broader university-wide resources. The feasibility of manipulating document names, adding metadata, linking, or managing fixity across this volume of documents will be assessed as the project progresses. I plan to use a Jupyter notebook and Python to do the workflow for the technical documentation, code and provenance information. Governance and ethics documentation will occur via Word and exported to PDF and placed in the repo. All documentation will be accessible per WCAG 2.2 requirements.

</details>

<details>
<summary>Anticipated Curatorial Actions</summary>

- Data collection: The set is already acquired and in organized Box folders. No further action needed.
- Ethical and legal: No explicit ethical or legal restrictions have been identified, yet further exploration of our institution’s policies is necessary to build documentation.
- Storage: Active curation will continue via our current infrastructure, Box.
- Quality assessment and cleaning: none anticipated.
- Workflows will be documented and any code archived in Git so that my actions may be reproducible.
- Provenance tracking: basic provenance information may be implemented to document changes and any curation decisions.
- Appropriate metadata standards will be applied if enhancement is done
- No persistent identifiers will be implemented

</details>

<details>
<summary>References</summary>

Higgins, S. (2008). The DCC Curation Lifecycle Model. International Journal of Digital Curation, 3(1), 134–140. https://doi.org/10.2218/ijdc.v3i1.48

</details>

<details>
<summary>Additional Resources</summary>

To learn more about Jupyter Notebooks in PyCharm, see [help](https://www.jetbrains.com/help/pycharm/ipython-notebook-support.html).
For an overview of PyCharm, go to Help -> Learn IDE features or refer to [our documentation](https://www.jetbrains.com/help/pycharm/getting-started.html).

</details>

In [3]:
from data_pipeline.data_explore import count_files
committees_directory = 'data/Committees'
total_files = count_files(committees_directory)
print(f"Total number of files in '{committees_directory}': {total_files}")

Total number of files in 'data/Committees': 2203


In [4]:
from data_pipeline.data_explore import find_file_types
file_types = find_file_types('data/Committees')
print(file_types)

{'': 6, '.docx': 1764, '.ppt': 26, '.doc': 53, '.pdf': 333, '.pptx': 21, '.xls': 4, '.xlsx': 2}


In [5]:
from data_pipeline.data_explore import list_committees_and_count
list_committees_and_count('data/Committees')


Research and Publication Committee
Reference Management Team
Promotion and Tenure Advisory Committee
The Library as Catalyst Project - Special Collections Research Center Working Group
Teaching and Learning Task Force
Graduate Student Survey Working Group
University Library Residency Program Working Group
Diversity Residency Advisory Committee
Awards and Recognition Committee
Academic Professional Promotion Implementation Team
Content Access Policy & Technology (CAPT)
Working Group on Library Grants, Outreach and Training (COMPLETED CHARGE)
Open Licensing Task Force
Academic Professional Peer Review Promotion Advisory Committee
Diversity, Equity, Inclusion, and Accessibility (DEIA) Task Force
Student-Focused Spaces Task Force
220 Exploratory Use Team
Marshall Gallery Task Force
Marketing and Communications Strategy Working Group
Reproduction and Use Fees Working Group
Library Faculty Meeting
Faculty Meeting Agenda Committee
CAPT Digital Production
CAPT Repositories, Preservation, and A

In [6]:
from data_pipeline.data_explore import list_files

list_files('data/Committees')


File: data/Committees/.DS_Store
File: data/Committees/The Library as Catalyst Project - Special Collections Research Center Working Group/Minutes/2019.06.13.docx
File: data/Committees/The Library as Catalyst Project - Special Collections Research Center Working Group/Minutes/2019.09.17.docx
File: data/Committees/The Library as Catalyst Project - Special Collections Research Center Working Group/Minutes/2019.07.29.docx
File: data/Committees/The Library as Catalyst Project - Special Collections Research Center Working Group/Minutes/2019.10.10.docx
File: data/Committees/The Library as Catalyst Project - Special Collections Research Center Working Group/Minutes/2019.07.15.docx
File: data/Committees/The Library as Catalyst Project - Special Collections Research Center Working Group/Minutes/2019.09.04.docx
File: data/Committees/The Library as Catalyst Project - Special Collections Research Center Working Group/Minutes/2019.05.30.docx
File: data/Committees/The Library as Catalyst Project - Sp

In [8]:
from data_pipeline.data_cleaning import ensure_output_directory

ensure_output_directory()



In [9]:
from data_pipeline.data_cleaning import copy_files

copy_files()

Deleting: ./data/Committees/.DS_Store
Deleting: ./data/Committees/Administrative Council/.DS_Store
Deleting: ./data/Committees/Administrative Council/Minutes/.DS_Store
Deleting: ./data/Committees/Administrative Council/Agendas/.DS_Store
Deleting: ./data/Committees/Collection Development Committee/.DS_Store
Deleting: ./data/Committees/Collection Development Committee/Agendas/.DS_Store


In [10]:
from data_pipeline.data_explore import count_files
count_files('data/Processed_Committees')

2203

In [11]:
from data_pipeline.data_explore import find_file_types
file_types = find_file_types('data/Processed_Committees')
print(file_types)

{'.docx': 1764, '.ppt': 26, '.doc': 53, '.pdf': 333, '.pptx': 21, '.xls': 4, '.xlsx': 2}


In [12]:
from data_pipeline.file_naming_old import generate_names_csv

generate_names_csv()

Unnamed: 0,Committee,Document Type,Original File Name,Extracted Date,Proposed File Name
0,The Library as Catalyst Project - Special Coll...,Minutes,2019.06.13.docx,2019-06-13,The Library as Catalyst Project - Special Coll...
1,The Library as Catalyst Project - Special Coll...,Minutes,2019.09.17.docx,2019-09-17,The Library as Catalyst Project - Special Coll...
2,The Library as Catalyst Project - Special Coll...,Minutes,2019.07.29.docx,2019-07-29,The Library as Catalyst Project - Special Coll...
3,The Library as Catalyst Project - Special Coll...,Minutes,2019.10.10.docx,2019-10-10,The Library as Catalyst Project - Special Coll...
4,The Library as Catalyst Project - Special Coll...,Minutes,2019.07.15.docx,2019-07-15,The Library as Catalyst Project - Special Coll...
...,...,...,...,...,...
2198,The Library as Catalyst Project - Managing the...,Minutes,2019-February-5.docx,unknown,The Library as Catalyst Project - Managing the...
2199,The Library as Catalyst Project - Managing the...,Minutes,2019-May-15.docx,unknown,The Library as Catalyst Project - Managing the...
2200,The Library as Catalyst Project - Managing the...,Minutes,2019-March-7.docx,unknown,The Library as Catalyst Project - Managing the...
2201,The Library as Catalyst Project - Managing the...,Minutes,2019-January-7.docx,unknown,The Library as Catalyst Project - Managing the...


In [13]:
from data_pipeline.data_cleaning import list_files

list_files("./data/Processed_Committees")

File: ./data/Processed_Committees/The Library as Catalyst Project - Special Collections Research Center Working Group/Minutes/2019.06.13.docx
File: ./data/Processed_Committees/The Library as Catalyst Project - Special Collections Research Center Working Group/Minutes/2019.09.17.docx
File: ./data/Processed_Committees/The Library as Catalyst Project - Special Collections Research Center Working Group/Minutes/2019.07.29.docx
File: ./data/Processed_Committees/The Library as Catalyst Project - Special Collections Research Center Working Group/Minutes/2019.10.10.docx
File: ./data/Processed_Committees/The Library as Catalyst Project - Special Collections Research Center Working Group/Minutes/2019.07.15.docx
File: ./data/Processed_Committees/The Library as Catalyst Project - Special Collections Research Center Working Group/Minutes/2019.09.04.docx
File: ./data/Processed_Committees/The Library as Catalyst Project - Special Collections Research Center Working Group/Minutes/2019.05.30.docx
File: 