######Preliminary remarks###### The python code in this repository has been produced during my participation in the Learn IT, Girl! initiative (https://www.learnitgirl.com) in spring 2019. You can find a short abstract of the project I am developing below. The code has been written to examine a bibliographic dataset, which is part of my current research on the history of agricultural meteorology. My historical research is sponsored by the Deutsche Forschungsgemeinschaft (DFG) (Project No. 321660352). I sincerely thank my mentor, Dr Laura Fernández Gallardo, for the support received during the Learn IT, Girl! initiative. Berlin, April 2019
######Structure of the repository###### There are four subfolders: "1_data" for the data to be examined; "2_code_scripts" for the python code; "3_printouts" for the lists generated from the data; "4_plots" for the graphs generated from the data. All the printouts and plots are generated by the code in "2_code_scripts". In each subfolder there is a further subdivision in Part1 (data analysis with Pandas) and Part2 (text analysis with SpaCy). The code is run using the "main" scripts. For each of them, a comment at the top of the file explains what the code can do and where the methods used by the code can be found.
######What can you do with this code###### The code has been written to extract information useful for my historical research on agricultural meteorology. It examines the distribution of the publications according to year and language, extract author and co-author lists. Using SpaCy, it also analyses the full-text of a few articles and extract the geographical names, and analyses the titles of all the English articles to extract the most common words for each journal category.
######Potential for code reuse###### The dataset analysed has been exported from a Zotero library (https://www.zotero.org). It should not be difficult to re-use the code for analysing different datasets created with Zotero. Pay only attention to the naming conventions I used in the code (e.g. data are read from the file "bibliography_data.csv").
######Project description###### Title: Digital History with Python: Data Analysis and Machine Learning with Bibliographic References Abstract: The project will explore how data science methods and machine learning can contribute to my historical research by examining a large bibliographic data set of publications on agricultural meteorology that appeared in the years 1900-1950. The data set is being built as part of my research project on the history of agricultural meteorology in the first half of the twentieth century (https://agriculturalmeteorology.wordpress.com). It currently contains over 3.300 entries in several languages, ranging from English to Japanese, and belonging to multiple disciplines such as meteorology, agriculture, botany, geography. The data are collected using the reference management software Zotero and can be easily exported for analysis as csv/JSON files. During the project I will learn: a) the basics of the Python programming language (through self-study of Z. Shaw’s ‘Learn Python the Hard Way’); b) the use of tools based on Python (e.g. Pandas and Matplotlib) to visualise and analyse my bibliographic data set and discover trends in the data (e.g. whether there was an increase in the number of publications over time, who were the most prolific authors, which were the journals that published more contributions on agricultural meteorology); c) how to apply machine learning methods to the data set publications that have an abstract available and test whether classification algorithms can reliably attribute the publications to different knowledge domains (e.g. meteorology journals vs agricultural journals vs multi-discipline journals such as ‘Nature’ and ‘Science’) by using keywords, word frequency, etc.By applying data science, I will be able to extract information from my bibliographic data set otherwise not available through traditional historical methods. This, I believe, will give me deeper insights on how agricultural meteorology developed as an interdisciplinary research field in the first half of the twentieth century.