MPP-E1180: Introduction to Collaborative Social Science Data Analysis
Version: 1 December 2016
Instructor: Christopher Gandrud
The objective of this course is to learn how to collaboratively and reproducibly gather social data, analyse it, and effectively present results.
The course is intended to be immediately useful for your academic work, as well as work in the public and private sectors. The tools you learn and the final project you complete in this course will be directly useful for your thesis research. As increasing emphasis in academics is being placed on the skills needed to effectively gather, handle, and analyse data as well as present results to a range of audiences in highly reproducible ways, this course will provide you with important tools for future academic study. Governments and international institutions are increasingly adopting the technologies and methods of collaborative open data science. For examples, see initiatives by the World Bank, Germany, New York City, the United Kingdom, and the United States. These and many more resources provide great new opportunities for open evidence-based policymaking. This course is designed to enable you to take full advantage of these opportunities and actively contribute to these initiatives. Finally, the skills we will learn in this course are also widely used in business. R programming skills in particular are highly valued in fields such as finance and information technology. Being able to effectively communicate results from statistical analyses in dynamic, often web-based formats is highly valued by businesses and increasingly in governments and academics.
A large part of the practice of social science data analysis is computer programming. Learning how to approach the analysis of data from a computer science perspective will allow you to take full advantage of state of the art statistical tools and best practice research methods for understanding social phenomena and effectively communicating your findings in multiple mediums.
The course will involve learning the fundamentals of widely used computer languages. The statistical language R will allow us to gather and analyse our data. The Markdown/HTML and LaTeX markup languages will allow us to present our results to a variety of audiences. We will use Git/GitHub to version control and store all of our files. This will enable collaboration and full reproducibility.
The focus of this course is active in class participation and collaboration on realistic projects using the concepts and tools introduced in lectures and scholarly articles. All assignments and projects will be completed in teams. I encourage you to use pair programming and even collaborate across teams.
Alongside learning the details of how to use specific tools of collaborative and reproducible social science data analysis we will emphasise their general properties and how they fit together into a highly collaborative and reproducible research workflow. Languages and technologies come and go, so it is important to understand the fundamental principles underlying them so that you can adapt to new technologies and understand previous researchers' work.
The course assumes that you have a good basic understanding of descriptive and introductory inferential statistics (e.g. data types, ways of describing distributions, significance testing, linear models, and so on). Knowledge of particular software or computer programming is not assumed.
Patience is a key skill for computer programming. Computer languages are extremely literal. This can lead to 'communication problems' between you and the computer. It does not share your assumptions, so you have to be very explicit. This quality makes using these tools great for recording your research steps so that they are highly reproducible. But it can also be maddening and requires patience to deal with effectively.
Gandrud, Christopher. 2015. Reproducible Research with R and RStudio. 2nd Edition. Chapman & Hall/CRC Press, Oxford. (RRRR)
A good reference text to have by your side when doing statistics with R is:
Crawley, Michael J. 2005. Statistics: An Introduction Using R. John Wiley and Sons Ltd., Chichester.
A great free resource for more advanced R programming is is Hadley Wickam's aptly named Advanced R Programming.
If you ever get stuck, a good first place to turn for answers is StackExchange. If you are stuck on a coding problem, chances are someone else has had the same problem before, asked an question on StackExchange, and found answers.
Software and Computers
All of the software used in this course will be open source, i.e. free.
Please bring your own laptop to class. What we do in the course requires you to have administrator privileges on your computer. It's preferable that you have a computer with Mac or (similarly) Linux OS. Windows is also fine, there will just be a few extra steps and it may take more time for me to help you resolve bugs.
Install LaTeX. This is a large installation, so dedicate some time to doing it.
All lecture materials and their source files will be hosted in the course's GitHub repository.
You are highly encouraged to suggest changes to the lecture material with a pull request (we'll learn about how to do this in the second lecture) if you think of improvements that can be made for clarity, relevance, and to fix typos.
|Name||Percent of Final Mark||Due|
|Pair Assignment 1||10%||7 October|
|Pair Assignment 2||10%||28 October|
|Pair Assignment 3||10%||11 November|
|Collaborative Research Project||50%||Presentation: Final Class, Paper/Website: Final Exam Week|
The first pair assignment is designed to develop your understanding of file structures, version control, and basic R data structures and descriptive statistics. Each pair will create a new public GitHub repository. The repository should be fully documented, including with a descriptive README.md file. The repository will include R source code files that access at least two data sets from the R core data sets and/or fivethirtyeight, perform basic transformations on the data, and illustrate the datas' distributions using a variety relevant of descriptive statistics. At least one file must dynamically link to another in a substantively meaningful way. Finally, another pair must make a pull request and it should be discussed and merged. Deadline 7 October, 10% of final grade.
The second pair assignment is a proposal for your Collaborative Research Project. It is an opportunity for you to layout your collaborative research paper question, justify why it is interesting, provide basic literature review (properly cited using BibTeX), and identify data sources/methodologies that you can access to help answer your question. You will also demonstrate your understanding of literate programming technologies. Deadline 28 October, 2,000 words maximum, 10% of final grade.
In the third pair assignment you will gather web based data from at least two sources, merge the data sets, conduct basic descriptive and (some) inferential statistics on the data to address a relevant research question and briefly describe the results including with dynamically generated tables and figures. Students are encouraged to access data and perform statistical analyses with an eye to answering questions relevant for their Collaborative Research Project. Deadline 11 November, the write up should be 1,500 words maximum and use literate programming, 10% of final grade.
For the Collaborative Research Project you will pose an interesting social science question and attempt to answer it using standard academic practices including original data collection and statistical analysis. The project should be considered a ‘dry run’ for your thesis. The project has three presentation outputs designed to present your research to multiple audiences. The first is a oral presentation (10 minute maximum) given in the final class. The second is a standard academic paper (5,000 words maximum) that is fully reproducible and dynamically generated. The third is a website designed to present key aspects of your research in an engaging way to a general audience. The paper and website are due in the Final Exam Week. The presentation and website are each worth 10% of your final mark. The paper is worth 30%.
All of the assignments for the course will be completed in pairs. All assignments must be developed using Git and submitted on GitHub. All assignments, including the version history must be completely reproducible from the repository files. In general a single mark for the pair will be given. However, as all assignments are developed using Git, your contributor statistics will be taken into consideration. Major discrepancies between team members will result in scores reflecting individual’s contributions.
Assignments are due by midnight on the due date. When you have completed the assignment, email me the GitHub tag URL for the final version of your assignment.
Examination Requirement: Weighted average grade of all course assignment must be 50% or higher on numerical scale.
Students Attendance/Participation: Students are expected to be present and prepared for every class session. Active participation during lectures and seminar discussions is essential. Participation involves both 'traditional participation' in terms of engaging in class discussions, It also involves non-traditional participation such as pair programming and actively contributing to both your team’s projects and other team’s projects via Git pull requests. You are even encouraged to make pull requests to the main course material if you find an error or think of an improvement. As such, your GitHub contributor statistics will be used to partially evaluate your participation.
If unavoidable circumstances arise which prevent attendance or preparation, the instructor should be advised by email with as much advance notice as possible. Please note that students cannot miss more than two sessions. For further information please consult the examination rules §4.
Late assignments: For each day the assignment is turned in late, the grade will be reduced by 10% (e.g. 2 days after the deadline would result in 20% grade deduction).
Part I: Motivation and getting started
Week 1: Introduction to the Course + Introduction to the R Programming Language (1)
In this week I will first give a general overview of the course objectives and key concepts. We will also make sure that you are able to install and load all of the necessary software required for the course.
Then, we will learn the basics of the R statistical programming language for data handling and simple descriptive statistics, as well as general computer science problem solving skills.
Ch.1-2, Sections 3.1-3.2: RRRR.
Leek, Jeffrey T. and Roger D. Peng. 2015. ''P-values are Just the Tip of the Iceberg''. Nature. 520: 612.
Stodden, Victoria and Miguez, Sheila 2014. ''Best Practices for Computational Science: Software Infrastructure and Environments for Reproducible and Extensible Research''. Journal of Open Research Software 2(1):e21.
Donoho, Donald. 2010. ''An Invitation to Reproducible Computational Research''. Biostatistics. 11(3): 385-388.
Herndon, Thomas, Michael Ash, and Robert Pollin. 2014. ''Does High Public Debt Consistently Stifle Economic Growth? A critique of Reinhart and Rogoff''. The Cambridge Journal of Economics. 38(2): 257-279.
Lazer, David. Ryan Kennedy, Gary King, and Alessandro Vespignani. 2014. ''The Parable of Google Flu: Traps in Big Data Analysis''. Science. 343(6176): 1203-1205.
Ch. 1: Spraul, V. Anton. 2012. Think like a Programmer. San Francisco: No Starch Press.
Week 2: Files, File Structures, Version Control, & Collaboration
We will complete our introduction to the R programming language.
Then we will learn about the importance and use of file structures for your research. Fundamentally, your research is a collection of files (preferably text files). Organising, manipulating, and storing files are at the heart of research practice. Well-organised and stored files are crucial for enabling collaboration and making your research reproducible. We will learn how file systems work, as well as how to organize, version control, and store research files to enable collaboration and reproducibility.
Ch. 2-4: Crawley, Michael J. 2005. Statistics: An Introduction Using R. John Wiley and Sons Ltd., Chichester.
Ch 4-5: RRRR.
Wilson, Greg. 2014. ''Why Do Scientists Want to Learn About Code Review?''. Mozilla Science Lab.
Use the nice interactive introduction to Git from the Code School.
Making Your Code Citable. 2014. GitHub Guides.
King, Gary. 2007. "An Introduction to the Dataverse Network as an Infrastructure for Data Sharing". Sociological Methods and Research. 36(2):173-199.
If you additionally want to get really good at command line file management (pretty much what the command line is best at) a great book to use is:
- Shotts Jr., William E. 2012. ''The Linux Command Line: A complete introduction''. No Starch Press, San Fransisco.
Style Guides: it generally doesn't matter what style guide you use for your code, but it is good to agree on a style with your team and stick to it. Otherwise it will take longer to figure out what your teammates are doing. If your teammates have difficulty understanding your code, other researchers will be even less able to figure out what you did. Two widely used style guides are:
Part II: Markup languages and literate programming
Week 3: Introduction to Markup Languages and Literate Programming (1)
A markup language is a set of instructions that allows you to take a text file and turn it into some formatted presentation document such as a PDF or webpage. Markup languages are a crucial tool for collaborative data science for at least two reasons. First they enable literate programming--the combination of computer code and the human readable description of this code--that is the foundation of highly reproducible research. Second, data is often embedded in markup languages, especially on websites. Understanding how markup languages work will enable you to gather this data more easily.
Section 3.3 and Ch. 11, 13: RRRR.
RStudio. 2015. RMarkdown--Dynamic Documents for R.
RStudio. 2015. Pandoc Markdown.
RStudio. 2015. Presentations with ioslides.
Part III: Data gathering, transformation, & analysis
Week 4: Introduction to Markup Languages and Literate Programming (2) + Automatic Data Gathering via Curl, and API Packages
We will finish up our introduction to Markup languages by venturing into two more advanced languages: HTML and LaTeX/BibTex.
Then we will begin to examine how to access and clean social science data. Most social science data sets are now available for download online. This week we will learn how to programmatically access this data and clean it so that it can be used for statistical analysis. We will also consider the benefits and challenges of government increasing the openness and accessibility of their data.
Ch. 6-7: RRRR
Wickham, Hadley. 2014. “Tidy Data”. Journal of Statistical Software 59(10): 1–23.
Janssen, Marijn, and Yannis Charalabidis, and Anneke Zulderwijk. 2012. ''Benefits, Adoption Barriers and Myths of Open Data and Open Government''. Information Systems Management. 29(4): 258-268.
Leeper, Thomas J. 2014. ''Archiving Reproducible Research with R and Dataverse''. The R Journal.
Week 5: Automatic Data Gathering Via Web Scraping + Automatic table and static visualisation
A considerable amount of social science data is not stored in traditional data table type formats. Instead it is embedded in webpages. To access this data we will learn the basics of web scraping. We will also be examining in more detail ways to transform data, particularly with the dplyr package.
We will then learn how to automatically generate summary and results tables for multiple markup languages using stargazer. We will then learn static descriptive and inferential data visualisation best practices including avoiding introducing optical illusions that distort data presentations and accommodating readers with visual impairments. We will also cover specific R packages for creating static visualisations, primarily ggplot2 and ggmap for mapping.
Wickham, Hadley. 2010. ''stringr: modern, consistent string processing''. The R Journal. 2(2): 38-40.
Ch. 9: RRRR.
Gelman, Andrew. 2011. ''Tables as Graphs: the Ramanujan Principle''. Significance 8(4): 183.
Ch. 1, 4, and 9: Tufte, Edward R. 2001. The Visual Display of Quantitative Information. Cheshire, Connecticut: Graphics Press.
A key tool for scraping websites (and dealing with text in general) is Regular Expressions. You can think of these as patterns of text to search for. To prepare for class read the helpful regular expressions overview by Greg Bacon and practice using them with the RegexOne website.
Ehrenberg, A S C. 1977. ''Rudiments of Numeracy''. Journal of the Royal Statistical Society. Series A General 140(3): 277–97.
Schenker, N., & Gentleman, J. F. 2001. ''On Judging the Significance of Difference by Examining the Overlap Between Confidence Intervals''. The American Statistician, 55(3), 182–186.
Wickham, H., Cook, D., Hofmann, H., & Buia, A. (2010). Graphical Inference for Infovis. IEEE Transactions on Visualization and Computer Graphics, 16(6): 973–979.
Gelman, Andrew, and Phillip N Price. 1999. ''All Maps of Parameter Estimates Are Misleading.'' Statistics in Medicine 18: 3221–34.
Raftery, Adrian. 2016. "Use and Communication of Probabilistic Forecasts." Statistical Analysis and Data Mining: The ASA Data Science Journal. 9(6): 397-410.
Fruehwald, Josef. 2012. AVML 2012: ggplot2.
Donahue, Rafe M. J. 2011. Fundamental Statistical Concepts in Presenting Data: Principles for Constructing Better Graphics. Version 2.11.
Part IV: Communicating results from statistical analyses
Week 6: Statistical Modeling with R
We will conclude our discussion of table and static figure generation in R.
We will then learn how to fit and evaluate a variety of statistical models using the R language, including simple linear models and logistic regression models for categorical dependent data.
- Ch. 7-8, 11: Crawley, Michael J. 2005. Statistics: An Introduction Using R. John Wiley and Sons Ltd., Chichester.
Croissant, Yves and Giovanni Millo. 2008. ''Panel Data Econometrics in R: the plm Package''. Journal of Statistical Software. 27(2): 1-43.
King, Gary, Micheal Tomz, and Jason Wittenberg. 2001. Making the Most of Statistical Analyses: Improving interpretation and presentation. American Journal of Political Science. 22(4): 341–355.
Week 7: Dynamic visualisation + Prepare Collaborative Research Project
In addition, we will bring together all of the tools we have learned to conduct an original collaborative and reproducible research project. You will present the results from the project in multiple mediums including as a paper, a presentation to the class, and a website. The project should ideally be the starting point of your thesis. This is an opportunity for you to work on your project and ask questions/get immediate feedback from the instructor/classmates.
Sigal, Mathew. 2011. Make it Pretty: An Graphical Post-Processing with Adobe Illustrator. Presentation to York University Department of Psychology Quantitative Methods Brownbag.
Part V: Collaborative research project
Week 8: Present Results
Licensed under MIT