Lectures
| Details
|
FAQ
| License
I am constantly trying to improve this course. Provide feedback.
My office hours are:
- Tuesdays 4pm-5pm
- Wednesdays 10:30-11:30am
You can book time here.
Note: While I have provided PDF versions of the lectures (in folders above), they are best viewed in the original HTML format.
The course is broken up into three rough sections.
- Part 1 covers basics of empirical organization, data gathering, and organizing that are not "big data" specific
- Part 2 covers data description, econometrics, and causal inference that are possible with big data
- Part 3 covers machine learning techniques that are possible with big data
Parts 2 and 3 will highlight examples of using big data to address social problems.
This is in progress and subject to change.
This class is about helping you build good habits for doing organized and reproducible empirical work. It is not about learning specific R packages or functions.
- Organize empirical projects that are replicable, reproducible, and collaborative using good programming practices
- Collect and clean big or novel datasets using APIs, web scraping, and other methods
- Use Big Data to generate key insights about economic opportunity, inequality, and other social problems
- Understand the differences between prediction, causality, and description, and when to apply each
- Explain what data science is, and how Big Data differs from other types of data
- All problem sets and lectures are linked above in the calendar
- The repository for each problem set, these course materials, and your class presentations are all linked in the organization page
This is an extremely challenging course. To help you succeed, I have outlined expectations for both you and me.
- Link to lecture slides and problem sets in the calendar
- Post any software you need to download or other materials you need to prep for class in the calendar with 24 hours notice
- Outline learning goals at the top of each lecture
- Clearly explain the expectations for each problem set
- Provide examples of skill sets to be used on problem sets in class
- Grade your problem sets within two weeks (i.e. before the next problem set is due)
- Post all problem set solutions to the repository within a week of the problem set being due
- Check the GitHub Issues tab (for all repositories) at least once per day to answer questions
- Check the calendar within 24 hours of each lecture to see any materials you need to download/review
- Fork the main problem set repository within 48 hours of the problem set being posted
- Open problem set data and code within 48 hours of the problem set being posted
- Work on problem sets in groups, but turn in your own code
- Post questions about material or problem sets to GitHub Issues unless it is of a private matter (e.g. grades, extensions)
- There is a GitHub Issues tab within every problem set that I create, please post questions about problem sets directly to the tab for each problem set
- If I receive an email with a question that will benefit everyone, I will ask you to post it to GitHub Issues
- This is so that everyone can benefit from the answer
- Also, it will encourage collaboration
- Use computers in class for class-related activities only
- Seek out solutions to coding problems you run into
- Read error messages and see if you can immediately solve the problem
- Think before going to Google/ChatGPT: "How would I read a small portion of a large dataset in R?" (Use this services proactively, not reactively)
Most classes will be divided into a "lecture" and an "interactive" component. During the lecture, computers will be closed. During the interactive component, computers will be open for you to work through it.
This course is taught in R, but the goal is not for students to learn individual R functions and packages. That is something a person could do using generative AI, existing R vignettes and demos, and other online resources. With that in mind, I expect students in this course to make ample use of the countless free resources on the internet to learn R. Here are a few that I recommend:
- R For Data Science by Hadley Wickham and Garrett Grolemund
- Advanced R by Hadley Wickham
- Geocomputation with R by Robin Lovelace by Jakub Nowosad, and Jannes Muenchow
- Posit Cheatsheets
- R Programming for Data Science by Roger D. Peng
- Bates Alumni Eli Mokas and Ian Ramsay's RStudio Tutorial
- RStudio Gallery
- R Markdown: The Definitive Guide by Yihui Xie, J. J. Allaire, and Garrett Grolemund
- An Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani
- Data Science for Economists and Other Animals by Grant McDermott and Ed Rubin
- Causal Inference: The Mixtape by Scott Cunningham
- The Effect by Nick Huntington-Klein
- Spatial Data Science by Edzer Pebesma and Roger Bivand
- Data Visualization: A practical introduction by Kieran Healy
- Curated List by Nathan Tefft
- Library of Statistical Techniques (LOST)
- Code and Data for the Social Sciences: A Practitioner's Guide by Matthew Gentzkow and Jesse Shapiro
- Coding for Economists: A Language-Agnostic Guide
- happygitwithr by Jenny Bryan
You are actively encouraged to use generative AI assistants in this class. These can be used to improve your code, refine your writing, iterate on your ideas, and more.
- Sign-up for ChatGPT
- Sign-up for GitHub CoPilot (Note: you do not signup through this organization, you signup through your own personal GitHub account as a student.)
- Tips to get better with ChatGPT
- Integration of AI with R
Scheduled hours for R held in the Student Academic Support Center (SASC) of the Library are:
- Sunday | 7:30-9pm
- Monday | 12-1pm, 2:30pm-4pm
- Tuesday |12-2:30pm, 6-7:30pm
- Wednesday | 11am-1pm, 6-7:30pm
- Thursday | 12-4pm, 6-7:30pm
- Friday | 11am-12pm
Chrissy Aman is our Course-Attached tutor. She will host office hours in the SASC and will be available for individual appointments. Her hours are:
- Fridays at 6:45pm
- Sundays at 10am-12pm in PGill 227
Chrissy can help you troubleshoot R. She does not know have solution to the problem sets, but she can help you figure them out.
Having trouble with R on your computer?
To get you up and running and writing R code in no time, I have containerized this workshop such that you have a ready out of the box R coding environment.
Click the green "<> Code" button at the top right on this repository page, and then select "Create codespace on main". (GitHub CodeSpaces is available with GitHub Enterprise and GitHub Education.)
To open RStudio Server, click the Forwarded Ports "Radio" icon at the bottom of the VS Code Online window.
In the Ports tab, click the Open in Browser "World" icon that appears when you hover in the "Local Address" column for the Rstudio row.
This will launch RStudio Server in a new window. Log in with the username and password rstudio/rstudio
.
- NOTE: Sometimes, the RStudio window may fail to open with a timeout error. If this happens, try again, or restart the Codepace.
In RStudio, use the File menu to open the file test.Rmd
. Use the "Knit" submenu to "Knit as HTML" and view the rendered "R Notebook" Markdown document.
-
Note: You may be prompted to install an updated version of the
markdown
package. Select "Yes". -
Note: Pushing/pulling will work a bit differently. In practice, you will use the
icon for "Source Control" on the RHS bar where you can stage things, commit, and push them. You will need to do this to turn in your problem set. See documentation from GitHub on Source Control and Codespaces
This is an undergraduate course taught by Kyle Coombs. Here is the course description, right out of the syllabus:
Economics is at the forefront of developing statistical methods for analyzing data collected from uncontrolled sources. Since econometrics addresses challenges in estimation such as sample selection bias and treatment effects identification, the discipline is well-suited for the analysis of large and unsystematically collected datasets. This course introduces statistical (machine) learning methods, which have been developed for analyzing such datasets but which have only recently been implemented in economic research. We will cover a variety of topics including data collection, data management, data description, causal inference, and data visualization. The course also explores how econometrics and statistical learning methods cross-fertilize and can be used to advance knowledge in the numerous domains where large volumes of data are rapidly accumulating. We will also cover the ethics of data collection and analysis. The course will be taught in R.
Component | Weight | Graded |
---|---|---|
7 × problem sets (10% each) | 50% | Top 5 |
1 × short presentation | 10% | Top 1 |
1 × final project | 40% | In parts |
Participation | Bonus up to 10% | End of course |
- Short presentations summarize either a key lecture reading, or an (approved) software package/platform.
- Sign up here
- Extensions: Each of you gets three ''grace period'' days to extend deadlines.
- You can use these days in any way you like, but once they're gone, they're gone.
There are several opportunities for bonus points during the semester:
- A 2.5% bonus on your final grade for issuing a pull request to any open source material -- including these lecture notes. This can be to fix a typo or to fix a bug in the code.
- A 2.5% participation bonus on your final grade that I will award at my discretion.
- I offer a bonus point for each typo corrected on problem sets and solutions. This is capped at 10 points per student per problem set. You must pull request and/or raise an Issue on the corresponding GitHub repository to get credit.
- Participation on GitHub as mentioned
I have given instructions on how to execute a pull request of a specific commit (instead of your entire commit history) in the FAQ.
Throughout the course you will engage in problem sets that deal with actual data. These may seem out of step with what we do in class, but they are designed to get you to think about how to apply the tools we learn in class to real data. As the class progresses, the problem sets will align more neatly with the material.
- Problem sets are coding assignments that get you to play with data using R
- They are extremely challenging, but also extremely rewarding
- With rare exceptions: You will not be given clear-cut code to copy and paste to accomplish these data cleaning tasks, but instead given a set of instructions and asked to figure out how to write code yourself
- You are encouraged to work together on problem sets, but you must write up your own answers (unless it is a group assignment)
- All problem sets will be completed and turned in as GitHub repositories
- Each problem set will be posted as a GitHub repository, which you will fork, set to private, and then clone to your computer (instructions provided in each problem set)
- You will then work on the problem set on your computer, and push your code to GitHub (push often!)
- For each problem set, you will turn in modular code (i.e. separate files do separate things) that accomplishes the tasks outlined in the problem set
- You will also turn in a
.Rmd
file that contains your answers to the questions in the problem set along with a knitted.html
or.pdf
of your.Rmd
- This
.Rmd
will "source" the code you wrote, so I can easily run your code from start to finish byknitting
- This
- Your problem sets will have a sensible folder structure that is easy to navigate (name folders
code
,data
,output
, etc.) - You will turn in your problem sets by pushing your code to GitHub.
Grading
Your problem sets are graded on three dimensions:
- Submission via GitHub (10%): Did you use GitHub to stage, commit, and push your code? Did you submit the assignment on time? Did you submit the assignment in the correct format?
- Quality of code (30%): Is it well-commented? Is it easy to follow? Can I run it?
- Any scripts needed to run your code should be included in the repository and sourced in the
.Rmd
file - Write code that automates as much of the process as possible. For example, if you need to download a file, write code that downloads the file automation
- If you cannot figure out how to automate a step, you can write a comment explaining what I need to do to run your code (you will lose very few points)
- Any scripts needed to run your code should be included in the repository and sourced in the
- Quality of presentation of graphs and tables (30%): Are they well-labeled? Do they have titles? Do they have legends? Are they formatted well?
- Quality of answers (30%): Are they clear? Do they answer the question?
The solutions are made public within a week of the problem set being posted.
In an effort to incentivize you to see coding as an ongoing process of learning and improvement, I will allow you to improve the coding and presentation quality portions of your grade on any problem set. However, you cannot just copy and paste the solutions.
Instead, you must provide carefully commented explanations of each step of the code -- whether from the solutions or of your own invention. This is a great way to learn, but it is also a lot of work.
Example. You might write add a comment like this to the top of your code:
# Create directories, suppress warning that the directory already exists
suppressWarnings({
dir.create(data)
dir.create(documentation)
dir.create(code)
dir.create(output)
dir.create(writing)
})
You must let me know that you are submitting for a regrade.
You will write a final project over the course of the semester. Further details are available here.
Participation on GitHub is a bonus worth up to 10% of your grade. I will rate participation based on the following criteria:
- Are you posting thoughtful questions to GitHub Issues? Follow the guidelines on stackoverflow for posting a good question.
- Are you replying to questions on GitHub Issues? Follow the guidelines on stackoverflow to write a good answer.
The goal is to encourage you to work together to solve problems. This is one of the most important skills you can take away from this, and really any, course. I also want to incentivize you to think carefully about how you post. Be kind and respectful, as much as you endeavor to be clear, concise, and helpful.
For each problem set, use the Issues tab for that specific problem set. For course materials, you can use that tab.
I will be monitoring the GitHub Issues tab for each repository and will award participation points to those who are actively engaging per the guidelines from stackoverflow.
Note: The main class page is a public repository so others can come and try to use these materials. Please do not post personal information here.
Please raise an issue or submit a pull request. For those taking this course, I offer a 2.5% bonus on your final grade for issuing a pull request to any open source material -- including these lecture notes. This can be to fix a typo or to fix a bug in the code.
Please note that this is a work in progress, with new material being added every week.
If you just want to read the lecture slides or HTML notebooks in your browser, then you should simply scroll up to the Lectures section at the top of this page. Completed lectures will be hyperlinked as soon as they have been added. Remember to check back in regularly to get any updates. Or, you can watch or star the repo to get notified automatically.
If you actually want to run the analysis and code on your own system (highly recommended), then you will need to download the material to your local machine. The best way to do this is to clone the repo via Git and then pull regularly to get updates. Please take a look at these slides if you are unfamiliar with Git or are unsure how to do any of that. Once that's done, you will find each lecture contained in a numbered folder (e.g. 01-intro
). The lectures themselves are written in R Markdown and then exported to HMTL format. Click on the HTML files if you just want to view the slides or notebooks.
Please open a new issue. Better yet, please fork the repo and submit an upstream pull request. I'm very grateful for any contributions, but may be slow to respond while this course is still be developed. Similarly, I am unlikely to help with software troubleshooting or conceptual difficulties for non-enrolled students. Others may feel free to jump in, though.
Sure. I already borrowed half of it Grant McDermott, Tyler Ransom, Raj Chetty, and Stephen Hansen. I have also kept everything publicly available. I ask two favours (like Grant McDermott) 1) Please let me know (email if you do use material from this course, or have found it useful in other ways. 2) An acknowledgment somewhere in your own syllabus or notes would be much appreciated.
If you want to make a pull request of a specific commit (and not all changes you have made), you have two options:
Option 1: Manually create a new fork
- Create a separate fork of the upstream repository for each commit you want to make a pull request for.
- Clone this separate fork to your local machine, make the changes, commit, and push.
- Pull request from this separate fork.
This is the easiest option, but it does mean you will have to clone multiple forks to your local machine.
Option 2: Use the command line (Git Bash, WSL, Terminal)
- Create a fork of this repository (called the upstream repository) if you have not before
- Clone the forked repo to your local computer
- Add the original repo as a remote called
upstream
(entergit remote add upstream <upstream-repo-url>
) - Fetch the upstream repo (
git fetch upstream
) - Create a branch of this upstream repo (
git checkout -b <pull-request-branch-name> upstream/main
) - Either:
- Make the changes you want to make to the code
- Cherry pick the specific commit you want to merge as a pull request by typing
git cherry pick <commit-hash>
into the command line- A commit hash is a unique combination of letters and numbers that identifies a specific commit. You can find the commit hash by running
git log
and copying the hash of the commit you want to make a pull request for OR by clicking on the commit history on GitHub and copying the SHA (the icon with two interlocked squares.)
- A commit hash is a unique combination of letters and numbers that identifies a specific commit. You can find the commit hash by running
- Push this branch to the forked repository with
git push -u origin <pull-request-branch-name>
- Return to your forked repo's main branch with
git checkout -b origin/main
- Navigate to your forked repository on GitHub and create a pull request from the branch you just pushed (you should see a banner that says "Compare & pull request" when you navigate to your forked repo)
- Make sure:
- The base repository is the upstream repo and the base is the main branch
- The head repository is your forked repo and the compare is the the branch named
<pull-request-branch-name>
- Optional: Destroy the pull-request-branch once it has served its purpose with
git branch -d <pull-request-branch-name>
The material in this repository is made available under the MIT license.