Skip to content

ECON368-fall2023-big-data-and-economics/big-data-class-materials

 
 

Repository files navigation

Class Materials for Bates ECON/DCS 368: Big Data and Economics

Full syllabus

Lectures | Details | FAQ | License

Feedback

I am constantly trying to improve this course. Provide feedback.

Office hours:

My office hours are:

  • Tuesdays 4pm-5pm
  • Wednesdays 10:30-11:30am

You can book time here.

Lectures

Note: While I have provided PDF versions of the lectures (in folders above), they are best viewed in the original HTML format.

The course is broken up into three rough sections.

  • Part 1 covers basics of empirical organization, data gathering, and organizing that are not "big data" specific
  • Part 2 covers data description, econometrics, and causal inference that are possible with big data
  • Part 3 covers machine learning techniques that are possible with big data

Parts 2 and 3 will highlight examples of using big data to address social problems.

This is in progress and subject to change.

Date Day Topic Do before class Due
Data Science Basics
2023-09-07 Th Introduction to Big Data (.html, .pdf, .Rmd) Read and Install Ch 1, 4-8 of happygitwihtr Problem Set 1 due 9/25, solutions
2023-09-12 T Git slides (.html, .pdf, .Rmd) Work through Ch 9-19 of happygitwithr
2023-09-21 Th Empirical Organization slides (.html, .pdf, .Rmd), Data Tips (.html, .pdf, .Rmd) Read Code and Data for Social Sciences
2023-09-26 T R Basics (.html, .pdf, .Rmd) Watch basics of RStudio by Bates alumni Eli Mokas and Ian Ramsay
2023-09-28 Th Data Table (.html, .pdf, .pdf) Tidyverse (.html, .pdf, .Rmd) Ch 1 DS4E
2023-10-03 T CSS (.html, .pdf, .Rmd), Scraping Notes by Jesus Fernández Villaverde and Pablo Guerrón SelectorGadget (Chrome), ScrapeMate (Firefox) Problem Set 2 due 10/09 at 11:59:59, Solutions
2023-10-05 Th APIs (.html, .pdf, .Rmd) JSONView, Sign-up and register for Personal API Key
Causal Inference
2023-10-10 T Opportunity Atlas (.html, .pdf, .Rmd) and Spatial Analysis (.html, .pdf, .Rmd) Watch Geography of Upward Mobility in America starting at 39min Problem Set 3 due 10/30 at 11:59:59 (optional)
2023-10-12 Th Regression Review (.html, .pdf, .Rmd) Read Effect Ch 13 or Mixtape Ch 2, Watch Causal Effects of Neighborhoods
2023-10-17 T Causal Inference (.html, .pdf, .Rmd) Read Effect Ch 13 or Mixtape Ch 2, Watch Causal Effects of Neighborhoods
2023-10-24 T Data Tips (.html, .pdf, .Rmd
2023-10-26 Th
2023-10-31 T Difference-in-differences (.html, .pdf, .Rmd), Panel data and two-way fixed effects (.html, .pdf, .Rmd) Watch first 40min of Teachers and Charter Schools
2023-11-02 Th Regression Discontinuity Design (.html, .pdf, .Rmd), RDD activity (.html, .pdf, .Rmd) Read Effect Ch 20 or Mixtape Ch 6
Machine Learning
2023-11-07 T Bootstrapping, Functions & Parallel Programming (.html, .pdf, .Rmd), Bootstrapping activity (.html, .pdf, .Rmd) Refer to Chapters 2-4 of DS4E, Chapter 9 of R for Data Science
2023-11-09 Th SQL (see work by Tyler Ransom)
2023-11-14 T Intro to Machine Learning (.html, .pdf, .Rmd), ISLR tidymodels lab (.html), Oregon Schools Decision Tree application by Cianna Bedford-Petersen, Christopher Loan, & Brendan Cullen (.html) Read Athey & Imbens (2019), Mullainathan and Spiess (2017), Refer to ISLR 8.1 Problem Set 4 due 11/13 at 11:59:59
2023-11-16 Th Machine Learning: Bias and Judicial Decisions (.pdf by Raj Chetty and Greg Bruich) Watch Improving Judicial Decisions
2023-11-28 T Causal Forests (.html, .pdf, .Rmd), Application: Causal forests with grf Refer to ISLR Ch 6.1, 6.2 Problem Set 5 due 11/27 at 11:59:59
2023-11-30 Th Regression regularization/penalization (.html, .pdf, .Rmd), Application (.html, .pdf, .Rmd) Read ISLR 8.2
2023-12-05 T Regular expressions, WordClouds (.html, .pdf, .Rmd), Tidy text activities (.html, .pdf, .Rmd) Read Gentzkow (2019): Text as Data
2023-12-07 Th Sentiment Analysis (.html, .pdf, .Rmd) Read Stephens-Davidowitz (2014) Problem Set 6 due 12/11 at 11:59:59
If time T Topics Modeling, LLMs Read Ash and Hansen (2023): Text Algorithms
If time Th AI and bias Read Rambachan et al (2020) and Cowgill et al. (2019)

Goals for this course

This class is about helping you build good habits for doing organized and reproducible empirical work. It is not about learning specific R packages or functions.

  • Organize empirical projects that are replicable, reproducible, and collaborative using good programming practices
  • Collect and clean big or novel datasets using APIs, web scraping, and other methods
  • Use Big Data to generate key insights about economic opportunity, inequality, and other social problems
  • Understand the differences between prediction, causality, and description, and when to apply each
  • Explain what data science is, and how Big Data differs from other types of data

Navigating the course

  • All problem sets and lectures are linked above in the calendar
  • The repository for each problem set, these course materials, and your class presentations are all linked in the organization page

Expectations

This is an extremely challenging course. To help you succeed, I have outlined expectations for both you and me.

For your professor

  • Link to lecture slides and problem sets in the calendar
  • Post any software you need to download or other materials you need to prep for class in the calendar with 24 hours notice
  • Outline learning goals at the top of each lecture
  • Clearly explain the expectations for each problem set
  • Provide examples of skill sets to be used on problem sets in class
  • Grade your problem sets within two weeks (i.e. before the next problem set is due)
    • Post all problem set solutions to the repository within a week of the problem set being due
  • Check the GitHub Issues tab (for all repositories) at least once per day to answer questions

For students

  • Check the calendar within 24 hours of each lecture to see any materials you need to download/review
  • Fork the main problem set repository within 48 hours of the problem set being posted
  • Open problem set data and code within 48 hours of the problem set being posted
  • Work on problem sets in groups, but turn in your own code
  • Post questions about material or problem sets to GitHub Issues unless it is of a private matter (e.g. grades, extensions)
    • There is a GitHub Issues tab within every problem set that I create, please post questions about problem sets directly to the tab for each problem set
    • If I receive an email with a question that will benefit everyone, I will ask you to post it to GitHub Issues
    • This is so that everyone can benefit from the answer
    • Also, it will encourage collaboration
  • Use computers in class for class-related activities only
  • Seek out solutions to coding problems you run into
    1. Read error messages and see if you can immediately solve the problem
    2. Think before going to Google/ChatGPT: "How would I read a small portion of a large dataset in R?" (Use this services proactively, not reactively)

How I will run class

Most classes will be divided into a "lecture" and an "interactive" component. During the lecture, computers will be closed. During the interactive component, computers will be open for you to work through it.

Resources to use for class

This course is taught in R, but the goal is not for students to learn individual R functions and packages. That is something a person could do using generative AI, existing R vignettes and demos, and other online resources. With that in mind, I expect students in this course to make ample use of the countless free resources on the internet to learn R. Here are a few that I recommend:

On R

On R Markdown

Econometrics, Statistics, Data Science with R examples

Staying organized

Large Language Models

You are actively encouraged to use generative AI assistants in this class. These can be used to improve your code, refine your writing, iterate on your ideas, and more.

Student Academic Support Center

Scheduled hours for R held in the Student Academic Support Center (SASC) of the Library are:

  • Sunday | 7:30-9pm
  • Monday | 12-1pm, 2:30pm-4pm
  • Tuesday |12-2:30pm, 6-7:30pm
  • Wednesday | 11am-1pm, 6-7:30pm
  • Thursday | 12-4pm, 6-7:30pm
  • Friday | 11am-12pm

Course-Attached Tutor

Chrissy Aman is our Course-Attached tutor. She will host office hours in the SASC and will be available for individual appointments. Her hours are:

  • Fridays at 6:45pm
  • Sundays at 10am-12pm in PGill 227

Chrissy can help you troubleshoot R. She does not know have solution to the problem sets, but she can help you figure them out.

GitHub Codespaces

Having trouble with R on your computer?

To get you up and running and writing R code in no time, I have containerized this workshop such that you have a ready out of the box R coding environment.

Dev Containers in GitHub CodeSpaces

Click the green "<> Code" button at the top right on this repository page, and then select "Create codespace on main". (GitHub CodeSpaces is available with GitHub Enterprise and GitHub Education.)

To open RStudio Server, click the Forwarded Ports "Radio" icon at the bottom of the VS Code Online window.

Forwarded Ports

In the Ports tab, click the Open in Browser "World" icon that appears when you hover in the "Local Address" column for the Rstudio row.

Ports

This will launch RStudio Server in a new window. Log in with the username and password rstudio/rstudio.

  • NOTE: Sometimes, the RStudio window may fail to open with a timeout error. If this happens, try again, or restart the Codepace.

In RStudio, use the File menu to open the file test.Rmd. Use the "Knit" submenu to "Knit as HTML" and view the rendered "R Notebook" Markdown document.

  • Note: You may be prompted to install an updated version of the markdown package. Select "Yes".

  • Note: Pushing/pulling will work a bit differently. In practice, you will use the Text changing depending on mode. Light: 'So light!' Dark: 'So dark!' icon for "Source Control" on the RHS bar where you can stage things, commit, and push them. You will need to do this to turn in your problem set. See documentation from GitHub on Source Control and Codespaces

Other details

This is an undergraduate course taught by Kyle Coombs. Here is the course description, right out of the syllabus:

Economics is at the forefront of developing statistical methods for analyzing data collected from uncontrolled sources. Since econometrics addresses challenges in estimation such as sample selection bias and treatment effects identification, the discipline is well-suited for the analysis of large and unsystematically collected datasets. This course introduces statistical (machine) learning methods, which have been developed for analyzing such datasets but which have only recently been implemented in economic research. We will cover a variety of topics including data collection, data management, data description, causal inference, and data visualization. The course also explores how econometrics and statistical learning methods cross-fertilize and can be used to advance knowledge in the numerous domains where large volumes of data are rapidly accumulating. We will also cover the ethics of data collection and analysis. The course will be taught in R.

Grading policy

Component Weight Graded
7 × problem sets (10% each) 50% Top 5
1 × short presentation 10% Top 1
1 × final project 40% In parts
Participation Bonus up to 10% End of course
  • Short presentations summarize either a key lecture reading, or an (approved) software package/platform.
  • Extensions: Each of you gets three ''grace period'' days to extend deadlines.
  • You can use these days in any way you like, but once they're gone, they're gone.

Bonus points:

There are several opportunities for bonus points during the semester:

  1. A 2.5% bonus on your final grade for issuing a pull request to any open source material -- including these lecture notes. This can be to fix a typo or to fix a bug in the code.
  2. A 2.5% participation bonus on your final grade that I will award at my discretion.
  3. I offer a bonus point for each typo corrected on problem sets and solutions. This is capped at 10 points per student per problem set. You must pull request and/or raise an Issue on the corresponding GitHub repository to get credit.
  4. Participation on GitHub as mentioned

I have given instructions on how to execute a pull request of a specific commit (instead of your entire commit history) in the FAQ.

Problem sets

Throughout the course you will engage in problem sets that deal with actual data. These may seem out of step with what we do in class, but they are designed to get you to think about how to apply the tools we learn in class to real data. As the class progresses, the problem sets will align more neatly with the material.

  • Problem sets are coding assignments that get you to play with data using R
  • They are extremely challenging, but also extremely rewarding
  • With rare exceptions: You will not be given clear-cut code to copy and paste to accomplish these data cleaning tasks, but instead given a set of instructions and asked to figure out how to write code yourself
  • You are encouraged to work together on problem sets, but you must write up your own answers (unless it is a group assignment)
  • All problem sets will be completed and turned in as GitHub repositories

What you will turn in:

  • Each problem set will be posted as a GitHub repository, which you will fork, set to private, and then clone to your computer (instructions provided in each problem set)
  • You will then work on the problem set on your computer, and push your code to GitHub (push often!)
  • For each problem set, you will turn in modular code (i.e. separate files do separate things) that accomplishes the tasks outlined in the problem set
  • You will also turn in a .Rmd file that contains your answers to the questions in the problem set along with a knitted .html or .pdf of your .Rmd
    • This .Rmd will "source" the code you wrote, so I can easily run your code from start to finish by knitting
  • Your problem sets will have a sensible folder structure that is easy to navigate (name folders code, data, output, etc.)
  • You will turn in your problem sets by pushing your code to GitHub.

Grading

Your problem sets are graded on three dimensions:

  1. Submission via GitHub (10%): Did you use GitHub to stage, commit, and push your code? Did you submit the assignment on time? Did you submit the assignment in the correct format?
  2. Quality of code (30%): Is it well-commented? Is it easy to follow? Can I run it?
    • Any scripts needed to run your code should be included in the repository and sourced in the .Rmd file
    • Write code that automates as much of the process as possible. For example, if you need to download a file, write code that downloads the file automation
    • If you cannot figure out how to automate a step, you can write a comment explaining what I need to do to run your code (you will lose very few points)
  3. Quality of presentation of graphs and tables (30%): Are they well-labeled? Do they have titles? Do they have legends? Are they formatted well?
  4. Quality of answers (30%): Are they clear? Do they answer the question?

Solutions

The solutions are made public within a week of the problem set being posted.

Improving your grade

In an effort to incentivize you to see coding as an ongoing process of learning and improvement, I will allow you to improve the coding and presentation quality portions of your grade on any problem set. However, you cannot just copy and paste the solutions.

Instead, you must provide carefully commented explanations of each step of the code -- whether from the solutions or of your own invention. This is a great way to learn, but it is also a lot of work.

Example. You might write add a comment like this to the top of your code:

# Create directories, suppress warning that the directory already exists
suppressWarnings({
    dir.create(data)
    dir.create(documentation)
    dir.create(code)
    dir.create(output)
    dir.create(writing)
})

You must let me know that you are submitting for a regrade.

Final Project

You will write a final project over the course of the semester. Further details are available here.

Participation

Participation on GitHub is a bonus worth up to 10% of your grade. I will rate participation based on the following criteria:

  • Are you posting thoughtful questions to GitHub Issues? Follow the guidelines on stackoverflow for posting a good question.
  • Are you replying to questions on GitHub Issues? Follow the guidelines on stackoverflow to write a good answer.

The goal is to encourage you to work together to solve problems. This is one of the most important skills you can take away from this, and really any, course. I also want to incentivize you to think carefully about how you post. Be kind and respectful, as much as you endeavor to be clear, concise, and helpful.

For each problem set, use the Issues tab for that specific problem set. For course materials, you can use that tab.

I will be monitoring the GitHub Issues tab for each repository and will award participation points to those who are actively engaging per the guidelines from stackoverflow.

Note: The main class page is a public repository so others can come and try to use these materials. Please do not post personal information here.

FAQ

If you find a typo in these lecture notes

Please raise an issue or submit a pull request. For those taking this course, I offer a 2.5% bonus on your final grade for issuing a pull request to any open source material -- including these lecture notes. This can be to fix a typo or to fix a bug in the code.

How do I download this material and keep up to date with any changes?

Please note that this is a work in progress, with new material being added every week.

If you just want to read the lecture slides or HTML notebooks in your browser, then you should simply scroll up to the Lectures section at the top of this page. Completed lectures will be hyperlinked as soon as they have been added. Remember to check back in regularly to get any updates. Or, you can watch or star the repo to get notified automatically.

If you actually want to run the analysis and code on your own system (highly recommended), then you will need to download the material to your local machine. The best way to do this is to clone the repo via Git and then pull regularly to get updates. Please take a look at these slides if you are unfamiliar with Git or are unsure how to do any of that. Once that's done, you will find each lecture contained in a numbered folder (e.g. 01-intro). The lectures themselves are written in R Markdown and then exported to HMTL format. Click on the HTML files if you just want to view the slides or notebooks.

I've spotted a mistake or would like to contribute

Please open a new issue. Better yet, please fork the repo and submit an upstream pull request. I'm very grateful for any contributions, but may be slow to respond while this course is still be developed. Similarly, I am unlikely to help with software troubleshooting or conceptual difficulties for non-enrolled students. Others may feel free to jump in, though.

Can I use/adapt your material for a similar course that I'm teaching?

Sure. I already borrowed half of it Grant McDermott, Tyler Ransom, Raj Chetty, and Stephen Hansen. I have also kept everything publicly available. I ask two favours (like Grant McDermott) 1) Please let me know (email if you do use material from this course, or have found it useful in other ways. 2) An acknowledgment somewhere in your own syllabus or notes would be much appreciated.

Pull Request of a Specific Commit

If you want to make a pull request of a specific commit (and not all changes you have made), you have two options:

Option 1: Manually create a new fork

  1. Create a separate fork of the upstream repository for each commit you want to make a pull request for.
  2. Clone this separate fork to your local machine, make the changes, commit, and push.
  3. Pull request from this separate fork.

This is the easiest option, but it does mean you will have to clone multiple forks to your local machine.

Option 2: Use the command line (Git Bash, WSL, Terminal)

  1. Create a fork of this repository (called the upstream repository) if you have not before
  2. Clone the forked repo to your local computer
  3. Add the original repo as a remote called upstream (enter git remote add upstream <upstream-repo-url>)
  4. Fetch the upstream repo (git fetch upstream)
  5. Create a branch of this upstream repo (git checkout -b <pull-request-branch-name> upstream/main)
  6. Either:
    • Make the changes you want to make to the code
    • Cherry pick the specific commit you want to merge as a pull request by typing git cherry pick <commit-hash> into the command line
      • A commit hash is a unique combination of letters and numbers that identifies a specific commit. You can find the commit hash by running git log and copying the hash of the commit you want to make a pull request for OR by clicking on the commit history on GitHub and copying the SHA (the icon with two interlocked squares.)
  7. Push this branch to the forked repository with git push -u origin <pull-request-branch-name>
  8. Return to your forked repo's main branch with git checkout -b origin/main
  9. Navigate to your forked repository on GitHub and create a pull request from the branch you just pushed (you should see a banner that says "Compare & pull request" when you navigate to your forked repo)
  10. Make sure:
    • The base repository is the upstream repo and the base is the main branch
    • The head repository is your forked repo and the compare is the the branch named <pull-request-branch-name>
  11. Optional: Destroy the pull-request-branch once it has served its purpose with git branch -d <pull-request-branch-name>

License

The material in this repository is made available under the MIT license.

About

Bates ECON 368 Big Data and Economics syllabus and lectures

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • HTML 97.4%
  • JavaScript 1.0%
  • Jupyter Notebook 0.9%
  • CSS 0.3%
  • TeX 0.3%
  • Stata 0.1%