Class: Tuesdays from 9am to 12pm in Meyerson Hall, room B2.
Office hours: Monday from 4pm to 7pm. Email galkamaxd at gmail to schedule a time.
Instructor: Max Galka (galkamaxd at gmail dot com)
TA: Evan Cernea (ecernea at sas dot upenn dot edu)
The purpose of this course is to familiarize students with the “pipeline” approach to data science. This involves the process of gathering data, storing the data, analyzing the data, and visualizing the data such that non-technical decision makers can make sense of it. The course is broken down accordingly into four sections.
- Data collection: Students will learn how to gather data by way of web scraping, APIs, and other unstructured sources.
- Databases: This part of the course teaches students how to store this data for efficient retrieval and analysis.
- Analytics: Students will learn a range of machine-driven techniques for analyzing structured and unstructured data.
- Data visualization: The last part of the course teaches students how to present the results of their analysis visually using R and the web application framework Shiny.
The course will be conducted in weekly sessions devoted to lectures, demonstrations, and in-class projects.
There is one required final project at the end of the semester. Homework will be assigned before the close of class and will be due the following Tuesday by the end of day. Five of the homework assignments will be explicitly required. The remainder are optional, but will count toward the participation component of your final grade.
For the final project, students will replicate the pipeline approach on a dataset (or datasets) of their choosing. The final deliverable will be a web-based data visualization and accompanying description including a summary of the results and the methods used in each step of the process (collection, storage, analysis and visualization).
Assignment Q&A:
- If you get stuck, the first step should always be to see if you can find the answer to your question online. In particular, Stack Overflow, Stack Exchange: GIS, and the rest of the Stack Exchange family are great resources.
- You are encouraged to ask [and answer] questions via the Slack channel as opposed to email, in case other students will have also have the same question.
- Evan and I are available for in depth discussion about assignment during office hours.
The grading breakdown is as follows: 50% for homework; 40% for final project, 10% for participation
There will be five required homework assignments, due at the beginning of class. Late homework will be accepted for up to one week after the deadline and will be deducted 10%. Credit will not be given for homework that is late by more than one week.
This course relies on use of the R Statistical Package in conjunction with Shiny and other associated extensions. For geospatial topics, we will also use QGIS.
Class # | Date | Topic | Homework* |
---|---|---|---|
Week 1 | Jan 16 | ggplot2, QGIS, data visualization fundamentals | |
Week 2 | Jan 23 | Data frames, tidyverse, map projections | Assign HW 1 |
Week 3 | Jan 30 | Geocoding/mapping: ggmap, sf (simple features) package | |
Week 4 | Feb 6 | Databases: Postgres, SQL | |
Week 5 | Feb 13 | Databases: PostGIS, spatial queries | Assign HW 2 |
Week 6 | Feb 20 | Web scraping 1: The DOM, web inspector | |
Week 7 | Feb 27 | Web scraping 2: CSS selectors, scraping dynamic pages | Assign HW 3 |
Spring Break | |||
Week 8 | Mar 13 | Unstructured data: Twitter API | |
Week 9 | Mar 20 | Natural language processing: sentiment analysis | Assign HW 4 |
Week 10 | Mar 27 | Advanced data visualization | |
Week 11 | Apr 3 | Interactive maps: Leaflet | |
Week 12 | Apr 10 | Shiny | Assign HW 5 |
Week 13 | Apr 17 | Shiny | |
Week 14 | Apr 24 | In-class work on final projects |
- Assignment dates of homework are tentative and subject to change