# Welcome to MUSA 550:<br>Geospatial Data Science in Python

Aug. 30, 2023

## Today

- Course logistics
- Using Jupyter Notebooks and Jupyter Lab
- Introduction to Python & Pandas

## Who am I?



### My day job

* My name is Nick Hand
* For the past 5+ years, I have led a small data science team in the City Controller's Office
* What did we do?: 
    - Objective, data-driven analysis of financial policies impacting Philadelphia 
    - Increasing transparency through data releases and interactive reports
* In two weeks, I start as a data scientist working at the Consumer Finance Protection Bureau in the federal government
    

In my time at the Controller's Office, we covered a range of policy issues in the city:

- [Analysis of the fairness and accuracy of property assessments](https://controller.phila.gov/philadelphia-audits/property-assessment-review/)
- [Interactive reports for the City's cash levels](https://controller.phila.gov/philadelphia-audits/cash-report-fiscal-year-2022-q3/)
- [Analysis of the 10-Year Tax Abatement program](https://controller.phila.gov/philadelphia-audits/an-analysis-of-tax-abatements-in-philadelphia/)
- [Interactive dashboard of Soda Tax spending](https://controller.phila.gov/philadelphia-audits/data-release-beverage-tax/)
- [Visualization of paving & potholes](https://controller.phila.gov/philadelphia-audits/data-release-paving-and-potholes/)
- [Series on the impact of COVID-19 on Philadelphia's small businesses and neighborhoods](https://controller.phila.gov/philadelphia-audits/the-impact-of-covid-19-across-philadelphias-neighborhoods/)
- [Visualization of the City's most recent budget](https://controller.phila.gov/philadelphia-audits/the-adopted-fy24-budget)
- [Interactive report on redlining in Philadelphia](https://controller.phila.gov/philadelphia-audits/mapping-the-legacy-of-structural-racism-in-philadelphia/)
- [Analysis of the impact of gun violence on housing prices](https://controller.phila.gov/philadelphia-audits/economic-impact-of-homicides/)
- [Interactive dashboard of shooting victims in Philadelphia](https://controller.phila.gov/philadelphia-audits/mapping-gun-violence/#/)
- [Interactive dashboard of neighborhood well-being](https://controller.phila.gov/philadelphia-audits/progressphl)


![](imgs/phila-controller-homepage.png)

::: {.callout-note}
To see more of our work, check out: https://controller.phila.gov/policy-analysis
:::

### Previously:<br>Astrophysics Ph.D. at Berkeley


![](imgs/uc-berkeley.jpg){width=600 fig-align="center"}

### How did I get here?

* Astrophysics/physics to data science is becoming increasingly common
* Landed a job through Twitter: https://www.parkingjawn.com
    * Dashboard visualization of monthly parking tickets in Philadelphia
    * Data from [OpenDataPhilly](https://opendataphilly.org/datasets/parking-violations/)

![](imgs/parking-jawn.png)

A nice example of how exploratory analysis + a well-designed dashbord can lead to insights:
- The power of cross-filtering: different views of the same data across multiple dimensions
- See drop in parking tickets over Jan 24-26, 2016 due to snowstorm


Parking Jawn is not Python based, but dovetails nicely with one of the **main goals of the course:**
> How can we effectively explore and extract insight from complex datasets?

## Course logistics

### General Info

* Two 90-minute lectures per week — mix of lecturing, interactive demos, and in-class lab time
* My email: nhand@design.upenn.edu
  - Office Hours: 
    - 2-hours during the week
    - Office hours will be by appointment and remote via Zoom. You will be able to sign up for 1 (or more) 15-minute time slot via the Canvas calendar. 
    - Time is to be determined
* Teaching Assistant: Teresa Chang
    - Email: thchang@design.upenn.edu
    - Office hours: TBD

### Course Websites

Course has four websites (sorry!). They are:

- Main Course: https://musa-550-fall-2023.github.io
- Github: https://github.com/MUSA-550-Fall-2023
- Canvas: https://canvas.upenn.edu/courses/1740535
- Ed Discussion: https://edstem.org/us/courses/42616/discussion/

Each will have its own purpose:


#### Main course website:

- Course schedule with links to weekly slides 
- Resources for learning Python, setting up software, and dealing with common issues
- General course info and policies
- Quick links to the other websites for the course

#### Github

- Github organization set up for the course
- Each week and assignment will have its own Github repository
- Assignments will also be submitted through Github

#### Canvas

- Will be used sign up for remote office hours and provide Zoom links for office hours
- Grading will also be tracked here

#### Ed Discussion

- Will be used for question & answer forum for course materials and assignments
- Announcements will also be made here so make sure you check frequently or turn on your notifications!
- Main method of communication will be through announcements on this site
- Participation grade (5% of total grade) will also be determined by user activity on the forum

## Main course website 

<center>
    <a href="https://musa-550-fall-2023.github.io/" target='blank_'>
<img src="imgs/main-course-website.png" width=800></img>
    </a>
</center>

### Highlights

- [Syllabus](https://musa-550-fall-2023.github.io/syllabus.html)
- [Schedule](https://musa-550-fall-2023.github.io/schedule.html)
- Resources & Guides:
  - [Python resources](https://musa-550-fall-2023.github.io/resource/python.html)
  - [Initial installation guide](https://musa-550-fall-2023.github.io/resource/python.html)
- Quick links to Canvas, Ed Discussion, GitHub homepage


## Policies

- Available at: https://musa-550-fall-2022.github.io/syllabus/policies/

## Course Github

<center>
    <a href="https://github.com/MUSA-550-Fall-2022" target='blank_'>
<img src="attachment:Screen%20Shot%202022-08-29%20at%204.41.06%20PM.png" width=800></img>
    </a>
</center>

## The goals of this course

- Provide students with the knowledge and tools to turn data into meaningful insights and stories
- Focus on the modern data science tools within the Python ecosystem
- The pipeline approach to data science:
    - gathering, storing, analyzing, and visualizing data to tell stories
- Real-world applications of analysis techniques in the urban planning and public policy realm 

## What we'll cover

### Module 1

**Exploratory Data Science:** Students will be introduced to the main tools needed to get started analyzing and visualizing data using Python

### Module 2

**Introduction to Geospatial Data Science:** Building on the previous set of tools, this module will teach students how to work with geospatial datasets using a range of modern Python toolkits.

### Module 3

**Data Ingestion & Big Data:** Students will learn how to collect new data through web scraping and APIs, as well as how to work effectively with the large datasets often encountered in real-world applications.

### Module 4

**Geospatial Data Science in the Wild:** Armed with the necessary data science tools, students will be introduced to a range of advanced analytic and machine learning techniques using a number of innovative examples from modern researchers.

### Module 5

**From Exploration to Storytelling:** The final module will teach students to present their analysis results using web-based formats to transform their insights into interactive stories.

## Assignments and grading

- Grading: 
    - 50% homework
    - 40% final project
    - 10% participation (based on class and Piazza participation)
- Late policy: will be accepted late but with a penalty

Homeworks will be assigned (roughly) every two weeks. You must complete five of the seven homework assignments. Four of the assignments are required, and you are allowed to choose the last assignment to complete (out of the remaining three options).

![Screen%20Shot%202022-08-29%20at%204.42.29%20PM.png](attachment:Screen%20Shot%202022-08-29%20at%204.42.29%20PM.png)

![Screen%20Shot%202022-08-29%20at%204.42.59%20PM.png](attachment:Screen%20Shot%202022-08-29%20at%204.42.59%20PM.png)

![Screen%20Shot%202022-08-29%20at%204.43.10%20PM.png](attachment:Screen%20Shot%202022-08-29%20at%204.43.10%20PM.png)

## Final Project

The final project is to replicate the pipeline approach on a dataset (or datasets) of your choosing.

Students will be required to use several of the analysis techniques taught in the class and produce a web-based data visualization that effectively communicates the empirical results to a non-technical audience. 

More info will be posted here: https://github.com/MUSA-550-Fall-2022/final-project

## Any questions so far?

## Initial survey

https://www.surveymonkey.com/r/TPTM6J3

### Okay, let's get started...

## The Incredible Growth of Python


[A StackOverflow analysis](https://stackoverflow.blog/2017/09/06/incredible-growth-python/)

<center>
    <img src="attachment:Screen%20Shot%202019-08-28%20at%2010.22.13%20PM.png" width=700><img>
</center>

<center>
<img src=attachment:Screen%20Shot%202019-08-28%20at%2010.23.51%20PM.png width=700></img>
</center>

## The rise of the Jupyter notebook

## The engine of collaborative data science

- First started by a physics grad student around 2001
- Known as the IPython notebook originally
- Starting getting popular in ~2011
- First funding received in 2015 $\rightarrow$ the Jupyter notebook was born


### Google searches for Jupyter notebook
<br>
<center>
    <img src="attachment:Screen%20Shot%202020-08-31%20at%208.52.32%20PM.png" height=500></img>
</center>

## Key features 

- Aimed at "computational narratives" — telling stories with data
- interactive, reproducible, shareable, user-friendly, visualization-focused

**Very versatile: good for both exploratory data analysis and polished finished products**

## Recommended reading on the Jupyter notebook

The official documentation for the Jupyter notebook is a good intro to the basics of the notebook:

  - [Introduction](https://jupyter-notebook.readthedocs.io/en/stable/notebook.html#introduction)
  - [Starting the notebook server](https://jupyter-notebook.readthedocs.io/en/stable/notebook.html#starting-the-notebook-server)
  - [Creating a new document](https://jupyter-notebook.readthedocs.io/en/stable/notebook.html#creating-a-new-notebook-document)
  - [Opening notebooks](https://jupyter-notebook.readthedocs.io/en/stable/notebook.html#opening-notebooks)
  - [Notebook user interface](https://jupyter-notebook.readthedocs.io/en/stable/notebook.html#notebook-user-interface)
  - [Structure of a notebook document](https://jupyter-notebook.readthedocs.io/en/stable/notebook.html#structure-of-a-notebook-document)
  - [User interface components](https://jupyter-notebook.readthedocs.io/en/stable/ui_components.html)
  
 
*Links available on the week-1 GitHub repository as well*

## Beyond the Jupyter notebook

### Google's Colaboratory
<br>

<center>
    <img src=attachment:Screen%20Shot%202019-01-22%20at%206.14.54%20PM.png width=1000
         ></img>
</center>

- A fancier notebook experience built on top of Jupyter notebook
- Running in the cloud on Google's servers
- An internal Google product that was recently released publicly
- Very popular for Python-based machine learning
- Won't need to use much in this course

See https://colab.research.google.com/notebooks/welcome.ipynb

## Binder: https://mybinder.org

![Screen%20Shot%202020-08-31%20at%208.54.43%20PM.png](attachment:Screen%20Shot%202020-08-31%20at%208.54.43%20PM.png)

### Allows you to launch a repository of Jupyter notebooks on GitHub in the cloud 

Note: as a free service, it can be a bit slow sometimes

### Weekly lectures are available on Binder
<br>
<center>
    <img src=attachment:Screen%20Shot%202019-08-28%20at%2010.34.04%20PM.png width=600></img>
</center>    

## Weekly Workflow

- Set up local Python environment as part of first homework assignment (posted next week on Wed. 9/7) 
- Each week, you will have two options to follow along with lectures:
    1. Using Binder in the cloud, launching via the button on the week's repository
    1. Download the week's repository to your laptop and launch the notebook locally
- Work on homeworks locally on your laptop — Binder is only a *temporary* environment (no save features)

To follow along today, go to https://github.com/MUSA-550-Fall-2022/week-1

## Now to the fun stuff...

These slides are a Jupyter notebook.

A mix of *code* cells and **text** cells in Markdown. Change the type of cell in the top menu bar.

In [2]:
# Comments begin with a "#" character in Python
# A simple code cell
# SHIFT-ENTER to execute


x = 10
print(x)

10


### Python data types

In [3]:
# integer
a = 10

# float
b = 10.5

# string
c = "this is a test string"

# lists
d = list(range(0, 10))

# booleans
e = True

# dictionaries
f = {"key1": 1, "key2": 2}

In [4]:
print(a)
print(b)
print(c)
print(d)
print(e)
print(f)

10
10.5
this is a test string
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
True
{'key1': 1, 'key2': 2}


**Note:** unlike `R`, you'll need to use quotes more often in Python, particularly around strings and keys of dictionaries

### Alternative method for creating a dictionary

In [5]:
f = dict(key1 = 1, key2=2, key3=3)

f

{'key1': 1, 'key2': 2, 'key3': 3}

### Accessing dictionary values

In [6]:
# access the value with key 'key1'
f['key1']

1

### Accessing list values

In [7]:
d

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [8]:
# access the second list entry (0 is the first index)
d[1]  

1

### Accessing characters of a string

In [9]:
c

'this is a test string'

In [10]:
# the first character
c[0]

't'

### Iterators and for loops

In [25]:
# Python code
result = 0
for i in range(10):
    print(i)
    result = result + i

0
1
2
3
4
5
6
7
8
9


In [26]:
print(result)

45


### Python's inline syntax

In [27]:
a = range(10) # this is an iterator

In [28]:
print(a)

range(0, 10)


In [29]:
# convert it to a list explicitly
a = list(range(10))
print(a)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]


In [30]:
# or use the INLINE syntax; this is the SAME
a = [i for i in range(10)]
print(a)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]


### Python functions

```python
def function_name(arg1, arg2, arg3):
    
    .
    .
    .
    code lines (indented)
    .
    .
    .
    
    return result
```

In [32]:
def compute_square(x):
    return x * x

In [33]:
sq = compute_square(5)
print(sq)

25


### Keywords:  arguments with a default!

In [34]:
def compute_product(x, y=5):
    return x * y

In [35]:
# use the default value for y
print(compute_product(5))

25


In [36]:
# specify a y value other than the default
print(compute_product(5, 10))

50


In [37]:
# can also explicitly tell Python which arguments are which
print(compute_product(5, y=2))
print(compute_product(y=2, x=5))

10
10


In [38]:
print(compute_product(x=5, y=4))

20


In [39]:
# argument names must match the function signature though!
print(compute_product(x=5, z=5))

TypeError: compute_product() got an unexpected keyword argument 'z'

### Getting help in the notebook

Use tab auto-completion and the ? and ?? operators

In [40]:
this_variable_has_a_long_name = 5

In [41]:
# try hitting tab after typing this_ 
this_variable_has_a_long_name

5

In [44]:
# Forget how to create a range? --> use the help message
range?

### Peeking at the source code for a function

Use the ?? operator

In [45]:
# Lets re-define compute_product() and add a docstring between """ """
def compute_product(x, y=5):
    """
    This computes the product of x and y
    
    
    This is all part of the comment.
    """
    return x * y

In [None]:
compute_product??

The question mark operator gives you access to the help message for any variable or function. **I use this frequently and it is the primary method I understand what functions do.**

## Getting more Python help

This was a **very** brief introduction. Additional Python tutorials are listed on our course website under "Resources"

https://musa-550-fall-2022.github.io/resources/python/

![Screen%20Shot%202022-08-29%20at%204.38.46%20PM.png](attachment:Screen%20Shot%202022-08-29%20at%204.38.46%20PM.png)

Recommend tutorial for students with little Python background:

- [Practical Python Programming](https://dabeaz-course.github.io/practical-python/Notes/Contents.html)

There are also a few good resources from the Berkeley Data Science Institute:

- https://bids.github.io/2016-01-14-berkeley/python/00-python-intro.html ([notebook version](https://bids.github.io/2016-01-14-berkeley/python/00-python-intro.ipynb))
- [Python for Social Science, a free online book](https://gawron.sdsu.edu/python_for_ss/course_core/book_draft/index.html)
- Many more resources are listed here: http://python.berkeley.edu/resources/

## The Data Science Handbook

The [The Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/) is a free, online textbook covering the Python basics needed in this course. In particular, the first four chapters are excellent:


- [Chapter 1: IPython/Jupyter](https://jakevdp.github.io/PythonDataScienceHandbook/01.00-ipython-beyond-normal-python.html)
- [Chapter 2: Numpy](https://jakevdp.github.io/PythonDataScienceHandbook/02.00-introduction-to-numpy.html)
- [Chapter 3: Pandas](https://jakevdp.github.io/PythonDataScienceHandbook/03.00-introduction-to-pandas.html)
- [Chapter 4: matplotlib](https://jakevdp.github.io/PythonDataScienceHandbook/04.00-introduction-to-matplotlib.html)

Note that you can click on the "Open in Colab" button for each chapter and run the examples interactively using [Google Colab](https://colab.research.google.com).

## One more thing: working outside the notebook

In this class, we will almost exclusively work inside Jupyter notebooks — you'll be writing Python code and doing data analysis directly in the notebook. 

The more traditional method of using Python is to put your code into a `.py` file and execute it via the command line (known as the Anaconda Prompt on Windows or Terminal app on MacOS).

See [this section](https://dabeaz-course.github.io/practical-python/Notes/01_Introduction/02_Hello_world.html) of the Practical Python Programming tutorial for more info.

There is a file called [`hello_world.py`](https://github.com/MUSA-550-Fall-2020/week-1/blob/master/hello_world.yml) in the repository for week 1. If we execute it, it should print out "Hello, World" to the command line.

Let's try it out.

### Notebook tip

You can run terminal commands directly in the Jupyter notebook's "code" cell by starting the line with a "!"

To list all of the files in the current folder (the "current working directory"), use the `ls` command:

In [None]:
! ls

We see the `hello_world.py` file listed. Now let's execute it on the command line by using the `python` command:

In [None]:
# We can run the same code right in the browser!
print("Hello World!")

In [None]:
! python hello_world.py

Success!

## Code editors

When writing software outside the notebook, it's useful to have an application known as a "code editor". This will provide a nice interface for writing Python code and some even have fancy features, like real-time syntax checking and syntax highlighting.

My recommended option is [Visual Studio Code](https://code.visualstudio.com/download).

## See you next week!

- No lecture on Monday next week due to Labor Day
- Next class a week from today on Wednesday 9/6