# Intro to Python for Data Science

## Jupyter Notebooks

This file is a Jupyter notebook, formerly known as an Interactive Python Notebook (.ipynb) file. As you can see, it can be interacted with in a web browser, so it does not need to be a static block like a PDF.

Jupyter notebooks are similar to R markdown files. They contain cells of text (like this one) for explanations and well-formatted comments, and then cells for code and results. The difference is that Jupyter notebooks are much more interactive--you can execute and modify code and see results easily, all within the notebook. This makes it very useful for projects where you want to collaborate and share your thinking, as well as research work where you want to make your results easily reproducible.

In [2]:
# Demo code
print('Hello, World!')

Hello, World!


## Python fundamentals

Python as a programming language isn't just used for quantitative work the way R is. There's a lot that you can do by just kind of getting the right library in the right place, but here we'll work through a few of the building blocks to know how things fit together better.

### Data Types

Just like with R, variables in Python can be saved items of text (called "Strings" instead of "char"), numbers, or True-False boolean values.

In [2]:
city = "New York"
pop = 8500000
metro = True

In [3]:
print(city)

New York


In [4]:
print(type(city))

<class 'str'>


### Lists

In R, you can frequently group together variables of the same type into an array. In Python, there are many different "collections" that let you group values together. Lists are perhaps the simplest case. They are:
* Mutable in size (you can add or drop new values easily)
* Ordered
* Mixable by type (don't have to all be Strings, booleans, or numeric, etc.)

They are signified in Python by square brackets: [ ]

In [10]:
city_names = ["New York", "Washington", "Seattle"]

In [11]:
print(city_names)

['New York', 'Washington', 'Seattle']


In [12]:
# Adding another city
city_names.append("Chicago")
print(city_names)

['New York', 'Washington', 'Seattle', 'Chicago']


In [13]:
# Adding a number
city_names.append(1000)

In [14]:
print(city_names)

['New York', 'Washington', 'Seattle', 'Chicago', 1000]


In [15]:
# Selecting from an index
print(city_names[1])

Washington


In [16]:
# Slicing
print(city_names[0:2])

['New York', 'Washington']


In [17]:
# Negative indexing
print(city_names[-1])

1000


In [18]:
# Combining with full slicing
print(city_names[:-1])

['New York', 'Washington', 'Seattle', 'Chicago']


In [19]:
city_names = city_names[:-1]

One useful thing to note is that in many ways, strings behave as special lists where each element is a character. So you can access character indicies directly.

In [30]:
text = "Python is great for data science, but I like R better"
print(text[-1])
print(text[:32])

r
Python is great for data science


### Dictionaries

Dictionaries are a very useful and fast tool for when you want to be able to "look up" related values, but don't care so much about how they're related to each other.

A dictionary is a collection of *key-value pairs*. Each *key* has a *value* associated with it, and while the keys can be stored in memory in any order, their *values* remain associated. This is helpful when you have a large collection of mappings, and want it to run very quickly.

Dictionaries are signified in Python with curly braces: { }

Lookups are performed with brackets, just like you would with a numeric index.

In [20]:
# Create a dictionary, using {}
city_pops = {"New York": 850000,
             "Washington": 700000,
             "Chicago": 2700000}

In [21]:
# Referencing a value is as simple as inputing its key
city_pops["New York"]

850000

In [22]:
# It is possible to access the collection of keys (or values), but by design they might not be in order
print(city_pops.keys())

dict_keys(['New York', 'Washington', 'Chicago'])


In [23]:
# Reassigning or adding new key-value pairs are both simple tasks
city_pops["Washington"] = 690000
print(city_pops["Washington"])

690000


In [24]:
city_pops["Baltimore"] = 620000
print(city_pops["Baltimore"])

620000


In [25]:
# Because dictionaries are unordered, you can't look them up by index number, unless you have that number as a key
city_pops[1]

KeyError: 1

## Loops, Functions, and Conditionals

Fortunately, both R and Python stick to most programming conventions for these fundamental programming concepts. If you know how to write a function or a loop in R, you could probably look at that function in Python and understand what's going on.

Unfortunately, the formatting differences are just small enough that it's *very* easy to get confused, and it *can* break your code if you're not careful. Test your work often!

### Loops

Like R, Python has *while* and *for* loops. For loops are still the most useful in most contexts. The first line defines the condition for how many times you want to run the loop, and the following indented lines say what steps are to be performed in each iteration.

In [48]:
# Print the numbers 1 through 10

i = 1

while i < 11:
    print(i)
    i+=1

1
2
3
4
5
6
7
8
9
10


In [49]:
# For loops apply the conditions for every element specified

# Print each character on a separate line
for letter in "Hello, World!":
    print(letter)

H
e
l
l
o
,
 
W
o
r
l
d
!


### Conditionals

Just like all sorts of programming languages, logical checks are at the core of Python, and work about as you'd expect.

In [50]:
# Check if two values are equal
print(1 == 2)

False


In [51]:
# Greater than, less than, or equal
print(1 > 2)
print(3 >= 2)
print("a" < "b")

False
True
True


In [53]:
# Logical combinations
print(True and False)
print(True and not True)
print(True or False)
print(1 != 2)

False
False
True
True


In [55]:
# Easily check for something in a collection
print("a" in ["a", "b", "c"])

True


In [69]:
# Now use these with if statements

if (4 / 2 == 2):
    print("The math works!")
else:
    print("Something has gone horribly wrong")

The math works!


### Functions

Write functions! Even little functions! Good programming ettiqutte is important in all programming languages, but it's a core tenant of good Python code. The first line defines the functions and its arguments, and the following indented lines lay out what the function does and what (if anything) it returns.

In [3]:
def myFunc(arg):
    """
    Functions should have comments at the start that explain what they do and what values they use.
    
    This function takes a number arg and returns its squared value.
    """
    
    output = arg**2
    return(output)

print(myFunc(5))

25


In [None]:
max()

### Function challenge

Write a function that takes a string and prints (not returns!) the vowels.

In [7]:
# First create a list of the 5 vowels
vowels = ['a', 'e', 'i', 'o', 'u']

In [8]:
# Define your function here
# Remember it should have one argument, a string, and it should not have a return statement
# You'll need to use a loop, a logical test, and your vowels list

def print_vowels(x):
    for char in x:
        if char in vowels:
            print(char)

In [9]:
print_vowels("Oh man this is good code!")

a
i
i
o
o
o
e


## Data sets and pandas for analysis

You *could* combine lists and dictionaries in all sorts of ways to store and reference your data. Lists can contain lists (or lists of lists, or...). Dictionaries can contain dictionaries as their values (but not their keys).

But you know what R is really good at? Data frames. Rows and columns, observations and variables. To mimic that functionality in Python, you'll want to use the *pandas* library. By convention, data scientists will usually refer to pandas as "pd" in their code, because it is so often used and typed out.

In [32]:
# How to import a library in Python
import pandas as pd

In [34]:
# Creating a data frame from scratch
df = pd.DataFrame([["New York", 8500000, True],
                  ["Washington", 700000, True],
                  ["Chicago", 2700000, False]])

In [36]:
df

Unnamed: 0,0,1,2
0,New York,8500000,True
1,Washington,700000,True
2,Chicago,2700000,False


In [38]:
# It works better from dictionaries, where the key is the variable name
df = pd.DataFrame({"City": ["New York", "Washington", "Chicago"],
                   "Population": [8500000, 700000, 2700000],
                   "East Coast": [True, True, False]})

In [39]:
df

Unnamed: 0,City,East Coast,Population
0,New York,True,8500000
1,Washington,True,700000
2,Chicago,False,2700000


In [40]:
# Lookups
df["City"]

0      New York
1    Washington
2       Chicago
Name: City, dtype: object

In [41]:
df[df["East Coast"] == True]

Unnamed: 0,City,East Coast,Population
0,New York,True,8500000
1,Washington,True,700000


In [42]:
# Some basic statistical operations are available automatically
df["Population"].sum()

11900000

In [43]:
# Booleans count as 1 or 0 for most math purposes
df["East Coast"].mean()

0.66666666666666663

### Loading DataFrames

Creating DataFrames by hand is possible, but clearly not ideal! The good news is that they can be read in or saved out using many different file formats, just like in R. The common formats are CSVs and JSONs, but for simplicity we'll keep working with CSVs.

In [56]:
dc_pums = pd.read_csv("https://raw.githubusercontent.com/GeorgetownMcCourt/data-science/master/lecture-12/data/ss15pdc.csv")

In [58]:
dc_pums.head()

Unnamed: 0,RT,SERIALNO,SPORDER,PUMA,ST,ADJINC,PWGTP,AGEP,CIT,CITWP,...,RACSOR,RACWHT,RC,SCIENGP,SCIENGRLP,SFN,SFR,SOCP,VPS,WAOB
0,P,1091,1,105,11,1001264,91,57,1,,...,0,1,0,1.0,2.0,,,119151,,1
1,P,1213,1,104,11,1001264,156,34,1,,...,0,0,0,,,,,435051,,1
2,P,1213,2,104,11,1001264,167,34,1,,...,0,0,0,,,,,37201X,,1
3,P,1213,3,104,11,1001264,129,7,1,,...,0,0,1,,,,,,,1
4,P,1213,4,104,11,1001264,217,2,1,,...,0,0,1,,,,,,,1


In [64]:
# What's the highest reported wage in the data?

dc_pums["WAGP"].max()

637000.0

In [65]:
# What's are the descriptive statistics for age?

dc_pums["AGEP"].describe()

count    6610.000000
mean       38.352496
std        21.845732
min         0.000000
25%        23.000000
50%        34.000000
75%        55.000000
max        94.000000
Name: AGEP, dtype: float64

### DataFrame practice

How would you find the mean wage for adults age 18 - 25?