# Agenda

1. Data frames (2D data)
2. Reading (and writing) files -- real-world data!

To download: https://files.lerner.co.il/data-science-exercise-files.zip

In [2]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

When we import a module, we're basically asking Python to do the following:

1. Find the module (ending with "py") on disk
2. Load it into memory
3. Cache it, so that we don't need to load it a second time
4. Define the module as a variable in our global namespace

The second time we use import, we just jump directly to step 4.



In [4]:
import sys
sys.modules['pandas']  # sys.modules is the cache that Python uses for modules

<module 'pandas' from '/usr/local/lib/python3.11/site-packages/pandas/__init__.py'>

What about "from import"?

In that case, it takes a slightly different route:

1. Find the module (ending with "py") on disk
2. Load it into memory
3. Cache it, so that we don't need to load it a second time
4. Defines only the names we've specified in our global namespace


In [5]:
from random import randint

In [6]:
sys.modules['random']

<module 'random' from '/usr/local/Cellar/python@3.11/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/random.py'>

In [7]:
# If I want to create a data frame...

# list of lists
df = DataFrame([[10, 20, 30, 40],
               [50, 60, 70, 80],
               [90, 100, 110, 120]])
df

Unnamed: 0,0,1,2,3
0,10,20,30,40
1,50,60,70,80
2,90,100,110,120


As with a series, we have an index -- along the left column, describing our rows

We also have columns, which are along the top row, describing the columns.

By default, both are numbered starting at 0.

We can set one or both by passing "index=" or "columns=" when we create teh data frame. And yes, we can modify those down the road.

In [8]:
df = DataFrame([[10, 20, 30, 40],
               [50, 60, 70, 80],
               [90, 100, 110, 120]],
              index=list('abc'),
              columns=list('wxyz'))
df

Unnamed: 0,w,x,y,z
a,10,20,30,40
b,50,60,70,80
c,90,100,110,120


In [9]:
# retrieving a row -- we use .loc and .iloc

df.loc['a']

w    10
x    20
y    30
z    40
Name: a, dtype: int64

In [10]:
df.loc[['a', 'c']]   # fancy indexing -- request more than one row

Unnamed: 0,w,x,y,z
a,10,20,30,40
c,90,100,110,120


In [11]:
# I can also use .iloc

df.iloc[1]

w    50
x    60
y    70
z    80
Name: b, dtype: int64

In [12]:
# how can I retrieve one or more columns? Use []
df['w']

a    10
b    50
c    90
Name: w, dtype: int64

In [13]:
# can I get more than one columns? Yes!
df[['w', 'y']]

Unnamed: 0,w,y
a,10,30
b,50,70
c,90,110


In [None]:
# what about numbering our columns?  that doesn't really happen.