In [27]:
import pandas as pd

# 5.1 Introduction to pandas Data Structures

### Series

In [28]:
obj = pd.Series([4, 7, -5, 3])

In [29]:
obj

0    4
1    7
2   -5
3    3
dtype: int64

Getting the array reprensentation and the index

In [30]:
obj.array


<PandasArray>
[4, 7, -5, 3]
Length: 4, dtype: int64

In [31]:
obj.index


RangeIndex(start=0, stop=4, step=1)

Series with and Index Identifying each data point with a lable.

In [32]:
obj2 = pd.Series([4, 7, -5, 3], index=["d", "b", "a", "c"])
obj2

d    4
b    7
a   -5
c    3
dtype: int64

Notice that index returns not the iterator but the index

In [33]:
obj2.index

Index(['d', 'b', 'a', 'c'], dtype='object')

So I can pull them out like its a dict

In [34]:
obj2["a"]

-5

I can easily update values too

In [36]:
obj2["d"] = 6

A list of indices

In [39]:
obj2[["c", "a", "d"]]

c    3
a   -5
d    6
dtype: int64

You can also do filtering.

In [42]:
obj2[obj2 > 0]

d    6
b    7
c    3
dtype: int64

Even Math

In [43]:
obj2 * 2

d    12
b    14
a   -10
c     6
dtype: int64

See here as we import something, but give it a value too

In [44]:
import numpy as np

In [45]:
np.exp(obj2)

d     403.428793
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

Hey look it's our good friend, the in command.

In [46]:
"b" in obj2

True

In [47]:
"e" in obj2

False

lets make a new series using a dict

In [48]:
In [30]: sdata = {"Ohio": 35000, "Texas": 71000, "Oregon": 16000, "Utah": 5000}

obj3 = pd.Series(sdata)

obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

We can also reverse this by using to_dict()

In [49]:
obj3.to_dict()

{'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

Can reorder keys with an index pass.
- Also notice that it converts into a float64, because a value is Null? Interesting.

In [53]:
states = ["California", "Ohio", "Oregon", "Texas"]
obj4 = pd.Series(sdata, index=states)
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

isna == isNull, just it's called isna in panda

In [55]:
pd.isna(obj4)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

notna == not NULL basically

In [56]:
pd.notna(obj4)

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

Series also have these instance methods, so we can shorten this to:

In [57]:
obj4.isna()

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

Automatically aligns by index in arithmitic operations

In [62]:
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

In [63]:
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

Notice how it converted Utah to NaN because NaN + 5000 is not possible

In [64]:
obj3 + obj4

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

Can also give it names

In [68]:
obj4.name = "population"
obj4.index.name = "state"
obj4

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

Can alter an index by assignment

In [69]:
obj

0    4
1    7
2   -5
3    3
dtype: int64

In [71]:
obj.index = ["Bob", "Steve", "Jeff", "Ryan"]
obj

Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

# DataFrame

Dataframes have a row and column index, just think of it like a dict of Series all sharing the same index.

In [78]:
data = {"state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada", "Nevada"],
        "year": [2000, 2001, 2002, 2001, 2002, 2003],
        "pop": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2



"Hey that's pretty good"


Head selects the first five rows


In [79]:
frame.head()

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


And tail does the opposet, with the last five rows

In [80]:
frame.tail()

Unnamed: 0,state,year,pop
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


can also specify columns too.

In [82]:
pd.DataFrame(data, columns=["year", "state", "pop"])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9
5,2003,Nevada,3.2


And if that column doesn't exist? It still appears, just empty.

In [83]:
frame2 = pd.DataFrame(data, columns=["year", "state", "pop", "debt"])
frame2

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,
1,2001,Ohio,1.7,
2,2002,Ohio,3.6,
3,2001,Nevada,2.4,
4,2002,Nevada,2.9,
5,2003,Nevada,3.2,


In [84]:
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

A column in a Dataframe can be retrieved as a Series by dict like notation or dot attribute notation.

In [85]:
frame2["state"]

0      Ohio
1      Ohio
2      Ohio
3    Nevada
4    Nevada
5    Nevada
Name: state, dtype: object

In [86]:
frame2.year

0    2000
1    2001
2    2002
3    2001
4    2002
5    2003
Name: year, dtype: int64

Rows are gathered by `iloc` or `loc`

In [88]:
frame2.loc[1]

year     2001
state    Ohio
pop       1.7
debt      NaN
Name: 1, dtype: object

In [89]:
frame2.iloc[2]

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: 2, dtype: object

You can modify all a column at once too.

In [91]:
frame2["debt"] = 16.5
frame2

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,16.5
1,2001,Ohio,1.7,16.5
2,2002,Ohio,3.6,16.5
3,2001,Nevada,2.4,16.5
4,2002,Nevada,2.9,16.5
5,2003,Nevada,3.2,16.5


Or you can give it a scalar value, or an array of values.

In [94]:
frame2["debt"] = np.arange(6.)
frame2

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,0.0
1,2001,Ohio,1.7,1.0
2,2002,Ohio,3.6,2.0
3,2001,Nevada,2.4,3.0
4,2002,Nevada,2.9,4.0
5,2003,Nevada,3.2,5.0


Do keep in mind that when assigning lists or arrays to a column the length must match the length of the DataFrame.

If you assign a Series, it's lables will be realigned to the DataFrames's index, inserting missing values in any index not present 

In [96]:
val = pd.Series([-1.2, -1.5, -1.7], index=["two", "four", "five"])
frame2["debt"] = val
frame2

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,
1,2001,Ohio,1.7,
2,2002,Ohio,3.6,
3,2001,Nevada,2.4,
4,2002,Nevada,2.9,
5,2003,Nevada,3.2,


Assigning a column that doesn’t exist will create a new column. Dot notation CAN NOT CREATE NEW COLUMNS

The 'del' keyword will delete columns like with a dictionary. As an example, I first add a new column of Boolean values where the state column equals "Ohio":

In [100]:
frame2["eastern"] = frame2["state"] == "Ohio"
frame2

Unnamed: 0,year,state,pop,debt,eastern
0,2000,Ohio,1.5,,True
1,2001,Ohio,1.7,,True
2,2002,Ohio,3.6,,True
3,2001,Nevada,2.4,,False
4,2002,Nevada,2.9,,False
5,2003,Nevada,3.2,,False


In [101]:
del frame2["eastern"]
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

- The column returned from indexing a DataFrame is a view on the underlying data, not a copy. Thus, any in-place modifications to the Series will be reflected in the DataFrame. The column can be explicitly copied with the Series’s copy method.

nested dicts are common too. if a nested dict is passed to DataFrame, pandas will interpret the outer dictionary keys as columns, and the inner as indices

In [103]:
populations = {"Ohio": {2000: 1.5, 2001: 1.7, 2002: 3.6},
               "Nevada": {2001: 2.4, 2002: 2.9}}
frame3 = pd.DataFrame(populations)
frame3

Unnamed: 0,Ohio,Nevada
2000,1.5,
2001,1.7,2.4
2002,3.6,2.9


You can transpose the DataFrame (swap rows and columns) with similar syntax to a NumPy array:

- Note that transposing discards the column data types if the columns do not all have the same data type, so transposing and then transposing back may lose the previous type information. The columns become arrays of pure Python objects in this case.



In [104]:
frame3.T

Unnamed: 0,2000,2001,2002
Ohio,1.5,1.7,3.6
Nevada,,2.4,2.9


The keys in the inner dictionaries are combined to form the index in the result. But you can give it an explicit index.

In [105]:
pd.DataFrame(populations, index=[2001, 2002, 2003])

Unnamed: 0,Ohio,Nevada
2001,1.7,2.4
2002,3.6,2.9
2003,,


Dictionaries of Series are treated in much the same way:

In [107]:
pdata = {"Ohio": frame3["Ohio"][:-1],
        "Nevada": frame3["Nevada"][:2]}
pd.DataFrame(pdata)

Unnamed: 0,Ohio,Nevada
2000,1.5,
2001,1.7,2.4


If a DataFrame’s `index` and `columns` have their `name` attributes set, these will also be displayed:

In [108]:
frame3.index.name = "year"
frame3.columns.name = "state"
frame3


state,Ohio,Nevada
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,1.5,
2001,1.7,2.4
2002,3.6,2.9


Now DataFrame does NOT have it's own `name` attribute, we use `to_numpy` to return the data as a two dimensional ndarray.
