**5.1: Introduction to *pandas* data structures**

**Series**

In [2]:
import pandas as pd
obj = pd.Series([1, 2, 3, 4, 5])
obj

0    1
1    2
2    3
3    4
4    5
dtype: int64

When printing a pd.Series, you get a default index of 0-(n-1) when n is the data's length

You can get the data and the index using *.array* and *.index* respectively

In [3]:
obj.array

<NumpyExtensionArray>
[1, 2, 3, 4, 5]
Length: 5, dtype: int64

In [4]:
obj.index

RangeIndex(start=0, stop=5, step=1)

You can also use your own indexes in Series using the *index=* method

In [5]:
obj2 = pd.Series([4, 7, -5, 3], index=["a", "b", "c", "d"])
obj2

a    4
b    7
c   -5
d    3
dtype: int64

In [6]:
obj2.array

<NumpyExtensionArray>
[4, 7, -5, 3]
Length: 4, dtype: int64

In [7]:
obj2.index

Index(['a', 'b', 'c', 'd'], dtype='object')

Like numpy arrays, you can use labels to select single values, and also a list of values

In [9]:
obj2['a']

4

In [10]:
obj2[["a", "b", "c"]]

a    4
b    7
c   -5
dtype: int64

You can also use numpy functions or numpy-like operations, and this even conserves the index-value link

In [11]:
obj2[obj2 > 0]

a    4
b    7
d    3
dtype: int64

In [13]:
obj2 * 2

a     8
b    14
c   -10
d     6
dtype: int64

In [15]:
import numpy as np
np.exp(obj2)

a      54.598150
b    1096.633158
c       0.006738
d      20.085537
dtype: float64

A Series can be thought of as a fixed-length dictionary, since you can use it in many contexts a dictionary would be used

In [16]:
'b' in obj2

True

In [17]:
'e' in obj2

False

If you have a preexisting python dictionary, you can pass it into pandas to get a Series

In [18]:
popdata = {'Ohio' : 35000, "Texas" : 71000, "Oregon" : 16000, "Utah" : 5000}
obj3 = pd.Series(popdata)
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

This Seies can be converted back into a dictionary by using *to_dict*

In [19]:
obj3.to_dict()

{'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

In [21]:
states = ['California', 'Ohio', 'Oregon', "Texas"]
obj4 = pd.Series(popdata, index=states)
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

Since no value for California was found, it appears a NaN(Not a Number)

The *isna* and *notna* functions can be used to find missing values

In [22]:
pd.isna(obj4)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [23]:
pd.notna(obj4)

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

Series also have these as instance methods

In [24]:
obj4.isna()

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

You can do arithmetic with multiple Series

In [25]:
obj3 + obj4

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

Both the Series' object and index have a *name* attribute

In [27]:
obj4.name = 'population'
obj4.index.name = 'state'

In [28]:
obj4

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

A Serie's index can be altered in place by using assignment

In [29]:
obj

0    1
1    2
2    3
3    4
4    5
dtype: int64

In [31]:
obj.index = ["bob", "steve", "jeff", "ryan", "bill"]
obj

bob      1
steve    2
jeff     3
ryan     4
bill     5
dtype: int64

**The Pandas DataFrame**

There are many ways to make a DataFrame, and the most common is from a dictionary of equal-length lists or Numpy Arrays

In [32]:
data = {"state" : ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada", "Nevada"],
        "year" : [2000, 2001, 2002, 2001, 2002, 2003],
        "pop" : [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]
        }
frame = pd.DataFrame(data)
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In large DataFrames, *.head()* returns the first 5 rows, and *.tail()* returns the last 5 rows.

In [33]:
frame.head()

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


In [38]:
frame.tail()

Unnamed: 0,state,year,pop
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


If you pass a column that is not contained in the dictionary, it will appear with missing values(Nan)

In [41]:
frame2 = pd.DataFrame(data, columns=["year", "state", "pop", "debt"])
frame2

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,
1,2001,Ohio,1.7,
2,2002,Ohio,3.6,
3,2001,Nevada,2.4,
4,2002,Nevada,2.9,
5,2003,Nevada,3.2,


You can get columns from DataFrames by using dictionary-like notation, or by using the dot-attribute method

In [42]:
frame2['state']

0      Ohio
1      Ohio
2      Ohio
3    Nevada
4    Nevada
5    Nevada
Name: state, dtype: object

In [43]:
frame2.state

0      Ohio
1      Ohio
2      Ohio
3    Nevada
4    Nevada
5    Nevada
Name: state, dtype: object

In some cases, the dot attribute method cannot be used, like when the column's name contains whitespace/symbols, or when the column name conflicts with a DataFrame method name, then, you would use the dictionary notation, like obj[column_name]

Rows can be retrieved by using the *iloc* and *loc* attributes

In [47]:
frame2.loc[1]

year     2001
state    Ohio
pop       1.7
debt      NaN
Name: 1, dtype: object

Loc and Iloc are different since loc uses the row's label/name and Iloc uses the rows index or number

In [48]:
frame2.iloc[1:3]

Unnamed: 0,year,state,pop,debt
1,2001,Ohio,1.7,
2,2002,Ohio,3.6,


Columns in a DataFrame can be modified by assignment, either by a single scalar value, or an array of values 

In [49]:
frame2['debt'] = 67
frame2

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,67
1,2001,Ohio,1.7,67
2,2002,Ohio,3.6,67
3,2001,Nevada,2.4,67
4,2002,Nevada,2.9,67
5,2003,Nevada,3.2,67


In [51]:
frame2.debt = np.arange(6.)
frame2

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,0.0
1,2001,Ohio,1.7,1.0
2,2002,Ohio,3.6,2.0
3,2001,Nevada,2.4,3.0
4,2002,Nevada,2.9,4.0
5,2003,Nevada,3.2,5.0


When assigning lists or arrays to a column, its length must match the length of a DataFrame, and if not, NaN values will be inserted into the slots that were not filled in

In [52]:
val = pd.Series([-1.2, -1.5, -1.7], index = ['two', 'four', 'five'])
frame2.debt = val
frame2

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,
1,2001,Ohio,1.7,
2,2002,Ohio,3.6,
3,2001,Nevada,2.4,
4,2002,Nevada,2.9,
5,2003,Nevada,3.2,


Assigning a column that does not exist will create a new column

In [53]:
frame2['eastern'] = frame2.state == 'Ohio'
frame2

Unnamed: 0,year,state,pop,debt,eastern
0,2000,Ohio,1.5,,True
1,2001,Ohio,1.7,,True
2,2002,Ohio,3.6,,True
3,2001,Nevada,2.4,,False
4,2002,Nevada,2.9,,False
5,2003,Nevada,3.2,,False


You can use the *del* method to delete columns

In [55]:
del frame2['eastern']

In [56]:
frame2

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,
1,2001,Ohio,1.7,
2,2002,Ohio,3.6,
3,2001,Nevada,2.4,
4,2002,Nevada,2.9,
5,2003,Nevada,3.2,


Another common form of data is a nested dictionary of dictionaries

In [58]:
populations = {"Ohio" : {2000: 1.5, 2001: 1.7, 2002: 3.6},
               "Nevada" : {2001 : 2.4, 2002: 2.9}}
frame3 = pd.DataFrame(populations)
frame3

Unnamed: 0,Ohio,Nevada
2000,1.5,
2001,1.7,2.4
2002,3.6,2.9


When using this, the outer dict keys are interpreted as the columns, and the inner keys are interpreted as the row indices

You can transpose a dataframe(swap its rows and columns) by doing *dataframe.T*; Note: When a dataframe is transposed, it discards the column data types, so transposing a dataframe and transposing it back could result in losing in the dataframe's type information. If this happens, then the columns become arrays of pure python objects  

In [59]:
frame3.T

Unnamed: 0,2000,2001,2002
Ohio,1.5,1.7,3.6
Nevada,,2.4,2.9


If A DataFrame's index and column attributes have their name attributes set, thew will also be displayed when printing out the DataFrame

In [61]:
frame3.index.name = "year"
frame3.columns.name = "state"

frame3

state,Ohio,Nevada
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,1.5,
2001,1.7,2.4
2002,3.6,2.9


DataFrames do not have a name attribute, and using *.to_numpy()* returns only the DataFrame's data as a 2d Ndarray

In [62]:
frame3.to_numpy()

array([[1.5, nan],
       [1.7, 2.4],
       [3.6, 2.9]])

If the data types in the columns of a DataFrame are different, the data type of the array will be changed to accomodate all the values/data types(most of time, this will be the *object* dtype)

In [63]:
frame2.to_numpy()

array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, nan],
       [2002, 'Ohio', 3.6, nan],
       [2001, 'Nevada', 2.4, nan],
       [2002, 'Nevada', 2.9, nan],
       [2003, 'Nevada', 3.2, nan]], dtype=object)

**Index Objects**

Pandas index objects are used to hold axis labels and other metadata about the DataFrame.

In [64]:
arr = pd.Series(np.arange(3), index=['a', 'b', 'c'])
index = arr.index
index

Index(['a', 'b', 'c'], dtype='object')

Index obvjects are immutable, and cannot be changed

In [65]:
index[0] = 9

TypeError: Index does not support mutable operations

Immutability makes it safer to share Index objects among data structures

In [66]:
labels = pd.Index(np.arange(3))
labels

Index([0, 1, 2], dtype='int32')

In [69]:
arr2 = pd.Series([1.2, -2.5, 0], index=labels)
arr2

0    1.2
1   -2.5
2    0.0
dtype: float64

In [72]:
arr2.index is labels

True

In addition to being array-like, an index also behaves like a fixed-size set

In [73]:
frame3

state,Ohio,Nevada
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,1.5,
2001,1.7,2.4
2002,3.6,2.9


In [74]:
frame3.columns

Index(['Ohio', 'Nevada'], dtype='object', name='state')

In [75]:
"Ohio" in frame3.columns

True

In [76]:
2003 in frame3.index

False

But, unlike Python  *sets*, Pandas *Indexes* can contain duplicate labels

In [79]:
pd.Index(['foo', 'foo', 'bar', 'bar'])

Index(['foo', 'foo', 'bar', 'bar'], dtype='object')