# Series

In [1]:
import pandas as pd

In [2]:
import numpy as np

In [3]:
from pandas import Series, DataFrame

In [4]:
obj = pd.Series([4,7,-5,3])

In [5]:
obj

0    4
1    7
2   -5
3    3
dtype: int64

The result of the .array attribute is a PandasArray which usually wraps a NumPy array but can also contain special array types.

In [6]:
obj.array

<PandasArray>
[4, 7, -5, 3]
Length: 4, dtype: int64

In [7]:
obj.index

RangeIndex(start=0, stop=4, step=1)

often, you'll want to create a Series with an index identifying each data point with a label: 

In [8]:
obj2 = pd.Series([4, 7, -5, 3], index = ["d", "b", "a", "c"])

In [9]:
obj2

d    4
b    7
a   -5
c    3
dtype: int64

In [10]:
obj2.index

Index(['d', 'b', 'a', 'c'], dtype='object')

compared with Numpy arrays, you can use lables in the index when selecting single Values or a set of Values: 

In [11]:
obj2["a"]

-5

In [12]:
obj2["d"]=6

In [13]:
obj2[["c","a","d"]]

c    3
a   -5
d    6
dtype: int64

here ["c","a","d"] is interpreted as a list of indices, even though it contains strings instead of integers.

In [14]:
obj2[obj2>0]

d    6
b    7
c    3
dtype: int64

In [15]:
obj2 * 2

d    12
b    14
a   -10
c     6
dtype: int64

In [16]:
np.exp(obj2)

d     403.428793
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

another way to think about a Series is as a fixed-length, ordered dictionary, as it is a mapping of index values to data values. It can be used in many contexts where you might use a dictionary: 

In [17]:
"b" in obj2

True

In [18]:
"e" in obj2

False

should you have data contained in a python dictionary, you can create a Series from it by passing the dictionary: 

In [19]:
sdata = {"Ohio": 35000, "Texas": 71000, "Oregon": 16000, "Utah": 5000}

In [20]:
obj3 = pd.Series(sdata)

In [21]:
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

A series can be converted back to a dictionary with its to_dict method: 

In [22]:
obj3.to_dict()

{'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

when you are only passing a dictionary, the index in the resulting Series will respect the order of the keys according to the dictionary's Keys method,which depends on the key insertion order. You can override this by passing an index with dictionary keys in the order you want them to appear in the resulting Series: 

In [23]:
states = ["California", "Ohio", "Oregon", "Texas"]

In [24]:
obj4 = pd.Series(sdata, index = states)

In [25]:
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

Here, three values found in sdata were placed in the appropriate locations, but since no value for "California" was found, it appears as NaN(Not a Number), which is considered in pandas to mark missing or NA values. since "Utah" was not included in states, it is excluded from the resulting object.

The isna and notna functions in pandas should be used to detect missing data: 

In [26]:
pd.isna(obj4)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [27]:
pd.notna(obj4)

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

Series also has these as instance methods: 

In [28]:
obj4.isna()

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

A useful Series feature for many applications is that it automatically alligns by index label in arithmetic operations: 

In [29]:
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

In [30]:
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [31]:
obj3 + obj4

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

Both the Series object itself and its index have name attribute, which integrates with other areas of pandas functionality:

In [32]:
obj4.name = "population"

In [33]:
obj4.index.name = "state"

In [34]:
obj4

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

A Series's index can be altered in place by assignment: 

In [35]:
obj

0    4
1    7
2   -5
3    3
dtype: int64

In [36]:
obj.index = ["Bob", "Steve", "Jeff", "Ryan"]

In [37]:
obj

Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

# DataFrame

There are many ways to construct a DataFrame ,though one of the most common is from a dictionary of equal-length lists or NumPy arrays: 

In [38]:
data = {"state": ["Ohio","ohio","ohio","Nevada", "Nevada", "Nevada"],
        "year": [2000, 2001, 2002, 2001, 2002, 2003],
        "pop": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]
       }

In [39]:
frame = pd.DataFrame(data)

In [40]:
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,ohio,2001,1.7
2,ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


For large DataFrames, the Head method selects only the first five rows: 

In [41]:
frame.tail()

Unnamed: 0,state,year,pop
1,ohio,2001,1.7
2,ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


if you specify a sequence of columns, the DataFrame's  columns will be arranged in that order: 

In [42]:
pd.DataFrame(data, columns=["year", "state", "pop"])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,ohio,1.7
2,2002,ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9
5,2003,Nevada,3.2


if you pass a columns that isn't contained in the dictionary, it will apear with missing values in the result: 

In [43]:
frame2 = pd.DataFrame(data, columns=["year", "state", "pop", "debt"])

In [44]:
frame2

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,
1,2001,ohio,1.7,
2,2002,ohio,3.6,
3,2001,Nevada,2.4,
4,2002,Nevada,2.9,
5,2003,Nevada,3.2,


In [45]:
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

A column in a DataFrame can be retrieved as a Series either by dictionary-like notation or by using the dot attribute order: 

In [46]:
frame2["state"]

0      Ohio
1      ohio
2      ohio
3    Nevada
4    Nevada
5    Nevada
Name: state, dtype: object

In [47]:
frame2.year

0    2000
1    2001
2    2002
3    2001
4    2002
5    2003
Name: year, dtype: int64

Rows can also be retrieved by position or name with the special iloc and loc attributes: 

In [48]:
frame2.loc[1]

year     2001
state    ohio
pop       1.7
debt      NaN
Name: 1, dtype: object

In [49]:
frame2.iloc[2]

year     2002
state    ohio
pop       3.6
debt      NaN
Name: 2, dtype: object

Columns can be modified by assignment.

In [50]:
frame2["debt"] = 16.5

In [51]:
frame2

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,16.5
1,2001,ohio,1.7,16.5
2,2002,ohio,3.6,16.5
3,2001,Nevada,2.4,16.5
4,2002,Nevada,2.9,16.5
5,2003,Nevada,3.2,16.5


In [52]:
frame2["debt"] = np.arange(6.)

In [53]:
frame2

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,0.0
1,2001,ohio,1.7,1.0
2,2002,ohio,3.6,2.0
3,2001,Nevada,2.4,3.0
4,2002,Nevada,2.9,4.0
5,2003,Nevada,3.2,5.0


when you are assigning lists or arrays to a column, the value's length must match the length of the DataFrame. If you assign a Series, its labels will be realigned exactly to the DataFrame's index, inserting missing values in any index values not present:  

In [54]:
val = pd.Series([-1.2, -1.5, -1.7], index=[2,4,5])

In [55]:
frame2["debt"] = val

In [56]:
frame2

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,
1,2001,ohio,1.7,
2,2002,ohio,3.6,-1.2
3,2001,Nevada,2.4,
4,2002,Nevada,2.9,-1.5
5,2003,Nevada,3.2,-1.7


Assigning a column that doesn't exist will create a new column.

The del keyword will delete columns like with a dictionary.

In [57]:
# As an example, I first add a new column of Boolean  values where the state column equlas "ohio"
frame2["eastern"] = frame2["state"] == "ohio"

In [58]:
frame2

Unnamed: 0,year,state,pop,debt,eastern
0,2000,Ohio,1.5,,False
1,2001,ohio,1.7,,True
2,2002,ohio,3.6,-1.2,True
3,2001,Nevada,2.4,,False
4,2002,Nevada,2.9,-1.5,False
5,2003,Nevada,3.2,-1.7,False


Caution : 
    New Columns cannot be created with the frame2.eastern dot attribute notation.

The del method can then be used to remove this column: 

In [59]:
del frame2["eastern"]

In [60]:
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

Caution : The columns returned from indexing a DataFrame is a view on the underlying data, not a copy. Thus, any in place modifications to the Series will be reflected in the DataFrame. The Column can be explicitly copied with the Series's copy method.

Another common form of data is a nested dictionary of dictionaries: 

In [61]:
populations = {"Ohio": {2000: 1.5, 2001: 1.7, 2002: 3.6},
               "Nevada": {2001: 2.4, 2002: 2.9}}

if the nested dictionary is passed to DataFrame, Pandas will interpret the outer dictionary keys as the columns, and the inner keys as the row indices: 

In [62]:
frame3 = pd.DataFrame(populations)

In [63]:
frame3

Unnamed: 0,Ohio,Nevada
2000,1.5,
2001,1.7,2.4
2002,3.6,2.9


You can transpose the DataFrame (swaps rows and columns) with similar syntax to a NumPy array: 

In [64]:
frame3.T

Unnamed: 0,2000,2001,2002
Ohio,1.5,1.7,3.6
Nevada,,2.4,2.9


Note that transposing discards the column data types if the columns do not all have same datatype, so transposing and then transposing back may lose the previous type information. The columns become arrays of pure python objects in this case: 

The key's in the inner dictionaries are combined to form the index in the result.This isn't true if an explicit index is specified:

In [65]:
pd.DataFrame(populations, index=[2001,2002,2003])

Unnamed: 0,Ohio,Nevada
2001,1.7,2.4
2002,3.6,2.9
2003,,


Dictionaries of Series are treated in much the same way: 

In [66]:
pdata = { "ohio": frame3["Ohio"][:-1],
          "Nevada": frame3["Nevada"][:2]}

In [67]:
pd.DataFrame(pdata)

Unnamed: 0,ohio,Nevada
2000,1.5,
2001,1.7,2.4


If a DataFrame's index and columns have their name attributes set, these will also be displayed: 

In [68]:
frame3.index.name = "year"

In [69]:
frame3.columns.name = "state"

In [70]:
frame3

state,Ohio,Nevada
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,1.5,
2001,1.7,2.4
2002,3.6,2.9


Unlike series, DataFrames does not have a name attribute. DataFrame's to_numpy method returns the data contained in the DataFrame as a two-Dimensional ndarray: 

In [71]:
frame3.to_numpy()

array([[1.5, nan],
       [1.7, 2.4],
       [3.6, 2.9]])

If the DataFrame's columns are different data types, the data type of the returned array will be chosen to accomodate all of the columns; 

In [72]:
frame2.to_numpy()

array([[2000, 'Ohio', 1.5, nan],
       [2001, 'ohio', 1.7, nan],
       [2002, 'ohio', 3.6, -1.2],
       [2001, 'Nevada', 2.4, nan],
       [2002, 'Nevada', 2.9, -1.5],
       [2003, 'Nevada', 3.2, -1.7]], dtype=object)

# index objects

In [74]:
obj = pd.Series(np.arange(3), index = ["a","b","c"])

In [75]:
index = obj.index

In [76]:
index

Index(['a', 'b', 'c'], dtype='object')

In [78]:
index[1:]

Index(['b', 'c'], dtype='object')

Index objects are immutable and thus can't be modified by the user.

index[1] = "d" # Type Error

In [79]:
labels = pd.Index(np.arange(3))

In [80]:
labels

Int64Index([0, 1, 2], dtype='int64')

In [81]:
obj2 = pd.Series([1.5, -2.5, 0], index = labels)

In [82]:
obj2 

0    1.5
1   -2.5
2    0.0
dtype: float64

In [84]:
obj2.index is labels

True

Caution : Some users will not often take advantage of the capabilites provided by an index, but because some operations will yeild results containing indexed data.  it's important to understand how they work.

In [None]:
frame3

In [86]:
frame3.columns

Index(['Ohio', 'Nevada'], dtype='object', name='state')

In [87]:
"Ohio" in frame3.columns

True

In [88]:
2003 in frame3.index 

False

Unlike python sets, a pandas index can contains duplicate labels: 

In [89]:
pd.Index(["foo","foo","bar","bar"])

Index(['foo', 'foo', 'bar', 'bar'], dtype='object')

In [None]:
Selections wit