In [1]:
import numpy as np
import pandas as pd

# Series 

Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is to call:  s= pd.Series(data,index=index)

Here data can be many different things:
- a Python dict
- an ndarray
- a scalar value

In [2]:
#If data is an ndarray, index must be of the same length as data. If no index is passed, one will be created having
#values [0,...,len(data)-1].
s = pd.Series(np.random.randn(5),index=['a','b','c','d','e'])
print(s)
print(s.index)
print(s.values)
#Pandas also support non-unique index values. If an operation that doesn't support duplicate index values is 
#attempted, an exception will be raised at that time.

a    0.772090
b    1.922059
c   -0.035424
d   -0.252012
e   -0.484047
dtype: float64
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
[ 0.77209047  1.9220594  -0.03542359 -0.2520119  -0.48404698]


In [3]:
#Series can be instantiated from dicts.
dic = {'b':2,'a':5,'c':8}
s = pd.Series(dic)
print(s)

b    2
a    5
c    8
dtype: int64


+ When the data is a dict and an index is not passed, the Series index will be ordered by the dict's insertion order, if you're using Python version>=3.6 and Pandas version>=0.23.
+ If you're using Python version<3.6 or Pandas version<0.23 and an index isnt passed, then the Series index will be the lexically ordered list of dict keys.

In [4]:
#If an index is passed, the values in data corresponding to the labels in the index will be pulled out.
pd.Series(dic,index=['a','b','c','d']) #NaN is the standard missing data marker in Pandas

a    5.0
b    2.0
c    8.0
d    NaN
dtype: float64

+ If data is a scalar value, an index must be provided. The value will be repeated to match the length of the index.

In [5]:
pd.Series(5,index=['a','b','c'])

a    5
b    5
c    5
dtype: int64

+ While Series is ndarray-like, if you need an actual ndarray, then use Series.to_numpy().

In [6]:
s.to_numpy()

array([2, 5, 8], dtype=int64)

In [8]:
s=pd.Series(dic,index=['a','b','c'])
print(s)
print(s['a'])
#print(s['f']) #An exception will be raised if a key isnt contained. Using the get() method, a missing label will
#return None or specified default:
print(s.get('f'))
print(s.get('f',np.nan))

a    5
b    2
c    8
dtype: int64
5
None
nan


## Vectorized operations and label alignment with Series

+ When working with raw NumPy arrays, looping through value-by-value is usually not necessary. The same is true when working with Series in pandas. Series can also be passed into most NumPy methods expecting an ndarray.

In [9]:
s+s

a    10
b     4
c    16
dtype: int64

In [10]:
s*3

a    15
b     6
c    24
dtype: int64

In [11]:
np.exp(s)

a     148.413159
b       7.389056
c    2980.957987
dtype: float64

* A key difference between Series and ndarray is that operations between Series automatically align the data based on label. Thus, you can write computations without giving consideration to whether the Series involved have the same labels.

In [12]:
s[1:]+s[:-1]

a    NaN
b    4.0
c    NaN
dtype: float64

+ The result of an operation between unaligned Series will have the union of the indexes involved. If a label is not found in one Series or the other, the result will be marked as missing NaN. 

## Name attribute

In [13]:
#Series can also have a name attribute.
s=pd.Series(np.random.randn(5),name='something')
s

0    0.188483
1   -0.924674
2    0.274417
3    2.726852
4   -0.858391
Name: something, dtype: float64

In [14]:
s.name

'something'

+ The Series name will be assigned automatically in many cases, in particular when taking 1D slices of DataFrame.
+ You can rename a Series with the pandas.series.rename() method.

In [15]:
s2 = s.rename('different')
print(s2.name)
print(s.name)
#Note that s and s2 refer to different objects.

different
something


# DataFrame

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object. Like Series, DataFrame accepts many different kinds of input:

+ Dict of 1-D ndarrays,lists, dicts or Series
+ 2-D numpy.ndarray
+ Structured or record ndarray
+ A Series
+ Another DataFrame

Along with the data, you can optionally pass index (row labels) and columns (column labels) arguments. If you pass an index and / or columns, you are guaranteeing the index and / or columns of the resulting DataFrame.

## From dict of Series or dicts

In [16]:
d = {'one':pd.Series([1,2,3],index=['a','b','c']),'two':pd.Series([1,2,3,4],index=['a','b','c','d'])}
df = pd.DataFrame(d)
df

Unnamed: 0,one,two
a,1.0,1
b,2.0,2
c,3.0,3
d,,4


+ The resulting index for a dict of Series without a specified index will be the union of the indexes of the various Series like above.

+ When the data is a dict, and columns is not specified, the DataFrame columns will be ordered by the dict’s insertion order, if you are using Python version >= 3.6 and Pandas >= 0.23 like above.

+ If you are using Python < 3.6 or Pandas < 0.23, and columns is not specified, the DataFrame columns will be the lexically ordered list of dict keys.

In [17]:
pd.DataFrame(d,index=['d','b','a'])

Unnamed: 0,one,two
d,,4
b,2.0,2
a,1.0,1


 + A dict of Series plus a specific index will discard all data not matching up to the passed index like above (the key and values corresponding to 'c' has been discarded since the passed index doesn't contain 'c').

In [18]:
pd.DataFrame(d,index=['d','b','a'],columns=['two','three'])

Unnamed: 0,two,three
d,4,
b,2,
a,1,


+ When a particular set of columns is passed along with a dict of data, the passed columns override the keys in the dict as above.
+ The row and column labels can be accessed respectively by accessing the index and columns attributes.

In [19]:
df.index

Index(['a', 'b', 'c', 'd'], dtype='object')

In [20]:
df.columns

Index(['one', 'two'], dtype='object')

## From dict of ndarrays/lists

+ The ndarrays must all be the same length. 
+ If an index is passed, it must clearly also be the same length as the arrays. 
+ If no index is passed, the result will be range(n), where n is the array length.

In [21]:
d = {'one':[1,2,3,4],'two':[5,6,7,8]}
pd.DataFrame(d)

Unnamed: 0,one,two
0,1,5
1,2,6
2,3,7
3,4,8


In [22]:
pd.DataFrame(d,index=['a','b','c','d'])

Unnamed: 0,one,two
a,1,5
b,2,6
c,3,7
d,4,8


## From structured or record array

This case is handled identically to a dict of arrays.

In [23]:
data = np.zeros((2, ),dtype=[('A','i4'),('B','f4'),('C','a10')])
data[:] = [(1,2.3,'Hello'),(2,3.4,'World')]
data

array([(1, 2.3, b'Hello'), (2, 3.4, b'World')],
      dtype=[('A', '<i4'), ('B', '<f4'), ('C', 'S10')])

In [24]:
pd.DataFrame(data)

Unnamed: 0,A,B,C
0,1,2.3,b'Hello'
1,2,3.4,b'World'


In [25]:
pd.DataFrame(data,columns=['A','C','B'])

Unnamed: 0,A,C,B
0,1,b'Hello',2.3
1,2,b'World',3.4


## From a list of dicts

In [26]:
data2 = [{'a':1,'b':2},{'a':5,'b':7,'c':14}]
pd.DataFrame(data2)

Unnamed: 0,a,b,c
0,1,2,
1,5,7,14.0


In [27]:
pd.DataFrame(data2,index=['first','second'])

Unnamed: 0,a,b,c
first,1,2,
second,5,7,14.0


In [28]:
pd.DataFrame(data2,columns=['a','c'])

Unnamed: 0,a,c
0,1,
1,5,14.0


## From a dict of tuples

+ We can automatically create a multi-indexed frame by passing a tuples dictionary.

In [29]:
pd.DataFrame({('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2},('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4},
              ('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6},('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8},
              ('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}})

Unnamed: 0_level_0,Unnamed: 1_level_0,a,a,a,b,b
Unnamed: 0_level_1,Unnamed: 1_level_1,b,a,c,a,b
A,B,1.0,4.0,5.0,8.0,10.0
A,C,2.0,3.0,6.0,7.0,
A,D,,,,,9.0


## From a Series

+ The result will be a DataFrame with the same index as the input Series, and with one column whose name is the original name of the Series (only if no other column name provided).

In [30]:
s = pd.Series([1,2,3,4,5],index=['a','b','c','d','e'],name='something')
s

a    1
b    2
c    3
d    4
e    5
Name: something, dtype: int64

In [31]:
pd.DataFrame(s)

Unnamed: 0,something
a,1
b,2
c,3
d,4
e,5


In [32]:
s=pd.Series([1,2,3,4,5],index=['a','b','c','d','e'])
pd.DataFrame(s)

Unnamed: 0,0
a,1
b,2
c,3
d,4
e,5


## Alternate constructors

+ DataFrame.from_dict() - 
Takes a dict of dicts or a dict of array-like sequences and returns a DataFrame. It operates like the DataFrame constructor except for the orient parameter which is 'columns' by default, but which can be set to 'index' in order to use the dict keys as row labels.

In [33]:
pd.DataFrame.from_dict(dict([('A',[1,2,3]),('B',[4,5,6])]))

Unnamed: 0,A,B
0,1,4
1,2,5
2,3,6


In [34]:
pd.DataFrame.from_dict(dict([('A',[1,2,3]),('B',[4,5,6])]),orient='index',columns=['one','two','three'])
# When we pass orient='index', the keys will be the row labels. In this case, we can also pass the desired
#column names.

Unnamed: 0,one,two,three
A,1,2,3
B,4,5,6


+ DataFrame.from_records() - 
 takes a list of tuples or an ndarray with structured dtype. It works analogously to the normal DataFrame constructor, except that the resulting DataFrame index may be a specific field of the structured dtype.

In [35]:
data

array([(1, 2.3, b'Hello'), (2, 3.4, b'World')],
      dtype=[('A', '<i4'), ('B', '<f4'), ('C', 'S10')])

In [36]:
pd.DataFrame.from_records(data,index='C')

Unnamed: 0_level_0,A,B
C,Unnamed: 1_level_1,Unnamed: 2_level_1
b'Hello',1,2.3
b'World',2,3.4


## Column selection, addition, deletion

In [37]:
d

{'one': [1, 2, 3, 4], 'two': [5, 6, 7, 8]}

In [38]:
df=pd.DataFrame(d,index=['a','b','c','d'])
df

Unnamed: 0,one,two
a,1,5
b,2,6
c,3,7
d,4,8


In [39]:
df['one']

a    1
b    2
c    3
d    4
Name: one, dtype: int64

In [40]:
#Addition of columns
df['three'] = df['one']*df['two']
df['flag'] = df['one']>2
df

Unnamed: 0,one,two,three,flag
a,1,5,5,False
b,2,6,12,False
c,3,7,21,True
d,4,8,32,True


In [41]:
#Columns can be deleted or popped like with a dict
del df['two']
df

Unnamed: 0,one,three,flag
a,1,5,False
b,2,12,False
c,3,21,True
d,4,32,True


In [42]:
df.pop('three')
df

Unnamed: 0,one,flag
a,1,False
b,2,False
c,3,True
d,4,True


In [43]:
#When inserting a scalar value, it will naturally be propagated to fill the column
df['foo']='bar'
df

Unnamed: 0,one,flag,foo
a,1,False,bar
b,2,False,bar
c,3,True,bar
d,4,True,bar


In [44]:
#When inserting a Series that does not have the same index as the DataFrame, it will be conformed to the 
#DataFrame’s index
df['trunc_one'] = df['one'][:2]
df

Unnamed: 0,one,flag,foo,trunc_one
a,1,False,bar,1.0
b,2,False,bar,2.0
c,3,True,bar,
d,4,True,bar,


+ You can insert raw ndarrays but their length must match the length of the DataFrame’s index.

+ By default, columns get inserted at the end. The insert function is available to insert at a particular location in the columns

In [45]:
df.insert(1,'bar',df['one'])
df

Unnamed: 0,one,bar,flag,foo,trunc_one
a,1,1,False,bar,1.0
b,2,2,False,bar,2.0
c,3,3,True,bar,
d,4,4,True,bar,


## Assigning new columns in method chains

+ DataFrame has an assign() method that allows you to easily create new columns that are potentially derived from existing columns.

In [46]:
iris = pd.read_csv('iris.csv')

In [47]:
iris.head()

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
0,5.1,3.5,1.4,0.2,Setosa
1,4.9,3.0,1.4,0.2,Setosa
2,4.7,3.2,1.3,0.2,Setosa
3,4.6,3.1,1.5,0.2,Setosa
4,5.0,3.6,1.4,0.2,Setosa


In [48]:
iris.assign(sepal_ratio=iris['sepal.length']/iris['sepal.width']).head()

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety,sepal_ratio
0,5.1,3.5,1.4,0.2,Setosa,1.457143
1,4.9,3.0,1.4,0.2,Setosa,1.633333
2,4.7,3.2,1.3,0.2,Setosa,1.46875
3,4.6,3.1,1.5,0.2,Setosa,1.483871
4,5.0,3.6,1.4,0.2,Setosa,1.388889


+ In the example above, we inserted a precomputed value. We can also pass in a function of one argument to be evaluated on the DataFrame being assigned to.

In [49]:
iris.assign(sepal_ratio = lambda x: x['sepal.length']/x['sepal.width']).head()

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety,sepal_ratio
0,5.1,3.5,1.4,0.2,Setosa,1.457143
1,4.9,3.0,1.4,0.2,Setosa,1.633333
2,4.7,3.2,1.3,0.2,Setosa,1.46875
3,4.6,3.1,1.5,0.2,Setosa,1.483871
4,5.0,3.6,1.4,0.2,Setosa,1.388889


+ assign always returns a copy of the data, leaving the original DataFrame untouched.
+ Passing a callable, as opposed to an actual value to be inserted, is useful when you don’t have a reference to the DataFrame at hand. This is common when using assign in a chain of operations.

## Indexing/Selection

In [50]:
df

Unnamed: 0,one,bar,flag,foo,trunc_one
a,1,1,False,bar,1.0
b,2,2,False,bar,2.0
c,3,3,True,bar,
d,4,4,True,bar,


+ Row selection returns a Series, whose index is the columns of the DataFrame.

In [51]:
#Select row by label
df.loc['b']

one              2
bar              2
flag         False
foo            bar
trunc_one        2
Name: b, dtype: object

In [52]:
#Select row by integer location
df.iloc[2]

one             3
bar             3
flag         True
foo           bar
trunc_one     NaN
Name: c, dtype: object

## Data alignment and arithmetic

In [53]:
df = pd.DataFrame(np.random.randn(10,4),columns=['A','B','C','D'])
df

Unnamed: 0,A,B,C,D
0,0.690786,-2.161851,0.377159,-0.178422
1,-0.525254,1.103352,0.11587,0.358995
2,1.111591,0.467086,0.494078,-0.872052
3,-1.042472,-1.18153,-1.005766,-0.102105
4,0.81386,-2.114646,-1.067958,-0.87894
5,-0.41778,-0.005371,0.429936,-0.155165
6,-0.766942,-0.272505,0.994289,0.887096
7,0.935952,0.223097,0.416793,0.335424
8,0.756632,-0.816376,-0.649978,-0.503461
9,-0.495776,1.362582,-2.048625,0.280393


In [54]:
df2 = pd.DataFrame(np.random.randn(7,3),columns=['A','B','C'])
df2

Unnamed: 0,A,B,C
0,0.388984,1.507803,1.690719
1,1.841667,0.594258,-1.901147
2,0.544114,-0.410366,0.902389
3,0.416795,0.323185,1.181471
4,0.376626,0.207287,-0.255456
5,0.098193,0.385834,0.016263
6,1.52236,-1.032319,0.219276


In [55]:
df+df2

Unnamed: 0,A,B,C,D
0,1.07977,-0.654048,2.067878,
1,1.316413,1.69761,-1.785277,
2,1.655706,0.05672,1.396467,
3,-0.625677,-0.858344,0.175706,
4,1.190486,-1.907359,-1.323414,
5,-0.319587,0.380463,0.446199,
6,0.755417,-1.304824,1.213565,
7,,,,
8,,,,
9,,,,


+ Data alignment between DataFrame objects automatically align on both the columns and the index (row labels). Again, the resulting object will have the union of the column and row labels.

+ When doing an operation between DataFrame and Series, the default behavior is to align the Series index on the DataFrame columns, thus broadcasting row-wise. For example:

In [56]:
df-df.iloc[2]

Unnamed: 0,A,B,C,D
0,-0.420805,-2.628937,-0.11692,0.69363
1,-1.636845,0.636266,-0.378208,1.231048
2,0.0,0.0,0.0,0.0
3,-2.154063,-1.648616,-1.499844,0.769948
4,-0.297731,-2.581732,-1.562037,-0.006887
5,-1.529371,-0.472457,-0.064142,0.716888
6,-1.878534,-0.739591,0.50021,1.759148
7,-0.17564,-0.243989,-0.077286,1.207476
8,-0.354959,-1.283462,-1.144056,0.368592
9,-1.607368,0.895496,-2.542704,1.152445


In [57]:
1/df

Unnamed: 0,A,B,C,D
0,1.447627,-0.462567,2.651402,-5.604681
1,-1.903842,0.906329,8.630354,2.785551
2,0.899611,2.140933,2.02397,-1.14672
3,-0.959259,-0.84636,-0.994267,-9.793848
4,1.228712,-0.472892,-0.936366,-1.137734
5,-2.393603,-186.186023,2.325927,-6.444764
6,-1.303879,-3.669661,1.005744,1.127274
7,1.068431,4.482349,2.399273,2.981302
8,1.321646,-1.224925,-1.538514,-1.986252
9,-2.017038,0.733901,-0.488132,3.566425


+ Boolean operators work as well:

In [58]:
df1 = pd.DataFrame({'a':[0,1,1],'b':[1,0,1]},dtype=bool)
df2 = pd.DataFrame({'a':[1,0,0],'b':[0,1,0]},dtype= bool)
df1 & df2

Unnamed: 0,a,b
0,False,False
1,False,False
2,False,False


In [59]:
df1 | df2

Unnamed: 0,a,b
0,True,True
1,True,True
2,True,True


In [60]:
df1 ^ df2

Unnamed: 0,a,b
0,True,True
1,True,True
2,True,True


In [61]:
-df1

Unnamed: 0,a,b
0,True,False
1,False,True
2,False,False


## Transposing

In [62]:
df

Unnamed: 0,A,B,C,D
0,0.690786,-2.161851,0.377159,-0.178422
1,-0.525254,1.103352,0.11587,0.358995
2,1.111591,0.467086,0.494078,-0.872052
3,-1.042472,-1.18153,-1.005766,-0.102105
4,0.81386,-2.114646,-1.067958,-0.87894
5,-0.41778,-0.005371,0.429936,-0.155165
6,-0.766942,-0.272505,0.994289,0.887096
7,0.935952,0.223097,0.416793,0.335424
8,0.756632,-0.816376,-0.649978,-0.503461
9,-0.495776,1.362582,-2.048625,0.280393


In [64]:
df[:5].T #gives the transpose

Unnamed: 0,0,1,2,3,4
A,0.690786,-0.525254,1.111591,-1.042472,0.81386
B,-2.161851,1.103352,0.467086,-1.18153,-2.114646
C,0.377159,0.11587,0.494078,-1.005766,-1.067958
D,-0.178422,0.358995,-0.872052,-0.102105,-0.87894


In [4]:
pd.read_csv('iris.csv').head()

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
0,5.1,3.5,1.4,0.2,Setosa
1,4.9,3.0,1.4,0.2,Setosa
2,4.7,3.2,1.3,0.2,Setosa
3,4.6,3.1,1.5,0.2,Setosa
4,5.0,3.6,1.4,0.2,Setosa


In [2]:
pd.read_csv('iris.csv',header=[1]).head()
#header says which row number to use as column name and hence start of the data

Unnamed: 0,5.1,3.5,1.4,.2,Setosa
0,4.9,3.0,1.4,0.2,Setosa
1,4.7,3.2,1.3,0.2,Setosa
2,4.6,3.1,1.5,0.2,Setosa
3,5.0,3.6,1.4,0.2,Setosa
4,5.4,3.9,1.7,0.4,Setosa


In [9]:
pd.read_csv('iris.csv',index_col=0).head()
#index_col says which column to use as the row labels of the DataFrame

Unnamed: 0_level_0,sepal.width,petal.length,petal.width,variety
sepal.length,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
5.1,3.5,1.4,0.2,Setosa
4.9,3.0,1.4,0.2,Setosa
4.7,3.2,1.3,0.2,Setosa
4.6,3.1,1.5,0.2,Setosa
5.0,3.6,1.4,0.2,Setosa


In [13]:
pd.read_csv('iris.csv',usecols=['sepal.length','variety']).head()
#usecols return a subset of columns that is to be considered

Unnamed: 0,sepal.length,variety
0,5.1,Setosa
1,4.9,Setosa
2,4.7,Setosa
3,4.6,Setosa
4,5.0,Setosa
