
NumPy arrays have one dtype for the entire array, while pandas DataFrames have one dtype per column

In [2]:
import numpy as np
import pandas as pd

Series is a type of list in pandas which can take integer values, string values, double values and more. But in Pandas Series we return an object in the form of list, having index starting from 0 to n, Where n is the length of values in series.

In [9]:
pd.Series(np.arange(12))

0      0
1      1
2      2
3      3
4      4
5      5
6      6
7      7
8      8
9      9
10    10
11    11
dtype: int64

 A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. Pandas DataFrame consists of three principal components, the data, rows, and columns.

In [10]:
pd.DataFrame(np.arange(16).reshape(4,4))

Unnamed: 0,0,1,2,3
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15


In pandas Dataframe we can change colomn as will as row name. pandas support different datatype in one dataframe

In [12]:
pd.DataFrame(np.arange(16).reshape(4,4),columns=list('A2C4'))

Unnamed: 0,A,2,C,4
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15


In [14]:
pd.DataFrame(np.arange(16).reshape(4,4),columns=list('A2C4'),index=list('E2G4'))

Unnamed: 0,A,2,C,4
E,0,1,2,3
2,4,5,6,7
G,8,9,10,11
4,12,13,14,15


Creating a DataFrame by passing a numpy array, with a datetime index and labeled columns

In [15]:
pd.date_range("20130101", periods=6)

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

Creating a DataFrame by passing a dict of objects that can be converted to series-like.

In [3]:
df2 = pd.DataFrame(
    {
        "A": 1.0,
        "B": pd.Timestamp("20130102"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "D": np.array([3] * 4, dtype="int32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo",
        "G": pd.Series(np.arange(4))
    }
)
df2

Unnamed: 0,A,B,C,D,E,F,G
0,1.0,2013-01-02,1.0,3,test,foo,0
1,1.0,2013-01-02,1.0,3,train,foo,1
2,1.0,2013-01-02,1.0,3,test,foo,2
3,1.0,2013-01-02,1.0,3,train,foo,3


Having specific dtypes

In [4]:
df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
G             int64
dtype: object

See the top & bottom rows of the frame

In [5]:
df2.head(2)

Unnamed: 0,A,B,C,D,E,F,G
0,1.0,2013-01-02,1.0,3,test,foo,0
1,1.0,2013-01-02,1.0,3,train,foo,1


In [6]:
df2.tail(2)

Unnamed: 0,A,B,C,D,E,F,G
2,1.0,2013-01-02,1.0,3,test,foo,2
3,1.0,2013-01-02,1.0,3,train,foo,3



Display the index, columns, and the underlying numpy data

In [7]:
df2.index

Int64Index([0, 1, 2, 3], dtype='int64')

In [8]:
df2.columns

Index(['A', 'B', 'C', 'D', 'E', 'F', 'G'], dtype='object')

In [9]:
df2.values

array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo', 0],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo', 1],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo', 2],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo', 3]],
      dtype=object)

Convert the DataFrame to a NumPy array.

In [10]:
df2.to_numpy()

array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo', 0],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo', 1],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo', 2],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo', 3]],
      dtype=object)


Describe shows a quick statistic summary of your data

In [11]:
df2.describe()

Unnamed: 0,A,C,D,G
count,4.0,4.0,4.0,4.0
mean,1.0,1.0,3.0,1.5
std,0.0,0.0,0.0,1.290994
min,1.0,1.0,3.0,0.0
25%,1.0,1.0,3.0,0.75
50%,1.0,1.0,3.0,1.5
75%,1.0,1.0,3.0,2.25
max,1.0,1.0,3.0,3.0



Sorting by an axis

In [12]:
df2.sort_index(axis=0, ascending=False)

Unnamed: 0,A,B,C,D,E,F,G
3,1.0,2013-01-02,1.0,3,train,foo,3
2,1.0,2013-01-02,1.0,3,test,foo,2
1,1.0,2013-01-02,1.0,3,train,foo,1
0,1.0,2013-01-02,1.0,3,test,foo,0


columns are sorted with axis=1

In [13]:
df2.sort_index(axis=1, ascending=False)

Unnamed: 0,G,F,E,D,C,B,A
0,0,foo,test,3,1.0,2013-01-02,1.0
1,1,foo,train,3,1.0,2013-01-02,1.0
2,2,foo,test,3,1.0,2013-01-02,1.0
3,3,foo,train,3,1.0,2013-01-02,1.0


Sorting by value

In [14]:
df2.sort_values(by="B")

Unnamed: 0,A,B,C,D,E,F,G
0,1.0,2013-01-02,1.0,3,test,foo,0
1,1.0,2013-01-02,1.0,3,train,foo,1
2,1.0,2013-01-02,1.0,3,test,foo,2
3,1.0,2013-01-02,1.0,3,train,foo,3


Object selection has had a number of user-requested additions in order to support more explicit location based indexing. pandas now supports three types of multi-axis indexing.

.loc is primarily label based, but may also be used with a boolean array. .loc will raise KeyError when the items are not found. 

In [15]:
df2.loc[:, ['B', 'A']] = df2[['A', 'B']]


In [16]:
 df2[['A', 'B']]

Unnamed: 0,A,B
0,1.0,2013-01-02
1,1.0,2013-01-02
2,1.0,2013-01-02
3,1.0,2013-01-02


.iloc is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array. .iloc will raise IndexError if a requested indexer is out-of-bounds, except slice indexers which allow out-of-bounds indexing.

In [17]:
 x = pd.DataFrame({'x': [1, 2, 3], 'y': [3, 4, 5]})

In [18]:
 x.iloc[1] = {'x': 9, 'y': 99}

In [19]:
x

Unnamed: 0,x,y
0,1,3
1,9,99
2,3,5


Multi-dimension indexing like numpy is not possible in pandas
Only possible way is to slice rows

In [20]:
df2[1:3]

Unnamed: 0,A,B,C,D,E,F,G
1,1.0,2013-01-02,1.0,3,train,foo,1
2,1.0,2013-01-02,1.0,3,test,foo,2


 fetching rows via indexes

In [21]:
df2.loc[[2,0,3]]

Unnamed: 0,A,B,C,D,E,F,G
2,1.0,2013-01-02,1.0,3,test,foo,2
0,1.0,2013-01-02,1.0,3,test,foo,0
3,1.0,2013-01-02,1.0,3,train,foo,3


fetching rows via index slice and column labels

In [22]:
df2.loc[1:3,["C","D"]]

Unnamed: 0,C,D
1,1.0,3
2,1.0,3
3,1.0,3


fetching rows via index slice and column position slice

In [23]:
df2.iloc[1:3, 2:5]

Unnamed: 0,C,D,E
1,1.0,3,train
2,1.0,3,test


In [24]:
df2.iloc[[1, 2, 3], [0, 2]]

Unnamed: 0,A,C
1,1.0,1.0
2,1.0,1.0
3,1.0,1.0


for getting values explicitly

In [25]:
df2.iloc[1,1]

Timestamp('2013-01-02 00:00:00')

Faster method to get scalar, similar to above iloc method

In [26]:
df2.iat[1, 1]

Timestamp('2013-01-02 00:00:00')

**Boolean Indexing**

Using a single column’s values to select data.

In [27]:
df2[df2["A"]>0]

Unnamed: 0,A,B,C,D,E,F,G
0,1.0,2013-01-02,1.0,3,test,foo,0
1,1.0,2013-01-02,1.0,3,train,foo,1
2,1.0,2013-01-02,1.0,3,test,foo,2
3,1.0,2013-01-02,1.0,3,train,foo,3


Selecting values from a DF where a boolean condition is met

In [28]:
df2[df2[["A","C"]]>0.0]

Unnamed: 0,A,B,C,D,E,F,G
0,1.0,NaT,1.0,,,,
1,1.0,NaT,1.0,,,,
2,1.0,NaT,1.0,,,,
3,1.0,NaT,1.0,,,,


In [29]:
df = pd.DataFrame(np.random.randn(6,4),columns=list("ABCD"))
df["E"] = list("ABABCB")
df

Unnamed: 0,A,B,C,D,E
0,-1.793103,2.944892,-1.175648,0.585947,A
1,-1.384861,-0.344272,0.2871,-1.22211,B
2,-0.460924,-1.212744,3.905303,0.553622,A
3,0.885628,0.198722,-1.272506,-0.711366,B
4,-0.184417,-0.872622,-1.2694,-0.751943,C
5,0.488902,1.084527,-0.304268,0.107403,B


isin for filtering in pandas

In [30]:
df[df["E"].isin(["A","C"])]

Unnamed: 0,A,B,C,D,E
0,-1.793103,2.944892,-1.175648,0.585947,A
2,-0.460924,-1.212744,3.905303,0.553622,A
4,-0.184417,-0.872622,-1.2694,-0.751943,C



**Setting** **data**


Setting a new column automatically aligns the data by the indexes

In [31]:
dates = pd.date_range("20130101", periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))

df2 = pd.DataFrame(
    {
        "A": 1.0,
        "B": pd.Timestamp("20130102"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "D": np.array([3] * 4, dtype="int32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo",
    }
)

In [32]:
df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ["E"])
df1

Unnamed: 0,A,B,C,D,E
2013-01-01,-0.133203,-0.887489,0.535943,0.832434,
2013-01-02,-1.367545,-0.525983,1.849544,-0.614434,
2013-01-03,1.361459,-1.564695,-1.338002,1.385231,
2013-01-04,0.37796,0.915489,0.588529,0.225879,



Setting values by label

In [33]:
df1.loc[dates[0] : dates[1], "E"] = 1
df1

Unnamed: 0,A,B,C,D,E
2013-01-01,-0.133203,-0.887489,0.535943,0.832434,1.0
2013-01-02,-1.367545,-0.525983,1.849544,-0.614434,1.0
2013-01-03,1.361459,-1.564695,-1.338002,1.385231,
2013-01-04,0.37796,0.915489,0.588529,0.225879,


In [34]:
df1.dropna(how="any")

Unnamed: 0,A,B,C,D,E
2013-01-01,-0.133203,-0.887489,0.535943,0.832434,1.0
2013-01-02,-1.367545,-0.525983,1.849544,-0.614434,1.0


In [35]:
df1.fillna(value=5)

Unnamed: 0,A,B,C,D,E
2013-01-01,-0.133203,-0.887489,0.535943,0.832434,1.0
2013-01-02,-1.367545,-0.525983,1.849544,-0.614434,1.0
2013-01-03,1.361459,-1.564695,-1.338002,1.385231,5.0
2013-01-04,0.37796,0.915489,0.588529,0.225879,5.0


In [36]:
pd.isna(df1)

Unnamed: 0,A,B,C,D,E
2013-01-01,False,False,False,False,False
2013-01-02,False,False,False,False,False
2013-01-03,False,False,False,False,True
2013-01-04,False,False,False,False,True


**Operations**

In [37]:
df = pd.DataFrame(np.random.randint(low=1, high=10, size=(4,6)),columns=list("ABCDEF"))
df

Unnamed: 0,A,B,C,D,E,F
0,6,2,2,2,8,9
1,9,9,9,9,9,1
2,4,3,7,7,4,7
3,4,1,5,3,6,3


mean of columns

In [38]:
df.mean()

A    5.75
B    3.75
C    5.75
D    5.25
E    6.75
F    5.00
dtype: float64

mean of rows, axis=1

In [39]:
df.mean(1)

0    4.833333
1    7.666667
2    5.333333
3    3.666667
dtype: float64

In [40]:
import random
emp = pd.DataFrame()
emp["id"]=np.arange(100,110)
emp["dept"] = np.random.choice(["HR","FIN","MKT","IT"],size=(10,))
emp["sal"] = np.random.randint(low=1000, high=10000, size=(10,))
emp

Unnamed: 0,id,dept,sal
0,100,MKT,9900
1,101,HR,1352
2,102,FIN,4807
3,103,FIN,7324
4,104,IT,9406
5,105,IT,8892
6,106,HR,8496
7,107,HR,3501
8,108,FIN,7765
9,109,FIN,8927
