# Pandas

pandas contains data structures and data manipulation tools designed to make **data cleaning and analysis** fast and easy in Python. pandas is often used in tandem with numerical computing tools like NumPy and SciPy, analytical libraries like statsmodels and scikit-learn, and data visualization libraries like matplotlib. pandas **adopts significant parts of NumPy’s idiomatic style of array-based computing,** especially array-based functions and a preference for data processing without for loops.

The biggest difference is that pandas is designed for working with **tabular or heterogeneous data**. NumPy, by contrast, is best suited for working with **homogeneous numerical array data.**

它的核心就像操作一个电子表格的无头版本，比如 Excel。你使用的大多数数据集将是所谓的数据帧（DataFrame）。你可能已经熟悉这个术语，它也用于其他语言，但是如果没有，数据帧通常就像电子表格一样，拥有列和行，这就是它了！从这里开始，我们可以利用 Pandas 以闪电般的速度操作我们的数据集。

# Data Structure

To get started with pandas, you will need to get comfortable with its two workhorse data structures: **Series** and **DataFrame**. While they are not a universal solution for every problem, they provide a solid, easy-to-use basis for most applications.

## 1.Series

A Series is a **one-dimensional array-like object** containing a sequence of **values (of similar types to NumPy types)** and an associated array of **data labels**, called its **index.**

In [11]:
import pandas as pd
data = pd.Series([0.25, 0.5, 0.75, 1.0],dtype = 'int32')
print(data)
data = pd.Series([0.25, 0.5, 0.75, 1.0],dtype = complex)
print(data)
data = pd.Series([0.25, 0.5, 0.75, 1.0])
print(data)

0    0
1    0
2    0
3    1
dtype: int32
0    (0.25+0j)
1     (0.5+0j)
2    (0.75+0j)
3       (1+0j)
dtype: complex128
0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64


As we see in the preceding output, the Series wraps both a sequence of values and a sequence of indices, which we can access with the **values** and **index** attributes. The values are simply a familiar NumPy array:

In [15]:
print('values:',data.values)
print('index:',data.index)

values: [0.25 0.5  0.75 1.  ]
index: RangeIndex(start=0, stop=4, step=1)


### Create data label

Often it will be desirable to create a Series with an index  identifying each data point with a label:

In [18]:
import pandas as pd
grade = pd.Series([12.00, 11.01, 9.99, 9.00],index = ['A','A-','B+','B'])
print(grade)
print(grade.values)
print(grade.index)

A     12.00
A-    11.01
B+     9.99
B      9.00
dtype: float64
[12.   11.01  9.99  9.  ]
Index(['A', 'A-', 'B+', 'B'], dtype='object')


## Constructing Series Objects

We’ve already seen a few ways of constructing a Pandas Series from scratch; all of them are some version of the following:

In [None]:
pd.Series(data, index=index)

where index is an optional argument, and data can be one of many entities.

|data object|result|
|-----------|------|
|a list/Numpy array|in which case index defaults to an integer sequence|
|a scalar/a python object|which is repeated to fill the specified index|
|a dict|in which index defaults to the sorted dict keys|

In [30]:
import pandas as pd
import numpy as np

#a list
d1 = pd.Series([2,4,6])
print(d1)
#Numpy array
d1 = pd.Series(np.arange(4))
print(d1)

#a scalar/a python object
d1 = pd.Series('F',index = ['Alice','Bell','Candy'])
print(d1)

#a dict
d1 = pd.Series({'a':2, 'b':1, 'c':3})
print(d1)

0    2
1    4
2    6
dtype: int64
0    0
1    1
2    2
3    3
dtype: int64
Alice    F
Bell     F
Candy    F
dtype: object
a    2
b    1
c    3
dtype: int64


In each case, the index can be explicitly set if a different result is preferred. In this case, **the Series is populated only with the explicitly identified keys.**

In [26]:
import pandas as pd

data = pd.Series({'a':1, 'b':2, 'c':3}, index=['c', 'a'])
print(data)

c    3
a    1
dtype: int64


### Series is dict-like

A Series is like a fixed-size dict in that you can get and set values by index label:

In [36]:
d1 = pd.Series({'a':2, 'b':1, 'c':3})
d1['d'] = 4
print(d1)

print('d' in d1)
print('f' in d1)

d1.get('f') # KeyError:'f'
d1.get('f',6)
#d1.setdefault('e',5)  #'Series' object has no attribute 'setdefault'

a    2
b    1
c    3
d    4
dtype: int64
True
False


6

### Vectorized operations

In [40]:
import pandas as pd
import numpy as np

d1 = pd.Series(np.arange(4))
d2 = pd.Series(np.arange(1,5))
print(d1)
print(d2)
print(d1 + d2)
print(d2 - d1)
print(d1*2)

0    0
1    1
2    2
3    3
dtype: int64
0    1
1    2
2    3
3    4
dtype: int64
0    1
1    3
2    5
3    7
dtype: int64
0    1
1    1
2    1
3    1
dtype: int64
0    0
1    2
2    4
3    6
dtype: int64


### Label Alignment

A key difference between Series and ndarray is that operations between Series automatically align the data based on label. Thus, you can write computations without giving consideration to whether the Series involved have the same labels.

In [43]:
import pandas as pd
import numpy as np

d1 = pd.Series(np.arange(4))
d2 = pd.Series(np.arange(1,5))
print(d1[1:])
print(d2[:-1])
print(d1[1:] + d2[:-1])

1    1
2    2
3    3
dtype: int64
0    1
1    2
2    3
dtype: int64
0    NaN
1    3.0
2    5.0
3    NaN
dtype: float64


The result of an operation between unaligned Series **will have the union of the indexes involved.** If a label is not found in one Series or the other, the result will be marked as missing NaN. Being able to write code without doing any explicit data alignment grants immense freedom and flexibility in interactive data analysis and research. The integrated data alignment features of the pandas data structures set pandas apart from the majority of related tools for working with labeled data.

## 2.DataFrame

A DataFrame represents a rectangular table of data and **contains an ordered collection of columns, each of which can be a different value type (numeric, string,boolean, etc.).** The DataFrame has both a row and column index; it can be thought of as **a dict of Series all sharing the same index.** Under the hood, the data is stored as one or more two-dimensional blocks rather than a list, dict, or some other collection of one-dimensional arrays.

While a DataFrame is physically two-dimensional, you can use it to
represent higher dimensional data in a tabular format using hierarchical indexing, a subject we will discuss *Advanced Tool* and an ingredient in some of the more advanced data-handling features in pandas.

### 2.1Construct DataFrame

There are many ways to construct a DataFrame:
1. a dict of equal-length lists
2. NumPy arrays
3. Series
4. Structured or record array
5. a list of equal-length dicts

In [54]:
import pandas as pd

#dict
print("From dict")
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002, 2003],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)
print(frame)
print(frame.index)
print(frame.columns)

#NumPy arrays
print("\nFrom Numpy arrays:")
f2 = pd.DataFrame(data = np.random.random((3,4)),columns = ['A','B','C','D'])
print(f2)

#Series
print("\nFrom Series:")
d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
     'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print(df)

#Structured or record array
print("\nFrom Structured or record array:")
data = np.zeros((2,), dtype=[('A', 'i4'),('B', 'f4'),('C', 'a10')])
data[:] = [(1,2.,'Hello'), (2,3.,"World")]
print(pd.DataFrame(data))

#a list of dicts
print("\nFrom a list of dicts:")
data2 = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]
print(pd.DataFrame(data2, index=['first', 'second']))

From dict
    state  year  pop
0    Ohio  2000  1.5
1    Ohio  2001  1.7
2    Ohio  2002  3.6
3  Nevada  2001  2.4
4  Nevada  2002  2.9
5  Nevada  2003  3.2
RangeIndex(start=0, stop=6, step=1)
Index(['state', 'year', 'pop'], dtype='object')

From Numpy arrays:
          A         B         C         D
0  0.684692  0.899439  0.734535  0.305937
1  0.170435  0.055092  0.791338  0.412941
2  0.188886  0.419327  0.856403  0.713916

From Series:
   one  two
a  1.0  1.0
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0

From Structured or record array:
   A    B         C
0  1  2.0  b'Hello'
1  2  3.0  b'World'

From a list of dicts:
        a   b     c
first   1   2   NaN
second  5  10  20.0


If you specify a sequence of columns, the DataFrame’s columns will be arranged in that order:

In [45]:
print(pd.DataFrame(data,columns=['year', 'state', 'pop']))

   year   state  pop
0  2000    Ohio  1.5
1  2001    Ohio  1.7
2  2002    Ohio  3.6
3  2001  Nevada  2.4
4  2002  Nevada  2.9
5  2003  Nevada  3.2


If you pass a column that isn’t contained in the dict, it will appear with missing values in the result:

In [46]:
print(pd.DataFrame(data,columns=['year', 'state', 'pop', 'debt']))

   year   state  pop debt
0  2000    Ohio  1.5  NaN
1  2001    Ohio  1.7  NaN
2  2002    Ohio  3.6  NaN
3  2001  Nevada  2.4  NaN
4  2002  Nevada  2.9  NaN
5  2003  Nevada  3.2  NaN


### 2.2Column Selection,addition,deletion

You can treat a DataFrame semantically like **a dict of like-indexed Series objects.** Getting, setting, and deleting columns works with the **same syntax as the analogous dict operations.**

In [94]:
import pandas as pd

data = {"one":[1,2,3,4],"two":[2,3,4,5]}
df = pd.DataFrame(data)

#Column Selection
print(df['one'])

#Column addition
df['three'] = df['one'] * df['two']
print(df)
df['flag'] = df['one'] > 2
print(df)
#inserting a scalar value
df['foo'] = 'bar'
df['one_trunc'] = df['one'][:2]
print(df)
#insert function
# By default, columns get inserted at the end. 
# The insert function is available to insert at a particular location 
# in the columns
df.insert(1,'bar',df['one'])
print(df)

#Column deletion
del df['flag']
three = df.pop('three')
print(df)

0    1
1    2
2    3
3    4
Name: one, dtype: int64
   one  two  three
0    1    2      2
1    2    3      6
2    3    4     12
3    4    5     20
   one  two  three   flag
0    1    2      2  False
1    2    3      6  False
2    3    4     12   True
3    4    5     20   True
   one  two  three   flag  foo  one_trunc
0    1    2      2  False  bar        1.0
1    2    3      6  False  bar        2.0
2    3    4     12   True  bar        NaN
3    4    5     20   True  bar        NaN
   one  bar  two  three   flag  foo  one_trunc
0    1    1    2      2  False  bar        1.0
1    2    2    3      6  False  bar        2.0
2    3    3    4     12   True  bar        NaN
3    4    4    5     20   True  bar        NaN
   one  bar  two  foo  one_trunc
0    1    1    2  bar        1.0
1    2    2    3  bar        2.0
2    3    3    4  bar        NaN
3    4    4    5  bar        NaN


## 3. Panel

Panel is a somewhat less-used, but still important container for 3-dimensional data. Panel is deprecated and will be removed in a future version.

The recommended way to represent these types of 3-dimensional data are with a MultiIndex on a DataFrame, via the Panel.to_frame() method

In [95]:
import pandas as pd

wp = pd.Panel(np.random.randn(2, 5, 4), 
              items=['Item1', 'Item2'],
              major_axis=pd.date_range('1/1/2000', periods=5),
              minor_axis=['A', 'B', 'C', 'D'])

print(wp)

<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 5 (major_axis) x 4 (minor_axis)
Items axis: Item1 to Item2
Major_axis axis: 2000-01-01 00:00:00 to 2000-01-05 00:00:00
Minor_axis axis: A to D


Panel is deprecated and will be removed in a future version.
The recommended way to represent these types of 3-dimensional data are with a MultiIndex on a DataFrame, via the Panel.to_frame() method
Alternatively, you can use the xarray package http://xarray.pydata.org/en/stable/.
Pandas provides a `.to_xarray()` method to help automate this conversion.

  exec(code_obj, self.user_global_ns, self.user_ns)


# Data indexing and Selection

Object selection has had a number of user-requested additions in order to support more explicit location based indexing. Pandas now supports three types of multi-axis indexing.

Getting values from an object with multi-axes selection uses the following notation (using .loc as an example, but the following applies to .iloc as well). **Any of the axes accessors may be the null slice `:`. Axes left out of the specification are assumed to be `:`**, e.g. `df.loc['a']` is equivalent to `df.loc['a', :]`.

|Object Type|Indexers|
|-----------|--------|
|Series|s.loc[indexer]|
|DataFrame|	df.loc[row_indexer,column_indexer]|
|Panel|p.loc[item_indexer,major_indexer,minor_indexer]|

## loc - select by label

`.loc` is primarily label based, but may also be used with a boolean array. `.loc` will raise KeyError when the items are not found. Allowed inputs are:
     
* A single label,e.g. `5` or 'a'(`5` is interpreted as a **label** of the index when we **do not assign the index value)**
* A list or array of labels ['a','b','c']
* A slice object with labels `'a':'f'`(Note that contrary to usual python slices, ** both the start and the stop are included,** when present in the index! )
* A boolean array
* A callable function with one argument and that returns valid output for indexing

In [96]:
#Series
import pandas as pd
s1 = pd.Series(np.random.randn(6),index=list('abcdef'))
print(s1)
print(s1.loc['c':])
print(s1.loc['b'])

a    0.445849
b    1.840048
c   -0.644921
d   -0.004764
e    0.671961
f    0.143648
dtype: float64
c   -0.644921
d   -0.004764
e    0.671961
f    0.143648
dtype: float64
1.8400483666694052


In [104]:
#DataFrame
import pandas as pd
df1 = pd.DataFrame(np.random.randn(6,4),
                   index=list('abcdef'),
                   columns=list('ABCD'))

print(df1)
print(df1.loc['a','A']) # equivalent to ``df1.at['a','A']``
print(df1.loc[['a','b','d'],:])
print(df1.loc['d':, 'A':'C']) #label slices
print(df1.loc['a'] > 0)
print(df1.loc[:,df1.loc['a'] > 0])

          A         B         C         D
a  0.708074  1.113480  0.692661 -1.236636
b -0.407881  0.622312  2.498805  0.037647
c  1.081152 -1.243035 -0.082764  0.707208
d -0.083208  0.078978 -1.144177 -0.296959
e  2.074780 -0.881652  0.400521  0.223618
f -0.438457 -1.006004 -0.283459 -2.567454
0.7080743220188984
          A         B         C         D
a  0.708074  1.113480  0.692661 -1.236636
b -0.407881  0.622312  2.498805  0.037647
d -0.083208  0.078978 -1.144177 -0.296959
          A         B         C
d -0.083208  0.078978 -1.144177
e  2.074780 -0.881652  0.400521
f -0.438457 -1.006004 -0.283459
A     True
B     True
C     True
D    False
Name: a, dtype: bool
          A         B         C
a  0.708074  1.113480  0.692661
b -0.407881  0.622312  2.498805
c  1.081152 -1.243035 -0.082764
d -0.083208  0.078978 -1.144177
e  2.074780 -0.881652  0.400521
f -0.438457 -1.006004 -0.283459


### Silicing with labels

When using .loc with slices, if both the start and the stop labels are present in the index, then elements located between the two (including them) are returned:

In [114]:
import pandas as pd
s = pd.Series(list('abcde'), index=list('03254'))
print(s)
# print(s.loc[3:5]) wrong
print(s.loc['3':'5'])
print(s.sort_index())

#When you didn't assign the index value
#the python will interpreted integer as a label of the index
s = pd.Series(list('abcde'))
print(s)
print(s.loc[2:4])

0    a
3    b
2    c
5    d
4    e
dtype: object
3    b
2    c
5    d
dtype: object
0    a
2    c
3    b
4    e
5    d
dtype: object
0    a
1    b
2    c
3    d
4    e
dtype: object
2    c
3    d
4    e
dtype: object


In [125]:
#Selection By Callable
print(df1.loc[lambda df: df.A > 0, :])
print(df1.loc[:, lambda df: ['A', 'B']])

          A         B         C         D
c  0.140788 -0.949210 -0.188980 -0.381409
d  0.827333 -1.110525  0.038322 -1.460787
e  1.910475  0.508625 -0.570836  2.037837
f  0.454442 -0.289554 -1.091092  1.294338
          A         B
a -0.796669  0.620480
b -0.211047 -2.163730
c  0.140788 -0.949210
d  0.827333 -1.110525
e  1.910475  0.508625
f  0.454442 -0.289554


## iloc - select by position

`iloc` is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array. `.iloc` will raise IndexError if a requested indexer is out-of-bounds, except slice indexers which allow out-of-bounds indexing. (this conforms with Python/NumPy slice semantics). Allowed inputs are:

* An integer
* A list or array of integers [4, 3, 0].
* A slice object with ints `1:7`.
* A boolean array.
* A callable function with one argument and that returns valid output for indexing

In [115]:
#Series
import pandas as pd
s1 = pd.Series(np.random.randn(5), index=list(range(0,10,2)))
print(s1)
print(s1.iloc[:3])
print(s1.iloc[3])

0    1.157133
2   -1.612463
4    0.635788
6   -0.881173
8    0.000976
dtype: float64
0    1.157133
2   -1.612463
4    0.635788
dtype: float64
-0.8811734242071337


In [123]:
#DataFrame
import pandas as pd
df1 = pd.DataFrame(np.random.randn(6,4),
                   index=list(range(0,12,2)),
                   columns=list(range(0,8,2)))

print(df1)
print(df1.iloc[:3])
print(df1.iloc[1:4, 2:4])
print(df1.iloc[[1, 3, 5], [1, 3]])
print(df1.iloc[1:3,:])
print(df1.iloc[:,1:3])
print("df1[1,1]:",df1.iloc[1,1]) #equivalent to df1.iat[1,1]
#For getting a cross section using an integer position
print(df1.iloc[1])

           0         2         4         6
0   1.500689  0.420536 -0.589131 -1.407943
2  -0.481557  0.175910  0.070358 -1.392852
4   1.362997  1.465508  0.023781  0.932459
6  -0.105208 -0.926797 -0.978040  0.228733
8  -1.092197 -0.008328  1.010179  0.488455
10 -1.910341 -0.461666 -0.533768 -1.277837
          0         2         4         6
0  1.500689  0.420536 -0.589131 -1.407943
2 -0.481557  0.175910  0.070358 -1.392852
4  1.362997  1.465508  0.023781  0.932459
          4         6
2  0.070358 -1.392852
4  0.023781  0.932459
6 -0.978040  0.228733
           2         6
2   0.175910 -1.392852
6  -0.926797  0.228733
10 -0.461666 -1.277837
          0         2         4         6
2 -0.481557  0.175910  0.070358 -1.392852
4  1.362997  1.465508  0.023781  0.932459
           2         4
0   0.420536 -0.589131
2   0.175910  0.070358
4   1.465508  0.023781
6  -0.926797 -0.978040
8  -0.008328  1.010179
10 -0.461666 -0.533768
df1[1,1]: 0.17591044386559979
0   -0.481557
2    0.175910
4    0

In [126]:
#Selection By Callable
df1 = pd.DataFrame(np.random.randn(6, 4),
                   index=list('abcdef'),
                   columns=list('ABCD'))

print(df1)
print(df1.iloc[:, lambda df: [0, 1]])
print(df1.A.loc[lambda s: s > 0])

          A         B         C         D
a -0.694511 -1.424054  0.481187 -0.995856
b  0.306202 -1.540450 -0.182246 -0.120516
c -1.805706  0.137579 -1.167179 -1.078590
d -1.115251 -0.049192  0.259490  1.290598
e -2.379732  1.103528 -0.998576  0.244959
f  0.809272 -0.241203  0.105684 -0.366950
          A         B
a -0.694511 -1.424054
b  0.306202 -1.540450
c -1.805706  0.137579
d -1.115251 -0.049192
e -2.379732  1.103528
f  0.809272 -0.241203
b    0.306202
f    0.809272
Name: A, dtype: float64


## ix - mixed selection(Deprecated)

Starting in 0.20.0, the .ix indexer is deprecated, in favor of the more strict .iloc and .loc indexers.

In [127]:
import pandas as pd
df = pd.DataFrame({'A':[1,2,3],
                   'B':[2,3,4]},
                  index = list("abc"))

print(df.ix[[0,2],'A'])

a    1
c    3
Name: A, dtype: int64


.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  


## Boolean Indexing

Another common operation is the use of boolean vectors to filter the data. The operators are: `|` for **or**, `&` for **and**, and `~` for **not**. These must be grouped by using parentheses, since by default Python will evaluate an expression such as `df.A > 2 & df.B < 3` as `df.A > (2 & df.B) < 3`, while the desired evaluation order is `(df.A > 2) & (df.B < 3)`.

In [128]:
#Series
import pandas as pd

s = pd.Series(range(-3, 4))
print(s[s > 0])
print(s[(s < -1) | (s > 0.5)])
print(s[~(s < 0)])

4    1
5    2
6    3
dtype: int64
0   -3
1   -2
4    1
5    2
6    3
dtype: int64
3    0
4    1
5    2
6    3
dtype: int64


In [131]:
#DataFrame
import pandas as pd

df = pd.DataFrame({'a' : ['one', 'one', 'two', 'three', 'two', 'one', 'six'],
                    'b' : ['x', 'y', 'y', 'x', 'y', 'x', 'x'],
                    'c' : np.random.randn(7)})
print(df)

criterion = df['a'].map(lambda x: x.startswith('t'))
print(df[criterion])
# equivalent but slower
print(df[[x.startswith('t') for x in df['a']]])

       a  b         c
0    one  x -0.322975
1    one  y  1.677633
2    two  y -0.547460
3  three  x -0.303123
4    two  y  2.303301
5    one  x -0.578096
6    six  x -0.069273
       a  b         c
2    two  y -0.547460
3  three  x -0.303123
4    two  y  2.303301
       a  b         c
2    two  y -0.547460
3  three  x -0.303123
4    two  y  2.303301


#

# IO Tools


The pandas I/O API is a set of top level reader functions accessed like `pandas.read_csv()` that generally return a pandas object. The corresponding writer functions are object methods that are accessed like `DataFrame.to_csv()`. Below is a table containing available readers and writers.

|Format Type|Data Description|Reader|Writer|
|-----------|----------------|------|------|
|text|CSV|read_csv|to_csv|
|text|JSON|read_json|to_json|
|text|HTML|read_html|to_html|
|text|Local clipboard|read_clipboard|to_clipboard|
|binary|MS Excel|read_excel|to_excel|
|binary|HDF5 Foramt|rad_hdf|to_hdf|
|binary|Stata|read_stata|to_stata|
|binary|SAS|read_sas||
|binary|Python Pickle Format|read_pickle|to_pickle|
|SQL|SQL|read_sql|to_sql|
|SQL|Google Big Query|read_gbp|to_gbq|

# Merge DataFrame

pandas provides various facilities for easily combining together Series, DataFrame, and Panel objects with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations.

* concate
* append
* join

## 1.Concate

The `concat()` function (in the main pandas namespace) **does all of the heavy lifting of performing concatenation operations along an axis** while performing optional set logic (union or intersection) of the indexes (if any) on the other axes. Note that I say “if any” because **there is only a single possible axis of concatenation for Series.**

![](https://pandas.pydata.org/pandas-docs/stable/_images/merging_concat_basic.png)

In [None]:
pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
          keys=None, levels=None, names=None, verify_integrity=False,
          copy=True)

* objs : **a sequence or mapping of Series, DataFrame, or Panel objects.** If a dict is passed, the sorted keys will be used as the keys argument, unless it is passed, in which case the values will be selected (see below). Any None objects will be dropped silently unless they are all None in which case a ValueError will be raised.
* axis : {0, 1, …}, **default 0. The axis to concatenate along.**
* join : {‘inner’, ‘outer’}, **default ‘outer’.** How to handle indexes on other axis(es). **Outer for union and inner for intersection.**
* ignore_index : boolean, default False. If True, do not use the index values on the concatenation axis. The resulting axis will be labeled 0, …, n - 1. Note the index values on the other axes are still respected in the join.
* join_axes : list of Index objects. Specific indexes to use for the other n - 1 axes instead of performing inner/outer set logic.
* keys : sequence, default None. **Construct hierarchical index using the passed keys as the outermost level.** If multiple levels passed, should contain tuples.
* levels : list of sequences, default None. Specific levels (unique values) to use for constructing a MultiIndex. Otherwise they will be inferred from the keys.
* names : list, default None. Names for the levels in the resulting hierarchical index.
* verify_integrity : boolean, default False. Check whether the new concatenated axis contains duplicates. This can be very expensive relative to the actual data concatenation.
* copy : boolean, default True. If False, do not copy data unnecessarily.

In [77]:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                     'B': ['B0', 'B1', 'B2', 'B3']})
df2 = pd.DataFrame({'C': ['C0', 'C1', 'C2', 'C3'],
                     'D': ['D0', 'D1', 'D2', 'D3']})

df3 = pd.DataFrame({'B': ['B4', 'B5', 'B6', 'B7'],
                    'C': ['C4', 'C5', 'C6', 'C7']})

result = pd.concat([df1,df2,df3],ignore_index=True,sort=False)
print(result)

      A    B    C    D
0    A0   B0  NaN  NaN
1    A1   B1  NaN  NaN
2    A2   B2  NaN  NaN
3    A3   B3  NaN  NaN
4   NaN  NaN   C0   D0
5   NaN  NaN   C1   D1
6   NaN  NaN   C2   D2
7   NaN  NaN   C3   D3
8   NaN   B4   C4  NaN
9   NaN   B5   C5  NaN
10  NaN   B6   C6  NaN
11  NaN   B7   C7  NaN


In [86]:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                     'B': ['B0', 'B1', 'B2', 'B3']})
df2 = pd.DataFrame({'C': ['C0', 'C1', 'C2', 'C3'],
                     'D': ['D0', 'D1', 'D2', 'D3']})

df3 = pd.DataFrame({'B': ['B4', 'B5', 'B6', 'B7'],
                    'C': ['C4', 'C5', 'C6', 'C7']})

result = pd.concat([df1,df2,df3],ignore_index=True,sort=True, keys=['x', 'y', 'z'])
print(result)

      A    B    C    D
0    A0   B0  NaN  NaN
1    A1   B1  NaN  NaN
2    A2   B2  NaN  NaN
3    A3   B3  NaN  NaN
4   NaN  NaN   C0   D0
5   NaN  NaN   C1   D1
6   NaN  NaN   C2   D2
7   NaN  NaN   C3   D3
8   NaN   B4   C4  NaN
9   NaN   B5   C5  NaN
10  NaN   B6   C6  NaN
11  NaN   B7   C7  NaN


AttributeError: 'DataFrame' object has no attribute 'x'

## 2.Append

A useful shortcut to concat() are the append() instance methods on Series and DataFrame. These methods actually predated concat. They concatenate along axis=0, namely the index:

![](https://pandas.pydata.org/pandas-docs/stable/_images/merging_append1.png)

In [None]:
def append(self, other, ignore_index=False,
            verify_integrity=False, sort=None)

Append rows of `other` to the end of this frame, returning a new object. Columns not in this frame are added as new columns.

#### Parameters

* other : DataFrame or Series/dict-like object, or list of these
       The data to append.
* ignore_index : boolean, default False
       If True, do not use the index labels.
* verify_integrity : boolean, default False
       If True, raise ValueError on creating index with duplicates.
* sort : boolean, default None
       Sort columns if the columns of `self` and `other` are not aligned.The default sorting is deprecated and will change to not-sorting in a future version of pandas. Explicitly pass `sort=True` to silence the warning and sort. Explicitly pass `sort=False` to silence the warning and not sort.


In [72]:
import pandas as pd

data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002, 2003],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)
data2 = {'state': ['Ohio', 'Ohio', 'Nevada'],
'year': [2000, 2001, 2002],
'pop': [1.5, 1.7, 3.6]}
frame2 = pd.DataFrame(data2)

# append from DataFrame
frame = frame.append(frame2,ignore_index=True)

# append from Dict
frame = frame.append({"state":"Nevada","year":2005,"pop":1.7},ignore_index=True)

# append from Series
frame = frame.append(pd.Series({"state":"AAA","year":2006,"pop":1.8}),ignore_index=True)
print(frame)

     state  year  pop
0     Ohio  2000  1.5
1     Ohio  2001  1.7
2     Ohio  2002  3.6
3   Nevada  2001  2.4
4   Nevada  2002  2.9
5   Nevada  2003  3.2
6     Ohio  2000  1.5
7     Ohio  2001  1.7
8   Nevada  2002  3.6
9   Nevada  2005  1.7
10     AAA  2006  1.8
