<h1>DataFrame 自救手册</h1>

@Author: SharkChilli_zx<br>
@Date: 2019

**Useful links**:

[pandas Documentation](https://pandas.pydata.org/pandas-docs/stable/getting_started/index.html)

<h2>Intro</h2>

安装pandas: `!pip install pandas`

查看pandas版本: `pd.show_versions()`

更新pandas版本: `!pip install --upgrade pandas`

In [1]:
import numpy as np
import pandas as pd

---
<h2><font color=green>读取与存储数据文件</font></h2>

<h3>csv文件</h3>

读取: `pd.read_csv()`

存储: `df.to_csv(path_or_buf=)`

---
<h2><font color=green>对DataFrame列进行处理</font></h2>

In [2]:
data_dict = {'b': ['Bob', 'Bob', 'Cindy', 'Cindy', 'Ann', 'Ann', 'Tom', 'Tom', 'Jerry', 'Jerry'],
             'a': [1,1,2,2,3,3,4,4,5,5]}

df_1 = pd.DataFrame(data_dict)

In [3]:
print('df_1 Shape:', str(df_1.shape),
      '\nColumn Names:', str(df_1.columns))

df_1 Shape: (10, 2) 
Column Names: Index(['b', 'a'], dtype='object')


<h3>修改列名</h3>

<h4>df.rename</h4>

> df.rename(columns={...}, inplace=True/False)

In [4]:
df_1.rename(columns={'b': 'Name',
                     'a': 'ID'},
            inplace=True)

print('Column Names(renamed):', df_1.columns)

Column Names(renamed): Index(['Name', 'ID'], dtype='object')


<h3>调整列的顺序</h3>

<h4>df.reindex or df [ [ ] ]</h4>

> df.reindex(columns=[...])

=

> df = df[ [...] ]

In [5]:
df_1 = df_1.reindex(columns=['ID', 'Name'])

print('Column Names(renamed):', df_1.columns)

Column Names(renamed): Index(['ID', 'Name'], dtype='object')


<h3>插入新的列</h3>

> df['column_name'] = add_list

In [6]:
add_list = ['A', 'A', 'A', 'A', 'B', 'C', 'A', 'A', 'A', 'A']

df_1['Score'] = add_list

print('Column Names(renamed):', df_1.columns)

Column Names(renamed): Index(['ID', 'Name', 'Score'], dtype='object')


<h3>对列进行排序</h3>

<h4>df.sort_values</h4>
Sort by the values along either axis.
> df.**<font color=red>sort_values</font>**(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')

* <font color=blue>by</font>:<br>
Name of list of names to sort by
    * if axis is 0 or 'index' then by may contain index levels and/or column labels
    * if axis is 1 or 'column' then by may contain column levels and /or index labels
* <font color=blue>axis</font>:<br>
{0 or 'index', 1 or 'columns'}, default 0
* <font color=blue>ascending</font>:<br>
bool or list of bool, default True
    * Sort ascending vs descending. Specify list for multiple sort orders. If this is a list of bools, must match the length of the by.
* <font color=blue>inplace</font>:<br>
bool, default False
    * if True, perform operation in-place
* <font color=blue>kind</font>:<br>
{'quicksort', 'mergesort', 'heapsort'}, default 'quicksort'
    * Choice of sorting algorithm. See also ndarray.np.sort for more information. *mergesort* is the only stable algorithm. For DataFrames, this option is **only** applied when sorying on a single column or label.
* <font color=blue>na_position</font>:<br>
{'first', 'last'}, default 'last'
    * *first* puts NaNs at the beginning, *last* puts NaNs at the end.

In [7]:
df_1.sort_values(by='ID', ascending=False)

Unnamed: 0,ID,Name,Score
8,5,Jerry,A
9,5,Jerry,A
6,4,Tom,A
7,4,Tom,A
4,3,Ann,B
5,3,Ann,C
2,2,Cindy,A
3,2,Cindy,A
0,1,Bob,A
1,1,Bob,A


---
<h2><font color=green>剔除重复数据</font></h2>

In [8]:
df_1

Unnamed: 0,ID,Name,Score
0,1,Bob,A
1,1,Bob,A
2,2,Cindy,A
3,2,Cindy,A
4,3,Ann,B
5,3,Ann,C
6,4,Tom,A
7,4,Tom,A
8,5,Jerry,A
9,5,Jerry,A


<h3>duplicated</h3>

> df.**duplicated**(subset=None, keep='first')

查看重复数据

In [9]:
df_1.duplicated(subset=['ID', 'Name'])

0    False
1     True
2    False
3     True
4    False
5     True
6    False
7     True
8    False
9     True
dtype: bool

In [10]:
df_1[df_1.duplicated()==True]

Unnamed: 0,ID,Name,Score
1,1,Bob,A
3,2,Cindy,A
7,4,Tom,A
9,5,Jerry,A


<h3>drop_duplicates</h3>

Return DataFrame with duplicate rows removed, optionally only considering certain columns.
> df.**<font color=red>drop_duplicates</font>**(subset=None, keep='first', inplace=False)

* <font color=blue>subset</font>:<br> column label or sequence of labels, optional
    * Only consider certain columns for identifying duplicates, by default use all of the columns.
* <font color=blue>keep</font>:<br> {‘first’, ‘last’, False}, default ‘first’
    * first : Drop duplicates except for the first occurrence.
    * last : Drop duplicates except for the last occurrence.
    * False : Drop all duplicates.
* <font color=blue>inplace</font>:<br> boolean, default False
    * Whether to drop duplicates in place or to return a copy

In [11]:
df_1.drop_duplicates()

Unnamed: 0,ID,Name,Score
0,1,Bob,A
2,2,Cindy,A
4,3,Ann,B
5,3,Ann,C
6,4,Tom,A
8,5,Jerry,A


In [12]:
df_1.drop_duplicates(keep='last')

Unnamed: 0,ID,Name,Score
1,1,Bob,A
3,2,Cindy,A
4,3,Ann,B
5,3,Ann,C
7,4,Tom,A
9,5,Jerry,A


In [13]:
df_1.drop_duplicates(subset=['ID', 'Name'])

Unnamed: 0,ID,Name,Score
0,1,Bob,A
2,2,Cindy,A
4,3,Ann,B
6,4,Tom,A
8,5,Jerry,A


In [14]:
df_1.drop_duplicates(subset='Score')

Unnamed: 0,ID,Name,Score
0,1,Bob,A
4,3,Ann,B
5,3,Ann,C


In [15]:
df_1.drop_duplicates(subset=['ID', 'Name'], inplace=True)

In [16]:
df_1

Unnamed: 0,ID,Name,Score
0,1,Bob,A
2,2,Cindy,A
4,3,Ann,B
6,4,Tom,A
8,5,Jerry,A


---
<h2><font color=green>索引</font></h2>

<h3>MultiIndex DataFrame</h3>

* <font color=blue>levels</font>:<br>
sequence of arrays. The unique labels for each level.
* <font color=blue>codes</font>:<br>
sequence of arrays. Intergers for each level designating which label at each location.

In [17]:
midx = pd.MultiIndex(levels=[['lama', 'cow', 'falcon'],
                             ['speed', 'weight', 'length']],
                     codes=[[0, 0, 0, 1, 1, 1, 2, 2, 2],
                            [0, 1, 2, 0, 1, 2, 0, 1, 2]])

multiIndex_df = pd.DataFrame(index=midx, 
                             columns=['big', 'small'],
                             data=[[45, 30],[200, 100],[1.5, 1],
                                   [30, 20],[250,150], [1.5, 0.8], 
                                   [320, 250],[1, 0.8], [0.3,0.2]])

multiIndex_df

Unnamed: 0,Unnamed: 1,big,small
lama,speed,45.0,30.0
lama,weight,200.0,100.0
lama,length,1.5,1.0
cow,speed,30.0,20.0
cow,weight,250.0,150.0
cow,length,1.5,0.8
falcon,speed,320.0,250.0
falcon,weight,1.0,0.8
falcon,length,0.3,0.2


<h3>更改索引</h3>

将索引更改为从1开始计数：
> df_1.index = range(1,len(df_1) + 1)

---
<h2><font color=green>数据的筛选与清洗</font></h2>

In [18]:
raw_data = pd.DataFrame({'creativeID': [1, 2, 3, 4, 5],
                         'adID': [11, 22, 33, 44, 55],
                         'appID': [123, 123, 123, 124, 125],
                         'name':['A', 'A', 'B', 'C', 'D']})
raw_data

Unnamed: 0,creativeID,adID,appID,name
0,1,11,123,A
1,2,22,123,A
2,3,33,123,B
3,4,44,124,C
4,5,55,125,D


<h3>按特定值删除某些行</h3>

<h4>drop 根据满足条件的行索引删除数据</h4>

Drop specified labels from rows or columns.
> df.**<font color=red>drop</font>**(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')

* <font color=blue>labels</font>:<br>
Index or column labels to drop
* <font color=blue>axis</font>:<br>
{0 or 'index', 1 or 'columns'}, default 0. Whether to drop labels from the index(0 or 'index') or columns(1 or 'columns')
* <font color=blue>index, columns</font>:<br>
Alternative to specifying axis(labels, axis=1 is equivalent to columns=labels)
* <font color=blue>level</font>:<br>
int or level name, optional. For multiIndex, level from which the labels will be removed.
* <font color=blue>inplace</font>:<br>
bool, default False. If True, do operation inplace and return None.
* <font color=blue>errors</font>:<br>
{'ignore', 'raise'}, default 'raise'. If 'ignore', suppress error and only existing labels are dropped.

Remove rows or columns by specifying label names and corresponding axis, or by specifying directly index or column names. When using a multi-index, labels on different levels can be removed by specifying the level.

In [19]:
# 单列
raw_data.drop(index=list(raw_data[raw_data['appID']==123].index))

Unnamed: 0,creativeID,adID,appID,name
3,4,44,124,C
4,5,55,125,D


这里多列条件是'且'的交集关系，都满足后才会删除。

In [20]:
# 多列
raw_data.drop(index=list(raw_data[(raw_data['appID']==123)&(raw_data['name']=='B')].index))


Unnamed: 0,creativeID,adID,appID,name
0,1,11,123,A
1,2,22,123,A
3,4,44,124,C
4,5,55,125,D


<h4>True^df.isin( [ ] )</h4>

推荐在对单列数据进行删选处理时再用这种方法。

In [21]:
raw_data[True^raw_data['adID'].isin([11])]

Unnamed: 0,creativeID,adID,appID,name
1,2,22,123,A
2,3,33,123,B
3,4,44,124,C
4,5,55,125,D


删除'appID'为123， 125的所有行, 再删除'creativeID'为3的所有行

**<font color=blue>注意这里不是'且'的交集关系！而是单列数据满足其中任意一个条件后所在行就被删除</font>**。

In [22]:
raw_data[(True^raw_data['appID'].isin([123, 125]))&(True^raw_data['creativeID'].isin([3]))]


Unnamed: 0,creativeID,adID,appID,name
3,4,44,124,C


In [23]:
raw_data[(True^raw_data['appID'].isin([123]))&(True^raw_data['name'].isin(['C']))]


Unnamed: 0,creativeID,adID,appID,name
4,5,55,125,D
