<a href="https://colab.research.google.com/github/smiledinisa/data_python_analysis/blob/master/pandas003.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


In [3]:
cd /content/drive/My Drive/examples

/content/drive/My Drive/examples


#CHAPTER 7 :Data Cleaning and Preparation


During the course of doing data analysis and modeling, a significant amount of time
is spent on data preparation: loading, cleaning, transforming, and rearranging. Such
tasks are often reported to take up 80% or more of an analyst’s time. Sometimes the
way that data is stored in files or databases is not in the right format for a particular
task. Many researchers choose to do ad hoc processing of data from one form to
another using a general-purpose programming language, like Python, Perl, R, or Java,
or Unix text-processing tools like sed or awk. Fortunately, pandas, along with the
built-in Python language features, provides you with a high-level, flexible, and fast set
of tools to enable you to manipulate data into the right form


In this chapter I discuss tools for **missing data**, **duplicate data**, **string manipulation**,
and **some other analytical data transformations**. In the next chapter, I focus on com‐
bining and rearranging datasets in various ways.

## Handling Missing Data


For numeric data, pandas uses the floating-point
value  ***NaN***(Not a Number) to represent missing data. We call this a 
***sentinel value*** that
can be easily detected:

In [5]:
import pandas as pd
from pandas import DataFrame
from pandas import Series
import numpy as np
string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])

In [6]:
string_data

0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object

In [9]:
string_data.isnull() #na 意思是：no available

0    False
1    False
2     True
3    False
dtype: bool

In [10]:
string_data[0] = None
string_data

0         None
1    artichoke
2          NaN
3      avocado
dtype: object

In [11]:
string_data.isnull()

0     True
1    False
2     True
3    False
dtype: bool

**list of some functions related**
**to missing data handling.**


![链接文字](https://img-blog.csdnimg.cn/20200814104309972.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2dhb2N1aTg4Mw==,size_16,color_FFFFFF,t_70#pic_center)

### Filtering Out Missing Data


There are a few ways to filter out missing data. While you always have the option to
do it by hand using  pandas**.isnull**and **boolean** indexing, the  dropna can be helpful.
On a Series, it returns the Series with only the non-null data and index values:


如果经常手动的用 .isnull 和布尔索引的方式来过滤数据的话，那么dropna 方法将更加方便。

In [13]:
from numpy import nan as NA

# 对于Series 
data = pd.Series([1, NA, 3.5, NA, 9])
data.dropna()

0    1.0
2    3.5
4    9.0
dtype: float64

In [14]:
# 等价于：

data[data.notnull()]

0    1.0
2    3.5
4    9.0
dtype: float64

In [15]:
# 对于DATAFRAME. a bit more complex. dropna drop 所有包含na 的rows

data = DataFrame([[1., 6.5, 3.], [1., NA, NA],[NA, NA, NA], [NA, 6.5, 3.]])
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [16]:
data.dropna()

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


data.dropna?

![链接文字](https://img-blog.csdnimg.cn/20200814105608645.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2dhb2N1aTg4Mw==,size_16,color_FFFFFF,t_70#pic_center)

可以看出，我们可以对dropna进行参数控制。

In [19]:
data.dropna(axis=0, how= 'all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


In [20]:
data.dropna(axis= 1, how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [22]:
print(data)
data.dropna(thresh= 2) # 只去掉两个以上na的行。

     0    1    2
0  1.0  6.5  3.0
1  1.0  NaN  NaN
2  NaN  NaN  NaN
3  NaN  6.5  3.0


Unnamed: 0,0,1,2
0,1.0,6.5,3.0
3,,6.5,3.0


### Filling In Missing Data


Rather than filtering out missing data (and potentially discarding other data along
with it), you may want to fill in the “holes” in any number of ways. For most pur‐
poses, the  ***fillna*** method is the workhorse function to use. Calling  fillna with a
constant replaces missing values with that value:



一切丢弃就是一行，一行内可能还有其他非na数据。
所以填充数据就很有用了。



In [23]:
df = DataFrame(np.random.randn(7,3))
df.iloc[:4,1] = NA
df.iloc[:2,2] = NA
df

Unnamed: 0,0,1,2
0,2.598565,,
1,0.350325,,
2,1.537747,,-0.581092
3,0.419049,,-1.123682
4,-1.23684,0.439199,0.292599
5,1.447771,-1.378561,0.438132
6,0.002066,0.015012,-1.51247


In [24]:
df.fillna(0) # 对NA 进行填充 0

Unnamed: 0,0,1,2
0,2.598565,0.0,0.0
1,0.350325,0.0,0.0
2,1.537747,0.0,-0.581092
3,0.419049,0.0,-1.123682
4,-1.23684,0.439199,0.292599
5,1.447771,-1.378561,0.438132
6,0.002066,0.015012,-1.51247


![链接文字](https://img-blog.csdnimg.cn/20200814111702964.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2dhb2N1aTg4Mw==,size_16,color_FFFFFF,t_70#pic_center)


![链接文字](https://img-blog.csdnimg.cn/20200814112048496.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2dhb2N1aTg4Mw==,size_16,color_FFFFFF,t_70#pic_center)

In [26]:
df.fillna({1:0.5, 2:0}) # dataframe,column1的非零值填入0.5，2 填入：0

Unnamed: 0,0,1,2
0,2.598565,0.5,0.0
1,0.350325,0.5,0.0
2,1.537747,0.5,-0.581092
3,0.419049,0.5,-1.123682
4,-1.23684,0.439199,0.292599
5,1.447771,-1.378561,0.438132
6,0.002066,0.015012,-1.51247


In [28]:
_ = df.fillna(0,inplace=True)
df

Unnamed: 0,0,1,2
0,2.598565,0.0,0.0
1,0.350325,0.0,0.0
2,1.537747,0.0,-0.581092
3,0.419049,0.0,-1.123682
4,-1.23684,0.439199,0.292599
5,1.447771,-1.378561,0.438132
6,0.002066,0.015012,-1.51247


In [35]:
df = pd.DataFrame(np.random.randn(6, 3))
df.iloc[2:,1] = NA
df.iloc[4:,2] = np.nan
df

Unnamed: 0,0,1,2
0,-0.664088,1.844724,1.695235
1,-0.608011,-0.168809,1.517685
2,0.077567,,0.08068
3,0.690576,,-0.941624
4,-0.720014,,
5,0.776801,,


In [36]:
df.fillna(method='ffill')

Unnamed: 0,0,1,2
0,-0.664088,1.844724,1.695235
1,-0.608011,-0.168809,1.517685
2,0.077567,-0.168809,0.08068
3,0.690576,-0.168809,-0.941624
4,-0.720014,-0.168809,-0.941624
5,0.776801,-0.168809,-0.941624


In [37]:
df.fillna(method='ffill', limit= 2)

Unnamed: 0,0,1,2
0,-0.664088,1.844724,1.695235
1,-0.608011,-0.168809,1.517685
2,0.077567,-0.168809,0.08068
3,0.690576,-0.168809,-0.941624
4,-0.720014,,-0.941624
5,0.776801,,-0.941624


In [40]:
df.fillna(method='bfill', limit= 1 ,axis= 1)

Unnamed: 0,0,1,2
0,-0.664088,1.844724,1.695235
1,-0.608011,-0.168809,1.517685
2,0.077567,0.08068,0.08068
3,0.690576,-0.941624,-0.941624
4,-0.720014,,
5,0.776801,,


In [44]:
df.fillna(method='ffill', limit= 1, axis=1)

Unnamed: 0,0,1,2
0,-0.664088,1.844724,1.695235
1,-0.608011,-0.168809,1.517685
2,0.077567,0.077567,0.08068
3,0.690576,0.690576,-0.941624
4,-0.720014,-0.720014,
5,0.776801,0.776801,


In [47]:
# 可以创造性的用很多方法来进行填充。比如：

data = Series([1, np.nan, 3.5, np.nan, 5])
print(data)
data.fillna(data.mean())

0    1.0
1    NaN
2    3.5
3    NaN
4    5.0
dtype: float64


0    1.000000
1    3.166667
2    3.500000
3    3.166667
4    5.000000
dtype: float64

## Data Transformation


So far in this chapter we’ve been concerned with rearranging data. Filtering, cleaning,
and other transformations are another class of important operations.



### Removing Duplicates 

数据的冗余，重复，等等。

In [49]:
data = DataFrame({'k1': ['one', 'two'] * 3 + ['two'],
'k2': [1, 1, 2, 3, 3, 4, 4]})
data

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4
6,two,4


The DataFrame method  ***duplicated*** returns a boolean Series indicating whether each
row is a duplicate (has been observed in a previous row) or not:


![链接文字](https://img-blog.csdnimg.cn/20200814114327579.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2dhb2N1aTg4Mw==,size_16,color_FFFFFF,t_70#pic_center)

In [51]:
data.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

In [53]:
# drop_duplicates 则返回相应的不重复的dataframe

data.drop_duplicates()

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4


In [55]:
# add new columns.
data['v'] = range(7)
print(data)
data.drop_duplicates(['k1'])

    k1  k2  v
0  one   1  0
1  two   1  1
2  one   2  2
3  two   3  3
4  one   3  4
5  two   4  5
6  two   4  6


Unnamed: 0,k1,k2,v
0,one,1,0
1,two,1,1


In [56]:
data.drop_duplicates(['k1','k2'], keep='last')

Unnamed: 0,k1,k2,v
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
6,two,4,6


In [57]:
data.drop_duplicates(['k1','k2'], keep = 'first')

Unnamed: 0,k1,k2,v
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
5,two,4,5


### Transforming Data Using a Function or Mapping

For many datasets, you may wish to perform some transformation based on the val‐
ues in an array, Series, or column in a DataFrame. Consider the following hypotheti‐
cal data collected about various kinds of meat:


想要进行基于数据集的列，或者组的数据转换。


In [58]:
data = DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
          'Pastrami', 'corned beef', 'Bacon',
          'pastrami', 'honey ham', 'nova lox'],
          'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})

# bacon:培根，pulled pork:手撕猪肉，Pastrami:五香熏牛肉，corned beef:腌牛肉，honey ham:蜜汁火腿。
# nova lox: 盐渍三文鱼

data

Unnamed: 0,food,ounces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,Pastrami,6.0
4,corned beef,7.5
5,Bacon,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


In [59]:
# mapping of each distinct meat type of the kind of animal:
# salmon 大马哈鱼。
meat_to_animal = {
    'bacon':'pig',
    'pulled pork':'pig',
    'pastrami':'cow',
    'corned beef':'cow',
    'honey ham':'pig',
    'nova lox':'salmon'



}

### Replicing Values

### Renaming Axis Indexes

### Discretization and Binning

### Permutation and Random Sampling

### Computing Indicator/Dummy Variables

## String Manipulation

### String Object Methods

### Regular Expressions

### Vectorized String Functions in Pandas
