# Pandas-数据分析利器


## 简介
pandas 是基于NumPy 的一种工具，该工具是为了解决数据分析任务而创建的。
pandas 纳入了大量库和一些标准的数据模型，提供了高效地操作大型数据集所需的工具。pandas提供了大量能使我们快速便捷地处理数据的函数和方法。你很快就会发现，它是使Python成为强大而高效的数据分析环境的重要因素之一。
pandas的特点：

* 电子表格式数据的闪电般的数据分析，其中最重要的概念：数据帧（DataFrame）
* 具有非常强大的输入/输出机制，可以很容易加载和输出txt、csv、xls或xlsx格式的文件
* 可以处理多种数据类型，并且可以转换数据类型。

## 安装Pandas
Anaconda中自带pandas包，Python中安装pandas可使用命令：``pip install pandas``。

## 创建数据帧

## Series
* Series 是一个类数组的数据结构，同时带有标签（lable）或者说索引（index）。

In [4]:
import numpy as np
import pandas as pd
s1=pd.Series([1,2,3,np.nan,5,6])
print(s1)#索引在左边 值在右边
print('--------------------------------------------------------')

s2 = pd.Series([1,2,3,4],["a","b","c","d"])
print(s2)
print('--------------------------------------------------------')


0    1.0
1    2.0
2    3.0
3    NaN
4    5.0
5    6.0
dtype: float64
--------------------------------------------------------
a    1
b    2
c    3
d    4
dtype: int64
--------------------------------------------------------


In [6]:
# 访问索引为"a"的元素
print(s2["a"])

# 访问索引为"a","c"的元素
print(s2[["a","c"]])

# 获取所有的值
print(s2.values)

# 获取所有的索引
print(s2.index)


1
a    1
c    3
dtype: int64
[1 2 3 4]
Index(['a', 'b', 'c', 'd'], dtype='object')


## DataFrame
* DataFrame是类似于数据库表结构的数据结构，，包含一组有序的列，每列可以是不同的值类型。DataFrame有行索引和列索引，可以看成由Series组成的字典。

In [None]:
import numpy as np
import pandas as pd

column_name=['id','var1','var2','var3','var4','var5']  #使用列表创建列名 
df=pd.DataFrame(np.arange(3600).reshape(600,6),columns=column_name) 
print(df)


In [None]:
dates=pd.date_range('20180310',periods=6)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=['A','B','C','D'])#生成6行4列位置
print(df)#输出6行4列的表格


In [7]:
import pandas as pd

web_stats = {'Day':[1,2,3,4,5,6],
             'Visitors':[43,34,65,56,29,76],
             'Bounce Rate':[65,67,78,65,45,52]}

df = pd.DataFrame(web_stats)
print(df)


   Day  Visitors  Bounce Rate
0    1        43           65
1    2        34           67
2    3        65           78
3    4        56           65
4    5        29           45
5    6        76           52


In [9]:
df = pd.DataFrame(columns=['A', 'B', 'C', 'D'])
print(df)


Empty DataFrame
Columns: [A, B, C, D]
Index: []


## Pandas导入导出
* pandas可以读取与存取像csv、excel、json、html、SQL数据库等格式的资料
* 官方文档  http://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

![image.png](attachment:image.png)

In [13]:
# 读写Pickle格式的Series和DataFrame文件

import pandas as pd
scientists = pd.read_csv('../data/scientists.csv')
names = scientists['Name']
print(names)

import os

if not os.path.exists('../output/scientists_names_series.pickle'):
    names.to_pickle('../output/scientists_names_series.pickle')
if not os.path.exists('../output/scientists_df.pickle'):
    scientists.to_pickle('../output/scientists_df.pickle')

# 读取Pickle文件
scientists_names_from_pickle = pd.read_pickle('../output/scientists_names_series.pickle')
print(scientists_names_from_pickle)

scientists_from_pickle = pd.read_pickle('../output/scientists_df.pickle')
print(scientists_from_pickle)


0       Rosaline Franklin
1          William Gosset
2    Florence Nightingale
3             Marie Curie
4           Rachel Carson
5               John Snow
6             Alan Turing
7            Johann Gauss
Name: Name, dtype: object
0       Rosaline Franklin
1          William Gosset
2    Florence Nightingale
3             Marie Curie
4           Rachel Carson
5               John Snow
6             Alan Turing
7            Johann Gauss
Name: Name, dtype: object
                   Name        Born        Died  Age          Occupation
0     Rosaline Franklin  1920-07-25  1958-04-16   37             Chemist
1        William Gosset  1876-06-13  1937-10-16   61        Statistician
2  Florence Nightingale  1820-05-12  1910-08-13   90               Nurse
3           Marie Curie  1867-11-07  1934-07-04   66             Chemist
4         Rachel Carson  1907-05-27  1964-04-14   56           Biologist
5             John Snow  1813-03-15  1858-06-16   45           Physician
6           Alan Turi

In [14]:
# 读写CSV格式的Series和DataFrame文件

import pandas as pd
scientists = pd.read_csv('../data/scientists.csv')
print(scientists)

names = scientists['Name']
print(names)

import os
if not os.path.exists('../output/scientists_names_series.csv'):
    names.to_csv('../output/scientists_names_series.csv')
if not os.path.exists('../output/scientists_names_series1.csv'):
    names.to_csv('../output/scientists_names_series1.csv',sep='*')
if not os.path.exists('../output/scientists_names_series2.csv'):
    names.to_csv('../output/scientists_names_series2.csv',index=False)

if not os.path.exists('../output/scientists_df1.csv'):
    scientists.to_csv('../output/scientists_df1.csv')
if not os.path.exists('../output/scientists_df2.csv'):
    scientists.to_csv('../output/scientists_df2.csv',index=False)
    

                   Name        Born        Died  Age          Occupation
0     Rosaline Franklin  1920-07-25  1958-04-16   37             Chemist
1        William Gosset  1876-06-13  1937-10-16   61        Statistician
2  Florence Nightingale  1820-05-12  1910-08-13   90               Nurse
3           Marie Curie  1867-11-07  1934-07-04   66             Chemist
4         Rachel Carson  1907-05-27  1964-04-14   56           Biologist
5             John Snow  1813-03-15  1858-06-16   45           Physician
6           Alan Turing  1912-06-23  1954-06-07   41  Computer Scientist
7          Johann Gauss  1777-04-30  1855-02-23   77       Mathematician
0       Rosaline Franklin
1          William Gosset
2    Florence Nightingale
3             Marie Curie
4           Rachel Carson
5               John Snow
6             Alan Turing
7            Johann Gauss
Name: Name, dtype: object


In [15]:
# 读写Excel格式的Series和DataFrame文件

import pandas as pd
scientists = pd.read_csv('../data/scientists.csv')
names = scientists['Name']
print(names)

names.to_excel('../output/scientists_names_series.xls')
names_df = names.to_frame()
names_df.to_excel('../output/scientists_names_series1.xls')

scientists.to_excel('../output/scientists_df.xlsx',sheet_name='scientists', index = False)


0       Rosaline Franklin
1          William Gosset
2    Florence Nightingale
3             Marie Curie
4           Rachel Carson
5               John Snow
6             Alan Turing
7            Johann Gauss
Name: Name, dtype: object


In [16]:
# 将Series和DataFrame导出为其他格式（json、html、sqlite等）的数据

import pandas as pd
scientists = pd.read_csv('../data/scientists.csv')
print(scientists)
print('---------------------------------------------------------------------------------------')

print(scientists.to_dict())
print('---------------------------------------------------------------------------------------')

print(scientists.to_html())
scientists.to_json('./output/scientists.json')



                   Name        Born        Died  Age          Occupation
0     Rosaline Franklin  1920-07-25  1958-04-16   37             Chemist
1        William Gosset  1876-06-13  1937-10-16   61        Statistician
2  Florence Nightingale  1820-05-12  1910-08-13   90               Nurse
3           Marie Curie  1867-11-07  1934-07-04   66             Chemist
4         Rachel Carson  1907-05-27  1964-04-14   56           Biologist
5             John Snow  1813-03-15  1858-06-16   45           Physician
6           Alan Turing  1912-06-23  1954-06-07   41  Computer Scientist
7          Johann Gauss  1777-04-30  1855-02-23   77       Mathematician
---------------------------------------------------------------------------------------
{'Name': {0: 'Rosaline Franklin', 1: 'William Gosset', 2: 'Florence Nightingale', 3: 'Marie Curie', 4: 'Rachel Carson', 5: 'John Snow', 6: 'Alan Turing', 7: 'Johann Gauss'}, 'Born': {0: '1920-07-25', 1: '1876-06-13', 2: '1820-05-12', 3: '1867-11-07', 4: '