# Scipy——做数据分析经常用到的
官网地址：https://scipy.org/

- scipy是基于python的开源软件生态系统
- 主要为数学、科学和工程服务

- Numpy
   - 高性能科学计算和数据分析的基础包
   - ndarray对象类似矩阵（不是矩阵，矩阵有专门的numpy.matrix来定义）
   - ufunc函数
   - 适合线性代数和随机数处理等科学计算
- Scipy Library
   - 基于Numpy，是科学计算的核心库
   - 致力于科学计算中常见问题的各个工具箱，其不同子模块有不同的应用，如插值、积分、优化和图像处理等   
- Matplotlib
   - 同样基于numpy
   - 二维绘图库，可以简单快速地生成曲线图、直方图和散点图等形式的图
   - 常用的pyplot是一个简单提供类似matlab接口的模块
- Pandas
   - 基于 SciPy library和 NumPy
   - 高效的Series和DataFrame数据结构
   - 强大的可扩展数据操作与分析的Python库
   - 高效处理大数据集的切片等功能
   - 提供优化库功能读写多种文件格式，如CSV、HDF5

- scipy中的数据结构
   - ndarray（N维数组）
   - Series（变长字典）
   - DataFrame（数据框）

## Numpy
- python中的数组
   - 用list和tuple等数据结构表示数组
   - ndarray多维数组
      - NumPy中基本的数据结构
      - 所有元素是同一种类型
      - 利于节省内存和CPU计算时间
      - 有丰富的函数

In [6]:
### 一维数组
aLst = [1,2,3,4]

### 二维数组
bList = [[1,2,3],[4,5,6]]

### array
import numpy as np
np.array([1,3,4,5,6])

array([1, 3, 4, 5, 6])

- N维数组
   - 维度（dimensions）称为轴（axes），轴的个数称为秩（rank）
   - 沿着第0轴和第一轴操作：
      - axis = 0 (按列)
      - axis = 1 (按行)
   - 基本属性
      - ndarray.ndim（秩）
      - ndarray.shape（维度）
      - ndarray.size（元素总个数）
      - ndarray.dtype（元素类型）
      - ndarray.itemsize（元素字节大小）

#### 如何创建ndarray

In [1]:
import numpy as np
aArray = np.array([1,2,3])
aArray

array([1, 2, 3])

In [10]:
bArray = np.array([[1,2,3],[4,5,6]], dtype=float)
bArray

array([[1., 2., 3.],
       [4., 5., 6.]])

In [11]:
bArray.ndim, bArray.shape, bArray.dtype

(2, (2, 3), dtype('float64'))

In [12]:
### 各种特殊数组
np.zeros((2,2))  #2×2元素为0的数组


array([[0., 0.],
       [0., 0.]])

In [14]:
np.ones((2,3))   #2×3

array([[1., 1., 1.],
       [1., 1., 1.]])

In [15]:
np.full((3, 3), np.pi)

array([[3.14159265, 3.14159265, 3.14159265],
       [3.14159265, 3.14159265, 3.14159265],
       [3.14159265, 3.14159265, 3.14159265]])

In [16]:
x = np.array([[1, 2, 3], [4, 5, 6]], dtype = np.float32)
np.ones_like(x)  #转换为元素为1的数组

array([[1., 1., 1.],
       [1., 1., 1.]], dtype=float32)

In [17]:
np.identity(3)  #单位数组/矩阵

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

In [23]:
np.eye(3, k = 1)  # Index of the diagonal: 0 (the default) refers to the main diagonal,
                  # a positive value refers to an upper diagonal, and a negative value
                  # to a lower diagonal.

array([[0., 1., 0.],
       [0., 0., 1.],
       [0., 0., 0.]])

In [24]:
help(np.diag)

Help on function diag in module numpy.lib.twodim_base:

diag(v, k=0)
    Extract a diagonal or construct a diagonal array.
    
    See the more detailed documentation for ``numpy.diagonal`` if you use this
    function to extract a diagonal and wish to write to the resulting array;
    whether it returns a copy or a view depends on what version of numpy you
    are using.
    
    Parameters
    ----------
    v : array_like
        If `v` is a 2-D array, return a copy of its `k`-th diagonal.
        If `v` is a 1-D array, return a 2-D array with `v` on the `k`-th
        diagonal.
    k : int, optional
        Diagonal in question. The default is 0. Use `k>0` for diagonals
        above the main diagonal, and `k<0` for diagonals below the main
        diagonal.
    
    Returns
    -------
    out : ndarray
        The extracted diagonal or constructed diagonal array.
    
    See Also
    --------
    diagonal : Return specified diagonals.
    diagflat : Create a 2-D array with the 

In [25]:
### 生成器创建
np.arange(1, 5, 0.5)  #参数含义：起始值-结束值-步长

array([1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5])

In [34]:
np.linspace(1,2,10,endpoint = False)   
# endpoint选择指定的区间的右端点值可否取到

array([1. , 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9])

In [33]:
help(np.linspace)

Help on function linspace in module numpy.core.function_base:

linspace(start, stop, num=50, endpoint=True, retstep=False, dtype=None)
    Return evenly spaced numbers over a specified interval.
    
    Returns `num` evenly spaced samples, calculated over the
    interval [`start`, `stop`].
    
    The endpoint of the interval can optionally be excluded.
    
    Parameters
    ----------
    start : scalar
        The starting value of the sequence.
    stop : scalar
        The end value of the sequence, unless `endpoint` is set to False.
        In that case, the sequence consists of all but the last of ``num + 1``
        evenly spaced samples, so that `stop` is excluded.  Note that the step
        size changes when `endpoint` is False.
    num : int, optional
        Number of samples to generate. Default is 50. Must be non-negative.
    endpoint : bool, optional
        If True, `stop` is the last sample. Otherwise, it is not included.
        Default is True.
    retstep : bo

In [37]:
### 随机数组

np.random.randn(3,3)   #标准正态分布

array([[-0.56336627, -0.61448325,  0.81802523],
       [ 0.46630176, -0.39501427, -0.0312137 ],
       [ 1.52600126,  1.10886144,  0.10431984]])

In [39]:
np.random.random((3,4))  #随机浮点数

array([[0.00986366, 0.31943979, 0.62847388, 0.5794316 ],
       [0.780894  , 0.6214369 , 0.83350299, 0.97382051],
       [0.03901241, 0.93404439, 0.55422824, 0.87916022]])

In [43]:
np.random.normal(5,5,100)  #正态分布N(5,5)的100个样本
# normal(loc=0.0, scale=1.0, size=None)
# loc : float or array_like of floats
        # Mean ("centre") of the distribution.
# scale : float or array_like of floats
        # Standard deviation (spread or "width") of the distribution.
#size : int or tuple of ints, optional
        # Output shape.  If the given shape is, e.g., ``(m, n, k)``, then
        # ``m * n * k`` samples are drawn.  If size is ``None`` (default),
        # a single value is returned if ``loc`` and ``scale`` are both scalars.
        # Otherwise, ``np.broadcast(loc, scale).size`` samples are drawn.

array([-3.51601322,  2.52702727,  3.64325159,  9.74511613,  7.84183103,
        7.38866364,  4.55829174,  7.51133724,  5.49475115,  9.9692057 ,
        7.52160454, 11.49660118, 10.06336819,  7.33276772, -7.55077129,
        5.08130529,  2.2797553 , 14.65870148,  0.41405662, -3.60821538,
        5.55635512,  7.66402628, 11.33021145,  6.4865978 ,  6.55877911,
        1.88492981, 10.06244849,  7.41336802,  4.06667991, 11.03781615,
       11.88212474,  3.59358349,  6.67134513,  9.09302833, 10.36489002,
        6.02483537,  6.17629934,  9.09719087,  8.64589446,  3.62077628,
        7.56977507,  2.54557433,  0.94704803,  4.05470126,  6.19682817,
        7.05314661,  9.7360808 ,  6.77627134, -4.13331926,  8.59952012,
       -7.7410454 ,  7.57244255,  3.47317596,  8.81485505,  2.24911508,
        9.51690709, -4.14153874, 15.39973322,  9.23415478,  4.46732843,
        1.48110331,  3.99696347,  6.53097717,  4.62772361, 11.90108167,
        7.8984473 , 11.89235031,  6.07083556,  7.52119596,  7.34

In [48]:
### 函数创建
np.fromfunction(lambda i,j: (i+1)*(j+1), (9,9))

array([[ 1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9.],
       [ 2.,  4.,  6.,  8., 10., 12., 14., 16., 18.],
       [ 3.,  6.,  9., 12., 15., 18., 21., 24., 27.],
       [ 4.,  8., 12., 16., 20., 24., 28., 32., 36.],
       [ 5., 10., 15., 20., 25., 30., 35., 40., 45.],
       [ 6., 12., 18., 24., 30., 36., 42., 48., 54.],
       [ 7., 14., 21., 28., 35., 42., 49., 56., 63.],
       [ 8., 16., 24., 32., 40., 48., 56., 64., 72.],
       [ 9., 18., 27., 36., 45., 54., 63., 72., 81.]])

#### numpy的操作&运算

In [56]:
### 切片 --- [行操作，列操作]
aArray = np.array([(1, 2, 3), (4, 5, 6)])
print('**取第二行')
print(aArray[1])
print('**从第几行取到第几行')
print(aArray[0:2])
print('**所有行的哪几列元素')
print(aArray[:, [0,2]])
print('**第二行的从第几个到第几个')
print(aArray[1,0:2])

**取第二行
[4 5 6]
**从第几行取到第几行
[[1 2 3]
 [4 5 6]]
**所有行的哪几列元素
[[1 3]
 [4 6]]
**第二行的从第几个到第几个
[4 5]


In [3]:
### 布尔索引/条件判断

aArray = np.arange(0,101)
bArray = aArray[aArray <= 50]
bArray

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50])

In [5]:
aArray[(aArray % 2 == 0) & (aArray>50)]

array([ 52,  54,  56,  58,  60,  62,  64,  66,  68,  70,  72,  74,  76,
        78,  80,  82,  84,  86,  88,  90,  92,  94,  96,  98, 100])

In [7]:
aArray[(aArray % 2 == 0)] = -1
aArray

array([-1,  1, -1,  3, -1,  5, -1,  7, -1,  9, -1, 11, -1, 13, -1, 15, -1,
       17, -1, 19, -1, 21, -1, 23, -1, 25, -1, 27, -1, 29, -1, 31, -1, 33,
       -1, 35, -1, 37, -1, 39, -1, 41, -1, 43, -1, 45, -1, 47, -1, 49, -1,
       51, -1, 53, -1, 55, -1, 57, -1, 59, -1, 61, -1, 63, -1, 65, -1, 67,
       -1, 69, -1, 71, -1, 73, -1, 75, -1, 77, -1, 79, -1, 81, -1, 83, -1,
       85, -1, 87, -1, 89, -1, 91, -1, 93, -1, 95, -1, 97, -1, 99, -1])

In [8]:
aArray = np.arange(1,101)
cArray = np.where(aArray % 2 == 0, -1, aArray)  # 按条件替换
cArray

array([ 1, -1,  3, -1,  5, -1,  7, -1,  9, -1, 11, -1, 13, -1, 15, -1, 17,
       -1, 19, -1, 21, -1, 23, -1, 25, -1, 27, -1, 29, -1, 31, -1, 33, -1,
       35, -1, 37, -1, 39, -1, 41, -1, 43, -1, 45, -1, 47, -1, 49, -1, 51,
       -1, 53, -1, 55, -1, 57, -1, 59, -1, 61, -1, 63, -1, 65, -1, 67, -1,
       69, -1, 71, -1, 73, -1, 75, -1, 77, -1, 79, -1, 81, -1, 83, -1, 85,
       -1, 87, -1, 89, -1, 91, -1, 93, -1, 95, -1, 97, -1, 99, -1])

In [12]:
### 改变数组形状
aArray = np.array([(1,2,3),(4,5,6)])
aArray.shape

(2, 3)

In [13]:
bArray = aArray.reshape(3,2)   #改为3行2列，不改变原数组
bArray
aArray.resize(3,2) #改变原数组

array([[1, 2],
       [3, 4],
       [5, 6]])

In [16]:
### ndarray的运算
a = np.array([(5,5,5), (5,5,5)])
b = np.array([(2,2,2), (2,2,2)])
c = a*b
c
a += b
a

array([[7, 7, 7],
       [7, 7, 7]])

In [18]:
### ndarray的广播功能：较小的数组会广播到较大数组的大小，使它们的形状兼容
a = np.array([1,2,3])
b = np.array([(1,2,3),(5,6,7)])
a + b

array([[ 2,  4,  6],
       [ 6,  8, 10]])

#### numpy做简单的数据统计

In [24]:
a = np.array([[3,2,1], [4,5,6]])
a.sum()   #求和

21

In [21]:
a.sum(axis=0)   #两行相加

array([5, 7, 9])

In [26]:
a.min()   #返回最小值

1

In [25]:
a.argmin() #返回最小值的索引

2

In [27]:
a.var() #方差

2.9166666666666665

In [28]:
a.std() #标准差

1.707825127659933

In [29]:
a.cumsum()  #累加

array([ 3,  5,  6, 10, 15, 21], dtype=int32)

In [30]:
a.cumprod() #累乘

array([  3,   6,   6,  24, 120, 720], dtype=int32)

#### ufunc函数
- ufunc（universal function）是一种能对数组的每 个元素进行操作的函数。 NumPy内置的许多ufunc函数都是在C语言级别实现的，计算速度非常快,数据量大时有很大的优势
- add, all, any, arange, apply_along_axis, argmax, argmin, argsort, average, bincount, ceil, conj, corrcoef, cov, cross ...

In [31]:
# Example
import time
import numpy as np
import math

x = np.arange(0,100,0.01)
t1 = time.process_time()
for i, t in enumerate(x):
    x[i] = math.pow((math.sin(t)), 2)

t2 = time.process_time()
print(t2-t1)

y = np.arange(0,100,0.01)
t1 = time.process_time()
y = np.power(np.sin(y), 2)
t2 = time.process_time()
print(t2-t1)

0.015625
0.0


#### ndarray在线性代数上的应用

### Pandas
**github pandas excercises  https://github.com/Aycrazy/pandas_excercises** 强烈推荐

- Pandas是基于numpy编写的