## Working With 2D Data
Table-like data

In [1]:
import math
import statistics
import numpy as np
import scipy.stats 
import pandas as pd

In [2]:
a = np.array([[1, 1, 1],
              [2, 3, 1],
              [4, 9, 2],
              [8, 27, 4],
              [16, 1, 1]])
a 

array([[ 1,  1,  1],
       [ 2,  3,  1],
       [ 4,  9,  2],
       [ 8, 27,  4],
       [16,  1,  1]])

In [5]:
print(np.mean(a))
print(a.mean())
print(np.median(a))
print(a.var(ddof=1))

5.4
5.4
2.0
53.40000000000001


### Axis
- **axis=None** says to calculate the statistics across all data in the array. 
- **axis=0** says to calculate the statistics across all rows, that is, for each column of the array. 
- **axis=1** says to calculate the statistics across all columns, that is, for each row of the array.

In [6]:
a.mean(axis=0) # |||

array([6.2, 8.2, 1.8])

In [7]:
a.mean(axis=1) # 三

array([ 1.,  2.,  5., 13.,  6.])

In [9]:
# In scipy,
# If you want statistics for the entire dataset, then you have to provide axis=None:
scipy.stats.gmean(a, axis=None)

2.829705017016332

In [10]:
# using describe
scipy.stats.describe(a, axis=None, ddof=1, bias=False)

DescribeResult(nobs=15, minmax=(1, 27), mean=5.4, variance=53.40000000000001, skewness=2.264965290423389, kurtosis=5.212690982795767)

In [11]:
scipy.stats.describe(a, ddof=1, bias=False)  # Default: axis=0

DescribeResult(nobs=5, minmax=(array([1, 1, 1]), array([16, 27,  4])), mean=array([6.2, 8.2, 1.8]), variance=array([ 37.2, 121.2,   1.7]), skewness=array([1.32531471, 1.79809454, 1.71439233]), kurtosis=array([1.30376344, 3.14969121, 2.66435986]))

In [15]:
scipy.stats.describe(a, axis=1, ddof=1, bias=False)

  scipy.stats.describe(a, axis=1, ddof=1, bias=False)


DescribeResult(nobs=3, minmax=(array([1, 1, 2, 4, 1]), array([ 1,  3,  9, 27, 16])), mean=array([ 1.,  2.,  5., 13.,  6.]), variance=array([  0.,   1.,  13., 151.,  75.]), skewness=array([       nan, 0.        , 1.15206964, 1.52787436, 1.73205081]), kurtosis=array([ nan, -1.5, -1.5, -1.5, -1.5]))

In [16]:
result = scipy.stats.describe(a, ddof=1, bias=False)
result.mean

array([6.2, 8.2, 1.8])

### DataFrames
The class DataFrame is one of the fundamental pandas data types. It’s very comfortable to work with because it has labels for rows and columns. Use the array a and create a DataFrame

In [18]:
row_names = ['first', 'second', 'third', 'fourth', 'fifth']
col_names = ['A', 'B', 'c']
df = pd.DataFrame(a, index=row_names, columns=col_names)
df

Unnamed: 0,A,B,c
first,1,1,1
second,2,3,1
third,4,9,2
fourth,8,27,4
fifth,16,1,1


In [19]:
# If you call Python statistics methods without arguments,
# then the DataFrame will return the results for each "column":
df.mean()

A    6.2
B    8.2
c    1.8
dtype: float64

In [20]:
df.var()

A     37.2
B    121.2
c      1.7
dtype: float64

In [21]:
# If you want the results for each "row",
# then just specify the parameter axis=1:
df.mean(axis=1)

first      1.0
second     2.0
third      5.0
fourth    13.0
fifth      6.0
dtype: float64

In [22]:
df.var(axis=1)

first       0.0
second      1.0
third      13.0
fourth    151.0
fifth      75.0
dtype: float64

In [23]:
# get certain column of data
df['A']

first      1
second     2
third      4
fourth     8
fifth     16
Name: A, dtype: int64

In [24]:
df['A'].mean()

6.2

In [25]:
df['A'].var()

37.20000000000001

In [26]:
# get datafram as a array
df.values

array([[ 1,  1,  1],
       [ 2,  3,  1],
       [ 4,  9,  2],
       [ 8, 27,  4],
       [16,  1,  1]])

In [27]:
df.to_numpy()

array([[ 1,  1,  1],
       [ 2,  3,  1],
       [ 4,  9,  2],
       [ 8, 27,  4],
       [16,  1,  1]])

In [28]:
# describe for dataframe
df.describe()

Unnamed: 0,A,B,c
count,5.0,5.0,5.0
mean,6.2,8.2,1.8
std,6.09918,11.009087,1.30384
min,1.0,1.0,1.0
25%,2.0,1.0,1.0
50%,4.0,3.0,1.0
75%,8.0,9.0,2.0
max,16.0,27.0,4.0


- count: the number of items in each column
- mean: the mean of each column
- std: the standard deviation
- min and max: the minimum and maximum values
- 25%, 50%, and 75%: the percentiles