# Chapter 03: 2 次元データの整理 first

In [1]:
import numpy as np
import pandas as pd

%precision 3
pd.set_option('precision', 3)

In [2]:
path = '/Users/yanaichiharu/c_data/Learning_Math/python_stat_sample-master/'
df = pd.read_csv(path + '/data/ch2_scores_em.csv', index_col='生徒番号')

In [3]:
en_scores = np.array(df['英語'])[:10]
ma_scores = np.array(df['数学'])[:10]

In [4]:
scores_df = pd.DataFrame({'英語': en_scores,
                         '数学': ma_scores},
                         index=pd.Index(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'], name='生徒'))

scores_df

Unnamed: 0_level_0,英語,数学
生徒,Unnamed: 1_level_1,Unnamed: 2_level_1
A,42,65
B,69,80
C,56,63
D,41,63
E,57,76
F,48,60
G,65,81
H,49,66
I,65,78
J,58,82


## 3.1　|　2 つのデータの関係性の指標

### 3.1.1　共分散

共分散が正であれば正の相関, 負であれば負の相関がある. 0 に近ければ無相関を表す.

In [5]:
summary_df = scores_df.copy()

summary_df['英語の偏差'] = \
    summary_df['英語'] - summary_df['英語'].mean()

summary_df['数学の偏差'] = \
    summary_df['数学'] - summary_df['数学'].mean()

summary_df['偏差同士の積'] = \
    summary_df['英語の偏差'] * summary_df['数学の偏差']

summary_df

Unnamed: 0_level_0,英語,数学,英語の偏差,数学の偏差,偏差同士の積
生徒,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
A,42,65,-13.0,-6.4,83.2
B,69,80,14.0,8.6,120.4
C,56,63,1.0,-8.4,-8.4
D,41,63,-14.0,-8.4,117.6
E,57,76,2.0,4.6,9.2
F,48,60,-7.0,-11.4,79.8
G,65,81,10.0,9.6,96.0
H,49,66,-6.0,-5.4,32.4
I,65,78,10.0,6.6,66.0
J,58,82,3.0,10.6,31.8


In [6]:
summary_df['偏差同士の積'].mean()

62.8

$共分散を S_{xy} とすると \\
S_{xy} = \frac{1}{n} \sum_{i = 1}^{n} (x_{i} - \bar{x})(y_{i} - \bar{y})$

In [7]:
# 共分散行列を求める

cov_mat = np.cov(en_scores, ma_scores, ddof=0)

# [0, 1] 成分と [1, 0] 成分が共分散
cov_mat

array([[86.  , 62.8 ],
       [62.8 , 68.44]])

In [8]:
print(cov_mat[0, 1], cov_mat[1, 0])
print(cov_mat[0, 0], cov_mat[1, 1])
print(np.var(en_scores, ddof=0), np.var(ma_scores, ddof=0))

62.800000000000004 62.800000000000004
86.0 68.44000000000001
86.0 68.44000000000001


### 3.1.2　相関係数

共分散は各データの単位をかけたものになるので, 各データの標準偏差で割る事で単位に依存しない指標を定義できる.

$相関係数を r_{xy} とすると \\
\begin{eqnarray}
r_{xy} &=& \frac{S_{xy}}{S_{x} S_{y}}\\
       &=& \frac{1}{n} \sum_{i = 1}^{n} (\frac{x_{i} - \bar{x}}{S_{x}}) (\frac{y_{i} - \bar{y}}{S_{y}})
\end{eqnarray}$

In [14]:
S_xy = np.cov(en_scores, ma_scores, ddof=0)[0, 1]
S_x = np.std(en_scores)
S_y = np.std(ma_scores)

S_xy / (S_x * S_y)

0.8185692341186713

In [15]:
# 相関行列
np.corrcoef(en_scores, ma_scores)

array([[1.   , 0.819],
       [0.819, 1.   ]])

In [16]:
scores_df.corr()

Unnamed: 0,英語,数学
英語,1.0,0.819
数学,0.819,1.0
