# pybaseballのデータを用いた、ホームランの予測

---

## 1. はじめに

本ノートブックでは、pybaseballのデータを用いた、ホームランの予測と機械学習モデルの構築を行う。

機械学習のアルゴリズムには、ロジスティック回帰を利用する。


## 2. 本ノートブックの目的

本ノートブックの目的は、以下の３つの処理の実行と検証

- モデルの学習
- 学習済みモデルの検証
- 学習済みモデルの保存

---

## 3. 処理フロー

1. データセットのロード
2. データセットの確認
3. ロジスティック回帰の学習
4. 学習済みモデルの検証
5. 学習済みモデルの保存

<br>

---

## 4. 処理詳細

### 4.1. 必要なライブラリのインポート

本分析の学習に必要なライブラリをまとめて、インポートする

利用するライブラリとその用途を下記にまとめる


|パッケージ名|用途|
|---|:--|
|pandas|データフレームの使用|
|train_test_split|訓練データとテストデータの分割|
|LogisticRegression|ロジスティック回帰の実装|
|confusion_matrix|混合行列|
|accuracy_score|正解率|
|precision_score|適合率|
|recall_score|再現率|
|f1_score|F1値|
|pickle|モデルの保存|

In [170]:
from pybaseball import statcast

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score

### 4.2. データセットのダウンロード

- 参考文献 https://github.com/jldbc/pybaseball

In [172]:
data = statcast(start_dt='2022-08-01', end_dt='2022-09-30')
data.head()

This is a large query, it may take a moment to complete


That's a nice request you got there. It'd be a shame if something were to happen to it.
We strongly recommend that you enable caching before running this. It's as simple as `pybaseball.cache.enable()`.
Since the Statcast requests can take a *really* long time to run, if something were to happen, like: a disconnect;
gremlins; computer repair by associates of Rudy Giuliani; electromagnetic interference from metal trash cans; etc.;
you could lose a lot of progress. Enabling caching will allow you to immediately recover all the successful
subqueries if that happens.
100%|██████████████████████████████████████████████████████████████████████████████████| 61/61 [01:24<00:00,  1.39s/it]


Unnamed: 0,pitch_type,game_date,release_speed,release_pos_x,release_pos_z,player_name,batter,pitcher,events,description,...,fld_score,post_away_score,post_home_score,post_bat_score,post_fld_score,if_fielding_alignment,of_fielding_alignment,spin_axis,delta_home_win_exp,delta_run_exp
3001,CU,2022-09-30,74.9,-2.62,4.7,"Herget, Jimmy",669701,623474,strikeout,swinging_strike,...,4,1,4,1,4,Standard,Standard,79,0.004,-0.07
3112,SL,2022-09-30,84.9,-2.37,5.02,"Herget, Jimmy",669701,623474,,called_strike,...,4,1,4,1,4,Infield shift,Standard,198,0.0,-0.023
3127,CH,2022-09-30,86.8,-2.18,4.99,"Herget, Jimmy",669701,623474,,ball,...,4,1,4,1,4,Infield shift,Standard,268,0.0,0.011
3225,SL,2022-09-30,85.0,-2.35,5.12,"Herget, Jimmy",669701,623474,,called_strike,...,4,1,4,1,4,Infield shift,Standard,210,0.0,-0.017
3362,CU,2022-09-30,73.2,-2.58,3.99,"Herget, Jimmy",673962,623474,field_out,hit_into_play,...,4,1,4,1,4,Standard,Standard,90,0.009,-0.112


### 4.3. 説明変数の設定

|変数名|概要|
|:-:|:-:|
|hit_distance_sc|打球飛距離|
|launch_speed|バットで打たれたボールの速度（mph)|
|launch_angle|バットで打たれたボールの角度（度）|
|hc_x|打球x座標|
|hc_y|打球y座標|
|zone|ストライクゾーン内のピッチの位置（1〜9）|
|result|ホームランとその他(0,1)|

本分析ではこれらを説明変数として用いる

In [161]:
df = data[['hit_distance_sc','launch_speed','launch_angle','hc_x','hc_y','zone','events']]
df = df.dropna()
df

Unnamed: 0,hit_distance_sc,launch_speed,launch_angle,hc_x,hc_y,zone,events
3362,338,99.7,43,34.69,95.04,4,field_out
1641,17,99.2,-9,131.41,142.15,3,force_out
1760,135,94.1,7,119.31,149.41,2,force_out
2183,20,25.0,5,128.98,191.61,5,sac_bunt
2191,293,88.4,20,164.84,87.9,6,single
...,...,...,...,...,...,...,...
2559,131,101.7,7,89.72,103.85,1,single
1638,2,28.3,-47,116.3,186.94,8,field_out
2216,154,84.4,12,82.19,156.52,7,field_out
2474,64,97.2,2,184.93,110.52,1,single


### 4.4. データ加工

ホームランとその他に文字列の変数を数値変換する

In [162]:
df['result'] = df['events'].where(df['events'] == 'home_run', 'other')
df['result'] = df['result'].map({'home_run': 1, 'other': 0})

df = df.drop(['events'], axis=1)
df

Unnamed: 0,hit_distance_sc,launch_speed,launch_angle,hc_x,hc_y,zone,result
3362,338,99.7,43,34.69,95.04,4,0
1641,17,99.2,-9,131.41,142.15,3,0
1760,135,94.1,7,119.31,149.41,2,0
2183,20,25.0,5,128.98,191.61,5,0
2191,293,88.4,20,164.84,87.9,6,0
...,...,...,...,...,...,...,...
2559,131,101.7,7,89.72,103.85,1,0
1638,2,28.3,-47,116.3,186.94,8,0
2216,154,84.4,12,82.19,156.52,7,0
2474,64,97.2,2,184.93,110.52,1,0


### 4.5. データの結合と分割

目的変数y、説明変数xに分割

|変数名|使用データ|
|:-:|:-:|
|x|hit_distance_sc, launch_speed, launch_angle, hc_x, hc_y, zone|
|y|result|

In [163]:
y = df['result'].values
x = df.drop(labels=['result'], axis=1).values

訓練データ、テストデータをそれぞれ、7:3に分割する

In [164]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=0)

### 4.6. ロジスティック回帰の実装

In [171]:
model = LogisticRegression() 
model.fit(x_train, y_train) 

Y_pred = model.predict(x_test)

## 5. 分析結果と検証

### モデル性能評価

混合行列・正解率・適合率・再現率・F1の評価

In [168]:
print(confusion_matrix(y_true=y_test, 
                       y_pred=Y_pred))

print('accuracy = ', accuracy_score(y_true=y_test, y_pred=Y_pred))
print('precision = ', precision_score(y_true=y_test, y_pred=Y_pred))
print('recall = ', recall_score(y_true=y_test, y_pred=Y_pred))
print('f1 score = ', f1_score(y_true=y_test, y_pred=Y_pred))

[[11953    77]
 [   86   398]]
accuracy =  0.9869745884609238
precision =  0.8378947368421052
recall =  0.8223140495867769
f1 score =  0.8300312825860271
