# 簡介機器學習
### 可大致分為監督鄰近分類學習和非監督兩種學習方法
<img src="unsuper.png">
1. 非監督式學習(Unsupervised Learing)的訓練資料不需要事先以人力處理標籤，機器面對資料時，做的處理是依照關聯性去歸類、找出潛在規則與套路、形成集群，不對資訊有正確或不正確的判別
<img src="supervised.png">
2. 監督式學習 (Supervised learning) 是電腦從標籤化 (labeled) 的資訊中分析模式後做出預測的學習方式。標記過的資料就好比標準答案，電腦在學習的過程透過對比誤差，一邊修正去達到更精準的預測，這樣的方式讓監督式學習有準確率高的優點。 

 根據不同特徵值之間的距離來進行分類的一種簡單的機器學習方法，它是一種簡單但是懶惰的演算法。他的訓練資料都是有標籤的資料，即訓練的資料都有自己的類別。KNN演算法主要應用領域是對未知事物進行分類，即判斷未知事物屬於哪一類，判斷透過歐幾里得定理，判斷未知事物的特徵和哪一類已知事物的的特徵最接近。它也可以用於遞迴，通過找出一個樣本的k個最近鄰居，將這些鄰居的屬性的平均值賦給該樣本，就可以得到該樣本的屬性。

 儘管很簡單，近鄰方法在處理資料量多的分類及迴歸分析上很成功

# KNN Algorithm 

### 之使用airbnb做房價預測

* KNN又稱K最近鄰居是基於監督學習技術的最簡單的機器學習算法之一。  

* KNN算法假設新案例/數據與可用案例之間具有相似性，並將新案例放入與可用類別最相似的類別中。

* KNN算法存儲所有可用數據，並基於相似度對新數據點進行分類。這意味著，當出現新數據時，可以使用K-NN算法將其輕鬆分類為鑽井套件類別。

* KNN算法既可以用於回歸也可以用於分類，但是大多數情況下用於分類問題。

* KNN是一種非參數算法，這意味著它不會對基礎數據進行任何假設。

* 它也稱為惰性學習器算法，因為它不立即從訓練集中學習，而是存儲數據集，並且在分類時對數據集執行操作。

* 訓練階段的KNN算法僅存儲數據集，並在獲取新數據時將其分類為與新數據非常相似的類別。


假設我們有一個看起來類似於婊子和狗的生物的圖像，但是我們想知道它是婊子還是狗。因此，對於這種識別，我們可以使用KNN算法，因為它適用於相似性度量。我們的KNN模型將發現新數據集與婊子和狗圖像相似的特徵，並基於最相似的特徵將其分類為婊子或狗類別。
<img src="dog.png">


## MLB Team Batting Regular Season Stats 
本文利用團隊該賽季的團隊數據
並預測該賽季的勝場數(此文中以 pW 表示勝場數)


## 資料抓取

我手動去下載 espn上面的 團隊棒球資訊
https://www.espn.com/mlb/stats/team/_/season/2019/seasontype/2

In [223]:
import pandas as pd

features = ['R','H','HR','RBI','SO','TB','OBP','OPS','pW','pERA','pHR','pSO','pBB','pWHIP']
#篩選出有用的特徵

dc_listings = pd.read_csv('baseball (1).csv',low_memory=False)

dc_listings = dc_listings[features]

print(dc_listings.shape)

dc_listings.head()

(90, 14)


Unnamed: 0,R,H,HR,RBI,SO,TB,OBP,OPS,pW,pERA,pHR,pSO,pBB,pWHIP
0,729,1379,213,698,1435,2320,0.31,0.725,54,5.59,305,1248,561,1.46
1,901,1554,245,857,1382,2688,0.34,0.806,84,4.7,215,1633,605,1.38
2,769,1368,220,734,1276,2338,0.324,0.746,72,5.12,267,1404,576,1.38
3,769,1354,223,731,1332,2345,0.323,0.756,93,3.76,207,1508,450,1.22
4,691,1356,162,655,1405,2203,0.309,0.71,59,5.2,221,1230,582,1.48


擷取的特徵 (baseball (1).csv 有很多跟我一樣廢的資訊 這邊只取幾個比較有用的)


## 

<img src="newdatapoint.png">


## KNN原理

## K為一個常數，表示我們找K個個案的特徵做比較
如我們所見，最近的三個鄰居來自類別A，因此此新數據點必須屬於類別A。
如何在KNN演算法中選擇K的值?

在K-NN算法中選擇K的值時，需要記住以下幾點：
* 沒有確定“ K”的最佳值的特定方法，因此我們需要嘗試一些值以從中找出最佳值。
* K的極低值（例如K = 1或K = 2）可能會產生噪聲，並導致模型中異常值的影響。
* 較大的K值不錯，但可能會遇到一些困難。
<img src="K_is_three.png">
此圖中選擇  K=3 

### KNN演算法的優點:
實施起來很簡單。
它對嘈雜的訓練數據具有魯棒性
如果訓練數據很大，可能會更有效。
### KNN演算法的缺點:
始終需要確定K的值，該值有時可能很複雜。
由於計算所有訓練樣本的數據點之間的距離，因此計算成本很高。

## 距離的計算


## distance=abs(該個案的pERA - 所求)

假設我的ERA=4，就ERA數量這個特徵來計算距離

In [224]:
import numpy as np

our__value = 4

dc_listings['distance'] = np.abs(dc_listings.pERA - our__value)
dc_listings.distance.value_counts().sort_index()

0.00    2
0.01    1
0.03    2
0.04    1
0.05    2
       ..
1.20    1
1.24    1
1.36    1
1.56    1
1.59    1
Name: distance, Length: 71, dtype: int64

上面為距離era=4 附近的個案 居然還有兩個完全一樣耶

In [225]:
dc_listings = dc_listings.sample(frac=1,random_state=0)#frac=1做shuffle 
dc_listings = dc_listings.sort_values('distance')
dc_listings.pW.head()

67    86
58    82
83    83
8     97
89    80
Name: pW, dtype: int64

In [226]:
#dc_listings['pW'] = dc_listings.pW.str.replace("\$|,",'').astype(float)

mean_pW = dc_listings.pW.iloc[:5].mean()
mean_pW

85.6

取前五個最接近的個案 取這五個個案勝場的的平均
## pERA=4 的隊伍 預計能拿下85.6g勝

## 模型的規劃

<img src="2020-07-01_091209.png" style="width:600px;height:250px;float:left">

先切割出訓練集 和 測試集

In [227]:
dc_listings = dc_listings.sample(frac=1,random_state=22)

dc_listings.drop('distance',axis=1)

train_df = dc_listings.copy().iloc[:60]
test_df = dc_listings.copy().iloc[60:]

以某個特徵做預測

In [228]:
def predict_pW(new_listing_value,feature_column):
    #print('new_list_value',new_listing_value)
    temp_df = train_df
    temp_df['distance'] = np.abs(dc_listings[feature_column] - new_listing_value)
    temp_df = temp_df.sort_values('distance')
    knn_5 = temp_df.pW.iloc[:5]
    #print('knn_5',knn_5)
    predicted_pW = knn_5.mean()
    return(predicted_pW)

In [229]:
test_df['predicted_pW'] = test_df.pERA.apply(predict_pW,feature_column='pERA')
test_df[['pW','pERA','predicted_pW']].head(10)

Unnamed: 0,pW,pERA,predicted_pW
65,64,5.36,61.8
39,64,4.58,76.4
52,80,4.14,91.0
57,90,3.74,92.8
79,97,3.88,86.8
11,67,4.79,70.4
42,100,3.78,92.4
45,67,4.92,69.2
75,92,3.95,84.2
25,57,4.74,71.4


稍微解讀一下上面的數據 pERA在本文指的是 投手防禦率，即一個投手每九局的平均自責失分
這個是一個越低越好的數據(畢竟失分比較少球隊比較有機會贏阿)
因此 pERA越低的隊伍 該賽季的勝場通常會比較高 
預測勝場(predicted_pW)和勝場(pW)都說明這件事

## root mean squared error (RMSE)均方根誤差
此數越低 表示用訓練集產生的預期值 越接近測試集的真實數據

<img src="8.png" style="width:700px;height:100px;float:left">

In [230]:
test_df['squared_error'] = (test_df['predicted_pW'] - test_df['pW'])**(2)
mse = test_df['squared_error'].mean()
rmse = mse ** (1/2)
rmse

9.450925880568528

得到一個變數的 rmse

## 不同的變數的結果

In [231]:
for feature in ['SO','OBP','pERA','pWHIP']:
    test_df['predicted_pW'] = test_df[feature].apply(predict_pW,feature_column=feature)
    test_df['squared_error'] = (test_df['predicted_pW'] - test_df['pW'])**(2)
    mse = test_df['squared_error'].mean()
    rmse = mse ** (1/2)
    print("RMSE for the {} column: {}".format(feature,rmse))

RMSE for the SO column: 17.857659421099953
RMSE for the OBP column: 11.561660780355043
RMSE for the pERA column: 9.450925880568528
RMSE for the pWHIP column: 11.2653450901426


差異有點大，接下來要統合一下 每個變數的
又每個數據 的集合範圍都不同
為了統一計算資料的分布情況 故將數據標準化

In [232]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
features = ['R','H','HR','RBI','SO','TB','OBP','OPS','pW','pERA','pHR','pSO','pBB','pWHIP']

dc_listings = pd.read_csv('baseball (1).csv',low_memory=False)

dc_listings = dc_listings[features]

#dc_listings['price'] = dc_listings.price.str.replace("\$|,",'').astype(float)

dc_listings = dc_listings.dropna()#捨棄掉欄位缺失的
ss = StandardScaler()
#print(dc_listings[features])
dc_listings[features] = ss.fit_transform(dc_listings[features])
#做標準化
normalized_listings = dc_listings
print(dc_listings.shape)

origin =  ss.inverse_transform(normalized_listings)
#normalifzed_listings.head()




# initialize list of lists 
#data = [['tom', 10], ['nick', 15], ['juli', 14]] 
  
# Create the pandas DataFrame 
df = pd.DataFrame(origin , columns = features) #還原標準化的dataflame
  
# print dataframe. 
normalized_listings.head(10)

(90, 14)


Unnamed: 0,R,H,HR,RBI,SO,TB,OBP,OPS,pW,pERA,pHR,pSO,pBB,pWHIP
0,-0.286917,-0.18507,0.208574,-0.259182,0.483401,-0.13037,-0.929721,-0.53121,-1.938043,2.294354,2.395848,-0.311858,0.516825,0.74915
1,1.859759,2.323405,1.061831,1.784286,0.023707,2.123212,1.46099,1.626182,0.215338,0.670445,0.388777,1.185217,0.935273,0.41844
2,0.21231,-0.342745,0.395224,0.20349,-0.89568,-0.020141,0.185944,0.028114,-0.646014,1.436784,1.548418,0.294749,0.659478,0.41844
3,0.21231,-0.543423,0.475217,0.164934,-0.409966,0.022726,0.106254,0.294459,0.861352,-1.044694,0.210371,0.699154,-0.538804,-0.24298
4,-0.761183,-0.514755,-1.151304,-0.811818,0.223197,-0.846862,-1.009411,-0.930727,-1.579146,1.582754,0.522582,-0.381852,0.716539,0.831828
5,0.21231,-0.371414,1.195152,0.33201,1.593604,0.537131,0.584396,0.587438,0.574235,0.12306,0.611785,0.65638,0.602417,0.211746
6,2.334025,2.223066,2.715016,2.414033,-0.392619,3.005048,1.301609,2.318678,1.435587,-0.278355,0.009664,0.524171,-0.519784,0.08773
7,2.383948,1.449022,2.688352,2.388329,0.500748,2.411033,1.381299,2.238775,1.579146,-0.041155,1.124703,0.800255,0.003276,0.08773
8,1.160842,-0.113399,1.381802,1.051722,-0.357925,0.929058,0.425015,0.827148,1.14847,-0.661525,0.076566,-0.113544,-0.28203,-0.160303
9,0.075023,-1.245796,0.901845,0.152082,1.749727,-0.056884,-0.451579,-0.131693,-0.933132,1.199584,1.392313,-0.346855,-0.015745,0.41844


In [233]:
df.head(10)#順便提供尚未標準化的供參考

Unnamed: 0,R,H,HR,RBI,SO,TB,OBP,OPS,pW,pERA,pHR,pSO,pBB,pWHIP
0,729.0,1379.0,213.0,698.0,1435.0,2320.0,0.31,0.725,54.0,5.59,305.0,1248.0,561.0,1.46
1,901.0,1554.0,245.0,857.0,1382.0,2688.0,0.34,0.806,84.0,4.7,215.0,1633.0,605.0,1.38
2,769.0,1368.0,220.0,734.0,1276.0,2338.0,0.324,0.746,72.0,5.12,267.0,1404.0,576.0,1.38
3,769.0,1354.0,223.0,731.0,1332.0,2345.0,0.323,0.756,93.0,3.76,207.0,1508.0,450.0,1.22
4,691.0,1356.0,162.0,655.0,1405.0,2203.0,0.309,0.71,59.0,5.2,221.0,1230.0,582.0,1.48
5,769.0,1366.0,250.0,744.0,1563.0,2429.0,0.329,0.767,89.0,4.4,225.0,1497.0,570.0,1.33
6,939.0,1547.0,307.0,906.0,1334.0,2832.0,0.338,0.832,101.0,4.18,198.0,1463.0,452.0,1.3
7,943.0,1493.0,306.0,904.0,1437.0,2735.0,0.339,0.829,103.0,4.31,248.0,1534.0,507.0,1.3
8,845.0,1384.0,257.0,800.0,1338.0,2493.0,0.327,0.776,97.0,3.97,201.0,1299.0,477.0,1.24
9,758.0,1305.0,239.0,730.0,1581.0,2332.0,0.316,0.74,68.0,4.99,260.0,1239.0,505.0,1.38


In [234]:
norm_train_df = normalized_listings.copy().iloc[0:60]
norm_test_df = normalized_listings.copy().iloc[60:]

多變數的距離計算:

<img src="6.png" style="width:400px;height:80px;float:left">

scipy中有現成的距離計算公式 

In [235]:
from scipy.spatial import distance ##函數測試 並未在後面使用

first_listing = normalized_listings.iloc[0][['OBP', 'pERA']]
fifth_listing = normalized_listings.iloc[20][['OBP', 'pERA']]
first_fifth_distance = distance.euclidean(first_listing, fifth_listing)
first_fifth_distance

1.2778442268935872

## 多變數KNN模型

In [236]:
def predict_pW_multivariate(new_listing_value,feature_columns):
    temp_df = norm_train_df
    temp_df['distance'] = distance.cdist(temp_df[feature_columns],[new_listing_value[feature_columns]])
    temp_df = temp_df.sort_values('distance')
    knn_5 = temp_df.pW.iloc[:5]
    predicted_pW = knn_5.mean()
    return(predicted_pW)

cols = ['OBP', 'pERA']
norm_test_df['predicted_pW'] = norm_test_df[cols].apply(predict_pW_multivariate,feature_columns=cols,axis=1)    
norm_test_df['squared_error'] = (norm_test_df['predicted_pW'] - norm_test_df['pW'])**(2)
mse = norm_test_df['squared_error'].mean()
rmse = mse ** (1/2)
print('rmse = ',rmse)
#print(norm_test_df)
norm_test_df.head(10)


rmse =  0.7098589312763492


Unnamed: 0,R,H,HR,RBI,SO,TB,OBP,OPS,pW,pERA,pHR,pSO,pBB,pWHIP,predicted_pW,squared_error
60,1.797356,2.710427,0.875181,1.74573,-2.534964,2.080345,1.939132,2.078968,-0.430676,1.163092,0.990898,-0.370186,0.688009,0.707812,0.617303,1.09826
61,0.898747,1.692703,-0.351376,0.961758,0.249218,0.696351,1.301609,0.96032,0.861352,-1.154171,-0.057239,0.979126,-0.396152,-0.118964,1.220249,0.128807
62,0.324636,1.506359,-0.298047,0.319158,-0.843639,0.433025,0.743777,0.427631,-0.071779,-0.241863,0.589484,-0.062994,-0.348601,0.005053,0.473744,0.297596
63,0.836344,1.219676,0.261902,1.000314,-0.453333,0.941306,0.823467,0.986955,-1.004911,0.816415,0.990898,-0.525726,1.192048,0.583795,0.143559,1.318983
64,0.823863,0.81832,0.18191,0.794682,-1.962516,0.824953,1.381299,1.146762,1.507367,-1.884018,-0.770864,1.111336,-0.957252,-0.491013,1.421232,0.007419
65,-0.249475,1.076335,-1.071312,-0.156366,-1.693638,-0.234476,0.345325,-0.184962,-1.220249,1.874692,0.455679,-0.49073,0.298091,0.914505,-0.689082,0.282139
66,1.32309,1.018998,0.955174,1.321614,0.058401,0.959678,1.381299,1.066858,-0.071779,0.50623,-0.034938,-0.436291,0.117398,0.459779,1.047979,1.253858
67,-0.112187,1.105003,0.715195,-0.066402,0.283911,0.714723,-0.77034,0.054749,0.358897,-0.606786,-0.280247,0.069215,0.440744,0.253085,0.043068,0.099748
68,0.786421,0.746649,0.021924,0.807534,-0.323231,0.420778,0.982848,0.614072,0.287117,0.469737,0.589484,-0.630716,-0.224969,0.377101,0.502456,0.046371
69,-0.623895,0.631976,-0.324712,-0.747558,-1.849761,-0.111999,-0.85003,-0.371403,0.717794,-1.117679,-0.124141,0.901356,-0.025255,-0.284319,0.071779,0.417335


### 以上為測試數據前十筆 predicted_pW 和 pW差距不算大吧 應該還算準吧 0..0


# 參考資料
* https://www.twblogs.net/a/5baab1662b7177781a0e63a0
* https://blog.csdn.net/weixin_41789707/article/details/80930274?utm_source=copy
* https://scikit-learn.org/stable/modules/neighbors.html?fbclid=IwAR1fuB0TtM1ZpzzNyHvHWc9mw0KqKzuDgmnXQON9UcOByosoNEWNgJJda8Y#ball-tree
* https://www.javatpoint.com/k-nearest-neighbor-algorithm-for-machine-learning?fbclid=IwAR2dRxMOfVOu1eXw9STV8vAw-YQmeGQTL-LcCLiPCYDkIYGgrRV5Tp-LYdQ