# Python 的 50+ 練習：資料科學學習手冊

> 監督式學習

[數據交點](https://www.datainpoint.com) | 郭耀仁 <yaojenkuo@datainpoint.com>

## 練習題指引

- 練習題閒置超過 10 分鐘會自動斷線，只要重新點選練習題連結即可重新啟動。
- 第一個程式碼儲存格會將可能用得到的模組載入。
- 如果練習題需要載入檔案，檔案存放絕對路徑為 `/home/jovyan/data`
- 練習題已經給定函數、類別、預期輸入或參數名稱，我們只需要寫作程式區塊。同時也給定函數的類別提示，說明預期輸入以及預期輸出的類別。
- 說明（Docstring）會描述測試如何進行，閱讀說明能夠暸解預期輸入以及預期輸出之間的關係，幫助我們更快解題。
- 請在 `### BEGIN SOLUTION` 與 `### END SOLUTION` 這兩個註解之間寫作函數或者類別的程式區塊。
- 將預期輸出放置在 `return` 保留字之後，若只是用 `print()` 函數將預期輸出印出無法通過測試。
- 語法錯誤（`SyntaxError`）或縮排錯誤（`IndentationError`）等將會導致測試失效，測試之前應該先在筆記本使用函數觀察是否與說明（Docstring）描述的功能相符。
- 如果卡關，可以先看練習題詳解或者複習課程單元影片之後再繼續寫作。
- 執行測試的步驟：
    1. 點選上方選單的 File -> Save Notebook 儲存 exercises.ipynb。
    2. 點選上方選單的 File -> New -> Terminal 開啟終端機。
    3. 在 Terminal 輸入 `cd ~` 切換回家目錄。
    4. 在 Terminal 輸入 `python 20-supervised-learning/test_runner.py` 後按下 Enter 執行測試。

In [237]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_squared_error

## 181. 載入 `house-prices` 中的 `train.csv` 與 `test.csv`

定義函數 `import_house_prices()` 將位於 `/home/jovyan/data/house-prices` 路徑的 `train.csv` 與 `test.csv` 載入。

來源：<https://www.kaggle.com/c/house-prices-advanced-regression-techniques>

- 運用絕對路徑。
- 使用 `pd.read_csv()` 函數。
- 將預期輸出寫在 `return` 之後。

In [2]:
def import_house_prices() -> tuple:
    """
    >>> train, test = import_house_prices()
    >>> type(train)
    pandas.core.frame.DataFrame
    >>> type(test)
    pandas.core.frame.DataFrame
    >>> train.shape
    (1460, 81)
    >>> test.shape
    (1459, 80)
    """
    ### BEGIN SOLUTION
    path_train = '/Users/yitinglu/Documents/PythonCourses/data/train.csv'
    path_test = '/Users/yitinglu/Documents/PythonCourses/data/test.csv'
    train = pd.read_csv(path_train)
    test = pd.read_csv(path_test)
    return train, test
    ### END SOLUTION

In [25]:
# train, test = import_house_prices()
# type(train)
# type(test)
# train.shape
# test.shape
#train

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,1456,60,RL,62.0,7917,Pave,,Reg,Lvl,AllPub,...,0,,,,0,8,2007,WD,Normal,175000
1456,1457,20,RL,85.0,13175,Pave,,Reg,Lvl,AllPub,...,0,,MnPrv,,0,2,2010,WD,Normal,210000
1457,1458,70,RL,66.0,9042,Pave,,Reg,Lvl,AllPub,...,0,,GdPrv,Shed,2500,5,2010,WD,Normal,266500
1458,1459,20,RL,68.0,9717,Pave,,Reg,Lvl,AllPub,...,0,,,,0,4,2010,WD,Normal,142125


## 182. 找出 `house-prices` 目標陣列欄位

定義函數 `find_target_array_variable_house_prices()` 將 `train.csv` 與 `test.csv` 差別的欄位找出來。

- 使用 `import_house_prices()` 函數。
- 運用 `DataFrame.columns` 的集合運算特性。
- 將預期輸出寫在 `return` 之後。

In [18]:
def find_target_array_variable_house_prices() -> pd.core.indexes.base.Index:
    """
    >>> target_array_variable_house_prices = find_target_array_variable_house_prices()
    >>> target_array_variable_house_prices
    Index(['SalePrice'], dtype='object')
    """
    ### BEGIN SOLUTION
    train, test = import_house_prices()
    mutual = train.columns ^ test.columns
    return mutual
    ### END SOLUTION

In [19]:
# target_array_variable_house_prices = find_target_array_variable_house_prices()
# target_array_variable_house_prices

  mutual = train.columns ^ test.columns


Index(['SalePrice'], dtype='object')

## 183. 選擇 `house-prices` 目標陣列與特徵矩陣

定義函數 `extract_target_array_feature_matrix_house_prices()` 以 `train.csv` 中的 `SalePrice` 作為目標陣列 $y$、`OverallQual` 作為特徵矩陣 $X$

- 使用 `import_house_prices()` 函數。
- 運用選擇欄位技巧。
- 注意特徵矩陣外型。
- 將預期輸出寫在 `return` 之後。

In [32]:
def extract_target_array_feature_matrix_house_prices() -> tuple:
    """
    >>> y, X = extract_target_array_feature_matrix_house_prices()
    >>> type(y)
    numpy.ndarray
    >>> type(X)
    numpy.ndarray
    >>> y.shape
    (1460,)
    >>> X.shape
    (1460, 1)
    """
    ### BEGIN SOLUTION
    train, test = import_house_prices()
    y = train['SalePrice']
    X = train['OverallQual'].values.reshape(-1, 1) #先用.values換成array list，再用reshape轉成固定只有一欄的array
    return y, X
    ### END SOLUTION

In [34]:
# y, X = extract_target_array_feature_matrix_house_prices()
# X
# X.shape

(1460, 1)

## 184. 切割 `house-prices` 訓練與驗證資料

定義函數 `split_train_valid_house_prices()` 將 `extract_target_array_feature_matrix_house_prices()` 函數所輸出的 $y$ 與 $X$ 切割為訓練與驗證資料。

- 使用 `train_test_split(test_size=0.3, random_state=42)` 函數。
- 將預期輸出寫在 `return` 之後。

In [40]:
def split_train_valid_house_prices(X: np.ndarray, y: np.ndarray) -> tuple:
    """
    >>> y, X = extract_target_array_feature_matrix_house_prices()
    >>> X_train, X_valid, y_train, y_valid = split_train_valid_house_prices(X, y)
    >>> X_train.shape
    (1022, 1)
    >>> X_valid.shape
    (438, 1)
    >>> y_train.shape
    (1022,)
    >>> y_valid.shape
    (438,)
    """
    ### BEGIN SOLUTION
    X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.3, random_state=42) #設置random_state是使種子結果每次都相同（為了通過作業測試）
    return X_train, X_valid, y_train, y_valid
    ### END SOLUTION

In [45]:
# y, X = extract_target_array_feature_matrix_house_prices()
# X_train, X_valid, y_train, y_valid = split_train_valid_house_prices(X, y)
# X_train.shape
# X_valid.shape
# y_train.shape
# y_valid.shape

(438,)

## 185. 建立 `house-prices` 虛假模型

定義類別 `DummyModelHousePrices` 用來建立具有兩個方法 `fit()`、`predict()` 的物件，能夠在 `split_train_valid_house_prices()` 函數所輸出 $y^{train}$ 最小值與最大值之間取隨機整數，建立虛假模型預測 $\hat{y}$

- 使用 `self`
- 以 `self.attribute` 在類別程式區塊中使用屬性。
- 以 `self.method()` 在類別程式區塊中使用方法。
- 使用 `ndarray.min()` 與 `ndarray.max()`
- 使用 `np.random.randint()` 函數。

In [47]:
class DummyModelHousePrices:
    """
    >>> y, X = extract_target_array_feature_matrix_house_prices()
    >>> X_train, X_valid, y_train, y_valid = split_train_valid_house_prices(X, y)
    >>> dummy_model_house_prices = DummyModelHousePrices()
    >>> dummy_model_house_prices.fit(y_train)
    >>> y_hat = dummy_model_house_prices.predict(X_valid)
    >>> type(y_hat)
    numpy.ndarray
    >>> y_hat.shape
    (438,)
    """
    ### BEGIN SOLUTION
    #使用虛假模型，只是從售價(y)裡面挑出範圍內的隨機亂數，然後最後去配對總體品質（X) == y售價隨便亂取，然後就去對答案
    def fit(self, y_train):
        self.y_train_max = y_train.max()
        self.y_train_min = y_train.min()
    def predict(self, X_valid):
        y_hat = np.random.randint(self.y_train_min, self.y_train_max, X_valid.shape[0]) #最小, 最大, size=X_valid.列數（因為shape=>(m,n))
        return y_hat
    ### END SOLUTION

In [54]:
# y, X = extract_target_array_feature_matrix_house_prices()
# X_train, X_valid, y_train, y_valid = split_train_valid_house_prices(X, y)
# dummy_model_house_prices = DummyModelHousePrices()
# dummy_model_house_prices.fit(y_train)
# y_hat = dummy_model_house_prices.predict(X_valid)
# type(y_hat)
# y_hat.shape

(438,)

## 186. 建立 `house-prices` 專家模型

定義類別 `ExpertModelHousePrices` 用來建立具有兩個方法 `fit()`、`predict()` 的物件，能夠依據 `split_train_valid_house_prices()` 函數所輸出 $X^{train}$ 與 $y^{train}$ 建立專家模型，以 `OverallQual` 分組，聚合 `SalePrice` 的平均數，以分組聚合的對應關係預測 $\hat{y}$

- 使用 `self`
- 以 `self.attribute` 在類別程式區塊中使用屬性。
- 以 `self.method()` 在類別程式區塊中使用方法。
- 使用分組聚合技巧。

In [93]:
class ExpertModelHousePrices:
    """
    >>> y, X = extract_target_array_feature_matrix_house_prices()
    >>> X_train, X_valid, y_train, y_valid = split_train_valid_house_prices(X, y)
    >>> expert_model_house_prices = ExpertModelHousePrices()
    >>> expert_model_house_prices.fit(X_train, y_train)
    OverallQual
    1      50150.000000
    2      60000.000000
    3      85950.000000
    4     107983.750000
    5     133982.099617
    6     162075.261993
    7     206433.647826
    8     274112.188525
    9     355825.366667
    10    437396.454545
    Name: SalePrice, dtype: float64
    >>> y_hat = expert_model_house_prices.predict(X_valid)
    >>> type(y_hat)
    numpy.ndarray
    >>> y_hat.shape
    (438,)
    """
    ### BEGIN SOLUTION
    #使用專家模型，用訓練資料集中的品質（X)和售價(y)計算，然後最後去對答案看測試資料（X)。
    def fit(self, X_train, y_train):
        self.X_train = X_train
        self.y_train = y_train
        dataf = pd.DataFrame()
        dataf['SalePrice'] = self.y_train
        dataf['OverallQual'] = self.X_train
        self.data_mean = dataf.groupby(['OverallQual'])['SalePrice'].mean()
        return self.data_mean
    def predict(self, X_valid):
        X_valid_ravel = X_valid.ravel() #用ravel()把Series一維化
        y_hat = list(map(lambda x: self.data_mean[x], X_valid_ravel)) 
        #lambda x: self.data_mean[x]會走訪X_valid_ravel中每個元素，並透過map應用方法
        return np.array(y_hat)
    ### END SOLUTION

In [95]:
# y, X = extract_target_array_feature_matrix_house_prices()
# X_train, X_valid, y_train, y_valid = split_train_valid_house_prices(X, y)
# expert_model_house_prices = ExpertModelHousePrices()
# expert_model_house_prices.fit(X_train, y_train)
# y_hat = expert_model_house_prices.predict(X_valid)
# type(y_hat)
# y_hat.shape

(438,)

## 187. 建立 `house-prices` 基於機器學習的模型

定義類別 `MachineLearningModelHousePrices` 用來建立具有兩個方法 `fit()`、`predict()` 的物件，能夠依據 `split_train_valid_house_prices()` 函數所輸出 $X^{train}$ 與 $y^{train}$ 建立基於機器學習的模型，直接使用 Scikit-Learn `LinearRegression` 類別 `fit()` 與 `predict()` 方法預測 $\hat{y}$

- 使用 `self`
- 以 `self.attribute` 在類別程式區塊中使用屬性。
- 以 `self.method()` 在類別程式區塊中使用方法。

In [108]:
class MachineLearningModelHousePrices:
    """
    >>> y, X = extract_target_array_feature_matrix_house_prices()
    >>> X_train, X_valid, y_train, y_valid = split_train_valid_house_prices(X, y)
    >>> machine_learning_model_house_prices = MachineLearningModelHousePrices()
    >>> machine_learning_model_house_prices.fit(X_train, y_train)
    LinearRegression()
    >>> y_hat = machine_learning_model_house_prices.predict(X_valid)
    >>> type(y_hat)
    numpy.ndarray
    >>> y_hat.shape
    (438,)
    """
    ### BEGIN SOLUTION
    def fit(self, X_train, y_train):
        lr = LinearRegression()
        lr.fit(X_train, y_train)    #只要用.fit(x,y)就會自動訓練模型了，並「不需要」賦值給其他變數
        self.model = lr
        return lr
    def predict(self, X_valid):
        y_hat = self.model.predict(X_valid)     #predict 可以直接拿來用於預測原本寫好的模型
        return y_hat
    ### END SOLUTION

In [111]:
# y, X = extract_target_array_feature_matrix_house_prices()
# X_train, X_valid, y_train, y_valid = split_train_valid_house_prices(X, y)
# machine_learning_model_house_prices = MachineLearningModelHousePrices()
# machine_learning_model_house_prices.fit(X_train, y_train)
# y_hat = machine_learning_model_house_prices.predict(X_valid)
# type(y_hat)
# y_hat.shape

(438,)

## 188. 驗證 `house-prices` 的三個模型

定義函數 `validate_model_performance_house_prices()` 能夠依據 `split_train_valid_house_prices()` 函數所輸出 $y^{valid}$，計算虛假模型、專家模型與基於機器學習模型的表現評估。

- 使用 `mean_squared_error()` 函數。
- 將預期輸出寫在 `return` 之後。

In [112]:
def validate_model_performance_house_prices(dummy_y_hat: np.ndarray,
                                            expert_y_hat: np.ndarray,
                                            machine_learning_y_hat: np.ndarray,
                                            y_valid: np.ndarray) -> dict:
    """
    >>> y, X = extract_target_array_feature_matrix_house_prices()
    >>> X_train, X_valid, y_train, y_valid = split_train_valid_house_prices(X, y)
    >>> dummy_model_house_prices = DummyModelHousePrices()
    >>> dummy_model_house_prices.fit(y_train)
    >>> dummy_y_hat = dummy_model_house_prices.predict(X_valid)
    >>> expert_model_house_prices = ExpertModelHousePrices()
    >>> expert_model_house_prices.fit(X_train, y_train)
    OverallQual
    1      50150.000000
    2      60000.000000
    3      85950.000000
    4     107983.750000
    5     133982.099617
    6     162075.261993
    7     206433.647826
    8     274112.188525
    9     355825.366667
    10    437396.454545
    Name: SalePrice, dtype: float64
    >>> expert_y_hat = expert_model_house_prices.predict(X_valid)
    >>> machine_learning_model_house_prices = MachineLearningModelHousePrices()
    >>> machine_learning_model_house_prices.fit(X_train, y_train)
    LinearRegression()
    >>> machine_learning_y_hat = machine_learning_model_house_prices.predict(X_valid)
    >>> validate_model_performance_house_prices(dummy_y_hat, expert_y_hat, machine_learning_y_hat, y_valid)
    {'dummy': 95083333626.51826,
     'expert': 2023633279.0004246,
     'machine_learning': 2483429086.6514378}
    """
    ### BEGIN SOLUTION
    dummy = mean_squared_error(dummy_y_hat, y_valid)
    expert = mean_squared_error(expert_y_hat, y_valid)
    machine_learning = mean_squared_error(machine_learning_y_hat, y_valid)
    result = dict()
    result = {'dummy': dummy , 'expert': expert,'machine_learning': machine_learning}
    return result
    ### END SOLUTION

In [115]:
# dummy_y_hat = dummy_model_house_prices.predict(X_valid)
# expert_y_hat = expert_model_house_prices.predict(X_valid)
# machine_learning_y_hat = machine_learning_model_house_prices.predict(X_valid)
# validate_model_performance_house_prices(dummy_y_hat, expert_y_hat, machine_learning_y_hat, y_valid)

{'dummy': 93644943451.6758,
 'expert': 2023633279.0004246,
 'machine_learning': 2483429086.651439}

## 189. 以 `house-prices` 專家模型預測位於 `/home/jovyan/data/house-prices` 路徑的 `test.csv`

定義函數 `predict_sale_price()` 能夠依據 `OverallQual` 與專家模型預測 `test.csv` 的 `SalePrice`

- 使用 `import_house_prices()` 函數。
- 使用 `extract_target_array_feature_matrix_house_prices()` 函數。
- 使用 `split_train_valid_house_prices()` 函數。
- 使用 `ExpertModelHousePrices` 類別。
- 將預期輸出寫在 `return` 之後。

In [125]:
def predict_sale_price(X_test: pd.core.frame.DataFrame) -> pd.core.frame.DataFrame:
    """
    >>> train, test = import_house_prices()
    >>> X_test = test[["Id", "OverallQual"]]
    >>> predict_sale_price(X_test)
            Id      SalePrice
    0     1461  133982.099617
    1     1462  162075.261993
    2     1463  133982.099617
    3     1464  162075.261993
    4     1465  274112.188525
    ...    ...            ...
    1454  2915  107983.750000
    1455  2916  107983.750000
    1456  2917  133982.099617
    1457  2918  133982.099617
    1458  2919  206433.647826

    [1459 rows x 2 columns]
    """
    ### BEGIN SOLUTION
    #train, test = import_house_prices()
    y, X = extract_target_array_feature_matrix_house_prices()
    X_train, X_valid, y_train, y_valid = split_train_valid_house_prices(X, y)
    expert_model = ExpertModelHousePrices()
    expert_model.fit(X_train, y_train)
    X_test_data = X_test['OverallQual'].values.reshape(-1,1)
    X_test_predict = expert_model.predict(X_test_data)
    result = pd.DataFrame()
    result['Id'] = X_test['Id']
    result['SalePrice'] = X_test_predict
    return result
    ### END SOLUTION

In [126]:
# train, test = import_house_prices()
# X_test = test[["Id", "OverallQual"]]
# predict_sale_price(X_test)

Unnamed: 0,Id,SalePrice
0,1461,133982.099617
1,1462,162075.261993
2,1463,133982.099617
3,1464,162075.261993
4,1465,274112.188525
...,...,...
1454,2915,107983.750000
1455,2916,107983.750000
1456,2917,133982.099617
1457,2918,133982.099617


## 190. 將 `house-prices` 專家模型預測結果輸出為 `submission_house_prices.csv`

定義函數 `export_sale_price_as_submission()` 能夠將 `predict_sale_price()` 函數的輸出以 `submission_house_prices.csv` 格式匯出至工作目錄。

- 使用 `predict_sale_price()` 函數。
- 使用 `DataFrame.to_csv("submission_house_prices.csv", index=False)`

In [131]:
def export_sale_price_as_submission(X_test: pd.core.frame.DataFrame) -> None:
    """
    >>> train, test = import_house_prices()
    >>> X_test = test[["Id", "OverallQual"]]
    >>> export_sale_price_as_submission(X_test)
    >>> submission_csv = pd.read_csv("submission_house_prices.csv")
    >>> submission_csv.shape
    (1459, 2)
    """
    ### BEGIN SOLUTION
    data = predict_sale_price(X_test)
    data.to_csv("submission_house_prices.csv", index = False)
    ### END SOLUTION

In [134]:
# train, test = import_house_prices()
# X_test = test[["Id", "OverallQual"]]
# export_sale_price_as_submission(X_test)
# submission_csv = pd.read_csv("submission_house_prices.csv")
# submission_csv.shape

(1459, 2)

## 191. 載入 `titanic` 中的 `train.csv` 與 `test.csv`

定義函數 `import_titanic()` 將位於 `/home/jovyan/data/titanic` 路徑的 `train.csv` 與 `test.csv` 載入。

來源：<https://www.kaggle.com/c/titanic>

- 運用絕對路徑。
- 使用 `pd.read_csv()` 函數。
- 將預期輸出寫在 `return` 之後。

In [137]:
def import_titanic() -> tuple:
    """
    >>> train, test = import_titanic()
    >>> type(train)
    pandas.core.frame.DataFrame
    >>> type(test)
    pandas.core.frame.DataFrame
    >>> train.shape
    (891, 12)
    >>> test.shape
    (418, 11)
    """
    ### BEGIN SOLUTION
    path_train = '/Users/yitinglu/Documents/PythonCourses/data/titanic/train.csv'
    path_test = '/Users/yitinglu/Documents/PythonCourses/data/titanic/test.csv'
    train, test = pd.read_csv(path_train), pd.read_csv(path_test)
    return train, test
    ### END SOLUTION

In [140]:
# train, test = import_titanic()
# type(train)
# train.shape
# test.shape

(418, 11)

## 192. 找出 `titanic` 目標陣列欄位

定義函數 `find_target_array_variable_titanic()` 將 `train.csv` 與 `test.csv` 差別的欄位找出來。

- 使用 `import_titanic()` 函數。
- 運用 `DataFrame.columns` 的集合運算特性。
- 將預期輸出寫在 `return` 之後。

In [141]:
def find_target_array_variable_titanic() -> pd.core.indexes.base.Index:
    """
    >>> target_array_variable_titanic = find_target_array_variable_titanic()
    >>> target_array_variable_titanic
    Index(['Survived'], dtype='object')
    """
    ### BEGIN SOLUTION
    train, test = import_titanic()
    return train.columns ^ test.columns
    ### END SOLUTION

In [143]:
# target_array_variable_titanic = find_target_array_variable_titanic()
# target_array_variable_titanic

  return train.columns ^ test.columns


Index(['Survived'], dtype='object')

## 193. 選擇 `titanic` 目標陣列與特徵矩陣

定義函數 `extract_target_array_feature_matrix_titanic()` 以 `train.csv` 中的 `Survived` 作為目標陣列 $y$、`Sex`、`Age` 作為特徵矩陣 $X$

- 使用 `import_titanic()` 函數。
- 運用選擇欄位技巧。
- 注意特徵矩陣外型。
- 將預期輸出寫在 `return` 之後。

In [150]:
def extract_target_array_feature_matrix_titanic() -> tuple:
    """
    >>> y, X = extract_target_array_feature_matrix_titanic()
    >>> type(y)
    numpy.ndarray
    >>> type(X)
    numpy.ndarray
    >>> y.shape
    (891,)
    >>> X.shape
    (891, 2)
    """
    ### BEGIN SOLUTION
    train, test = import_titanic()
    y = train['Survived'].values
    X = train[['Sex', 'Age']].values #放在同一個列表中，而不是分開變成多個列表
    #X = train['Age'].values
    X = X.reshape(-1,2)
    return y, X
    ### END SOLUTION

In [154]:
# y, X = extract_target_array_feature_matrix_titanic()
# type(y)
# y.shape
# X.shape

(891, 2)

## 194. 操作 `titanic` 特徵矩陣

定義函數 `wrangle_feature_matrix_titanic()` 將 `extract_target_array_feature_matrix_titanic()` 函數輸出的 `X` 第 0 欄轉換為整數、第 1 欄填補未定義值，轉換與填補的規則如下：

- `{'female': 0, 'male': 1}`
- 使用 `Series.map()`
- 使用 `Series.mean()`
- 使用 `Series.fillna()` 以平均數作為填補值。
- 將預期輸出寫在 `return` 之後。

In [206]:
def wrangle_feature_matrix_titanic(X: np.ndarray) -> np.ndarray:
    """
    >>> y, X = extract_target_array_feature_matrix_titanic()
    >>> X_wrangled = wrangle_feature_matrix_titanic(X)
    >>> type(X_wrangled)
    numpy.ndarray
    >>> np.unique(X_wrangled[:, 0])
    array([0., 1.])
    >>> np.sum(np.isnan(X_wrangled[:, 1]))
    0
    """
    ### BEGIN SOLUTION
    X_df = pd.DataFrame(X)
    gender_code = X_df[0].map(lambda x: 1 if x == 'male' else 0)    #成功轉換男女代碼
    gender_mean = X_df[1].mean()
    X_df[1] = X_df[1].fillna(gender_mean)
    result = pd.DataFrame()
    result['Sex'] = gender_code
    result['Age'] = X_df[1]
    return result.values #最後轉換成array
    ### END SOLUTION

In [210]:
# y, X = extract_target_array_feature_matrix_titanic()
# X_wrangled = wrangle_feature_matrix_titanic(X)
# X_wrangled
# type(X_wrangled)
# np.unique(X_wrangled[:, 0])
# np.sum(np.isnan(X_wrangled[:, 1]))

0

## 195. 切割 `titanic` 訓練與驗證資料

定義函數 `split_train_valid_titanic()` 將 `extract_target_array_feature_matrix_titanic()` 函數所輸出的 $y$ 與 `wrangle_feature_matrix_titanic()` 函數所輸出的 `X_wrangled` 切割為訓練與驗證資料。

- 使用 `train_test_split(test_size=0.3, random_state=42)` 函數。
- 將預期輸出寫在 `return` 之後。

In [211]:
def split_train_valid_titanic(X: np.ndarray, y: np.ndarray) -> tuple:
    """
    >>> y, X = extract_target_array_feature_matrix_titanic()
    >>> X_wrangled = wrangle_feature_matrix_titanic(X)
    >>> X_train, X_valid, y_train, y_valid = split_train_valid_titanic(X_wrangled, y)
    >>> X_train.shape
    (623, 2)
    >>> X_valid.shape
    (268, 2)
    >>> y_train.shape
    (623,)
    >>> y_valid.shape
    (268,)
    """
    ### BEGIN SOLUTION
    X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.3, random_state=42)
    return X_train, X_valid, y_train, y_valid
    ### END SOLUTION

In [261]:
# y, X = extract_target_array_feature_matrix_titanic()
# X_wrangled = wrangle_feature_matrix_titanic(X)
# X_train, X_valid, y_train, y_valid = split_train_valid_titanic(X_wrangled, y)
# X_train.shape
# X_valid.shape

array([[ 1.        , 29.69911765],
       [ 1.        , 31.        ],
       [ 1.        , 20.        ],
       [ 0.        ,  6.        ],
       [ 0.        , 14.        ],
       [ 0.        , 26.        ],
       [ 0.        , 29.69911765],
       [ 1.        , 16.        ],
       [ 0.        , 16.        ],
       [ 0.        , 19.        ],
       [ 1.        , 37.        ],
       [ 1.        , 44.        ],
       [ 0.        , 29.69911765],
       [ 1.        , 30.        ],
       [ 1.        , 36.        ],
       [ 0.        , 16.        ],
       [ 1.        , 42.        ],
       [ 0.        , 29.69911765],
       [ 1.        , 27.        ],
       [ 1.        , 47.        ],
       [ 1.        , 24.        ],
       [ 1.        , 34.        ],
       [ 0.        , 19.        ],
       [ 1.        , 20.        ],
       [ 1.        , 29.69911765],
       [ 1.        , 10.        ],
       [ 1.        , 40.        ],
       [ 1.        , 31.        ],
       [ 1.        ,

## 196. 建立 `titanic` 虛假模型

定義類別 `DummyModelTitanic` 用來建立具有兩個方法 `fit()`、`predict()` 的物件，能夠隨機生成整數 0 或 1，建立虛假模型預測 $\hat{y}$

- 使用 `self`
- 以 `self.attribute` 在類別程式區塊中使用屬性。
- 以 `self.method()` 在類別程式區塊中使用方法。
- 使用 `np.random.randint()` 函數。

In [223]:
class DummyModelTitanic:
    """
    >>> y, X = extract_target_array_feature_matrix_titanic()
    >>> X_wrangled = wrangle_feature_matrix_titanic(X)
    >>> X_train, X_valid, y_train, y_valid = split_train_valid_titanic(X_wrangled, y)
    >>> dummy_model_titanic = DummyModelTitanic()
    >>> dummy_model_titanic.fit(y_train)
    >>> y_hat = dummy_model_titanic.predict(X_valid)
    >>> type(y_hat)
    numpy.ndarray
    >>> y_hat.shape
    (268,)
    """
    ### BEGIN SOLUTION
    def fit(self, y_train):
        self.y_train = y_train
    def predict(self, X_valid):
        y_hat = np.random.randint(0,2, X_valid.shape[0])   #要X_valid的列數而已，實際上值不重要
        return y_hat
    ### END SOLUTION

In [226]:
# y, X = extract_target_array_feature_matrix_titanic()
# X_wrangled = wrangle_feature_matrix_titanic(X)
# X_train, X_valid, y_train, y_valid = split_train_valid_titanic(X_wrangled, y)
# dummy_model_titanic = DummyModelTitanic()
# dummy_model_titanic.fit(y_train)
# y_hat = dummy_model_titanic.predict(X_valid)
# type(y_hat)
# y_hat.shape

(268,)

## 197. 建立 `titanic` 專家模型

定義類別 `ExpertModelTitanic` 用來建立具有兩個方法 `fit()`、`predict()` 的物件，能夠依據 `split_train_valid_titanic()` 函數所輸出 $X^{train}$ 與 $y^{train}$ 建立專家模型，`Sex` 為 `0` 或者 `Age` 小於平均就預測 $\hat{y}$ 為 `1`，否則預測為 `0`

- 使用 `self`
- 以 `self.attribute` 在類別程式區塊中使用屬性。
- 以 `self.method()` 在類別程式區塊中使用方法。
- 使用 `wrangle_feature_matrix_titanic()` 函數。

In [262]:
class ExpertModelTitanic:
    """
    >>> y, X = extract_target_array_feature_matrix_titanic()
    >>> X_wrangled = wrangle_feature_matrix_titanic(X)
    >>> X_train, X_valid, y_train, y_valid = split_train_valid_titanic(X_wrangled, y)
    >>> expert_model_titanic = ExpertModelTitanic()
    >>> expert_model_titanic.fit(X_train)
    >>> y_hat = expert_model_titanic.predict(X_valid)
    >>> type(y_hat)
    numpy.ndarray
    >>> y_hat.shape
    (268,)
    """
    ### BEGIN SOLUTION
    def fit(self, X_train):
        self.X_train_sex = X_train[:,0]
        self.X_train_age = X_train[:,1]
        self.age_mean = self.X_train_age.mean()
    def predict(self, X_valid):
        Con1 = X_valid[:,0] == 0
        Con2 = X_valid[:,1] < self.age_mean
        y_hat = np.where(np.logical_or(Con1, Con2), 1, 0)
        return y_hat
    ### END SOLUTION

In [264]:
# y, X = extract_target_array_feature_matrix_titanic()
# X_wrangled = wrangle_feature_matrix_titanic(X)
# X_train, X_valid, y_train, y_valid = split_train_valid_titanic(X_wrangled, y)
# expert_model_titanic = ExpertModelTitanic()
# expert_model_titanic.fit(X_train)
# y_hat = expert_model_titanic.predict(X_valid)
# y_hat.shape
# type(y_hat)

numpy.ndarray

## 198. 建立 `titanic` 基於機器學習的模型

定義類別 `MachineLearningModelTitanic` 用來建立具有兩個方法 `fit()`、`predict()` 的物件，能夠依據 `split_train_valid_titanic()` 函數所輸出 $X^{train}$ 與 $y^{train}$ 建立基於機器學習的模型，直接使用 Scikit-Learn `LogisticRegression` 類別 `fit()` 與 `predict()` 方法預測 $\hat{y}$

- 使用 `self`
- 以 `self.attribute` 在類別程式區塊中使用屬性。
- 以 `self.method()` 在類別程式區塊中使用方法。

In [265]:
class MachineLearningModelTitanic:
    """
    >>> y, X = extract_target_array_feature_matrix_titanic()
    >>> X_wrangled = wrangle_feature_matrix_titanic(X)
    >>> X_train, X_valid, y_train, y_valid = split_train_valid_titanic(X_wrangled, y)
    >>> machine_learning_model_titanic = MachineLearningModelTitanic()
    >>> machine_learning_model_titanic.fit(X_train, y_train)
    LogisticRegression()
    >>> y_hat = machine_learning_model_titanic.predict(X_valid)
    >>> type(y_hat)
    numpy.ndarray
    >>> y_hat.shape
    (268,)
    """
    ### BEGIN SOLUTION
    def fit(self, X_train, y_train):
        self.lr = LogisticRegression()
        self.lr.fit(X_train, y_train)
        return self.lr
    def predict(self, X_valid):
        y_hat = self.lr.predict(X_valid)
        return y_hat
    ### END SOLUTION

In [270]:
# y, X = extract_target_array_feature_matrix_titanic()
# X_wrangled = wrangle_feature_matrix_titanic(X)
# X_train, X_valid, y_train, y_valid = split_train_valid_titanic(X_wrangled, y)
# machine_learning_model_titanic = MachineLearningModelTitanic()
# machine_learning_model_titanic.fit(X_train, y_train)
# y_hat = machine_learning_model_titanic.predict(X_valid)
# type(y_hat)
# y_hat.shape

(268,)

## 199. 驗證 `titanic` 的三個模型

定義函數 `validate_model_performance_titanic()` 能夠依據 `split_train_valid_titanic()` 函數所輸出 $y^{valid}$，計算虛假模型、專家模型與基於機器學習模型的表現評估。

- 計算誤分類數。
- 將預期輸出寫在 `return` 之後。

In [279]:
def validate_model_performance_titanic(dummy_y_hat: np.ndarray,
                                       expert_y_hat: np.ndarray,
                                       machine_learning_y_hat: np.ndarray,
                                       y_valid: np.ndarray) -> dict:
    """
    >>> y, X = extract_target_array_feature_matrix_titanic()
    >>> X_wrangled = wrangle_feature_matrix_titanic(X)
    >>> X_train, X_valid, y_train, y_valid = split_train_valid_titanic(X_wrangled, y)
    >>> dummy_model_titanic = DummyModelTitanic()
    >>> dummy_model_titanic.fit(y_train)
    >>> dummy_y_hat = dummy_model_titanic.predict(X_valid)
    >>> expert_model_titanic = ExpertModelTitanic()
    >>> expert_model_titanic.fit(X_train)
    >>> expert_y_hat = expert_model_titanic.predict(X_valid)
    >>> machine_learning_model_titanic = MachineLearningModelTitanic()
    >>> machine_learning_model_titanic.fit(X_train, y_train)
    LinearRegression()
    >>> machine_learning_y_hat = machine_learning_model_titanic.predict(X_valid)
    >>> validate_model_performance_titanic(dummy_y_hat, expert_y_hat, machine_learning_y_hat, y_valid)
    {'dummy': 118, 'expert': 100, 'machine_learning': 56}
    """
    ### BEGIN SOLUTION
    result = dict() #sum 可以把下面三行檢查完之後符合資格的話，持續加上去
    dummy_y_hat_error = sum(1 for i in range(len(dummy_y_hat)) if dummy_y_hat[i] != y_valid[i])
    expert_y_hat_error = sum(1 for i in range(len(expert_y_hat)) if expert_y_hat[i] != y_valid[i])
    machine_learning_y_hat_error = sum(1 for i in range(len(machine_learning_y_hat)) if machine_learning_y_hat[i] != y_valid[i])
    result = {'dummy':dummy_y_hat_error, 'expert':expert_y_hat_error, 'machine_learning':machine_learning_y_hat_error}
    return result
    ### END SOLUTION

In [280]:
# y, X = extract_target_array_feature_matrix_titanic()
# X_wrangled = wrangle_feature_matrix_titanic(X)
# X_train, X_valid, y_train, y_valid = split_train_valid_titanic(X_wrangled, y)
# dummy_model_titanic = DummyModelTitanic()
# dummy_model_titanic.fit(y_train)
# dummy_y_hat = dummy_model_titanic.predict(X_valid)
# expert_model_titanic = ExpertModelTitanic()
# expert_model_titanic.fit(X_train)
# expert_y_hat = expert_model_titanic.predict(X_valid)
# machine_learning_model_titanic = MachineLearningModelTitanic()
# machine_learning_model_titanic.fit(X_train, y_train)
# LinearRegression()
# machine_learning_y_hat = machine_learning_model_titanic.predict(X_valid)
# validate_model_performance_titanic(dummy_y_hat, expert_y_hat, machine_learning_y_hat, y_valid)

{'dummy': 128, 'expert': 100, 'machine_learning': 56}

## 200. 以 `titanic` 機器學習模型預測位於 `/home/jovyan/data/titanic` 路徑的 `test.csv`

定義函數 `predict_survived()` 能夠依據 `Age`、`Sex` 與基於機器學習的模型預測 `test.csv` 的 `Survived` 並以 `submission_titanic.csv` 格式匯出至工作目錄。

- 使用 `import_titanic()` 函數。
- 使用 `extract_target_array_feature_matrix_titanic()` 函數。
- 使用 `wrangle_feature_matrix_titanic()` 函數。
- 使用 `split_train_valid_titanic()` 函數。
- 使用 `MachineLearningModelTitanic` 類別。
- 使用 `DataFrame.to_csv("submission_titanic.csv", index=False)`
- 將預期輸出寫在 `return` 之後。

In [330]:
def predict_survived(X_test: pd.core.frame.DataFrame) -> pd.core.frame.DataFrame:
    """
    >>> train, test = import_titanic()
    >>> X_test = test[["PassengerId", "Sex", "Age"]]
    >>> predict_survived(X_test)
         PassengerId  Survived
    0            892         0
    1            893         1
    2            894         0
    3            895         0
    4            896         1
    ..           ...       ...
    413         1305         0
    414         1306         1
    415         1307         0
    416         1308         0
    417         1309         0

    [418 rows x 2 columns]
    >>> submission_csv = pd.read_csv("submission_titanic.csv")
    >>> submission_csv.shape
    (418, 2)
    """
    ### BEGIN SOLUTION
    y, X = extract_target_array_feature_matrix_titanic()
    X = wrangle_feature_matrix_titanic(X)
    X_train, X_valid, y_train, y_valid = split_train_valid_titanic(X, y)
    #其他類別的東西要先實例化：
    machine_learning_model_titanic = MachineLearningModelTitanic()
    machine_learning_model_titanic.fit(X_train, y_train)
    X_test_data = X_test[['Sex', 'Age']].values
    X_test_data = wrangle_feature_matrix_titanic(X_test_data)
    result = machine_learning_model_titanic.predict(X_test_data)
    result_df = pd.DataFrame({'PassengerId': X_test['PassengerId'], 'Survived':result})
    result_df.to_csv('submission_titanic.csv', index=False)
    return result_df
    ### END SOLUTION

In [333]:
# train, test = import_titanic()
# X_test = test[["PassengerId", "Sex", "Age"]]
# predict_survived(X_test)
# submission_csv = pd.read_csv("submission_titanic.csv")
# submission_csv.shape

(418, 2)