# Adaboost Lab

## 准备工作
### 环境准备
请确保完成以下依赖包的安装，并且通过下面代码来导入与验证。

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


### 数据集准备
我们将使用以下数据集进行 Adaboost 的训练。

该数据集与决策树部分使用的数据集相同，包括 7 个特征以及一个标签“是否适合攻读博士”，涵盖了适合攻读博士的各种条件，如love doing research,I absolutely want to be a college professor等。

请执行下面的代码来加载数据集。


In [2]:
# read decision_tree_datasets.csv
train_data = pd.read_csv('train_phd_data.csv')
test_data = pd.read_csv('test_phd_data.csv')

# translate lables [0,1] to [-1,1]
# if 0 then -1, if 1 then 1
train_data.iloc[:, -1] = train_data.iloc[:, -1].map({0: -1, 1: 1})
test_data.iloc[:, -1] = test_data.iloc[:, -1].map({0: -1, 1: 1})

## Adaboost (15 pts)

在上一个lab中，你已经成功完成了 Decision Tree 的构建。在本部分，你可以继续沿用上一部分的代码，学习并完成 Adaboost 模型的训练。

在这个 Adaboost 模型中，我们选择了一层决策树作为弱学习器，并使用基尼系数作为分类标准。

请完成以下类的构建以及相应函数的实现：

1. **weakClassifier()**: 我们采用一层决策树，包括 `split()` 和 `predict()`。你可以参考上一次实验中的代码。

2. **Adaboost()**：包括弱学习器的集合，拟合过程 `fit()` 和预测过程 `predict()`。


In [3]:
class weakClassifier:
    def __init__(self):


        self.tree = None
        self.alpha = None

    # here, we use the gini impurity to find the best feature and threshold
    # Note: you need consider sample_weight when computing the gini impurity

    def best_split(self, X, y, sample_weight):

        '''
            find the best feature and threshold to split the data based on the gini impurity

            Args:
                X: the features of the data
                y: the labels of the data
                sample_weight: the weight of each sample

            Returns:
                best_feature: the best feature to split the data
                best_Series: Series, the data set after splitting
        '''

        # TODO: implement the function to find the best feature and threshold to split the data based on the gini impurity
        data = pd.DataFrame(X)
        data['sample_weight'] = sample_weight
        data['label'] = y
        best_gini = float('inf')
        labels = y.unique()
        total_weight = data['sample_weight'].sum()
        for i in range(X.shape[1]):
            series = self.split_data(data, i)
            total_gini = 0

            for j in range(len(series)):
                df = series.iloc[j]
                df_weight = df['sample_weight'].sum()
                weighted_gini = 1
                for label in labels:
                    proportion = df[df['label'] == label]['sample_weight'].sum() / df_weight
                    weighted_gini -= proportion ** 2
                total_gini += weighted_gini * df_weight / total_weight

            if total_gini < best_gini:
                best_gini = total_gini
                best_feature_index = i
                best_Series = series

        return data.columns[best_feature_index], best_Series






    def split_data(self, data, column):
        '''
            split the data set according to the feature column

            Args:
                data: the data set, the last column is the label, the other columns are the features
                column: the feature column
            Returns:
                splt_datas: Series, the data set after splitting
        '''
        # 1. construct a Series to save the data set after splitting
        splt_datas = pd.Series()
        # 2. get the unique values of the feature column
        str_values = data.iloc[:,column].unique()
        # 3. find the data set corresponding to each unique value
        for i in range(len(str_values)):
            df = data.loc[data.iloc[:,column] == str_values[i]]

            splt_datas[str(i)] = df
        return splt_datas


    def fit(self, X, y, sample_weight):
        '''
            fit the data to the decision tree

            Args:
                X: the features of the data
                y: the labels of the data
                sample_weight: the weight of each sample

            Returns:
                None, but self.tree should be updated
        '''
        best_feature, best_splits = self.best_split(X, y, sample_weight)

        if best_feature is None:
            return

        # TODO: Create the tree as a nested dictionary
        tree = {best_feature: {}}
        for j in range(len(best_splits)):
            split_data = best_splits.iloc[j]
            value = split_data.loc[:,best_feature].unique()[0]
            value_count = split_data['label'].value_counts()
            tree[best_feature][value] = value_count.idxmax()

        self.tree = tree


    def predict(self,x):
        '''
        predict the label of the data

        Args:
            x: the features of the data
        Return:
            predict_lables: the predict labels of the data
        '''

        # Store the results
        predict_lables = []

        # predict the label of each sample
        for i in range(len(x)):
            sample = x.iloc[i,:]

            # TODO: predict the label of the sample

            predict_lable = self.tree[list(self.tree.keys())[0]][sample[list(self.tree.keys())[0]]]
            if predict_lable == -1 :
                predict_lable = 0
            predict_lables.append(predict_lable)

        return predict_lables



In [4]:
class Adaboost:

    def __init__(self, n_estimators=10):

        # the number of weak classifier
        self.n_estimators = n_estimators
        # the list of weak classifier
        self.clfs = []

    # AdaBoost training process
    def fit(self, X, y):
        n_samples,m_features = X.shape

        # initialize weights
        w = np.ones(n_samples)/n_samples

        # for each weak classifier
        for _ in range(self.n_estimators):
            clf = weakClassifier()

            # 1. fit the weak classifier
            clf.fit(X,y,w)

            # TODO: 2. predict the label of the data using the weak classifier
            predict = clf.predict(X)

            # TODO: 3. Calculate errors
            error = sum(w[i] * (predict[i] != y[i]) for i in range(n_samples))

            # TODO:4. Calculate alpha
            alpha = np.log((1 - error) / (error)) / 2
            # TODO: 5. Update weights
            w = w * np.exp(-alpha * y * np.array(predict))
            # normalize to one
            w /= np.sum(w)


            # save classifier and weight
            clf.alpha = alpha
            self.clfs.append(clf)


    def predict(self, X):
        '''
        predict the label of the data

        Args:
            X: the features of the data
        Return:
            y_pred: the predict labels of the data
        '''

        #TODO: 1. compute the predict labels of the data using all weak classifiers
        wc_predict = np.array([clf.predict(X) for clf in self.clfs])

        #TODO: 2. compute the weighted sum of the predict labels
        alpha = np.array([clf.alpha for clf in self.clfs])
        predict = np.dot(alpha, wc_predict)

        #TODO: 3. get the label of the data by sign function (if x>0 return 1, else return -1)
        return np.where(predict > 0, 1, -1)


In [5]:
adaboost_model = Adaboost(n_estimators=10)
# fit the model
adaboost_model.fit(train_data.iloc[:, :-1], train_data.iloc[:, -1])

# TODO: predict the test data
predict = adaboost_model.predict(test_data.iloc[:, :-1])

# TODO: calculate the accuracy of test data
accuracy = ((predict == test_data.iloc[:, -1]).sum()) / len(test_data)
print("The accuracy of Adaboost is: ", accuracy)

The accuracy of Adaboost is:  1.0
