# Adaboost
* ***AdaBoost***, short for Adaptive Boosting, is a machine learning meta-algorithm formulated by Yoav Freund and Robert Schapire, who won the 2003 Gödel Prize for their work. It can be used in conjunction with many other types of learning algorithms to improve performance.
* Three main iedas behind the Adaboost algorithm are:
    * ***Adaboost*** combines a lot of "***weak learners***" to make classifications. the weak learners are almost away ***stumps***.
    * Some stumps get more say in the classification more than others.
    * Each stump is made by taking the previous stump's mistakes into account.


## Imports
* `Pandas` for reading csv file.
* `Matplotlib.pyplot` to draw plots.
* `Numpy` to work with numbers and math functions like: log
* `Random` to generate random numbers

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import random

In [2]:
df = pd.read_csv('./dataset/data.csv')
df

Unnamed: 0,Chest Pain,Blocked Arteries,Patient Weight,Heart Disease
0,YES,YES,205,YES
1,NO,YES,180,YES
2,YES,NO,210,YES
3,YES,YES,167,YES
4,NO,YES,156,NO
5,NO,YES,125,NO
6,YES,NO,168,NO
7,YES,YES,172,NO


### Inserting Samples Weight Column
* Sample Weight represent the samples weight that will effect the further classifications.

In [3]:
default_value = float(1 / len(df.get('Chest Pain')))
df.insert(4, "Sample Weight", default_value)
df

Unnamed: 0,Chest Pain,Blocked Arteries,Patient Weight,Heart Disease,Sample Weight
0,YES,YES,205,YES,0.125
1,NO,YES,180,YES,0.125
2,YES,NO,210,YES,0.125
3,YES,YES,167,YES,0.125
4,NO,YES,156,NO,0.125
5,NO,YES,125,NO,0.125
6,YES,NO,168,NO,0.125
7,YES,YES,172,NO,0.125


In [4]:
class Stump():
    def __init__(self, df, col_name):
        self.is_chosen_stump = False
        self.df = df
        self.col_name = col_name
        self.incorrect_guess_row_indices = []
        self.TP, self.TN, self.FP, self.FN = self.__compute_elements()
        self.Gini_index = self.__compute_Gini_index()
        self.total_error = self.__compute_total_error()
        self.impact = self.__compute_impact(self.total_error)
        
    def __compute_elements(self):
        '''This function comptes FN, FP, TP, and TN'''
        col_data = self.df.get(self.col_name)
        result_data = self.df.get('Heart Disease')

        TP, TN, FP, FN = 0, 0, 0, 0
        
        for idx, i in enumerate(col_data):
            if self.col_name != 'Patient Weight':
                if i == 'YES':
                    if result_data[idx] == 'YES':
                        TP += 1
                    else:
                        TN += 1
                        self.incorrect_guess_row_indices.append(idx)
                else:
                    if result_data[idx] == 'NO':
                        FN += 1
                    else:
                        self.incorrect_guess_row_indices.append(idx)
                        FP += 1
            else:
                if i > 176:
                    if result_data[idx] == 'YES':
                        TP += 1
                    else:
                        TN += 1
                        self.incorrect_guess_row_indices.append(idx)
                else:
                    if result_data[idx] == 'YES':
                        FN += 1
                        self.incorrect_guess_row_indices.append(idx)
                    else:
                        FP += 1
                        
        return TP, TN, FP, FN
    
    def __compute_Gini_index(self):
        '''This function computes Gini index according to tree values and returns Gini'''
        true_prob_yes = float(self.TP / (self.TP + self.TN))
        true_prob_no = float(self.TN / (self.TP + self.TN))
        true_Gini_index = 1 - (true_prob_yes ** 2) - (true_prob_no ** 2)
        
        false_prob_yes = float(self.FP / (self.FP + self.FN))
        false_prob_no = float(self.FN / (self.FP + self.FN))
        false_Gini_index = 1 - (false_prob_yes ** 2) - (false_prob_no ** 2)
        
        Gini_index = (true_Gini_index * (self.TP + self.TN) + false_Gini_index *
                      (self.FP + self.FN)) / ((self.TP + self.TN) + (self.FP + self.FN)) 
        return round(Gini_index, 2)
    
    def __compute_total_error(self):
        '''This function computes total error according to the stump.
        if the stump prediction is different that the original data,
        it will be counted as a error.'''
        total_error = 0
        for idx in self.incorrect_guess_row_indices:
            total_error += self.df.get('Sample Weight')[idx]
        return total_error
    
    def __compute_impact(self, total_error):
        total_error += 0.001
        return round((1 / 2) * np.log((1 - total_error) / total_error), 2)
    
    def __normalize_sample_weights_column(self):
        total_sum = df.get('Sample Weight').sum()
        df['Sample Weight'] = round(df.get("Sample Weight") / total_sum, 2)
        return
    
    def update_stupms_samples_weights(self):
        for idx in range(len(df.get('Sample Weight'))):
            old_weight = df.get('Sample Weight')[idx]
            if idx in self.incorrect_guess_row_indices:
                new_weight = round(np.exp(self.impact) * old_weight, 2)
                df['Sample Weight'][idx] = new_weight
            else:
                new_weight = round(np.exp(-self.impact) * old_weight, 2)
                df['Sample Weight'][idx] = new_weight
                
        self.__normalize_sample_weights_column()
        return

In [5]:
features_columns = ["Chest Pain", "Blocked Arteries", "Patient Weight"]
Gini_indices = []
stumpts_list = []

for col_name in features_columns:
    s = Stump(df, col_name)
    stumpts_list.append(s)
    Gini_index = s.Gini_index
    Gini_indices.append(Gini_index)
    print(f'Column "{col_name}" has Gini index: {Gini_index}')

Column "Chest Pain" has Gini index: 0.47
Column "Blocked Arteries" has Gini index: 0.5
Column "Patient Weight" has Gini index: 0.2


### Selecting the lowest Gini index
* Now that the Gini indices are computed, the feature that separates the data the best will be selected as the best stump.

In [6]:
min_Gini_index = min(Gini_indices)
min_Gini_index_col_name = features_columns[Gini_indices.index(min_Gini_index)]
print(f"Minimum Gini index belongs to '{min_Gini_index_col_name}' with index: {min_Gini_index}, So the first stump will be it.")

Minimum Gini index belongs to 'Patient Weight' with index: 0.2, So the first stump will be it.


In [7]:
def find_chosen_stump(stumpts_list, min_Gini_index_col_name):
    for stump in stumpts_list:
        if stump.col_name == min_Gini_index_col_name:
            return stump        

In [8]:
chosen_stump = find_chosen_stump(stumpts_list, min_Gini_index_col_name)
chosen_stump.is_chosen_stump = True

### Misclassified Data According to The First Stump

In [9]:
chosen_stump.total_error

0.125

### Stump Impact
* As now the stump is chosen, It's needed to compute the stump impact. for that stupm's total error is required.
    > $$ Impact = \frac{1}{2}Log^{(\frac{1 - Total-Error}{Total-Error})}$$
    
* The total error is the total sample weights of the given dataset that the stump classified it wrong. So that it's impact will be decreased.

* The impact plot is shaped as below:

![impact plot](./plots/impact_plot.png)

### Impact Plot

In [10]:
def impact(total_error):
    return (1 / 2) * np.log((1 - total_error) / total_error)

def plot_impact():
    x = np.asarray(list(range(1, 1000)))
    x = x / 1000.0
    y = impact(x)
    plt.title("Impact Function Plot")
    plt.plot(x, y)
    plt.xlabel("Total Error")
    plt.ylabel("Impact")
    plt.savefig('./plots/impact_plot.png')
    return

In [11]:
# plot_impact()

### Chosen Stump Impact
* Now that the total error is calculated for the first stump, it's time to calculate its impact.

In [12]:
chosen_stump_impact = chosen_stump.impact
print(f"Chosen stump impact is: {chosen_stump_impact}")

Chosen stump impact is: 0.97


#### Other Stumps Impact
* Altough we know that the first stump has the most impact, computing other stump's impact would help us illustrate the concept ot impacts better.

In [13]:
for stump in stumpts_list:
    if stump != chosen_stump:
        print(f"Stump {stump.col_name} has total error: {stump.total_error} and impact: {stump.impact}")

Stump Chest Pain has total error: 0.375 and impact: 0.25
Stump Blocked Arteries has total error: 0.5 and impact: -0.0


### New Sample Weights
* Now we know the Sample Weight for the incorrectly classified samples are used to determine the Impact each stump gets.
* At first step all the sample weights were the same, that means we weren't emphasizing on any sample, But now that the we know the misclassified data and calculated the impact, for further steps we will use updated sample weights. So:

* Increase the sample weight for the sample that was incorrectly classified. 
    > $$ Weight_{(new)} = Weight_{(old)} * e^{(impact)} $$
* Decrease the samples weight for the samples that were correctly classified.
    > $$ Weight_{(new)} = Weight_{(old)} * e^{(-impact)} $$

* Update Chosen Stump Weight Plot:

![chosen_update weight plot](./plots/update_chosen.png)


* Update Not Chosen Stump Weight Plot:

![none_chosen_update weight plot](./plots/update_none_chosen.png)

In [14]:
def chosen_stump_update_weight(impact):
    return np.exp(impact) * 1

def unchosen_stump_update_weight(impact):
    return np.exp(-impact) * 1

def plot_update_weights():
    x = np.asarray(list(range(1, 1000)))
    x = x / 250.0
    y_chosen = chosen_stump_update_weight(x)
    y_not_chosen = unchosen_stump_update_weight(x)
    plt.plot(x, y_chosen)
    plt.xlabel("Impact with constant Old Weight 1")
    plt.ylabel("New Weight (Chosen Stump)")
    plt.savefig('./plots/update_chosen.png')
    plt.show()
    
    plt.plot(x, y_not_chosen)
    plt.xlabel("Impact with constant Old Weight 1")
    plt.ylabel("New Weight (not-Chosen Stump)")
    plt.savefig('./plots/update_none_chosen.png')
    plt.show()
    return

In [15]:
# plot_update_weights()

In [16]:
chosen_stump.update_stupms_samples_weights()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


### Updated Data

In [17]:
df

Unnamed: 0,Chest Pain,Blocked Arteries,Patient Weight,Heart Disease,Sample Weight
0,YES,YES,205,YES,0.07
1,NO,YES,180,YES,0.07
2,YES,NO,210,YES,0.07
3,YES,YES,167,YES,0.49
4,NO,YES,156,NO,0.07
5,NO,YES,125,NO,0.07
6,YES,NO,168,NO,0.07
7,YES,YES,172,NO,0.07


### Creating Second Stump
* First step is to generate a new dataset. to do so, we generate a number between 0 and 1 and see where that numbers falls when we use the Sample Weights like a distribution.

In [18]:
def generate_new_samples(df):
    sum_ranges = []
    selected_rows = []
    generated_numbers = []
    for idx, i in enumerate(df.get("Sample Weight")):
        sum_ranges.append(round(df.get("Sample Weight")[:idx].sum(), 2))
    for i in range(len(df.get("Sample Weight"))):
        generated_num = random.random()
        generated_numbers.append(generated_num)
    
    for j in generated_numbers:
        for idx, k in enumerate(sum_ranges):
            if j < k:
                selected_rows.append(df.iloc[idx - 1])
                break
            
    df = pd.DataFrame(columns=df.columns)
    
    for idx, i in enumerate(selected_rows):
        df = pd.DataFrame(np.insert(df.values, idx, list(i), axis=0), columns=df.columns)
    df['Sample Weight'] = 1 / len(df.get('Sample Weight'))
    return df

In [19]:
new_df = generate_new_samples(df)

In [20]:
new_df

Unnamed: 0,Chest Pain,Blocked Arteries,Patient Weight,Heart Disease,Sample Weight
0,YES,YES,205,YES,0.142857
1,NO,YES,125,NO,0.142857
2,YES,YES,167,YES,0.142857
3,YES,YES,167,YES,0.142857
4,YES,YES,167,YES,0.142857
5,YES,YES,167,YES,0.142857
6,NO,YES,125,NO,0.142857


## Choosing Second Stump

In [21]:
def create_stump_forest(df, num_stumps):
    features_columns = ["Chest Pain", "Blocked Arteries", "Patient Weight"]
    forest = []
    for t in range(num_stumps):
        Gini_indices = []
        stumpts_list = []

        for col_name in features_columns:
            s = Stump(df, col_name)
            stumpts_list.append(s)
            Gini_index = s.Gini_index
            Gini_indices.append(Gini_index)
            print(f'STEP {t}: ===> Column "{col_name}" has Gini index: {Gini_index}')

        min_Gini_index = min(Gini_indices)
        min_Gini_index_col_name = features_columns[Gini_indices.index(min_Gini_index)]
        print(f"Minimum Gini index belongs to '{min_Gini_index_col_name}' with index: {min_Gini_index}, So the first stump will be it.")

        chosen_stump = find_chosen_stump(stumpts_list, min_Gini_index_col_name)
        chosen_stump.is_chosen_stump = True
        print(f"Chosen stump total error: {chosen_stump.total_error}")

        chosen_stump_impact = chosen_stump.impact
        print(f"Chosen stump impact is: {chosen_stump_impact}\n\n")

        chosen_stump.update_stupms_samples_weights()
        
        forest.append(stump)
        
        df = generate_new_samples(df)
    return forest

In [28]:
df = pd.read_csv('./dataset/data.csv')
default_value = float(1 / len(df.get('Chest Pain')))
df.insert(4, "Sample Weight", default_value)
stump_forest = create_stump_forest(df, 3)

STEP 0: ===> Column "Chest Pain" has Gini index: 0.47
STEP 0: ===> Column "Blocked Arteries" has Gini index: 0.5
STEP 0: ===> Column "Patient Weight" has Gini index: 0.2
Minimum Gini index belongs to 'Patient Weight' with index: 0.2, So the first stump will be it.
Chosen stump total error: 0.125
Chosen stump impact is: 0.97


STEP 1: ===> Column "Chest Pain" has Gini index: 0.21
STEP 1: ===> Column "Blocked Arteries" has Gini index: 0.43
STEP 1: ===> Column "Patient Weight" has Gini index: 0.38
Minimum Gini index belongs to 'Chest Pain' with index: 0.21, So the first stump will be it.
Chosen stump total error: 0.14285714285714285
Chosen stump impact is: 0.89


STEP 2: ===> Column "Chest Pain" has Gini index: 0.34
STEP 2: ===> Column "Blocked Arteries" has Gini index: 0.34
STEP 2: ===> Column "Patient Weight" has Gini index: 0.38
Minimum Gini index belongs to 'Chest Pain' with index: 0.34, So the first stump will be it.
Chosen stump total error: 0.2857142857142857
Chosen stump impact is

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [29]:
for stump in stump_forest:
    print(stump.col_name)

Patient Weight
Patient Weight
Patient Weight


### Conclution
* As we see the stumps are more likely to choose "Patients Weight" feature. It means that the output of classification is more dependent on this feature than others. 