****

**Outline**
- Hard voting
  - Explanation
  - Coding example
- Bagging
  - Explanation
  - Coding example
- Random Forest
  - Basic RF Algorithm
  - Randomizing the feature choice
  - Stroke prediction using RF

****

# Hard voting
## Explanation
Hard voting is a method for combining the predictions of multiple models to make a final prediction. In hard voting, each model's prediction is treated equally, and the final prediction is made based on the majority vote of all the models.  

## Coding example

In [8]:
from sklearn.datasets import load_iris
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier

In [5]:
# Load the Iris dataset
iris = load_iris()

# Define three different classifiers
clf1 = LogisticRegression()
clf2 = GaussianNB()
clf3 = DecisionTreeClassifier()

# Define a voting classifier that uses hard voting
voting_clf = VotingClassifier(estimators=[('lr', clf1), ('nb', clf2), ('dt', clf3)], voting='hard')

# Fit the voting classifier to the data
voting_clf.fit(iris.data, iris.target)

# Make predictions using the voting classifier
preds = voting_clf.predict(iris.data)

# Print the accuracy of the voting classifier
print("Accuracy:", voting_clf.score(iris.data, iris.target))


Accuracy: 0.98


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


# Bagging

## Explanation
The term bagging referes to bootstrap aggregating. It is a machine learning technique that involves building multiple models from bootstrap samples of the training data and combining their predictions through a voting or averaging mechanism. The idea behind bagging is to reduce the variance of the individual models and improve the overall performance of the ensemble model.

In bagging, we randomly select a subset of the training data (with replacement) to train each model in the ensemble. This process creates multiple independent samples of the training data, which are used to train multiple models in parallel. Each model in the ensemble is trained on a different sample of the training data and produces its own set of predictions.

Once all the models are trained, their predictions are combined to make a final prediction. This can be done using a simple majority vote (for classification problems) or an average (for regression problems). By combining the predictions of multiple models, we can reduce the risk of overfitting and improve the generalization performance of the ensemble.

Bagging is commonly used with decision trees, where it is known as Random Forest. In a Random Forest, each tree in the ensemble is built from a random subset of the features, in addition to a random subset of the training data. This further helps to reduce the variance of the individual trees and improve the accuracy of the ensemble.  

## Coding Example

In [6]:
from sklearn.datasets import load_iris
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

In [7]:
# Load the Iris dataset
iris = load_iris()

# Define a decision tree classifier
clf = DecisionTreeClassifier()

# Define a bagging classifier with 10 decision tree classifiers
bagging_clf = BaggingClassifier(base_estimator=clf, n_estimators=10)

# Fit the bagging classifier to the data
bagging_clf.fit(iris.data, iris.target)

# Make predictions using the bagging classifier
preds = bagging_clf.predict(iris.data)

# Print the accuracy of the bagging classifier
print("Accuracy:", bagging_clf.score(iris.data, iris.target))


Accuracy: 0.98




# Random Forest

## Basic RF Algorithm

Given a training set of size $m￥  

```
for b = 1 to B: 
     use sampling with replacement to create a new training set of size m
```

For example, if There is a dataset contains 10 data, we sampling it with replacement to get a traing set contains 10 examples. Some of them  may be sampled repeatly, but that is ok. Then we will train the decision tree(eg. split by ear shape) on this dataset. Next step, we resampled it and apply the other decision tree(eg. split by face shape). We do this for $B$ times. $B$ is usually between 64 and 228. Then, we get these trees all votes and make a correct final prediction.  



## Randomizing the feature choice

There is one modificaiton to further randomize the feature choice at each node:  
at every note when choosing a feature to use to split, if $n$ features are available, rather than picking from all end features, we will instead pick a random subset of $K$ less than $n$ features. And allow the algorithm to choose only from that subset of K features.  
A typical choice for the value of $K$ would be to choose it to be $\sqrt{n}$

## Stroke prediction using RF

In [2]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder

# Load the stroke prediction dataset
df = pd.read_csv("data/healthcare-dataset-stroke-data.csv")
# Create a LabelEncoder object
le = LabelEncoder()

# Fit and transform the gender column
df['gender'] = le.fit_transform(df['gender'])
df['ever_married'] = le.fit_transform(df['ever_married'])
df['work_type'] = le.fit_transform(df['work_type'])
df['Residence_type'] = le.fit_transform(df['Residence_type'])
df['smoking_status'] = le.fit_transform(df['smoking_status'])

# Drop rows with missing values
df.dropna(inplace=True)

# Split the dataset into training and testing sets
X = df.drop(['stroke'], axis=1)
y = df['stroke']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=27)

# Create a random forest model with 100 trees
model = RandomForestClassifier(n_estimators=70)

# Train the model on the training data
model.fit(X_train, y_train)

# Make predictions on the testing data
y_pred = model.predict(X_test)

# Evaluate the model's accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)



Accuracy: 0.960285132382892


In [3]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load the stroke prediction dataset
df = pd.read_csv("data/healthcare-dataset-stroke-data.csv")
# Create a LabelEncoder object
le = LabelEncoder()

# Fit and transform the gender column
df['gender'] = le.fit_transform(df['gender'])
df['ever_married'] = le.fit_transform(df['ever_married'])
df['work_type'] = le.fit_transform(df['work_type'])
df['Residence_type'] = le.fit_transform(df['Residence_type'])
df['smoking_status'] = le.fit_transform(df['smoking_status'])

# Drop null values
df = df.dropna()

# Select only the 'age' and 'stroke' columns
df_num = df[['age', 'stroke']]

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df_num[['age']], df_num['stroke'], test_size=0.2, random_state=27)

# Create a Random Forest model with 100 trees
model = RandomForestClassifier(n_estimators=100)

# Train the model on the training data
model.fit(X_train, y_train)

# Make predictions on the testing data
y_pred = model.predict(X_test)

# Evaluate the model's accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)


Accuracy: 0.960285132382892
