## Homework Week 4

1. Summarize the basic idea of logistic regression, support vector machines (SVM), and decision tree.
1. Read the section of Random Forest, and summarize the major steps.
1. Why do we split samples into training and test set? What does the stratify mean?
1. Explain what are the bias-variance trade off and regularization.
1. When do we use SVM's kernel method?
1. Load the scikit-learn's Wine recognition dataset; separate 20% data as test data set; predict the wine quality using the 3 methods (logistic regression, support vector machines, decision tree), and print the accuracy in the test set of each method. 


To load the data set:
```python
from sklearn import datasets
import numpy as np

df = datasets.load_wine()

X = df.data
y = df.target

```

You can check the targets and features by:
```python
print(df.target_names)
print(df.feature_names)
```

To get more details of the wine data set:https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_wine.html


**1. Summarize the basic idea of logistic regression and decision tree.**

**Logistic Regression**

Logistic Regression is a linear model for binary classification. It works very well with linearly separable classes, and can be extended to multiclass classification (OvR, or one-versus-rest method). Logistic Regression predicts the probability that a certain sample belongs to a particualr class. 

**Decision Tree**

Decision Trees are classification models for values that are nonlinearly separable. The data is split based on a series of questions until the the nodes are pure, meaning they are all of the same class. The tree can be pruned by setting limits for maximum depth.

**2. Read the section of Random Forest, and summarize the major steps.**

Since individual decision trees suffer from high variance, Random forests can be implemented to build a more robust model with better performance using the average of a multiple decision trees.

**Step 1:** draw a random bootstrap sample of size *n*. Bootstrapping refers to a resampling method where *n* samples are selected from the training set with replacement.

**Step 2:** Grow a decision tree from the bootstrap sample. At each node:
  
> a. randomly select *d* features without replacement.


> b. Split the node using the feature that provides the best split accoiring to the objective function, i.e. maximizing the IG

**Step 3:** Repeat steps 1 through 2 *k* times. Typically, the larger *k* is, the better the performance of the Random Forest is, but at the expense of an increase in computational cost.

**Step 4:** Aggregate the prediction by each tree to assign the class label.


**3. Why do we split samples into training and test set? What does the stratify mean?**

We split samples into training and test sets because we use the training data to build (or train) the model, but we need unseen/unknown data to evaluate the models performance. If we don't have a test set, there is no way to know how well our model is at predicting unknown variables. The training set is known, and used to build the model. The test set is used to evaluate the model.

Stratification is when we split the data into a training and test set, but the data is split so that the training and test sets have the same proportion of class labels as the dataset. Stratified samples are used when the proportions of class labels are very different, or imbalanced.

**4. Explain what are the bias-variance trade off and regularization.**

Bias measures how far the predictions are from the correct values.
Variance measurues the variability of the model predictions.

A simple model with few parameters may have a high bias and low variance, but a very complex model with many parameters may have a high variance and low bias. A simple model may suffer from underfitting while a complex model may suffer from overfitting. 

Regularization can be used to prevent overfitting. Regularization is a useful tool to handle a high correlation among features and filter out the noise. It introduces additional bias to penalize extreme parameter values. For instance, a very complex model suffers from high variance and low bias. Introducing additional bias is a way of finding a good bias-variance tradeoff.

**6. Load the scikit-learn's Wine recognition dataset; separate 20% data as test data set; predict the wine quality using the 3 methods (logistic regression, support vector machines, decision tree), and print the accuracy in the test set of each method.**

In [19]:
from sklearn import datasets
import numpy as np

df = datasets.load_wine() #this is the dataset

X = df.data #features
y = df.target #targets

In [20]:
from sklearn.model_selection import train_test_split

#separate 20% of the data as a test data set
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=1,
    stratify=y
    )

In [21]:
from sklearn.linear_model import LogisticRegression

#logistic regression model
lr = LogisticRegression(random_state=1, solver='liblinear', multi_class='auto')

#fit the logistic regression model to the training data
lr.fit(X_train, y_train)

#use the test data to predict the test target
y_pred = lr.predict(X_test)

In [22]:
from sklearn.metrics import accuracy_score

#find the accuracy using the target values in the test set versus the predicted target values
print('Accuracy: %.2f' % accuracy_score(y_test, y_pred))

Accuracy: 0.94


The accuracy of the logistic regression method was 94%.

In [26]:
from sklearn.tree import DecisionTreeClassifier

#decision tree model with Gini Impurity split criteria and a max depth of 4
tree = DecisionTreeClassifier(criterion='gini', max_depth=4, random_state=1)

#fit the decision tree to the training data
tree.fit(X_train, y_train)

#use the test data to predict the test target
tree.y_pred = tree.predict(X_test)

In [25]:
from sklearn.metrics import accuracy_score

#find the accuracy using the target values in the test set versus the predicted target values
print('Accuracy: %.2f' % accuracy_score(y_test, tree.y_pred))

Accuracy: 0.97


The accuracy of the decision tree method was 97%.