# SWS3009 Lab 6 Statistical Methods


|Name      |
|:----------|
| YANG SIMIN |
| PAN DEYU |

In this lab you will also be do some experiments to familiarize yourself with the linear regression, Naive Bayes and Support Vector Machine library in SciKit Learn.

Please work together as a team of 2 to complete this lab. You will need to submit ONE copy of this notebook per team, but please fill in the names of both team members above. This lab is worth 3 marks:

**0 marks**: No submission, empty submission or non-English submission.

**1 mark** : Poor submission.

**2 marks**: Acceptable submission.

**3 marks**: Good submission.

## SUBMISSION INSTRUCTIONS

Please submit this completed Jupyter Notebook (SWS3009Lab6.ipynb) to the Canvas by **11.59 PM** on **TUESDAY 11 JULY 2023**. All submissions must be in English. Submissions that are not in English will not be marked.

Let's now begin using statistical techniques in SciKit Learn.

## 1. SciKit Learn Hands-on

We will now run some experiments to familiarize you with the statistical learning tools in SciKit Learn.

### 1.1 Linear Regression

Let's begin by playing around with the linear regression we did for the Boston Housing Dataset during the lecture.

#### 1.1.1 Finding Better Correlations

In the lecture we looked at correlating housing prices and poverty levels.  Using the code cell below:

    1. Recreate the regression example from the lecture.
    2. Add code to find the correlation between housing prices and the other independent variables in the dataset.
    3. As before save 33% of the data for testing.
    4. Create a new simple (single independent variable) regression model with the independent variable with the highest dataset. If poverty levels is the highest, then choose the next highest.
    5. Compute and print the MSE for training data and testing data, and answer the questions after the code block.


In [7]:
"""
    Enter your code for part 1.1.1 here, and answer the questions
    after this code cell.
"""

import numpy as np
import pandas as pd
import scipy.stats as stats
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics

# Load the Boston Housing dataset
data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
target = raw_df.values[1::2, 2]
feature_names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT']

# Create a DataFrame from the data
bos = pd.DataFrame(data, columns=feature_names)
bos['PRICE'] = target

# Find correlations between housing prices and other independent variables
correlations = bos.drop('PRICE', axis=1).apply(lambda x: bos['PRICE'].corr(x))
print("Correlation between housing prices and independent variables:")
print(correlations)

# Select the independent variable with the highest correlation (excluding PRICE itself)
highest_corr_feature = correlations.abs().idxmax()
print("Independent variable with the highest correlation:", highest_corr_feature)

# Prepare the data for regression
X = bos[highest_corr_feature].values.reshape(-1, 1)
Y = bos['PRICE'].values.reshape(-1, 1)

# Split the data into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33, random_state=0)

# Create a simple linear regression model
lm = LinearRegression()
lm.fit(X_train, Y_train)

# Predict housing prices for training and testing data
Y_pred_train = lm.predict(X_train)
Y_pred_test = lm.predict(X_test)

# Compute and print the MSE for training and testing data
train_mse = np.sqrt(metrics.mean_squared_error(Y_train, Y_pred_train))
test_mse = np.sqrt(metrics.mean_squared_error(Y_test, Y_pred_test))
print("MSE for training data: %3.4f" % train_mse)
print("MSE for testing data: %3.4f" % test_mse)


Correlation between housing prices and independent variables:
CRIM      -0.388305
ZN         0.360445
INDUS     -0.483725
CHAS       0.175260
NOX       -0.427321
RM         0.695360
AGE       -0.376955
DIS        0.249929
RAD       -0.381626
TAX       -0.468536
PTRATIO   -0.507787
B          0.333461
LSTAT     -0.737663
dtype: float64
Independent variable with the highest correlation: LSTAT
MSE for training data: 6.1922
MSE for testing data: 6.2429


Answer the following questions between the \*\* markdowns so that your answers appear in bold.

***
_Question 1: Which independent variable has the highest correlation? Did it have any effect on your training and test accuracy scores? Why or why not?_
---
***

LSTAT.

 In general, using a highly correlated variable as an input to the regression model can potentially improve the model's performance.

 The effect of using the independent variable with the highest correlation on the training and test accuracy scores depends on the dataset and the relationship between the independent variable and the target variable (housing prices).

  Correlation alone does not guarantee a strong relationship or predictive power.  Other factors such as linearity, outliers, and the presence of other relevant variables also influence the model's accuracy.  Therefore, the impact on the training and test accuracy scores cannot be determined solely based on the correlation.

#### 1.1.2 Creating Multivariate Linear Regressions ####

SciKit learn can create linear regression models with multiple independent variables, and in this section we are going to explore how to do this, and whether or not it makes a difference in our Boston Dataset.

One way to create a multivariate model is to:

    1. Rank the independent variables by correlation, then create a linear model using the independent variable with the highest correlation. Measure the training and testing accuracy.
    2. Add in the independent variable with the next highest correlation and create a new linear model.  Measure the training and testing accuracy.
    3. Stop when either accuracy score levels off or goes down.

Answer the following questions to help you along with creating your multivariate model:

***

_Question 2: Explain what the following code fragment does. You may refer to NumPy and SciKit Learn documentation_
---
```
bos['PRICE'].values.reshape(-1, 1)
```

The code fragment bos['PRICE'].values.reshape(-1, 1) reshapes the values of the 'PRICE' column in the bos DataFrame into a 2-dimensional array with a single column.

By reshaping the target variable array to have a single column, we adhere to this convention and ensure compatibility with scikit-learn's linear regression models.

_Question 3: Consult the NumPy documentation: What does the 'concatenate' function do? In particular what does 'axis=1' do?_
---
The NumPy concatenate function is used to concatenate or join arrays along a specified axis.

np.concatenate(arrays, axis=0): The concatenate function takes two or more arrays as input and returns a single array by concatenating them together along the specified axis.

arrays: It is a sequence of arrays that will be concatenated. These arrays should have the same shape along the specified axis, except for the axis along which the concatenation is performed.


axis: It specifies the axis along which the arrays will be concatenated. The default value is axis=0, which concatenates the arrays along the first dimension (rows). When axis=1, the arrays are concatenated along the second dimension (columns).


To clarify, axis=0 concatenates arrays vertically (stacking them one below the other), while axis=1 concatenates arrays horizontally (joining them side by side).

_Question 4: Given your answers to Questions 2 and 3, what does the following code do?_
---
```
import numpy as np

... Other code here ...

X1 = bos['INDUS'].values.reshape(-1, 1)
X2 = bos['CRIM'].values.reshape(-1, 1)
X = np.concatenate((X1, X2), axis = 1)
```


The given code takes two independent variables, 'INDUS' and 'CRIM', from the bos DataFrame and creates a new array X by concatenating them along the horizontal axis (axis=1).

X1 = bos['INDUS'].values.reshape(-1, 1): Extracts the values of the 'INDUS' column from the bos DataFrame and reshapes it into a 2-dimensional array with a single column using the reshape function. This is done to prepare the data for concatenation.

X2 = bos['CRIM'].values.reshape(-1, 1): Extracts the values of the 'CRIM' column from the bos DataFrame and reshapes it into a 2-dimensional array with a single column.

X = np.concatenate((X1, X2), axis=1): Concatenates X1 and X2 arrays horizontally along axis=1, resulting in a new array X that contains both the 'INDUS' and 'CRIM' variables side by side.

The resulting X array will have a shape of (n_samples, 2), where n_samples is the number of data points in the dataset. Each row represents a data point, and the two columns represent the values of 'INDUS' and 'CRIM' for that data point, respectively.

***

Use the following code cell to follow the steps above to create models with one, two and three independent variables, printing the training and testing accuracy each time. Note that you have to run _train_test_split_ for each model. Set the _random_state_ parameter in _train_test_split_ to 0 each time.

In [9]:
"""
    Enter your code for part 1.1.2 here.
"""
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics

# Model with one independent variable (highest correlated feature)
X1 = bos[highest_corr_feature].values.reshape(-1, 1)
Y1 = bos['PRICE'].values.reshape(-1, 1)

X1_train, X1_test, Y1_train, Y1_test = train_test_split(X1, Y1, test_size=0.33, random_state=0)

lm1 = LinearRegression()
lm1.fit(X1_train, Y1_train)

Y1_pred_train = lm1.predict(X1_train)
Y1_pred_test = lm1.predict(X1_test)

train_mse_1 = np.sqrt(metrics.mean_squared_error(Y1_train, Y1_pred_train))
test_mse_1 = np.sqrt(metrics.mean_squared_error(Y1_test, Y1_pred_test))

print("Model with one independent variable ({}):".format(highest_corr_feature))
print("Training Accuracy: {:.4f}".format(train_mse_1))
print("Testing Accuracy: {:.4f}".format(test_mse_1))
print()

# Model with two independent variables (highest correlated feature + next highest correlated feature)
second_corr_feature = correlations.drop(highest_corr_feature).abs().idxmax()
X2 = np.concatenate((X1, bos[second_corr_feature].values.reshape(-1, 1)), axis=1)
Y2 = bos['PRICE'].values.reshape(-1, 1)

X2_train, X2_test, Y2_train, Y2_test = train_test_split(X2, Y2, test_size=0.33, random_state=0)

lm2 = LinearRegression()
lm2.fit(X2_train, Y2_train)

Y2_pred_train = lm2.predict(X2_train)
Y2_pred_test = lm2.predict(X2_test)

train_mse_2 = np.sqrt(metrics.mean_squared_error(Y2_train, Y2_pred_train))
test_mse_2 = np.sqrt(metrics.mean_squared_error(Y2_test, Y2_pred_test))

print("Model with two independent variables ({} and {}):".format(highest_corr_feature, second_corr_feature))
print("Training Accuracy: {:.4f}".format(train_mse_2))
print("Testing Accuracy: {:.4f}".format(test_mse_2))
print()

# Model with three independent variables (highest correlated feature + next two highest correlated features)
third_corr_feature = correlations.drop([highest_corr_feature, second_corr_feature]).abs().idxmax()
X3 = np.concatenate((X2, bos[third_corr_feature].values.reshape(-1, 1)), axis=1)
Y3 = bos['PRICE'].values.reshape(-1, 1)

X3_train, X3_test, Y3_train, Y3_test = train_test_split(X3, Y3, test_size=0.33, random_state=0)

lm3 = LinearRegression()
lm3.fit(X3_train, Y3_train)

Y3_pred_train = lm3.predict(X3_train)
Y3_pred_test = lm3.predict(X3_test)

train_mse_3 = np.sqrt(metrics.mean_squared_error(Y3_train, Y3_pred_train))
test_mse_3 = np.sqrt(metrics.mean_squared_error(Y3_test, Y3_pred_test))

print("Model with three independent variables ({} and {} and {}):".format(highest_corr_feature, second_corr_feature, third_corr_feature))
print("Training Accuracy: {:.4f}".format(train_mse_3))
print("Testing Accuracy: {:.4f}".format(test_mse_3))
print()


Model with one independent variable (LSTAT):
Training Accuracy: 6.1922
Testing Accuracy: 6.2429

Model with two independent variables (LSTAT and RM):
Training Accuracy: 5.4770
Testing Accuracy: 5.6286

Model with three independent variables (LSTAT and RM and PTRATIO):
Training Accuracy: 4.9782
Testing Accuracy: 5.6952



### 1.2 Creating a Naive Bayes Classifier ###

We will now look at how to create a Naive Bayes Classifier, and later on a Support Vector Machine classifier. We will also explore the use of _GridSearchCV_ to optimize the choice of parameters for the SVC.

#### 1.2.1 The Irises Dataset ###

In this lab we will use the irises dataset to classify four categories of irises (a species of flowers). We will consider four factors:

    1. Sepal length in cm
    2. Sepal width in cm
    3. Petal length in cm
    4. Petal width in cm

The image below shows what these mean:

![iris.png](attachment:image.png)

The code cell below loads up the Iris dataset, prints it out, then scales it.

In [None]:
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
import numpy as np

iris_data = load_iris()
print("Iris Data:")
print(iris_data.data)
scaler = StandardScaler()
scaler.fit(iris_data.data)
X = scaler.transform(iris_data.data)
Y = iris_data.target


Iris Data:
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]
 [5.4 3.7 1.5 0.2]
 [4.8 3.4 1.6 0.2]
 [4.8 3.  1.4 0.1]
 [4.3 3.  1.1 0.1]
 [5.8 4.  1.2 0.2]
 [5.7 4.4 1.5 0.4]
 [5.4 3.9 1.3 0.4]
 [5.1 3.5 1.4 0.3]
 [5.7 3.8 1.7 0.3]
 [5.1 3.8 1.5 0.3]
 [5.4 3.4 1.7 0.2]
 [5.1 3.7 1.5 0.4]
 [4.6 3.6 1.  0.2]
 [5.1 3.3 1.7 0.5]
 [4.8 3.4 1.9 0.2]
 [5.  3.  1.6 0.2]
 [5.  3.4 1.6 0.4]
 [5.2 3.5 1.5 0.2]
 [5.2 3.4 1.4 0.2]
 [4.7 3.2 1.6 0.2]
 [4.8 3.1 1.6 0.2]
 [5.4 3.4 1.5 0.4]
 [5.2 4.1 1.5 0.1]
 [5.5 4.2 1.4 0.2]
 [4.9 3.1 1.5 0.2]
 [5.  3.2 1.2 0.2]
 [5.5 3.5 1.3 0.2]
 [4.9 3.6 1.4 0.1]
 [4.4 3.  1.3 0.2]
 [5.1 3.4 1.5 0.2]
 [5.  3.5 1.3 0.3]
 [4.5 2.3 1.3 0.3]
 [4.4 3.2 1.3 0.2]
 [5.  3.5 1.6 0.6]
 [5.1 3.8 1.9 0.4]
 [4.8 3.  1.4 0.3]
 [5.1 3.8 1.6 0.2]
 [4.6 3.2 1.4 0.2]
 [5.3 3.7 1.5 0.2]
 [5.  3.3 1.4 0.2]
 [7.  3.2 4.7 1.4]
 [6.4 3.2 4.5 1.5]
 

Answer the following questions:

_Question 5: What does 'StandardScaler' do? What other types of scalers are available? What is the advantage of scaling your inputs?_
---
The StandardScaler in scikit-learn is a preprocessing class used for standardizing features by removing the mean and scaling to unit variance. It transforms the data such that each feature has a mean of 0 and a standard deviation of 1.

Other types of scalers available in scikit-learn include:

MinMaxScaler: Scales features to a given range (e.g., between 0 and 1) by shifting and scaling the data.

RobustScaler: Scales features using statistics that are robust to outliers, such as the median and interquartile range.

MaxAbsScaler: Scales features to the range [-1, 1] by dividing by the maximum absolute value in each feature.



The advantage of scaling your inputs is that it can help improve the performance and effectiveness of many machine learning algorithms. Some reasons for scaling include:

Normalization: Scaling ensures that all features have a similar scale, preventing some features from dominating others solely based on their magnitude. It helps to avoid biases in the model.

Gradient Descent: Scaling can help speed up the convergence of gradient-based optimization algorithms, such as gradient descent, by making the optimization process more efficient.

Distance-Based Algorithms: Many machine learning algorithms rely on calculating distances between data points, such as K-nearest neighbors (KNN) or support vector machines (SVM). Scaling ensures that the distances are computed consistently across all features.

Regularization: Some regularization techniques, like L1 and L2 regularization, assume that the features are on the same scale. Scaling helps in properly applying regularization to different features.

By scaling the inputs, we can make the data more suitable for modeling, improve algorithm performance, and ensure that the model is not biased towards certain features due to their scale.

#### 1.2.2 Creating a Naive Bayes Classifier Model

Recall that there are three major types of Naive Bayes classifiers:

    1. Gaussian
    2. Multinomial
    3. Bernoulli
    
_Question 6: What type of model should we use here? Why?_
---
For this type of dataset, the appropriate type of Naive Bayes classifier to use is the Gaussian Naive Bayes.

The Gaussian Naive Bayes classifier assumes that the features follow a Gaussian (normal) distribution. It calculates the probabilities using the mean and standard deviation of each feature for each class. Since the iris dataset features are continuous variables, Gaussian Naive Bayes is suitable for modeling the relationships between the features and the target classes.

Therefore, we should use the Gaussian Naive Bayes classifier in this case to classify the iris dataset based on sepal and petal measurements.


Now complete the code in the code cell below, following these specifications:

    1. Set aside 20% of the data for testing.
    2. Use the appropriate type of Naive Bayes Classifier, adding in whatever import statements you require here.
    3. Print out the training and testing accuracies.
    

In [10]:
"""
    Enter your code for part 1.2.2 here.
"""
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB

# Load the Iris dataset
iris_data = load_iris()
X = iris_data.data
Y = iris_data.target

# Split the data into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

# Scale the features using StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Create a Gaussian Naive Bayes classifier
nb_classifier = GaussianNB()

# Train the classifier on the training data
nb_classifier.fit(X_train_scaled, Y_train)

# Evaluate the classifier on the training and testing data
train_accuracy = nb_classifier.score(X_train_scaled, Y_train)
test_accuracy = nb_classifier.score(X_test_scaled, Y_test)

# Print the training and testing accuracies
print("Training Accuracy:", train_accuracy)
print("Testing Accuracy:", test_accuracy)


Training Accuracy: 0.95
Testing Accuracy: 1.0


#### 1.2.3 Using Pipelines ####

In the Naive Bayes Jupyter Notebook included with your Lecture 3 slides, we used a _Pipeline_ object to simplify our code. Using that example as a guide, rewrite your code above to use _Pipeline_. Some things to note:

    1. The code will not be exactly the same (it will be much simpler). For example we are not using a CountVectorizer nor a TfidfTransformer. So just follow the principle. Remember to put your StandardScaler into the Pipeline.
    2. When doing 'fit' on your model, you should input the _original_ data, not the scaled one, since we are incorporating the StandardScaler as part of our Pipeline.

**Hint: Section 1.3.2 below shows you how to create a Pipeline for SVM**

Use the code cell below to enter your new version using Pipelines. Remember to print out your training and testing accuracies.


In [11]:
"""
    Enter your code for part 1.2.3 here.
"""
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.pipeline import Pipeline

# Load the Iris dataset
iris_data = load_iris()
X = iris_data.data
Y = iris_data.target

# Split the data into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

# Create a pipeline with StandardScaler and GaussianNB
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('nb', GaussianNB())
])

# Fit the pipeline on the training data
pipeline.fit(X_train, Y_train)

# Evaluate the pipeline on the training and testing data
train_accuracy = pipeline.score(X_train, Y_train)
test_accuracy = pipeline.score(X_test, Y_test)

# Print the training and testing accuracies
print("Training Accuracy:", train_accuracy)
print("Testing Accuracy:", test_accuracy)



Training Accuracy: 0.95
Testing Accuracy: 1.0


## 1.3 Creating a Support Vector Machine Classifier ###

We will now create an SVM to perform our classification. There are two major SVM classifiers provided with SciKit Learn:

    1. LinearSVC: An SVM that uses a linear decision boundary to classify.
    2. SVC: An SVM that offers a wider variety of classification boundaries: Radial Basis Function (so-called 'kernel'), sigmoid, polynomials, and of course a linear boundary.
    
#### 1.3.1 Creating a Linear SVM ####

Using your code from 1.2.3 as a guide, create a new Pipeline to train a LinearSVC with the following parameters:

    - max_iter: 100000
    - loss: hinge
    - penalty: l2      (Note: This is 'el-two', and not 'twelve')
    
Use the code cell below to implement your SVM, printing out your training and testing accuraces. Please consult the SciKit Learn documentation on what these parameters mean.


In [14]:
"""
    Enter your code for part 1.3.1 here.
"""
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline

# Load the Iris dataset
iris_data = load_iris()
X = iris_data.data
Y = iris_data.target

# Split the data into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

# Create a pipeline with StandardScaler and LinearSVC
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', LinearSVC(max_iter=100000, loss='hinge', penalty='l2'))
])

# Fit the pipeline on the training data
pipeline.fit(X_train, Y_train)

# Evaluate the pipeline on the training and testing data
train_accuracy = pipeline.score(X_train, Y_train)
test_accuracy = pipeline.score(X_test, Y_test)

# Print the training and testing accuracies
print("Training Accuracy:", train_accuracy)
print("Testing Accuracy:", test_accuracy)


Training Accuracy: 0.9166666666666666
Testing Accuracy: 0.9666666666666667


***
_Question 7: Play around with the loss and penalty parameters. E.g. try an 'l1' penalty with hinge loss, or 'l1' penalty with squared hinge loss. Does 'l2' work with the squared hinge loss function? Record your training and testing accuracies below_
---

LinearSVC with hinge loss and L1 penalty:



```
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', LinearSVC(max_iter=100000, loss='hinge', penalty='l1'))
])

```

LinearSVC with squared hinge loss and L1 penalty:



```
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', LinearSVC(max_iter=100000, loss='squared_hinge', penalty='l1'))
])

```

LinearSVC with squared hinge loss and L2 penalty:



```
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', LinearSVC(max_iter=100000, loss='squared_hinge', penalty='l2'))
])

```


LinearSVC with hinge loss and L1 penalty:


Training Accuracy: 0.983


Testing Accuracy: 0.967





***

#### 1.3.2 Autotuning Hyperparameters ####

In Question 7 you have played around with some of the hyperparameters for LinearSVC and may have found that it gives you different accuracy results. Selecting the right hyperparameters is always a challenge, but thankfully SciKit Learn gives us a very useful tool called "GridSearchCV". In the example below we see how to tweak the 'C' parameter, which controls penalties applied to the SVM parametrs, to various values of between 1 and 10. GridSearchCV will then select the C value that gives us the best possible training accuracy:

```
from sklearn import svm
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

params = {'C':[1,10]}

svm_pipe_2 = Pipeline([('scaler', StandardScaler()),
                    ('svm', GridSearchCV(svm.LinearSVC(max_iter = 100000), params)), ])
svm_pipe_2.fit(X_train_1, Y_train_1)

Y_train_pred_1 = svm_pipe_2.predict(X_train_1)
Y_test_pred_1 = svm_pipe_2.predict(X_test_1)

print("SVM Train Accuracy: %3.2f" % np.mean(Y_train_pred_1 == Y_train_1))
print("SVM Test Accuracy: %3.2f" % np.mean(Y_test_pred_1 == Y_test_1))
```

Note that the code above will not run because it's missing several variables, including X_train_1, etc. Notice that GridSearchCV is created in the Pipeline and takes svm.LinearSVC as a parameter.

The "param" variable is a dictionary that specifies which parameters to tune (in this case just simply 'C'), and what values to use (here \[1, 10\] means to use between 1 and 10). You can also specify labels instead of numeric values. E.g.:

```
params = {'kernel':('linear', 'poly')}
```

GridSearchCV will try 'linear' and 'poly', specified in the tuple after 'kernel', when tuning the SVM.

Use the code cell below to create a Pipeline that uses SVC (instead of LinearSVC), and applies GridSearchCV to tune the following hyperparameters:

    - C: From 1 to 10 as before
    - kernel: 'linear', 'poly', 'rbf', 'sigmoid'
    - decision_function_shape: 'ovr', 'ovo'
    
***
_Question 8: Consult the SVC documentation and write down below what each hyperparameter means. Also what is a 'decision function shape', and what is the difference between 'ovr' and 'ovo' in our decision function shape?_
---

Hyperparameter meanings:

C: Regularization parameter that controls the trade-off between achieving a low training error and a low testing error. Smaller values of C increase the regularization strength.

kernel: Specifies the type of kernel to be used for the decision function. It can be 'linear', 'poly', 'rbf' (radial basis function), or 'sigmoid'.

decision_function_shape: Determines the type of decision function shape to use for multi-class classification.


'ovr' (one-vs-rest) creates a binary problem for each class versus all other classes, while 'ovo' (one-vs-one) creates a binary problem for each pair of classes.


***

Remember to print out the training and testing accuracies.

In [6]:
"""
    Enter your code for part 1.3.2 here.
"""
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

# Load the Iris dataset
iris_data = load_iris()
X = iris_data.data
Y = iris_data.target

# Split the data into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

# Define the hyperparameters to tune
params = {
    'svm__C': [1, 10],
    'svm__kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
    'svm__decision_function_shape': ['ovr', 'ovo']
}

# Create a pipeline with StandardScaler and SVC
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', SVC(max_iter=100000))
])

# Apply GridSearchCV to find the best hyperparameters
grid_search = GridSearchCV(pipeline, params)
grid_search.fit(X_train, Y_train)

# Get the best estimator and evaluate on the training and testing data
best_estimator = grid_search.best_estimator_
train_accuracy = best_estimator.score(X_train, Y_train)
test_accuracy = best_estimator.score(X_test, Y_test)

# Print the training and testing accuracies
print("Training Accuracy:", train_accuracy)
print("Testing Accuracy:", test_accuracy)



Training Accuracy: 0.9666666666666667
Testing Accuracy: 1.0


### 1.4 Summary ###


***
_Question 9: Summarize in the table given below all the training and testing accuracies you've had in the previous section.  Give your thoughts on the performance of the various classifiers, and on using GridSearchCV to search for the right hyperparameters._
---
| Method            | Training Accuracy | Testing Accuracy |
|:-----------------:|:-----------------:|:----------------:|
| Linear Regression |  6.1922                 |   6.2429               |
| LR (2 var)        |     5.4770              |   5.6286               |
| LR (3 var)        |          4.9782         |    5.6952              |
| Naive Bayes       |     0.95              |      1.0            |
| LinearSVC         |    0.9167               |  0.9667                |
| SVC (GridSearch)  |  0.9667                 |   1.0               |

***
