# WorkShop 4

It is the time to be a Machine Learning Engineer. Pay a lot of attention for instructions.

# Section 1

For this assignment, you will be using the _Breast Cancer Wisconsin_ (Diagnostic) Database to create a classifier that can help diagnose patients. First, read through the description of the dataset (below).


In [None]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer

cancer = load_breast_cancer()

print(cancer.DESCR)

### Problem 1.1

_Scikit-learn_ works with lists, numpy arrays, scipy-sparse matrices, and pandas DataFrames, so converting the dataset to a DataFrame is not necessary for training this model. Using a _DataFrame_ does however help make many things easier such as munging data, so let's practice creating a classifier with a pandas DataFrame. 


Convert the sklearn.dataset `cancer` to a DataFrame. 

_This function should return a_ `(569, 31)` _DataFrame with:_

```
columns = 
    ['mean radius', 'mean texture', 'mean perimeter', 'mean area',
    'mean smoothness', 'mean compactness', 'mean concavity',
    'mean concave points', 'mean symmetry', 'mean fractal dimension',
    'radius error', 'texture error', 'perimeter error', 'area error',
    'smoothness error', 'compactness error', 'concavity error',
    'concave points error', 'symmetry error', 'fractal dimension error',
    'worst radius', 'worst texture', 'worst perimeter', 'worst area',
    'worst smoothness', 'worst compactness', 'worst concavity',
    'worst concave points', 'worst symmetry', 'worst fractal dimension',
    'target']

index = RangeIndex(start=0, stop=569, step=1)
```

In [None]:
def answer_one():
    # YOUR CODE HERE
    cancer_df = pd.DataFrame(np.c_[cancer['data'], cancer['target']], columns= np.append(cancer['feature_names'], ['target']))           
    return cancer_df
answer_one()

In [None]:
# space for professor tests

### Problem 1.2

What is the class distribution? (i.e. how many instances of `malignant` and how many `benign`?)

_This function should return a Series named `target` of length 2 with integer values and index =_ `['malignant', 'benign']`

In [None]:
def answer_two():
    # YOUR CODE HERE
    df = answer_one()
    target = df["target"].value_counts().set_axis({"malginant","benign"})
    return target

In [None]:
# space for professor tests

### Problem 1.3

Split the DataFrame into `X` (the data) and `y` (the labels).

_This function should return a tuple of length 2: `(X, y)`, where:_

- _`X` has shape `(569, 30)`_
- _`y` has shape `(569,)`._

In [None]:
def answer_three():
    # YOUR CODE HERE
    df = answer_one()   
    X = df[df.columns[:-1]]
    y = df["target"]
    return X , y

In [None]:
# space for professor tests

### Problem 1.4

Using `train_test_split`, split `X` and `y` into training and test sets `(X_train, X_test, y_train, and y_test)`.

__Set the random number generator state to 0 using `random_state=0` to make sure your results match the autograder!__

_This function should return a tuple of length 4: `(X_train, X_test, y_train, y_test)`, where:_

- _`X_train` has shape `(426, 30)`_
- _`X_test` has shape `(143, 30)`_
- _`y_train` has shape `(426,)`_
- _`y_test` has shape `(143,)`_

In [None]:
from sklearn.model_selection import train_test_split

def answer_four():
    # YOUR CODE HERE
    X,y = answer_three()
    X_train,X_test,y_train,y_test = train_test_split(X,y,train_size=426,test_size=143,random_state=0)
    return (X_train,X_test,y_train,y_test)

In [None]:
# space for professor tests

### Problem 1.5

Using KNeighborsClassifier, fit a k-nearest neighbors (knn) classifier with `X_train`, `y_train` and using one nearest neighbor (`n_neighbors = 1`).

_This function should return a `sklearn.neighbors.classification.KNeighborsClassifier`._

In [None]:
from sklearn.neighbors import KNeighborsClassifier

def answer_five():
    # YOUR CODE HERE
    X_train,X_test,y_train,y_test = answer_four()
    cl_kn = KNeighborsClassifier(n_neighbors=1)
    cl_kn.fit(X_train,y_train)
    return cl_kn

In [None]:
# space for professor tests

### Problem 1.6

Using your __knn classifier__, predict the class label using the mean value for each feature.

___Hint:___ _You can use `cancer_df.mean()[:-1].values.reshape(1, -1)` which gets the mean value for each feature, ignores the target column, and reshapes the data from 1 dimension to 2 (necessary for the precict method of KNeighborsClassifier)._

In [None]:
def answer_six():
    # YOUR CODE HERE
    df = answer_one()
    knn = answer_five()
    df_mean = df.mean()[:-1].values.reshape(1,-1)
    predict = knn.predict(df_mean)
    return predict

In [None]:
# space for professor tests

### Problem 1.7

Using your __knn classifier__, predict the class labels for the test set `X_test`.

_This function should return a numpy array with shape `(143,)` and values either `0.0` or `1.0`._

In [None]:
def answer_seven():
    # YOUR CODE HERE
    X_train,X_test,y_train,y_test = answer_four()
    knn = answer_five()
    predict = knn.predict(X_test)
    return predict

In [None]:
# space for professor tests

### Problem 1.8

Find the score (_mean accuracy_) of your __knn classifier__ using `X_test` and `y_test`.

_This function should return a float between $0$ and $1$._

In [None]:
def answer_eight():  
    # YOUR CODE HERE
    X_train,X_test,y_train,y_test = answer_four()
    knn = answer_five()
    score = knn.score(X_test,y_test)
    return score

In [None]:
# space for professor tests

### Problem 1.9

Using the plotting function below to visualize the different predicition scores between _train_ and _test sets_, as well as malignant and benign cells.

In [None]:
import matplotlib.pyplot as plt

def answer_nine():
    # YOUR CODE HERE 
    test_scores = answer_eight()
    predict_scores = answer_seven()
    plt.bar(test_scores,predict_scores)
    plt.title("Prediction scores: test-training")
    plt.xlabel("Sets")
    plt.ylabel("Score")
    plt.show()
    
answer_nine()

## Section 2

In this case, you are going to use a _.csv_ dataset to evaluate some performance.

In [None]:
# depencencies
import numpy as np
import pandas as pd

### Problem 2.1

Import the data from `assets/fraud_data.csv`. What percentage of the observations in the dataset are instances of fraud?

_This function should return a float between $0$ and $1$._

In [None]:
def answer_eleven():
    # YOUR CODE HERE    
    fraud_df = pd.read_csv("assets/fraud_data.csv")
    fraud_ist = fraud_df.Class.sum()/fraud_df.Class.count()
    return fraud_ist
answer_eleven()

In [None]:
# space for professor tests

In [None]:
# Use X_train, X_test, y_train, y_test for all of the following questions
from sklearn.model_selection import train_test_split

df = pd.read_csv('assets/fraud_data.csv')

X = df.iloc[:,:-1]
y = df.iloc[:,-1]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

### Problem 2.2

Using `X_train`, `X_test`, `y_train`, and `y_test` (as defined above), train a dummy classifier that classifies everything as the majority class of the training data. What is the accuracy of this classifier? What is the recall?

_This function should a return a tuple with two floats, i.e. `(accuracy score, recall score)`._

In [None]:
from sklearn.dummy import DummyClassifier
from sklearn.metrics import recall_score
def answer_twelve():
    # YOUR CODE HERE
    cl_dumy = DummyClassifier()
    cl_dumy.fit(X_train,y_train)
    acc_score = cl_dumy.score(X_test,y_test)
    y_hat = cl_dumy.predict(X_test)
    recall = recall_score(y_test,y_hat)
    return (acc_score,recall)

In [None]:
# space for professor tests

### Problem 2.3

Using `X_train`, `X_test`, `y_train`, and `y_test` (as defined above), train a _XGBoost_ classifer using the default parameters. What is the accuracy, recall, precision, and F1 Score of this classifier?

_This function should a return a tuple with three floats, i.e. `(accuracy score, recall score, precision score, f1 score)`._

In [None]:
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
from xgboost import XGBClassifier

def answer_thirteen():
    # YOUR CODE HERE
    xgb_cl = XGBClassifier()
    xgb_cl.fit(X_train,y_train)
    y_hat = xgb_cl.predict(X_test)
    accuracy = accuracy_score(y_test, y_hat)
    recall = recall_score(y_test, y_hat)
    precision = precision_score(y_test, y_hat)
    f1 = f1_score(y_test, y_hat)
    return (accuracy,recall,precision,f1)

In [None]:
# space for professor tests