# Assignment: Decision Tree Model 

## Data: Use the Breast Cancer Wisconsin (Diagnostic) dataset 

### Steps:
- Load the Breast Cancer Wisconsin (Diagnostic) data into a Pandas DataFrame.
- Split the data into a training set and a test set.
- Create a DecisionTreeClassifier model.
- Fit the model to the training set.
- Evaluate the model on the test set.
- Report the accuracy of the model.

Explanation:

Load the data into a Pandas DataFrame. The first step is to load the data into a Pandas DataFrame. This can be done using the below code. Enter missing code. 

In [6]:
# Load the data into a Pandas DataFrame.
import pandas as pd

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split


# The `load_breast_cancer()` function returns a dictionary containing the data and the target labels.
# The Breast Cancer Wisconsin (Diagnostic) dataset is a collection of data from breast cancer patients. 
# The data contains 569 instances, each with 30 features. The features are measurements of the tumor, 
# such as the radius, texture, and perimeter. The target variable is whether the tumor is malignant or benign.

breast_cancer_data = load_breast_cancer()
df = pd.DataFrame(breast_cancer_data.data, columns=breast_cancer_data.feature_names)
df['target'] = breast_cancer_data.target

print(df.head())

   mean radius  mean texture  mean perimeter  mean area  mean smoothness  \
0        17.99         10.38          122.80     1001.0          0.11840   
1        20.57         17.77          132.90     1326.0          0.08474   
2        19.69         21.25          130.00     1203.0          0.10960   
3        11.42         20.38           77.58      386.1          0.14250   
4        20.29         14.34          135.10     1297.0          0.10030   

   mean compactness  mean concavity  mean concave points  mean symmetry  \
0           0.27760          0.3001              0.14710         0.2419   
1           0.07864          0.0869              0.07017         0.1812   
2           0.15990          0.1974              0.12790         0.2069   
3           0.28390          0.2414              0.10520         0.2597   
4           0.13280          0.1980              0.10430         0.1809   

   mean fractal dimension  ...  worst texture  worst perimeter  worst area  \
0             

Creates a new DataFrame called X that contains all of the features in the original DataFrame, except for the target column. Use .drop()
Creates a new NumPy array called y that contains the target values from the original DataFrame.

Split the data into a training set and a test set. The training set will be used to train the model, and the test set will be used to evaluate the model. 

The train_test_split() function takes four arguments:
- X: The DataFrame containing the features.
- y: The NumPy array containing the target values.
- test_size: The fraction of the data to use for the test set.
- random_state: A random number generator seed.
- The train_test_split() function returns four NumPy arrays. 

`X_train`: The training set features. 
`X_test`: The test set features. 
`y_train`: The training set target values. 
`y_test`: The test set target values.

In [8]:
X = df.drop('target', axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Create a DecisionTreeClassifier model. Once the data is split, it is time to create a DecisionTreeClassifier model. 

In [10]:
# Create a DecisionTreeClassifier model.
from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier(random_state=42)

Fit the model to the training set. Once the model is created, it needs to be fitted to the training set. The following code can be used to fit the model:

In [12]:
# Fit the model to the training set.
# The `fit()` function takes two arguments:
#   * The training data.
#   * The training target labels.
# The function fits the model to the training data.

clf.fit(X_train, y_train)

Evaluate the model on the test set. Once the model is fitted, it needs to be evaluated on the test set. The following code can be used to evaluate the model:

In [14]:
# Evaluate the model on the test set.
# The `predict()` function takes one argument:
#   * The test data.
# The function returns a NumPy array with the predicted target labels.
y_pred = clf.predict(X_test)

In [15]:
# Calculate the accuracy of the model.
# The `accuracy_score()` function takes two arguments:
#   * The predicted target labels.
#   * The actual target labels.
# The function returns the accuracy of the model.

from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_pred, y_test)

Report the accuracy of the model. The accuracy of the model can be reported using accuracy_score() and/or score():


In [17]:
# Report the accuracy of the model.
print('Accuracy (from predictions):', accuracy)


Accuracy (from predictions): 0.9473684210526315


In [27]:
print(f'Train accuracy: {clf.score(X_train, y_train):.4f}')
print(f'Test accuracy:  {clf.score(X_test, y_test):.4f}')

Train accuracy: 1.0000
Test accuracy:  0.9474
