In this exercise, you will use your new knowledge to propose a solution to a real-world scenario.  To succeed, you will need to import data into Python, answer questions using the data, and generate **scatter plots**, **histograms**, and **KDE plots** to understand patterns in the data.

## Scenario

You work for hospital ... wants to begin preliminary work to design an algorithm that can classify a tumor as either benign or malignant.  give you this dataset ...

## Setup

Run the next cell to import and configure the Python libraries that you need to complete the exercise.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
print("Setup Complete")

The questions below will give you feedback on your work. Run the following cell to set up our feedback system.

In [None]:
# Set up code checking
from learntools.core import binder
binder.bind(globals())
#from learntools.machine_learning.ex1 import *
print("Setup Complete")

## Step 1: Load the data

Read the cancer data file into a DataFrame called `cancer_data`.

In [None]:
# Path of the file to read
cancer_filepath = "../data/data-viz-easy/cancer.csv"

# Fill in the line below to read the file into a variable cancer_data
cancer_data = ____

# Run the line below with no changes to check that you've loaded the data correctly
# step_1.check()

In [None]:
#%%RM_IF(PROD)%%
cancer_data = pd.read_csv(cancer_filepath, index_col="id")
step_1.assert_check_passed()

In [None]:
# Lines below will give you a hint or solution code
#_COMMENT_IF(PROD)_
step_1.hint()
#_COMMENT_IF(PROD)_
step_1.solution()

## Step 2: Review the data

Use a Python command to print the first 5 rows of the data.

In [None]:
# Print the first five rows of the data 
cancer_data.head()

will then be told briefly about different columns ...

Use the first 5 rows of the data to answer the questions below.

In [None]:
# Fill in the line below: How many columns does the data have?
num_cols = ____

# Fill in the line below: In the first five rows, how many tumors are malignant?
num_malig = ____

# step_2.check()

In [None]:
# Lines below will give you a hint or solution code
# step_2.hint()
# step_2.solution()

## Step 3: ...

The hospital would like your help in identifying whether `radius_worst` and `texture_worst` are useful for classifying tumors as benign or malignant.  

Create a scatter plot to show the relationship between `radius_worst` (on the horizontal "x" axis) and `texture_worst` (on the vertical "y" axis).  She further suggests that you color-code the subsets based on `diagnosis`.  Use the cell below to create this plot.

In [None]:
sns.scatterplot(x=cancer_data.radius_worst, y=cancer_data.texture_worst, hue=cancer_data.diagnosis)
plt.show()

Another colleague recommends that you create two different plots to address this question, instead of just one.  In particular, 
- the **first plot** should use `sns.swarmplot()` (or `sns.stripplot()`) to show the relationship between `radius_worst` and `diagnosis`.  
- the **second plot** should use `sns.swarmplot()` (or `sns.stripplot()`) to show the relationship between `texture_worst` and `diagnosis`.

In [None]:
# first plot
sns.swarmplot(x=cancer_data.diagnosis, y=cancer_data.radius_worst)
plt.show()

# second plot
sns.swarmplot(x=cancer_data.diagnosis, y=cancer_data.texture_worst)
plt.show()

**Question 1**: Based on the plots you created above, what is your opinion: 
- Does `radius_worst` seem useful for classifying tumors?
- Does `texture_worst` seem useful for classifying tumors?
- Once you've decided the answers to the first two questions, which plots do you think most clearly support your reasoning?

In [None]:
# Line below will give you a hint or solution code
# step_3.hint()
# step_3.solution()

## Step 4: ...

One doctor would like your feedback on some preliminary theories.  She believes:
- tumors with larger values for `compactness_mean` also typically have larger values for `perimeter_worst`, and
- the value of `perimeter_worst` is typically greater than 150.  It's still possible (but unlikely!) to see a value for `perimeter_worst` less than 150.

Or, at least, that is what she has seen in her patients!  She is worried that she hasn't seen too many patients, so is unsure of whether her ideas apply more broadly to the human population.  

Use `sns.jointplot()` to create a plot to verify if the data has the same behavior.

In [None]:
# your code here
sns.jointplot(x=cancer_data.perimeter_worst, y=cancer_data.compactness_mean, kind="reg")
plt.show()

**Question 2**: Use the above plot to determine if the doctor's theories are correct:
- Do tumors with larger values for `compactness_mean` also typically have larger values for `perimeter_worst`?
- Is the value of `perimeter_worst` typically more than 150?

In [None]:
# Line below will give you a hint or solution code
# step_4.hint()
# step_4.solution()

## Step 5: ...

One doctor is excited about the possibility of using the value of `radius_worst` to gauge the likelihood of a tumor being benign or malignant.  He knows that larger values for `radius_worst` typically denote a malignant tumor, whereas smaller values usually signify that the tumor is benign.  He'd like your help with identifying a possible threshold value and has identified three possible options:
- We can classify tumors with `radius_worst` > 10 as malignant (where `radius_worst` < 10 is benign).
- We can classify tumors with `radius_worst` > 17 as malignant (where `radius_worst` < 17 as benign).
- We can classify tumors with `radius_worst` > 22 as malignant (where `radius_worst` < 22 is benign).

He knows that this system is unlikely to be perfect, but he'd love your help in determining the best option.  To address his question, we've broken the original CSV dataset into two separate datasets, which can be loaded into Python in the code cell below.  One dataset contains all of the information relating to the benign tumors, and the other contains all of the information relating to the malignant tumors.

In [None]:
cancer_b_file_path = "../data/data-viz-easy/cancer_b.csv"
cancer_m_file_path = "../data/data-viz-easy/cancer_m.csv"

cancer_data_b = pd.read_csv(cancer_b_file_path)
cancer_data_m = pd.read_csv(cancer_m_file_path)

In [None]:
cancer_data_b.head()

In [None]:
cancer_data_m.head()

emphasize that `cancer_data_m` and `cancer_data_b` only contain rows from `cancer_data` ... split into two dataframes ...

In the cell below, use `sns.kdeplot()`, along with the two new DataFrames, to address the doctor's question.

In [None]:
# your code here
sns.kdeplot(data=cancer_data_b.radius_worst, shade=True, label="Benign")
sns.kdeplot(data=cancer_data_m.radius_worst, shade=True, label="Malignant")
plt.show()

**Question 3**: Which threshold for `radius_worst` makes the most sense: 10, 17, or 22?

In [None]:
# Line below will give you a hint or solution code
# step_5.hint()
# step_5.solution()