In this exercise, you will use your new knowledge to propose a solution to a real-world scenario.  To succeed, you will need to import data into Python, answer questions using the data, and generate **scatter plots**, **histograms**, and **density plots** to understand patterns in the data.

## Scenario

Over the last several years, machine learning professionals have made great progress with developing algorithms that detect cancer with the same accuracy as trained medical professionals.  To learn more about this fascinating research area, watch the video below!

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo('9Mz84cwVmS0', width=800, height=450)

In this exercise, you'll use your data visualization skills to gain an understanding of how to build an algorithm that can accurately classify breast cancer tumors as either **benign** (_noncancerous_) or **malignant** (_cancerous_).  Your dataset contains information collected from 569 different images, similar to the image below.

<img src="images/cancer_image.png">

## Setup

Run the next cell to import and configure the Python libraries that you need to complete the exercise.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
print("Setup Complete")

The questions below will give you feedback on your work. Run the following cell to set up our feedback system.

In [None]:
# Set up code checking
from learntools.core import binder
binder.bind(globals())
from learntools.data_viz_easy.ex3 import *
print("Setup Complete")

## Step 1: Load the data

Read the cancer data file into a DataFrame called `cancer_data`.  Use the `'Id'` column to label the rows.

In [None]:
# Path of the file to read
cancer_filepath = "../input/cancer.csv"

# Fill in the line below to read the file into a variable cancer_data
cancer_data = ____

# Run the line below with no changes to check that you've loaded the data correctly
step_1.check()

In [None]:
#%%RM_IF(PROD)%%
cancer_data = pd.read_csv(cancer_filepath, index_col="Id")
step_1.assert_check_passed()

In [None]:
# Lines below will give you a hint or solution code
#_COMMENT_IF(PROD)_
step_1.hint()
#_COMMENT_IF(PROD)_
step_1.solution()

## Step 2: Review the data

Use a Python command to print the first 5 rows of the data.

In [None]:
# Print the first five rows of the data 
____ # Your code here

The dataset has 569 different rows, or one for each analyzed image.  It has 31 different columns, corresponding to:
- 1 column (`'Diagnosis'`) that classifies tumors as either benign (which appears in the dataset as **`B`**) or malignant (__`M`__), and
- 30 different columns containing distinct measurements collected from the images.

Use the first 5 rows of the data to answer the questions below.

In [None]:
# Fill in the line below: In the first five rows, how many tumors are malignant?
num_malig = ____

# Fill in the line below: What is the value for 'Radius (mean)' for the tumor with Id 842517?
mean_radius = ____

step_2.check()

In [None]:
#%%RM_IF(PROD)%%
num_malig = 5
mean_radius = 20.57
step_2.assert_check_passed()

In [None]:
# Lines below will give you a hint or solution code
#_COMMENT_IF(PROD)_
step_2.hint()
#_COMMENT_IF(PROD)_
step_2.solution()

## Step 3: Detect useful columns

#### Part A
One useful first step towards understanding whether the measured values ... The hospital would like your help in identifying whether `radius_worst` and `texture_worst` are useful for classifying tumors as benign or malignant.  

In the code cell below, create a scatter plot to show the relationship between `'Radius (worst)'` (on the horizontal x-axis) and `'Texture (worst)'` (on the vertical y-axis).  Use the `'Diagnosis'` column to color-code the points. 

In [None]:
# Scatter plot showing the relationship between 'Radius (worst)', 'Texture (worst)', and 'Diagnosis'
____ # Your code here

# Check your answer
step_3.check()

In [None]:
#%%RM_IF(PROD)%%
sns.scatterplot(x=cancer_data['Radius (worst)'], y=cancer_data['Texture (worst)'], hue=cancer_data['Diagnosis'])
step_3.assert_check_passed()

In [None]:
# Line below will give you a hint or solution code
#_COMMENT_IF(PROD)_
step_3.hint()
#_COMMENT_IF(PROD)_
step_3.solution_plot()

#### Part B
Based on the plot you created above, which column is more useful for classifying tumors:  `'Radius (worst)'`, or `'Texture (worst)'`?

In [None]:
# get answer here

#### Part C

Use the code cell below to create a categorical scatter plot that shows the relationship between `'Diagnosis'` and the column that you identified in **Part B**.

In [None]:
# Scatter plot showing the relationship between 'Diagnosis' and (a useful column)
____ # Your code here

# Check your answer
# step_3c.check()

In [None]:
#%%RM_IF(PROD)%%
# Scatter plot showing the relationship between 'Diagnosis' and 'Radius (worst)'
sns.swarmplot(x=cancer_data['Diagnosis'], y=cancer_data['Radius (worst)'])

## Step 4: ...

One doctor would like your feedback on some preliminary theories.  She believes:
- tumors with larger values for `compactness_mean` also typically have larger values for `perimeter_worst`, and
- the value of `perimeter_worst` is typically greater than 150.  It's still possible (but unlikely!) to see a value for `perimeter_worst` less than 150.

Create a scatter plot with a regression line to assess the correlation ...

In [None]:
# your code here
sns.regplot(x=cancer_data['Perimeter (worst)'], y=cancer_data['Compactness (mean)'])

In [None]:
# Line below will give you a hint or solution code
# step_4.hint()
# step_4.solution()

#### Part B 

Do tumors with larger values for `compactness_mean` also typically have larger values for `perimeter_worst`?

In [None]:
# answer

## Step 5: ...

take a stab at creating a rough algorithm, to see how well it performs as a baseline.

One doctor is excited about the possibility of using the value of `radius_worst` to gauge the likelihood of a tumor being benign or malignant.  He knows that larger values for `radius_worst` typically denote a malignant tumor, whereas smaller values usually signify that the tumor is benign.  He'd like your help with identifying a possible threshold value and has identified three possible options:
- We can classify tumors with `radius_worst` > 10 as malignant (where `radius_worst` < 10 is benign).
- We can classify tumors with `radius_worst` > 17 as malignant (where `radius_worst` < 17 as benign).
- We can classify tumors with `radius_worst` > 22 as malignant (where `radius_worst` < 22 is benign).

He knows that this system is unlikely to be perfect, but he'd love your help in determining the best option.  To address his question, we've broken the original CSV dataset into two separate datasets, which can be loaded into Python in the code cell below.  One dataset contains all of the information relating to the benign tumors, and the other contains all of the information relating to the malignant tumors.

In [None]:
cancer_b_file_path = "../input/cancer_b.csv"
cancer_m_file_path = "../input/cancer_m.csv"

cancer_data_b = pd.read_csv(cancer_b_file_path)
cancer_data_m = pd.read_csv(cancer_m_file_path)

In [None]:
cancer_data_b.head()

In [None]:
cancer_data_m.head()

emphasize that `cancer_data_m` and `cancer_data_b` only contain rows from `cancer_data` ... split into two dataframes ...

In the cell below, use `sns.kdeplot()`, along with the two new DataFrames, to address the doctor's question.

In [None]:
# your code here
sns.kdeplot(data=cancer_data_b.radius_worst, shade=True, label="Benign")
sns.kdeplot(data=cancer_data_m.radius_worst, shade=True, label="Malignant")
plt.show()

**Question 3**: Which threshold for `radius_worst` makes the most sense: 10, 17, or 22?

In [None]:
# Line below will give you a hint or solution code
# step_5.hint()
# step_5.solution()