In this exercise, you will use your new knowledge to propose a solution to a real-world scenario.  To succeed, you will need to import data into Python, answer questions using the data, and generate **scatter plots**, **histograms**, and **density plots** to understand patterns in the data.

## Scenario

Over the last several years, machine learning professionals have made great progress with developing algorithms that detect cancer with the same accuracy as trained medical professionals.  To learn more about this fascinating research area, watch the video below!

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo('9Mz84cwVmS0', width=800, height=450)

In this exercise, you'll use your data visualization skills to think about how to build an algorithm that can accurately classify breast cancer tumors as either **benign** (_noncancerous_) or **malignant** (_cancerous_).  Your dataset contains information collected from 569 different images, similar to the image below.

<img src="images/cancer_image.png">

## Setup

Run the next cell to import and configure the Python libraries that you need to complete the exercise.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
print("Setup Complete")

The questions below will give you feedback on your work. Run the following cell to set up our feedback system.

In [None]:
# Set up code checking
from learntools.core import binder
binder.bind(globals())
from learntools.data_viz_easy.ex3 import *
print("Setup Complete")

## Step 1: Load the data

Read the cancer data file into a DataFrame called `cancer_data`.  Use the `'Id'` column to label the rows.

In [None]:
# Path of the file to read
cancer_filepath = "../input/cancer.csv"

# Fill in the line below to read the file into a variable cancer_data
cancer_data = ____

# Run the line below with no changes to check that you've loaded the data correctly
step_1.check()

In [None]:
#%%RM_IF(PROD)%%
cancer_data = pd.read_csv(cancer_filepath, index_col="Id")
step_1.assert_check_passed()

In [None]:
# Lines below will give you a hint or solution code
#_COMMENT_IF(PROD)_
step_1.hint()
#_COMMENT_IF(PROD)_
step_1.solution()

## Step 2: Review the data

Use a Python command to print the first 5 rows of the data.

In [None]:
# Print the first five rows of the data 
____ # Your code here

The dataset has 569 different rows, one for each analyzed image.  It has 31 different columns, corresponding to:
- 1 column (`'Diagnosis'`) that classifies tumors as either benign (which appears in the dataset as **`B`**) or malignant (__`M`__), and
- 30 different columns containing distinct measurements collected from the images.

Use the first 5 rows of the data to answer the questions below.

In [None]:
# Fill in the line below: In the first five rows, how many tumors are malignant?
num_malig = ____

# Fill in the line below: What is the value for 'Radius (mean)' for the tumor with Id 842517?
mean_radius = ____

step_2.check()

In [None]:
#%%RM_IF(PROD)%%
num_malig = 5
mean_radius = 20.57
step_2.assert_check_passed()

In [None]:
# Lines below will give you a hint or solution code
#_COMMENT_IF(PROD)_
step_2.hint()
#_COMMENT_IF(PROD)_
step_2.solution()

## Step 3: Detect useful columns

#### Part A
We'll begin by determining whether `'Radius (worst)'` or `'Texture (worst)'` are useful columns for classifying tumors as benign or malignant.  

In the code cell below, create a scatter plot to show the relationship between `'Radius (worst)'` (on the horizontal x-axis) and `'Texture (worst)'` (on the vertical y-axis).  Use the `'Diagnosis'` column to color-code the points. 

In [None]:
# Scatter plot showing the relationship between 'Radius (worst)', 'Texture (worst)', and 'Diagnosis'
____ # Your code here

# Check your answer
step_3.check()

In [None]:
#%%RM_IF(PROD)%%
sns.scatterplot(x=cancer_data['Radius (worst)'], y=cancer_data['Texture (worst)'], hue=cancer_data['Diagnosis'])
step_3.assert_check_passed()

In [None]:
# Line below will give you a hint or solution code
#_COMMENT_IF(PROD)_
step_3.hint()
#_COMMENT_IF(PROD)_
step_3.solution_plot()

#### Part B
Based on the plot you created above, which column is more useful for classifying tumors:  `'Radius (worst)'`, or `'Texture (worst)'`?

In [None]:
# get answer here. go into how larger values for 'Radius (worst)' seem to suggest a tumor is malignant
# and this is something we'd ideally like the machine learning algorithm to learn ...

#### Part C

Use the code cell below to create a categorical scatter plot that shows the relationship between `'Diagnosis'` and the column that you identified in **Part B**.  Put `'Diagnosis'` on the horizontal axis.

In [None]:
# Scatter plot showing the relationship between 'Diagnosis' and (a useful column)
____ # Your code here

# Check your answer
# step_3c.check()

In [None]:
#%%RM_IF(PROD)%%
# Scatter plot showing the relationship between 'Diagnosis' and 'Radius (worst)'
sns.swarmplot(x=cancer_data['Diagnosis'], y=cancer_data['Radius (worst)'])

## Step 4: ...

While the ultimate goal is to determine how different columns in the dataset might affect whether a tumor is benign or maligant (in the `'Diagnosis'` column), it is useful to understand the relationships between all of the columns, to develop a more holistic view of the dataset.

#### Part A

Create a scatter plot with a regression line to show the correlation between `'Perimeter (worst)'` and `'Concavity (worst)'`.

In [None]:
# Scatter plot with regression line
____ # Your code here

# step_4.check()

In [None]:
#%%RM_IF(PROD)%%
# Scatter plot with regression line 
sns.regplot(x=cancer_data['Perimeter (worst)'], y=cancer_data['Concavity (worst)'])

In [None]:
# Line below will give you a hint or solution code
# step_4.hint()
# step_4.solution()

#### Part B 

Do tumors with larger values for `'Perimeter (worst)'` also typically have larger values for `'Concavity (worst)'`?

In [None]:
# yup

## Step 5: ...

#### Part A

For the next part of the exercise, we've broken the original CSV file into two separate datasets, which are loaded in the code cell below.  One dataset (`cancer_b_data`) contains all of the information about benign tumors, and the other (`cancer_m_data`) contains all of the information for the malignant tumors. 

In [None]:
cancer_b_filepath = "../input/cancer_b.csv"
cancer_m_filepath = "../input/cancer_m.csv"

cancer_b_data = pd.read_csv(cancer_b_filepath, index_col="Id")
cancer_m_data = pd.read_csv(cancer_m_filepath, index_col="Id")

When perusing the data, you notice that larger values for `'Radius (worst)'` typically denote a malignant tumor, whereas smaller values usually indicate that a tumor is benign.  This suggests that you might identify a threshold value to create an early prototype of an algorithm.  Say you identify three possible options:
- We can classify tumors with `'Radius (worst)'` > 12 as malignant (where `'Radius (worst)'` < 12 is benign).
- We can classify tumors with `'Radius (worst)'` > 17 as malignant (where `'Radius (worst)'` < 17 as benign).
- We can classify tumors with `'Radius (worst)'` > 22 as malignant (where `'Radius (worst)'` < 22 is benign).

This system is unlikely to be perfect, and it's certainly not ready to be released to hospitals as a diagnostic tool, but it's a good, initial start!

Use the code cell below to create a plot that helps with identifying the best threshold value.

In [None]:
# your code here
sns.kdeplot(data=cancer_b_data["Radius (worst)"], shade=True, label="Benign")
sns.kdeplot(data=cancer_m_data["Radius (worst)"], shade=True, label="Malignant")
plt.show()

In [None]:
# Line below will give you a hint or solution code
# step_5.hint()
# step_5.solution()

#### Part B

Which threshold for `'Radius (worst)'` makes the most sense: 10, 17, or 22?

In [None]:
# answer to this is quite complicated, as we want to catch as many malignant tumors as 
# possible. not as bad to accidentally classify benign tumors as malignant?? b/c then doctors
# could take a second look. might need to reword this question to be more specific re: what's wanted