In this exercise, you will use your new knowledge to propose a solution to a real-world scenario.  To succeed, you will need to import data into Python, answer questions using the data, and generate **scatter plots**, **histograms**, and **density plots** to understand patterns in the data.

## Scenario

You'll think about how to design an algorithm that can classify breast cancer tumors as either **benign** (_noncancerous_) or **malignant** (_cancerous_).  The dataset contains information collected from microscopic images of tumors, similar to the image below.

<img src="images/cancer_image.png">

To learn more about how these algorithms are currently used in medical settings, **watch the short video [at this link](https://www.youtube.com/watch?v=9Mz84cwVmS0)**!



## Setup

Run the next cell to import and configure the Python libraries that you need to complete the exercise.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
print("Setup Complete")

The questions below will give you feedback on your work. Run the following cell to set up our feedback system.

In [None]:
# Set up code checking
from learntools.core import binder
binder.bind(globals())
from learntools.data_viz_easy.ex3 import *
print("Setup Complete")

## Step 1: Load the data

Read the cancer data file into a DataFrame called `cancer_data`.  Use the `"Id"` column to label the rows.

In [None]:
# Path of the file to read
cancer_filepath = "../input/cancer.csv"

# Fill in the line below to read the file into a variable cancer_data
cancer_data = ____

# Run the line below with no changes to check that you've loaded the data correctly
step_1.check()

In [None]:
#%%RM_IF(PROD)%%
cancer_data = pd.read_csv(cancer_filepath, index_col="Id")
step_1.assert_check_passed()

In [None]:
# Lines below will give you a hint or solution code
#_COMMENT_IF(PROD)_
step_1.hint()
#_COMMENT_IF(PROD)_
step_1.solution()

## Step 2: Review the data

Use a Python command to print the first 5 rows of the data.

In [None]:
# Print the first five rows of the data 
____ # Your code here

The dataset has 569 different rows, one for each analyzed image.  It has 31 different columns, corresponding to:
- 1 column (`'Diagnosis'`) that classifies tumors as either benign (which appears in the dataset as **`B`**) or malignant (__`M`__), and
- 30 columns containing different measurements collected from the images.

Use the first 5 rows of the data to answer the questions below.

In [None]:
# Fill in the line below: In the first five rows, how many tumors are malignant?
num_malig = ____

# Fill in the line below: What is the value for 'Radius (mean)' for the tumor with Id 842517?
mean_radius = ____

step_2.check()

In [None]:
#%%RM_IF(PROD)%%
num_malig = 5
mean_radius = 20.57
step_2.assert_check_passed()

In [None]:
# Lines below will give you a hint or solution code
#_COMMENT_IF(PROD)_
step_2.hint()
#_COMMENT_IF(PROD)_
step_2.solution()

## Step 3: Detect useful measurements

The columns in the dataset are not equally important - as you'll explore below, some measurements are more useful than others for distinguishing benign from malignant tumors.

#### Part A

In the code cell below, create a scatter plot to show the relationship between `'Perimeter (mean)'` (on the horizontal x-axis) and `'Texture (worst)'` (on the vertical y-axis).  Use the `'Diagnosis'` column to color-code the points. 

In [None]:
# Scatter plot showing the relationship between 'Radius (worst)', 'Texture (worst)', and 'Diagnosis'
____ # Your code here

# Check your answer
step_3.a.check()

In [None]:
#%%RM_IF(PROD)%%
sns.scatterplot(x=cancer_data['Perimeter (mean)'], 
                y=cancer_data['Texture (worst)'], 
                hue=cancer_data['Diagnosis'])
step_3.a.assert_check_passed()

In [None]:
#%%RM_IF(PROD)%%
sns.scatterplot(x=cancer_data['Texture (worst)'],
                y=cancer_data['Perimeter (mean)'], 
                hue=cancer_data['Diagnosis'])
step_3.a.assert_check_passed()

In [None]:
#%%RM_IF(PROD)%%
sns.scatterplot(x=cancer_data['Perimeter (mean)'], 
                y=cancer_data['Texture (worst)'])
step_3.a.assert_check_failed()

In [None]:
#%%RM_IF(PROD)%%
sns.scatterplot(x=cancer_data['Diagnosis'],
                y=cancer_data['Texture (worst)'],
                hue=cancer_data['Perimeter (mean)'])
step_3.a.assert_check_failed()

In [None]:
# Lines below will give you a hint or solution code
#_COMMENT_IF(PROD)_
step_3.a.hint()
#_COMMENT_IF(PROD)_
step_3.a.solution_plot()

#### Part B
Based on the plot you created above, which column is relatively more useful for classifying tumors:  `'Perimeter (mean)'`, or `'Texture (worst)'`?

In [None]:
# get answer here. go into how larger values for 'Radius (worst)' seem to suggest a tumor is malignant
# and this is something we'd ideally like the machine learning algorithm to learn ...

In [None]:
#_COMMENT_IF(PROD)_
step_3.b.hint()

In [None]:
#_COMMENT_IF(PROD)_
step_3.b.solution()

## Step 4: Communicate your findings

Use the code cell below to create a categorical scatter plot that shows the relationship between `'Diagnosis'` and the column that you identified in **Step 3**.  

In [None]:
# Scatter plot showing the relationship between 'Diagnosis' and (a useful column)
____ # Your code here

# Check your answer
step_4.check()

In [None]:
#%%RM_IF(PROD)%%
sns.swarmplot(x=cancer_data['Diagnosis'], y=cancer_data['Perimeter (mean)'])
step_4.assert_check_passed()

In [None]:
#%%RM_IF(PROD)%%
sns.swarmplot(x=cancer_data['Perimeter (mean)'], y=cancer_data['Diagnosis'])
step_4.assert_check_passed()

In [None]:
# Lines below will give you a hint or solution code
#_COMMENT_IF(PROD)_
step_4.hint()
#_COMMENT_IF(PROD)_
step_4.solution_plot()

## Step 5: Find a threshold

For the next part of the exercise, we've broken the original CSV file into two separate files.  
- One file (imported as `cancer_b_data`) contains all of the information about benign tumors, and 
- the other (`cancer_m_data`) contains all of the measurements from malignant tumors.

Run the code cell below without changes to load the files.

In [None]:
# Paths of the files to read
cancer_b_filepath = "../input/cancer_b.csv"
cancer_m_filepath = "../input/cancer_m.csv"

# Read the files into variables cancer_b_data, cancer_m_data
cancer_b_data = pd.read_csv(cancer_b_filepath, index_col="Id")
cancer_m_data = pd.read_csv(cancer_m_filepath, index_col="Id")

As expected, `cancer_b_data` contains only rows corresponding to benign tumors.  Run the next code cell without any changes.

In [None]:
# Print the first five rows of the data 
cancer_b_data.head()

And, `cancer_m_data` contains all of the information about the malignant tumors.  Run the next code cell without any changes.

In [None]:
# Print the first five rows of the data 
cancer_m_data.head()

#### Part A

A trustworthy medical source tells you that `'Radius (worst)'` might be a useful column for diagnosing tumors.  Use the code cell below to create a KDE plot that shows the distribution in values for `'Radius (worst)'` for both benign and malignant tumors.

In [None]:
# KDE plot for benign and malignant tumors
____ # Your code here

# Check your answer
step_5.a.check()

In [None]:
#%%RM_IF(PROD)%%
sns.kdeplot(data=cancer_b_data['Radius (worst)'], shade=True, label="Benign")
sns.kdeplot(data=cancer_m_data['Radius (worst)'], shade=True, label="Malignant")
step_5.a.assert_check_passed()

In [None]:
# Lines below will give you a hint or solution code
#_COMMENT_IF(PROD)_
step_5.a.hint()
#_COMMENT_IF(PROD)_
step_5.a.solution_plot()

#### Part B

Based on the KDE plot, do malignant tumors have higher or lower values for `'Radius (worst)'` (relative to benign tumors)?

In [None]:
#_COMMENT_IF(PROD)_
step_5.b.hint()

In [None]:
#_COMMENT_IF(PROD)_
step_5.b.solution()

## What's next?

write later