<a href="https://colab.research.google.com/github/T-Sunm/Learn-Data-Visualization-Kaggle-/blob/main/Exercise_Scatter_Plots.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:

# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES
# TO THE CORRECT LOCATION (/kaggle/input) IN YOUR NOTEBOOK,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.

import os
import sys
from tempfile import NamedTemporaryFile
from urllib.request import urlopen
from urllib.parse import unquote, urlparse
from urllib.error import HTTPError
from zipfile import ZipFile
import tarfile
import shutil

CHUNK_SIZE = 40960
DATA_SOURCE_MAPPING = ':https%3A%2F%2Fstorage.googleapis.com%2Fkaggle-data-sets%2F116573%2F3551030%2Fbundle%2Farchive.zip%3FX-Goog-Algorithm%3DGOOG4-RSA-SHA256%26X-Goog-Credential%3Dgcp-kaggle-com%2540kaggle-161607.iam.gserviceaccount.com%252F20240823%252Fauto%252Fstorage%252Fgoog4_request%26X-Goog-Date%3D20240823T170607Z%26X-Goog-Expires%3D259200%26X-Goog-SignedHeaders%3Dhost%26X-Goog-Signature%3D6567814323759295efa7ac6f713115d2b8337e1075895600d0127f5c5457f9b87a8efc5c547ad6fa67c2d5f19689bda1a3e86a5ff757f5f8e3d35042dd041afd3bbe2ecdfe2ee5e35fa8559616f97adbecc081c0c34c230f0a082cc16f94298e3529ec377cb7f7b6b3c79a84df35b06b7534b91a7ec098ac71f33b2496a4a3fab2805aab926433541db54f736be7b8d3fd5905345cf0e671257c97694752c9c7e1edfa116ab9eef6a661e5f2f34f4c7a768b0bd71d854738be4700a3fbfcb9d8e3f2d35989e2175fb8f306f4418d9bdab0497307bf4210f675f0a3c8f64a48b4829659a8a46d8c1f95baa6fb711d24fd71effb111c9efa46a673b358978a236f'

KAGGLE_INPUT_PATH='/kaggle/input'
KAGGLE_WORKING_PATH='/kaggle/working'
KAGGLE_SYMLINK='kaggle'

!umount /kaggle/input/ 2> /dev/null
shutil.rmtree('/kaggle/input', ignore_errors=True)
os.makedirs(KAGGLE_INPUT_PATH, 0o777, exist_ok=True)
os.makedirs(KAGGLE_WORKING_PATH, 0o777, exist_ok=True)

try:
  os.symlink(KAGGLE_INPUT_PATH, os.path.join("..", 'input'), target_is_directory=True)
except FileExistsError:
  pass
try:
  os.symlink(KAGGLE_WORKING_PATH, os.path.join("..", 'working'), target_is_directory=True)
except FileExistsError:
  pass

for data_source_mapping in DATA_SOURCE_MAPPING.split(','):
    directory, download_url_encoded = data_source_mapping.split(':')
    download_url = unquote(download_url_encoded)
    filename = urlparse(download_url).path
    destination_path = os.path.join(KAGGLE_INPUT_PATH, directory)
    try:
        with urlopen(download_url) as fileres, NamedTemporaryFile() as tfile:
            total_length = fileres.headers['content-length']
            print(f'Downloading {directory}, {total_length} bytes compressed')
            dl = 0
            data = fileres.read(CHUNK_SIZE)
            while len(data) > 0:
                dl += len(data)
                tfile.write(data)
                done = int(50 * dl / int(total_length))
                sys.stdout.write(f"\r[{'=' * done}{' ' * (50-done)}] {dl} bytes downloaded")
                sys.stdout.flush()
                data = fileres.read(CHUNK_SIZE)
            if filename.endswith('.zip'):
              with ZipFile(tfile) as zfile:
                zfile.extractall(destination_path)
            else:
              with tarfile.open(tfile.name) as tarfile:
                tarfile.extractall(destination_path)
            print(f'\nDownloaded and uncompressed: {directory}')
    except HTTPError as e:
        print(f'Failed to load (likely expired) {download_url} to path {destination_path}')
        continue
    except OSError as e:
        print(f'Failed to load {download_url} to path {destination_path}')
        continue

print('Data source import complete.')


**This notebook is an exercise in the [Data Visualization](https://www.kaggle.com/learn/data-visualization) course.  You can reference the tutorial at [this link](https://www.kaggle.com/alexisbcook/scatter-plots).**

---


In this exercise, you will use your new knowledge to propose a solution to a real-world scenario.  To succeed, you will need to import data into Python, answer questions using the data, and generate **scatter plots** to understand patterns in the data.

## Scenario

You work for a major candy producer, and your goal is to write a report that your company can use to guide the design of its next product.  Soon after starting your research, you stumble across this [very interesting dataset](https://fivethirtyeight.com/features/the-ultimate-halloween-candy-power-ranking/) containing results from a fun survey to crowdsource favorite candies.

## Setup

Run the next cell to import and configure the Python libraries that you need to complete the exercise.

In [None]:
import pandas as pd
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
print("Setup Complete")

The questions below will give you feedback on your work. Run the following cell to set up our feedback system.

In [None]:
# Set up code checking
import os
if not os.path.exists("../input/candy.csv"):
    os.symlink("../input/data-for-datavis/candy.csv", "../input/candy.csv")
from learntools.core import binder
binder.bind(globals())
from learntools.data_viz_to_coder.ex4 import *
print("Setup Complete")

## Step 1: Load the Data

Read the candy data file into `candy_data`.  Use the `"id"` column to label the rows.

In [None]:
# Path of the file to read
candy_filepath = "../input/candy.csv"

# Fill in the line below to read the file into a variable candy_data
candy_data = pd.read_csv(candy_filepath, index_col = "id")

# Run the line below with no changes to check that you've loaded the data correctly
step_1.check()

In [None]:
# Lines below will give you a hint or solution code
#step_1.hint()
#step_1.solution()

## Step 2: Review the data

Use a Python command to print the first five rows of the data.

In [None]:
# Print the first five rows of the data
candy_data

The dataset contains 83 rows, where each corresponds to a different candy bar.  There are 13 columns:
- `'competitorname'` contains the name of the candy bar.
- the next **9** columns (from `'chocolate'` to `'pluribus'`) describe the candy.  For instance, rows with chocolate candies have `"Yes"` in the `'chocolate'` column (and candies without chocolate have `"No"` in the same column).
- `'sugarpercent'` provides some indication of the amount of sugar, where higher values signify higher sugar content.
- `'pricepercent'` shows the price per unit, relative to the other candies in the dataset.
- `'winpercent'` is calculated from the survey results; higher values indicate that the candy was more popular with survey respondents.

Use the first five rows of the data to answer the questions below.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(10, 11))

data1 = candy_data.loc[candy_data['competitorname'].isin(['3 Musketeers', 'Almond Joy'])]
data2 = candy_data.loc[candy_data['competitorname'].str.contains('Air Heads|Baby Ruth')]

sns.barplot(y='winpercent', x='competitorname', ax=axes[0] , data=data1)
sns.barplot(y='sugarpercent', x='competitorname', ax=axes[1] , data=data2)

# Fill in the line below: Which candy was more popular with survey respondents:
# '3 Musketeers' or 'Almond Joy'?  (Please enclose your answer in single quotes.)
more_popular = ('3 Musketeers')

# Fill in the line below: Which candy has higher sugar content: 'Air Heads'
# or 'Baby Ruth'? (Please enclose your answer in single quotes.)
more_sugar = ('Air Heads')

# Check your answers
step_2.check()

In [None]:
# Lines below will give you a hint or solution code
#step_2.hint()
#step_2.solution()

## Step 3: The role of sugar

Do people tend to prefer candies with higher sugar content?  

#### Part A

Create a scatter plot that shows the relationship between `'sugarpercent'` (on the horizontal x-axis) and `'winpercent'` (on the vertical y-axis).  _Don't add a regression line just yet -- you'll do that in the next step!_

In [None]:
# Scatter plot showing the relationship between 'sugarpercent' and 'winpercent'
sns.scatterplot(data = candy_data , x = 'sugarpercent', y = 'winpercent')

# Check your answer
step_3.a.check()

In [None]:
# Lines below will give you a hint or solution code
#step_3.a.hint()
# step_3.a.solution_plot()

#### Part B

Does the scatter plot show a **strong** correlation between the two variables?  If so, are candies with more sugar relatively more or less popular with the survey respondents?

In [None]:
#step_3.b.hint()

In [None]:
# Check your answer (Run this code cell to receive credit!)
step_3.b.solution()

## Step 4: Take a closer look

#### Part A

Create the same scatter plot you created in **Step 3**, but now with a regression line!

In [None]:
# Scatter plot w/ regression line showing the relationship between 'sugarpercent' and 'winpercent'
sns.regplot(data = candy_data , x = 'sugarpercent', y = 'winpercent')

# Check your answer
step_4.a.check()

In [None]:
# Lines below will give you a hint or solution code
#step_4.a.hint()
#step_4.a.solution_plot()

#### Part B

According to the plot above, is there a **slight** correlation between `'winpercent'` and `'sugarpercent'`?  What does this tell you about the candy that people tend to prefer?

In [None]:
#step_4.b.hint()

In [None]:
# Check your answer (Run this code cell to receive credit!)
step_4.b.solution()

## Step 5: Chocolate!

In the code cell below, create a scatter plot to show the relationship between `'pricepercent'` (on the horizontal x-axis) and `'winpercent'` (on the vertical y-axis). Use the `'chocolate'` column to color-code the points.  _Don't add any regression lines just yet -- you'll do that in the next step!_

In [None]:
# Scatter plot showing the relationship between 'pricepercent', 'winpercent', and 'chocolate'
sns.scatterplot(data=candy_data, x = 'pricepercent', y = 'winpercent', hue='chocolate')

# Check your answer
step_5.check()

In [None]:
# Lines below will give you a hint or solution code
#step_5.hint()
#step_5.solution_plot()

Can you see any interesting patterns in the scatter plot?  We'll investigate this plot further  by adding regression lines in the next step!

## Step 6: Investigate chocolate

#### Part A

Create the same scatter plot you created in **Step 5**, but now with two regression lines, corresponding to (1) chocolate candies and (2) candies without chocolate.

In [None]:
# Color-coded scatter plot w/ regression lines
sns.lmplot(data = candy_data, x = 'pricepercent', y = 'winpercent', hue='chocolate')

# Check your answer
step_6.a.check()

In [None]:
# Lines below will give you a hint or solution code
#step_6.a.hint()
#step_6.a.solution_plot()

#### Part B

Using the regression lines, what conclusions can you draw about the effects of chocolate and price on candy popularity?

In [None]:
#step_6.b.hint()

In [None]:
# Check your answer (Run this code cell to receive credit!)
step_6.b.solution()

## Step 7: Everybody loves chocolate.

#### Part A

Create a categorical scatter plot to highlight the relationship between `'chocolate'` and `'winpercent'`.  Put `'chocolate'` on the (horizontal) x-axis, and `'winpercent'` on the (vertical) y-axis.

In [None]:
# Scatter plot showing the relationship between 'chocolate' and 'winpercent'
sns.swarmplot(data = candy_data, x = 'chocolate', y = 'winpercent')

# Check your answer
step_7.a.check()

In [None]:
# Lines below will give you a hint or solution code
#step_7.a.hint()
#step_7.a.solution_plot()

#### Part B

You decide to dedicate a section of your report to the fact that chocolate candies tend to be more popular than candies without chocolate.  Which plot is more appropriate to tell this story: the plot from **Step 6**, or the plot from **Step 7**?

In [None]:
#step_7.b.hint()

In [None]:
# Check your answer (Run this code cell to receive credit!)
step_7.b.solution()

## Keep going

Explore **[histograms and density plots](https://www.kaggle.com/alexisbcook/distributions)**.

---




*Have questions or comments? Visit the [course discussion forum](https://www.kaggle.com/learn/data-visualization/discussion) to chat with other learners.*