<a href="https://colab.research.google.com/github/BreakoutMentors/Data-Science-and-Machine-Learning/blob/main/basics/challenges/Data_Visualization_Challenge.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Visualization Challenge - Seaborn and Matplotlib
In the previous notebook, you learned about various visualization techniques using Seaborn and Matplotlib. You got familiar with line plots, scatter plots, bar plots, histograms, box plots, swarm plots, and heatmaps.

Now it's your turn to demonstrate your skills with this challenge! You'll be applying your knowledge of the different plots and trying to derive insights from the datasets provided.

Let's get started!

# Imports

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the Datasets

In [None]:
students_url = 'https://raw.githubusercontent.com/BreakoutMentors/Data-Science-and-Machine-Learning/main/datasets/students.csv'
athletes_url = 'https://raw.githubusercontent.com/BreakoutMentors/Data-Science-and-Machine-Learning/main/datasets/student_athletes.csv'
academics_url = 'https://raw.githubusercontent.com/BreakoutMentors/Data-Science-and-Machine-Learning/main/datasets/student_academics.csv'

students = pd.read_csv(students_url, index_col='id')
athletes = pd.read_csv(athletes_url, index_col=0)
academics = pd.read_csv(academics_url, index_col=0)

In [None]:
students

In [None]:
athletes

In [None]:
academics

# Line Plot

A line plot visualizes the relationship between two sets of values. By connecting data points with a line, we can see trends over time or other relationships.

Now, try to make a line plot showing the relationship between an athletes years of experience and their rank. Don't forget to add a title and labels for the axes!

In [None]:
plt.figure(figsize=(10, 6))

# your code here

# Scatter Plot

In a scatter plot, we plot individual data points on an X-Y plane. This type of plot is great for spotting correlations between two variables.

Create a scatter plot showing the relationship between an athlete's hours training and their ranking. Also, color code the points by `'years experience'` to see if different sports show different patterns.

In [None]:
plt.figure(figsize=(10, 6))

# your code here

The output might look a bit confusing, but you can see that there are at least 4 distinct groups. These are caused by an underlying relationship with `'years experience'`.

What does the relationship between hours training and ranking look like for those with only 1 year of experience?

Generally, what relationship does hours training have with ranking?

# Bar Plot

Bar plots are used for comparing the quantities of different categories or groups.

For this task, create a bar plot showing the average athletic ranking by age.

In [None]:
avg_scores_by_grade = athletes.groupby('age')['ranking'].mean().sort_values()
plt.figure(figsize=(10, 6))

# your code here

Does there appear to be a relationship between age and ranking?

Why might the ranking be higher for 14 year olds than 15 year olds?

# Box Plot

Box plots provide a good visual summary of the data points and help to identify outliers.

Create a box plot showing the average hours spent training based on age.

In [None]:
plt.figure(figsize=(14, 8))

# your code here

Generally, do older athletes train longer?

Which age has the largest range of values?

# Histogram

Histograms are used to visualize the distribution of a set of continuous data.

Try making a histogram showing the distribution of hours spent training. Also, plot a Kernel Density Estimate (KDE) on top of the histogram.

In [None]:
plt.figure(figsize=(10, 6))

# your code here

Why might there be fewer students training around 35 hours a week than 10?


# Swarm Plot

Swarm plots provide a good visual summary of the data points without any overlap.

Create a swarm plot displaying the course scores by sport. Sample only 100 examples.

In [None]:
plt.figure(figsize=(10,6))

# your code here

Are there more students with 1 year of experience or 7? Which has more students with a high ranking?

# Heatmap

Heatmaps are used to depict the correlation between different numeric features in a dataset.

For this task, create a heatmap showing the correlations between the different subjects in the academics dataset. This is taken care of for you, just use `academics_pivot`.

In [None]:
course_to_subject = {
    'Algebra 1': 'Math',
    'Algebra 2': 'Math',
    'Geometry': 'Math',
    'Precalculus': 'Math',
    'Calculus': 'Math',
    'Biology': 'Science',
    'Chemistry': 'Science',
    'Physics': 'Science',
    'Environmental Science': 'Science',
    'Astronomy': 'Science',
    'Anatomy': 'Science',
    'World Geography': 'History',
    'World History': 'History',
    'American History': 'History',
    'American Government': 'History',
    'Economics': 'History',
    'English 1': 'English',
    'English 2': 'English',
    'English 3': 'English',
    'English 4': 'English',
    'Physical Education': 'Elective',
    'Computer Science': 'Elective',
    'Cooking': 'Elective',
    'Yearbook': 'Elective',
    'Studio Art': 'Elective',
    'Music': 'Elective'
}
academics['subject'] = academics['course'].map(course_to_subject)
academics_pivot = academics.pivot_table(values='course score', index='id', columns='subject')
academics_pivot

In [None]:
plt.figure(figsize=(10, 8))

# your code here

Does there appear to be a correlation between math and science? Math and english?