<a href="https://colab.research.google.com/github/BreakoutMentors/Data-Science-and-Machine-Learning/blob/main/basics/Data_Visualization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Welcome to Data Visualization with Seaborn and Matplotlib
Imagine you've got a ton of data and you want to see it in a way that makes sense. Meet Seaborn and Matplotlib, two Python libraries that are kind of like magic wands for your data.

Seaborn, which builds on Matplotlib, is like your friendly neighborhood artist. It takes complex data and, with a few lines of code, transforms it into neat, easy-to-understand, and good-looking graphs. No fuss, no stress, just clean and beautiful visuals.

Then there's Matplotlib, the older, more powerful cousin. It might be a bit more complex, but it allows you to get down to the nitty-gritty. Want to change the color of a line, the label on an axis, or animate your plot? Matplotlib is your go-to. It's the library that gives you the reins to fully customize your visuals.

So, as we go along, we'll get to know these tools better, learn how to make different kinds of plots, and pick up tips and tricks to make your data look as good as possible. Ready to dive in? Let's get started with Seaborn and Matplotlib!

# Necessary Imports

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the Datasets

In [None]:
students_url = 'https://raw.githubusercontent.com/BreakoutMentors/Data-Science-and-Machine-Learning/main/datasets/students.csv'
athletes_url = 'https://raw.githubusercontent.com/BreakoutMentors/Data-Science-and-Machine-Learning/main/datasets/student_athletes.csv'
academics_url = 'https://raw.githubusercontent.com/BreakoutMentors/Data-Science-and-Machine-Learning/main/datasets/student_academics.csv'

students = pd.read_csv(students_url, index_col='id')
athletes = pd.read_csv(athletes_url, index_col=0)
academics = pd.read_csv(academics_url, index_col=0)

In [None]:
students

In [None]:
athletes

In [None]:
academics

# Line Plot

The line plot above visualizes the relationship between an athlete's years of experience and the number of hours they spend training each week. It does this by connecting data points with a line. Line plots are ideal for showing changes in values over time, or in this case, the relationship between two variables.

In [None]:
plt.figure(figsize=(10, 6))
sns.lineplot(data=athletes, x='years experience', y='hours training')

plt.title('Athletes Years of Experience vs. Hours Training')
plt.xlabel('Years of Experience')
plt.ylabel('Weekly Hours Training')
plt.show()

Does there appear to be a relationship?

One reason the line looks weird is that there are other variables determining how many hours an athlete trains. For example, older athletes generally train longer than a younger athlete even with the same years of experience.

# Scatter Plot

In the scatter plot, we are looking at whether there's a relationship between an athlete's years of experience in a sport and their ranking. Each dot represents an individual athlete, with their years of experience and ranking represented on the x and y-axis respectively. Scatter plots are good for spotting correlations between two variables.

In [None]:
plt.figure(figsize=(10, 6))
sns.scatterplot(data=athletes, x='years experience', y='ranking', hue='sport')

plt.title('Years of Experience vs Ranking')
plt.xlabel('Years of Experience')
plt.ylabel('Ranking')
plt.legend(title='Sport', bbox_to_anchor=(1.05, 1), loc=2) # Formats the legend to be on the right side
plt.show()

Kinda hard to tell a relationship with all that data! Let's reduce the amount of data we display to make it easier to distinguish whether a relationship exists. Rather than consider all rankings for sports by years experience let's visualize just the mean and standard deviation for sports by years experience. We can do this easily with Seaborn by converting our scatter plot to a lineplot. Seaborn will automatically compute the mean ranking for each years experience sport group. We can add the standard deviation for each group to our plot by specifying the parameters err_style="bars" to use bars and errorbar=("se", 2) to set the range of our standard deviations to 2.

In [None]:
plt.figure(figsize=(10, 6))
sns.lineplot(data=athletes, x='years experience', y='ranking', hue='sport', style='sport', err_style="bars", errorbar=("se", 3))

plt.title('Years of Experience vs Ranking')
plt.xlabel('Years of Experience')
plt.ylabel('Ranking')
plt.legend(title='Sport', bbox_to_anchor=(1.05, 1), loc=2) # Formats the legend to be on the right side
plt.show()

Does there appear to be a correlation between years of experience and ranking?

What other patterns do you notice? For example, how does the range of values change as the years of experience increases?

Do you think it was useful to visualize the data in two forms: scatter and line plots? Are we able to see trends more clearly with one style over the other? For example, which plot makes it easier to see if there is a relationship between years experience and ranking for each sport? Which plot makes it easier to see if the variance (i.e., how spread out the data is) of ranking changes in relation to years of experience?

# Bar Plot

The bar plot shows the average course score by grade level. It's a great way to compare different categories of data. In this case, we can see which grade levels, on average, score higher or lower on courses.

In [None]:
avg_scores_by_grade = academics.groupby('grade level')['course score'].mean().sort_values()
plt.figure(figsize=(10, 6))
sns.barplot(x=avg_scores_by_grade.index, y=avg_scores_by_grade.values)

plt.title('Average Course Score by Grade Level')
plt.xlabel('Grade Level')
plt.ylabel('Average Course Score')
plt.xticks(rotation=90)
plt.show()

# Histogram

The histogram provides a visual representation of the distribution of course scores. Each bar represents a range of course scores, and the height of the bar represents the frequency of scores in that range. It helps to understand the shape of the data and identify any outliers or unusual data points.

In [None]:
plt.figure(figsize=(10, 6))
sns.histplot(academics['course score'], bins=5, kde=True)

plt.title('Distribution of Course Scores')
plt.xlabel('Course Score')
plt.ylabel('Frequency')
plt.show()

It appears that grades are pretty evenly spread except for 70-72.5 and 97.5-100. Why might this be?

Feel free to play around with the parameters and see how things change. What happens when you change bins to 5?

# Box Plot

Box plots are a way of graphically representing the statistical measures of dispersion such as the median, quartiles, and outliers of a dataset. The box plot for course scores based on the sport played can provide insights into which sports tend to have higher or lower course scores, as well as the spread of scores within each sport.

In [None]:
sport_academics = pd.merge(athletes, academics, on='id')
plt.figure(figsize=(14, 8))
sns.boxplot(data=sport_academics, x='sport', y='course score')
plt.title('Course Scores for Different Sports')
plt.xlabel('Sport')
plt.ylabel('Course Score')
plt.xticks(rotation=90)
plt.show()

Do you think that the sport a student plays affects the expected course score?

# Swarm plot

Swarm plots, like the one displaying the course scores by sport, provide a good visual summary of the data points without any overlap. This plot type is a blend of a scatter plot and a strip plot. It shows the distribution of the data and can help us to visualize both the values and the density of the observations.

In [None]:
plt.figure(figsize=(10,6))
sns.swarmplot(x="sport", y="course score", data=sport_academics.sample(100))
plt.title("Swarm plot of course scores by sport")

Does the sport a student plays seem to significantly affect the distribution of scores?

# Heatmap

A heatmap, like the one showing the correlations between different *numeric features*, is a data visualization technique that shows the magnitude of a phenomenon as colors in two dimensions. The variations in color intensity correlate with the magnitude of the phenomenon. In this case, it's used to depict the correlation between different numeric features in our dataset. A correlation close to 1 or -1 indicates a strong positive or negative relationship, while a correlation close to 0 indicates a weak or no relationship.

In pandas, the `corr()` function is used to compute the pairwise correlation of columns in a DataFrame or a Series.

In [None]:
plt.figure(figsize=(10, 8))
sns.heatmap(athletes.corr(), annot=True, cmap='coolwarm') # try some values for cmap ('BuPu', 'Blues, 'coolwarm', 'Greens')
plt.title('Correlation heatmap of numeric features')
plt.show()

Which values seem to have the highest correlations?

Why do you think that `years of experience` and `ranking` have a correlation of 0.9?

What about `ranking` and `hours training` having a correlation of 0.67?

Why do you think `id` has such a low correlation to everything else?
