# Introduction to Data Visualization with Seaborn
Run the hidden code cell below to import the data used in this course.

In [1]:
# Importing the course packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Importing the course datasets
country_data = pd.read_csv('datasets/countries-of-the-world.csv', decimal=",")
mpg = pd.read_csv('datasets/mpg.csv')
student_data = pd.read_csv('datasets/student-alcohol-consumption.csv', index_col=0)
survey = pd.read_csv('datasets/young-people-survey-responses.csv', index_col=0)

## Take Notes

Add notes about the concepts you've learned and code cells with code you want to keep.

_Add your notes here_

In [2]:
# Add your code snippets here

## Explore Datasets
Use the DataFrames imported in the first cell to explore the data and practice your skills!
- From `country_data`, create a scatter plot to look at the relationship between GDP and Literacy. Use color to segment the data points by region.
- Use `mpg` to create a line plot with `model_year` on the x-axis and `weight` on the y-axis. Create differentiating lines for each country of origin (`origin`). 
- Create a box plot from `student_data` to explore the relationship between the number of failures (`failures`) and the average final grade (`G3`).
- Create a bar plot from `survey` to compare how `Loneliness` differs across values for `Internet usage`. Format it to have two subplots for gender.
- Make sure to add titles and labels to your plots and adjust their format for readability!

# **CHAPTER-3**

# Count plots


- Use sns.catplot() to create a count plot using the survey_data DataFrame with "Internet usage" on the x-axis.


In [None]:
# Separate into column subplots based on age category
sns.catplot(x="Internet usage", data=survey_data,
            kind="count")

# Show plot
plt.show()

- Make the bars horizontal instead of vertical.

In [None]:
# Separate into column subplots based on age category
sns.catplot(y="Internet usage", data=survey_data,
            kind="count")

# Show plot
plt.show()

-Separate this plot into two side-by-side column subplots based on "Age Category", which separates respondents into those that are younger than 21 vs. 21 and older.

In [None]:
# Separate into column subplots based on age category
sns.catplot(y="Internet usage", data=survey_data,
            kind="count",col="Age Category")

# Show plot
plt.show()

# Bar plots with percentages

Use the survey_data DataFrame and sns.catplot() to create a bar plot with "Gender" on the x-axis and "Interested in Math" on the y-axis.


In [None]:
# Create a bar plot of interest in math, separated by gender

sns.catplot(x="Gender",y="Interested in Math",data=survey_data,kind="bar")

# Show plot
plt.show()

# Customizing bar plots

Use sns.catplot() to create a bar plot with "study_time" on the x-axis and final grade ("G3") on the y-axis, using the student_data DataFrame.

In [None]:
# Create bar plot of average final grade in each study category

sns.catplot(x="study_time",y="G3",data = student_data,kind="bar")



# Show plot
plt.show()

- Using the order parameter and the category_order list that is provided, rearrange the bars so that they are in order from lowest study time to highest.

In [None]:
# List of categories from lowest to highest
category_order = ["<2 hours", 
                  "2 to 5 hours", 
                  "5 to 10 hours", 
                  ">10 hours"]

# Rearrange the categories
sns.catplot(x="study_time", y="G3",
            data=student_data,
            kind="bar",order=category_order)

# Show plot
plt.show()

Update the plot so that it no longer displays confidence intervals.

In [None]:
# List of categories from lowest to highest
category_order = ["<2 hours", 
                  "2 to 5 hours", 
                  "5 to 10 hours", 
                  ">10 hours"]

# Turn off the confidence intervals
sns.catplot(x="study_time", y="G3",
            data=student_data,
            kind="bar",
            order=category_order,ci=None)

# Show plot
plt.show()

# Create and interpret a box plot

Use sns.catplot() and the student_data DataFrame to create a box plot with "study_time" on the x-axis and "G3" on the y-axis. Set the ordering of the categories to study_time_order.

In [None]:
# Specify the category ordering
study_time_order = ["<2 hours", "2 to 5 hours", 
                    "5 to 10 hours", ">10 hours"]

# Create a box plot and set the order of the categories

sns.catplot(x="study_time",y="G3",order=study_time_order,kind="box",data=student_data)



# Show plot
plt.show()

Question
Which of the following is a correct interpretation of this box plot?

Possible Answers

The 75th percentile of grades is highest among students who study more than 10 hours a week.

There are no outliers plotted for these box plots.

The 5th percentile of grades among students studying less than 2 hours is 5.0.

**The median grade among students studying less than 2 hours is 10.0.**

# Omitting outliers

- Use sns.catplot() to create a box plot with the student_data DataFrame, putting "internet" on the x-axis and "G3" on the y-axis.
- Add subgroups so each box plot is colored based on "location".
- Do not display the outliers.

In [None]:
# Create a box plot with subgroups and omit the outliers

sns.catplot(x="internet",y="G3",hue="location",data=student_data,kind = 'box',sym="")




# Show plot
plt.show()

# Adjusting the whiskers

Adjust the code to make the box plot whiskers to extend to 0.5 * IQR. Recall: the IQR is the interquartile range.


In [None]:
# Set the whiskers to 0.5 * IQR
sns.catplot(x="romantic", y="G3",
            data=student_data,
            kind="box")

# Show plot
plt.show()

Change the code to set the whiskers to extend to the 5th and 95th percentiles.

In [None]:
# Set the whiskers at the min and max values
sns.catplot(x="romantic", y="G3",
            data=student_data,
            kind="box",
            whis=[5, 95])

# Show plot
plt.show()

Change the code to set the whiskers to extend to the min and max values.




In [None]:
# Set the whiskers at the min and max values
sns.catplot(x="romantic", y="G3",
            data=student_data,
            kind="box",
            whis=[0,100])

# Show plot
plt.show()

# Customizing point plots

Use sns.catplot() and the student_data DataFrame to create a point plot with "famrel" on the x-axis and number of absences ("absences") on the y-axis.



In [None]:
# Create a point plot of family relationship vs. absences
sns.catplot(x="famrel",y="absences",data = student_data,kind="point")


            
# Show plot
plt.show()

Add "caps" to the end of the confidence intervals with size 0.2.



In [None]:
# Remove the lines joining the points
sns.catplot(x="famrel", y="absences",
			data=student_data,
            kind="point",
            capsize=0.2)
            
# Show plot
plt.show()

Remove the lines joining the points in each category.





In [None]:
# Remove the lines joining the points
sns.catplot(x="famrel", y="absences",
			data=student_data,
            kind="point",
            capsize=0.2,join=False)
            
# Show plot
plt.show()

# Point plots with subgroups

Use sns.catplot() and the student_data DataFrame to create a point plot with relationship status ("romantic") on the x-axis and number of absences ("absences") on the y-axis. Color the points based on the school that they attend ("school").




In [None]:
# Create a point plot that uses color to create subgroups

sns.catplot(x="romantic",y="absences",hue="school",data=student_data,kind="point")


# Show plot
plt.show()

Turn off the confidence intervals for the plot.


In [None]:
# Turn off the confidence intervals for this plot
sns.catplot(x="romantic", y="absences",
			data=student_data,
            kind="point",
            hue="school",ci=None)

# Show plot
plt.show()

Since there may be outliers of students with many absences, use the median function that we've imported from numpy to display the median number of absences instead of the average.



In [None]:
# Import median function from numpy
from numpy import median

# Plot the median number of absences instead of the mean
sns.catplot(x="romantic", y="absences",estimator=median,
			data=student_data,
            kind="point",
            hue="school",
            ci=None)

# Show plot
plt.show()