# Intermediate Python

**Course Description**

Learning Python is crucial for any aspiring data science practitioner. Learn to visualize real data with Matplotlib’s functions and get acquainted with data structures such as the dictionary and pandas DataFrame. This four-hour intermediate course will help you to build on your existing Python skills and explore new Python applications and functions that expand your repertoire and help you work more efficiently.

You'll discover how dictionaries offer an alternative to Python lists, and why the pandas dataframe is the most popular way of working with tabular data. In the second chapter of this course, you’ll find out how you can create and manipulate datasets, and how to access them using these structures. Hands-on practice throughout the course will build your confidence in each area.

As you progress, you’ll look at logic, control flow, filtering and loops. These functions work to control decision-making in Python programs and help you to perform more operations with your data, including repeated statements. You’ll finish the course by applying all of your new skills by using hacker statistics to calculate your chances of winning a bet.

Once you’ve completed all of the chapters, you’ll be ready to apply your new skills in your job, new career, or personal project, and be prepared to move onto more advanced Python learning.

## 1 Matplotlib

Data visualization is a key skill for aspiring data scientists. Matplotlib makes it easy to create meaningful and insightful plots. In this chapter, you’ll learn how to build various types of plots, and customize them to be more visually appealing and interpretable.

### Basic plots with Matplotlib

1. Basic plots with Matplotlib

Hi! My name is Hugo, and I'm a data scientist and educator at DataCamp. I'm also the host of the weekly podcast DataFramed, which you need to check out to stay up to date with everything that's happening in data science. In this intermediate Python course, you're going to take your Python skills to the next level, specifically for data science.

2. Basic plots with Matplotlib

![image.png](attachment:image.png)

You will learn how to visualize data and to store data in new data structures. Along the way, you will master control structures, which you will need to customize the flow of your scripts and algorithms. These are the types of things data scientists use every day. We'll finish this chapter with a case study, where you'll blend together everything you've learned to solve an exciting challenge.

3. Data visualization

![image-2.png](attachment:image-2.png)

This first chapter is about data visualization, which is a very important part of data analysis. First of all, you will use it to explore your dataset. The better you understand your data, the better you'll be able to extract insights. And once you've found those insights, again, you'll need visualization to be able to share your valuable insights with other people.

4. Beautiful plot

![image-3.png](attachment:image-3.png)

As an example, have a look at this beautiful plot. It's made by the late, the great Swedish professor Hans Rosling. His talks about global development have been viewed millions of times. And what makes them so intriguing, is that by making beautiful plots, he allows the data to tell their own story. Here we see a bubble chart, where each bubble represents a country. The bigger the bubble, the bigger the country's population, so the two biggest bubbles here are China and India. There are 2 axes.

1 Source: GapMinder, Wealth and Health of Nations

5. Beautiful plot

![image-4.png](attachment:image-4.png)

The horizontal axis shows the GDP per capita, in US dollars.

1 Source: GapMinder, Wealth and Health of Nations

6. Beautiful plot

![image-5.png](attachment:image-5.png)

The vertical axis shows life expectancy. We clearly see that people live longer in countries with a higher GDP per capita. Still, there is a huge difference in life expectancy between countries on the same income level. Now why am I telling you this? Well, because by the end of this chapter, you'll be able to build this beautiful plot yourself!

1 Source: GapMinder, Wealth and Health of Nations

7. Matplotlib

![image-6.png](attachment:image-6.png)

There are many visualization packages in python, but the mother of them all, is matplotlib. You will need its subpackage pyplot. By convention, this subpackage is imported as plt, like this. For our first example, let's try to gain some insights in the evolution of the world population. I have a list with years here, year, and a list with corresponding populations, expressed in billions, pop. In the year 1970, for example, 3.7 billion people lived on planet Earth. To plot this data as a line chart, we call plt-dot-plot and use our two lists as arguments. The first argument corresponds to the horizontal axis, and the second one to the vertical axis. You might think that a plot will pop up right now, but Python's pretty lazy. It will wait for the show function to actually display the plot. This is because you might want to add some extra ingredients to your plot before actually displaying it, such as titles and label customizations. I'll talk about that some more later on. Just remember this: the plot function tells Python what to plot and how to plot it. show actually displays the plot.

8. Matplotlib

![image-7.png](attachment:image-7.png)

When we look at our plot, we see that the years are indeed shown on the horizontal axis, and the populations on the vertical axis.

9. Matplotlib

![image-8.png](attachment:image-8.png)

There are four data points, and Python draws a line between them. In 1950, the world population was around 2 point 5 billion. In 2010, it was around 7 billion. So the world population has almost tripled in sixty years. What if the population keeps on growing like that? Will the world become over populated? You'll find out in the exercises.

10. Scatter plot

![image-9.png](attachment:image-9.png)

Let me first introduce you to another type of plot: the scatter plot. To create it, we can start from the code from before. This time, though, you change the plot function to scatter.

11. Scatter plot

![image-10.png](attachment:image-10.png)

The resulting scatter plot simply plots all the individual data points; Python doesn't connect the dots with a line. For many applications, the scatter plot is often a better choice than the line plot, so remember this scatter function well. You could also say that this is a more -honest- way of plotting your data, because you can clearly see that the plot is based on just four data points.

12. Let's practice!

Now that we've got the basics of matplotlib covered, it's your turn to build some awesome plots. Have fun! But make sure to come back here so we can plot even more together.

**Exercise**

**Line plot (1)**

With matplotlib, you can create a bunch of different plots in Python. The most basic plot is the line plot. A general recipe is given here.

import matplotlib.pyplot as plt
plt.plot(x,y)
plt.show()

In the video, you already saw how much the world population has grown over the past years. Will it continue to do so? The world bank has estimates of the world population for the years 1950 up to 2100. The years are loaded in your workspace as a list called year, and the corresponding populations as a list called pop.

This course touches on a lot of concepts you may have forgotten, so if you ever need a quick refresher, download the [Python for data science Cheat Sheet](https://datacamp-community-prod.s3.amazonaws.com/0eff0330-e87d-4c34-88d5-73e80cb955f2) and keep it handy!

**Instructions**

- print() the last item from both the year and the pop list to see what the predicted population for the year 2100 is. Use two print() functions.
- Before you can start, you should import matplotlib.pyplot as plt. pyplot is a sub-package of matplotlib, hence the dot.
- Use plt.plot() to build a line plot. year should be mapped on the horizontal axis, pop on the vertical axis. Don't forget to finish off with the plt.show() function to actually display the plot.

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

**Exercise**

**Line plot (2)**

![image.png](attachment:image.png)

In [7]:
# df = pd.read_csv("gapminder.csv", encoding = 'unicode_escape')
# df

**Exercise**

**Line plot (3)**

Now that you've built your first line plot, let's start working on the data that professor Hans Rosling used to build his beautiful bubble chart. It was collected in 2007. Two lists are available for you:

- life_exp which contains the life expectancy for each country and
- gdp_cap, which contains the GDP per capita (i.e. per person) for each country expressed in US Dollars.

GDP stands for Gross Domestic Product. It basically represents the size of the economy of a country. Divide this by the population and you get the GDP per capita.

matplotlib.pyplot is already imported as plt, so you can get started straight away.

**Instructions**

- Print the last item from both the list gdp_cap, and the list life_exp; it is information about Zimbabwe.
- Build a line chart, with gdp_cap on the x-axis, and life_exp on the y-axis. Does it make sense to plot this data on a line plot?
- Don't forget to finish off with a plt.show() command, to actually display the plot.

**Exercise**

**Scatter Plot (1)**

When you have a time scale along the horizontal axis, the line plot is your friend. But in many other cases, when you're trying to assess if there's a correlation between two variables, for example, the scatter plot is the better choice. Below is an example of how to build a scatter plot.

import matplotlib.pyplot as plt
plt.scatter(x,y)
plt.show()

Let's continue with the gdp_cap versus life_exp plot, the GDP and life expectancy data for different countries in 2007. Maybe a scatter plot will be a better alternative?

Again, the matplotlib.pyplot package is available as plt.

**Instructions**

- Change the line plot that's coded in the script to a scatter plot.
- A correlation will become clear when you display the GDP per capita on a logarithmic scale. Add the line plt.xscale('log').
- Finish off your script with plt.show() to display the plot.

**Exercise**

**Scatter plot (2)**

In the previous exercise, you saw that the higher GDP usually corresponds to a higher life expectancy. In other words, there is a positive correlation.

Do you think there's a relationship between population and life expectancy of a country? The list life_exp from the previous exercise is already available. In addition, now also pop is available, listing the corresponding populations for the countries in 2007. The populations are in millions of people.

**Instructions**

- Start from scratch: import matplotlib.pyplot as plt.
- Build a scatter plot, where pop is mapped on the horizontal axis, and life_exp is mapped on the vertical axis.
- Finish the script with plt.show() to actually display the plot. Do you see a correlation?

### Histogram

1. Histogram

Oh you are killing it! Now in this video, I'll introduce the histogram.

2. Histogram

![image.png](attachment:image.png)

The histogram is a type of visualization that's very useful to explore your data. It can help you to get an idea about the distribution of your variables. To see how it works, imagine 12 values between 0 and 6.

3. Histogram

![image-2.png](attachment:image-2.png)

I've put them along a number line here. To build a histogram for these values, you can divide the line into equal chunks, called bins.

4. Histogram

![image-3.png](attachment:image-3.png)

Suppose you go for 3 bins, that each have a width of 2. Next, you count how many data points sit inside each bin.

5. Histogram

![image-4.png](attachment:image-4.png)

There's 4 data points in the first bin,

6. Histogram

![image-5.png](attachment:image-5.png)

6 in the second bin

7. Histogram

![image-6.png](attachment:image-6.png)

and 2 in the third bin.

8. Histogram

![image-7.png](attachment:image-7.png)

Finally, you draw a bar for each bin. The height of the bar corresponds to the number of data points that fall in this bin. The result is a histogram, which gives us a nice overview on how the 12 values are distributed. Most values are in the middle, but there are more values below 2 than there are above 4. Of course, we can use matplotlib to build histograms as well.

9. Matplotlib

![image-8.png](attachment:image-8.png)

As before, you should start by importing the pyplot package that's inside matplotlib. Next, you can use the hist function. Let's open up its documentation. There's a bunch of arguments you can specify, but the first two here are the most important ones. x should be a list of values you want to build a histogram for. You can use the second argument, bins, to tell Python into how many bins the data should be divided. Based on this number, hist will automatically find appropriate boundaries for all bins, and calculate how may values are in each one. If you don't specify the bins argument, it will by 10 by default.

10. Matplotlib example

![image-9.png](attachment:image-9.png)

So to generate the histogram that you've seen before, let's start by building a list with the 12 values. Next, you simply call hist and pass this list as an input, so it's matched to the argument x. I also specified the bins argument to be 3, so that the values are divided in three bins. If you finally call the show function, you get a histogram. Histograms are really useful to give a bigger picture.

11. Population pyramid

![image-10.png](attachment:image-10.png)

As an example, have a look at this so-called population pyramid. The age distribution is shown, for both males and females, in the European Union. Notice that the histograms are flipped 90 degrees; the bins are horizontal now. The bins are largest for the ages 40 to 44, where there are 20 million males and 20 million females. They are the so called baby boomers. These are figures of the year 2010. What do you think will have changed in 2050?

12. Population pyramid

![image-11.png](attachment:image-11.png)

Let's have a look. The distribution is flatter, and the baby boom generation has gotten older. With the blink of an eye, you can easily see how demographics will be changing over time. That's the true power of histograms at work here!

13. Let's practice!

Now head over to the exercises to experiment with histograms yourself!

**Exercise**

**Build a histogram (1)**

life_exp, the list containing data on the life expectancy for different countries in 2007, is available in your Python shell.

To see how life expectancy in different countries is distributed, let's create a histogram of life_exp.

matplotlib.pyplot is already available as plt.

**Instructions**

- Use plt.hist() to create a histogram of the values in life_exp. Do not specify the number of bins; Python will set the number of bins to 10 by default for you.
- Add plt.show() to actually display the histogram. Can you tell which bin contains the most observations?

**Exercise**

**Build a histogram (2): bins**

In the previous exercise, you didn't specify the number of bins. By default, Python sets the number of bins to 10 in that case. The number of bins is pretty important. Too few bins will oversimplify reality and won't show you the details. Too many bins will overcomplicate reality and won't show the bigger picture.

To control the number of bins to divide your data in, you can set the bins argument.

That's exactly what you'll do in this exercise. You'll be making two plots here. The code in the script already includes plt.show() and plt.clf() calls; plt.show() displays a plot; plt.clf() cleans it up again so you can start afresh.

As before, life_exp is available and matplotlib.pyplot is imported as plt.

**Instructions**

- Build a histogram of life_exp, with 5 bins. Can you tell which bin contains the most observations?
- Build another histogram of life_exp, this time with 20 bins. Is this better?

**Exercise**

**Build a histogram (3): compare**

In the video, you saw population pyramids for the present day and for the future. Because we were using a histogram, it was very easy to make a comparison.

Let's do a similar comparison. life_exp contains life expectancy data for different countries in 2007. You also have access to a second list now, life_exp1950, containing similar data for 1950. Can you make a histogram for both datasets?

You'll again be making two plots. The plt.show() and plt.clf() commands to render everything nicely are already included. Also matplotlib.pyplot is imported for you, as plt.

**Instructions**

- Build a histogram of life_exp with 15 bins.
- Build a histogram of life_exp1950, also with 15 bins. Is there a big difference with the histogram for the 2007 data?

**Exercise**

![image.png](attachment:image.png)

**Exercise**

![image.png](attachment:image.png)

### Customization

1. Customization

Wow, now you've got histograms under your datavis belt, let's figure out how to customize our plots. Creating a plot is one thing. Making the correct plot, that makes the message very clear -- that's the real challenge.

2. Data visualization

![image.png](attachment:image.png)

For each visualization, you have many options. First of all, there are the different plot types. And for each plot, you can do an infinite number of customizations. You can change colors, shapes, labels, axes, and so on. The choice depends on: one, the data, and two, the story you want to tell with this data. Since there are so many possible customizations, the best way to learn this is by example.

3. Basic plot

![image-2.png](attachment:image-2.png)

Let's start with the code in this script to build a simple line plot. It's similar to the line plot we've created in the first video, but this time the year and pop lists contain more data, including projections until the year 2100, forecasted by the United Nations. If we run this script, we already get a pretty nice plot: it shows that the population explosion that's going on will have slowed down by the end of the century. But some things can be improved. First, it should be clearer which data we are displaying, especially to people who are seeing the graph for the first time. And second, the plot really needs to draw the attention to the population explosion.

4. Axis labels

![image-3.png](attachment:image-3.png)

The first thing you always need to do is label your axes. Let's do this by adding the xlabel and ylabel functions. As inputs, we pass strings that should be placed alongside the axes. Make sure to call these functions before calling the show function, otherwise your customizations will not be displayed. If we run the script again,

5. Axis labels

![image-4.png](attachment:image-4.png)

this time the axes are annotated. Fantastic!

6. Title

![image-5.png](attachment:image-5.png)

We're also going to add a title to our plot, with the title function. We pass the actual title, 'World Population Projections', as an argument.

7. Title

![image-6.png](attachment:image-6.png)

And there's the title! So, using xlabel, ylabel and title, we can give the reader more information about the data on the plot: now they can at least tell what the plot is about. To put the population growth in perspective, I want to have the y-axis start from zero.

8. Ticks

![image-7.png](attachment:image-7.png)

You can do this with the yticks function. The first input is a list, in this example with the numbers zero up to ten, with intervals of 2.

9. Ticks

![image-8.png](attachment:image-8.png)

If we run this, the plot will change: the curve shifts up. Now it's clear that already in 1950, there were already about 2.5 billion people on this planet.

10. Ticks (2)

![image-9.png](attachment:image-9.png)

Next, to make it clear we're talking about billions, we can add a second argument to the yticks function, which is a list with the display names of the ticks. This list should have the same length as the first list. The tick 0 gets the name 0, the tick 2 gets the name 2B, the tick 4 gets the name 4B and so on. By the way, B stands for Billions here. If we run this version of the script,

11. Ticks (2)

![image-10.png](attachment:image-10.png)

the labels will change accordingly. Awesome!

12. Add historical data

![image-11.png](attachment:image-11.png)

Finally, let's add some more historical data to accentuate the population explosion in the last 60 years. On wikipedia, I found the world population data for the years 1800, 1850 and 1900. I can write them in list form and append them to the pop and year lists with the plus sign. If I now run the script once more,

13. Add historical data

![image-12.png](attachment:image-12.png)

three data points are added to the graph, giving a more complete picture.

14. Before vs. after

![image-13.png](attachment:image-13.png)

Now that's how you turn an average line plot into a visual that has a clear story to tell! Over to you now.

15. Let's practice!

Head over to the exercises, gradually customize the world development chart and become the next Hans Rosling!

**Exercise**

**Labels**

It's time to customize your own plot. This is the fun part, you will see your plot come to life!

You're going to work on the scatter plot with world development data: GDP per capita on the x-axis (logarithmic scale), life expectancy on the y-axis. The code for this plot is available in the script.

As a first step, let's add axis labels and a title to the plot. You can do this with the xlabel(), ylabel() and title() functions, available in matplotlib.pyplot. This sub-package is already imported as plt.

**Instructions**

- The strings xlab and ylab are already set for you. Use these variables to set the label of the x- and y-axis.
- The string title is also coded for you. Use it to add a title to the plot.
- After these customizations, finish the script with plt.show() to actually display the plot.

**Exercise**

**Ticks**

The customizations you've coded up to now are available in the script, in a more concise form.

In the video, Hugo has demonstrated how you could control the y-ticks by specifying two arguments:

- plt.yticks([0,1,2], ["one","two","three"])

In this example, the ticks corresponding to the numbers 0, 1 and 2 will be replaced by one, two and three, respectively.

Let's do a similar thing for the x-axis of your world development chart, with the xticks() function. The tick values 1000, 10000 and 100000 should be replaced by 1k, 10k and 100k. To this end, two lists have already been created for you: tick_val and tick_lab.

**Instructions**

- Use tick_val and tick_lab as inputs to the xticks() function to make the the plot more readable.
- As usual, display the plot with plt.show() after you've added the customizations.

**Exercise**

**Sizes**

Right now, the scatter plot is just a cloud of blue dots, indistinguishable from each other. Let's change this. Wouldn't it be nice if the size of the dots corresponds to the population?

To accomplish this, there is a list pop loaded in your workspace. It contains population numbers for each country expressed in millions. You can see that this list is added to the scatter method, as the argument s, for size.

**Instructions**

- Run the script to see how the plot changes.
- Looks good, but increasing the size of the bubbles will make things stand out more.
    - Import the numpy package as np.
    - Use np.array() to create a numpy array from the list pop. Call this NumPy array np_pop.
    - Double the values in np_pop setting the value of np_pop equal to np_pop * 2. Because np_pop is a NumPy array, each array element will be doubled.
    - Change the s argument inside plt.scatter() to be np_pop instead of pop.

![image.png](attachment:image.png)

![image.png](attachment:image.png)

**Exercise**

**Colors**

The code you've written up to now is available in the script.

The next step is making the plot more colorful! To do this, a list col has been created for you. It's a list with a color for each corresponding country, depending on the continent the country is part of.

How did we make the list col you ask? The Gapminder data contains a list continent with the continent each country belongs to. A dictionary is constructed that maps continents onto colors:

dict = {
    'Asia':'red',
    'Europe':'green',
    'Africa':'blue',
    'Americas':'yellow',
    'Oceania':'black'
}

Nothing to worry about now; you will learn about dictionaries in the next chapter.

**Instructions**

- Add c = col to the arguments of the plt.scatter() function.
- Change the opacity of the bubbles by setting the alpha argument to 0.8 inside plt.scatter(). Alpha can be set from zero to one, where zero is totally transparent, and one is not at all transparent.

**Exercise**

**Additional Customizations**

If you have another look at the script, under # Additional Customizations, you'll see that there are two plt.text() functions now. They add the words "India" and "China" in the plot.

**Instructions**

- Add plt.grid(True) after the plt.text() calls so that gridlines are drawn on the plot.

**Exercise**

![image.png](attachment:image.png)

![image-2.png](attachment:image-2.png)