<a href="https://colab.research.google.com/github/HamidurRahman-GitH/github.io/blob/main/01_PythonIntro_(pre_class).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pre-class exercise: Introduction to Data Science
(Remixed from: [Data Science: A First Introduction--Python Edition)](https://worksheets.python.datasciencebook.ca/)

Welcome to ***CRP 384: Regional Data Science for Mobility Justice***  

We'll sometimes have a pre-class assignment that you'll complete prior to meeting with the full class on Thursdays.

The idea behind these assignments is that you can't learn technical subjects without hands-on practice. These types of data exercises will be an important part of the course.

You are all adults. You are responsible for completing these exercises on your own.

Collaborating on these pre-class assignments is more than okay -- it's encouraged! You should rarely be stuck for more than a few minutes on questions in lecture or tutorial, so don't hesitate to ask a colleague or me for help (explaining things is beneficial, too -- the best way to solidify your knowledge of a subject is to explain it). You can also consult with AI-based LLMs, but I don't think it'll be that useful to have them just answer the question for you. Make sure you understand whatever solution they spit out.

### Learning objectives:

After completing this pre-class assignment, you will be able to:

* use a Jupyter notebook via Goolge Colab to execute provided Python code
* edit code and markdown cells in a Jupyter notebook
* create new code and markdown cells in a Jupyter notebook
* load the `pandas` package into Python
* create new variables and objects in Python
* use the help and documentation tools in Python
* perform the following operations from the `pandas` package:
    - Read a standard .csv file using `read_csv`.
    - subset rows and columns of a data frame using `loc[]`.
    - select columns of a dataframe using `[]`.
    - create a new column of a dataframe using `assign`.

In this first notebook you will also learn how to test your answersto assess whether they're correct.

This worksheet covers parts of [the Introduction chapter](https://python.datasciencebook.ca/intro.html) of an online textbook called ***Data Science: A First Introduction (Python Edition)***. In most notebooks, I'll expect you to read the textbook chapters before completing the worksheet, however I know that might not have been possible this time, so I've added a bit more context to assist as you work through the content.

## 1. Jupyter Notebooks
This webpage is called a "Jupyter notebook." A notebook is a place to write computer code for analysis, view the results of the analysis, as well as to narrate the analysis with rich formatted text.

### 1.1. Text Cells
In a notebook, each rectangle containing text or code is called a *cell*.

Text cells (like this one) can be edited by double-clicking on them. They're written in a simple format called [Markdown](http://daringfireball.net/projects/markdown/syntax) to add formatting and section headings.  You don't need to learn Markdown, but you might want to.

After you edit a text cell, confirm the changes by clicking out of the cell or hitting shift-enter. (Try not to delete the instructions of the lab.)

**Question 1.1.1**<br>
This paragraph is in its own text cell.  Try editing it so that all of the sentences following this one are deleted, then click out of the cell or hit shift-enter.  This sentence, for example, should be deleted.  So should this one.

### 1.2. Code Cells
Other cells contain code in the Python language. Running a code cell will execute all of the code it contains.

To run the code in a cell, first click on that cell to activate it. Next, either press Run ▶ or hold down the `shift` key and press `return` or `enter`.

Try running the next cell:

In [1]:
print("Hello, World!")

Hello, World!


The above code cell contains a single line of code, but cells can also contain multiple lines of code. When you run a cell, the lines of code are executed in the order in which they appear. Every `print` expression prints a line. Run the next cell and notice the order of the output.

In [None]:
print("First this line is printed,")
print("and then this one.")

**Question 1.2.1**
<br>

Change the cell above so that it prints out:

    First this line is printed,
    and then the next line,
    and then this one.

*Hint:* If you're stuck for more than a few minutes, ask someone. That's a good idea for any notebook problem.

### 1.3. Writing Jupyter Notebooks
You can use Jupyter notebooks for your own projects or documents.  When you make your own notebook, you'll need to create your own cells for text and code.

To add a cell, click the + code or + text button at the top of the window.

**Question 1.3.1**
<br>

Add a code cell below this one.  Write code in it that prints out:
   
    A whole new code cell!

Run your cell to verify that it works.

**Question 1.3.2**
<br>

Add a text/Markdown cell below this one. Write the text "A whole new Markdown cell" in it.

### 1.4. Comments
You will sometimes see lines like this in code cells:

    # Test cell; please do not change!

That is called a *comment*.  It doesn't make anything happen in Python; Python ignores anything on a line after a #.  Instead, it's there to communicate something about the code to you, the human reader.  Comments are extremely useful and can help increase how readable our code is.

<img src="http://imgs.xkcd.com/comics/future_self.png">

*Source: https://xkcd.com/1421/*

The below code cell contains comments (one at the start of a line, and one after some other code). Run the cell. You will see that everything after a comment symbol `#` is ignored by Python.

In [None]:
# you can use comments to document your code, or make Python ignore some code without deleting it entirely
# print("this is a commented line that Python will ignore. You won't see this text in the output!")

print("hello!") # you can also put comments at the end of a line of code

### 1.5. Errors
Python is a language, and like natural human languages, it has rules.  It differs from natural language in two important ways:
1. The rules are *simple*.  You can learn most of them in a few weeks and gain reasonable proficiency with the language in a semester.
2. The rules are *rigid*.  If you're proficient in a natural language, you can understand a non-proficient speaker, glossing over small mistakes.  A computer running Python code is not smart enough to do that.

Whenever you write code, you'll make mistakes (everyone who writes code does, even--and especially--me!).  When you run a code cell that has errors, Python will sometimes produce error messages to tell you what you did wrong.

Errors are okay; even experienced programmers make many errors.  When you make an error, you just have to find the source of the problem, fix it, and move on.

We have made an error in the next cell.  Uncomment the line below, run the cell and see what happens.

In [None]:
#print("This line is missing something."

There's a lot of terminology in programming languages, but you don't need to know it all in order to program effectively. If you see a cryptic message like this, you can often get by without deciphering it.  (Of course, if you're frustrated, ask asomeone for help.)

Try to fix the code above so that you can run the cell and see the intended message instead of an error.

### 1.6 Saving your work

Its important to save your work often so you don't lose your progress! When working in Google Colab, edits will be saved automatically. When we transition to other platforms later, we'll have to take actions to save our work.

## 2. Numbers
Quantitative information arises everywhere in data science. In addition to representing commands to print out lines, our Python code can represent numbers and methods of combining numbers. The expression `3.2500` evaluates to the number 3.25. (Run the cell and see.)

In [None]:
3.2500

Notice that we didn't have to write `print()`. When you run a notebook cell, Jupyter will helpfully print the last output for you. So in the below cell, the last statement just evaluates to `4`, so it prints `4` (remember -- each line in a cell is a separate line of code!)

In [None]:
2
3
4

If you want to print out results from earlier lines in the cell, you need to use the `print` function.

In [None]:
print(2)
print(3)
print(4)

### 2.1. Arithmetic
The line in the next cell subtracts.  Its value is what you'd expect.  Run it.

In [None]:
2.0 - 1.5

Same with the cell below. Run it.

In [None]:
2 * 2

Many basic arithmetic operations are built in to Python.  [This webpage](https://docs.python.org/3.9/library/operator.html) describes all the arithmetic operators used in the course.  You can refer back to this webpage as you need throughout the term.

## 3. Names
In natural language, we have terminology that lets us quickly reference very complicated concepts.  We don't say, "That's a large mammal with brown fur and sharp teeth!"  Instead, we just say, "Bear!"

Similarly, an effective strategy for writing code is to define names for data as we compute it, like a lawyer would define terms for complex ideas at the start of a legal document to simplify the rest of the writing.

In Python, we do this with *objects*. An object has a name on the left side of an `=` sign and an expression to be evaluated on the right.

In [None]:
answer = 3 * 2 + 4

When you run that cell, Python first evaluates the first line.  It computes the value of the expression `3 * 2 + 4`, which is the number 10.  Then it gives that value the name `answer`.  At that point, the code in the cell is done running.

After you run that cell, the value 10 is bound to the name `answer`:

In [None]:
answer

We can name our objects anything we'd like. Above we called it `answer`, but we could have named it `value`, `data` or anything else we desired. A good rule of thumb is to name it something that has meaning to a human as it relates to what we are trying to accomplish with our Python code.

**Question 3.1**
<br>

Enter a new code cell. Try creating another object using `= 3 * 2 + 4` with a name different from `answer`.

A common pattern in Jupyter notebooks is to assign a value to a name and then immediately evaluate the name in the last line in the cell so that the value is displayed as output.

In [None]:
close_to_pi = 355 / 113
close_to_pi

Another common pattern is that a series of lines in a single cell will build up a complex computation in stages, naming the intermediate results.

In [None]:
bimonthly_salary = 840
monthly_salary = 2 * bimonthly_salary
number_of_months_in_a_year = 12
yearly_salary = number_of_months_in_a_year * monthly_salary
yearly_salary

When naming objects in Python there are some rules:
1. Names in Python can have letters (upper- and lower-case letters are both okay and count as different letters e.g. "Answer" and "answer" will be treated as different objects), underscores, and numbers.
2. The first character can't be a number (otherwise a name might look like a number).  
3. Names can't contain spaces, since spaces are used to separate pieces of code from each other. Instead, it is common to use an underscore character _ to replace each space.
4. Names can't contain other special characters such as -, +, #, $, %, ^ since some characters have special roles in Python. Take # for example, this character specifies a comment within written code.

Other than those rules, what you name something doesn't matter *to Python*.  For example, the next cell does the same thing as the above cell, except everything has a different name:

In [None]:
a = 840
b = 2 * a
c = 12
d = c * b
d

**However**, names are very important for making your code *readable* to yourself and others.  The cell above is shorter, but it's totally useless without an explanation of what it does.

There is also cultural style associated with different programming languages. In the modern Python style, object names should use only lowercase letters, numbers, and `_`. Underscores (`_`) are typically used to separate words within a name (*e.g.*, `answer_one`).

**Question 3.2** <br>

Assign the name `seconds_in_an_hour` to the number of seconds in an hour. You should do this in two steps. In the first, you calculate the number of seconds in a minute and assign that number the name `seconds_in_a_minute`. Next you should calculate the number of seconds in an hour and assign that number the name `seconds_in_an_hour.`  

*Hint - there are 60 seconds in a minute and 60 minutes in a hour*

In [None]:
# your code here

# I've put this line in this cell so that it will print
# the value you've given to seconds_in_an_hour when you
# run it.  You don't need to change this.
seconds_in_an_hour

## 4. Calling Functions/Methods and Attributes

The most common way to combine or manipulate values in Python is by calling functions or methods. Python comes with many built-in functions and methods that perform common operations. You can think of functions and methods as verbs that do things. And objects in Python like nouns, which are entities that exist.

In the module, we explored examples of functions and methods such as `print`. Here, we'll demonstrate using another method `upper` that converts text to uppercase:

In [None]:
greeting = "Why, hello there!".upper()
greeting

> The `upper` method we used here is different from the functions we used previously (e.g. `print`). This method is called using the dot notation (`string.upper()`), because this method only works for a particular kind of object that they were designed to work for. Here the `upper` method was written to only work with string objects. `print` function, however, was written to work with many kinds of objects, therefore, we don't use the dot notation.

**Question 4.0** <br>

Use the method `lower` to change all the words in the following movie title to lower case text: "The House with a Clock in Its Walls" and assign the lower case text the name `title`.

In [None]:
# your code here

title

### 4.1. Multiple Arguments
Some functions take multiple arguments, separated by commas. For example, the built-in `max` function returns the maximum argument passed to it.

In [None]:
biggest = max(2, 15, 4, 7)
biggest

**Question 4.1** <br>

Use the `min` function to find the minumum value of the numbers in the cell above.

Assign the value to an object called `smallest`.

In [None]:
# your code here

smallest

## 5. Packages
Python has many built-in functions, but we can also use functions that are stored within packages created by other Python users. We are going to use a package, called `pandas`, to load, modify and plot data.
This package has already been installed for you. Later in the course you will learn how to install packages so you are free to bring in other tools as you need them for your data analysis.

To use the functions from a package you first need to load it using the `import` function. This needs to be done once per notebook (and a good rule of thumb is to do this at the very top of your notebook so it is easy to see what packages your Python code depends on).

Here we also give `pandas` a nickname of `pd`, formally called an alias. This lets us refer to the pandas package more efficiently by just typing `pd` instead of `pandas`. Referring to packages with aliases is very common in Python and you will see us do this with many of the packages we use in this course.

In [None]:
import pandas as pd

**Question 5.1** <br>

Use the `import` function to load the `numpy` Python package as `np`.

In [None]:
import sys

# your code here

## 6. Looking for Help

No one, even experienced, professional programmers remember what every function does, nor do they remember every possible function argument/option. So both experienced and new programmers (like you!) need to look things up, A LOT!

### 6.1. Help Files
One of the most efficient places to look for help on how a function works is the Python documentation. Let’s say we wanted to pull up the documentation for the `read_csv` method in pandas. We can do this by typing the `?` character followed by the name we want more information about.

Run the cell below to find out more about `.read_csv` function from the `pandas` package.

In [None]:
?pd.read_csv

Google Colab doesn't format the help output very nicely. In this case you can also seek out the online help for the function, located in the overall [pandas documentation here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html#pandas.read_csv).

At the very top of the output, you will see the function itself and its arguments. Next is a description of what the function does. The bottom of the file specifies the package it is in (in this case, it is pandas). You’ll find that the most helpful sections on this page are “Parameters” and "Examples".

- **Docstring** at the top gives you an idea of how you would use the function when coding--what the syntax would be and how the function itself is structured.
- **Parameters** tells you the different parts that can be added to the function to make it more simple or more complicated. Often the “Parameters” sections doesn’t provide you with step by step instructions, because there are so many different ways that a person can incorporate a function into their code. Instead, they provide users with a general understanding as to what the function could do and parts that could be added. At the end of the day, the user must interpret the help file and figure out how best to use the functions and which parts are most important to include for their particular task.
- The **Returns** explains what to expect as an output.
- The **Examples** section is often the most useful part of the help file as it shows how a function could be used with real data. It provides a skeleton code that the users can work off of.
- Sometimes there is a **See Also** section which may suggest similar functions that could be of use to the user.

Beyond the Python help files there are many resources that you can use to find help. [Stack overflow](https://stackoverflow.com/), an online forum, is a great place to go and ask questions such as how to perform a complicated task in Python or why a specific error message is popping up. Oftentimes, a previous user will have already asked your question of interest and received helpful advice from fellow Python users.

**Question 6.1** Multiple Choice:
<br>

Use `?pd.read_csv` or the online pandas help to answer the multiple choice question below. To answer the question, assign the letter associated with the correct answer to a variable in the the code cell below:

Which statement below is accurate?

A. `pd.read_csv` is useful for reading a comma-separated values (csv) file into a DataFrame.

B. It can accept a possible parameter of `warnings=True`.

C. The parameter `delimiter` is an alias for the parameter `squeeze`.

D. `pd.read_csv` is perfect for reading a table of fixed-width formatted lines into DataFrame.

*Assign your answer to an object called `answer6_1`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `"F"`).*

In [None]:
# your code here



## 7. Pandas Functions

Now that we have learned a little about Jupyter notebooks and Python, let's load a real dataset into Python and explore it. As we do this we will learn more about key data loading, wrangling and visualization functions in Python.

### Exercise: Data about Runners!
Vickers and Vertosick performed [a study in 2016](https://bmcsportsscimedrehabil.biomedcentral.com/articles/10.1186/s13102-016-0052-y) that aimed to identify what factors had a relationship with race performance of recreational runners so that they could better predict future 5 km, 10 km and marathon race times for individual runners. Such predictions (and knowing what drives these predictions) can help runners by suggesting changes they could make to modifiable factors, such as training, to help them improve race time. Unmodifiable factors that contribute to the prediction, such as age or sex, allow for fair comparisons to be made between different runners.

Vickers and Vertosick reasoned that their study is important because all previous research done to predict races times has focused on data from elite athletes. This biased data set means that the predictions generated from them do not necessarily do a good job predicting race times for recreational runners (whose data was not in the dataset that was used to create the model that generates the predictions). Additionally, previous research focused on reporting/measuring factors that require special expertise or equipment that are not freely available to recreational runners. This means that recreational runners may not be able to put their characteristics/measurements for these factors in the race time prediction models and so they will not be able to obtain an accurate prediction, or a prediction at all (in the case of some models).

To make a better model, Vickers and Vertosick performed a large survey. They put their survey on the news website [Slate.com](https://slate.com/) attached to a news story about race time prediction. They were able to obtain 2,497 responses. The survey included questions that allowed them to collect a data set that included:
- age,
- sex,
- body mass index (BMI),
- whether they are an edurance runner or speed demon,
- what type of shoes they wear,
- what type of training they do,
- race time for 2-3 races they completed in the last 6 months,
- self-rated fitness for each race,
- and race difficulty for each race.


Let's now use this data to explore a question we might be interested in - is there a relationship between 10 km race time and body mass index (BMI) for male runners in this data set. This is an exploratory data analysis question because we stated we looking for a relationship between measurements within the single data set we have and are not interested in yet interpreting beyond it. We can answer this question by visualizing the data as a scatter plot using Python.

If, however we are not aiming to extend our findings to a broader population, make predictions, analyze cause or mechanics, we would need to state a different data analyis question and follow-up with different analytical methods to answer that question.

To answer our exploratory question (is there a relationship between 10 km race time and body mass index (BMI) for men runners in this data set), we will need to do the following things in Python:

1. load the data set into Python
2. subset the data we are interested in visualizing from the loaded dataset
3. create a new column to get the unit of time in minutes instead of seconds
4. create a scatter plot using this modified data

> *Note 1 - subsetting the data and converting from seconds to minutes is not absolutely required to answer our question, but it will give us practice manipulating data in Python, and make our data tables and figures more readable.*
>
> *Note 2 - many historical datasets treated sex as a variable where the possible values are only binary: male or female. This representation in this question reflects how the data were historically collected and is not meant to imply that we believe that sex is binary.*

### 7.1. Reading Data

Let's get started with our first step - loading the data set. The data set we are loading is called `marathon_small.csv` and it contains a subset of the data from the study described above. The file is in the same directory/folder as the file for this notebook. It is a comma separated file (meaning the columns are separated by the `,` character). We often refer to these files as `.csv`'s.


```
age,bmi,km5_time_seconds,km10_time_seconds,sex
25.0,21.6221160888672,NA,2798,female
41.0,23.905969619751,1210.0,NA,male
25.0,21.6407279968262,994.0,NA,male
35.0,23.5923233032227,1075.0,2135,male
34.0,22.7064037322998,1186.0,NA,male
45.0,42.0875434875488,3240.0,NA,female
33.0,22.5182952880859,1292.0,NA,male
58.0,25.2340793609619,NA,3420,male
29.0,24.505407333374,1440.0,3240,male
```

We can use the `pd.read_csv` function from the `pandas` package to do this.

*Note - the quotes around the filename are important and you will get an error if you forget them.*

**Question 7.1.1** <br>

Download the `marathon_small.csv` file using [this link](https://drive.google.com/file/d/1zWAwZHXpKRokDRemE4DEZ7V38muS4vEJ/view?usp=sharing). Add it to your workspace in Google cloab and then use the `pd.read_csv` function from `pandas` package to load the data from the `marathon_small.csv` file into Python. Save the data to an object called `marathon_small`. If you need additional help try `?pd.read_csv` and/or ask your neighbours or the Instructional team for help.

In [None]:
import pandas as pd

# your code here
marathon_small = pd.read_csv("marathon_small.csv")

marathon_small

### 7.2. Data frames

The functions from the `pandas` package give us a data frame and we can look at the structure of a data frame by simply writing its name to view the output.

In [None]:
marathon_small

This returns the first 5 and last 5 rows of the data frame, and hides the middle rows with an ellipsis (`...`).

By default, the first row of a data set is always the **header** that `pd.read_csv` uses to label the column. Therefore, the first row contains descriptive names while the rows below contain the actual data. The bolded column on the left without a header is called the index. For now you can think of this is the row numbers of the data frame.

This only shows us a small portion of the data set. You can look at more of the data set by using the `head` method to specify the number of rows you want to print.

In [None]:
marathon_small.head(50)

This shows us the first 50 rows of the data set. We could look at the entire data by changing the `n` argument but looking at many rows of data can be very long and unnecessary to look at.

**Question 7.2.1** <br>
To know how many rows and columns there are, use the method `shape`. Assign the number of rows and columns to the object `rows_and_columns`.

In [None]:
# your code here

print(rows_and_columns)

### 7.3. Obtaining a subset of rows OR columns with `[]`

One of the most common operations on a data frame is to *filter* its rows (observations) to keep only specific rows based on their entries in one or more columns. To do this we can use the `[]` operation on a `pandas` data frame.

For example, if we had a data frame (named `data`) that looked like this:

```
  colour size speed
1    red   15  12.3
2   blue   19  34.1
3   blue   20  23.2
4    red   22  21.9
5   blue   12  33.6
6   blue   23  28.8
```

We could use the first line of the code in the image below to filter for rows where the colour has the value of "blue". The second line of code below would let us filter for rows where the size has a value greater than 20.

![python snippet](https://drive.google.com/uc?id=1MJhV96jhnNDWSS8OYAn3EbS97Uy7H3jX)

**Question 7.3.1** <br>

Use the `[]` operation to subset your data frame `marathon_small` so it only contains survey data from males. Assign your new filtered data frame to an object called `marathon_filtered_rows`.

In [None]:
# your code here

marathon_filtered_rows

**Question 7.3.2** <br>
The `[]` operation can also be used to subset columns via the syntax `data[['column1, 'column2']]`. Use the `[]` operation to subset your data frame `marathon_small` so it only contains the columns "bmi" and "km10_time_seconds". Assign your new filtered data frame to an object called `marathon_filtered_columns`.

In [None]:
# your code here

marathon_filtered_columns

### 7.4. Obtaining a subset of rows AND columns with `loc[]`

The `[]` operation is only used when you want to either filter rows **or** select columns;
it cannot be used to do both operations at the same time. This is where `loc[]`
comes in. When we use `loc` to select columns and rows by labels in a dataframe we always specify row condition first, and then the list of columns we want: `data.loc[data['column1'] == row_condition, ['column1', 'column2']]`.

**Question 7.4.1** <br>

Use `loc` to keep only the male runners and the columns `bmi` and `km10_time_seconds` from `marathon_small`, i.e. perform both the steps from the previous two question in a single operation. Assign your new filtered data frame to an object called `marathon_male`.

*Make sure you select `bmi` first and then `km10_time_seconds`*!

In [None]:
# your code here

marathon_male

**Question 7.4.2** <br>

What are the units of the time taken to complete a run of 10 km? Assign your answer to an object called `answer7_4_2`. Write your answer in lower case. Place your answer between quotation marks.


*Hint: scroll up and look at the introduction to this exercise.*

In [None]:
# your code here


**Question 7.4.3**
<br>

What are the units for time (e.g., seconds, minutes, hours) that we would like to use when plotting BMI against time taken to run 10 km? Assign your answer to an object called `answer7_4_3`. Write your answer in lower case. Place your answer between quotation marks.

*Hint: scroll up and look at the introduction to this exercise.*

In [None]:
# your code here


### 7.5. Assign

The method `assign` is used to add columns to a dataset, typically by making use of existing columns to compute a new column.

![python snippet](https://drive.google.com/uc?id=1_s5D19D9mFxb0xm-bUaOh-PayESAnSN1)

In the example above, we are creating a new column named `new_column` that is equal to `old_column * 10` and saving the results to an object called `data_mutated`.

**Question 7.5.1**<br>

Add a new column to our `marathon_male` dataset called `km10_time_minutes` that is equal to `km10_time_seconds/60.` Assign your answer to an object called `marathon_minutes`.

In [None]:
# your code here

marathon_minutes

### 7.5. Visualization
`Altair` is powerful visualization package for Python. The fundamental object in `Altair` is the `Chart`, which takes a data frame as a single argument `alt.Chart(dataframe)`. With a chart object in hand, we can now specify how we would like the data to be visualized. We first indicate what kind of graphical mark we want to use to represent the data. We can set the mark attribute of the chart object using the the `Chart.mark_*` methods. The `encode` method builds a mapping between visual encoding channels (such as x, y, color, shape, size, etc.) and columns in the dataset.

![Image Description](https://drive.google.com/uc?id=1rCgEut4HnoFZp1koVprlknVE2vU8dUh-)


Let's plot a scatterplot with the `bmi` on the x axis and `km10_time_minutes` on the y axis.

Before we start plotting use `Altair`, we need to import the package. You'll see we give it the alias `alt`.

In [None]:
import altair as alt

In [None]:
# Run this cell to create a scatterplot of BMI against the time it took to run 10 km.

plot = alt.Chart(marathon_minutes).mark_point().encode(
    x="bmi",
    y="km10_time_minutes"
)
plot

**Question 7.6.1** Multiple Choice
<br>

Looking at the graph above, choose a statement above that most reflects what we see.

A. There appears to be no relationship between 10 km run time and body mass index. As the value for body mass index increases we see neither an increase nor decrease in the time it takes to run 10 km.

B. There may be a positive relationship between 10 km run time and body mass index. As the value for body mass index increases, so does the time it takes to run 10 km.

C. There may be a negative relationship between 10 km run time and body mass index. As the value for body mass index increases, the time it takes to run 10 km decreases.




*Assign your answer to an object called `answer7_6_1`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `"F"`).*

In [None]:
# your code here


The visualization code above barely scratches the surface of what `Altair`, and Python as a whole, are capable of. Not only are there far more choices about the kinds of plots available, but there are many, many options for customizing the look and feel of each graph. You can choose the font, the font size, the colors, the style of the axes, etc.

Let’s dig a little deeper into just a couple of options that you can add to any of your graphs to make them look a little better. For example, you can change the text of the x-axis label or the y-axis label by specifying the title inside `alt.X()` or `alt.Y()` inside the encoder. You can also change the font size using the `configure_axis` method. Let’s do that for the scatterplot to make the labels easier to read.

In [None]:
# Run this cell.
# You can replace the axes with whatever you wish to label.
# After running the cell once, try changing the axes to something else.

marathon_plot = alt.Chart(marathon_minutes).mark_point().encode(
    x=alt.X("bmi").title("Body Mass Index"),
    y=alt.Y("km10_time_minutes").title("10 km run time (minutes)"),
).configure_axis(
    labelFontSize=12,
    titleFontSize=12,
)
marathon_plot

## Attributions
- UC Berkley [Data 8 Public Materials](https://github.com/data-8/data8assets)
- UBC [Key Capabilities in Data Science Programming in Python course](https://github.com/UBC-MDS/prog-python-data-science-students)