# Python beginners course - Level 2 - Pandas

The *pandas* package is the most important tool for Data Scientists and Analysts working in Python today. The powerful machine learning and glamorous visualization tools may get all the attention, but pandas is the backbone of most data projects. 

>\[*pandas*\] is derived from the term "**pan**el **da**ta", an econometrics term for data sets that include observations over multiple time periods for the same individuals. — [Wikipedia](https://en.wikipedia.org/wiki/Pandas_%28software%29)

If you're thinking about data science as a career, then it is imperative that one of the first things you do is learn pandas. In this course, we will go over the essential bits of information about pandas

## 1. What's Pandas for?

Pandas has so many uses that it might make sense to list the things it can't do instead of what it can do. It is essentially your data’s home. Through pandas, you get acquainted with your data by cleaning, transforming, and analyzing it. 

For example, if you want to explore a dataset stored in a CSV / Excel file on your computer then pandas will extract the data from that CSV into what it calls a ***DataFrame** and then let you do things like:

- Calculate statistics and answer questions about the data, like:


    - What's the average, median, max, or min of each column? 
    - Does column A correlate with column B?
    - What does the distribution of data in column C look like?


- Clean the data by doing things like removing missing values and filtering rows or columns by some criteria.


- Visualize the data with help from Matplotlib (see notebook 3). Plot bars, lines, histograms, bubbles, and more. 


- Store the cleaned, transformed data back into a CSV or Excel file for later use.



### How does pandas fit into the data science toolkit?

Not only is the pandas library a central component of the data science toolkit but it is used in conjunction with other libraries in that collection. 

Pandas is built on top of the **NumPy** package (see notebook level 1), meaning a lot of the structure of NumPy is used or replicated in Pandas. Data in pandas is often used to create visualisations in **Matplotlib** (see notebook level 3) and run machine learning algorithms in **Scikit-learn** (see notebook level 4).

### IMPORTANT: BEFORE STARTING, RUN THE CELL BELOW
Just like the previous notebook, we should always start by loading (importing) the packages. Run the cell below before you do anything else!

In [20]:
#import the pandas packages as 'pd'
import pandas as pd

### Core components of pandas: Series and DataFrames

The primary two components of pandas are the `Series` and `DataFrame`. 

A `Series` is essentially a column, and a `DataFrame` is a table made up of a collection of Series. 

<img src="../assets/series-and-dataframe.png" width=600px />

DataFrames and Series are quite similar in that many operations that you can do with one you can do with the other, such as filling in null values and calculating the mean.

You'll see how these components work when we start working with data below. 

## 2. Creating DataFrames from scratch
Creating DataFrames directly in Python is useful for practice and for testing new methods and functions.

There are *many* ways to create a DataFrame from scratch, but a great option is to just use a simple `dict` (dictionary). A dictionary is a mapping of unique keys to values.

Let's say we have a fruit stand that sells apples and oranges. We want to have a column for each fruit and a row for each customer purchase. To organize this as a dictionary for pandas we could do something like:

In [6]:
data = {
    'names': ['June', 'Robert', 'Lily', 'David'],
    'apples': [3, 2, 0, 1], 
    'oranges': [0, 3, 7, 2]
}

data

{'names': ['June', 'Robert', 'Lily', 'David'],
 'apples': [3, 2, 0, 1],
 'oranges': [0, 3, 7, 2]}

In the example above, *apples* and *oranges* are called the **keys** of the dictionary, while the lists on the righthandside are called **values** of the dictionary.

Now, we convert the dictionary into a DataFrame called *purchases*:

In [5]:
purchases = pd.DataFrame(data)

purchases

Unnamed: 0,names,apples,oranges
0,June,3,0
1,Robert,2,3
2,Lily,0,7
3,David,1,2


**How did that work?**

Each *(key, value)* item in the dictionary corresponds to a *column* in the resulting DataFrame.

In this case, the **index** of this DataFrame was assigned automatically by pandas as the numbers 0 to 3. However, we could also create our own when we initialize the DataFrame. 

Let's have customer names as our index: 

In [7]:
purchases = pd.DataFrame(data)
purchases = purchases.set_index('names')

purchases

Unnamed: 0_level_0,apples,oranges
names,Unnamed: 1_level_1,Unnamed: 2_level_1
June,3,0
Robert,2,3
Lily,0,7
David,1,2


In practice, you'll find that most CSVs won't have a column suitable to be used as an index. That is no problem, as in that case numbers will assigned by default (starting at 0) are just fine to work with.

Setting the *names* column as the index of the DataFrame allows us to **loc**ate a customer's order by using their name:

In [8]:
purchases.loc['Robert']

apples     2
oranges    3
Name: Robert, dtype: int64

There's more on locating and extracting data from the DataFrame later, but now you should be able to create a DataFrame with any random data to practice on.

Let's move on to some quick methods for creating DataFrames from various other sources.

### Exercise 1
Above we have seen how to manually create a DataFrame. Now lets try ourselves! Complete the steps below by filling in the ___.

#### Step 1
Create a DataFrame of the following table:

| Account holder | Account number | Balance |
| --- | --- | --- |
| Kenneth Effe | 3421 1242 3232 5324 | 232.2 |
| James Chang | 7583 0274 7537 9613 | 405.37 |
| Ralph Mallets | 8534 7319 4697 1254 | 342.65 |

In [None]:
#step 1
data = {
    ___
}

bankaccounts = pd.DataFrame(___)

#### Step 2
Set the first column as the index column.

In [None]:
#step 2
bankaccounts = bankaccounts.___

#### Step 3
Locate Ralph's account number and balance.

In [None]:
#step 3
located_account = bankaccounts.loc___

print(located_account)

#### Step 4 (challenge / optional)
Calculate the sum and the average of all balances. Hint: use the .sum() and .mean() functions.

In [None]:
#step 4
sum = bankaccounts_________

## 3. How to read in data

Manually typing out data is nice, but we actually want to read in from spreadsheets. In the following examples we'll keep using our apples and oranges data, but this time it's coming from a CSV file. With CSV files (Excel-like files) all you need is a single line to load in the data:

In [10]:
purchases = pd.read_csv('../assets/purchases.csv')

purchases

Unnamed: 0,names,apples,oranges
0,June,3,0
1,Robert,2,3
2,Lily,0,7
3,David,1,2


Again we want the names to be used as an index.

In [13]:
df = pd.read_csv('../assets/purchases.csv')
df.set_index('names')

df

Unnamed: 0,names,apples,oranges
0,June,3,0
1,Robert,2,3
2,Lily,0,7
3,David,1,2


## 4. Exploring your data in pandas
Now that we have a basic idea of how to load data into DataFrames, let's move on to importing some real-world data and detailing a few of the operations you'll be using a lot.

DataFrames possess hundreds of methods and other operations that are crucial to any analysis. As a beginner, you should know the operations that perform simple transformations of your data and those that provide fundamental statistical analysis.

In this training, we use the Boston Housing Dataset. This dataset contains the median value of houses (column MEDV) in various towns in Boston. In addition, the dataset also provide extra information such as the per capita crime rate of the town (column CRIM), areas of non-retail business in the town (INDUS), the age of people who own the house (AGE), and many others. The data was originally published by D. Harrison and D.L. Rubinfeld in "Hedonic prices and the demand for clean air, J. Environ. Economics & Management, vol.5, 81-102, 1978".

The dataset is relatively small in size with only 506 cases and has the following columns (**you don't have to remember this**):

| Column        | Represents                                      |
| :------------- |------------------------------------------------ | 
| CRIM | per capita crime rate by town |
| ZN | proportion of residential land zoned for lots over 25,000 sq.ft.|
| INDUS | proportion of non-retail business acres per town.|
| CHAS | Charles River dummy variable (1 if tract bounds river; 0 otherwise)|
| NOX | nitric oxides concentration (parts per 10 million)|
| RM | average number of rooms per dwelling|
| AGE | proportion of owner-occupied units built prior to 1940|
| DIS | weighted distances to five Boston employment centres|
| RAD | index of accessibility to radial highways|
| TAX | full-value property-tax rate per 10,000 dollar|
| PTRATIO | pupil-teacher ratio by town|
| B | 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town|
| LSTAT | percentage lower status of the population|
| MEDV | Median value of owner-occupied homes (1 = 1000 dollar)|


### Exploring the data
We're loading this dataset from a CSV below. 

The first thing a data analyst does when opening a new dataset is print out a few rows to get a first impression of the data we are dealing with. We accomplish this with `.head()`:

In [15]:
boston = pd.read_csv("../data/boston_dataset.csv")
boston.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2


`.head()` outputs the **first** five rows of your DataFrame by default, but we could also pass a number as well: `boston.head(10)` would output the top ten rows, for example. 

To see the **last** five rows use `.tail()`. `tail()` also accepts a number, and in this case we printing the bottom two rows.:

In [9]:
boston.tail(2)

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
504,0.10959,0.0,11.93,0,0.573,6.794,89.3,2.3889,1,273,21.0,393.45,6.48,22.0
505,0.04741,0.0,11.93,0,0.573,6.03,80.8,2.505,1,273,21.0,396.9,7.88,11.9


Typically when we load in a dataset, we like to view the first five or so rows to see what's under the hood. Here we can see the names of each column, the index, and examples of values in each row.

### Exercise 2: print first five rows and last five rows of the boston data set

### Getting info about your data

Another command that is used frequently for exploring a new data set is `.info()`:

In [16]:
boston.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
CRIM       506 non-null float64
ZN         506 non-null float64
INDUS      506 non-null float64
CHAS       506 non-null int64
NOX        506 non-null float64
RM         506 non-null float64
AGE        506 non-null float64
DIS        506 non-null float64
RAD        506 non-null int64
TAX        506 non-null int64
PTRATIO    506 non-null float64
B          506 non-null float64
LSTAT      506 non-null float64
MEDV       506 non-null float64
dtypes: float64(11), int64(3)
memory usage: 55.4 KB


`.info()` provides the essential details about your dataset, such as the number of rows and columns, the number of non-null values, what type of data is in each column, and how much memory your DataFrame is using. Note that in this notebook we only encounter numbers (int64 = whole numbers, float64 = numbers with decimals), but in reality there are more types of data.

### Understanding your variables

To get an idea of the data you are working with, `describe()` can be used on an entire DataFrame to get a summary of the distribution of continuous variables:

In [21]:
boston.describe()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,3.613524,11.363636,11.136779,0.06917,0.554695,6.284634,68.574901,3.795043,9.549407,408.237154,18.455534,356.674032,12.653063,22.532806
std,8.601545,23.322453,6.860353,0.253994,0.115878,0.702617,28.148861,2.10571,8.707259,168.537116,2.164946,91.294864,7.141062,9.197104
min,0.00632,0.0,0.46,0.0,0.385,3.561,2.9,1.1296,1.0,187.0,12.6,0.32,1.73,5.0
25%,0.082045,0.0,5.19,0.0,0.449,5.8855,45.025,2.100175,4.0,279.0,17.4,375.3775,6.95,17.025
50%,0.25651,0.0,9.69,0.0,0.538,6.2085,77.5,3.20745,5.0,330.0,19.05,391.44,11.36,21.2
75%,3.677082,12.5,18.1,0.0,0.624,6.6235,94.075,5.188425,24.0,666.0,20.2,396.225,16.955,25.0
max,88.9762,100.0,27.74,1.0,0.871,8.78,100.0,12.1265,24.0,711.0,22.0,396.9,37.97,50.0


## 5. Selecting, slicing, extracting data in pandas

Up until now we've focused on some basic summaries of our data. Below we explain the methods of *selecting, slicing, and extracting* that you'll need to use constantly when working with data in Python. Let's look at working with columns first.

### Selecting a column
Using square brackets is the general way we select columns in a DataFrame:

In [22]:
#select a column by its column name
medv_col = boston['MEDV']

#use type() to view the type of this data
type(medv_col)

pandas.core.series.Series

Notice that the selected column `medv_col` is not a DataFrame like `boston` is. Instead, `medv_col` is what we call a *Series*. It's important to note that, although many methods are the same, DataFrames and Series have different attributes, so you'll need be sure to know which type you are working with or else you will receive attribute errors. 

To extract a column as a *DataFrame* instead, you use square brackets to pass a list of column names. In our case that's just a single column:

In [24]:
medv_col = boston[['MEDV']]

type(medv_col)

pandas.core.frame.DataFrame

However, this approach can be expanded to selecting any number of columns by adding additional column names:

In [23]:
subset = boston[['RM', 'MEDV']]

subset.head()

Unnamed: 0,RM,MEDV
0,6.575,24.0
1,6.421,21.6
2,7.185,34.7
3,6.998,33.4
4,7.147,36.2


### Selecting a row
Now we can select columns. But what if you want to get a specific row?

For rows, we can use the `.iloc` method to **loc**ate by numerical **i**ndex. So to get the row at index 222 we do:

In [34]:
row = boston.iloc[222]

print(row)
print(type(row))

CRIM         0.62356
ZN           0.00000
INDUS        6.20000
CHAS         1.00000
NOX          0.50700
RM           6.87900
AGE         77.70000
DIS          3.27210
RAD          8.00000
TAX        307.00000
PTRATIO     17.40000
B          390.39000
LSTAT        9.93000
MEDV        27.50000
Name: 222, dtype: float64
<class 'pandas.core.series.Series'>


### Exercise 3
Above we select row 222 using ```.iloc```. What we get is a Series. How can we edit the cell above to get a DataFrame instead?

Bonus question: and how can we use ```.iloc``` to select both row 222 and 223?

In [None]:
row = boston.iloc[____]

row

We can also use a trick called *slicing* to select multiple rows. Slicing is done using square brackets like `boston[1:4]`:

In [37]:
boston.iloc[20:30]

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
20,1.25179,0.0,8.14,0,0.538,5.57,98.1,3.7979,4,307,21.0,376.57,21.02,13.6
21,0.85204,0.0,8.14,0,0.538,5.965,89.2,4.0123,4,307,21.0,392.53,13.83,19.6
22,1.23247,0.0,8.14,0,0.538,6.142,91.7,3.9769,4,307,21.0,396.9,18.72,15.2
23,0.98843,0.0,8.14,0,0.538,5.813,100.0,4.0952,4,307,21.0,394.54,19.88,14.5
24,0.75026,0.0,8.14,0,0.538,5.924,94.1,4.3996,4,307,21.0,394.33,16.3,15.6
25,0.84054,0.0,8.14,0,0.538,5.599,85.7,4.4546,4,307,21.0,303.42,16.51,13.9
26,0.67191,0.0,8.14,0,0.538,5.813,90.3,4.682,4,307,21.0,376.88,14.81,16.6
27,0.95577,0.0,8.14,0,0.538,6.047,88.8,4.4534,4,307,21.0,306.38,17.28,14.8
28,0.77299,0.0,8.14,0,0.538,6.495,94.4,4.4547,4,307,21.0,387.94,12.8,18.4
29,1.00245,0.0,8.14,0,0.538,6.674,87.3,4.239,4,307,21.0,380.23,11.98,21.0


#### Conditional selections
We’ve gone over how to select columns and rows, but what if we want to make a conditional selection? 

For example, what if we want to filter our DataFrame to show only item where the nitric oxides concentration exeeds 0.4 parts per million?

To do that, we take a column from the DataFrame and apply a Boolean condition to it. Here's an example of a Boolean condition:

In [38]:
condition = (boston['NOX'] > 0.4)

condition.head()

0    True
1    True
2    True
3    True
4    True
Name: NOX, dtype: bool

The Boolean condition above returns a Series of True and False values: True for concentrations above 0.4 and False for concentrations below. 

We want to filter out concentrations below 0.4, in other words, we don’t want rows that are False. To return the rows where that condition is True we have to pass this operation into the DataFrame:

In [39]:
boston[boston['NOX'] > 0.4].head(10)

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2
5,0.02985,0.0,2.18,0,0.458,6.43,58.7,6.0622,3,222,18.7,394.12,5.21,28.7
6,0.08829,12.5,7.87,0,0.524,6.012,66.6,5.5605,5,311,15.2,395.6,12.43,22.9
7,0.14455,12.5,7.87,0,0.524,6.172,96.1,5.9505,5,311,15.2,396.9,19.15,27.1
8,0.21124,12.5,7.87,0,0.524,5.631,100.0,6.0821,5,311,15.2,386.63,29.93,16.5
9,0.17004,12.5,7.87,0,0.524,6.004,85.9,6.5921,5,311,15.2,386.71,17.1,18.9


We can also make some richer conditionals by using logical operators `|` for "or" and `&` for "and".

Let's filter the the DataFrame to show only items where NOX is greater than 0.4 OR where TAX is lower than 300:

In [31]:
boston[(boston['NOX'] > 0.4) | (boston['TAX'] < 300)].head(10)

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2
5,0.02985,0.0,2.18,0.0,0.458,6.43,58.7,6.0622,3.0,222.0,18.7,394.12,5.21,28.7
6,0.08829,12.5,7.87,0.0,0.524,6.012,66.6,5.5605,5.0,311.0,15.2,395.6,12.43,22.9
7,0.14455,12.5,7.87,0.0,0.524,6.172,96.1,5.9505,5.0,311.0,15.2,396.9,19.15,27.1
8,0.21124,12.5,7.87,0.0,0.524,5.631,100.0,6.0821,5.0,311.0,15.2,386.63,29.93,16.5
9,0.17004,12.5,7.87,0.0,0.524,6.004,85.9,6.5921,5.0,311.0,15.2,386.71,17.1,18.9


As you can see, we can easily filter our data to get only the values we are interested in using Boolean conditions. Also, notice how similar filtering is to what we saw in the NumPy notebook. 


### Exercise 4: Print all rows where NOX is greater than 0.4.

## 6. Sorting a DataFrame

To sort pandas DataFrame, you may use the ```.sort_values``` syntax in Python.

Below, we'll show you 4 examples to demonstrate how to sort the data in ascending and descending order based on a specific column. Then, we examine how you sort based on multiple columns.

Let's say that we want to sort our boston data set in an ascending order (low to high) by median value (column MEDV). We can do that as follows:

In [45]:
boston.sort_values(by=['MEDV']).head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
398,38.3518,0.0,18.1,0,0.693,5.453,100.0,1.4896,24,666,20.2,396.9,30.59,5.0
405,67.9208,0.0,18.1,0,0.693,5.683,100.0,1.4254,24,666,20.2,384.97,22.98,5.0
400,25.0461,0.0,18.1,0,0.693,5.987,100.0,1.5888,24,666,20.2,396.9,26.77,5.6
399,9.91655,0.0,18.1,0,0.693,5.852,77.8,1.5004,24,666,20.2,338.16,29.97,6.3
414,45.7461,0.0,18.1,0,0.693,4.519,100.0,1.6582,24,666,20.2,88.27,36.98,7.0


Easy, right? Note that unless specified otherwise, pandas will sort values in an ascending order by default.

Now, sorting in descending order is actually very similar. If we want to know the unsafest towns in the data set, we can sort the data in descending order by crime rate (column CRIM):

In [43]:
boston.sort_values(by=['CRIM'], ascending=False).head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
380,88.9762,0.0,18.1,0,0.671,6.968,91.9,1.4165,24,666,20.2,396.9,17.21,10.4
418,73.5341,0.0,18.1,0,0.679,5.957,100.0,1.8026,24,666,20.2,16.45,20.62,8.8
405,67.9208,0.0,18.1,0,0.693,5.683,100.0,1.4254,24,666,20.2,384.97,22.98,5.0
410,51.1358,0.0,18.1,0,0.597,5.757,100.0,1.413,24,666,20.2,2.6,10.11,15.0
414,45.7461,0.0,18.1,0,0.693,4.519,100.0,1.6582,24,666,20.2,88.27,36.98,7.0


### Sorting on multiple columns
Sorting on one column usually does the trick, but what if we want to sort our data according to the following description:

  - Sort rows by median value of houses (column MEDV) and sort rows with equal value by the number of old houses in the town (column AGE), all in descending order.

In [44]:
boston.sort_values(by=['MEDV', 'AGE'], ascending=False).head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
368,4.89822,0.0,18.1,0,0.631,4.97,100.0,1.3325,24,666,20.2,375.52,3.26,50.0
371,9.2323,0.0,18.1,0,0.631,6.216,100.0,1.1691,24,666,20.2,366.15,9.53,50.0
162,1.83377,0.0,19.58,1,0.605,7.802,98.2,2.0407,5,403,14.7,389.61,1.92,50.0
370,6.53876,0.0,18.1,1,0.631,7.016,97.5,1.2024,24,666,20.2,392.05,2.96,50.0
369,5.66998,0.0,18.1,1,0.631,6.683,96.8,1.3567,24,666,20.2,375.33,3.73,50.0


### Exercise 5:  Sort the rows by crime rate (CRIM), then by tax (TAX) and finally by nitrox concentration (NOX). All in ascending order.

## 7. Final challenges (optional)

### Challenge 1: Find the most expensive house (MEDV) that tracts the Charles River (CHAS).
tip: use filtering with conditionals

### Challenge 2: Find the average nitric oxides concentration (NOX).

### Challenge 3:  
Sort the rows by:

  - highest median value (MEDV) (descending order)
  - then by lowest tax (TAX) (ascending order)
  - and finally by highest average number of rooms per house (RM) (descending order)

## Wrapping up

Exploring, cleaning, transforming, and visualization data with pandas in Python is an essential skill in data science. Just cleaning wrangling data is 80% of your job as a Data Scientist. After a few projects and some practice, you should be very comfortable with most of the basics. You can now continue with level 3, where we work on more practical applications.