# Data analysis with pandas

Pandas stands for 'Python Data Analysis Library'; it is designed to provide data scientists working in Python with a set of powerful tools to load, transform, and process large-ish data sets. As a result, it has become something of a de facto standard for online tutorials and many of the lessons that you can find online will make use of pandas at some point. The official documentation is [here](http://pandas.pydata.org/pandas-docs/stable/).

Let's use pandas to work with a table of data from a local CSV file. We'll use a small file (only 19 KB) so it's easy and quick to work with. The file shows an analysis of housing affordability by Laurence Troy. It was downloaded from City Futures Research Centre's CityData website: *[Sydney Dwelling Affordability Index](https://citydata.be.unsw.edu.au/layers/geonode%3Aaffordability_index)*.

Visit that page now to learn a little about the dataset.

First let's cheat by opening the file in Excel or other spreadsheet software to help understand what pandas is doing.

You'll find it in the `data` folder, which is in the parent folder from where this notebook is located.

**Open `affordability_index.csv` now.**

It looks like this:

![image.png](attachment:image.png)

Pandas is not part of the base Python software, so you need to import it:

In [None]:
import pandas

We'll use pandas' `read_csv` method to open the file, read it and store it in a variable called `data`. All in one command!

We have to tell Python where to find the file. Rather than an absolute path, we'll use a relative file path, which is relative to the notebook's location. To explain this path (below):
* `..` means the parent directory.
* Windows uses blackslash `\` to separate the components of a path. But backslash has special meaning in Python: it's an escape character, used to introduce special characters. So for example `\n` means newline. You have to use `\\` to indicate an actual backslash.

In [None]:
my_data = pandas.read_csv('..\\data\\affordability_index.csv')

What data type is `my_data`?

In [None]:
print(type(my_data))

It's a 'DataFrame'. That's pandas talk for a two-dimensional table of data.

What can we find out about the data? Let's start by counting the rows and columns, which we do with a method called `shape`:

In [None]:
print(my_data.shape)

The output shows number of records and number of attributes. Have a look at the spreadsheet. Do those numbers look right?

Let's see the column names.

In [None]:
print(my_data.columns)

All very well, but what do those columns mean? Fortunately Laurence has provided metadata when he published the dataset on CityData. Go back to [that page](https://citydata.be.unsw.edu.au/layers/geonode%3Aaffordability_index) and click the **Attributes** tab below the map to find out what the column names mean.

Next let's read the first few rows with the `head` method. This is especially useful when dealing with a large dataset where you don't want to view the whole thing. By default it prints the first 5 rows of the dataframe.

In [None]:
my_data.head()

It looks like just the first 5 columns are numeric data. Let's check that with the `dtypes` property:

In [None]:
my_data.dtypes

Now we can use the `describe` method to run some statistics on the numeric columns: 

In [None]:
my_data.describe()

Can you find the following from the statistics above?
* The average affordability in 2005
* The median affordability in 2015
* The worst decline in affordability in this period
* The best improvement in affordability

We can also use Boolean expressions to find rows matching certain conditions.

Which suburbs have more than 50% affordability in 2015?

In [None]:
my_data[my_data["N_15_100k"] > 50]

How many rows match that condition? Use `shape` in the cell below to find out.

How would we find the most affordable suburbs in 2015?

We saw that the `describe` method presents a range of statistics including maximum.

Can you guess how to just return the maximum value from one column? Yes it's the `max` function:

In [None]:
 max(my_data["N_15_100k"])

Let's make a variable to hold that maximum value:

In [None]:
max_afford_2015 = max(my_data["N_15_100k"])

Now we can list the rows that match that value:

In [None]:
my_data[my_data["N_15_100k"] == max_afford_2015]

Just the one! To get just the suburb name:

In [None]:
my_data["Sub_indic"][my_data["N_15_100k"] == max_afford_2015]

And to return the *value* 'Budgewoi' rather than the object (cell) containing that value, append the `values` property and find the first (*zero-th*) value:

In [None]:
my_data["Sub_indic"][my_data["N_15_100k"] == max_afford_2015].values[0]

**Challenge:** Can you find the suburbs with the worst drop in affordability from 2005 to 2015? Write your code in the empty cell below.

**Congratulations on completing this introduction to data analysis with pandas!**

*The content and structure of this notebook was created by Jonathan Doig for City Futures Research Centre. It is licensed under the [Creative Commons Attribution-NonCommercial 4.0 license](https://creativecommons.org/licenses/by-nc/4.0/) (CC-BY-NC) and the code is licensed under [The MIT License](https://opensource.org/licenses/mit-license.php).*