# Introduction to DataFrames

In this workbook we introduce a popular Python library called pandas that makes loading and saving data in different formats trivial. It's also very powerful for data manipulation, and thus selection.

We'll start with an example with the data in Python, we'll then look at loading and saving various common formats you're likely to encounter.

To use pandas we need to import it. By convention pandas is referred to by the abbreviation ``pd`` -- this isn't required but it's very common. 

In [None]:
import pandas as pd

## Pandas DataFrames

In pandas, data are stored in a structure it calls a DataFrame. 
Conceptually you can think of it as a table, with column headings and numbered rows, although DataFrames also function like data bases.

DataFrame is a class in Python, so we create a DataFrame object by calling the constructor.
We'll use a few books for our data.
``df`` is a common variable name for a DataFrame.

In [None]:
df = pd.DataFrame({
    "title": ["Tangled Web",
              "Close Up",
              "Foundations",
              "Professional Secrets",
              "5 Times 5"],
    "author": ["Eric Mead",
               "David Stone",
               "Eberhard Riese",
               "Geoffrey Durham",
               "Richard Kaufman"],
    "price": [40, 23.1, 70, 295, 34.65]
})

df

You can clearly see the table like structure above.
What looks like a column is actually a `Series` in `pandas`.
The column names are taken from the dictionary that was passed into the constructor function.
Notice that row numbers have been added automatically and start from 0. 
This is the index.
It's possible to specify different row labels when constructing the DataFrame by passing in a value for the optional parameter ``index``.
The index is an instance attribute of the DataFrame so it can be accessed using the dot notation.
Column names can be accessed in a similar way.

In [None]:
df.index

In [None]:
df.columns

A column Series is selected using square brackets and the column name.
For example:

In [None]:
df["title"]

Notice that the row labels are also extracted with the requested column.
To get a single item from the Series use the square bracket notation and the row label, for example:

In [None]:
df["title"][1]

A few useful DataFrame methods for inspecting your data:
- ``head()`` shows the first few rows of data. By default it will show five rows or you can specify how many.
- ``tail()`` is like ``head()`` but shows the last few rows of data.
- ``describe()`` gives some summary statistics about the numerical columns.
-  ``info()`` returns information such as the column names and their data types.

The ``shape`` of a DataFrame (or Series) is its dimension, the number of rows and columns it has.

In [None]:
df.head(2)

In [None]:
df.describe()

In [None]:
df.info()

In [None]:
df.shape

## Iterating Over DataFrames
A Pandas Series and a Pandas DataFrame are both iterable. That means we can use them in loops. Let's look at a Series first. For the purposes of a loop, a Series can be treated like a list. 

In [None]:
title_series = df["title"]

for i in title_series:
    print('The title of this book is {}'.format(i))

Iterating over a DataFrame directly will only give us the column names.

In [None]:
for i in df:
    print(i)

The ``items()`` method allows us to loop through the columns in turn. Each column is provided as a Series, so we can also loop through the data in the column. 

In [None]:
for col_name, col_data in df.items():
    print('**New column: {}**'.format(col_name))
    for i in col_data:
        print(i)

To iterate through the rows instead of the columns use ``iterrows()``. 

In [None]:
for row_name, row_data in df.iterrows():
    print("**New row: {}**".format(row_name))
    for i in row_data:
        print(i)

Another way to iterate through the rows is with ``itertuples()``, which returns a tuple for each row (remember that a tuple is the same as a list except that it can't be modified once it's created). 

In [None]:
for row in df.itertuples(name="Book"):
    print(row)

Now we can access a piece of data from each row using the column name. It could also be useful as an intermediate step to creating an object from each of the rows in the DataFrame. 

In [None]:
for row in df.itertuples(name="Book"):
    print(row.title)

## DataFrame Selecting

We've already looked at how to select a Series (`df["title"]`), but now we'll get fancy with our selects.

First, you can select any columns you'd like in any order (you can even repeat the columns selected), just pass a list of column titles to the indexing `[]` syntax.
This returns a new DataFrame.
The original DataFrame is not affected.

In [None]:
df[["author", "title"]]

To select specific rows, use `iloc` with a slicing syntax:

In [None]:
df.iloc[2:-1]

We can mix them too:

In [None]:
df.iloc[2:-1]["title"]

Rather than selecting blocks of rows, more often you'll find you'll need to filter them based upon some condition. We do this like so:

In [None]:
df[df["price"] > 50]

You can use any of the numerical comparators (`= != > >= < <=`). The Series also provides an `isin` function, and a `notna` function. `isin` is a membership check, `notna` filters out `na` values.

In [None]:
df[df["title"].isin(["Foundations", "Close Up"])]

Note, this is a dataframe returned, so you can stack further selectors or filters onto it like so:

In [None]:
df[df["price"] > 50]["author"]

In [None]:
expensive = df[df["price"] > 50]
not_too_expensive = expensive[expensive["price"] < 100]
not_too_expensive

## Exercise 1

### 1a
The next cell contains some information about the 20 highest-grossing movies. 
At the moment, we have a dictionary for each movie showing the rank, title, the money made (in US $), and the year that the film was released. 

In order to create a DataFrame, we need the data in a different format, namely, a dictionary of lists. 

Write some code to create lists for each type of information (ranking, title, etc.) and use these to create the DataFrame. Give the columns the same names as the current dictionary keys. 

Notice that the ranking, worldwide gross and year data are all presented here as strings. These columns should be converted to integers. 

In [None]:
movies = [
    {'Ranking': '1', 'Title': 'Avengers: Endgame', 'Worldwide gross ($)': '2797800564', 'Year': '2019'}, 
    {'Ranking': '2', 'Title': 'Avatar', 'Worldwide gross ($)': '2790439000', 'Year': '2009'}, 
    {'Ranking': '3', 'Title': 'Titanic', 'Worldwide gross ($)': '2194439542', 'Year': '1997'},
    {'Ranking': '4', 'Title': 'Star Wars: The Force Awakens', 'Worldwide gross ($)': '2068223624', 'Year': '2015'},
    {'Ranking': '5', 'Title': 'Avengers: Infinity War', 'Worldwide gross ($)': '2048359754', 'Year': '2018'},
    {'Ranking': '6', 'Title': 'Jurassic World', 'Worldwide gross ($)': '1671713208', 'Year': '2015'},
    {'Ranking': '7', 'Title': 'The Lion King', 'Worldwide gross ($)': '1656943394', 'Year': '2019'},
    {'Ranking': '8', 'Title': 'The Avengers', 'Worldwide gross ($)': '1518812988', 'Year': '2012'},
    {'Ranking': '9', 'Title': 'Furious 7', 'Worldwide gross ($)': '1516045911', 'Year': '2015'},
    {'Ranking': '10', 'Title': 'Frozen II', 'Worldwide gross ($)': '1450026933', 'Year': '2019'},
    {'Ranking': '11', 'Title': 'Avengers: Age of Ultron', 'Worldwide gross ($)': '1402805868', 'Year': '2015'},
    {'Ranking': '12', 'Title': 'Black Panther', 'Worldwide gross ($)': '1347280838', 'Year': '2018'},
    {'Ranking': '13', 'Title': 'Harry Potter and the Deathly Hallows – Part 2', 'Worldwide gross ($)': '1342025430', 'Year': '2011'},
    {'Ranking': '14', 'Title': 'Star Wars: The Last Jedi', 'Worldwide gross ($)': '1332539889', 'Year': '2017'},
    {'Ranking': '15', 'Title': 'Jurassic World: Fallen Kingdom', 'Worldwide gross ($)': '1309484461', 'Year': '2018'},
    {'Ranking': '16', 'Title': 'Frozen', 'Worldwide gross ($)': '1290000000', 'Year': '2013'},
    {'Ranking': '17', 'Title': 'Beauty and the Beast', 'Worldwide gross ($)': '1263521126', 'Year': '2017'},
    {'Ranking': '18', 'Title': 'Incredibles 2', 'Worldwide gross ($)': '1242805359', 'Year': '2018'},
    {'Ranking': '19', 'Title': 'The Fate of the Furious', 'Worldwide gross ($)': '1238764765', 'Year': '2017'},
    {'Ranking': '20', 'Title': 'Iron Man 3', 'Worldwide gross ($)': '1214811252', 'Year': '2013'}
]

### 1b
Select the rows for movies from 2017


### 1c
Select the rows which made at least $1300000000 and were released before 2018.


### 1d
Create a new DataFrame by selecting the 'Title' and 'Worldwide gross' columns for films released before 2018.

## Loading, Converting, and Saving Data

We've already seen in previous lessons how to load data from a CSV file, and pandas also provides a set of functions to load data from a CSV file directly into a DataFrame. 

The function `read_csv()` takes the names of a CSV file and creates a DataFrame. This DataFrame will take column names from the CSV file automatically, or you can override this behaviour by setting the optional parameters ``header`` to `0` and ``names`` to a list of the column names you would like to use instead. 

If you need to load data from an online source, just replace the file name with the URL of the data, pandas is smart enough to know what to do. 

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/pandas-dev/pandas/main/pandas/tests/io/data/csv/tips.csv')
df

This has downloaded a csv file from the pandas github page. 
To save a DataFrame as a CSV use ``to_csv()``. We set ``index=False`` so that the row numbers aren't included in the new file. 

In [None]:
df.to_csv("tips.csv", index=False)

But what if you need that data in a completely different format? Well `pandas` offers a surprising number of alternative formats. Let's convert it to JSON, if we don't provide a file name to write to, it'll just return the conversion, which is handy for JSON:

In [None]:
df.to_json(orient="records")

## Exercise 2
To be able to follow this exercise you will need a csv file contianing data about the passengers aboarded the famous Titanic ship on its first and only journey. The data for this exercise is available at https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/titanic.csv.

### 2a
Load the CSV file into a DataFrame called 'titanic'.

### 2b 
Inspect the data - view the first 10 rows

### 2c
Create a DataFrame called 'titanic_age_sex' which only contains the age and sex of the passengers.

### 2d
Get the subset of data which relates to adults (age greater than or equal to 18), call the new DataFrame 'titanic_adults'.
Check how many rows are in this new DataFrame and compare to the original titanic DataFrame.

## How to Find Out More

The last thing you need to know is how to discover these functions and how to read their docs so you can read, convert, and write many more formats. There are [docs online](https://pandas.pydata.org/pandas-docs/stable/index.html), or you can make use of `dir` and `help` functions, which is handy to know so you can data-science on your Antarctic expedition, or while flying to meet a client, or when you just don't have a good internet connection.

`dir` lists all the methods of a class, `help` shows the documentation.

In [None]:
# list all the methods in pandas that begin with "read_"
list(filter(lambda m: m.startswith("read_"), dir(pd)))

In [None]:
# list all the methods of DataFrame that begin with "to_"
list(filter(lambda m: m.startswith("to_"), dir(df)))

In [None]:
# print the documentation about the read_csv fundtion in pandas
help(pd.read_csv)

In [None]:
# print the documentation about the to_csv method of DataFrame
help(df.to_csv)

# Solutions to Exercises
## 1
### 1a

In [None]:
movies = [
    {'Ranking': '1', 'Title': 'Avengers: Endgame', 'Worldwide gross ($)': '2797800564', 'Year': '2019'}, 
    {'Ranking': '2', 'Title': 'Avatar', 'Worldwide gross ($)': '2790439000', 'Year': '2009'}, 
    {'Ranking': '3', 'Title': 'Titanic', 'Worldwide gross ($)': '2194439542', 'Year': '1997'},
    {'Ranking': '4', 'Title': 'Star Wars: The Force Awakens', 'Worldwide gross ($)': '2068223624', 'Year': '2015'},
    {'Ranking': '5', 'Title': 'Avengers: Infinity War', 'Worldwide gross ($)': '2048359754', 'Year': '2018'},
    {'Ranking': '6', 'Title': 'Jurassic World', 'Worldwide gross ($)': '1671713208', 'Year': '2015'},
    {'Ranking': '7', 'Title': 'The Lion King', 'Worldwide gross ($)': '1656943394', 'Year': '2019'},
    {'Ranking': '8', 'Title': 'The Avengers', 'Worldwide gross ($)': '1518812988', 'Year': '2012'},
    {'Ranking': '9', 'Title': 'Furious 7', 'Worldwide gross ($)': '1516045911', 'Year': '2015'},
    {'Ranking': '10', 'Title': 'Frozen II', 'Worldwide gross ($)': '1450026933', 'Year': '2019'},
    {'Ranking': '11', 'Title': 'Avengers: Age of Ultron', 'Worldwide gross ($)': '1402805868', 'Year': '2015'},
    {'Ranking': '12', 'Title': 'Black Panther', 'Worldwide gross ($)': '1347280838', 'Year': '2018'},
    {'Ranking': '13', 'Title': 'Harry Potter and the Deathly Hallows – Part 2', 'Worldwide gross ($)': '1342025430', 'Year': '2011'},
    {'Ranking': '14', 'Title': 'Star Wars: The Last Jedi', 'Worldwide gross ($)': '1332539889', 'Year': '2017'},
    {'Ranking': '15', 'Title': 'Jurassic World: Fallen Kingdom', 'Worldwide gross ($)': '1309484461', 'Year': '2018'},
    {'Ranking': '16', 'Title': 'Frozen', 'Worldwide gross ($)': '1290000000', 'Year': '2013'},
    {'Ranking': '17', 'Title': 'Beauty and the Beast', 'Worldwide gross ($)': '1263521126', 'Year': '2017'},
    {'Ranking': '18', 'Title': 'Incredibles 2', 'Worldwide gross ($)': '1242805359', 'Year': '2018'},
    {'Ranking': '19', 'Title': 'The Fate of the Furious', 'Worldwide gross ($)': '1238764765', 'Year': '2017'},
    {'Ranking': '20', 'Title': 'Iron Man 3', 'Worldwide gross ($)': '1214811252', 'Year': '2013'}
]

ranking = [int(m["Ranking"]) for m in movies]
title = [m["Title"] for m in movies]
money = [int(m["Worldwide gross ($)"]) for m in movies]
year = [int(m["Year"]) for m in movies]

df = pd.DataFrame({
    "Ranking": ranking,
    "Title": title,
    "Worldwide gross ($)": money,
    "Year": year
})

df

### 1b

In [None]:
df[df["Year"]==2019]

### 1c

In [None]:
df2 = df[df["Worldwide gross ($)"] >= 1300000000]
df2[df2["Year"] < 2018]

### 1d

In [None]:
df3 = df[df["Year"] < 2018][['Title', 'Worldwide gross ($)']]
df3

## 2
### 2a

In [None]:
titanic = pd.read_csv('https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/titanic.csv')

### 2b

In [None]:
titanic.head(10)

### 2c

In [None]:
titanic_age_sex = titanic[['Age', 'Sex']]
titanic_age_sex.head()

### 2d

In [None]:
titanic_adults = titanic[titanic['Age'] >= 18]

# get the number of records in the new DF and original
titanic.shape , titanic_adults.shape