# Working with DataFrames

Having seen how to create a DataFrame with pandas we now look at some ways to manipulate and modify a DataFrame.

Remember that we need to import pandas.
We also create a DataFrame to work with. 

In [1]:
import pandas as pd

df = pd.DataFrame({
    "title": ["Tangled Web",
              "Close Up",
              "Foundations",
              "Professional Secrets",
              "5 Times 5"],
    "author": ["Eric Mead",
               "David Stone",
               "Eberhard Riese",
               "Geoffrey Durham",
               "Richard Kaufman"],
    "price": [40, 23.1, 70, 295, 34.65]
})

df

Unnamed: 0,title,author,price
0,Tangled Web,Eric Mead,40.0
1,Close Up,David Stone,23.1
2,Foundations,Eberhard Riese,70.0
3,Professional Secrets,Geoffrey Durham,295.0
4,5 Times 5,Richard Kaufman,34.65


## Add a New Row or Colulmn to a DataFrame
A row in a DataFrame is a `Series` so we first create a new pandas Series.
The new data can be passed in to the constructor as a dictionary or as a list (as with a DataFrame).
If the data is given as a list then we need to use the optional ``index`` parameter to specify which column each member of the list refers to. 
Both methods are shown below. Both achieve the same thing. Which one to use depends entirely on the form that your data is in, it is simply a matter of convenience. 

To insert the new row use the `append()` method on the existing DataFrame.
Our DataFrame was given row numbers automatically. 
Setting the optional parameter `ignore_index` to `True` will allow this automatic numbering to continue.
If the rows required different labels then the parameter `name` may be set when constructing the Series. 
The value of this parameter will become the row label when the Series is added to the DataFrame. 

The `append()` method does *not* modify the original DataFrame but returns a new one.
The new DataFrame can be assigned to the existing variable if you do not want to keep the original DataFrame, otherwise assign it to a new variable.

In [2]:
spirit = pd.Series({"title" :"The Spirit of the Quakers", "author": "Geoffrey Durham", "price": 9.43})
df = df.append(spirit, ignore_index=True)

believe = pd.Series(["What do Quakers Believe?","Geoffrey Durham", 5.94], index=df.columns)
df = df.append(believe, ignore_index=True)

df

Unnamed: 0,title,author,price
0,Tangled Web,Eric Mead,40.0
1,Close Up,David Stone,23.1
2,Foundations,Eberhard Riese,70.0
3,Professional Secrets,Geoffrey Durham,295.0
4,5 Times 5,Richard Kaufman,34.65
5,The Spirit of the Quakers,Geoffrey Durham,9.43
6,What do Quakers Believe?,Geoffrey Durham,5.94


To delete a row use the method ``drop()``, passing in a list of row labels to the parameter ``index``.
```
drop(index=[<row labels>])
```
This returns a new DataFrame. 
In order to use it later you would need to assign it to a new variable, `drop()` does *not* modify the original DataFrame.
Notice also that the rest of the row labels do not change.

In [3]:
df.drop(index = [2,3])

Unnamed: 0,title,author,price
0,Tangled Web,Eric Mead,40.0
1,Close Up,David Stone,23.1
4,5 Times 5,Richard Kaufman,34.65
5,The Spirit of the Quakers,Geoffrey Durham,9.43
6,What do Quakers Believe?,Geoffrey Durham,5.94


Let's add a column with a rating for each book. 
A column of a DataFrame is also a pandas Series, which we can create from a list.
Adding the Series to the DataFrame uses similar syntax to extending a dictionary. 
Specify the name of the column in the DataFrame and assign the Series to it.
Of course, if an existing column name is specified then this column will be overwritten, so take care! 

In [4]:
rating = pd.Series([4.5, 3.7, 3.8, 4.9, 4.0, 3.1, 3.4])

df['ratting'] = rating

df

Unnamed: 0,title,author,price,ratting
0,Tangled Web,Eric Mead,40.0,4.5
1,Close Up,David Stone,23.1,3.7
2,Foundations,Eberhard Riese,70.0,3.8
3,Professional Secrets,Geoffrey Durham,295.0,4.9
4,5 Times 5,Richard Kaufman,34.65,4.0
5,The Spirit of the Quakers,Geoffrey Durham,9.43,3.1
6,What do Quakers Believe?,Geoffrey Durham,5.94,3.4


To add a new column with the same value for every row there is an even simpler method:

In [5]:
df['new column'] = 15

df

Unnamed: 0,title,author,price,ratting,new column
0,Tangled Web,Eric Mead,40.0,4.5,15
1,Close Up,David Stone,23.1,3.7,15
2,Foundations,Eberhard Riese,70.0,3.8,15
3,Professional Secrets,Geoffrey Durham,295.0,4.9,15
4,5 Times 5,Richard Kaufman,34.65,4.0,15
5,The Spirit of the Quakers,Geoffrey Durham,9.43,3.1,15
6,What do Quakers Believe?,Geoffrey Durham,5.94,3.4,15


Using this method always adds the new column at the end. To add a new column somewhere in the middle of the DataFrame use ``insert()`` instead and specify the numerical position that the new column should be in using the ``loc`` parameter. The name of the column is given in the ``column`` parameter. If a column with the same name exists this will cause an error. To allow a new column with the same name as an existing column use the optional parameter ``allow_duplicates`` with the value ``True``. 

In [6]:
subtitles = ['Everything you need to know about magic']*5+['Everything you need to know about Quakers']*2
df.insert(loc=1, column='sub-title', value=subtitles)

df

Unnamed: 0,title,sub-title,author,price,ratting,new column
0,Tangled Web,Everything you need to know about magic,Eric Mead,40.0,4.5,15
1,Close Up,Everything you need to know about magic,David Stone,23.1,3.7,15
2,Foundations,Everything you need to know about magic,Eberhard Riese,70.0,3.8,15
3,Professional Secrets,Everything you need to know about magic,Geoffrey Durham,295.0,4.9,15
4,5 Times 5,Everything you need to know about magic,Richard Kaufman,34.65,4.0,15
5,The Spirit of the Quakers,Everything you need to know about Quakers,Geoffrey Durham,9.43,3.1,15
6,What do Quakers Believe?,Everything you need to know about Quakers,Geoffrey Durham,5.94,3.4,15


Those two newest columns are not really very useful so let's delete them again. The ``drop()`` function works for deleting columns as well as rows, just specify the ``columns`` parameter instead of the ``index`` parameter. Recall that ``drop()`` does not modify the original DataFrame but returns a new one.
The returned DataFrame can be assigned to the existing variable to overwrite it or we can use the optional ``inplace`` parameter to tell Pandas that we want to change the original DataFrame.
An ``inplace`` optional parameter is available on several DataFrames methods which return a new DataFrame by default. Just check the documentation for the method you want to use if you're not sure. 

In [7]:
df.drop(columns = ['sub-title', 'new column'], inplace=True)

df

Unnamed: 0,title,author,price,ratting
0,Tangled Web,Eric Mead,40.0,4.5
1,Close Up,David Stone,23.1,3.7
2,Foundations,Eberhard Riese,70.0,3.8
3,Professional Secrets,Geoffrey Durham,295.0,4.9
4,5 Times 5,Richard Kaufman,34.65,4.0
5,The Spirit of the Quakers,Geoffrey Durham,9.43,3.1
6,What do Quakers Believe?,Geoffrey Durham,5.94,3.4


To update many values in a column you could simply define a new column and delete the old one. To update a single value use ``at()`` and specify the index and column to be changed, then simply assign the new value. 

In [8]:
df.at[1, 'price'] = 10

## Update Index and Column Names

Whoops! There's a spelling mistake in one of the column names. We should change the column 'ratting' to 'rating'. To do this you can simply specify a list of all the correct column names and assign it to the ``columns`` attribute of the DataFrame. A quicker way, if only one or two column names need to be updated, is to use the ``rename()`` method.

In [9]:
df.rename(columns={'ratting':'rating'}, inplace=True)

df

Unnamed: 0,title,author,price,rating
0,Tangled Web,Eric Mead,40.0,4.5
1,Close Up,David Stone,10.0,3.7
2,Foundations,Eberhard Riese,70.0,3.8
3,Professional Secrets,Geoffrey Durham,295.0,4.9
4,5 Times 5,Richard Kaufman,34.65,4.0
5,The Spirit of the Quakers,Geoffrey Durham,9.43,3.1
6,What do Quakers Believe?,Geoffrey Durham,5.94,3.4


The row labels (index) can also be updated using both of these methods. 

In [10]:
df.rename(index={1:'a'}, inplace=True)

df

Unnamed: 0,title,author,price,rating
0,Tangled Web,Eric Mead,40.0,4.5
a,Close Up,David Stone,10.0,3.7
2,Foundations,Eberhard Riese,70.0,3.8
3,Professional Secrets,Geoffrey Durham,295.0,4.9
4,5 Times 5,Richard Kaufman,34.65,4.0
5,The Spirit of the Quakers,Geoffrey Durham,9.43,3.1
6,What do Quakers Believe?,Geoffrey Durham,5.94,3.4


The row numbers can be reset to use a ``RangeIndex``, which is the index Pandas uses by default when creating a DataFrame.

In [11]:
df.index = pd.RangeIndex(start=0, stop=7, step=1)

df

Unnamed: 0,title,author,price,rating
0,Tangled Web,Eric Mead,40.0,4.5
1,Close Up,David Stone,10.0,3.7
2,Foundations,Eberhard Riese,70.0,3.8
3,Professional Secrets,Geoffrey Durham,295.0,4.9
4,5 Times 5,Richard Kaufman,34.65,4.0
5,The Spirit of the Quakers,Geoffrey Durham,9.43,3.1
6,What do Quakers Believe?,Geoffrey Durham,5.94,3.4


## Calculations Based on DataFrames

In the last workbook we looked at selecting rows based on their values, for example

In [12]:
df[df['price'] >= 40]

Unnamed: 0,title,author,price,rating
0,Tangled Web,Eric Mead,40.0,4.5
2,Foundations,Eberhard Riese,70.0,3.8
3,Professional Secrets,Geoffrey Durham,295.0,4.9


Notice that ``df['price']`` is not an integer. It is a Pandas Series. So how is it possible to make a numerical comparison here? The short answer is it's just Pandas being very clever. Given a statement like this it will perform the comparison for each item in the Series. Similarly, we can perform aritmetic with Series and Pandas will know that we mean to apply the operation to each member of the Series.

In [13]:
df['price']*2

0     80.00
1     20.00
2    140.00
3    590.00
4     69.30
5     18.86
6     11.88
Name: price, dtype: float64

We can even add two Series together and Pandas will know to add the corresponding members. Of course this only works if the Series are of the same size and only contain numerical values. 

In [14]:
df['price'] + df['rating']

0     44.50
1     13.70
2     73.80
3    299.90
4     38.65
5     12.53
6      9.34
dtype: float64

There are also functions for finding the 
- max
- min
- mean
- median
- mode
- sum

of a Series or of every row/column in a DataFrame.

In [15]:
# max over a single Series
df['price'].max()

295.0

In [16]:
# max over a DataFrame
# axis=0 indicates to find the max for each column, comparing the different rows. 
# where the entries are strings the maximum value is the last one alphabetically. 
df.max(axis=0)

title     What do Quakers Believe?
author             Richard Kaufman
price                          295
rating                         4.9
dtype: object

Now we have everything we need to define new columns based on the old ones. Select the information you're interested in, manipulate it, and assign the new values to a new column in the DataFrame. Let's create a column which, instead of the exact price, categorises a book as low, medium or high cost, and let's add this column immediately after the existing price column.

In [17]:
costs = []
for price in df["price"]:
    if price >= 40:
        costs.append('high')
    elif price >= 10:
        costs.append('medium')
    else:
        costs.append('low')

df.insert(loc=3, column='cost', value=costs)

df

Unnamed: 0,title,author,price,cost,rating
0,Tangled Web,Eric Mead,40.0,high,4.5
1,Close Up,David Stone,10.0,medium,3.7
2,Foundations,Eberhard Riese,70.0,high,3.8
3,Professional Secrets,Geoffrey Durham,295.0,high,4.9
4,5 Times 5,Richard Kaufman,34.65,medium,4.0
5,The Spirit of the Quakers,Geoffrey Durham,9.43,low,3.1
6,What do Quakers Believe?,Geoffrey Durham,5.94,low,3.4


## Exercise 1

The next cell contains code to generate a DataFrame. The data is the number of cases of Tuberculosis (TB) in three countries over three years. 

Add columns for the total and average number of cases over the three years. 

In [18]:
df = pd.DataFrame(
    {
        '2011':[7000,5800,15000],
        '2012':[6900,6000,14000],
        '2013':[7000,6200,13000]
    },
    index=['France','Germany','United States of America']
)

df

Unnamed: 0,2011,2012,2013
France,7000,6900,7000
Germany,5800,6000,6200
United States of America,15000,14000,13000


### 1b

What if we had a DataFrame with some missing or invalid entries? 
The DataFrame below contains an entry ``pd.NA``, which is a special value in Pandas meaning 'not available'. 
It stands in for data which is missing. 

Having missing values could make it difficult to calculate a correct average. See if your method from part (a) still works. If it doesn't, then find a different way to calculate the average ignoring the missing value. For France the average should be $\frac{6900+7000}{2} = 6950$, but don't simply do this cacluation! Find a method which will work when you don't know how many of the values are 'not a number'. 

In [19]:
df = pd.DataFrame(
    {
        '2011':[pd.NA,5800,15000],
        '2012':[6900,6000,14000],
        '2013':[7000,6200,13000]
    },
    index=['France','Germany','United States of America']
)

df

Unnamed: 0,2011,2012,2013
France,,6900,7000
Germany,5800.0,6000,6200
United States of America,15000.0,14000,13000


### 1c

Finally, consider the possibility that the data you imported has some bad values in it. Here, for example, we know that the number of cases of TB can't be negative and must be a whole number. If any values are present which don't fit these criteria then we should highlight them. We might then choose to ignore these values in any calculations. 

Create a function to check the quality of data in the DataFrame, replacing bad values with the special value ``pd.NA``. Then add columns for the sum and average of each row as before. 

In [20]:
df = pd.DataFrame(
    {
        '2011':[pd.NA,5800,15000],
        '2012':[6900,60.3,14000],
        '2013':[7000,6200,-40]
    },
    index=['France','Germany','United States of America']
)

df

Unnamed: 0,2011,2012,2013
France,,6900.0,7000
Germany,5800.0,60.3,6200
United States of America,15000.0,14000.0,-40


## Modifying Data Types

When creating your DataFrame think about what data type each of the columns should have and, if possible, ensure that the data is passed into the constructor as the correct type. 
Recall that the data types of each column can be seen with ``info()``.

In [21]:
df = pd.DataFrame(
    {
        'number': ['1', '2', '1.1', '14', '5.32234', '6', '7'],
        'start_date': ['10-10-2019','4-3-2010','9-10-1999','10-01-2019','2-04-2018','17-11-2003','10-10-1981'],
        'year': ['2019', '2010', '1999', '2019', '2018', '2003', '1981'],
        'location': ['north', 'south', 'east', 'west', 'north', 'south', 'east']
    }
)
df

Unnamed: 0,number,start_date,year,location
0,1.0,10-10-2019,2019,north
1,2.0,4-3-2010,2010,south
2,1.1,9-10-1999,1999,east
3,14.0,10-01-2019,2019,west
4,5.32234,2-04-2018,2018,north
5,6.0,17-11-2003,2003,south
6,7.0,10-10-1981,1981,east


In [22]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   number      7 non-null      object
 1   start_date  7 non-null      object
 2   year        7 non-null      object
 3   location    7 non-null      object
dtypes: object(4)
memory usage: 352.0+ bytes


See that the 'number' column has data type ``object``. For some reason the data was provided as strings when clearly it is numeric. The pandas method ``to_numeric`` takes a Series and returns a new Series in which each element has been made numeric. The new Series can then be assigned to the correct column of the DataFrame. Note that if there is any value in the original Series (column) that does not 'look like' a number i.e. that pandas doesn't know how to convert it to a number, this method will raise an exception. Also, pandas will decide whether the new data type should be an integer or a float based on the data in the Series. 

In [23]:
df['number'] = pd.to_numeric(df['number'])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   number      7 non-null      float64
 1   start_date  7 non-null      object 
 2   year        7 non-null      object 
 3   location    7 non-null      object 
dtypes: float64(1), object(3)
memory usage: 352.0+ bytes


A similar method ``to_datetime()`` converts a string representation of a date into a pandas DateTime object. We've already seen that the Python datetime library can help when working with dates, now we see that pandas also has many methods and clever features for working with dates. You can find out more in the documentation, it's not a topic we will go into in great detail here.

In [24]:
df['start_date'] = pd.to_datetime(df['start_date'])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   number      7 non-null      float64       
 1   start_date  7 non-null      datetime64[ns]
 2   year        7 non-null      object        
 3   location    7 non-null      object        
dtypes: datetime64[ns](1), float64(1), object(2)
memory usage: 352.0+ bytes


Now see what happens when we use the same method on the 'year' column.

In [25]:
df['year'] = pd.to_datetime(df['year'])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   number      7 non-null      float64       
 1   start_date  7 non-null      datetime64[ns]
 2   year        7 non-null      datetime64[ns]
 3   location    7 non-null      object        
dtypes: datetime64[ns](2), float64(1), object(1)
memory usage: 352.0+ bytes


In [26]:
df

Unnamed: 0,number,start_date,year,location
0,1.0,2019-10-10,2019-01-01,north
1,2.0,2010-04-03,2010-01-01,south
2,1.1,1999-09-10,1999-01-01,east
3,14.0,2019-10-01,2019-01-01,west
4,5.32234,2018-02-04,2018-01-01,north
5,6.0,2003-11-17,2003-01-01,south
6,7.0,1981-10-10,1981-01-01,east


Finally, see that the 'location' column seems to only allow a few possible values. We can make this into a ``Categorical`` data type. If we had a larger DataFrame and couldn't conveniently look at all of the data directly there are a couple of useful methods for checking the possible values in a column.

In [27]:
df['location'].nunique

<bound method IndexOpsMixin.nunique of 0    north
1    south
2     east
3     west
4    north
5    south
6     east
Name: location, dtype: object>

In [28]:
df['location'].value_counts()

south    2
east     2
north    2
west     1
Name: location, dtype: int64

In [29]:
df['location'] = pd.Categorical(df["location"])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   number      7 non-null      float64       
 1   start_date  7 non-null      datetime64[ns]
 2   year        7 non-null      datetime64[ns]
 3   location    7 non-null      category      
dtypes: category(1), datetime64[ns](2), float64(1)
memory usage: 495.0 bytes


# Solutions to Exercises
## 1
### 1a
In the first part it is acceptable to manually calculate the sum and average for each column. 

In [30]:
df = pd.DataFrame(
    {
        '2011':[7000,5800,15000],
        '2012':[6900,6000,14000],
        '2013':[7000,6200,13000]
    },
    index=['France','Germany','United States of America']
)

total = df['2011'] + df['2012'] + df['2013']
df.insert(value=total, column='total', loc=3)

average = round(total/3,1)
df.insert(value=average, column='average', loc=4)
df

Unnamed: 0,2011,2012,2013,total,average
France,7000,6900,7000,20900,6966.7
Germany,5800,6000,6200,18000,6000.0
United States of America,15000,14000,13000,42000,14000.0


### 1b

Now we need to use the built in functions for finding the sum and average so that the ``pd.NA`` value is handled correctly.

In [31]:
df = pd.DataFrame(
    {
        '2011':[pd.NA,5800,15000],
        '2012':[6900,6000,14000],
        '2013':[7000,6200,13000]
    },
    index=['France','Germany','United States of America']
)

# axis=1 indicates we're finding the sum for each row, across multiple columns 
# df.iloc selects the three columns we want to calculate the sum across (and all rows)
total = df.iloc[:, 0:3].sum(axis=1)
df.insert(value=total, column='total', loc=3)

# axis=1 indicates we're finding the mean for each row, across multiple columns
# df.iloc selects the three columns we want to calculate the mean across (and all rows)
average = round(df.iloc[:, 0:3].mean(axis=1),1)
df.insert(value=average, column='average', loc=4)
df

Unnamed: 0,2011,2012,2013,total,average
France,,6900,7000,13900.0,6950.0
Germany,5800.0,6000,6200,18000.0,6000.0
United States of America,15000.0,14000,13000,42000.0,14000.0


### 1c

In this part we need to check for any value which is not a positive integer and remove this from the data before calculating the sum and average. 

In [32]:
df = pd.DataFrame(
    {
        '2011':[pd.NA,5800,15000],
        '2012':[6900,60.3,14000],
        '2013':[7000,6200,-40]
    },
    index=['France','Germany','United States of America']
)

for col_name, data in df.items():
    for i, d in enumerate(data):
        if not d is pd.NA:
            if not int(d) == d or d < 0:
                df.at[df.index[i], col_name] = pd.NA

total = df.iloc[:, 0:3].sum(axis=1)
df.insert(value=total, column='total', loc=3)

average = round(df.iloc[:, 0:3].mean(axis=1),1)
df.insert(value=average, column='average', loc=4)
df

Unnamed: 0,2011,2012,2013,total,average
France,,6900.0,7000.0,13900.0,6950.0
Germany,5800.0,,6200.0,12000.0,6000.0
United States of America,15000.0,14000.0,,29000.0,14500.0
