# Modifying Dataframes

## Adding a Column

In [8]:
import pandas as pd
df = pd.DataFrame([
  [1, '3 inch screw', 0.5, 0.75],
  [2, '2 inch nail', 0.10, 0.25],
  [3, 'hammer', 3.00, 5.50],
  [4, 'screwdriver', 2.50, 3.00]
],
  columns=['Product ID', 'Description', 'Cost to Manufacture', 'Price']
)

In [9]:
# Add columns 
df['Sold in Bulk?'] = ['Yes', 'Yes', 'No', 'No']
df

Unnamed: 0,Product ID,Description,Cost to Manufacture,Price,Sold in Bulk?
0,1,3 inch screw,0.5,0.75,Yes
1,2,2 inch nail,0.1,0.25,Yes
2,3,hammer,3.0,5.5,No
3,4,screwdriver,2.5,3.0,No


We can also add a new column that is the same for all rows in the DataFrame. Let’s return to our inventory example:

In [10]:
df['Is taxed?'] = 'Yes'
df

Unnamed: 0,Product ID,Description,Cost to Manufacture,Price,Sold in Bulk?,Is taxed?
0,1,3 inch screw,0.5,0.75,Yes,Yes
1,2,2 inch nail,0.1,0.25,Yes,Yes
2,3,hammer,3.0,5.5,No,Yes
3,4,screwdriver,2.5,3.0,No,Yes


Finally, you can add a new column by performing a function on the existing columns.

Maybe we want to add a column to our inventory table with the amount of sales tax that we need to charge for each item. 

In [11]:
# Add column here
df['Margin'] = df.Price - df['Cost to Manufacture']
df

Unnamed: 0,Product ID,Description,Cost to Manufacture,Price,Sold in Bulk?,Is taxed?,Margin
0,1,3 inch screw,0.5,0.75,Yes,Yes,0.25
1,2,2 inch nail,0.1,0.25,Yes,Yes,0.15
2,3,hammer,3.0,5.5,No,Yes,2.5
3,4,screwdriver,2.5,3.0,No,Yes,0.5


*** 

## Performing Column Operations

Often, the column that we want to add is related to existing columns, but requires a calculation more complex than multiplication or addition.

We can use the apply function to apply a function to every value in a particular column. 

In [13]:
df = pd.DataFrame([
  ['JOHN SMITH', 'john.smith@gmail.com'],
  ['Jane Doe', 'jdoe@yahoo.com'],
  ['joe schmo', 'joeschmo@hotmail.com']
],
columns=['Name', 'Email'])

# Apply the function lower to all names in column 'Name' in df. Assign these new names to a new column of df called 'Lowercase Name'
df['Lowercase Name'] = df.Name.apply(str.lower)
df

Unnamed: 0,Name,Email,Lowercase Name
0,JOHN SMITH,john.smith@gmail.com,john smith
1,Jane Doe,jdoe@yahoo.com,jane doe
2,joe schmo,joeschmo@hotmail.com,joe schmo


***

### Applying Lambda to a Column

In Pandas, we often use lambda functions to perform complex operations on columns.

`lambda x: [OUTCOME IF TRUE] if [CONDITIONAL] else [OUTCOME IF FALSE]`

Creates a lambda function `get_last_name` which takes a string with someone’s first and last name (i.e., `John Smith`), and returns just the last name (i.e., `Smith`).

In [16]:
df = pd.read_csv('employees.csv')

get_last_name = lambda x: x.split()[-1]

Use the lambda function `get_last_name` to create a new column `last_name` with only the employees’ last name.

In [17]:
df['last_name'] = df.name.apply(get_last_name)

df

Unnamed: 0,id,name,hourly_wage,hours_worked,last_name
0,10310,Lauren Durham,19,43,Durham
1,18656,Grace Sellers,17,40,Sellers
2,61254,Shirley Rasmussen,16,30,Rasmussen
3,16886,Brian Rojas,18,47,Rojas
4,89010,Samantha Mosley,11,38,Mosley
5,87246,Louis Guzman,14,39,Guzman
6,20578,Denise Mcclure,15,40,Mcclure
7,12869,James Raymond,15,32,Raymond
8,53461,Noah Collier,18,35,Collier
9,14746,Donna Frederick,20,41,Frederick


### Applying Lambda to a Row

We can also operate on multiple columns at once. If we use apply without specifying a single column and add the argument `axis=1`, the input to our lambda function will be an entire row, not a column. To access particular values of the row, we use the syntax `row.column_name` or `row[‘column_name’]`.

If an employee worked for more than 40 hours, she needs to be paid overtime (1.5 times the normal hourly wage).

For instance, if an employee worked for 43 hours and made $10/hour, she would receive $400 for the first 40 hours that she worked, and an additional $45 for the 3 hours of overtime, for a total for $445.

Create a lambda function `total_earned` that accepts an input row with keys `hours_worked` and `hourly_wage` and uses an if statement to calculate the total wages earned.

In [19]:
total_earned = lambda row: (row.hourly_wage * 40) + ((row.hourly_wage * 1.5) * (row.hours_worked - 40)) \
	if row.hours_worked > 40 \
  else row.hourly_wage * row.hours_worked
  
df['total_earned'] = df.apply(total_earned, axis = 1)

df

Unnamed: 0,id,name,hourly_wage,hours_worked,last_name,total_earned
0,10310,Lauren Durham,19,43,Durham,845.5
1,18656,Grace Sellers,17,40,Sellers,680.0
2,61254,Shirley Rasmussen,16,30,Rasmussen,480.0
3,16886,Brian Rojas,18,47,Rojas,909.0
4,89010,Samantha Mosley,11,38,Mosley,418.0
5,87246,Louis Guzman,14,39,Guzman,546.0
6,20578,Denise Mcclure,15,40,Mcclure,600.0
7,12869,James Raymond,15,32,Raymond,480.0
8,53461,Noah Collier,18,35,Collier,630.0
9,14746,Donna Frederick,20,41,Frederick,830.0


## Renaming Columns

In [20]:
df = pd.read_csv('imdb.csv')

df

Unnamed: 0,id,name,genre,year,imdb_rating
0,1,Avatar,action,2009,7.9
1,2,Jurassic World,action,2015,7.3
2,3,The Avengers,action,2012,8.1
3,4,The Dark Knight,action,2008,9.0
4,5,Star Wars: Episode I - The Phantom Menace,action,1999,6.6
...,...,...,...,...,...
215,216,Hannibal,drama,2001,6.7
216,217,Catch Me If You Can,drama,2002,8.0
217,218,Big Daddy,drama,1999,6.4
218,219,Se7en,drama,1995,8.6


In [21]:
# Rename columns here
df.columns = ['ID', 'Title', 'Category', 'Year Released', 'Rating']
df

Unnamed: 0,ID,Title,Category,Year Released,Rating
0,1,Avatar,action,2009,7.9
1,2,Jurassic World,action,2015,7.3
2,3,The Avengers,action,2012,8.1
3,4,The Dark Knight,action,2008,9.0
4,5,Star Wars: Episode I - The Phantom Menace,action,1999,6.6
...,...,...,...,...,...
215,216,Hannibal,drama,2001,6.7
216,217,Catch Me If You Can,drama,2002,8.0
217,218,Big Daddy,drama,1999,6.4
218,219,Se7en,drama,1995,8.6


You also can rename individual columns by using the .rename method. 

Using rename with only the columns keyword will create a __new__ DataFrame, leaving your original DataFrame unchanged. That’s why we also passed in the keyword argument `inplace=True`. Using `inplace=True` lets us edit the original DataFrame.

There are several reasons why `.rename` is preferable to `.columns`:

- You can rename just one column
- You can be specific about which column names are getting changed (with `.column` you can accidentally switch column names if you’re not careful)

__Note:__ If you misspell one of the original column names, this command won’t fail. It just won’t change anything.

In [23]:
# Rename columns
df.rename(columns={
    'name': 'movie_title'},
    inplace=True)
df

Unnamed: 0,ID,Title,Category,Year Released,Rating
0,1,Avatar,action,2009,7.9
1,2,Jurassic World,action,2015,7.3
2,3,The Avengers,action,2012,8.1
3,4,The Dark Knight,action,2008,9.0
4,5,Star Wars: Episode I - The Phantom Menace,action,1999,6.6
...,...,...,...,...,...
215,216,Hannibal,drama,2001,6.7
216,217,Catch Me If You Can,drama,2002,8.0
217,218,Big Daddy,drama,1999,6.4
218,219,Se7en,drama,1995,8.6
