### How to modify an existing DataFrame ??


Modification Involves -
======================

Adding columns to a DataFrame

Using lambda functions to calculate complex quantities

Renaming columns

In [None]:
# Question : When would we need to modify a dataframe in Pandas?

Using Pandas, it’s important that we have the ability to modify dataframes when needed.

These are some of the most common reasons where you might need to modify dataframes in Pandas.

    
1. Adding a new row to the dataframe.
==================================
One important reason for modification is when we need to add a new entry to the table, which is usually referred to as a row.

2. Adding a new column.
==================================
In Pandas, columns are similar to columns as used in SQL databases. They allow us to have similar values that fall under different columns. A common modification of a dataframe is adding a new column if we are expanding the dataframe to include more columns to add more information.

3. Renaming a column.
==================================
We may need to rename a column to something else that makes the data more clear to users. For example, if we had a dataframe of information regarding movies, and the column name for the movie titles was simply called “name”, this might not be obvious. We might rename the column to something clearer like “movie_title”.

4. Modifying a specific row of data.
==================================
We sometimes need to update a specific row, or even multiple rows, in a dataframe.

    

In [2]:
import pandas as pd

In [7]:
orders_df = pd.read_csv(r"D:\GIT_Repositories\pandas\shoefly_orders_2.csv")

In [8]:
# Let's examine the first 10 rows of our data!

orders_df.head(10)

Unnamed: 0,id,first_name,last_name,email,shoe_type,shoe_material,shoe_color
0,74970,Albert,Crane,ACrane1998@gmail.com,ballet flats,fabric,white
1,28140,Lauren,Wise,LWise1972@gmail.com,Sandals,fabric,black
2,16841,Bryan,Maldonado,Bryan.Maldonado@gmail.com,STILLETTOS,fabric,navy
3,19695,Nancy,Talley,NTalley1988@gmail.com,wedges,fabric,black
4,61287,Catherine,Brown,CBrown1981@gmail.com,Boots,faux-leather,white
5,57141,Doris,Newton,DN6902@gmail.com,Sandals,fabric,navy
6,52132,Billy,Mcintyre,BM6854@hotmail.com,wedges,faux-leather,navy
7,25486,Sarah,Kaufman,SK8490@gmail.com,Sandals,leather,navy
8,12421,Susan,Leblanc,SusanLeblanc40@outlook.com,CLOGS,fabric,red
9,36234,Benjamin,Newton,BenjaminNewton59@aol.com,CLOGS,faux-leather,navy



Some of the shoe types are all caps, and others are all lower case. This is messy, and we should like to clean it up. 
We can do this by applying string.lower to the column shoe_type.


In [10]:
orders_df['shoe_type'].str.lower()

0    ballet flats
1         sandals
2      stillettos
3          wedges
4           boots
5         sandals
6          wedges
7         sandals
8           clogs
9           clogs
Name: shoe_type, dtype: object

In [11]:
orders_df['shoe_type'].apply(str.lower)

0    ballet flats
1         sandals
2      stillettos
3          wedges
4           boots
5         sandals
6          wedges
7         sandals
8           clogs
9           clogs
Name: shoe_type, dtype: object

In [13]:
# Update Original Dataframe

orders_df['shoe_type'] = orders_df['shoe_type'].apply(str.lower)

orders_df.head(10)

Unnamed: 0,id,first_name,last_name,email,shoe_type,shoe_material,shoe_color
0,74970,Albert,Crane,ACrane1998@gmail.com,ballet flats,fabric,white
1,28140,Lauren,Wise,LWise1972@gmail.com,sandals,fabric,black
2,16841,Bryan,Maldonado,Bryan.Maldonado@gmail.com,stillettos,fabric,navy
3,19695,Nancy,Talley,NTalley1988@gmail.com,wedges,fabric,black
4,61287,Catherine,Brown,CBrown1981@gmail.com,boots,faux-leather,white
5,57141,Doris,Newton,DN6902@gmail.com,sandals,fabric,navy
6,52132,Billy,Mcintyre,BM6854@hotmail.com,wedges,faux-leather,navy
7,25486,Sarah,Kaufman,SK8490@gmail.com,sandals,leather,navy
8,12421,Susan,Leblanc,SusanLeblanc40@outlook.com,clogs,fabric,red
9,36234,Benjamin,Newton,BenjaminNewton59@aol.com,clogs,faux-leather,navy


## Adding columns to a DataFrame

Our factory says that they are not able to stock fabric shoes anymore. 

Lets add a column in_stock, which is True for all non-fabric shoes and False for fabric shoes.

In [29]:
# requires a validation on shoe_material column ---- set: 'False' if fabric; otherwise set: 'True' if non-fabric
# save the result in new column --> 'in_stock'

# orders_df['in_stock'] = orders_df. < column name > .apply(lambda <for each column value> : False if <column value> == 'fabric' else True)


orders_df['in_stock'] = orders_df.shoe_material.apply(lambda x: False if x == 'fabric' else True)


In [30]:
orders_df.head(10)

Unnamed: 0,id,first_name,last_name,email,shoe_type,shoe_material,shoe_color,in_stock,description
0,74970,Albert,Crane,ACrane1998@gmail.com,ballet flats,fabric,white,False,Albert Crane bought white fabric ballet flats
1,28140,Lauren,Wise,LWise1972@gmail.com,sandals,fabric,black,False,Lauren Wise bought black fabric sandals
2,16841,Bryan,Maldonado,Bryan.Maldonado@gmail.com,stillettos,fabric,navy,False,Bryan Maldonado bought navy fabric stillettos
3,19695,Nancy,Talley,NTalley1988@gmail.com,wedges,fabric,black,False,Nancy Talley bought black fabric wedges
4,61287,Catherine,Brown,CBrown1981@gmail.com,boots,faux-leather,white,True,Catherine Brown bought white faux-leather boots
5,57141,Doris,Newton,DN6902@gmail.com,sandals,fabric,navy,False,Doris Newton bought navy fabric sandals
6,52132,Billy,Mcintyre,BM6854@hotmail.com,wedges,faux-leather,navy,True,Billy Mcintyre bought navy faux-leather wedges
7,25486,Sarah,Kaufman,SK8490@gmail.com,sandals,leather,navy,True,Sarah Kaufman bought navy leather sandals
8,12421,Susan,Leblanc,SusanLeblanc40@outlook.com,clogs,fabric,red,False,Susan Leblanc bought red fabric clogs
9,36234,Benjamin,Newton,BenjaminNewton59@aol.com,clogs,faux-leather,navy,True,Benjamin Newton bought navy faux-leather clogs


Our marketing department wants to announce some purchases on our Twitter feed.

Let's add a description to each row that they can use. It will show up in a new column called "description".

description  <==  " {first_name} {last_name} bought {shoe_color} {shoe_material} {shoe_type}".format(column_names..., axis = 1)

## Using lambda functions to calculate complex quantities

In [26]:
# add a new columns --- 'description'

# description  <==  " {first_name} {last_name} bought {shoe_color} {shoe_material} {shoe_type}".format(column_names..., axis = 1)

orders_df['description'] = orders_df.apply( lambda row: "{} {} bought {} {} {}" \
.format(row.first_name,
        row.last_name,
        row.shoe_color,
        row.shoe_material,
        row.shoe_type),
        axis=1
       )
orders_df.head(10)

Unnamed: 0,id,first_name,last_name,email,shoe_type,shoe_material,shoe_color,in_stock,description
0,74970,Albert,Crane,ACrane1998@gmail.com,ballet flats,fabric,white,False,Albert Crane bought white fabric ballet flats
1,28140,Lauren,Wise,LWise1972@gmail.com,sandals,fabric,black,False,Lauren Wise bought black fabric sandals
2,16841,Bryan,Maldonado,Bryan.Maldonado@gmail.com,stillettos,fabric,navy,False,Bryan Maldonado bought navy fabric stillettos
3,19695,Nancy,Talley,NTalley1988@gmail.com,wedges,fabric,black,False,Nancy Talley bought black fabric wedges
4,61287,Catherine,Brown,CBrown1981@gmail.com,boots,faux-leather,white,True,Catherine Brown bought white faux-leather boots
5,57141,Doris,Newton,DN6902@gmail.com,sandals,fabric,navy,False,Doris Newton bought navy fabric sandals
6,52132,Billy,Mcintyre,BM6854@hotmail.com,wedges,faux-leather,navy,True,Billy Mcintyre bought navy faux-leather wedges
7,25486,Sarah,Kaufman,SK8490@gmail.com,sandals,leather,navy,True,Sarah Kaufman bought navy leather sandals
8,12421,Susan,Leblanc,SusanLeblanc40@outlook.com,clogs,fabric,red,False,Susan Leblanc bought red fabric clogs
9,36234,Benjamin,Newton,BenjaminNewton59@aol.com,clogs,faux-leather,navy,True,Benjamin Newton bought navy faux-leather clogs


## pd.set_option()

display.max_columns : int
==========================
If max_cols is exceeded, switch to truncate view. Depending on large_repr, objects are either centrally truncated or printed as a summary view. 

‘None’ value means unlimited.

In case python/IPython is running in a terminal and large_repr equals ‘truncate’ this can be set to 0 and pandas will auto-detect the width of the terminal and print a truncated object which fits the screen width. 

The IPython notebook, IPython qtconsole, or IDLE do not run in a terminal and hence it is not possible to do correct auto-detection. [default: 20] [currently: 20]


display.max_colwidth : int
==========================
The maximum width in characters of a column in the repr of a pandas data structure. When the column overflows, a ”...” placeholder is embedded in the output. [default: 50] [currently: 50]

In [31]:
# pd.set_option("max_columns", None) # show all cols
pd.set_option('display.max_colwidth', None)

orders_df.head(10)

Unnamed: 0,id,first_name,last_name,email,shoe_type,shoe_material,shoe_color,in_stock,description
0,74970,Albert,Crane,ACrane1998@gmail.com,ballet flats,fabric,white,False,Albert Crane bought white fabric ballet flats
1,28140,Lauren,Wise,LWise1972@gmail.com,sandals,fabric,black,False,Lauren Wise bought black fabric sandals
2,16841,Bryan,Maldonado,Bryan.Maldonado@gmail.com,stillettos,fabric,navy,False,Bryan Maldonado bought navy fabric stillettos
3,19695,Nancy,Talley,NTalley1988@gmail.com,wedges,fabric,black,False,Nancy Talley bought black fabric wedges
4,61287,Catherine,Brown,CBrown1981@gmail.com,boots,faux-leather,white,True,Catherine Brown bought white faux-leather boots
5,57141,Doris,Newton,DN6902@gmail.com,sandals,fabric,navy,False,Doris Newton bought navy fabric sandals
6,52132,Billy,Mcintyre,BM6854@hotmail.com,wedges,faux-leather,navy,True,Billy Mcintyre bought navy faux-leather wedges
7,25486,Sarah,Kaufman,SK8490@gmail.com,sandals,leather,navy,True,Sarah Kaufman bought navy leather sandals
8,12421,Susan,Leblanc,SusanLeblanc40@outlook.com,clogs,fabric,red,False,Susan Leblanc bought red fabric clogs
9,36234,Benjamin,Newton,BenjaminNewton59@aol.com,clogs,faux-leather,navy,True,Benjamin Newton bought navy faux-leather clogs


## Adding a Column I

One way that we can add a new column is by giving a list of the same length as the existing DataFrame.

Example:

Suppose we own a hardware store called The Handy Woman and have a DataFrame containing inventory information:

Product ID	Product Description	  Cost to Manufacture	Price
----------  -------------------   -------------------   -----
1	        3 inch screw	      0.50	                0.75
2	        2 inch nail	          0.10	                0.25
3	        hammer	              3.00	                5.50
4	        screwdriver	          2.50	                3.00

Add the missing column that maintains the actual quantity of each product in our warehouse 

df['Quantity'] = [100, 150, 50, 35]

In [34]:
Handy_Woman_df = pd.DataFrame([
    [1, '3 inch screw', 0.50, 0.75],
    [2, '2 inch nail', 0.10, 0.25],
    [3, 'hammer', 3.00, 5.50],
    [4, 'screwdriver', 2.50, 3.00]
], 
    columns = ['Product ID', 'Product Description', 'Cost to Manufacture', 'Price']
)

In [35]:
Handy_Woman_df

Unnamed: 0,Product ID,Product Description,Cost to Manufacture,Price
0,1,3 inch screw,0.5,0.75
1,2,2 inch nail,0.1,0.25
2,3,hammer,3.0,5.5
3,4,screwdriver,2.5,3.0


Add the missing column that maintains the actual quantity of each product in our warehouse 


In [36]:
Handy_Woman_df['Quantity'] = [100, 150, 50, 35]

Our new DataFrame looks like this:

In [37]:
Handy_Woman_df

Unnamed: 0,Product ID,Product Description,Cost to Manufacture,Price,Quantity
0,1,3 inch screw,0.5,0.75,100
1,2,2 inch nail,0.1,0.25,150
2,3,hammer,3.0,5.5,50
3,4,screwdriver,2.5,3.0,35


In [None]:
Excercise:
==========

The DataFrame df contains information on products sold at a hardware store. 

Add a column to df called 'Sold in Bulk?', which indicates if the product is sold in bulk or individually. 

The final table should look like this:

Product ID	Product Description	  Cost to Manufacture	Price    Sold in Bulk?
----------  -------------------   -------------------   -----    -------------
1	        3 inch screw	      0.50	                0.75     Yes
2	        2 inch nail	          0.10	                0.25     Yes
3	        hammer	              3.00	                5.50     No
4	        screwdriver	          2.50	                3.00     No

In [39]:
test_df = pd.DataFrame([
    [1, '3 inch screw', 0.50, 0.75],
    [2, '2 inch nail', 0.10, 0.25],
    [3, 'hammer', 3.00, 5.50],
    [4, 'screwdriver', 2.50, 3.00]
],
 columns=['Product ID', 'Product Description', 'Cost to Manufacture', 'Price'])

In [40]:
test_df

Unnamed: 0,Product ID,Product Description,Cost to Manufacture,Price
0,1,3 inch screw,0.5,0.75
1,2,2 inch nail,0.1,0.25
2,3,hammer,3.0,5.5
3,4,screwdriver,2.5,3.0


Add a column to df called 'Sold in Bulk?', which indicates if the product is sold in bulk or individually.

In [41]:
test_df['Sold in Bulk?'] = ['Yes', 'Yes', 'No', 'No']

In [42]:
test_df

Unnamed: 0,Product ID,Product Description,Cost to Manufacture,Price,Sold in Bulk?
0,1,3 inch screw,0.5,0.75,Yes
1,2,2 inch nail,0.1,0.25,Yes
2,3,hammer,3.0,5.5,No
3,4,screwdriver,2.5,3.0,No


## Question: Can we add a new column at a specific position in a Pandas dataframe?


In [None]:
Yes, you can add a new column in a specified position into a dataframe, by specifying -

    an index and using the insert() function. 

By default, adding a column will always add it as the last column of a dataframe.

    
Example:
========
if we have a dataframe with five columns & If we want to insert a new column at the third position (index 2)

# Third position would be at index 2, because of zero-indexing.
df.insert(2, 'new-col', data)

When inserting, the columns from index 2 onward will effectively be shifted over to the right by 1 index each. i.e.
The column that was previously at index 2 would now be at index 3 and so on for the following

## Adding a Column II

add a new column that is the same for all rows in the DataFrame.


Example:

Suppose we know that all of our products are currently in-stock. We can add a column that says this:

test_df['In Stock?'] = True

In [45]:
test_df

Unnamed: 0,Product ID,Product Description,Cost to Manufacture,Price,Sold in Bulk?
0,1,3 inch screw,0.5,0.75,Yes
1,2,2 inch nail,0.1,0.25,Yes
2,3,hammer,3.0,5.5,No
3,4,screwdriver,2.5,3.0,No


In [46]:
test_df['Sold in Bulk?'] = True

Now all of the rows have a column called In Stock? with value True.

In [47]:
test_df

Unnamed: 0,Product ID,Product Description,Cost to Manufacture,Price,Sold in Bulk?
0,1,3 inch screw,0.5,0.75,True
1,2,2 inch nail,0.1,0.25,True
2,3,hammer,3.0,5.5,True
3,4,screwdriver,2.5,3.0,True


Excercise:

Add a column to df called Is taxed?, which indicates whether or not to collect sales tax on the product. 

It should be 'Yes' for all rows.


In [48]:
test_df['Is taxed?'] = 'Yes'

In [49]:
test_df

Unnamed: 0,Product ID,Product Description,Cost to Manufacture,Price,Sold in Bulk?,Is taxed?
0,1,3 inch screw,0.5,0.75,True,Yes
1,2,2 inch nail,0.1,0.25,True,Yes
2,3,hammer,3.0,5.5,True,Yes
3,4,screwdriver,2.5,3.0,True,Yes


# Question

When we store values such as True or False into a dataframe, are they stored as strings?


In [None]:
No, although the values may seem to be string types when printing out a dataframe, the values of True and False are stored as the bool type.

To see the data types that each column of your dataframe stores ---> .info()

Remember:
---------

when inspecting the column data types using .info(), you will see types such as float64, bool, as well as object.

The object data type means that the column can store any Python object. 

Columns that store more than one type of value, say a column that contains numbers and strings, will have a dtype of object.


In [50]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Product ID           4 non-null      int64  
 1   Product Description  4 non-null      object 
 2   Cost to Manufacture  4 non-null      float64
 3   Price                4 non-null      float64
 4   Sold in Bulk?        4 non-null      bool   
 5   Is taxed?            4 non-null      object 
dtypes: bool(1), float64(2), int64(1), object(2)
memory usage: 296.0+ bytes


# Adding a Column III

Finally, you can add a new column by performing a function on the existing columns.

In [None]:
Example:

Maybe we want to add a column to our inventory table with the amount of sales tax that we need to charge for each item.
                                                                                                        
The following code multiplies each Price by 0.075, the sales tax for our state:
                                                                                                        
df['Sales Tax'] = df.Price * 0.075


In [52]:
test_df

Unnamed: 0,Product ID,Product Description,Cost to Manufacture,Price,Sold in Bulk?,Is taxed?
0,1,3 inch screw,0.5,0.75,True,Yes
1,2,2 inch nail,0.1,0.25,True,Yes
2,3,hammer,3.0,5.5,True,Yes
3,4,screwdriver,2.5,3.0,True,Yes


In [61]:
test_df['Sales Tax'] = test_df['Price'] * 0.075

In [54]:
test_df

Unnamed: 0,Product ID,Product Description,Cost to Manufacture,Price,Sold in Bulk?,Is taxed?,Sales Tax
0,1,3 inch screw,0.5,0.75,True,Yes,0.05625
1,2,2 inch nail,0.1,0.25,True,Yes,0.01875
2,3,hammer,3.0,5.5,True,Yes,0.4125
3,4,screwdriver,2.5,3.0,True,Yes,0.225


## .round(n)

In [62]:
test_df['Sales Tax'] = test_df['Sales Tax'] .round(2)

In [63]:
test_df

Unnamed: 0,Product ID,Product Description,Cost to Manufacture,Price,Sold in Bulk?,Is taxed?,Sales Tax
0,1,3 inch screw,0.5,0.75,True,Yes,0.06
1,2,2 inch nail,0.1,0.25,True,Yes,0.02
2,3,hammer,3.0,5.5,True,Yes,0.41
3,4,screwdriver,2.5,3.0,True,Yes,0.22


Excercise:
==========

Add a column to df called 'Margin', which is equal to the difference between the Price and the Cost to Manufacture.


In [64]:
test_df['Margin'] = test_df['Price'] - test_df['Cost to Manufacture']

In [65]:
test_df

Unnamed: 0,Product ID,Product Description,Cost to Manufacture,Price,Sold in Bulk?,Is taxed?,Sales Tax,Margin
0,1,3 inch screw,0.5,0.75,True,Yes,0.06,0.25
1,2,2 inch nail,0.1,0.25,True,Yes,0.02,0.15
2,3,hammer,3.0,5.5,True,Yes,0.41,2.5
3,4,screwdriver,2.5,3.0,True,Yes,0.22,0.5


## Question

Can we perform operations between more than two columns?

In [None]:
Yes, you can perform operations between two or more columns. 

In fact, there is no limit to how many columns you can perform functions on.

For example, say we had a dataframe containing columns for
price, tax, and quantity.

We could perform an operation using all three of these columns:
df['total'] = (df.price + (df.price * df.tax)) * df.quantity

#### add quantity

In [68]:
test_df.insert(4, 'Quantity', [100, 150, 50, 35])

In [69]:
test_df

Unnamed: 0,Product ID,Product Description,Cost to Manufacture,Price,Quantity,Sold in Bulk?,Is taxed?,Sales Tax,Margin
0,1,3 inch screw,0.5,0.75,100,True,Yes,0.06,0.25
1,2,2 inch nail,0.1,0.25,150,True,Yes,0.02,0.15
2,3,hammer,3.0,5.5,50,True,Yes,0.41,2.5
3,4,screwdriver,2.5,3.0,35,True,Yes,0.22,0.5


#### compute total

In [73]:
test_df['total'] = ( test_df['Price'] + test_df['Sales Tax'] ) * test_df['Quantity']

In [74]:
test_df

Unnamed: 0,Product ID,Product Description,Cost to Manufacture,Price,Quantity,Sold in Bulk?,Is taxed?,Sales Tax,Margin,total
0,1,3 inch screw,0.5,0.75,100,True,Yes,0.06,0.25,81.0
1,2,2 inch nail,0.1,0.25,150,True,Yes,0.02,0.15,40.5
2,3,hammer,3.0,5.5,50,True,Yes,0.41,2.5,295.5
3,4,screwdriver,2.5,3.0,35,True,Yes,0.22,0.5,112.7


# Performing Column Operations

In [None]:

Often, the column that we want to add is related to existing columns, but requires a calculation more complex than multiplication or addition

For example,
===========

imagine that we have the following table of customers.
    
Name	     Email
JOHN SMITH	john.smith@gmail.com
Jane Doe	jdoe@yahoo.com
joe schmo	joeschmo@hotmail.com

It’s a little annoying that the capitalization is different for each row. 

Perhaps we’d like to make it more consistent by making all of the letters uppercase.

====> use the apply() ---- to apply a function to every value in a particular column.                                

In [77]:
user_df = pd.DataFrame([
  ['JOHN SMITH', 'john.smith@gmail.com'],
  ['Jane Doe', 'jdoe@yahoo.com'],
  ['joe schmo', 'joeschmo@hotmail.com']
],
columns=['Name', 'Email'])

In [78]:
user_df

Unnamed: 0,Name,Email
0,JOHN SMITH,john.smith@gmail.com
1,Jane Doe,jdoe@yahoo.com
2,joe schmo,joeschmo@hotmail.com


In [None]:
Example

overwrites the existing 'Name' columns by applying the function upper to every row in 'Name'.

user_df['Name'] = user_df['Name'].apply( < string function > )

In [79]:
user_df['Name'] = user_df['Name'].apply(str.upper)

In [80]:
user_df

Unnamed: 0,Name,Email
0,JOHN SMITH,john.smith@gmail.com
1,JANE DOE,jdoe@yahoo.com
2,JOE SCHMO,joeschmo@hotmail.com


## Excercise

In [None]:
Apply the function lower to all names in column 'Name' in df. Assign these new names to a new column of df called 'Lowercase Name'. 

The final DataFrame should look like this:

Name	     Email	                Lowercase Name
----         -----                  --------------
JOHN SMITH	john.smith@gmail.com	john smith
Jane Doe	jdoe@yahoo.com	        jane doe
joe schmo	joeschmo@hotmail.com	joe schmo

In [81]:
user_df['Lowercase Name'] = user_df['Name'].apply(str.lower)

In [82]:
user_df

Unnamed: 0,Name,Email,Lowercase Name
0,JOHN SMITH,john.smith@gmail.com,john smith
1,JANE DOE,jdoe@yahoo.com,jane doe
2,JOE SCHMO,joeschmo@hotmail.com,joe schmo


## Question

Can we utilize the apply() method in Pandas to update a dataframe in-place?


In [None]:
No, unlike other methods that update the dataframe for which you can specify in-place, such as

df.drop(['A'], inplace=True)

df.rename({'B' : 'C'}, inplace=True)

====> using the apply() method does not have the parameter for inplace.

As a result, whenever you use apply() on a dataframe, if you wish to update the dataframe, then you must reassign it, for example:

df = df.apply(my_lambda)


Note: 
=====
df['col_name'] = df['col_name'].lower()  <<==== lower() is from python2 and it is deprecated function. 
so always use apply with str methods


## Reviewing Lambda Function

A lambda function is ----> a way of defining a function in a single line of code.

Usually, we would assign them to a variable.

Example: the following lambda function multiplies a number by 2 and then adds 3:

In [83]:
mylambda = lambda x: (x * 2) + 3
print(mylambda(5))

13


#### note: Lambda functions work with all types of variables, not just integers! 

In [85]:
stringlambda = lambda x: x.lower()
print(stringlambda("Oh Hi Mark!"))

oh hi mark!


## Excercise

Create a lambda function mylambda that returns the first and last letters of a string, assuming the string is at least 2 characters long. 

For example,

print(mylambda('This is a string'))

should produc ====> 'Tg'

In [87]:
mylambda = lambda my_str: my_str[0] + my_str[-1]
mylambda('This is a string')

'Tg'

## Question:

In the context of this exercise 17, how might the example lambda functions be rewritten using the regular form, utilizing def?

Lambda functions can usually always be written in the normal Python function structure. 

converting single line lambda functions to multiple line general functions

In [89]:
# Convert ====> mylambda = lambda x: (x * 2) + 3

def my_function(x):
  return (x * 2) + 3

print(my_function(5))

13


In [90]:
# Convert ====> stringlambda = lambda x: x.lower()

def string_function(x):
  return x.lower()

print(string_function("Oh Hi Mark!"))

oh hi mark!


### Note:

If we generalize how this is done, basically, the parameters of the function are the parameters that follow right after the keyword lambda 
and before the colon :

For example, for this lambda, the parameters are x and y
sumlambda = lambda x, y: x + y

The returned value of the function is just what follows the colon :,
x + y.

In normal structure, this would be

def sum_function(x, y):
  return x + y

## Reviewing Lambda Function: If Statements

We can make our lambdas more complex by using a modified form of an if statement.

In [None]:
Example:

Suppose we want to pay workers time-and-a-half for overtime (any work above 40 hours per week). 

The following function will convert the number of hours into time-and-a-half hours using an if statement:

def myfunction(x):
    if x > 40:
        return 40 + (x - 40) * 1.50
    else:
        return x

## syntax for an if function in a lambda function

lambda x: [OUTCOME IF TRUE] if [CONDITIONAL] else [OUTCOME IF FALSE]

In [92]:
# Below is a lambda function that does the same thing:

myfunction = lambda x: (40 + (x - 40) * 1.50) if x > 40 else x

In [93]:
print(myfunction(50))

55.0


In [94]:
print(myfunction(40))

40


In [95]:
print(myfunction(20))

20


In [None]:
## Excercise

You are managing the webpage of a somewhat violent video game and you want to check that each user’s age is 13 or greater 
when they visit the site.

Write a lambda function that takes an inputted age and either returns Welcome to BattleCity! if the user is 13 or older or 
You must be 13 or older if they are younger than 13. 

Your lambda function should be called mylambda.


In [100]:
mylambda = lambda age: 'Welcome to BattleCity' if age >= 13 else 'You must be 13 or older'

In [97]:
print(mylambda(5))

You must be 13 or older


In [98]:
print(mylambda(14))

Welcome to BattleCity


In [101]:
print(mylambda(13))

Welcome to BattleCity


## Applying a Lambda to a Column

In [None]:
In Pandas, we often use lambda functions to perform complex operations on columns.

For example, suppose that we want to create a column containing the email provider for each email address in the following table:

Name	    Email
JOHN SMITH	john.smith@gmail.com
Jane Doe	jdoe@yahoo.com
joe schmo	joeschmo@hotmail.com

In [109]:
user_df

Unnamed: 0,Name,Email,Lowercase Name
0,JOHN SMITH,john.smith@gmail.com,john smith
1,JANE DOE,jdoe@yahoo.com,jane doe
2,JOE SCHMO,joeschmo@hotmail.com,joe schmo


In [110]:
user_df.drop(['Lowercase Name'], inplace=True, axis = 1)

In [111]:
user_df

Unnamed: 0,Name,Email
0,JOHN SMITH,john.smith@gmail.com
1,JANE DOE,jdoe@yahoo.com
2,JOE SCHMO,joeschmo@hotmail.com


Split the email column at @ and assign the last portion as Email Provider

In [114]:
user_df['Email Provider'] = user_df.Email.apply(lambda x: x.split('@')[-1])

In [115]:
user_df

Unnamed: 0,Name,Email,Email Provider
0,JOHN SMITH,john.smith@gmail.com,gmail.com
1,JANE DOE,jdoe@yahoo.com,yahoo.com
2,JOE SCHMO,joeschmo@hotmail.com,hotmail.com


## Excercise

Create a lambda function get_last_name which takes a string with someone’s first and last name (i.e., John Smith), and 
returns just the last name (i.e., Smith).


In [116]:
get_last_name = lambda fullname: fullname.split(' ')[-1]

In [117]:
get_last_name('Santosh Kumar')

'Kumar'

## Excercise

The DataFrame df represents the hours worked by different employees over the course of the week. 

It contains the following columns:

'name'        : The employee’s name
'hourly_wage' : The employee’s hourly wage
'hours_worked': The number of hours worked this week

Use the lambda function get_last_name to create a new column last_name with only the employees’ last name.

In [119]:
employees_df = pd.read_csv(r'D:\GIT_Repositories\pandas\employees.csv')
employees_df.head(10)

Unnamed: 0,id,name,hourly_wage,hours_worked
0,10310,Lauren Durham,19,43
1,18656,Grace Sellers,17,40
2,61254,Shirley Rasmussen,16,30
3,16886,Brian Rojas,18,47
4,89010,Samantha Mosley,11,38
5,87246,Louis Guzman,14,39
6,20578,Denise Mcclure,15,40
7,12869,James Raymond,15,32
8,53461,Noah Collier,18,35
9,14746,Donna Frederick,20,41


Use the lambda function get_last_name to create a new column last_name with only the employees’ last name.

In [128]:
get_last_name = lambda x: x.split()[-1]

employees_df['last_name'] = employees_df.name.apply(get_last_name)

In [130]:
employees_df.head(10)

Unnamed: 0,id,name,hourly_wage,hours_worked,last_name
0,10310,Lauren Durham,19,43,Durham
1,18656,Grace Sellers,17,40,Sellers
2,61254,Shirley Rasmussen,16,30,Rasmussen
3,16886,Brian Rojas,18,47,Rojas
4,89010,Samantha Mosley,11,38,Mosley
5,87246,Louis Guzman,14,39,Guzman
6,20578,Denise Mcclure,15,40,Mcclure
7,12869,James Raymond,15,32,Raymond
8,53461,Noah Collier,18,35,Collier
9,14746,Donna Frederick,20,41,Frederick


## Applying a Lambda to a Row

In [None]:
We can also operate on multiple columns at once.

If we use apply without specifying a single column and add the argument axis=1, the input to our lambda function will be an entire row, 
not a column. 

To access particular values of the row, we use the syntax row.column_name or row[‘column_name’]

## Example

In [None]:
Suppose we have a table representing a grocery list:

Item	        Price	Is taxed?
----            -----   ---------
Apple	        1.00	No
Milk	        4.20	No
Paper Towels	5.00	Yes
Light Bulbs	    3.75	Yes

If we want to add in the price with tax for each line, we’ll need to look at two columns: Price and Is taxed?.

If Is taxed? is Yes, then we’ll want to multiply Price by 1.075 (for 7.5% sales tax).

If Is taxed? is No, we’ll just have Price without multiplying it.

In [131]:
grocery_df = pd.DataFrame([
    ['Apple', 1.00, 'No'],
    ['Milk', 4.20, 'No'],
    ['Paper Towels', 5.00, 'Yes'],
    ['Light Bulbs', 3.75, 'Yes']
],
    columns=['Item', 'Price', 'Is taxed?']
)

In [132]:
grocery_df

Unnamed: 0,Item,Price,Is taxed?
0,Apple,1.0,No
1,Milk,4.2,No
2,Paper Towels,5.0,Yes
3,Light Bulbs,3.75,Yes


In [139]:
grocery_df['price with tax'] = grocery_df.apply( lambda row: 
    row['Price'] * 1.075 
    if row['Is taxed?'] == 'Yes' 
    else row['Price'],
    axis = 1)

In [140]:
grocery_df

Unnamed: 0,Item,Price,Is taxed?,price with tax
0,Apple,1.0,No,1.0
1,Milk,4.2,No,4.2
2,Paper Towels,5.0,Yes,5.375
3,Light Bulbs,3.75,Yes,4.03125


## Excercise

In [None]:
If an employee worked for more than 40 hours, she needs to be paid overtime (1.5 times the normal hourly wage).

For instance, if an employee worked for 43 hours and made $10/hour, she would receive $400 for the first 40 hours that she worked, 
and an additional $45 for the 3 hours of overtime, for a total for $445.

Create a lambda function total_earned that accepts an input row with keys hours_worked and hourly_wage and uses an if statement 
to calculate the total wages earned.

In [144]:
employees_df

Unnamed: 0,id,name,hourly_wage,hours_worked,last_name
0,10310,Lauren Durham,19,43,Durham
1,18656,Grace Sellers,17,40,Sellers
2,61254,Shirley Rasmussen,16,30,Rasmussen
3,16886,Brian Rojas,18,47,Rojas
4,89010,Samantha Mosley,11,38,Mosley
5,87246,Louis Guzman,14,39,Guzman
6,20578,Denise Mcclure,15,40,Mcclure
7,12869,James Raymond,15,32,Raymond
8,53461,Noah Collier,18,35,Collier
9,14746,Donna Frederick,20,41,Frederick


In [157]:
total_earned = lambda row: ( 
    40 + ( ( row['hours_worked'] - 40 ) * 1.5 )) * row['hourly_wage'] \
    if row['hours_worked'] > 40 \
    else ( row['hours_worked'] * row['hourly_wage'] )

In [158]:
employees_df['total_earned'] = employees_df.apply(total_earned, axis=1)

In [159]:
employees_df

Unnamed: 0,id,name,hourly_wage,hours_worked,last_name,total_earned
0,10310,Lauren Durham,19,43,Durham,845.5
1,18656,Grace Sellers,17,40,Sellers,680.0
2,61254,Shirley Rasmussen,16,30,Rasmussen,480.0
3,16886,Brian Rojas,18,47,Rojas,909.0
4,89010,Samantha Mosley,11,38,Mosley,418.0
5,87246,Louis Guzman,14,39,Guzman,546.0
6,20578,Denise Mcclure,15,40,Mcclure,600.0
7,12869,James Raymond,15,32,Raymond,480.0
8,53461,Noah Collier,18,35,Collier,630.0
9,14746,Donna Frederick,20,41,Frederick,830.0


## Question

in Pandas, when do we apply lambda functions to rows as opposed to columns of a dataframe?

In [None]:
Generally, we apply a lambda to rows, as opposed to columns, when we want to perform functionality that needs to access more than 
one column at a time.

Take for instance, the example function from the exercise:

lambda row: row['Price'] * 1.075 if row['Is taxed?'] == 'Yes' else row['Price']

As we can see, this lambda function is accessing multiple columns of the dataframe: Price and Is taxed?. 
Because it is accessing multiple columns, it would need to be able to access the entire row, instead of just a single column.

On the other hand, when applying a lambda function to a single column, the lambda will only apply to that column’s values. 
For example:
df['Email Provider'] = df.Email.apply(lambda x: x.split('@')[-1] )
will apply the lambda function only on the values of the column df.Email, and not to any other columns.


## Renaming Columns   -- Rename all columns

When we get our data from other sources, we often want to change the column names. 

For example, we might want all of the column names to follow variable name rules, so that we can use df.column_name (which tab-completes) rather than df['column_name'] (which takes up extra space).

You can change all of the column names at once by setting the .columns property to a different list.

In [None]:
df.columns = []
================

You can change all of the column names at once by setting the .columns property to a different list.

This is great when you need to change all of the column names at once, but be careful! 

You can easily mislabel columns if you get the ordering wrong. 


In [160]:
# example:

temp_df = pd.DataFrame({
    'name': ['John', 'Jane', 'Sue', 'Fred'],
    'age': [23, 29, 21, 18]
})
temp_df.columns = ['First Name', 'Age']

In [161]:
temp_df

Unnamed: 0,First Name,Age
0,John,23
1,Jane,29
2,Sue,21
3,Fred,18


## Excercise

In [None]:

We want to present this data to some film producers. Right now, our column names are in lower case, and are not very descriptive. 

Let’s modify df using the .columns attribute to make the following changes to the columns:

Old	          New
----          ----
id	          ID
name	      Title
genre	      Category
year	      Year Released
imdb_rating	  Rating

In [162]:
imdb_df = pd.read_csv(r'D:\GIT_Repositories\pandas\imdb.csv')

In [163]:
imdb_df.head(10)

Unnamed: 0,id,name,genre,year,imdb_rating
0,1,Avatar,action,2009,7.9
1,2,Jurassic World,action,2015,7.3
2,3,The Avengers,action,2012,8.1
3,4,The Dark Knight,action,2008,9.0
4,5,Star Wars: Episode I - The Phantom Menace,action,1999,6.6
5,6,Star Wars,action,1977,8.7
6,7,Avengers: Age of Ultron,action,2015,7.9
7,8,The Dark Knight Rises,action,2012,8.5
8,9,Pirates of the Caribbean: Dead Mans Chest,action,2006,7.3
9,10,Iron Man 3,action,2013,7.3


In [164]:
imdb_df.columns = ['ID', 'Title', 'Category', 'Year Released', 'Rating']

In [165]:
imdb_df.head(10)

Unnamed: 0,ID,Title,Category,Year Released,Rating
0,1,Avatar,action,2009,7.9
1,2,Jurassic World,action,2015,7.3
2,3,The Avengers,action,2012,8.1
3,4,The Dark Knight,action,2008,9.0
4,5,Star Wars: Episode I - The Phantom Menace,action,1999,6.6
5,6,Star Wars,action,1977,8.7
6,7,Avengers: Age of Ultron,action,2015,7.9
7,8,The Dark Knight Rises,action,2012,8.5
8,9,Pirates of the Caribbean: Dead Mans Chest,action,2006,7.3
9,10,Iron Man 3,action,2013,7.3


## Question

In Pandas, should dataframe column names always be capitalized?

In [None]:
The short answer is, no. 

Column names in Pandas dataframes do not always have to be capitalized, and there is no strict requirement on how to case your column names.

However, there are a few important points you might keep in mind when you are naming columns for Pandas dataframes.

be consistent
=============
One point which also applies to Python in general, is that column name casing should usually be consistent. 
If you decide to capitalize one column name, then it might be good to capitalize all column names to stay consistent. 
This applies when naming variables, functions and almost anything else in Python.

Use ----> snake_case
====================
One additional point is that when naming the columns, consider using “snake_case”, which uses casing in the form that this convention implies. This is because, it will give you the freedom to select columns of a dataframe using either format:
df.column_name or df['column_name']

## Renaming Columns II   -- Rename individual columns

You also can rename individual columns by using the .rename method.

Pass a dictionary -------- like the one below to the columns keyword argument:

{'old_column_name1': 'new_column_name1', 
 'old_column_name2': 'new_column_name2'}


## Example

In [166]:
temp_df

Unnamed: 0,First Name,Age
0,John,23
1,Jane,29
2,Sue,21
3,Fred,18


In [167]:
temp_df.rename( 
    columns={
    'First Name': 'name',
    'Age': 'age'},
    inplace=True)

In [168]:
temp_df

Unnamed: 0,name,age
0,John,23
1,Jane,29
2,Sue,21
3,Fred,18


## Note:

Using rename with only the columns keyword will create a new DataFrame, leaving your original DataFrame unchanged. 

That’s why we also passed in the keyword argument inplace=True. 

Using inplace=True lets us edit the original DataFrame.

## several reasons why .rename is preferable to .columns ??

In [None]:
1. You can rename just one column

2. You can be specific about which column names are getting changed 
   (with .column you can accidentally switch column names if you’re not careful)

Note: If you misspell one of the original column names, this command won’t fail. It just won’t change anything.

## Excercise

In [169]:
imdb_df.head()

Unnamed: 0,ID,Title,Category,Year Released,Rating
0,1,Avatar,action,2009,7.9
1,2,Jurassic World,action,2015,7.3
2,3,The Avengers,action,2012,8.1
3,4,The Dark Knight,action,2008,9.0
4,5,Star Wars: Episode I - The Phantom Menace,action,1999,6.6


In [None]:
If we didn’t know that imdb_df was a table of movie ratings, the column Title might be confusing.

To clarify, let’s rename Title to movie_title.

Use the keyword inplace=True so that you modify df rather than creating a new DataFrame!

In [172]:
imdb_df.rename(columns = {'Title': 'Movie_Title'}, inplace=True)

In [173]:
imdb_df.head()

Unnamed: 0,ID,Movie_Title,Category,Year Released,Rating
0,1,Avatar,action,2009,7.9
1,2,Jurassic World,action,2015,7.3
2,3,The Avengers,action,2012,8.1
3,4,The Dark Knight,action,2008,9.0
4,5,Star Wars: Episode I - The Phantom Menace,action,1999,6.6


## Question:

Are Pandas dataframe column names case sensitive?


In [None]:
Yes, column names for dataframes are case sensitive.

Dataframe column names are essentially string values, which are case sensitive in Python.

Because of this, you will need to be careful whenever you utilize column names, such as when renaming a column, 
accessing columns or performing functions on them etc

Example

# Given a dataframe with a column "name"
# this will incorrectly try to select it
# due to incorrect casing
print(df["Name"])

# The correct casing would be
print(df["name"])


## Final excercise

### REQUIREMENT 1:

In [None]:
Many of our customers want to buy vegan shoes (shoes made from materials that do not come from animals). 

Add a new column called shoe_source, which is vegan if the materials is not leather and animal otherwise.

In [178]:
shoefly_df = pd.read_csv(r'D:\GIT_Repositories\pandas\shoefly.csv')

In [179]:
shoefly_df.head()

Unnamed: 0,id,first_name,last_name,gender,email,shoe_type,shoe_material,shoe_color
0,54791,Rebecca,Lindsay,female,RebeccaLindsay57@hotmail.com,clogs,faux-leather,black
1,53450,Emily,Joyce,female,EmilyJoyce25@gmail.com,ballet flats,faux-leather,navy
2,91987,Joyce,Waller,female,Joyce.Waller@gmail.com,sandles,fabric,black
3,14437,Justin,Erickson,male,Justin.Erickson@outlook.com,clogs,faux-leather,red
4,79357,Andrew,Banks,male,AB4318@gmail.com,boots,leather,brown


In [181]:
shoe_source = lambda x: 'animal' if x == 'leather' else 'vegan'

orders_df['shoe_source'] = orders_df.shoe_material.apply(shoe_source)

In [182]:
orders_df

Unnamed: 0,id,first_name,last_name,email,shoe_type,shoe_material,shoe_color,in_stock,description,shoe_source
0,74970,Albert,Crane,ACrane1998@gmail.com,ballet flats,fabric,white,False,Albert Crane bought white fabric ballet flats,vegan
1,28140,Lauren,Wise,LWise1972@gmail.com,sandals,fabric,black,False,Lauren Wise bought black fabric sandals,vegan
2,16841,Bryan,Maldonado,Bryan.Maldonado@gmail.com,stillettos,fabric,navy,False,Bryan Maldonado bought navy fabric stillettos,vegan
3,19695,Nancy,Talley,NTalley1988@gmail.com,wedges,fabric,black,False,Nancy Talley bought black fabric wedges,vegan
4,61287,Catherine,Brown,CBrown1981@gmail.com,boots,faux-leather,white,True,Catherine Brown bought white faux-leather boots,vegan
5,57141,Doris,Newton,DN6902@gmail.com,sandals,fabric,navy,False,Doris Newton bought navy fabric sandals,vegan
6,52132,Billy,Mcintyre,BM6854@hotmail.com,wedges,faux-leather,navy,True,Billy Mcintyre bought navy faux-leather wedges,vegan
7,25486,Sarah,Kaufman,SK8490@gmail.com,sandals,leather,navy,True,Sarah Kaufman bought navy leather sandals,animal
8,12421,Susan,Leblanc,SusanLeblanc40@outlook.com,clogs,fabric,red,False,Susan Leblanc bought red fabric clogs,vegan
9,36234,Benjamin,Newton,BenjaminNewton59@aol.com,clogs,faux-leather,navy,True,Benjamin Newton bought navy faux-leather clogs,vegan


### REQUIREMENT 2:

In [None]:
Our marketing department wants to send out an email to each customer. 

Using the columns last_name and gender create a column called salutation which contains Dear Mr. <last_name> for men and 
Dear Ms. <last_name> for women.


In [183]:
shoefly_df = pd.read_csv(r'D:\GIT_Repositories\pandas\shoefly.csv')

In [184]:
shoefly_df.head()

Unnamed: 0,id,first_name,last_name,gender,email,shoe_type,shoe_material,shoe_color
0,54791,Rebecca,Lindsay,female,RebeccaLindsay57@hotmail.com,clogs,faux-leather,black
1,53450,Emily,Joyce,female,EmilyJoyce25@gmail.com,ballet flats,faux-leather,navy
2,91987,Joyce,Waller,female,Joyce.Waller@gmail.com,sandles,fabric,black
3,14437,Justin,Erickson,male,Justin.Erickson@outlook.com,clogs,faux-leather,red
4,79357,Andrew,Banks,male,AB4318@gmail.com,boots,leather,brown


In [190]:
salutation = lambda row: ( 'Dear Mr.' + row['last_name'] ) if row['gender'] == "male" else ( 'Dear Ms.' + row['last_name'] )

In [191]:
shoefly_df['salutation'] = shoefly_df.apply(salutation, axis=1)

In [192]:
shoefly_df

Unnamed: 0,id,first_name,last_name,gender,email,shoe_type,shoe_material,shoe_color,salutation
0,54791,Rebecca,Lindsay,female,RebeccaLindsay57@hotmail.com,clogs,faux-leather,black,Dear Ms.Lindsay
1,53450,Emily,Joyce,female,EmilyJoyce25@gmail.com,ballet flats,faux-leather,navy,Dear Ms.Joyce
2,91987,Joyce,Waller,female,Joyce.Waller@gmail.com,sandles,fabric,black,Dear Ms.Waller
3,14437,Justin,Erickson,male,Justin.Erickson@outlook.com,clogs,faux-leather,red,Dear Mr.Erickson
4,79357,Andrew,Banks,male,AB4318@gmail.com,boots,leather,brown,Dear Mr.Banks
5,52386,Julie,Marsh,female,JulieMarsh59@gmail.com,sandles,fabric,black,Dear Ms.Marsh
6,20487,Thomas,Jensen,male,TJ5470@gmail.com,clogs,fabric,navy,Dear Mr.Jensen
7,76971,Janice,Hicks,female,Janice.Hicks@gmail.com,clogs,faux-leather,navy,Dear Ms.Hicks
8,21586,Gabriel,Porter,male,GabrielPorter24@gmail.com,clogs,leather,brown,Dear Mr.Porter
9,62083,Frances,Palmer,female,FrancesPalmer50@gmail.com,wedges,leather,white,Dear Ms.Palmer


## Remove Columns from Dataframe

#### Approach 1: 

Creating a new dataframe, and including just the columns you want to keep from the original dataframe. 

In [None]:
For example, if we only wanted to include these columns from a dataframe, it effectively “removes” all the other columns not included:
new_df = df[['col1', 'col4']]

#### Approach 2

Using built-in drop() method

In [None]:
Example:
========

to drop a column, we must specify axis=1. 

df.drop('col3', axis=1, inplace=True)


## Drop multiple columns at once

enter in multiple column names as a list using drop()

In [None]:
Example:
========

df.drop(['col3', 'col5'], axis=1, inplace=True)