## NumPy
NumPy is an open-source Python library that facilitates efficient numerical operations on large quantities of data. The main data structure in this library is the powerful NumPy array, ndarray, which can have any number of dimensions. The NumPy library contains many useful features for performing mathematical and logical operations on these special arrays. NumPy is a part of a set of Python libraries that are used for scientific computing due to its efficient data analysis capabilities.

## Pandas
Pandas is a library with data manipulation tools that are built on top of and add to those of the established NumPy library. It relies on the NumPy array structure for implementation of its objects and therefore shares many features with NumPy and is frequently used alongside it. Pandas is also a part of the set of libraries used for scientific computation.

## Installation
If you have Anaconda installed, NumPy and pandas may have been auto-installed as well! If they haven’t been, or if you want to update to the latest versions, you can open a terminal window and run the following commands:

`conda install numpy`

`conda install pandas`

If you don’t have Anaconda installed, you can alternatively install the libraries using pip by running the following commands from your terminal:

`pip install numpy`

`pip install pandas`

Once you’ve installed these libraries, you’re ready to open any Python coding environment (we recommend Jupyter Notebook). Before you can use these libraries, you’ll need to import them using the following lines of code. We’ll use the abbreviations np and pd, respectively, to simplify our function calls in the future.

`import numpy as np`

`import pandas as pd`

In [1]:
import numpy as np
import pandas as pd

## NumPy Arrays
NumPy arrays are unique in that they are more flexible than normal Python lists. They are called ndarrays since they can have any number (n) of dimensions (d). They hold a collection of items of any one data type and can be either a vector (one-dimensional) or a matrix (multi-dimensional). NumPy arrays allow for fast element access and efficient data manipulation.

In [2]:
list1 = [1,2,3,4]
array1 = np.array(list1)
print(array1)
list2 = [[1,2,3],[4,5,6]]
array2 = np.array(list2)
print(array2)

[1 2 3 4]
[[1 2 3]
 [4 5 6]]


Many operations can be performed on NumPy arrays which makes them very helpful for manipulating data:

* Selecting array elements
* Slicing arrays
* Reshaping arrays
* Splitting arrays
* Combining arrays
* Numerical operations (min, max, mean, etc)

In [3]:
toyPrices = np.array([5,8,3,6])
print(toyPrices - 2)

[3 6 1 4]


In [4]:
toyPrices = [5,8,3,6]
# print(toyPrices - 2) -- Not possible. Causes an error
for i in range(len(toyPrices)):
    toyPrices[i] -= 2
print(toyPrices)

[3, 6, 1, 4]


## Pandas Series and Dataframes
Just as the ndarray is the foundation of the NumPy library, the Series is the core object of the pandas library. A pandas Series is very similar to a one-dimensional NumPy array, but it has additional functionality that allows values in the Series to be indexed using labels. A NumPy array does not have the flexibility to do this. This labeling is useful when you are storing pieces of data that have other data associated with them. Say you want to store the ages of students in an online course to eventually figure out the average student age. If stored in a NumPy array, you could only access these ages with the internal ndarray indices 0,1,2.... With a Series object, the indices of values are set to 0,1,2... by default, but you can customize the indices to be other values such as student names so an age can be accessed using a name. Customized indices of a Series are established by sending values into the Series constructor, as you will see below.

A Series holds items of any one data type and can be created by sending in a scalar value, Python list, dictionary, or ndarray as a parameter to the pandas Series constructor. If a dictionary is sent in, the keys may be used as the indices.

In [5]:
# Create a Series using a NumPy array of ages with the default numerical indices
ages = np.array([13,25,19])
series1 = pd.Series(ages)
print(ages)
print(series1)

[13 25 19]
0    13
1    25
2    19
dtype: int32


In [6]:
# Create a Series using a NumPy array of ages but customize the indices to be the names that correspond to each age
ages = np.array([13,25,19])
series1 = pd.Series(ages,index=['Emma', 'Swetha', 'Serajh'])
print(series1)

Emma      13
Swetha    25
Serajh    19
dtype: int32


Another important type of object in the pandas library is the DataFrame. This object is similar in form to a matrix as it consists of rows and columns. Both rows and columns can be indexed with integers or String names. One DataFrame can contain many different types of data types, but within a column, everything has to be the same data type. A column of a DataFrame is essentially a Series. All columns must have the same number of elements (rows).

There are different ways to fill a DataFrame such as with a CSV file, a SQL query, a Python list, or a dictionary. 

In [7]:
dataf = pd.DataFrame([
    ['John Smith','123 Main St',34],
    ['Jane Doe', '456 Maple Ave',28],
    ['Joe Schmo', '789 Broadway',51]
    ],
    columns=['name','address','age'])
print(dataf)

         name        address  age
0  John Smith    123 Main St   34
1    Jane Doe  456 Maple Ave   28
2   Joe Schmo   789 Broadway   51


In [8]:
dataf.set_index('name')

Unnamed: 0_level_0,address,age
name,Unnamed: 1_level_1,Unnamed: 2_level_1
John Smith,123 Main St,34
Jane Doe,456 Maple Ave,28
Joe Schmo,789 Broadway,51


DataFrames are useful because they make it much easier to select, manipulate, and summarize data. Their tabular format (a table with rows and columns) also makes it easier to label, simpler to read, and easier to export data to and from a spreadsheet. Understanding the power of these new data structures is the key to unlocking many new avenues for data manipulation, exploration, and analysis!

## Dataframes
A DataFrame is an object that stores data as rows and columns. You can think of a DataFrame as a spreadsheet or as a SQL table. You can manually create a DataFrame or fill it with data from a CSV, an Excel spreadsheet, or a SQL query.

DataFrames have rows and columns. Each column has a name, which is a string. Each row has an index, which is an integer. DataFrames can contain many different data types: strings, ints, floats, tuples, etc.

You can pass in a dictionary to pd.DataFrame(). Each key is a column name and each value is a list of column values. The columns must all be the same length or you will get an error. 

In [9]:
df1 = pd.DataFrame({
  'Product ID': [1, 2, 3, 4],
  'Product Name': ['t-shirt', 't-shirt', 'skirt', 'skirt'], 'Color': ['blue', 'green', 'red', 'black']
})

print(df1)

   Product ID Product Name  Color
0           1      t-shirt   blue
1           2      t-shirt  green
2           3        skirt    red
3           4        skirt  black


You can also add data using lists.

For example, you can pass in a list of lists, where each one represents a row of data. Use the keyword argument columns to pass a list of column names.

In [10]:
df2 = pd.DataFrame([
  [1, 'San Diego', 100],
  [2, 'Los Angeles', 120],
  [3, 'San Francisco', 90],
  [4, 'Sacramento', 115]
],
  columns=[
    'Store ID', 'Location', 'Number of Employees'
  ])

print(df2)

   Store ID       Location  Number of Employees
0         1      San Diego                  100
1         2    Los Angeles                  120
2         3  San Francisco                   90
3         4     Sacramento                  115


## Comma Separated Variables (CSV)
We now know how to create our own DataFrame. However, most of the time, we’ll be working with datasets that already exist. One of the most common formats for big datasets is the CSV.

CSV (comma separated values) is a text-only spreadsheet format. You can find CSVs in lots of places:

Online datasets (here’s an example from data.gov)
Export from Excel or Google Sheets
Export from SQL
The first row of a CSV contains column headings. All subsequent rows contain values. Each column heading and each variable is separated by a comma:

column1,column2,column3
value1,value2,value3

cupcakes.csv file:

name,cake_flavor,frosting_flavor,topping
Chocolate Cake,chocolate,chocolate,chocolate shavings
Birthday Cake,vanilla,vanilla,rainbow sprinkles
Carrot Cake,carrot,cream cheese,almonds

## Loading and Saving CSVs
When you have data in a CSV, you can load it into a DataFrame in Pandas using `.read_csv()`:

`pd.read_csv('my-csv-file.csv')`

In the example above, the `.read_csv()` method is called. The CSV file called my-csv-file is passed in as an argument.

We can also save data to a CSV, using `.to_csv()`.

`df.to_csv('new-csv-file.csv')`

In [11]:
df = pd.read_csv('sample.csv')
print(df)

            City  Population  Median Age
0      Maplewood      100000          40
1          Wayne      350000          33
2  Forrest Hills      300000          35
3        Paramus      400000          55
4     Hackensack      290000          39


In [12]:
df2 = pd.read_csv('cupcakes.csv')
print(df2)

             name cake_flavor frosting_flavor             topping
0  Chocolate Cake   chocolate       chocolate  chocolate shavings
1   Birthday Cake     vanilla         vanilla   rainbow sprinkles
2     Carrot Cake      carrot    cream cheese             almonds


## Inspect a DataFrame
When we load a new DataFrame from a CSV, we want to know what it looks like.

If it’s a small DataFrame, you can display it by typing `print(df)`.

If it’s a larger DataFrame, it’s helpful to be able to inspect a few items without having to look at the entire DataFrame.

The method `.head()` gives the **first 5 rows** of a DataFrame. If you want to see more rows, you can pass in the positional argument `n`. For example, `df.head(10)` would show the first 10 rows.

The method **`df.info()`** gives some statistics for each column.

In [13]:
import pandas as pd
df3 = pd.read_csv('imdb.csv')
print(df3.head())
print(df3.info())

   id                                       name   genre  year  imdb_rating
0   1                                     Avatar  action  2009          7.9
1   2                             Jurassic World  action  2015          7.3
2   3                               The Avengers  action  2012          8.1
3   4                            The Dark Knight  action  2008          9.0
4   5  Star Wars: Episode I - The Phantom Menace  action  1999          6.6
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 220 entries, 0 to 219
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   id           220 non-null    int64  
 1   name         220 non-null    object 
 2   genre        220 non-null    object 
 3   year         220 non-null    int64  
 4   imdb_rating  220 non-null    float64
dtypes: float64(1), int64(2), object(2)
memory usage: 8.7+ KB
None


## Select Columns
Now we know how to create and load data. Let’s select parts of those datasets that are interesting or important to our analyses.

Suppose you have the DataFrame called `customers`, which contains the ages of your customers:

name	age
Rebecca Erikson	35
Thomas Roberson	28
Diane Ochoa	42
…	…

Perhaps you want to take the average or plot a histogram of the ages. In order to do either of these tasks, you’d need to select the column.

There are two possible syntaxes for selecting all values from a column:

1. Select the column as if you were selecting a value from a dictionary using a key. In our example, we would type `customers['age']` to select the ages.
2. If the name of a column follows all of the rules for a variable name (doesn’t start with a number, doesn’t contain spaces or special characters, etc.), then you can select it using the following notation: `df.MySecondColumn`. In our example, we would type `customers.age`.

When we select a single column, the result is called a Series.

In [14]:
df4 = pd.DataFrame([
  ['January', 100, 100, 23, 100],
  ['February', 51, 45, 145, 45],
  ['March', 81, 96, 65, 96],
  ['April', 80, 80, 54, 180],
  ['May', 51, 54, 54, 154],
  ['June', 112, 109, 79, 129]],
  columns=['month', 'clinic_east',
           'clinic_north', 'clinic_south',
           'clinic_west']
)

clinic_north = df4.clinic_north
print(type(clinic_north))
print(type(df4))
print(clinic_north)

<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>
0    100
1     45
2     96
3     80
4     54
5    109
Name: clinic_north, dtype: int64


## Selecting Multiple Columns
When you have a larger DataFrame, you might want to select just a few columns.

For instance, let’s return to a DataFrame of `orders` from ShoeFly.com:

id	first_name	last_name	email	shoe_type	shoe_material	shoe_color
54791	Rebecca	Lindsay	RebeccaLindsay57@hotmail.com	clogs	faux-leather	black
53450	Emily	Joyce	EmilyJoyce25@gmail.com	ballet flats	faux-leather	navy
91987	Joyce	Waller	Joyce.Waller@gmail.com	sandals	fabric	black
14437	Justin	Erickson	Justin.Erickson@outlook.com	clogs	faux-leather	red

We might just be interested in the customer’s `last_name` and `email`. We want a DataFrame like this:

last_name	email
Lindsay	RebeccaLindsay57@hotmail.com
Joyce	EmilyJoyce25@gmail.com
Waller	Joyce.Waller@gmail.com
Erickson	Justin.Erickson@outlook.com

To select two or more columns from a DataFrame, we use a list of the column names. To create the DataFrame shown above, we would use:

`new_df = orders[['last_name', 'email']]`

*Note: *Make sure that you have a double set of brackets `([[]])`, or this command won’t work!

In [15]:
clinic_north_south = df4[['clinic_north', 'clinic_south']]
print(type(clinic_north_south))
print(clinic_north_south)

<class 'pandas.core.frame.DataFrame'>
   clinic_north  clinic_south
0           100            23
1            45           145
2            96            65
3            80            54
4            54            54
5           109            79


## Select Rows
Let’s revisit our orders from ShoeFly.com:

id	first_name	last_name	email	shoe_type	shoe_material	shoe_color
54791	Rebecca	Lindsay	RebeccaLindsay57@hotmail.com	clogs	faux-leather	black
53450	Emily	James	EmilyJames25@gmail.com	ballet flats	faux-leather	navy
91987	Joyce	Waller	Joyce.Waller@gmail.com	sandals	fabric	black
14437	Justin	Erickson	Justin.Erickson@outlook.com	clogs	faux-leather	red
…						

Maybe our Customer Service department has just received a message from Joyce Waller, so we want to know exactly what she ordered. We want to select this single row of data.

DataFrames are zero-indexed, meaning that we start with the 0th row and count up from there. Joyce Waller’s order is the 2nd row.

We select it using the following command:

`orders.iloc[2]`

When we select a single row, the result is a Series (just like when we select a single column).

In [16]:
march = df4.iloc[2]
print(march)

month           March
clinic_east        81
clinic_north       96
clinic_south       65
clinic_west        96
Name: 2, dtype: object


## Selecting Multiple Rows
You can also select multiple rows from a DataFrame.

Here are some different ways of selecting multiple rows:

* `orders.iloc[3:7]` would select all rows starting at the 3rd row and up to but not including the 7th row (i.e., the 3rd row, 4th row, 5th row, and 6th row)
* `orders.iloc[:4]` would select all rows up to, but not including the 4th row (i.e., the 0th, 1st, 2nd, and 3rd rows)
* `orders.iloc[-3:]` would select the rows starting at the 3rd to last row and up to and including the final row

In [17]:
april_may_june = df4.iloc[3:]
print(april_may_june)

   month  clinic_east  clinic_north  clinic_south  clinic_west
3  April           80            80            54          180
4    May           51            54            54          154
5   June          112           109            79          129


## Select Rows with Logic I
You can select a subset of a DataFrame by using logical statements:

`df[df.MyColumnName == desired_column_value]`

Suppose we want to select all rows where the customer’s age is 30. We would use:

`df[df.age == 30]`

In Python, == is how we test if a value is exactly equal to another value.

We can use other logical statements, such as:

* Greater Than, > — Here, we select all rows where the customer’s age is greater than 30:

`df[df.age > 30]`

* Less Than, < — Here, we select all rows where the customer’s age is less than 30:

`df[df.age < 30]`

* Not Equal, != — This snippet selects all rows where the customer’s name is not Clara Oswald:

`df[df.name != 'Clara Oswald']`

In [18]:
january = df4[df4.month == 'January']
print(january)

     month  clinic_east  clinic_north  clinic_south  clinic_west
0  January          100           100            23          100


## Select Rows with Logic II
You can also combine multiple logical statements, as long as each statement is in parentheses.

For instance, suppose we wanted to select all rows where the customer’s age was under 30 or the customer’s name was “Martha Jones”:

`df[(df.age < 30) | (df.name == 'Martha Jones')]`

In Python, `|` means “or” and `&` means “and”.

In [19]:
march_april = df4[(df4.month == 'March') | (df4.month == 'April')]
print(march_april)

   month  clinic_east  clinic_north  clinic_south  clinic_west
2  March           81            96            65           96
3  April           80            80            54          180


## Select Rows with Logic III

Suppose we want to select the rows where the customer’s name is either “Martha Jones”, “Rose Tyler” or “Amy Pond”.

`df[df.name.isin(['Martha Jones', 'Rose Tyler', 'Amy Pond'])]`

In [20]:
january_february_march = df4[df4.month.isin(['January', 'February', 'March'])]
print(january_february_march)

      month  clinic_east  clinic_north  clinic_south  clinic_west
0   January          100           100            23          100
1  February           51            45           145           45
2     March           81            96            65           96


## Setting indices
When we select a subset of a DataFrame using logic, we end up with non-consecutive indices. This is inelegant and makes it hard to use `.iloc()`.

We can fix this using the method `.reset_index()`. For example, here is a DataFrame called df with non-consecutive indices:

First Name	Last Name
0	John	Smith
4	Jane	Doe
7	Joe	Schmo

If we use the command `df.reset_index()`, we get a new DataFrame with a new set of indices:

index	First Name	Last Name
0	0	John	Smith
1	4	Jane	Doe
2	7	Joe	Schmo

Note that the old indices have been moved into a new column called `'index'`. Unless you need those values for something special, it’s probably better to use the keyword `drop=True` so that you don’t end up with that extra column. If we run the command `df.reset_index(drop=True)`, we get a new DataFrame that looks like this:

First Name	Last Name
0	John	Smith
1	Jane	Doe
2	Joe	Schmo

Using `.reset_index()` will return a new DataFrame, but we usually just want to modify our existing DataFrame. If we use the keyword `inplace=True` we can just modify our existing DataFrame.

In [21]:
df5 = df4.loc[[1, 3, 5]]
print(df5)

# df6 = df5.reset_index(drop=True, inplace=True)
# print(df6)

df5.reset_index(drop=True, inplace=True)
print(df5)

      month  clinic_east  clinic_north  clinic_south  clinic_west
1  February           51            45           145           45
3     April           80            80            54          180
5      June          112           109            79          129
      month  clinic_east  clinic_north  clinic_south  clinic_west
0  February           51            45           145           45
1     April           80            80            54          180
2      June          112           109            79          129


In [22]:
orders = pd.read_csv('shoefly.csv')
print(orders.info())
print('-'*50)
print(orders.head())
print('-'*50)
emails = orders.email
print(emails)
print('-'*50)
frances_palmer = orders[(orders.first_name == 'Frances') & (orders.last_name == 'Palmer')]
print(frances_palmer)
print('-'*50)
comfy_shoes = orders[orders.shoe_type.isin(['clogs', 'boots', 'ballet flats'])]
print(comfy_shoes)
print('-'*50)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   id             20 non-null     int64 
 1   first_name     20 non-null     object
 2   last_name      20 non-null     object
 3   email          20 non-null     object
 4   shoe_type      20 non-null     object
 5   shoe_material  20 non-null     object
 6   shoe_color     20 non-null     object
dtypes: int64(1), object(6)
memory usage: 1.2+ KB
None
--------------------------------------------------
      id first_name last_name                         email     shoe_type  \
0  54791    Rebecca   Lindsay  RebeccaLindsay57@hotmail.com         clogs   
1  53450      Emily     Joyce        EmilyJoyce25@gmail.com  ballet flats   
2  91987      Joyce    Waller        Joyce.Waller@gmail.com       sandals   
3  14437     Justin  Erickson   Justin.Erickson@outlook.com         clogs   
4  79357     Andrew 

## Adding a Column I
Sometimes, we want to add a column to an existing DataFrame. We might want to add new information or perform a calculation based on the data that we already have.

One way that we can add a new column is by giving a list of the same length as the existing DataFrame.

Suppose we own a hardware store called The Handy Woman and have a DataFrame containing inventory information:

Product ID	Product Description	Cost to Manufacture	Price
1	3 inch screw	0.50	0.75
2	2 inch nail	0.10	0.25
3	hammer	3.00	5.50
4	screwdriver	2.50	3.00

It looks like the actual quantity of each product in our warehouse is missing!

Let’s use the following code to add that information to our DataFrame.

`df['Quantity'] = [100, 150, 50, 35]`

In [23]:
df = pd.DataFrame([
  [1, '3 inch screw', 0.5, 0.75],
  [2, '2 inch nail', 0.10, 0.25],
  [3, 'hammer', 3.00, 5.50],
  [4, 'screwdriver', 2.50, 3.00]
],
  columns=['Product ID', 'Description', 'Cost to Manufacture', 'Price']
)

# Add columns here
df['Sold in Bulk?'] = ['Yes', 'Yes', 'No', 'No']
print(df)

   Product ID   Description  Cost to Manufacture  Price Sold in Bulk?
0           1  3 inch screw                  0.5   0.75           Yes
1           2   2 inch nail                  0.1   0.25           Yes
2           3        hammer                  3.0   5.50            No
3           4   screwdriver                  2.5   3.00            No


## Adding a Column II
We can also add a new column that is the same for all rows in the DataFrame.

`df['In Stock?'] = True`

Now all of the rows have a column called In Stock? with value True.

In [24]:
df['Is taxed?'] = 'Yes'
print(df)

   Product ID   Description  Cost to Manufacture  Price Sold in Bulk?  \
0           1  3 inch screw                  0.5   0.75           Yes   
1           2   2 inch nail                  0.1   0.25           Yes   
2           3        hammer                  3.0   5.50            No   
3           4   screwdriver                  2.5   3.00            No   

  Is taxed?  
0       Yes  
1       Yes  
2       Yes  
3       Yes  


## Adding a Column III
Finally, you can add a new column by performing a function on the existing columns.

Maybe we want to add a column to our inventory table with the amount of sales tax that we need to charge for each item. The following code multiplies each Price by 0.075, the sales tax for our state:

`df['Sales Tax'] = df.Price * 0.075`

Now our table has a column called Sales Tax

In [25]:
df['Margin'] = df['Price'] - df['Cost to Manufacture']

print(df)

   Product ID   Description  Cost to Manufacture  Price Sold in Bulk?  \
0           1  3 inch screw                  0.5   0.75           Yes   
1           2   2 inch nail                  0.1   0.25           Yes   
2           3        hammer                  3.0   5.50            No   
3           4   screwdriver                  2.5   3.00            No   

  Is taxed?  Margin  
0       Yes    0.25  
1       Yes    0.15  
2       Yes    2.50  
3       Yes    0.50  


## Performing Column Operations
In the previous exercise, we learned how to add columns to a DataFrame.

Often, the column that we want to add is related to existing columns, but requires a calculation more complex than multiplication or addition.

We can use the apply function to apply a function to every value in a particular column. For example, this code overwrites the existing 'Name' columns by applying the function upper to every row in 'Name'.

`from string import upper`
 
`df['Name'] = df.Name.apply(upper)`

In [35]:
# from string import lower
# df['Lowercase Name'] = df.Name.str.apply(lower)
# print(df)

In [36]:
mylambda = lambda str: str[0] + str[-1]
print(mylambda('Mikheltodd'))

Md


In [37]:
def myfunction(x):
    if x > 40:
        return 40 + (x - 40) * 1.50
    else:
        return x

In [38]:
myfunction = lambda x: 40 + (x - 40) * 1.50 if x > 40 else x

In [39]:
mylambda = lambda age: "Welcome to BattleCity!" if age >= 13 else "You must be over 13"
print(mylambda(15))

Welcome to BattleCity!


## Applying a Lambda to a Column
In Pandas, we often use lambda functions to perform complex operations on columns. For example, suppose that we want to create a column containing the email provider for each email address

`df['Email Provider'] = df.Email.apply(
    lambda x: x.split('@')[-1]
    )`

In [45]:
df = pd.read_csv('employees.csv')
get_last_name = lambda full_name: full_name.split(' ')[-1]
df['last_name'] = df.name.apply(get_last_name)
print(df)

       id               name  hourly_wage  hours_worked  last_name
0   10310      Lauren Durham           19            43     Durham
1   18656      Grace Sellers           17            40    Sellers
2   61254  Shirley Rasmussen           16            30  Rasmussen
3   16886        Brian Rojas           18            47      Rojas
4   89010    Samantha Mosley           11            38     Mosley
5   87246       Louis Guzman           14            39     Guzman
6   20578     Denise Mcclure           15            40    Mcclure
7   12869      James Raymond           15            32    Raymond
8   53461       Noah Collier           18            35    Collier
9   14746    Donna Frederick           20            41  Frederick
10  71127       Shirley Beck           14            32       Beck
11  92522    Christina Kelly            8            44      Kelly
12  22447        Brian Noble           11            39      Noble
13  61654          Randy Key           16            38       

## Applying a Lambda to a Row

We can also operate on multiple columns at once. If we use apply without specifying a single column and add the argument axis=1, the input to our lambda function will be an entire row, not a column. To access particular values of the row, we use the syntax `row.column_name` or `row[‘column_name’]`.

If we want to add in the price with tax for each line, we’ll need to look at two columns: `Price` and `Is taxed?`.

If `Is taxed?` is `Yes`, then we’ll want to multiply `Price` by 1.075 (for 7.5% sales tax).

If `Is taxed?` is `No`, we’ll just have `Price` without multiplying it.

We can create this column using a lambda function and the keyword axis=1:

`df['Price with Tax'] = df.apply(lambda row:
     row['Price'] * 1.075
     if row['Is taxed?'] == 'Yes'
     else row['Price'],
     axis=1
)`

In [49]:
total_earned = lambda row: row['hours_worked'] * row['hourly_wage'] if row['hours_worked'] <= 40 else (40 * row['hourly_wage']) + (row['hours_worked'] - 40) * (row['hourly_wage'] * 1.50)

df['total_earned'] = df.apply(total_earned, axis = 1)

print(df[['hourly_wage','hours_worked', 'total_earned']][df['hourly_wage']<15])

    hourly_wage  hours_worked  total_earned
4            11            38         418.0
5            14            39         546.0
10           14            32         448.0
11            8            44         368.0
12           11            39         429.0
14           14            48         728.0
15           14            42         602.0
16           11            48         572.0
17           12            38         456.0
19           13            49         695.5


## Renaming Columns

When we get our data from other sources, we often want to change the column names. For example, we might want all of the column names to follow variable name rules, so that we can use `df.column_name` (which tab-completes) rather than `df['column_name']` (which takes up extra space).

You can change all of the column names at once by setting the `.columns` property to a different list. This is great when you need to change all of the column names at once, but be careful! You can easily mislabel columns if you get the ordering wrong. Here’s an example:

`df = pd.DataFrame({
    'name': ['John', 'Jane', 'Sue', 'Fred'],
    'age': [23, 29, 21, 18]
})
df.columns = ['First Name', 'Age']`

This command edits the existing DataFrame df.

In [51]:
df = pd.read_csv('imdb.csv')
df.columns = ['ID', 'Title', 'Category', 'Year Released', 'Rating']
print(df)

      ID                                      Title Category  Year Released  \
0      1                                     Avatar   action           2009   
1      2                             Jurassic World   action           2015   
2      3                               The Avengers   action           2012   
3      4                            The Dark Knight   action           2008   
4      5  Star Wars: Episode I - The Phantom Menace   action           1999   
..   ...                                        ...      ...            ...   
215  216                                   Hannibal    drama           2001   
216  217                        Catch Me If You Can    drama           2002   
217  218                                  Big Daddy    drama           1999   
218  219                                      Se7en    drama           1995   
219  220                                      Seven    drama           1979   

     Rating  
0       7.9  
1       7.3  
2       8

## Renaming Columns II
You also can rename individual columns by using the `.rename` method. Pass a dictionary like the one below to the columns keyword argument:

`{'old_column_name1': 'new_column_name1', 'old_column_name2': 'new_column_name2'}`

Here’s an example:

`df = pd.DataFrame({
    'name': ['John', 'Jane', 'Sue', 'Fred'],
    'age': [23, 29, 21, 18]
})
df.rename(columns={
    'name': 'First Name',
    'age': 'Age'},
    inplace=True)`
    
    
The code above will rename `name` to `First Name` and `age` to `Age`.

Using rename with only the columns keyword will create a new DataFrame, leaving your original DataFrame unchanged. That’s why we also passed in the keyword argument `inplace=True`. Using `inplace=True` lets us edit the original DataFrame.

There are several reasons why `.rename` is preferable to `.columns`:

* You can rename just one column
* You can be specific about which column names are getting changed (with `.column` you can accidentally switch column names if you’re not careful)

**_Note_**: _If you misspell one of the original column names, this command won’t fail. It just won’t change anything._

In [59]:
df.rename(columns={'movie_title': 'Movie_Title'}, inplace=True)
print(df)
# df.rename(columns={'Movie_Title': 'Title'}, inplace=True)

      ID                                Movie_Title Category  Year Released  \
0      1                                     Avatar   action           2009   
1      2                             Jurassic World   action           2015   
2      3                               The Avengers   action           2012   
3      4                            The Dark Knight   action           2008   
4      5  Star Wars: Episode I - The Phantom Menace   action           1999   
..   ...                                        ...      ...            ...   
215  216                                   Hannibal    drama           2001   
216  217                        Catch Me If You Can    drama           2002   
217  218                                  Big Daddy    drama           1999   
218  219                                      Se7en    drama           1995   
219  220                                      Seven    drama           1979   

     Rating  
0       7.9  
1       7.3  
2       8

In [67]:
orders = pd.read_csv('shoefly_v2.csv')

print(orders.info())
print('-'*50)
print(orders.head())
print('-'*50)
orders['shoe_source'] = orders.shoe_material.apply(lambda x: 'animal' if x == 'leather'else 'vegan')

orders['salutation'] = orders.apply(lambda row: 'Dear Mr. ' + row['last_name'] if row['gender'] == 'male' else 'Dear Ms. ' + row['last_name'], axis=1)

print(orders[orders.shoe_source == 'vegan'][['first_name', 'last_name', 'gender', 'shoe_type']])

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   id             20 non-null     int64 
 1   first_name     20 non-null     object
 2   last_name      20 non-null     object
 3   gender         20 non-null     object
 4   email          20 non-null     object
 5   shoe_type      20 non-null     object
 6   shoe_material  20 non-null     object
 7   shoe_color     20 non-null     object
dtypes: int64(1), object(7)
memory usage: 1.4+ KB
None
--------------------------------------------------
      id first_name last_name  gender                         email  \
0  54791    Rebecca   Lindsay  female  RebeccaLindsay57@hotmail.com   
1  53450      Emily     Joyce  female        EmilyJoyce25@gmail.com   
2  91987      Joyce    Waller  female        Joyce.Waller@gmail.com   
3  14437     Justin  Erickson    male   Justin.Erickson@outlook.com   
4  7935