The following json string presents data for a single person. You can turn this into a dataframe with a column header and 1 row that populates the data.

In [1]:
person = {
    "first": "Drew",
    "last": "Dodds",
    "email": "drewdodds@email.com"
}

This is a Dictionary that represents multiple people. The dictionary now has keys and then the values are a list of values. This can turn into a dataframe with multiple rows. Notice the vales (for each key) are held is a list.

In [2]:
people = {
    "first": ["Drew", "Jane", "John"],
    "last": ["Dodds", "Doe", "Smith"],
    "email": ["drewdodds@email.com","JaneDoe@email.com","johnsmith@email.com"]
}

We can think of these as rows and columns. The rows are the values and the keys are the columns. This is a 2 dimensional data structure. And even though we havn't turned this into dataframe yet, we can access these keys like one. 

In [3]:
people['email']

['drewdodds@email.com', 'JaneDoe@email.com', 'johnsmith@email.com']

DataFrames are really similar to dics but have even more functionality. They are much more than a list of dictionaries. Now, we are going to turn this into a dataframe.

In [4]:
import pandas as pd

In [5]:
df = pd.DataFrame(people)

In [6]:
print(df)

  first   last                email
0  Drew  Dodds  drewdodds@email.com
1  Jane    Doe    JaneDoe@email.com
2  John  Smith  johnsmith@email.com


We turned this into an index. And you can see the index to the left as well. 

Let's now access the values and info within the dataframe. 
1. Let's access the values of a single column 

When you call a column by it's label like this, you are directly accessing it. You can see the type it returns below that - which is a series. Dataframes are a bunch of series put together.

In [7]:
df['email']

0    drewdodds@email.com
1      JaneDoe@email.com
2    johnsmith@email.com
Name: email, dtype: object

In [8]:
type(df['email'])

pandas.core.series.Series

DataFrames are rows and columns and a series as being rows of a single column. So a dataframe is a container for multiple series.

Let's say we wanted to access multiple columns. There are two pairs of brackets here.

In [9]:
df[['last','email']]

Unnamed: 0,last,email
0,Dodds,drewdodds@email.com
1,Doe,JaneDoe@email.com
2,Smith,johnsmith@email.com


Also, since we are getting multiple columns...we are no longer returning a series. We are now returning a filtered down dataframe. A new dataframe with just those two columns.

In [10]:
type(df[['last','email']])

pandas.core.frame.DataFrame

You can also access all of the columns by running this

In [11]:
df.columns

Index(['first', 'last', 'email'], dtype='object')

Okay, so how do we now access all of the rows? We use "loc" and "iloc". iloc allows us to access rows by integer location - hence the name - iloc is integer location which is an index. The diff between calling a label/column name vs using loc and iloc is...

calling a label name will return the entire column whereas loc and iloc are returning rows.

In [12]:
df.iloc[0]

first                   Drew
last                   Dodds
email    drewdodds@email.com
Name: 0, dtype: object

We passed in the 0 which gives us the first row. It returns a series that contains the values of that first row. Notice in the output - the index is now the column names not the 0,1,2, etc. above.  

What if we wanted more than 1 row? We can pass in a list of integers. See below. We then get the first 2 rows of data. You have to pass in a list though.

In [13]:
df.iloc[[0,1]]

Unnamed: 0,first,last,email
0,Drew,Dodds,drewdodds@email.com
1,Jane,Doe,JaneDoe@email.com


Now we are getting a dataframe with these multiple rows. With these indexers we can also select columns as well. And that is going to be the second value we pass into the outer brackets. If you think of iloc and loc as functions then we can think of the rows that we want as the first argument and the columns as the second arg. 

In [14]:
df.iloc[[0,1],2]

0    drewdodds@email.com
1      JaneDoe@email.com
Name: email, dtype: object

We got the first 2 rowns by calling 1 and 0 for the first arg and then we got the third column b/c we called 2 for the 2nd arg.

Let's now look at the loc function. With iloc, we were searching with integer location. With loc, we are searching by label. When weare talking about labels for rows we are talking about labels for rows, these will be the indexes. In this case, this index is a default range of integers. So in this case it seems the same as iloc but really it is not.

In [15]:
df.loc[0]

first                   Drew
last                   Dodds
email    drewdodds@email.com
Name: 0, dtype: object

This can be confusing b/c the row index is an integer so it looks like iloc above but here we are pulling the first row based on the label not iteger. 

Now I will get two rows

In [16]:
df.loc[[0,1]]

Unnamed: 0,first,last,email
0,Drew,Dodds,drewdodds@email.com
1,Jane,Doe,JaneDoe@email.com


Just like with iloc, we can also pass in a list to specify multiple rows we want to return like above. 

AND just like iloc - We can also pass in a second value into our indexer to select a specific columns. With iloc we use integers to get the columns but with loc we can get the values. See the example below

In [17]:
df.loc[[0,1], 'email']

0    drewdodds@email.com
1      JaneDoe@email.com
Name: email, dtype: object

We can also pass in a list for the columns as well. See below

In [18]:
df.loc[[0,1],['email', 'last']]

Unnamed: 0,email,last
0,drewdodds@email.com,Dodds
1,JaneDoe@email.com,Doe


Now watching vid 3: https://www.youtube.com/watch?v=W9XjRYFkkyw 

We are going to learn about indexing. First I will print the df again to see it.

In [19]:
print(df)

  first   last                email
0  Drew  Dodds  drewdodds@email.com
1  Jane    Doe    JaneDoe@email.com
2  John  Smith  johnsmith@email.com


You can see above the first column that has no name is the index. That is just a default index. Which is just a range of numbers that's basically an integer identifier for the rows. 

Sometimes it makes more sense to have a different identifier for a row. (in SQL tables this would be the primary key).

Pandas doesn't actually enfource indexes to be unique and sometimes it won't be. But most of the time they are unique values. So in the df above email will probably be the best.

What if we wanted to reset the index of the df to email?

In [20]:
df.set_index('email')

Unnamed: 0_level_0,first,last
email,Unnamed: 1_level_1,Unnamed: 2_level_1
drewdodds@email.com,Drew,Dodds
JaneDoe@email.com,Jane,Doe
johnsmith@email.com,John,Smith


Now, we can see the email is on the far left and it's bold. It looks like a normaal column but it is now the index. Now, if you were to print the df it would still look like no change as taken place. See below..

In [21]:
df

Unnamed: 0,first,last,email
0,Drew,Dodds,drewdodds@email.com
1,Jane,Doe,JaneDoe@email.com
2,John,Smith,johnsmith@email.com


That is b/c pandas won't actually make the change unless you explicitly tell it to. That way, you can experiment before making the final change. Let's go ahead and make the email column the new index. Now we'll actually run the code to set the new index.

In [22]:
df.set_index('email', inplace=True)

df

Unnamed: 0_level_0,first,last
email,Unnamed: 1_level_1,Unnamed: 2_level_1
drewdodds@email.com,Drew,Dodds
JaneDoe@email.com,Jane,Doe
johnsmith@email.com,John,Smith


There we go! Let's look at the index now. 

In [23]:
df.index

Index(['drewdodds@email.com', 'JaneDoe@email.com', 'johnsmith@email.com'], dtype='object', name='email')

Why is this useful? Going back to the loc function which is label based. Before we were using the label which was the index. Now the label makes more sense b/c we can pass in a specific email. 

In [24]:
df.loc['drewdodds@email.com']

first     Drew
last     Dodds
Name: drewdodds@email.com, dtype: object

What if we just wanted the last name to this person? Now, the first arg is the row, and the second is the column we want to see.

In [25]:
df.loc['drewdodds@email.com','last']

'Dodds'

What if you want to reset the index? 

In [26]:
 df.reset_index(inplace=True) #- but I don't want to that right now.

Now if you actually know what you want the index to be when you're creating your dataframe then you can set it there instead of later. You can do that as you are loading data in from a csv for example.

Let's look at filtering now. Let's filter on last name and see if we can filter for where last name = 'Doe'. What the below code will return is a series object. 

In [27]:
df['last'] == 'Doe'

0    False
1     True
2    False
Name: last, dtype: bool

Maybe you thought we woudl get dataframe back. But what we got back is a series with a lot of true or false values. These true/false values correspond to the original dataframe.

Where you see 'True' means that those records met the filter criteria. And false is where it did not. So, this is what is called a filter mask.

Now, let's apply this filter to our dataframe.

In [28]:
filt = (df['last'] == 'Doe') # I wrapped the entire thing in parentheses even though it was not needed. Just to separte the assignment vs the equals

Now we created the filter variable above we can then run agaist the data frame. Which you will see below. 

In [29]:
df[filt]

Unnamed: 0,email,first,last
1,JaneDoe@email.com,Jane,Doe


And now you can see the filter worked. So you might see the example above where a variable was made and used to filter. Or you might just see the direct filter used within the brackets like the code below. Which would give you the same thing. It's justmore difficult to read. 

In [30]:
df[df['last'] == 'Doe']

Unnamed: 0,email,first,last
1,JaneDoe@email.com,Jane,Doe


Another way we can narrow down rows is using the .loc function. So, pandas can be a little confusing for this reason. There is multiple ways to do things. the benifit to using .loc here is that we can also select what columns we want to see.

In [31]:
df.loc[filt]

Unnamed: 0,email,first,last
1,JaneDoe@email.com,Jane,Doe


In [32]:
df.loc[filt, 'first']

1    Jane
Name: first, dtype: object

So this returned a series where the last name was 'Doe'

Okay, let's now learn a new way to filter: using "and" which we have to use the "&" symbol. And we can use the "or" filter which we have to use the pipe operator "|".

So let's say we wanted to build a filter where the last name is 'Doe' and the first name is 'John'. See code below.

In [33]:
filt_2 = (df['last'] == 'Doe') & (df['first'] == 'John')

df.loc[filt, 'first']

1    Jane
Name: first, dtype: object

Now let's do an or statement

In [34]:
filt_3 = (df['last'] == 'Dodds') | (df['first'] == 'John')

df.loc[filt_3, 'first']

0    Drew
2    John
Name: first, dtype: object

We can also get the opposite of a filter. We could just make a new filter to give everything we like. OR we can use a tilde "~" which does the opposite of the filter. See below. That will negate the filter and return the opposite. We should return jane below.

In [35]:
df.loc[~filt_3, 'first']

1    Jane
Name: first, dtype: object

Updating Rows and Columns: https://www.youtube.com/watch?v=DCDe29sIKcE

We will update columns first and then rows and then take the learnings to the large dataset from stack overflow data. 

In [36]:
df.columns

Index(['email', 'first', 'last'], dtype='object')

What if we wanted to rename the columns?

In [37]:
df.columns = ['email', 'first_name', 'last_name']
df.columns
df

Unnamed: 0,email,first_name,last_name
0,drewdodds@email.com,Drew,Dodds
1,JaneDoe@email.com,Jane,Doe
2,johnsmith@email.com,John,Smith


This is really used when you want to change all the names for your columns. Not neccessary if you just want to change one name.

Amother thing that's common is the need to change something speicfc about these columns in our data frames. For example, what if we wanted all the colum names to be upper.

You can use a list comprehension in this case. 

In [38]:
df.columns = [x.upper() for x in df.columns]
df

Unnamed: 0,EMAIL,FIRST_NAME,LAST_NAME
0,drewdodds@email.com,Drew,Dodds
1,JaneDoe@email.com,Jane,Doe
2,johnsmith@email.com,John,Smith


Let's go back to lower

In [39]:
df.columns = [x.lower() for x in df.columns]
df

Unnamed: 0,email,first_name,last_name
0,drewdodds@email.com,Drew,Dodds
1,JaneDoe@email.com,Jane,Doe
2,johnsmith@email.com,John,Smith


We can also rename specific columns if we want using a dict. 


In [40]:
df.rename(columns={'first_name': 'first','last_name': 'last'}) # this change will not take place unless you set unplace to true. See below
df

Unnamed: 0,email,first_name,last_name
0,drewdodds@email.com,Drew,Dodds
1,JaneDoe@email.com,Jane,Doe
2,johnsmith@email.com,John,Smith


In [41]:
df.rename(columns={'first_name': 'first','last_name': 'last'}, inplace=True) # this change will not take place unless you set unplace to true. See below
df

Unnamed: 0,email,first,last
0,drewdodds@email.com,Drew,Dodds
1,JaneDoe@email.com,Jane,Doe
2,johnsmith@email.com,John,Smith


Now let's take a look at updating the data in our rows. Let's change Jane doe's last name to smith.

We could change all the values like the code below.

In [42]:
df.loc[1] = ['JaneSmith@email.com','Jane','Smith']
df

Unnamed: 0,email,first,last
0,drewdodds@email.com,Drew,Dodds
1,JaneSmith@email.com,Jane,Smith
2,johnsmith@email.com,John,Smith


But what if we had a lot of columns but we only wanted to change a couple of value?

In [43]:
df.loc[1, ['last','email']] = ['Doe','JaneDoe@email.com']
df

Unnamed: 0,email,first,last
0,drewdodds@email.com,Drew,Dodds
1,JaneDoe@email.com,Jane,Doe
2,johnsmith@email.com,John,Smith


How do we change a single row? Now, we don't want a list of columns and then the list of new  values this time.

In [44]:
df.loc[1, 'last']  = 'Smith'
df

Unnamed: 0,email,first,last
0,drewdodds@email.com,Drew,Dodds
1,JaneDoe@email.com,Jane,Smith
2,johnsmith@email.com,John,Smith


What if you wanted to create a filter in order to filter the dataframe to specific criteria, and then change values on those records?

You can build the filters the same you did above BUT You still have to use the python indexers like loc and iloc to change the values. 

In [45]:
filt_4 = (df['email'] == 'johnsmith@email.com')
df.loc[filt_4, 'last'] = 'Doe'
df

Unnamed: 0,email,first,last
0,drewdodds@email.com,Drew,Dodds
1,JaneDoe@email.com,Jane,Smith
2,johnsmith@email.com,John,Doe


Now, lets change the email address all lower cases. 

In [46]:
df['email'].str.lower() # this will just return a series of the lower case values of that field. It doesn't actually make the change yet. 
# in order to make the actual change you have to do this (see below)
df['email'] = df['email'].str.lower()
df

Unnamed: 0,email,first,last
0,drewdodds@email.com,Drew,Dodds
1,janedoe@email.com,Jane,Smith
2,johnsmith@email.com,John,Doe


Now they are all lower! That is one way to change all the rows at once. Maybe we want to do something a little more advanced. 

There are several ways to do this and we will go over all the ways individually and explan these in detail. 

The 4 methods are:
1. apply
2. map
3. applymap
4. replace

lets work on these in order. 

Apply is used for working calling a function on our values. Apply can work on either a dataframe or a series object. The behavior might be diff depending on the object.
Let's first see how it works on a series. Let's say we wanted to see the length of all the addresses. So we can apply the len() function on each value in the series like this


In [47]:
df['email'].apply(len) # this is a quick way to see the answers. But we can also use it to update the values as well. 

0    19
1    17
2    19
Name: email, dtype: int64

Let's make a simple function to do this

In [48]:
def update_email (email):
    return email.upper()

In [70]:
# let's now apply this function to our emails. 
df['email'].apply(update_email)  # again ...this will just show what it would do but not update the values in the actual df yet. See how to apply below. 

0    DREWDODDS@EMAIL.COM
1      JANEDOE@EMAIL.COM
2    JOHNSMITH@EMAIL.COM
Name: email, dtype: object

In [50]:
df['email'] = df['email'].apply(update_email)
df

Unnamed: 0,email,first,last
0,DREWDODDS@EMAIL.COM,Drew,Dodds
1,JANEDOE@EMAIL.COM,Jane,Smith
2,JOHNSMITH@EMAIL.COM,John,Doe


And now the changes have taken place as seen in the output above. You can also use what's called lambda functions. Now let's convert these back to lower case using labda function. 
Which you will see is a function passed in line that is not defined - i.e. it's "anonomous". it's a no name function. See below.

In [51]:
df['email'] = df['email'].apply(lambda x: x.lower())
df

Unnamed: 0,email,first,last
0,drewdodds@email.com,Drew,Dodds
1,janedoe@email.com,Jane,Smith
2,johnsmith@email.com,John,Doe


We are working with strings here but you can also do numbers, etc. 

Here we were working with seires. When we run apply on a series, it does the function on each value of the series. Now, let's see how to works with a dataframe. 

When we run apply on a dataframe it runs a function on each row or column of that data frame.

The difference is:
df[email].apply() which is on a series
vs
df.apply() which will be on the entire dataframe

So running apply on a series applies a function to every value in a series. And running apply to a dataframe like we did above applies a function to EVERY series in the dataframe.

In [52]:
df.apply(len) # you would expect this to operate the same right? No, it is now appplying that len function to each series. It tells you the number of rows in each column. 

email    3
first    3
last     3
dtype: int64

In [53]:
len(df['email']) # which you can run that on each column and get the individual value (which is the number of values for that column)

3

In [54]:
# here we can also have this apply to rows too like this
df.apply(len, axis='columns') # it is saying row one has 3 values. 

0    3
1    3
2    3
dtype: int64

So we want to use functions that will make sense to be used on a series object when using apply on a an entire dataframe.

Example, let's say we wanted to grab the min value on each column. Series objects have a min method we could pass that into apply and see the min value for each series. 

And the min will really be alphabetical since these are string objects. 

In [71]:
df.apply(pd.Series.min)

email        drewdodds@email.com
full_name           Andrew Dodds
first                     Andrew
last                       Dodds
dtype: object

So this shows the first values is the 1st alphabatized value in email, then first name, etc. We can use lambda functions with this as well. 

In [56]:
df.apply(lambda x: x.min())

email    drewdodds@email.com
first                   Drew
last                   Dodds
dtype: object

So running apply on a series applies a function to every value in a series. And running apply to a dataframe like we did above applies a function to every series in the dataframe.

Is there a way to apply a function to every individual element in the data frame and that's what apply map is used for. 

Apply map only works on a dataframe. 

Let's see how this is different. 

In [57]:
df.applymap(len)

Unnamed: 0,email,first,last
0,19,4,5
1,17,4,5
2,19,4,3


We can see that what this does is that it's now applying that length function to each individual value on the dataframe. 
Instead of each series. 

So the first name has 5 characters, the second, 5 ,etc. 

What if we wanted all of the strings to be lower case? Let's use applymap() here

In [58]:
df.applymap(str.lower)

Unnamed: 0,email,first,last
0,drewdodds@email.com,drew,dodds
1,janedoe@email.com,jane,smith
2,johnsmith@email.com,john,doe


Now, let's look at the map method. Map can only be used on a series. Map is used for subsituting each value in a series with another value. 

Example. Let's say we wanted to substitue a couple of our first names. 

In [59]:
df['first'].map({'Drew':'Andrew','Jane':'Mary'})

0    Andrew
1      Mary
2       NaN
Name: first, dtype: object

This returns a series where those first names were substituted out. You can see too that the values we didn't substitute were converted to NaN values.
And that may or may not be what we wanted. 
What if we wanted to keep john but substitute the others? 

We can use the replace method. 

In [60]:
df['first'].replace({'Drew':'Andrew','Jane':'Mary'})

0    Andrew
1      Mary
2      John
Name: first, dtype: object

Now, we can see it's the same result as above but now we actually kept John. And again. This doesn't actually change the dataframe. If we wanted to make that change then
we would run this. 


In [61]:
df['first'] = df['first'].replace({'Drew':'Andrew','Jane':'Mary'})
df

Unnamed: 0,email,first,last
0,drewdodds@email.com,Andrew,Dodds
1,janedoe@email.com,Mary,Smith
2,johnsmith@email.com,John,Doe


Now we will go over how to add/remove rows and columns from dataframes: https://www.youtube.com/watch?v=HQ6XO9eT-fc

First let's look at adding columns. We can simply create a column and pass in a series of values that we want that column to have. 

In [62]:
df['first'] + ' ' + df['last'] # so now we have an output series with the first and last values combined. 

0    Andrew Dodds
1      Mary Smith
2        John Doe
dtype: object

In order to combined these we can...

In [63]:
df['full_name'] = df['first'] + ' ' + df['last']
df

Unnamed: 0,email,first,last,full_name
0,drewdodds@email.com,Andrew,Dodds,Andrew Dodds
1,janedoe@email.com,Mary,Smith,Mary Smith
2,johnsmith@email.com,John,Doe,John Doe


so that's how we add a column. Let's say we no longer need the first and last name columns so here is how we do that. 

In [64]:
df.drop(columns=['first','last']) # this is just a preview

Unnamed: 0,email,full_name
0,drewdodds@email.com,Andrew Dodds
1,janedoe@email.com,Mary Smith
2,johnsmith@email.com,John Doe


In [65]:
df.drop(columns=['first','last'], inplace=True)
df

Unnamed: 0,email,full_name
0,drewdodds@email.com,Andrew Dodds
1,janedoe@email.com,Mary Smith
2,johnsmith@email.com,John Doe


Now, if we wanted to reverse the process and we wanted to split the column into first and last columns then we can do this

In [66]:
df['full_name'].str.split(' ') # the result is the first and last name in a list. Here is how we assign these two new columns

0    [Andrew, Dodds]
1      [Mary, Smith]
2        [John, Doe]
Name: full_name, dtype: object

In [67]:
df['full_name'].str.split(' ', expand=True) # here is the preview. So now we need to set two columns in our df to these new returned columns

Unnamed: 0,0,1
0,Andrew,Dodds
1,Mary,Smith
2,John,Doe


In [68]:
df[['first', 'last']] = df['full_name'].str.split(' ', expand=True) # here we are passing a list of new columns.
df

Unnamed: 0,email,full_name,first,last
0,drewdodds@email.com,Andrew Dodds,Andrew,Dodds
1,janedoe@email.com,Mary Smith,Mary,Smith
2,johnsmith@email.com,John Doe,John,Doe


That worked. WE can see we added a first and last name columns.

So that was adding and removing columns. 

Let's now look at adding/removing rows of data.

There are a couple ways. First, we might just want add a single row of data. Second, maybe we want to combine dataframes by appending rows of data. 

First, let's look at adding a row to the current dataframe. 

In [69]:
df.append({'first': 'Tony'}) # if we run this by itself we get an error b/c it doesn't have an index. 

  df.append({'first': 'Tony'}) # if we run this by itself we get an error b/c it doesn't have an index.


TypeError: Can only append a dict if ignore_index=True

In [None]:
df.append({'first': 'Tony'}, ignore_index=True) # we only assigned this row a first name value so the rest are null. So we can add all the values if we want.

  df.append({'first': 'Tony'}, ignore_index=True)


Unnamed: 0,email,full_name,first,last
0,drewdodds@email.com,Andrew Dodds,Andrew,Dodds
1,janedoe@email.com,Mary Smith,Mary,Smith
2,johnsmith@email.com,John Doe,John,Doe
3,,,Tony,


let's now append a new dataframe to this one. 

In [73]:
new_people = {
    "first": ["Tony", "Steve"],
    "last": ["Stark", "Rogers"],
    "email": ["ironman@avenge.com", "cap@avenge.com"]
}
df2 = pd.DataFrame(new_people)
df2

Unnamed: 0,first,last,email
0,Tony,Stark,ironman@avenge.com
1,Steve,Rogers,cap@avenge.com


Now, these have conflicting indexes and they also have columns that are not in the same order. So we want to ignore the index here as well.


In [74]:
df.append(df2, ignore_index=True, sort=False) # thi is just the preview. Let's apply it

  df.append(df2, ignore_index=True, sort=False) # thi is just the preview. Let's apply it


Unnamed: 0,email,full_name,first,last
0,drewdodds@email.com,Andrew Dodds,Andrew,Dodds
1,janedoe@email.com,Mary Smith,Mary,Smith
2,johnsmith@email.com,John Doe,John,Doe
3,ironman@avenge.com,,Tony,Stark
4,cap@avenge.com,,Steve,Rogers


You get a warning b/c we didn't pass in all of the columns in the same order when appending these. 
It's giving us a warning that there were multiple ways to order this. Don't worry too much about that. If we wanted to we could pass on 
If we wanted to we could pass in the order to false and it will get rid of the warning. 

In [75]:
df = df.append(df2, ignore_index=True, sort=False)
df

  df = df.append(df2, ignore_index=True, sort=False)


Unnamed: 0,email,full_name,first,last
0,drewdodds@email.com,Andrew Dodds,Andrew,Dodds
1,janedoe@email.com,Mary Smith,Mary,Smith
2,johnsmith@email.com,John Doe,John,Doe
3,ironman@avenge.com,,Tony,Stark
4,cap@avenge.com,,Steve,Rogers


Now, let's look at how to drop a row. We can do that similar to how we drop a column but instead of specifying the columns we want to drop we can simply pass 
in the indexes we want to drop. 

In [None]:
df.drop(index=4) # this is just a preview. You can see we no longer have steve rogers

Unnamed: 0,email,full_name,first,last
0,drewdodds@email.com,Andrew Dodds,Andrew,Dodds
1,janedoe@email.com,Mary Smith,Mary,Smith
2,johnsmith@email.com,John Doe,John,Doe
3,ironman@avenge.com,,Tony,Stark


You might want to do something a little more complicated and drop rows based on a conditional.

You can do this using loc. But we can also use this using drop. 

If you want to drop all last names that have doe

df.drop(index=df[df[last'] == 'Doe'].index) I can't do this since in my file I don't have multiple name with doe

An easier way to read it you can...

filt = df[last'] == 'Doe'
df.drop(index=df[filt].index) this is a little easier to read. 

Now we are going to learn about sorting data: https://www.youtube.com/watch?v=T11QYVfZoD0

We'll look at ways how to sort columns, sort multiple columns, and grab min and max values. 

Let's say we want to sort this dataframe we have above. Maybe by last name. 

In [76]:
df.sort_values(by='last')

Unnamed: 0,email,full_name,first,last
0,drewdodds@email.com,Andrew Dodds,Andrew,Dodds
2,johnsmith@email.com,John Doe,John,Doe
4,cap@avenge.com,,Steve,Rogers
1,janedoe@email.com,Mary Smith,Mary,Smith
3,ironman@avenge.com,,Tony,Stark


What if we wanted to sort in decending order?

In [None]:
df.sort_values(by='last', ascending=False)

Unnamed: 0,email,full_name,first,last
1,janedoe@email.com,Mary Smith,Mary,Smith
2,johnsmith@email.com,John Doe,John,Doe
0,drewdodds@email.com,Andrew Dodds,Andrew,Dodds


Sometimes your sorts can get a little complicated like if you want to sort on multiple columns. In order to do this we can pass in a list of columns we want to sort on. 

In [None]:
df.sort_values(by=['last','first'], ascending=False)

Unnamed: 0,email,full_name,first,last
1,janedoe@email.com,Mary Smith,Mary,Smith
2,johnsmith@email.com,John Doe,John,Doe
0,drewdodds@email.com,Andrew Dodds,Andrew,Dodds


Sometimes you may want to sort multiple columns and you want one in asc order and another in desc. To do this we can pass in a list of boolean values to the ascending arg.

In [None]:
df.sort_values(by=['last','first'],ascending=[False, True]) # then if you want to make this permenant then you can set the inplace arg to true

Unnamed: 0,email,full_name,first,last
1,janedoe@email.com,Mary Smith,Mary,Smith
2,johnsmith@email.com,John Doe,John,Doe
0,drewdodds@email.com,Andrew Dodds,Andrew,Dodds


In [None]:
df.sort_values(by=['last','first'],ascending=[False, True], inplace=True)
df

Unnamed: 0,email,full_name,first,last
1,janedoe@email.com,Mary Smith,Mary,Smith
2,johnsmith@email.com,John Doe,John,Doe
0,drewdodds@email.com,Andrew Dodds,Andrew,Dodds


If you wanted to switch it back to sort by index you can do this. 

In [None]:
df.sort_index()

Unnamed: 0,email,full_name,first,last
0,drewdodds@email.com,Andrew Dodds,Andrew,Dodds
1,janedoe@email.com,Mary Smith,Mary,Smith
2,johnsmith@email.com,John Doe,John,Doe


If you want to sort a single column you can also sort a series. If you wanted to just sort the names and not see the entire df then do this. 

In [None]:
df['last'].sort_values()

0    Dodds
2      Doe
1    Smith
Name: last, dtype: object