# 06. Adding and Removing Rows and Columns

---

In the last notebook, we looked at how we can update data in Pandas. In this notebook, we explore adding/removing rows and columns in Pandas.

As usual, we start by creating a dictionary in pure Python and then turning it into a Pandas DataFrame to see how the examples we give here might look applied: 

In [1]:
import pandas as pd

In [2]:
people = {'First name': ['Adam', 'John', 'Jake', 'Jane'],
         'Last name': ['Smith', 'Doe', 'Doe', 'Snow'],
         'Email': ['adamsmith@gmail.com', 'johndoe@notreally.com', 'jakedoe@notawebsite.org', 'janesnow@mail.com']}

In [3]:
people_df = pd.DataFrame(people)

In [4]:
people_df

Unnamed: 0,First name,Last name,Email
0,Adam,Smith,adamsmith@gmail.com
1,John,Doe,johndoe@notreally.com
2,Jake,Doe,jakedoe@notawebsite.org
3,Jane,Snow,janesnow@mail.com


Let's say we want to add a new column to our DataFrame.

We can do that by simply specifying the name of the new column and the values of this column.

For example, let's create a `Full name` column by combining the first and last name of a person:

In [5]:
people_df['Full name'] = people_df['First name'] + ' ' + people_df['Last name']

In [6]:
people_df

Unnamed: 0,First name,Last name,Email,Full name
0,Adam,Smith,adamsmith@gmail.com,Adam Smith
1,John,Doe,johndoe@notreally.com,John Doe
2,Jake,Doe,jakedoe@notawebsite.org,Jake Doe
3,Jane,Snow,janesnow@mail.com,Jane Snow


As we see, the effect takes place immediately.

This works because we are assigning a Series object to the new column. To assign a new column containing numerical values, we can use the `apply` method from the last notebook instead of our string type expression.

**Note: We can't use the dot notation to create a new column, as Pandas will assume we are trying to assign an attribute or a method to the DataFrame object, and not column. So we have to use brackets to create new columns.**

Let's say we now want to remove our first and last name columns, as we no longer need them. For that, we use Pandas'
 `drop` method:

In [7]:
people_df.drop(columns=['First name', 'Last name'], inplace=True)

In [8]:
people_df

Unnamed: 0,Email,Full name
0,adamsmith@gmail.com,Adam Smith
1,johndoe@notreally.com,John Doe
2,jakedoe@notawebsite.org,Jake Doe
3,janesnow@mail.com,Jane Snow


As we can see, we can drop the columns by using `inplace=True` if we are satisfied with the preview we would get by using `drop` without it.

Note: We pass a list of columns names because we are dropping multiple columns at once.

Now, let's go for something a bit different: splitting one column into multiple columns.

We can do that using the string class method `split`:

In [9]:
people_df['Full name'].str.split(' ')

0    [Adam, Smith]
1      [John, Doe]
2      [Jake, Doe]
3     [Jane, Snow]
Name: Full name, dtype: object

This returns a list of names for each row within the column. To format this into DataFrame format, we can use the
`expand=True` flag:

In [10]:
people_df['Full name'].str.split(' ', expand=True)

Unnamed: 0,0,1
0,Adam,Smith
1,John,Doe
2,Jake,Doe
3,Jane,Snow


We can now set two new columns' names to these columns that we received, to create the desired effect that we're after:

In [11]:
people_df[['First', 'Last']] = people_df['Full name'].str.split(' ', expand=True)
people_df

Unnamed: 0,Email,Full name,First,Last
0,adamsmith@gmail.com,Adam Smith,Adam,Smith
1,johndoe@notreally.com,John Doe,John,Doe
2,jakedoe@notawebsite.org,Jake Doe,Jake,Doe
3,janesnow@mail.com,Jane Snow,Jane,Snow


**Remember: We must use 2 sets of square brackets to access multiple columns at once!**

Now, let's look at how we can add rows to our DataFrame. There are multiple objectives we may want to achieve here:

- Adding a single row to our DataFrame.
- Combining DataFrames by appending the rows of one to the end of the other.

To add a single row to our DataFrame, we can use the `loc` method within Pandas by passing a dictionary containing the values for select columns, or a list of all columns' values in order. for instance:

In [12]:
people_df.loc[len(people_df.index)] = {'Email': "hermionegranger@hogwarts.magic", 'Full name': 'Hermione Granger'}

In [13]:
people_df

Unnamed: 0,Email,Full name,First,Last
0,adamsmith@gmail.com,Adam Smith,Adam,Smith
1,johndoe@notreally.com,John Doe,John,Doe
2,jakedoe@notawebsite.org,Jake Doe,Jake,Doe
3,janesnow@mail.com,Jane Snow,Jane,Snow
4,hermionegranger@hogwarts.magic,Hermione Granger,,


Note: Any values that we do not specfiy are assumed to be `NaN` by default.

To concatenate another DataFrame at the end of our DataFrame, we first create a second DataFrame to concatenate:

In [14]:
people_2 = {'First': ['Alex', 'Jordan'],
         'Last': ['Jones', 'Pricks'],
         'Email': ['ajones@gmail.com', 'jpricks@notreally.com'],
           'Favorite color': ['Blue', 'Red']}

new_df= pd.DataFrame(people_2)

In [15]:
new_df

Unnamed: 0,First,Last,Email,Favorite color
0,Alex,Jones,ajones@gmail.com,Blue
1,Jordan,Pricks,jpricks@notreally.com,Red


Now, we can append this new DataFrame to the original one by using the `Pandas.concat` method:

In [16]:
pd.concat([people_df, new_df])

Unnamed: 0,Email,Full name,First,Last,Favorite color
0,adamsmith@gmail.com,Adam Smith,Adam,Smith,
1,johndoe@notreally.com,John Doe,John,Doe,
2,jakedoe@notawebsite.org,Jake Doe,Jake,Doe,
3,janesnow@mail.com,Jane Snow,Jane,Snow,
4,hermionegranger@hogwarts.magic,Hermione Granger,,,
0,ajones@gmail.com,,Alex,Jones,Blue
1,jpricks@notreally.com,,Jordan,Pricks,Red


**Note: Since `Pandas.concat()` only views the proposed updates and has no inplace flag, we would have to update the variable of the DataFrame we want to lengthen if we want the changes to take place.**

Alternatively, we can of course store the results in a new variable:

In [17]:
big_df = pd.concat([new_df, people_df], ignore_index=True)
big_df

Unnamed: 0,First,Last,Email,Favorite color,Full name
0,Alex,Jones,ajones@gmail.com,Blue,
1,Jordan,Pricks,jpricks@notreally.com,Red,
2,Adam,Smith,adamsmith@gmail.com,,Adam Smith
3,John,Doe,johndoe@notreally.com,,John Doe
4,Jake,Doe,jakedoe@notawebsite.org,,Jake Doe
5,Jane,Snow,janesnow@mail.com,,Jane Snow
6,,,hermionegranger@hogwarts.magic,,Hermione Granger


**Note: The resulting DataFrame takes the format (order of columns) of the FIRST DataFrame within the list of DataFrames passed to the Pandas.concat method. It also has all columns from all participaing DataFrames, with any missing data being set to NaN.**

**Note: Not using `ignore_index=True` results in a DataFrame with multiple rows having the same indices, as each participating 
DataFrame keeps its original indices!**

Lastly, let's look at removing rows in Pandas. For that, we can also use the `drop` method, but passing the index of the row we want to remove:

In [18]:
big_df.drop(5, inplace=True)
big_df

Unnamed: 0,First,Last,Email,Favorite color,Full name
0,Alex,Jones,ajones@gmail.com,Blue,
1,Jordan,Pricks,jpricks@notreally.com,Red,
2,Adam,Smith,adamsmith@gmail.com,,Adam Smith
3,John,Doe,johndoe@notreally.com,,John Doe
4,Jake,Doe,jakedoe@notawebsite.org,,Jake Doe
6,,,hermionegranger@hogwarts.magic,,Hermione Granger


If we want to drop multiple rows at a time which meet a certain criteria or satisfy a certain condition, we can do that using loc as we saw before (?), but we can also do that using `drop` with a filter.

For example, let's say we want to remove every row with the last name "Doe" from our DataFrame using `drop`:

In [23]:
filter_df = big_df['Last'] == 'Doe'
big_df.drop(index= big_df[filter_df].index) 

Unnamed: 0,First,Last,Email,Favorite color,Full name
0,Alex,Jones,ajones@gmail.com,Blue,
1,Jordan,Pricks,jpricks@notreally.com,Red,
2,Adam,Smith,adamsmith@gmail.com,,Adam Smith
6,,,hermionegranger@hogwarts.magic,,Hermione Granger


Remember: With `drop`, the `inplace=True` flag has to be passed for the effects to take place.

One other handy function we can use to drop rows - courtesy of Pincecone's James Briggs - is the `drop_duplicates` function, which we'll only hint at here. The use of this function would not make sense for this dataset, however, as there are no duplicates in this dataset.

Suppose, however, that we only wanted to keep 1 example of a given attribute/feature. Only One "Kevin" from the Stack Overflow dataset, for example, or only one developer from each of the countries represented in the dataset and no more. We could do the following to implement that:

In [1]:
# df.drop_duplicates(subset=["column name"], keep="first")

As with many other Pandas methods, `drop_duplicates` needs `inplace=True` to be specified for the changes to take place, otherwise, the changes that we see would only be a preview.