## Recap:

- We know how to create a `DataFrame` (table for pandas).
- We know how to find the no. of `rows` and `columns` of our table
- We know how to display the first `n` rows of our table. Ex: first 5 rows.
- We know how to create a new column and populate it with values.
- We know how to filter our table by a specific condition.

In [4]:
customer_data = [
    [1, 'Ella', 'emily@example.com'],
    [2, 'David', 'michael@example.com'],
    [3, 'Zachary', 'sarah@example.com'],
    [4, 'Alice', 'john@example.com'],
    [5, 'Finn', 'john@example.com'],
    [6, 'Violet', 'alice@example.com']
]

In [6]:
customer_data_columns = ['customer_id', 'name', 'email']

In [8]:
import pandas as pd

In [10]:
df = pd.DataFrame(data=customer_data, columns=customer_data_columns)

In [12]:
df

Unnamed: 0,customer_id,name,email
0,1,Ella,emily@example.com
1,2,David,michael@example.com
2,3,Zachary,sarah@example.com
3,4,Alice,john@example.com
4,5,Finn,john@example.com
5,6,Violet,alice@example.com


In [22]:
df['email']

0      emily@example.com
1    michael@example.com
2      sarah@example.com
3       john@example.com
4       john@example.com
5      alice@example.com
Name: email, dtype: object

In [24]:
df.drop_duplicates(subset='email', inplace=True)

In [26]:
df

Unnamed: 0,customer_id,name,email
0,1,Ella,emily@example.com
1,2,David,michael@example.com
2,3,Zachary,sarah@example.com
3,4,Alice,john@example.com
5,6,Violet,alice@example.com


## ^ Checkpoint: `.drop_duplicates()`

removes the rows with repeated occurances of a value.


`.drop_duplicates()` takes in the `subset=` which is the column you want to drop duplicates for and `inplace=True` will make sure that the filtering happens on your original table.

In [47]:
student_data = [
    [1, 'Ella', None],
    [2, 'David', 'michael@example.com'],
    [3, None, 'sarah@example.com'],
    [4, 'Alice', None],
    [5, None, 'john@example.com'],
    [6, 'Violet', 'alice@example.com']
]

In [49]:
student_data_columns = ['student_id', 'name', 'email']

In [51]:
df2 = pd.DataFrame(data=student_data, columns=student_data_columns)

In [59]:
df2

Unnamed: 0,student_id,name,email
0,1,Ella,
1,2,David,michael@example.com
3,4,Alice,
5,6,Violet,alice@example.com


In [55]:
df2.dropna(subset='name', inplace=True)

In [69]:
# If you want to drop empty values for more than one column,
# write the subset as a list [...]

# Similarly, for .drop_duplicates() you can select multiple 
# columns using a list

df2.dropna(subset=['name', 'email'], inplace=True)

In [71]:
df2

Unnamed: 0,student_id,name,email
1,2,David,michael@example.com
5,6,Violet,alice@example.com


## ^ Checkpoint: `.dropna()`

used to remove the rows that have no values for a certain column.

`.dropna()` takes the same arguments as `.drop_duplicates()`. It takes a `subset=` and `inplace=True`

In [75]:
salary_data = [
    ['Jack', 20000],
    ['Piper', 75000],
    ['Mia', 60000],
    ['Ulysses', 55000]
]

In [77]:
df3= pd.DataFrame(data=salary_data, columns=['name', 'salary'])

In [79]:
df3

Unnamed: 0,name,salary
0,Jack,20000
1,Piper,75000
2,Mia,60000
3,Ulysses,55000


1. Focus in on the salary column
2. Double each value in each row
3. Update the salary column with the doubled salary

In [84]:
df3['salary'] = df3['salary'] * 2

In [86]:
df3

Unnamed: 0,name,salary
0,Jack,40000
1,Piper,150000
2,Mia,120000
3,Ulysses,110000


## ^ Updating existing values of a column:

- We access the row values for `salary` and multiply it by 2 (`df3['salary'] * 2`)


- We set this new computation as the new row value for `salary` (`df3['salary'] =`)

## So far...

1. `.drop_duplicates()` function
2. `.dropna()` function
3. Learned how to modify existing row values of a column

In [91]:
names_data = [
    [1, 'Ella', 'emily@example.com'],
    [2, 'David', 'michael@example.com'],
    [3, 'Zachary', 'sarah@example.com'],
    [4, 'Alice', 'john@example.com'],
    [5, 'Finn', 'john@example.com'],
    [6, 'Violet', 'alice@example.com']
]

In [107]:
names_data_columns = ['id', 'first', 'email']

In [109]:
df4 = pd.DataFrame(data=names_data, columns=names_data_columns)

In [111]:
df4

Unnamed: 0,id,first,email
0,1,Ella,emily@example.com
1,2,David,michael@example.com
2,3,Zachary,sarah@example.com
3,4,Alice,john@example.com
4,5,Finn,john@example.com
5,6,Violet,alice@example.com


In [113]:
df4.rename(columns={
    'id': 'student_id',
    'first': 'first_name',
    'email': 'email_address'
})

Unnamed: 0,student_id,first_name,email_address
0,1,Ella,emily@example.com
1,2,David,michael@example.com
2,3,Zachary,sarah@example.com
3,4,Alice,john@example.com
4,5,Finn,john@example.com
5,6,Violet,alice@example.com


## ^ Checkpoint: `.rename()`

renames the column names for your table.

It takes it a variable called `columns=` where you can provide a mapping between old column name and new column name.

In [120]:
df4.dtypes

id        int64
first    object
email    object
dtype: object

In [122]:
df4['first'].dtype()

NameError: name 'String' is not defined