<center> <img src="https://yildirimcaglar.github.io/ds3000/ds3000.png"> </center>

<center> <h2> Data Wrangling Cont'd </h2></center>

## Outline
1. <a href='#1'>Multilevel/Hierarchical Indexing</a>
2. <a href='#2'>Working with Missing Values in DataFrames</a>
3. <a href='#3'>Working with Duplicates</a>
4. <a href='#4'>Replacing Values</a>
5. <a href='#5'>Renaming Axis Indices</a>

<a id="1"></a>

## 1. Multilevel/Hierarchical Indexing
* Can index dataframes using two columns as indices
* Use the **set_index** method and pass in a **list of column names**

In [1]:
import pandas as pd

In [2]:
#need to have saved the output from the previous Notebook as a csv file
df = pd.read_csv("res/ave_grades.csv")

In [3]:
df

Unnamed: 0,Student,House,Potion_Ave,Charm_Ave
0,Hermione Granger,Gryffindor,100.0,100.0
1,Anthony Goldstein,Ravenclaw,89.0,87.0
2,Harry Potter,Gryffindor,88.0,90.0
3,Lisa Turpin,Ravenclaw,86.5,84.0
4,Michael Corner,Ravenclaw,85.5,86.5
5,Draco Malfoy,Slytherin,84.5,81.0
6,Susan Bones,Hufflepuff,84.0,83.5
7,Ron Weasley,Gryffindor,83.0,87.5
8,Hannah Abbott,Hufflepuff,80.5,84.5
9,Ernie Macmillan,Hufflepuff,77.5,85.0


In [4]:
df = df.set_index("Student")

In [5]:
df

Unnamed: 0_level_0,House,Potion_Ave,Charm_Ave
Student,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Hermione Granger,Gryffindor,100.0,100.0
Anthony Goldstein,Ravenclaw,89.0,87.0
Harry Potter,Gryffindor,88.0,90.0
Lisa Turpin,Ravenclaw,86.5,84.0
Michael Corner,Ravenclaw,85.5,86.5
Draco Malfoy,Slytherin,84.5,81.0
Susan Bones,Hufflepuff,84.0,83.5
Ron Weasley,Gryffindor,83.0,87.5
Hannah Abbott,Hufflepuff,80.5,84.5
Ernie Macmillan,Hufflepuff,77.5,85.0


In [6]:
df.loc["Hermione Granger"]

House         Gryffindor
Potion_Ave         100.0
Charm_Ave          100.0
Name: Hermione Granger, dtype: object

#### reset_index() method
* resets the index
* promotes the current index into a column
* creates a default numbered index


In [7]:
df = df.reset_index()

In [8]:
df

Unnamed: 0,Student,House,Potion_Ave,Charm_Ave
0,Hermione Granger,Gryffindor,100.0,100.0
1,Anthony Goldstein,Ravenclaw,89.0,87.0
2,Harry Potter,Gryffindor,88.0,90.0
3,Lisa Turpin,Ravenclaw,86.5,84.0
4,Michael Corner,Ravenclaw,85.5,86.5
5,Draco Malfoy,Slytherin,84.5,81.0
6,Susan Bones,Hufflepuff,84.0,83.5
7,Ron Weasley,Gryffindor,83.0,87.5
8,Hannah Abbott,Hufflepuff,80.5,84.5
9,Ernie Macmillan,Hufflepuff,77.5,85.0


In [9]:
#set multiple indices
df = df.set_index(["House", "Student"])

In [10]:
df

Unnamed: 0_level_0,Unnamed: 1_level_0,Potion_Ave,Charm_Ave
House,Student,Unnamed: 2_level_1,Unnamed: 3_level_1
Gryffindor,Hermione Granger,100.0,100.0
Ravenclaw,Anthony Goldstein,89.0,87.0
Gryffindor,Harry Potter,88.0,90.0
Ravenclaw,Lisa Turpin,86.5,84.0
Ravenclaw,Michael Corner,85.5,86.5
Slytherin,Draco Malfoy,84.5,81.0
Hufflepuff,Susan Bones,84.0,83.5
Gryffindor,Ron Weasley,83.0,87.5
Hufflepuff,Hannah Abbott,80.5,84.5
Hufflepuff,Ernie Macmillan,77.5,85.0


In [11]:
df = df.sort_index()

In [12]:
df

Unnamed: 0_level_0,Unnamed: 1_level_0,Potion_Ave,Charm_Ave
House,Student,Unnamed: 2_level_1,Unnamed: 3_level_1
Gryffindor,Harry Potter,88.0,90.0
Gryffindor,Hermione Granger,100.0,100.0
Gryffindor,Ron Weasley,83.0,87.5
Hufflepuff,Ernie Macmillan,77.5,85.0
Hufflepuff,Hannah Abbott,80.5,84.5
Hufflepuff,Susan Bones,84.0,83.5
Ravenclaw,Anthony Goldstein,89.0,87.0
Ravenclaw,Lisa Turpin,86.5,84.0
Ravenclaw,Michael Corner,85.5,86.5
Slytherin,Draco Malfoy,84.5,81.0


* Multilevel indices are stored as a list of tuples:

In [13]:
df.index

MultiIndex([('Gryffindor',      'Harry Potter'),
            ('Gryffindor',  'Hermione Granger'),
            ('Gryffindor',       'Ron Weasley'),
            ('Hufflepuff',   'Ernie Macmillan'),
            ('Hufflepuff',     'Hannah Abbott'),
            ('Hufflepuff',       'Susan Bones'),
            ( 'Ravenclaw', 'Anthony Goldstein'),
            ( 'Ravenclaw',       'Lisa Turpin'),
            ( 'Ravenclaw',    'Michael Corner'),
            ( 'Slytherin',      'Draco Malfoy'),
            ( 'Slytherin',     'Gregory Goyle'),
            ( 'Slytherin',    'Vincent Crabbe')],
           names=['House', 'Student'])

In [14]:
df.index.values

array([('Gryffindor', 'Harry Potter'), ('Gryffindor', 'Hermione Granger'),
       ('Gryffindor', 'Ron Weasley'), ('Hufflepuff', 'Ernie Macmillan'),
       ('Hufflepuff', 'Hannah Abbott'), ('Hufflepuff', 'Susan Bones'),
       ('Ravenclaw', 'Anthony Goldstein'), ('Ravenclaw', 'Lisa Turpin'),
       ('Ravenclaw', 'Michael Corner'), ('Slytherin', 'Draco Malfoy'),
       ('Slytherin', 'Gregory Goyle'), ('Slytherin', 'Vincent Crabbe')],
      dtype=object)

### 1.1. Selecting Rows from a Hierarchically-Indexed DataFrame

In [15]:
df.loc["Gryffindor"]

Unnamed: 0_level_0,Potion_Ave,Charm_Ave
Student,Unnamed: 1_level_1,Unnamed: 2_level_1
Harry Potter,88.0,90.0
Hermione Granger,100.0,100.0
Ron Weasley,83.0,87.5


In [16]:
df.loc["Gryffindor", "Harry Potter"]

Potion_Ave    88.0
Charm_Ave     90.0
Name: (Gryffindor, Harry Potter), dtype: float64

In [17]:
df.loc["Gryffindor"].describe()

Unnamed: 0,Potion_Ave,Charm_Ave
count,3.0,3.0
mean,90.333333,92.5
std,8.736895,6.614378
min,83.0,87.5
25%,85.5,88.75
50%,88.0,90.0
75%,94.0,95.0
max,100.0,100.0


In [18]:
df.loc["Slytherin"].mean()

Potion_Ave    71.666667
Charm_Ave     66.833333
dtype: float64

#### How would you modify the code snippet above to retrieve the mean score for Slytherin on Potion_Ave?

In [19]:
#TODO in video
df.loc["Gryffindor"]["Potion_Ave"].mean()

90.33333333333333

### 1.2. Selecting Multiple Rows
* When selecting multiple rows, provide a list of tuples corresponding to each row
    * e.g., if selecting two rows, provide a list of two tuples
        * Let's select Harry Potter and Draco Malfoy

In [20]:
df.loc[[("Gryffindor", "Harry Potter"), ("Slytherin", "Draco Malfoy")]]

Unnamed: 0_level_0,Unnamed: 1_level_0,Potion_Ave,Charm_Ave
House,Student,Unnamed: 2_level_1,Unnamed: 3_level_1
Gryffindor,Harry Potter,88.0,90.0
Slytherin,Draco Malfoy,84.5,81.0


### 1.3. Selecting Multiple Rows with Specific Columns
* Specify the list of columns as a second argument to **.loc[rows, columns]**

In [21]:
df.loc[[("Gryffindor", "Harry Potter"), ("Slytherin", "Draco Malfoy")], ["Potion_Ave"]]

Unnamed: 0_level_0,Unnamed: 1_level_0,Potion_Ave
House,Student,Unnamed: 2_level_1
Gryffindor,Harry Potter,88.0
Slytherin,Draco Malfoy,84.5


## 2. Working with Missing Values in DataFrames
* Real-life datasets often contain missing values
* Need to do something about those missing values before analyzing the data

In [22]:
dada = pd.read_csv("res/DADA.csv")

In [23]:
dada

Unnamed: 0,Student,Expelliarmus,Stupefy,Protego,Accio,Petrificus Totalus,Expecto Patronum
0,Harry Potter,100.0,100.0,100.0,100.0,100.0,100.0
1,Hermione Granger,100.0,100.0,100.0,100.0,100.0,95.0
2,Ron Weasley,100.0,100.0,95.0,95.0,95.0,98.0
3,Hannah Abbott,87.0,78.0,82.0,90.0,85.0,
4,Susan Bones,77.0,84.0,75.0,87.0,81.0,75.0
5,Ernie Macmillan,81.0,81.0,76.0,81.0,89.0,
6,Michael Corner,83.0,82.0,79.0,87.0,84.0,
7,Anthony Goldstein,89.0,85.0,80.0,75.0,83.0,
8,Lisa Turpin,80.0,77.0,83.0,86.0,82.0,70.0
9,Draco Malfoy,87.0,82.0,91.0,86.0,83.0,


### Pandas and Missing Data
* Pandas automatically assigns **NaN**, "Not a Number", to missing values while reading a file
* By defaulty, empty cells are considered missing values, and hence, assigned NaN
* Can specify the missing value used to indicate missing data
    * use **na_values** keyword argument

In [24]:
pd.read_csv("res/DADA_NA.csv")

Unnamed: 0,Student,Expelliarmus,Stupefy,Protego,Accio,Petrificus Totalus,Expecto Patronum
0,Harry Potter,100,100,100,100,100,100
1,Hermione Granger,100,100,100,100,100,95
2,Ron Weasley,100,100,95,95,95,98
3,Hannah Abbott,87,78,82,90,85,-99
4,Susan Bones,77,84,75,87,81,75
5,Ernie Macmillan,81,81,76,81,89,-99
6,Michael Corner,83,82,79,87,84,-99
7,Anthony Goldstein,89,85,80,75,83,-99
8,Lisa Turpin,80,77,83,86,82,70
9,Drace Malfoy,87,82,91,86,83,-99


In [25]:
pd.read_csv("res/DADA_NA.csv", na_values = "-99")

Unnamed: 0,Student,Expelliarmus,Stupefy,Protego,Accio,Petrificus Totalus,Expecto Patronum
0,Harry Potter,100,100.0,100.0,100,100.0,100.0
1,Hermione Granger,100,100.0,100.0,100,100.0,95.0
2,Ron Weasley,100,100.0,95.0,95,95.0,98.0
3,Hannah Abbott,87,78.0,82.0,90,85.0,
4,Susan Bones,77,84.0,75.0,87,81.0,75.0
5,Ernie Macmillan,81,81.0,76.0,81,89.0,
6,Michael Corner,83,82.0,79.0,87,84.0,
7,Anthony Goldstein,89,85.0,80.0,75,83.0,
8,Lisa Turpin,80,77.0,83.0,86,82.0,70.0
9,Drace Malfoy,87,82.0,91.0,86,83.0,


### What to do with missing data?
* Filter out missing data
* Fill in missing data
* Ignore missing data

### 2.1. Filtering Out Missing Data
* The dropna() method allows you to drop rows and columns with missing data
* By default, dropna() drops any row containing a missing value

In [26]:
dada

Unnamed: 0,Student,Expelliarmus,Stupefy,Protego,Accio,Petrificus Totalus,Expecto Patronum
0,Harry Potter,100.0,100.0,100.0,100.0,100.0,100.0
1,Hermione Granger,100.0,100.0,100.0,100.0,100.0,95.0
2,Ron Weasley,100.0,100.0,95.0,95.0,95.0,98.0
3,Hannah Abbott,87.0,78.0,82.0,90.0,85.0,
4,Susan Bones,77.0,84.0,75.0,87.0,81.0,75.0
5,Ernie Macmillan,81.0,81.0,76.0,81.0,89.0,
6,Michael Corner,83.0,82.0,79.0,87.0,84.0,
7,Anthony Goldstein,89.0,85.0,80.0,75.0,83.0,
8,Lisa Turpin,80.0,77.0,83.0,86.0,82.0,70.0
9,Draco Malfoy,87.0,82.0,91.0,86.0,83.0,


In [27]:
dada_clean = dada.dropna()

In [28]:
dada_clean

Unnamed: 0,Student,Expelliarmus,Stupefy,Protego,Accio,Petrificus Totalus,Expecto Patronum
0,Harry Potter,100.0,100.0,100.0,100.0,100.0,100.0
1,Hermione Granger,100.0,100.0,100.0,100.0,100.0,95.0
2,Ron Weasley,100.0,100.0,95.0,95.0,95.0,98.0
4,Susan Bones,77.0,84.0,75.0,87.0,81.0,75.0
8,Lisa Turpin,80.0,77.0,83.0,86.0,82.0,70.0


In [29]:
dada

Unnamed: 0,Student,Expelliarmus,Stupefy,Protego,Accio,Petrificus Totalus,Expecto Patronum
0,Harry Potter,100.0,100.0,100.0,100.0,100.0,100.0
1,Hermione Granger,100.0,100.0,100.0,100.0,100.0,95.0
2,Ron Weasley,100.0,100.0,95.0,95.0,95.0,98.0
3,Hannah Abbott,87.0,78.0,82.0,90.0,85.0,
4,Susan Bones,77.0,84.0,75.0,87.0,81.0,75.0
5,Ernie Macmillan,81.0,81.0,76.0,81.0,89.0,
6,Michael Corner,83.0,82.0,79.0,87.0,84.0,
7,Anthony Goldstein,89.0,85.0,80.0,75.0,83.0,
8,Lisa Turpin,80.0,77.0,83.0,86.0,82.0,70.0
9,Draco Malfoy,87.0,82.0,91.0,86.0,83.0,


### 2.1.1. Filtering out empty rows
It appears that Row 12 is in the dataset by mistake.

**dropna(how = "all")** will only drop rows that are all NaN.
* If you want to drop columns in the same way, pass **axis=1**: **df.dropna(how="all", axis=1)**

In [30]:
dada= dada.dropna(how="all")

In [31]:
dada

Unnamed: 0,Student,Expelliarmus,Stupefy,Protego,Accio,Petrificus Totalus,Expecto Patronum
0,Harry Potter,100.0,100.0,100.0,100.0,100.0,100.0
1,Hermione Granger,100.0,100.0,100.0,100.0,100.0,95.0
2,Ron Weasley,100.0,100.0,95.0,95.0,95.0,98.0
3,Hannah Abbott,87.0,78.0,82.0,90.0,85.0,
4,Susan Bones,77.0,84.0,75.0,87.0,81.0,75.0
5,Ernie Macmillan,81.0,81.0,76.0,81.0,89.0,
6,Michael Corner,83.0,82.0,79.0,87.0,84.0,
7,Anthony Goldstein,89.0,85.0,80.0,75.0,83.0,
8,Lisa Turpin,80.0,77.0,83.0,86.0,82.0,70.0
9,Draco Malfoy,87.0,82.0,91.0,86.0,83.0,


* Alternatively, set skip_blank_lines=True when loading the dataset

In [32]:
pd.read_csv("res/DADA_NA.csv", skip_blank_lines=True) 

Unnamed: 0,Student,Expelliarmus,Stupefy,Protego,Accio,Petrificus Totalus,Expecto Patronum
0,Harry Potter,100,100,100,100,100,100
1,Hermione Granger,100,100,100,100,100,95
2,Ron Weasley,100,100,95,95,95,98
3,Hannah Abbott,87,78,82,90,85,-99
4,Susan Bones,77,84,75,87,81,75
5,Ernie Macmillan,81,81,76,81,89,-99
6,Michael Corner,83,82,79,87,84,-99
7,Anthony Goldstein,89,85,80,75,83,-99
8,Lisa Turpin,80,77,83,86,82,70
9,Drace Malfoy,87,82,91,86,83,-99


### 2.2. Filling In Missing Data
* Instead of filtering out missing data, you can fill in missing data with some other values.
* the **fillna()** method allows you to fill in missing data with specific values

In [33]:
dada.fillna(0)

Unnamed: 0,Student,Expelliarmus,Stupefy,Protego,Accio,Petrificus Totalus,Expecto Patronum
0,Harry Potter,100.0,100.0,100.0,100.0,100.0,100.0
1,Hermione Granger,100.0,100.0,100.0,100.0,100.0,95.0
2,Ron Weasley,100.0,100.0,95.0,95.0,95.0,98.0
3,Hannah Abbott,87.0,78.0,82.0,90.0,85.0,0.0
4,Susan Bones,77.0,84.0,75.0,87.0,81.0,75.0
5,Ernie Macmillan,81.0,81.0,76.0,81.0,89.0,0.0
6,Michael Corner,83.0,82.0,79.0,87.0,84.0,0.0
7,Anthony Goldstein,89.0,85.0,80.0,75.0,83.0,0.0
8,Lisa Turpin,80.0,77.0,83.0,86.0,82.0,70.0
9,Draco Malfoy,87.0,82.0,91.0,86.0,83.0,0.0


### Caution
* fillna returns a new DF
* If you want to update the original DF, set inplace=True
> ```python
dada.fillna(0, inplace = True)
```

### 2.2.1. Filling In a Different Value for Each Column
* Can fill in each column with a different value
* Pass in a dictionary containing column names as keys and desired NA values as values
    * **fillna(dictionary_of_values)**

In [34]:
default_col_values = {"Expelliarmus": 0, "Stupefy": 0, "Protego": 0, 
                      "Accio": 0, "Petrificus Totalus": 0, "Expecto Patronum": ""}

In [35]:
dada.fillna(default_col_values)

Unnamed: 0,Student,Expelliarmus,Stupefy,Protego,Accio,Petrificus Totalus,Expecto Patronum
0,Harry Potter,100.0,100.0,100.0,100.0,100.0,100.0
1,Hermione Granger,100.0,100.0,100.0,100.0,100.0,95.0
2,Ron Weasley,100.0,100.0,95.0,95.0,95.0,98.0
3,Hannah Abbott,87.0,78.0,82.0,90.0,85.0,
4,Susan Bones,77.0,84.0,75.0,87.0,81.0,75.0
5,Ernie Macmillan,81.0,81.0,76.0,81.0,89.0,
6,Michael Corner,83.0,82.0,79.0,87.0,84.0,
7,Anthony Goldstein,89.0,85.0,80.0,75.0,83.0,
8,Lisa Turpin,80.0,77.0,83.0,86.0,82.0,70.0
9,Draco Malfoy,87.0,82.0,91.0,86.0,83.0,


<a id="3"></a>

## 3. Working with Duplicates
* Sometimes datasets contain duplicates record, which you may want to discard

In [36]:
dada

Unnamed: 0,Student,Expelliarmus,Stupefy,Protego,Accio,Petrificus Totalus,Expecto Patronum
0,Harry Potter,100.0,100.0,100.0,100.0,100.0,100.0
1,Hermione Granger,100.0,100.0,100.0,100.0,100.0,95.0
2,Ron Weasley,100.0,100.0,95.0,95.0,95.0,98.0
3,Hannah Abbott,87.0,78.0,82.0,90.0,85.0,
4,Susan Bones,77.0,84.0,75.0,87.0,81.0,75.0
5,Ernie Macmillan,81.0,81.0,76.0,81.0,89.0,
6,Michael Corner,83.0,82.0,79.0,87.0,84.0,
7,Anthony Goldstein,89.0,85.0,80.0,75.0,83.0,
8,Lisa Turpin,80.0,77.0,83.0,86.0,82.0,70.0
9,Draco Malfoy,87.0,82.0,91.0,86.0,83.0,


In [37]:
dada = dada.append(dada.iloc[0])

In [38]:
dada

Unnamed: 0,Student,Expelliarmus,Stupefy,Protego,Accio,Petrificus Totalus,Expecto Patronum
0,Harry Potter,100.0,100.0,100.0,100.0,100.0,100.0
1,Hermione Granger,100.0,100.0,100.0,100.0,100.0,95.0
2,Ron Weasley,100.0,100.0,95.0,95.0,95.0,98.0
3,Hannah Abbott,87.0,78.0,82.0,90.0,85.0,
4,Susan Bones,77.0,84.0,75.0,87.0,81.0,75.0
5,Ernie Macmillan,81.0,81.0,76.0,81.0,89.0,
6,Michael Corner,83.0,82.0,79.0,87.0,84.0,
7,Anthony Goldstein,89.0,85.0,80.0,75.0,83.0,
8,Lisa Turpin,80.0,77.0,83.0,86.0,82.0,70.0
9,Draco Malfoy,87.0,82.0,91.0,86.0,83.0,


### 3.1. duplicated() method
Returns a boolean Series indicating whether each row is a duplicated (whether it has been observed in a previous row)

In [39]:
dada.duplicated()

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
0      True
dtype: bool

### 3.2. drop_duplicates() method
* Returns a DataFrame where the duplicates have been dropped

In [40]:
dada.drop_duplicates()

Unnamed: 0,Student,Expelliarmus,Stupefy,Protego,Accio,Petrificus Totalus,Expecto Patronum
0,Harry Potter,100.0,100.0,100.0,100.0,100.0,100.0
1,Hermione Granger,100.0,100.0,100.0,100.0,100.0,95.0
2,Ron Weasley,100.0,100.0,95.0,95.0,95.0,98.0
3,Hannah Abbott,87.0,78.0,82.0,90.0,85.0,
4,Susan Bones,77.0,84.0,75.0,87.0,81.0,75.0
5,Ernie Macmillan,81.0,81.0,76.0,81.0,89.0,
6,Michael Corner,83.0,82.0,79.0,87.0,84.0,
7,Anthony Goldstein,89.0,85.0,80.0,75.0,83.0,
8,Lisa Turpin,80.0,77.0,83.0,86.0,82.0,70.0
9,Draco Malfoy,87.0,82.0,91.0,86.0,83.0,


* You can also specify a subset of columns to detect duplicates.
* Pass in a list of column names as an argument to the **drop_duplicates()** method call.

In [41]:
dada

Unnamed: 0,Student,Expelliarmus,Stupefy,Protego,Accio,Petrificus Totalus,Expecto Patronum
0,Harry Potter,100.0,100.0,100.0,100.0,100.0,100.0
1,Hermione Granger,100.0,100.0,100.0,100.0,100.0,95.0
2,Ron Weasley,100.0,100.0,95.0,95.0,95.0,98.0
3,Hannah Abbott,87.0,78.0,82.0,90.0,85.0,
4,Susan Bones,77.0,84.0,75.0,87.0,81.0,75.0
5,Ernie Macmillan,81.0,81.0,76.0,81.0,89.0,
6,Michael Corner,83.0,82.0,79.0,87.0,84.0,
7,Anthony Goldstein,89.0,85.0,80.0,75.0,83.0,
8,Lisa Turpin,80.0,77.0,83.0,86.0,82.0,70.0
9,Draco Malfoy,87.0,82.0,91.0,86.0,83.0,


In [42]:
dada.drop_duplicates(["Student"])

Unnamed: 0,Student,Expelliarmus,Stupefy,Protego,Accio,Petrificus Totalus,Expecto Patronum
0,Harry Potter,100.0,100.0,100.0,100.0,100.0,100.0
1,Hermione Granger,100.0,100.0,100.0,100.0,100.0,95.0
2,Ron Weasley,100.0,100.0,95.0,95.0,95.0,98.0
3,Hannah Abbott,87.0,78.0,82.0,90.0,85.0,
4,Susan Bones,77.0,84.0,75.0,87.0,81.0,75.0
5,Ernie Macmillan,81.0,81.0,76.0,81.0,89.0,
6,Michael Corner,83.0,82.0,79.0,87.0,84.0,
7,Anthony Goldstein,89.0,85.0,80.0,75.0,83.0,
8,Lisa Turpin,80.0,77.0,83.0,86.0,82.0,70.0
9,Draco Malfoy,87.0,82.0,91.0,86.0,83.0,


* By default, both duplicated() and drop_duplicates() keep the first observed value combination. 
* If you want to keep the last value, pass in **keep="last"**

In [43]:
dada

Unnamed: 0,Student,Expelliarmus,Stupefy,Protego,Accio,Petrificus Totalus,Expecto Patronum
0,Harry Potter,100.0,100.0,100.0,100.0,100.0,100.0
1,Hermione Granger,100.0,100.0,100.0,100.0,100.0,95.0
2,Ron Weasley,100.0,100.0,95.0,95.0,95.0,98.0
3,Hannah Abbott,87.0,78.0,82.0,90.0,85.0,
4,Susan Bones,77.0,84.0,75.0,87.0,81.0,75.0
5,Ernie Macmillan,81.0,81.0,76.0,81.0,89.0,
6,Michael Corner,83.0,82.0,79.0,87.0,84.0,
7,Anthony Goldstein,89.0,85.0,80.0,75.0,83.0,
8,Lisa Turpin,80.0,77.0,83.0,86.0,82.0,70.0
9,Draco Malfoy,87.0,82.0,91.0,86.0,83.0,


In [44]:
dada.drop_duplicates(["Student"], keep="last")

Unnamed: 0,Student,Expelliarmus,Stupefy,Protego,Accio,Petrificus Totalus,Expecto Patronum
1,Hermione Granger,100.0,100.0,100.0,100.0,100.0,95.0
2,Ron Weasley,100.0,100.0,95.0,95.0,95.0,98.0
3,Hannah Abbott,87.0,78.0,82.0,90.0,85.0,
4,Susan Bones,77.0,84.0,75.0,87.0,81.0,75.0
5,Ernie Macmillan,81.0,81.0,76.0,81.0,89.0,
6,Michael Corner,83.0,82.0,79.0,87.0,84.0,
7,Anthony Goldstein,89.0,85.0,80.0,75.0,83.0,
8,Lisa Turpin,80.0,77.0,83.0,86.0,82.0,70.0
9,Draco Malfoy,87.0,82.0,91.0,86.0,83.0,
10,Gregory Goyle,65.0,60.0,,72.0,,


<a id="4"></a>

## 4. Replacing Values
* Beyond filtering or fill in missing data, we can replace certain values with new values.
* **df.replace(current_value, new_value)**

In [45]:
dada_NA = pd.read_csv("res/DADA_NA.csv")

In [46]:
dada_NA

Unnamed: 0,Student,Expelliarmus,Stupefy,Protego,Accio,Petrificus Totalus,Expecto Patronum
0,Harry Potter,100,100,100,100,100,100
1,Hermione Granger,100,100,100,100,100,95
2,Ron Weasley,100,100,95,95,95,98
3,Hannah Abbott,87,78,82,90,85,-99
4,Susan Bones,77,84,75,87,81,75
5,Ernie Macmillan,81,81,76,81,89,-99
6,Michael Corner,83,82,79,87,84,-99
7,Anthony Goldstein,89,85,80,75,83,-99
8,Lisa Turpin,80,77,83,86,82,70
9,Drace Malfoy,87,82,91,86,83,-99


In [47]:
dada_NA.replace(-99, "NaN") #numpy has a specific attribute for NaN: np.nan

Unnamed: 0,Student,Expelliarmus,Stupefy,Protego,Accio,Petrificus Totalus,Expecto Patronum
0,Harry Potter,100,100.0,100.0,100,100.0,100.0
1,Hermione Granger,100,100.0,100.0,100,100.0,95.0
2,Ron Weasley,100,100.0,95.0,95,95.0,98.0
3,Hannah Abbott,87,78.0,82.0,90,85.0,
4,Susan Bones,77,84.0,75.0,87,81.0,75.0
5,Ernie Macmillan,81,81.0,76.0,81,89.0,
6,Michael Corner,83,82.0,79.0,87,84.0,
7,Anthony Goldstein,89,85.0,80.0,75,83.0,
8,Lisa Turpin,80,77.0,83.0,86,82.0,70.0
9,Drace Malfoy,87,82.0,91.0,86,83.0,


In [48]:
dada_NA

Unnamed: 0,Student,Expelliarmus,Stupefy,Protego,Accio,Petrificus Totalus,Expecto Patronum
0,Harry Potter,100,100,100,100,100,100
1,Hermione Granger,100,100,100,100,100,95
2,Ron Weasley,100,100,95,95,95,98
3,Hannah Abbott,87,78,82,90,85,-99
4,Susan Bones,77,84,75,87,81,75
5,Ernie Macmillan,81,81,76,81,89,-99
6,Michael Corner,83,82,79,87,84,-99
7,Anthony Goldstein,89,85,80,75,83,-99
8,Lisa Turpin,80,77,83,86,82,70
9,Drace Malfoy,87,82,91,86,83,-99


In [49]:
dada_NA.replace("Drace Malfoy", "Draco Malfoy", inplace=True)

In [50]:
dada_NA

Unnamed: 0,Student,Expelliarmus,Stupefy,Protego,Accio,Petrificus Totalus,Expecto Patronum
0,Harry Potter,100,100,100,100,100,100
1,Hermione Granger,100,100,100,100,100,95
2,Ron Weasley,100,100,95,95,95,98
3,Hannah Abbott,87,78,82,90,85,-99
4,Susan Bones,77,84,75,87,81,75
5,Ernie Macmillan,81,81,76,81,89,-99
6,Michael Corner,83,82,79,87,84,-99
7,Anthony Goldstein,89,85,80,75,83,-99
8,Lisa Turpin,80,77,83,86,82,70
9,Draco Malfoy,87,82,91,86,83,-99


* If you want to replace multiple values at once, pass a list and then the substitute value(s)

```python
dada_NA.replace([-99, -999], "NaN")
dada_NA.replace([-99, -999], ["NaN", 0])
dada_NA.replace({-99: "NaN", -999: 0})
```

<a id="5"></a>

## 5. Renaming Axis Indices
* Can rename rows and columns using the **rename()** method
* Use *column* and *index* keywords arguments
* Pass **inplace=True** if you want to modify the original df

In [51]:
dada_NA = dada_NA.set_index("Student")

In [52]:
dada_NA

Unnamed: 0_level_0,Expelliarmus,Stupefy,Protego,Accio,Petrificus Totalus,Expecto Patronum
Student,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Harry Potter,100,100,100,100,100,100
Hermione Granger,100,100,100,100,100,95
Ron Weasley,100,100,95,95,95,98
Hannah Abbott,87,78,82,90,85,-99
Susan Bones,77,84,75,87,81,75
Ernie Macmillan,81,81,76,81,89,-99
Michael Corner,83,82,79,87,84,-99
Anthony Goldstein,89,85,80,75,83,-99
Lisa Turpin,80,77,83,86,82,70
Draco Malfoy,87,82,91,86,83,-99


In [53]:
dada_NA = dada_NA.rename(index={"Harry Potter": "Harry", "Hermione Granger": "Hermione", "Ron Weasley": "Ron"})

In [54]:
dada_NA

Unnamed: 0_level_0,Expelliarmus,Stupefy,Protego,Accio,Petrificus Totalus,Expecto Patronum
Student,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Harry,100,100,100,100,100,100
Hermione,100,100,100,100,100,95
Ron,100,100,95,95,95,98
Hannah Abbott,87,78,82,90,85,-99
Susan Bones,77,84,75,87,81,75
Ernie Macmillan,81,81,76,81,89,-99
Michael Corner,83,82,79,87,84,-99
Anthony Goldstein,89,85,80,75,83,-99
Lisa Turpin,80,77,83,86,82,70
Draco Malfoy,87,82,91,86,83,-99
