# Import Libraries

In [1]:
import pandas as pd
import numpy as np

# Load Datasets

In [4]:
test_scores = pd.read_csv('./test_scores.csv')

## Creating a Copy of the original dataset

In [5]:
df = test_scores.copy()

In [6]:
df.head()

Unnamed: 0,Name,Age,Test A Score
0,Amy Linn,15.0,95.0
1,Marc Fletcher,15.0,50.0
2,Naima Barry,,100.0
3,Kara Davis,15.0,
4,Zeeshan Gibson,14.0,100.0


# Finding all duplicated rows based on the `Name` and `Age` Columns

In [7]:
if_duplicated = df.duplicated(['Name','Age'])
if_duplicated

0     False
1     False
2     False
3     False
4     False
5      True
6     False
7     False
8     False
9      True
10     True
11     True
dtype: bool

# Getting the duplicated Rows

In [8]:
duplicate_rows = df.loc[df.duplicated(['Name','Age'])]
duplicate_rows

Unnamed: 0,Name,Age,Test A Score
5,Amy Linn,15,85
9,Marc Fletcher,15,32
10,Kara Davis,15,79
11,Amy Linn,15,95


## Getting all rows of "Amy Linn"

In [9]:
Amy = df.loc[df['Name'] == "Amy Linn"]
Amy

Unnamed: 0,Name,Age,Test A Score
0,Amy Linn,15,95
5,Amy Linn,15,85
11,Amy Linn,15,95


Using using `duplicated()`, it will only return the duplicates of a certain row, not all the rows assciated with the duplication.

# Getting information about the duplicated rows

## Getting the number of duplicated rows

In [10]:
df.duplicated(['Name','Age']).sum()

4

## Looking for trends in the duplicated rows visually

Can ask, are duplicates only for students who are 15 years-old?

In [11]:
df

Unnamed: 0,Name,Age,Test A Score
0,Amy Linn,15,95
1,Marc Fletcher,15,50
2,Naima Barry,,100
3,Kara Davis,15,
4,Zeeshan Gibson,14,100
5,Amy Linn,15,85
6,Dewey Cobb,Fourteen,Sixty six
7,Zeeshan Gibson,120,108
8,Lie�m Gibson,14,
9,Marc Fletcher,15,32


# Determining which duplicated row to remove

Sometimes, some duplicate rows each contain certain different values (based on the duplicate row searching criteria)

In [14]:
Amy = df.loc[df['Name'] == 'Amy Linn']
Amy

Unnamed: 0,Name,Age,Test A Score
0,Amy Linn,15,95
5,Amy Linn,15,85
11,Amy Linn,15,95


To fix this:

1. The original data providers can be contacted about the data accuracy
2. Duplicate rows that are incorrect can be removed while the correct ones are kept

## Loading dataset with multiple test scores

Some duplicate scores on `Test A` are incorrect but every test score on `Test B` are correct

In [16]:
multi_test_scores = pd.read_csv('./multiple_test_scores.csv')
multi_test_scores

Unnamed: 0,Name,Age,Test A Score,Test B Score
0,Amy Linn,15,95,34
1,Marc Fletcher,15,50,87
2,Naima Barry,,100,100
3,Kara Davis,15,,3
4,Zeeshan Gibson,14,100,20
5,Amy Linn,15,85,88
6,Dewey Cobb,Fourteen,Sixty six,Fifty three
7,Zeeshan Gibson,120,108,100
8,Lie�m Gibson,14,,75
9,Marc Fletcher,15,32,54


## Getting duplicate rows based on the `Age` and `Name` Columns

In [17]:
multi_test_scores[multi_test_scores.duplicated(['Age','Name'])]

Unnamed: 0,Name,Age,Test A Score,Test B Score
5,Amy Linn,15,85,88
9,Marc Fletcher,15,32,54
10,Kara Davis,15,79,90


To fix this:
1. 
The original data providers can be contacted about the data accura
   
   They could reply with:
   
      * Duplicate students' data in `Test A Score` is incorrect and incorrect rows should be removed
      * Duplicate students' data in `Test B Score` is correct and should be keptc

2. y
Duplicate rows that are incorrect can be removed while the correct ones are k
    
    The incorrect duplicate values in `Test A Score` can be marked as NaNs. Or, a seperate table can be created for repeated values in `Test B Score`ept

## Fixing the duplicated rows

### By using `drop_duplicates()`

by default, `drop_duplicates()` keeps the first occurance of the duplicated data

In [21]:
removed_dup = df.drop_duplicates(subset=['Name','Age'])
removed_dup

Unnamed: 0,Name,Age,Test A Score
0,Amy Linn,15,95
1,Marc Fletcher,15,50
2,Naima Barry,,100
3,Kara Davis,15,
4,Zeeshan Gibson,14,100
6,Dewey Cobb,Fourteen,Sixty six
7,Zeeshan Gibson,120,108
8,Lie�m Gibson,14,


To specify the last instance should be kept, the following paramter should be added `keep='last'`

In [22]:
df.drop_duplicates(subset=['Name','Age'],keep='last')

Unnamed: 0,Name,Age,Test A Score
2,Naima Barry,,100
4,Zeeshan Gibson,14,100
6,Dewey Cobb,Fourteen,Sixty six
7,Zeeshan Gibson,120,108
8,Lie�m Gibson,14,
9,Marc Fletcher,15,32
10,Kara Davis,15,79
11,Amy Linn,15,95


### Confirming all duplicates are removed

In [25]:
removed_dup.duplicated(['Name','Age']).sum()

0

### Dropping rows that are neither the first or the last

`drop_duplicates()` can't handle this operation, it must be done manually.

In [26]:
Amy = df.loc[df['Name'] == "Amy Linn"]
Amy

Unnamed: 0,Name,Age,Test A Score
0,Amy Linn,15,95
5,Amy Linn,15,85
11,Amy Linn,15,95


In [28]:
Amy.drop(index=[5])

Unnamed: 0,Name,Age,Test A Score
0,Amy Linn,15,95
11,Amy Linn,15,95


### Converting duplicate values to NaNs

Instead of dropping the duplicated values, they can be converted to NaNs and dealt with later.

In [29]:
dupe_index = multi_test_scores[multi_test_scores.duplicated(['Name','Age'])].index
dupe_index

Index([5, 9, 10], dtype='int64')

#### Setting the duplicated index `Test A Score` to NaNs

In [31]:
multi_test_scores.loc[dupe_index, 'Test A Score'] = np.nan

In [32]:
multi_test_scores

Unnamed: 0,Name,Age,Test A Score,Test B Score
0,Amy Linn,15,95,34
1,Marc Fletcher,15,50,87
2,Naima Barry,,100,100
3,Kara Davis,15,,3
4,Zeeshan Gibson,14,100,20
5,Amy Linn,15,,88
6,Dewey Cobb,Fourteen,Sixty six,Fifty three
7,Zeeshan Gibson,120,108,100
8,Lie�m Gibson,14,,75
9,Marc Fletcher,15,,54
