### Prepping Data Challenge: Dealing with Duplication (week 5)


#### Requirement:

 1. Input the data 
 2. For each Client, work out who the most recent Account Manager is 
 3. Filter the data so that only the most recent Account Manager remains 
    - Be careful not to lose any attendees from the training sessions!
 4. In some instances, the Client ID has changed along with the Account Manager. Ensure only the most recent Client ID remains
 5. Output the data


### 1. Input the data 

In [1]:
#import libraries
import pandas as pd

In [2]:
df = pd.read_csv('WK5-Joined Dataset.csv')

df.head()

Unnamed: 0,Training,Contact Email,Contact Name,Client,Client ID,Account Manager,From Date
0,Prep 101 - 2020-10-01,abagael.matresse@brauninc.com,Abagael Matresse,Braun Inc,1200,Xiaoxuan Ma,31/12/2019
1,Prep 101 - 2020-10-01,abagail.macconnell@lakinllc.com,Abagail MacConnell,Lakin LLC,924,Lucy Stevenson,01/01/2019
2,Prep 101 - 2020-10-01,abagail.moodey@raynorinc.com,Abagail Moodey,Raynor Inc,444,Nancy Smith,01/07/2015
3,Prep 101 - 2020-10-01,abby.eager@paucekgroup.com,Abby Eager,Paucek Group,893,Nancy Smith,20/09/2018
4,Prep 101 - 2020-10-01,abelard.mechell@lehner.swiftanddickinson.com,Abelard Mechell,"Lehner, Swift and Dickinson",1323,Xiaoxuan Ma,31/12/2019


In [6]:
#Test for Duplicate in the dataset based on 'Client','Client ID','Account Manager'
duplicate_test =df[df.duplicated('Client')]
duplicate_test

Unnamed: 0,Training,Contact Email,Contact Name,Client,Client ID,Account Manager,From Date
27,Prep 101 - 2020-10-01,agathe.evill@o'reilly.hilpertandlittel.com,Agathe Evill,"O'Reilly, Hilpert and Littel",620,Florian Webb,30/11/2019
48,Prep 101 - 2020-10-01,alfy.such@langworth.koelpinandklein.com,Alfy Such,"Langworth, Koelpin and Klein",1067,Florian Webb,29/06/2020
51,Prep 101 - 2020-10-01,alisa.cramb@carterltd.com,Alisa Cramb,Carter Ltd,971,Louisa James,01/11/2017
60,Prep 101 - 2020-10-01,allin.braine@carterltd.com,Allin Braine,Carter Ltd,971,Louisa James,01/11/2017
67,Prep 101 - 2020-10-01,alphonse.halwill@emard-wunsch.com,Alphonse Halwill,Emard-Wunsch,1232,Lucy Stevenson,01/04/2019
...,...,...,...,...,...,...,...
13618,Tableau 104 - 2020-09-17,zenia.antonin@marks-gulgowski.com,Zenia Antonin,Marks-Gulgowski,1022,Ezazul Islam,01/02/2018
13619,Tableau 104 - 2020-09-17,zia.easterling@turcotte-borer.com,Zia Easterling,Turcotte-Borer,1300,Lucy Stevenson,31/07/2019
13620,Tableau 104 - 2020-10-08,zolly.bonson@hudson.runolfsdottirandabbott.com,Zolly Bonson,"Hudson, Runolfsdottir and Abbott",1092,Oscar Adams,01/09/2018
13621,Tableau 104 - 2020-10-08,zolly.duxfield@ortiz.harrisandbrown.com,Zolly Duxfield,"Ortiz, Harris and Brown",1164,Oscar Adams,31/05/2018


### 2. For each Client, work out who the most recent Account Manager is 

In [6]:
A_Mgt = df.sort_values(by = ['From Date']).groupby('Client')[['Client ID','Account Manager', 'From Date', 
                           ]].last().reset_index()
#A_Mgt

###  3. Filter the data so that only the most recent Account Manager remains 
   - Be careful not to lose any attendees from the training sessions!

In [8]:
A_Mgtd = df.drop_duplicates(subset = ['Contact Email','Contact Name','Client','Training'])

In [10]:
output = A_Mgtd.merge(A_Mgt,how='left',on=['Client'])

### In some instances, the Client ID has changed along with the Account Manager. Ensure only the most recent Client ID remains

In [6]:
output.columns

Index(['Training', 'Contact Email', 'Contact Name', 'Client', 'Client ID_x',
       'Account Manager_x', 'From Date_x', 'Client ID_y', 'Account Manager_y',
       'From Date_y'],
      dtype='object')

In [7]:
output.drop(['Client ID_x','Account Manager_x','From Date_x'], axis=1, inplace=True)
output.rename(columns={'Account Manager_y':'Account Manager','Client ID_y':'Client ID','From Date_y':'From Date'}, inplace=True)

In [8]:
output.columns

Index(['Training', 'Contact Email', 'Contact Name', 'Client', 'Client ID',
       'Account Manager', 'From Date'],
      dtype='object')

###  5. Output the data

In [9]:
output.to_csv('WK5-Dealing with duplication Output.csv', index=False)

In [10]:
output.head()

Unnamed: 0,Training,Contact Email,Contact Name,Client,Client ID,Account Manager,From Date
0,Prep 101 - 2020-10-01,abagael.matresse@brauninc.com,Abagael Matresse,Braun Inc,1200,Xiaoxuan Ma,31/12/2019
1,Prep 101 - 2020-10-01,abagail.macconnell@lakinllc.com,Abagail MacConnell,Lakin LLC,924,Lucy Stevenson,01/01/2019
2,Prep 101 - 2020-10-01,abagail.moodey@raynorinc.com,Abagail Moodey,Raynor Inc,444,Nancy Smith,01/07/2015
3,Prep 101 - 2020-10-01,abby.eager@paucekgroup.com,Abby Eager,Paucek Group,893,Nancy Smith,20/09/2018
4,Prep 101 - 2020-10-01,abelard.mechell@lehner.swiftanddickinson.com,Abelard Mechell,"Lehner, Swift and Dickinson",1323,Xiaoxuan Ma,31/12/2019
