Requirements

1. Input the data
2. For the Transaction Path table:
    Make sure field naming convention matches the other tables.
    i.e. instead of Account_From it should be Account From
3. For the Account Information table:
    Make sure there are no null values in the Account Holder ID.
    Ensure there is one row per Account Holder ID.
        Joint accounts will have 2 Account Holders, we want a row for each of them
4. For the Account Holders table:
    Make sure the phone numbers start with 07
5. Bring the tables together
6. Filter out cancelled transactions 
7. Filter to transactions greater than £1,000 in value 
8. Filter out Platinum accounts
9. Output the data

In [2]:
import pandas as pd

Inputting Data

In [4]:
trans_path_df = pd.read_csv('Preppin Data Inputs/Transaction Path.csv')

In [5]:
trans_path_df

Unnamed: 0,Transaction ID,Account_To,Account_From
0,1957155,27356852,76206810
1,2147025,44242297,24826358
2,3065073,10295384,52104303
3,6622100,45519330,69315008
4,14877473,28680375,44586370
...,...,...,...
8776,9996102963,17925406,40530538
8777,9996177785,37678813,60789634
8778,9997003500,54458410,17810734
8779,9997164946,57426365,23333877


In [6]:
trans_details_df = pd.read_csv('Preppin Data Inputs/Transaction Detail.csv')

In [7]:
trans_details_df

Unnamed: 0,Transaction ID,Transaction Date,Value,Cancelled?
0,1957155,2023-02-01,128.78,N
1,28234510,2023-02-01,163.82,N
2,33688648,2023-02-01,54.71,N
3,41670299,2023-02-01,88.10,N
4,42825784,2023-02-01,217.22,Y
...,...,...,...,...
8776,9881408962,2023-02-14,38.18,N
8777,9889326485,2023-02-14,126.71,N
8778,9892097130,2023-02-14,157.43,N
8779,9951297137,2023-02-14,120.16,N


In [8]:
acc_info_df = pd.read_csv('Preppin Data Inputs/Account Information.csv')

In [9]:
acc_info_df

Unnamed: 0,Account Number,Account Type,Account Holder ID,Balance Date,Balance
0,10005367,Platinum,70390615,2023-01-31,728.25
1,10011977,Basic,20123998,2023-01-31,676.54
2,10024680,Platinum,54374080,2023-01-31,567.46
3,10031238,Basic,97027297,2023-01-31,576.52
4,10034341,Joint,"89920386, 97325900",2023-01-31,390.39
...,...,...,...,...,...
2995,99734848,Gold,56623581,2023-01-31,354.29
2996,99760030,Gold,66659633,2023-01-31,988.00
2997,99791709,Gold,94872412,2023-01-31,15.05
2998,99877007,Basic,90069774,2023-01-31,582.00


In [10]:
acc_holders_df = pd.read_csv('Preppin Data Inputs/Account Holders.csv')

In [11]:
acc_holders_df

Unnamed: 0,Account Holder ID,Name,Date of Birth,Contact Number,First Line of Address
0,70390615,Mahmoud Hehnke,28/08/1995,7479286250,18535 Loftsgordon Park
1,20123998,Maynord Surgeoner,21/08/1997,7716107305,6422 Buena Vista Plaza
2,54374080,Giraldo Kimbley,22/03/1995,7489940612,93005 Summer Ridge Avenue
3,97027297,Blake Dudson,30/06/1955,7253587445,2 Huxley Hill
4,89920386,Ajay Douce,19/12/1930,7395580534,90176 Miller Alley
...,...,...,...,...,...
3067,66659633,Mata Brownett,16/07/1931,7314541365,5 Kenwood Park
3068,94872412,Tabby Matteotti,13/10/1962,7586210387,7517 Aberg Plaza
3069,90069774,Cyndia Fosse,09/03/1961,7494132554,42 Trailsway Point
3070,45810412,Arch Segrave,05/06/2000,7289510957,14 3rd Center


For the Transaction Path table: Make sure field naming convention matches the other tables. i.e. instead of Account_From it should be Account From

In [13]:
trans_path_df = trans_path_df.rename(columns={'Account_To':'Account To', 'Account_From':'Account From'})

In [14]:
trans_path_df

Unnamed: 0,Transaction ID,Account To,Account From
0,1957155,27356852,76206810
1,2147025,44242297,24826358
2,3065073,10295384,52104303
3,6622100,45519330,69315008
4,14877473,28680375,44586370
...,...,...,...
8776,9996102963,17925406,40530538
8777,9996177785,37678813,60789634
8778,9997003500,54458410,17810734
8779,9997164946,57426365,23333877


For the Account Information table: 
    Make sure there are no null values in the Account Holder ID.
    Ensure there is one row per Account Holder ID. 
    Joint accounts will have 2 Account Holders, we want a row for each of them

In [16]:
acc_info_df = acc_info_df.dropna(subset='Account Holder ID')

In [17]:
acc_info_df

Unnamed: 0,Account Number,Account Type,Account Holder ID,Balance Date,Balance
0,10005367,Platinum,70390615,2023-01-31,728.25
1,10011977,Basic,20123998,2023-01-31,676.54
2,10024680,Platinum,54374080,2023-01-31,567.46
3,10031238,Basic,97027297,2023-01-31,576.52
4,10034341,Joint,"89920386, 97325900",2023-01-31,390.39
...,...,...,...,...,...
2995,99734848,Gold,56623581,2023-01-31,354.29
2996,99760030,Gold,66659633,2023-01-31,988.00
2997,99791709,Gold,94872412,2023-01-31,15.05
2998,99877007,Basic,90069774,2023-01-31,582.00


In [18]:
acc_info_df['Account Holder ID'] = acc_info_df['Account Holder ID'].str.split(',')

In [19]:
acc_info_df

Unnamed: 0,Account Number,Account Type,Account Holder ID,Balance Date,Balance
0,10005367,Platinum,[70390615],2023-01-31,728.25
1,10011977,Basic,[20123998],2023-01-31,676.54
2,10024680,Platinum,[54374080],2023-01-31,567.46
3,10031238,Basic,[97027297],2023-01-31,576.52
4,10034341,Joint,"[89920386, 97325900]",2023-01-31,390.39
...,...,...,...,...,...
2995,99734848,Gold,[56623581],2023-01-31,354.29
2996,99760030,Gold,[66659633],2023-01-31,988.00
2997,99791709,Gold,[94872412],2023-01-31,15.05
2998,99877007,Basic,[90069774],2023-01-31,582.00


In [20]:
acc_info_df = acc_info_df.explode('Account Holder ID', ignore_index=True)

acc_info_df

Unnamed: 0,Account Number,Account Type,Account Holder ID,Balance Date,Balance
0,10005367,Platinum,70390615,2023-01-31,728.25
1,10011977,Basic,20123998,2023-01-31,676.54
2,10024680,Platinum,54374080,2023-01-31,567.46
3,10031238,Basic,97027297,2023-01-31,576.52
4,10034341,Joint,89920386,2023-01-31,390.39
...,...,...,...,...,...
3067,99760030,Gold,66659633,2023-01-31,988.00
3068,99791709,Gold,94872412,2023-01-31,15.05
3069,99877007,Basic,90069774,2023-01-31,582.00
3070,99937043,Joint,45810412,2023-01-31,918.42


For the Account Holders table: Make sure the phone numbers start with 07

In [22]:
acc_holders_df['Contact Number'] = '0' + acc_holders_df['Contact Number'].astype(str)

In [23]:
acc_holders_df

Unnamed: 0,Account Holder ID,Name,Date of Birth,Contact Number,First Line of Address
0,70390615,Mahmoud Hehnke,28/08/1995,07479286250,18535 Loftsgordon Park
1,20123998,Maynord Surgeoner,21/08/1997,07716107305,6422 Buena Vista Plaza
2,54374080,Giraldo Kimbley,22/03/1995,07489940612,93005 Summer Ridge Avenue
3,97027297,Blake Dudson,30/06/1955,07253587445,2 Huxley Hill
4,89920386,Ajay Douce,19/12/1930,07395580534,90176 Miller Alley
...,...,...,...,...,...
3067,66659633,Mata Brownett,16/07/1931,07314541365,5 Kenwood Park
3068,94872412,Tabby Matteotti,13/10/1962,07586210387,7517 Aberg Plaza
3069,90069774,Cyndia Fosse,09/03/1961,07494132554,42 Trailsway Point
3070,45810412,Arch Segrave,05/06/2000,07289510957,14 3rd Center


Bring the tables together

In [25]:
df = trans_details_df.merge(right=trans_path_df, how='inner', on='Transaction ID')

df

Unnamed: 0,Transaction ID,Transaction Date,Value,Cancelled?,Account To,Account From
0,1957155,2023-02-01,128.78,N,27356852,76206810
1,28234510,2023-02-01,163.82,N,28745450,87373821
2,33688648,2023-02-01,54.71,N,48271608,98821521
3,41670299,2023-02-01,88.10,N,34128127,80808205
4,42825784,2023-02-01,217.22,Y,25006771,81049128
...,...,...,...,...,...,...
8776,9881408962,2023-02-14,38.18,N,19920990,25099734
8777,9889326485,2023-02-14,126.71,N,86864320,22404690
8778,9892097130,2023-02-14,157.43,N,44912164,88933315
8779,9951297137,2023-02-14,120.16,N,54288062,97521278


In [26]:
df = df.merge(right=acc_info_df, how='left', left_on='Account From', right_on='Account Number')

df

Unnamed: 0,Transaction ID,Transaction Date,Value,Cancelled?,Account To,Account From,Account Number,Account Type,Account Holder ID,Balance Date,Balance
0,1957155,2023-02-01,128.78,N,27356852,76206810,76206810,Platinum,91369374,2023-01-31,887.63
1,28234510,2023-02-01,163.82,N,28745450,87373821,87373821,Basic,57733624,2023-01-31,945.00
2,33688648,2023-02-01,54.71,N,48271608,98821521,98821521,Basic,62722448,2023-01-31,906.51
3,41670299,2023-02-01,88.10,N,34128127,80808205,80808205,Basic,14515913,2023-01-31,280.92
4,42825784,2023-02-01,217.22,Y,25006771,81049128,81049128,Basic,72939081,2023-01-31,694.91
...,...,...,...,...,...,...,...,...,...,...,...
8970,9881408962,2023-02-14,38.18,N,19920990,25099734,25099734,Gold,74513603,2023-01-31,872.31
8971,9889326485,2023-02-14,126.71,N,86864320,22404690,22404690,Gold,08051737,2023-01-31,726.06
8972,9892097130,2023-02-14,157.43,N,44912164,88933315,88933315,Gold,58527537,2023-01-31,289.79
8973,9951297137,2023-02-14,120.16,N,54288062,97521278,97521278,Basic,07369531,2023-01-31,833.78


In [27]:
acc_holders_df['Account Holder ID'] = acc_holders_df['Account Holder ID'].astype(str)

In [28]:
df = df.merge(right=acc_holders_df, how='left', on='Account Holder ID')

df

Unnamed: 0,Transaction ID,Transaction Date,Value,Cancelled?,Account To,Account From,Account Number,Account Type,Account Holder ID,Balance Date,Balance,Name,Date of Birth,Contact Number,First Line of Address
0,1957155,2023-02-01,128.78,N,27356852,76206810,76206810,Platinum,91369374,2023-01-31,887.63,Karolina Lorraine,06/06/1955,07353176639,5 Helena Junction
1,28234510,2023-02-01,163.82,N,28745450,87373821,87373821,Basic,57733624,2023-01-31,945.00,Rolland Pestricke,11/11/1938,07499827004,71 Vera Road
2,33688648,2023-02-01,54.71,N,48271608,98821521,98821521,Basic,62722448,2023-01-31,906.51,Dosi Cayley,04/02/1935,07314523135,2 Talmadge Court
3,41670299,2023-02-01,88.10,N,34128127,80808205,80808205,Basic,14515913,2023-01-31,280.92,Aymer Chessman,14/05/1979,07629430015,92952 Jackson Place
4,42825784,2023-02-01,217.22,Y,25006771,81049128,81049128,Basic,72939081,2023-01-31,694.91,Joanne Shrieves,21/12/1961,07961408348,31 Sauthoff Terrace
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8970,9881408962,2023-02-14,38.18,N,19920990,25099734,25099734,Gold,74513603,2023-01-31,872.31,Tripp Dehn,30/06/1979,07104308156,66531 Walton Alley
8971,9889326485,2023-02-14,126.71,N,86864320,22404690,22404690,Gold,08051737,2023-01-31,726.06,,,,
8972,9892097130,2023-02-14,157.43,N,44912164,88933315,88933315,Gold,58527537,2023-01-31,289.79,Dudley Tomek,04/03/1973,07457996003,5957 Scofield Street
8973,9951297137,2023-02-14,120.16,N,54288062,97521278,97521278,Basic,07369531,2023-01-31,833.78,,,,


Filter out cancelled transactions

In [30]:
df = df[df['Cancelled?'] == 'N']

df

Unnamed: 0,Transaction ID,Transaction Date,Value,Cancelled?,Account To,Account From,Account Number,Account Type,Account Holder ID,Balance Date,Balance,Name,Date of Birth,Contact Number,First Line of Address
0,1957155,2023-02-01,128.78,N,27356852,76206810,76206810,Platinum,91369374,2023-01-31,887.63,Karolina Lorraine,06/06/1955,07353176639,5 Helena Junction
1,28234510,2023-02-01,163.82,N,28745450,87373821,87373821,Basic,57733624,2023-01-31,945.00,Rolland Pestricke,11/11/1938,07499827004,71 Vera Road
2,33688648,2023-02-01,54.71,N,48271608,98821521,98821521,Basic,62722448,2023-01-31,906.51,Dosi Cayley,04/02/1935,07314523135,2 Talmadge Court
3,41670299,2023-02-01,88.10,N,34128127,80808205,80808205,Basic,14515913,2023-01-31,280.92,Aymer Chessman,14/05/1979,07629430015,92952 Jackson Place
5,57723869,2023-02-01,89.22,N,12859818,34902863,34902863,Gold,21811388,2023-01-31,31.36,Skip Hiddy,26/05/1979,07543428482,80447 Troy Street
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8970,9881408962,2023-02-14,38.18,N,19920990,25099734,25099734,Gold,74513603,2023-01-31,872.31,Tripp Dehn,30/06/1979,07104308156,66531 Walton Alley
8971,9889326485,2023-02-14,126.71,N,86864320,22404690,22404690,Gold,08051737,2023-01-31,726.06,,,,
8972,9892097130,2023-02-14,157.43,N,44912164,88933315,88933315,Gold,58527537,2023-01-31,289.79,Dudley Tomek,04/03/1973,07457996003,5957 Scofield Street
8973,9951297137,2023-02-14,120.16,N,54288062,97521278,97521278,Basic,07369531,2023-01-31,833.78,,,,


Filter to transactions greater than £1,000 in value

In [32]:
df = df[df['Value'] > 1000]

df

Unnamed: 0,Transaction ID,Transaction Date,Value,Cancelled?,Account To,Account From,Account Number,Account Type,Account Holder ID,Balance Date,Balance,Name,Date of Birth,Contact Number,First Line of Address
16,205795064,2023-02-01,1152.9,N,86053438,86893452,86893452,Platinum,92356297,2023-01-31,641.84,Guenevere Rubie,01/11/1978,07603634011,2715 Little Fleur Road
22,275656446,2023-02-01,1139.7,N,28911002,81655495,81655495,Basic,84399484,2023-01-31,315.33,Vernice Blonfield,15/03/1933,07981776102,718 Heath Way
46,664959989,2023-02-01,1255.3,N,39402264,70214700,70214700,Basic,08454965,2023-01-31,152.79,,,,
61,826394388,2023-02-01,1089.6,N,71718483,32221652,32221652,Platinum,55167963,2023-01-31,36.96,Cyndia Berrington,05/06/1944,07776843065,5 Waywood Drive
68,908048920,2023-02-01,1194.4,N,97294152,99729369,99729369,Basic,15559813,2023-01-31,751.01,Kerrill Dunstan,18/02/1952,07605122700,921 Merrick Way
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8890,8657927187,2023-02-14,1397.8,N,11110085,11662506,11662506,Basic,41390313,2023-01-31,573.89,Leigh Slimings,12/11/2002,07112224952,5841 Florence Circle
8918,9153064143,2023-02-14,1338.4,N,23295106,45029598,45029598,Gold,30124635,2023-01-31,558.87,Jerri McCrohon,04/07/2003,07696895888,95 Vernon Park
8941,9497069960,2023-02-14,1314.7,N,69540137,10799356,10799356,Gold,26582406,2023-01-31,915.04,Emmi Sinnie,29/01/1954,07304047316,76493 Jenifer Place
8952,9598995518,2023-02-14,1365.0,N,53376344,94973297,94973297,Basic,27881994,2023-01-31,761.68,Vida Chiverstone,15/04/1998,07775774111,6 Mayer Court


Filter out Platinum accounts

In [34]:
df = df[~df['Account Type'].str.contains('Platinum')]

df

Unnamed: 0,Transaction ID,Transaction Date,Value,Cancelled?,Account To,Account From,Account Number,Account Type,Account Holder ID,Balance Date,Balance,Name,Date of Birth,Contact Number,First Line of Address
22,275656446,2023-02-01,1139.7,N,28911002,81655495,81655495,Basic,84399484,2023-01-31,315.33,Vernice Blonfield,15/03/1933,07981776102,718 Heath Way
46,664959989,2023-02-01,1255.3,N,39402264,70214700,70214700,Basic,08454965,2023-01-31,152.79,,,,
68,908048920,2023-02-01,1194.4,N,97294152,99729369,99729369,Basic,15559813,2023-01-31,751.01,Kerrill Dunstan,18/02/1952,07605122700,921 Merrick Way
70,973893052,2023-02-01,1187.7,N,51266306,45706674,45706674,Basic,04564384,2023-01-31,390.73,,,,
131,2019114135,2023-02-01,1821.6,N,87876065,52814597,52814597,Gold,92201515,2023-01-31,899.39,Aluin Errey,11/01/1993,07748137461,65 Ramsey Court
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8890,8657927187,2023-02-14,1397.8,N,11110085,11662506,11662506,Basic,41390313,2023-01-31,573.89,Leigh Slimings,12/11/2002,07112224952,5841 Florence Circle
8918,9153064143,2023-02-14,1338.4,N,23295106,45029598,45029598,Gold,30124635,2023-01-31,558.87,Jerri McCrohon,04/07/2003,07696895888,95 Vernon Park
8941,9497069960,2023-02-14,1314.7,N,69540137,10799356,10799356,Gold,26582406,2023-01-31,915.04,Emmi Sinnie,29/01/1954,07304047316,76493 Jenifer Place
8952,9598995518,2023-02-14,1365.0,N,53376344,94973297,94973297,Basic,27881994,2023-01-31,761.68,Vida Chiverstone,15/04/1998,07775774111,6 Mayer Court


Output the data

In [36]:
df.to_csv('Preppin Data Outputs/pd2023wk7_output.csv', index=False)