In [41]:
import pandas as pd
from utils import generate_student_id

In [42]:
df = pd.read_csv("data/TC_Aug1923_2024.csv")

Remove unecessary rows

In [43]:
df = df.drop(df.index[34:])

The column names are quite long, lets change that by mapping the orignal column name to our desired column name using a dictionary. We then use the *rename* method in pandas to change the dataframes column names

In [44]:
column_mapping = {
    "Name of Participant (First Name, Last Name)": "Full Name",
    "Nick Name": "Nickname",
    "Payment": "Payment Received",
    "School Attending": "School",
    "Main Contact e-mail address": "Main Contact E-Mail",
    "Main Contact phone number": "Main Contact Number",
    "Optional Second e-mail address": "Secondary E-Mail",
    "Choose your group": "Age Group",
    "Optional Second phone number": "Secondary Phone Number",
    "Thank your for signing up, please add any comments you would like us to know about.": "Additional Comments",
    "I, the parent or guardian of the player named above, acknowledge that when my child is playing/participating/performing basketball activities s/he may suffer injury. I release Top Flight Basketball Co. Ltd from any liability concerning any injury or harm suffered by my child during or as a consequence of participation in the activities.": "Injury Liability Waver",
    "I, the parent or legal guardian of the child named above grant Top Flight basketball Co. Ltd my permission to use the photographs taken at basketball sessions for any legal use, including but not limited to: publicity, copyright purposes, illustration, advertising, and web content. Furthermore, I understand that no royalty, fee or other compensation shall become payable to me by reason of such use.": "Photograph Release Agreement",
    "Once booking made payments can be made to the following bank account: Top Flight Basketball Company Limited - HSBC – 023-697444-838. Please send us proof of payment to INFO@TOPFLIGHTHONGKONG.COM with your child's name indicated.": "Payment Instruction Acknowledgement"
}

df.rename(columns=column_mapping, inplace=True)

In [45]:
df.head()

Unnamed: 0,Timestamp,Email Address,Full Name,Nickname,Aug 19,Aug 20,Aug 21,Aug 22,Aug 23,Payment Received,...,School,Age Group,Main Contact E-Mail,Main Contact Number,Secondary E-Mail,Secondary Phone Number,Payment Instruction Acknowledgement,Injury Liability Waver,Photograph Release Agreement,Additional Comments
0,7/16/2024 13:31:46,pacificocean1977@gmail.com,Deniz Kabadayi,Deniz,True,True,True,True,True,"$2,000.00",...,DSC İnternational School,August 12-16 (Mon-Fri) at ESF / Ages 6-13 - 9:...,pacificocean1977@gmail.com,98099861,zafer.kabadayi@schindler.com,91918111.0,Confirm,Confirm,Confirm,
1,7/16/2024 14:01:06,chewku@gmail.com,"Damian, Chew",Damian,True,True,True,True,True,"$2,000.00",...,ESF Island School,August 12-16 (Mon-Fri) at ESF / Ages 6-13 - 9:...,Chewku@gmail.com,91836859,Chewdjm@gmail.com,91856675.0,Confirm,Confirm,Confirm,
2,7/26/2024 13:45:05,joyce_mslai@hotmail.com,Jonas Chiu,Jonas,True,True,False,True,True,"$1,800.00",...,South Island School,,joyce_mslai@hotmail.com,92633705,,,Confirm,Confirm,Confirm,
3,7/27/2024 6:34:32,Tereza_d@hotmail.com,Ethan Cheng,Ethan,True,True,False,False,True,,...,HKIS,,Tereza_d@hotmail.com,Tereza_d@hotmail.com,,,Confirm,Confirm,Confirm,
4,7/27/2024 6:36:31,Tereza_d@hotmail.com,Elliot Cheng,Elliot,True,True,True,True,True,,...,HKIS,,Tereza_d@hotmail.com,Tereza_d@hotmail.com,,,Confirm,Confirm,Confirm,


Lets remove anything that would give away the identity of our customers or columns we dont need. This would include columns such as "Email Address", "Nickname", "Main Contact Number", "Secondary Phone Number", "Main Contact E-Mail", "Age Group" and  "Secondary E-Mail". *We will not include "Full Name" as we will need it for the student id.*

In [46]:
id_cols_remove = ["Email Address", "Main Contact Number", "Main Contact E-Mail", "Secondary E-Mail", "Nickname", "Secondary Phone Number", "Age Group"]
[df.drop(x, axis=1, inplace=True) for x in id_cols_remove]

[None, None, None, None, None, None, None]

In [47]:
df

Unnamed: 0,Timestamp,Full Name,Aug 19,Aug 20,Aug 21,Aug 22,Aug 23,Payment Received,Jersey Sizes,Payment Date,Date of Birth,Age,Gender,School,Payment Instruction Acknowledgement,Injury Liability Waver,Photograph Release Agreement,Additional Comments
0,7/16/2024 13:31:46,Deniz Kabadayi,True,True,True,True,True,"$2,000.00",34.0,Aug 24,12/22/2011,12.0,Female,DSC İnternational School,Confirm,Confirm,Confirm,
1,7/16/2024 14:01:06,"Damian, Chew",True,True,True,True,True,"$2,000.00",36.0,,4/26/2012,12.0,Male,ESF Island School,Confirm,Confirm,Confirm,
2,7/26/2024 13:45:05,Jonas Chiu,True,True,False,True,True,"$1,800.00",34.0,,9/12/2012,11.0,Male,South Island School,Confirm,Confirm,Confirm,
3,7/27/2024 6:34:32,Ethan Cheng,True,True,False,False,True,,36.0,,4/10/2012,12.0,Male,HKIS,Confirm,Confirm,Confirm,
4,7/27/2024 6:36:31,Elliot Cheng,True,True,True,True,True,,34.0,,10/27/2014,9.0,Male,HKIS,Confirm,Confirm,Confirm,
5,7/30/2024 4:16:19,Luca Monguillot,True,True,True,False,True,"$1,600.00",32.0,Aug 23,8/22/2015,9.0,Male,IMS,Confirm,Confirm,Confirm,As discussed with Agnes Luca will attend daily...
6,7/30/2024 6:38:05,Adam Lai,True,True,True,True,True,"$2,000.00",34.0,July 30,10/17/2014,9.0,Male,HKIS,Confirm,Confirm,,
7,7/31/2024 6:49:26,Alexander Junas,True,True,True,True,True,"$2,000.00",32.0,Aug 1,6/7/2013,11.0,Male,Silvermine bay school,Confirm,Confirm,Confirm,
8,8/2/2024 11:00:47,Ryan Jenson LAM,False,False,False,False,False,,,,4/3/2012,12.0,Male,French International School,Confirm,Confirm,Confirm,
9,8/2/2024 11:52:26,Loic VAN HOOF,True,True,True,True,True,"$1,800.00",28.0,Aug 19,9/19/2017,6.0,Male,Victoria Shanghai Academy,Confirm,Confirm,Confirm,


After Skimming, I noticed some missing information. Two names are missing, lets fill them in row 32 and 33. They also attended every class, and are male. There are also four students missing information, lets add that information.

In [48]:
# Change misspelling in row 31 and Fill in Names at row 32 and 33
df.at[31,'Full Name'] = 'Caleb Wan'
df.at[32, 'Full Name'] = 'Andrew Atayde'
df.at[33, 'Full Name'] = 'Anthony Atayde'
# Parent filled in two names, but only one of them attended
df.at[15, 'Full Name'] = 'George Tilton'
# Fill in Attendance at row 32 and 33 from Aug 19 - Aug 23
df.iloc[32, 2:7] = 'TRUE'
df.iloc[33, 2:7] = 'TRUE'
# Fill in Gender at row 32 and 33.
df.at[32, 'Gender'] = 'Male'
df.at[33, 'Gender'] = 'Male'
# Fill in missing Date of Birth for row 30:34
df.at[30,'Date of Birth'] = '1/1/2017'
df.at[31,'Date of Birth'] = '1/1/2017'
df.at[32,'Date of Birth'] = '10/14/2009'
df.at[33,'Date of Birth'] = '5/31/2008'
# Fill in missing Ages for row 30:34
df.at[30,'Age'] = '9'
df.at[31,'Age'] = '14'
df.at[32,'Age'] = '14'
df.at[33,'Age'] = '16'
# Fill in missing Schools
df.at[30,'School'] = 'Unknown'
df.at[31,'School'] = 'Canadian International School'
df.at[32,'School'] = 'DSC'
df.at[33,'School'] = 'DSC'

df

Unnamed: 0,Timestamp,Full Name,Aug 19,Aug 20,Aug 21,Aug 22,Aug 23,Payment Received,Jersey Sizes,Payment Date,Date of Birth,Age,Gender,School,Payment Instruction Acknowledgement,Injury Liability Waver,Photograph Release Agreement,Additional Comments
0,7/16/2024 13:31:46,Deniz Kabadayi,True,True,True,True,True,"$2,000.00",34.0,Aug 24,12/22/2011,12,Female,DSC İnternational School,Confirm,Confirm,Confirm,
1,7/16/2024 14:01:06,"Damian, Chew",True,True,True,True,True,"$2,000.00",36.0,,4/26/2012,12,Male,ESF Island School,Confirm,Confirm,Confirm,
2,7/26/2024 13:45:05,Jonas Chiu,True,True,False,True,True,"$1,800.00",34.0,,9/12/2012,11,Male,South Island School,Confirm,Confirm,Confirm,
3,7/27/2024 6:34:32,Ethan Cheng,True,True,False,False,True,,36.0,,4/10/2012,12,Male,HKIS,Confirm,Confirm,Confirm,
4,7/27/2024 6:36:31,Elliot Cheng,True,True,True,True,True,,34.0,,10/27/2014,9,Male,HKIS,Confirm,Confirm,Confirm,
5,7/30/2024 4:16:19,Luca Monguillot,True,True,True,False,True,"$1,600.00",32.0,Aug 23,8/22/2015,9,Male,IMS,Confirm,Confirm,Confirm,As discussed with Agnes Luca will attend daily...
6,7/30/2024 6:38:05,Adam Lai,True,True,True,True,True,"$2,000.00",34.0,July 30,10/17/2014,9,Male,HKIS,Confirm,Confirm,,
7,7/31/2024 6:49:26,Alexander Junas,True,True,True,True,True,"$2,000.00",32.0,Aug 1,6/7/2013,11,Male,Silvermine bay school,Confirm,Confirm,Confirm,
8,8/2/2024 11:00:47,Ryan Jenson LAM,False,False,False,False,False,,,,4/3/2012,12,Male,French International School,Confirm,Confirm,Confirm,
9,8/2/2024 11:52:26,Loic VAN HOOF,True,True,True,True,True,"$1,800.00",28.0,Aug 19,9/19/2017,6,Male,Victoria Shanghai Academy,Confirm,Confirm,Confirm,


In [49]:
school_mappings = {
    "DSC İnternational School": "DSC International School",
    "DSC": "DSC International School",
    "Dsc International School": "DSC International School",
    "ESF Island School": "ESF Island School",
    "Esf South Island School": "ESF South Island School",
    "South Island School": "ESF South Island School",
    "HKIS": "Hong Kong International School",
    "Hkis": "Hong Kong International School",
    "IMS": "International Montessori School",
    "Silvermine bay school": "Silvermine Bay School",
    "French International School": "French International School",
    "Victoria Shanghai Academy": "Victoria Shanghai Academy",
    "Hong Kong Harrow International School": "Harrow International School",
    "The ISF School": "ISF Academy",
    "CDNIS": "Canadian International School",
    "YMCA Christian Academy": "YMCA Christian Academy",
    "Chinese International School": "Chinese International School",
    "CIS": "Chinese International School",
    "Kellett": "Kellett School",
    "ESF SIS": "ESF South Island School",
    "Harrow School": "Harrow International School",
    "SJPS": "St. Joseph's Primary School",
    "AISHK": "Australian International School",
    "Australian International School": "Australian International School"
}

In [50]:
def standardize_school_name(school_name):
    return school_mappings.get(school_name.strip(), school_name)

df['School'] = df['School'].apply(standardize_school_name)

In [51]:
# df['School'] = df['School'].apply(lambda x: 'ESF' if x == 'ESF' else x.title())
df['School'] = df['School'].apply(lambda x: ' '.join(word if word.upper() == 'ESF' else word.title() for word in x.split()))

In [52]:
df

Unnamed: 0,Timestamp,Full Name,Aug 19,Aug 20,Aug 21,Aug 22,Aug 23,Payment Received,Jersey Sizes,Payment Date,Date of Birth,Age,Gender,School,Payment Instruction Acknowledgement,Injury Liability Waver,Photograph Release Agreement,Additional Comments
0,7/16/2024 13:31:46,Deniz Kabadayi,True,True,True,True,True,"$2,000.00",34.0,Aug 24,12/22/2011,12,Female,Dsc International School,Confirm,Confirm,Confirm,
1,7/16/2024 14:01:06,"Damian, Chew",True,True,True,True,True,"$2,000.00",36.0,,4/26/2012,12,Male,ESF Island School,Confirm,Confirm,Confirm,
2,7/26/2024 13:45:05,Jonas Chiu,True,True,False,True,True,"$1,800.00",34.0,,9/12/2012,11,Male,ESF South Island School,Confirm,Confirm,Confirm,
3,7/27/2024 6:34:32,Ethan Cheng,True,True,False,False,True,,36.0,,4/10/2012,12,Male,Hong Kong International School,Confirm,Confirm,Confirm,
4,7/27/2024 6:36:31,Elliot Cheng,True,True,True,True,True,,34.0,,10/27/2014,9,Male,Hong Kong International School,Confirm,Confirm,Confirm,
5,7/30/2024 4:16:19,Luca Monguillot,True,True,True,False,True,"$1,600.00",32.0,Aug 23,8/22/2015,9,Male,International Montessori School,Confirm,Confirm,Confirm,As discussed with Agnes Luca will attend daily...
6,7/30/2024 6:38:05,Adam Lai,True,True,True,True,True,"$2,000.00",34.0,July 30,10/17/2014,9,Male,Hong Kong International School,Confirm,Confirm,,
7,7/31/2024 6:49:26,Alexander Junas,True,True,True,True,True,"$2,000.00",32.0,Aug 1,6/7/2013,11,Male,Silvermine Bay School,Confirm,Confirm,Confirm,
8,8/2/2024 11:00:47,Ryan Jenson LAM,False,False,False,False,False,,,,4/3/2012,12,Male,French International School,Confirm,Confirm,Confirm,
9,8/2/2024 11:52:26,Loic VAN HOOF,True,True,True,True,True,"$1,800.00",28.0,Aug 19,9/19/2017,6,Male,Victoria Shanghai Academy,Confirm,Confirm,Confirm,


Lets create the student ID. We will use the Initials of First, Middle and Last Names followed by a hyphen ("-"), then the Year and Month of their date of birth (YYYYMM), and their Gender (Female = 0 and Male = 1) For example: Scott Matthew Summers 1977/09/22 Male = SMS-197709-01

In [53]:
student_id_cols = df[["Full Name", "Date of Birth", "Gender"]]
student_id_cols

Unnamed: 0,Full Name,Date of Birth,Gender
0,Deniz Kabadayi,12/22/2011,Female
1,"Damian, Chew",4/26/2012,Male
2,Jonas Chiu,9/12/2012,Male
3,Ethan Cheng,4/10/2012,Male
4,Elliot Cheng,10/27/2014,Male
5,Luca Monguillot,8/22/2015,Male
6,Adam Lai,10/17/2014,Male
7,Alexander Junas,6/7/2013,Male
8,Ryan Jenson LAM,4/3/2012,Male
9,Loic VAN HOOF,9/19/2017,Male


I will create a function *create_student_id* in my *utils.py* script

In [54]:
student_id_cols["Student ID"] = df.apply(generate_student_id, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  student_id_cols["Student ID"] = df.apply(generate_student_id, axis=1)


In [55]:
student_id_cols.drop(["Full Name"], axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  student_id_cols.drop(["Full Name"], axis=1, inplace=True)


In [56]:
student_id_cols

Unnamed: 0,Date of Birth,Gender,Student ID
0,12/22/2011,Female,DK-201112-0
1,4/26/2012,Male,DC-201204-1
2,9/12/2012,Male,JC-201209-1
3,4/10/2012,Male,EC-201204-1
4,10/27/2014,Male,EC-201410-1
5,8/22/2015,Male,LM-201508-1
6,10/17/2014,Male,AL-201410-1
7,6/7/2013,Male,AJ-201306-1
8,4/3/2012,Male,RJL-201204-1
9,9/19/2017,Male,LVH-201709-1


In [57]:
df["Student ID"] = student_id_cols["Student ID"]
df

Unnamed: 0,Timestamp,Full Name,Aug 19,Aug 20,Aug 21,Aug 22,Aug 23,Payment Received,Jersey Sizes,Payment Date,Date of Birth,Age,Gender,School,Payment Instruction Acknowledgement,Injury Liability Waver,Photograph Release Agreement,Additional Comments,Student ID
0,7/16/2024 13:31:46,Deniz Kabadayi,True,True,True,True,True,"$2,000.00",34.0,Aug 24,12/22/2011,12,Female,Dsc International School,Confirm,Confirm,Confirm,,DK-201112-0
1,7/16/2024 14:01:06,"Damian, Chew",True,True,True,True,True,"$2,000.00",36.0,,4/26/2012,12,Male,ESF Island School,Confirm,Confirm,Confirm,,DC-201204-1
2,7/26/2024 13:45:05,Jonas Chiu,True,True,False,True,True,"$1,800.00",34.0,,9/12/2012,11,Male,ESF South Island School,Confirm,Confirm,Confirm,,JC-201209-1
3,7/27/2024 6:34:32,Ethan Cheng,True,True,False,False,True,,36.0,,4/10/2012,12,Male,Hong Kong International School,Confirm,Confirm,Confirm,,EC-201204-1
4,7/27/2024 6:36:31,Elliot Cheng,True,True,True,True,True,,34.0,,10/27/2014,9,Male,Hong Kong International School,Confirm,Confirm,Confirm,,EC-201410-1
5,7/30/2024 4:16:19,Luca Monguillot,True,True,True,False,True,"$1,600.00",32.0,Aug 23,8/22/2015,9,Male,International Montessori School,Confirm,Confirm,Confirm,As discussed with Agnes Luca will attend daily...,LM-201508-1
6,7/30/2024 6:38:05,Adam Lai,True,True,True,True,True,"$2,000.00",34.0,July 30,10/17/2014,9,Male,Hong Kong International School,Confirm,Confirm,,,AL-201410-1
7,7/31/2024 6:49:26,Alexander Junas,True,True,True,True,True,"$2,000.00",32.0,Aug 1,6/7/2013,11,Male,Silvermine Bay School,Confirm,Confirm,Confirm,,AJ-201306-1
8,8/2/2024 11:00:47,Ryan Jenson LAM,False,False,False,False,False,,,,4/3/2012,12,Male,French International School,Confirm,Confirm,Confirm,,RJL-201204-1
9,8/2/2024 11:52:26,Loic VAN HOOF,True,True,True,True,True,"$1,800.00",28.0,Aug 19,9/19/2017,6,Male,Victoria Shanghai Academy,Confirm,Confirm,Confirm,,LVH-201709-1


We want Student ID in the 2nd column, lets change that

In [58]:
cols = list(df.columns)
cols.insert(1, cols.pop(cols.index('Student ID')))
df = df[cols]

Now lets remove the Full Name column as we do not need it anymore

In [59]:
df.drop("Full Name",axis=1, inplace=True)
df.drop("Additional Comments", axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop("Full Name",axis=1, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop("Additional Comments", axis=1, inplace=True)


In [60]:
df.to_csv('data/Anonymized_Data.csv', encoding="utf-8", index=False)