# Everledger Code Test

Author: Ninad G. Wadekar   
Date: 01/12/2021

In [1]:
# Need xlrd library for reading data and formatting information 
# from Excel files in the historical .xls format.

!pip install xlrd



In [2]:
# Importing necessary libraries

import pandas as pd

# Task 1 - Parse the attached employee__1_.xls file

In [3]:
# Reading employee__1_.xls file and converting it to dataframe

df_employee = pd.read_excel('employee__1_.xls')
print('\n\nAnswer to Task 1:\n', df_employee)



Answer to Task 1:
     Emp ID First Name Last Name Gender     Father's Name       Mother's Name  \
0   677509       Lois    Walker      F     Donald Walker        Helen Walker   
1   940761     Brenda  Robinson      F  Raymond Robinson       Judy Robinson   
2   428945        Joe  Robinson      M    Scott Robinson  Stephanie Robinson   
3   408351      Diane     Evans      F       Jason Evans      Michelle Evans   
4   193819   Benjamin   Russell      M   Gregory Russell   Elizabeth Russell   
..     ...        ...       ...    ...               ...                 ...   
95  639892       Jose      Hill      M       Carlos Hill           Anna Hill   
96  704709     Harold    Nelson      M    Richard Nelson       Pamela Nelson   
97  461593     Nicole      Ward      F        Ralph Ward          Julia Ward   
98  392491    Theresa    Murphy      F     George Murphy   Jacqueline Murphy   
99  495141      Tammy     Young      F      Andrew Young        Brenda Young   

   Mother's Maiden

# Task 2 - Normalize the date fields into a standard format

In [4]:
# 1. Check 1
# Let's first check shape of dataframe, data type of each column in the
# dataframe and if there are null values in the dataframe

print('1. Shape of dataframe: ', df_employee.shape)

print('\n2. Datatype of each column in the dataframe:\n',\
      df_employee.dtypes)

print('\n3 .Number of null values in each column:\n',\
      df_employee.isna().sum())


1. Shape of dataframe:  (100, 16)

2. Datatype of each column in the dataframe:
 Emp ID                   int64
First Name              object
Last Name               object
Gender                  object
Father's Name           object
Mother's Name           object
Mother's Maiden Name    object
Date of Birth           object
Date of Joining         object
Quarter of Joining      object
Place Name              object
County                  object
City                    object
State                   object
Zip                      int64
User Name               object
dtype: object

3 .Number of null values in each column:
 Emp ID                  0
First Name              0
Last Name               0
Gender                  0
Father's Name           0
Mother's Name           0
Mother's Maiden Name    0
Date of Birth           0
Date of Joining         0
Quarter of Joining      0
Place Name              0
County                  0
City                    0
State                   0
Zi

In [5]:
# 2. Normalization
# We can see from above column datatype description that Date of Birth and
# Date of Joining are object datatypes. Let's normalize these columns to 
# standard date format YYYY-MM-DD.

# Normalizing Date of Birth column
df_employee['Date of Birth'] = pd.to_datetime(df_employee['Date of Birth'],\
                                              errors='coerce')

# Normalizing Date of Joining column
df_employee['Date of Joining'] = pd.to_datetime(df_employee['Date of Joining'],\
                                              errors='coerce')

print('\n\nAnswer to Task 2:\n')
print('Date of Birth and Date of Joining datatype changed:\n', df_employee.dtypes)

print('\n\ndf_employee with normalized date column \n\n', df_employee)



Answer to Task 2:

Date of Birth and Date of Joining datatype changed:
 Emp ID                           int64
First Name                      object
Last Name                       object
Gender                          object
Father's Name                   object
Mother's Name                   object
Mother's Maiden Name            object
Date of Birth           datetime64[ns]
Date of Joining         datetime64[ns]
Quarter of Joining              object
Place Name                      object
County                          object
City                            object
State                           object
Zip                              int64
User Name                       object
dtype: object


df_employee with normalized date column 

     Emp ID First Name Last Name Gender     Father's Name       Mother's Name  \
0   677509       Lois    Walker      F     Donald Walker        Helen Walker   
1   940761     Brenda  Robinson      F  Raymond Robinson       Judy Robinson   
2  

# Task 3 - Group the employee list based on the field "Quarter of Joining" and sorted by the field "Date of Birth" and print as dictionary {Q1 : [emp1, emp2, ...]}

In [6]:
# 1. Check 1
# Since we need to groupby 'Quarter of Joining', let's first check unique values
# of this column. This will also make us aware if there are any incorrect values.

print('Unique values in Quarter of Joining column:\n',\
      df_employee['Quarter of Joining'].unique())

# Thus values in Quarter of Joining column are consistent and correct.

Unique values in Quarter of Joining column:
 ['Q4' 'Q3' 'Q2' 'Q1']


In [7]:
# 2. Check 2
# Since we need to display employee name, let's check if there are
# more than one employees with same First Name.
print('Results show if First Name is same for multiple employees: \n',\
      df_employee['First Name'].value_counts())

Results show if First Name is same for multiple employees: 
 Nancy       4
Judy        2
Margaret    2
Brenda      2
Cheryl      2
           ..
Douglas     1
Roger       1
Matthew     1
Kelly       1
Joyce       1
Name: First Name, Length: 79, dtype: int64


In [8]:
# We can see from above results that some employees have same first
# names. To improve probability of uniquely identifing employees 
# belonging to particular Quarter in final results, we can merge  
# First Name and Last Name to create new column called Full Name.

df_employee['Full Name'] = df_employee[['First Name', 'Last Name']]\
                            .apply(lambda x: ' '.join(x), axis=1)

print('\n Dataframe with new column Full Name:\n',df_employee)


 Dataframe with new column Full Name:
     Emp ID First Name Last Name Gender     Father's Name       Mother's Name  \
0   677509       Lois    Walker      F     Donald Walker        Helen Walker   
1   940761     Brenda  Robinson      F  Raymond Robinson       Judy Robinson   
2   428945        Joe  Robinson      M    Scott Robinson  Stephanie Robinson   
3   408351      Diane     Evans      F       Jason Evans      Michelle Evans   
4   193819   Benjamin   Russell      M   Gregory Russell   Elizabeth Russell   
..     ...        ...       ...    ...               ...                 ...   
95  639892       Jose      Hill      M       Carlos Hill           Anna Hill   
96  704709     Harold    Nelson      M    Richard Nelson       Pamela Nelson   
97  461593     Nicole      Ward      F        Ralph Ward          Julia Ward   
98  392491    Theresa    Murphy      F     George Murphy   Jacqueline Murphy   
99  495141      Tammy     Young      F      Andrew Young        Brenda Young   


In [9]:
# 3. Sorting and Grouping
# Ref: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html
# Groupby preserves the order of rows of original dataframe within each group. 
# Thus we can sort first based on Date of Birth and then groupby Quarter of Joining 
# and subsequently converting results into dictionary

grouped_sorted_emp_dict = df_employee.sort_values(['Date of Birth'], ascending=False)\
                            .groupby('Quarter of Joining')['Full Name']\
                            .apply(list)\
                            .to_dict()

print(grouped_sorted_emp_dict)

{'Q1': ['Jack Alexander', 'Melissa Butler', 'Matthew Turner', 'Ann Cooper', 'Theresa Lee', 'Sharon Lopez', 'Martha Washington', 'Daniel Cooper', 'Ernest Martinez', 'Larry Miller', 'Diana Peterson', 'Steven Phillips', 'Ernest Washington', 'Cynthia Ramirez', 'Phillip White', 'Deborah Smith', 'Debra Wood', 'Roger Roberts', 'William Hernandez', 'Nicole Ward', 'Elizabeth Jackson', 'Cynthia White', 'Nancy Howard', 'Julia Scott', 'Judy Gonzales', 'Carol Murphy'], 'Q2': ['Pamela Wright', 'Antonio Roberts', 'Carol Edwards', 'Todd Hall', 'Linda Moore', 'Andrea Garcia', 'Henry Jenkins', 'Amanda Hughes', 'Lillian Brown', 'Margaret Allen', 'Amy Howard', 'Tammy Young', 'Alan Rivera', 'Diane Evans', 'Ralph Flores', 'Carl Collins', 'Thomas Lewis', 'Jeremy Sanchez', 'Mary Bryant', 'Rebecca Stewart', 'Joyce Jenkins', 'Frances Young', 'Paul Watson'], 'Q3': ['Wayne Watson', 'Nancy Baker', 'Jose Hill', 'Gregory Edwards', 'Richard Mitchell', 'Ryan Alexander', 'Margaret Brooks', 'Paul Cooper', 'Roy Griffin',

In [10]:
# 4. Verification of results by alternate method

# Let's verify if above results are correct by comparing list of Q1 with
# list obtained by alternate method. 

# For verification, df_employee is filtered on rows with Quarter of Joining
# as Q1 and sliced to keep only First Name and Date of Birth columns. Further
# sliced dataframe is sorted in descending order of Date of Birth. This is
# alternate method which results into same sorted list of employees as above 
# based on Date of Birth and belonging to selected Quarter of Joining. 

verification_df = df_employee[df_employee['Quarter of Joining'] == 'Q1']\
                    [['Full Name', 'Date of Birth']]\
                    .sort_values('Date of Birth', ascending=False)

name_list_for_verification = list(verification_df['Full Name'])


In [11]:
# Let's check if list of Q1 in grouped_sorted_emp_dict is same as 
# name_list_for_verification
if name_list_for_verification == grouped_sorted_emp_dict['Q1']:
    print('\nLists are same. Hence, results are correct and verified.')
else:
    print('\nList are different. Results might be incorrect.')


Lists are same. Hence, results are correct and verified.


In [12]:
# 5. Displaying results
# Displaying dictionary in more readable format

print('\n\nAnswer to Task 3:\n')
for k, v in grouped_sorted_emp_dict.items():
    print(k, ':', v, '\n')



Answer to Task 3:

Q1 : ['Jack Alexander', 'Melissa Butler', 'Matthew Turner', 'Ann Cooper', 'Theresa Lee', 'Sharon Lopez', 'Martha Washington', 'Daniel Cooper', 'Ernest Martinez', 'Larry Miller', 'Diana Peterson', 'Steven Phillips', 'Ernest Washington', 'Cynthia Ramirez', 'Phillip White', 'Deborah Smith', 'Debra Wood', 'Roger Roberts', 'William Hernandez', 'Nicole Ward', 'Elizabeth Jackson', 'Cynthia White', 'Nancy Howard', 'Julia Scott', 'Judy Gonzales', 'Carol Murphy'] 

Q2 : ['Pamela Wright', 'Antonio Roberts', 'Carol Edwards', 'Todd Hall', 'Linda Moore', 'Andrea Garcia', 'Henry Jenkins', 'Amanda Hughes', 'Lillian Brown', 'Margaret Allen', 'Amy Howard', 'Tammy Young', 'Alan Rivera', 'Diane Evans', 'Ralph Flores', 'Carl Collins', 'Thomas Lewis', 'Jeremy Sanchez', 'Mary Bryant', 'Rebecca Stewart', 'Joyce Jenkins', 'Frances Young', 'Paul Watson'] 

Q3 : ['Wayne Watson', 'Nancy Baker', 'Jose Hill', 'Gregory Edwards', 'Richard Mitchell', 'Ryan Alexander', 'Margaret Brooks', 'Paul Coop