### Prepping Data Challenge: TC22 Session Attendance (Week 20)
There are 4 inputs datasets for this challenge:
 - Registrations
 - Sessions Lookup Table
 - Online Attendees
 - In Person Attendees
 
### Requirements
- Input the data
- In the Registrations Input, tidy up the Online/In Person field 
- From the Email field, extract the company name 
   - We define the company name as being the text following the @ symbol, up to the.
- Count the number of sessions each registered person is planning to attend 
- Join on the Session Lookup table to replace the Session ID with the Session name 
- Join the In Person Attendees dataset to the cleaned Registrations
   - You will need multiple join clauses
   - Think about the Join Type, we only want to return the names of those that did not attend the sessions they registered for
- Filter to only include those who registered to be In Person 
- Join the Online Attendees dataset to the cleaned Registrations
   - You will need multiple join clauses
   - Think about the Join Type, we only want to return the names of those that did not attend the sessions they registered for
- Filter to only include those who registered to be Online
- Union together these separate streams to get a complete list of those who were unable to attend the sessions they registered for 
- Count the number of sessions each person was unable to attend 
- Calculate the % of sessions each person was unable to attend 
  - Round this to 2 decimal places
- Remove unnecessary fields 
- Output the data

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Input the data.
with pd.ExcelFile("TC22 Input.xlsx") as xlsx:
    register = pd.read_excel(xlsx, 'Registrations')
    sess = pd.read_excel(xlsx, 'Sessions')
    online = pd.read_excel(xlsx, 'Online Attendees')
    inperson = pd.read_excel(xlsx, 'In Person Attendees')

In [3]:
register.head()

Unnamed: 0,First Name,Last Name,Email,Online/In Person,Session ID
0,Aaron,Arthey,aarthey47@de.vu,Online,5
1,Aaron,Scahill,ascahillh9@exblog.jp,Online,1
2,Abagael,Simants,asimants4g@sciencedirect.com,Online,1
3,Abagael,Simants,asimants4g@sciencedirect.com,Online,4
4,Abagael,Simants,asimants4g@sciencedirect.com,Online,5


In [4]:
sess.head()

Unnamed: 0,Session ID,Session
0,1,Keynote
1,2,Speed Tipping
2,3,Iron Viz
3,4,Prep Tips & Tricks
4,5,Devs on Stage


In [5]:
online.head(7)

Unnamed: 0,Session,Email
0,Keynote,ascahillh9@exblog.jp
1,Keynote,asimants4g@sciencedirect.com
2,Keynote,ahenningtoni9@theguardian.com
3,Keynote,asquibbscj@mashable.com
4,Keynote,abartolh8@squarespace.com
5,Keynote,aperschkejo@delicious.com
6,Keynote,ashawcroftdq@digg.com


In [6]:
inperson.head()

Unnamed: 0,Session,First Name,Last Name
0,Keynote,Abbey,Paten
1,Keynote,Abbie,Croy
2,Keynote,Abigael,Beton
3,Keynote,Adams,Lynds
4,Keynote,Adela,Tooting


In [7]:
#In the Registrations Input, tidy up the Online/In Person field
register["Online/In Person"].unique()

array(['Online', 'ONLINE', 'IN PERSON', 'In Person', 'Onlyne', 'Onlin',
       'In Persn', 'In Persoon', 'Im Person'], dtype=object)

In [8]:
spellcheck = {'Online':'^O.*','In Person':'^I.*'}
   
register["Online/In Person"] = register["Online/In Person"].replace(list(spellcheck.values()), list(spellcheck.keys()), regex = True)

In [9]:
register["Online/In Person"].unique()

array(['Online', 'In Person'], dtype=object)

In [10]:
#From the Email field, extract the company name 
# We define the company name as being the text following the @ symbol, up to the.
register['Company'] = register['Email'].str.extract('@(.*)\.')

In [11]:
#Count the number of sessions each registered person is planning to attend 
register['no of sessions'] = register.groupby(['First Name','Last Name','Email'])['Session ID'].transform('count')

In [12]:
#Join on the Session Lookup table to replace the Session ID with the Session name 
sess_dict = dict(zip(sess['Session ID'],sess['Session']))
register['Session ID'] = register['Session ID'].map(sess_dict)                                     

In [13]:
register.rename(columns = {'Session ID':'Session Name'}, inplace=True)   

In [14]:
#Filter to only include those who registered to be In Person
re_inperson = register[register['Online/In Person'] == 'In Person']

In [15]:
#Join the In Person Attendees dataset to the cleaned Registrations
# You will need multiple join clauses
# Think about the Join Type, we only want to return the names of those that did not attend the sessions they registered for
df = re_inperson.merge(inperson, how='outer', left_on=['First Name','Last Name','Session Name'], 
                    right_on = ['First Name','Last Name','Session'])
not_able = df[df['Session'].isna()]

In [16]:
not_able.head()

Unnamed: 0,First Name,Last Name,Email,Online/In Person,Session Name,Company,no of sessions,Session
24,Adolph,Ikin,aikin2u@cdbaby.com,In Person,Prep Tips & Tricks,cdbaby,3,
27,Adolph,Sarver,asarveree@miitbeian.gov.cn,In Person,Iron Viz,miitbeian.gov,3,
30,Adolpho,Jean,ajean5w@statcounter.com,In Person,Prep Tips & Tricks,statcounter,2,
71,Aleda,Rolls,arollsnr@lulu.com,In Person,Prep Tips & Tricks,lulu,1,
86,Alexio,Tytherton,atythertonf8@gizmodo.com,In Person,Speed Tipping,gizmodo,4,


In [17]:
#Filter to only include those who registered to be Online
re_online = register[register['Online/In Person'] == 'Online']
re_online.head()

Unnamed: 0,First Name,Last Name,Email,Online/In Person,Session Name,Company,no of sessions
0,Aaron,Arthey,aarthey47@de.vu,Online,Devs on Stage,de,1
1,Aaron,Scahill,ascahillh9@exblog.jp,Online,Keynote,exblog,1
2,Abagael,Simants,asimants4g@sciencedirect.com,Online,Keynote,sciencedirect,3
3,Abagael,Simants,asimants4g@sciencedirect.com,Online,Prep Tips & Tricks,sciencedirect,3
4,Abagael,Simants,asimants4g@sciencedirect.com,Online,Devs on Stage,sciencedirect,3


In [18]:
#Join the Online Attendees dataset to the cleaned Registrations
# You will need multiple join clauses
# Think about the Join Type, we only want to return the names of those that did not attend the sessions they registered for
df2 = re_online.merge(online, how='outer', left_on=['Email','Session Name'], 
                    right_on = ['Email','Session'])
not_able2 = df2[df2['Session'].isna()]

In [19]:
not_able2.head()

Unnamed: 0,First Name,Last Name,Email,Online/In Person,Session Name,Company,no of sessions,Session
3,Abagael,Simants,asimants4g@sciencedirect.com,Online,Prep Tips & Tricks,sciencedirect,3,
19,Abbye,Armytage,aarmytagemy@drupal.org,Online,Prep Tips & Tricks,drupal,4,
37,Abner,Vassano,avassanofw@sitemeter.com,Online,Prep Tips & Tricks,sitemeter,2,
39,Abra,MacKay,amackay4f@businesswire.com,Online,Speed Tipping,businesswire,3,
73,Adler,Kift,akift8a@ox.ac.uk,Online,Speed Tipping,ox.ac,4,


In [20]:
#Union together these separate streams to get a complete list of those 
#who were unable to attend the sessions they registered for 
output = pd.concat([not_able,not_able2],ignore_index=True)

In [21]:
#Count the number of sessions each person was unable to attend 
output['no of sessions not attended'] = output.groupby(['First Name','Last Name','Email'])['Session Name'].transform('count')

In [22]:
#Calculate the % of sessions each person was unable to attend 
# Round this to 2 decimal places
output["Not Attended %"] = ((output['no of sessions not attended']/output['no of sessions'])*100).round(2)

In [23]:
output.rename(columns = {'Session Name':'Session not attended'}, inplace=True) 

In [24]:
#Remove unnecessary fields
output = output[['Company','First Name','Last Name','Online/In Person','Session not attended',"Not Attended %"]]

In [25]:
output.head(10)

Unnamed: 0,Company,First Name,Last Name,Online/In Person,Session not attended,Not Attended %
0,cdbaby,Adolph,Ikin,In Person,Prep Tips & Tricks,33.33
1,miitbeian.gov,Adolph,Sarver,In Person,Iron Viz,33.33
2,statcounter,Adolpho,Jean,In Person,Prep Tips & Tricks,50.0
3,lulu,Aleda,Rolls,In Person,Prep Tips & Tricks,100.0
4,gizmodo,Alexio,Tytherton,In Person,Speed Tipping,50.0
5,gizmodo,Alexio,Tytherton,In Person,Iron Viz,50.0
6,networksolutions,Allayne,Kibard,In Person,Keynote,40.0
7,networksolutions,Allayne,Kibard,In Person,Iron Viz,40.0
8,instagram,Ally,Brownill,In Person,Devs on Stage,50.0
9,google,Amory,Dracksford,In Person,Keynote,33.33


In [26]:
#output the data (output2)
output.to_excel('wk20-output.xlsx', index=False)