## Table of Contents
- [Importing Packages](#Importing-Packages)
- [Webscraping Data](#Webscraping-Data)
- [Exploring Data](#Exploring-Data)
- [Output Files](#Output-Files)
- [Conclusion](#Conclusion)

## Importing Packages

In [92]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

## Webscraping Data

In [93]:
# target webpage
url = "https://www.dataiku.com/company/team/"

# Establishing the connection to the web page:
response = requests.get(url)

# You can use status codes to understand how the target server responds to your request.
# Ex., 200 = OK, 400 = Bad Request, 403 = Forbidden, 404 = Not Found.
print(response.status_code)

# Pull the HTML string out of requests and convert it to a Python string.
html = response.text

200


In [94]:
# instantiating a BeautifulSoup object, a Python library for pulling data out of HTML files
soup = BeautifulSoup(html, 'lxml')

In [95]:
# team-listing div contains the entire team so we will store this as div
div = soup.find('div', {'class': 'team-listing'})

Below we will iterate through all team-cards and extract the names and roles of each team member and store this into a dataframe.

In [96]:
# creating a list of people
people = []

# iterating through each team-card
for row in div.find_all('div', 'team-card'):
    
    # creating a dictionary for this person
    person = {}
    # their names are listed in the 'h5' tags
    person['name'] = row.find('h5').text
    # their roles are listed in the 'p' tags
    # using the previous line style gave an error due to a none type, meaning there is a missing value so we are using if
    if row.find('p'):
        person['role'] = row.find('p').text
    
    # adding this peron's records to the list of people
    people.append(person)
    
# creating a dataframe from this list of people
team_listing = pd.DataFrame(people)

## Exploring Data

In [97]:
# checking first 5 rows of dataframe
team_listing.head()

Unnamed: 0,name,role
0,Chia Y.,Sales Engineer
1,Hylke V.,Sales Director Netherlands
2,Marieme D.,Enterprise Customer Success Manager
3,Michael G.,Community Manager
4,Maïté L.,Technical Recruiter


In [98]:
# checking number of rows and columns
team_listing.shape

(376, 2)

In [99]:
# checking information on this dataframe
team_listing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 376 entries, 0 to 375
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   name    376 non-null    object
 1   role    375 non-null    object
dtypes: object(2)
memory usage: 6.0+ KB


In [100]:
# checking null values, we found 1 for role
team_listing.isnull().sum()

name    0
role    1
dtype: int64

In [101]:
# identifying the record without a role
team_listing[team_listing['role'].isnull()]

Unnamed: 0,name,role
355,Abby Z.,


In [102]:
# checking data types
team_listing.dtypes

name    object
role    object
dtype: object

In [103]:
# Checking the number of unique names
team_listing['name'].nunique()

362

In [104]:
# Checking the number of unique roles
team_listing['role'].nunique()

170

We are checking below for duplicated names, based on first name and initial of last name.

In [105]:
# Getting counts of name column, sorted in descending order
team_listing.pivot_table(index=['name'], aggfunc='size').sort_values(ascending = False).head(14)

name
Kyle A.          2
Christiaan B.    2
Terry O.         2
Emilie C.        2
Nicolas O.       2
Tara C.          2
Ned M.           2
Liam S.          2
Aurélien V.      2
Cooper J.        2
Mike M.          2
Deborah G.       2
Ina J.           2
Louis B.         2
dtype: int64

In [106]:
# Checking duplicates for the combination of names and roles
team_listing[team_listing.duplicated()]

Unnamed: 0,name,role
40,Mike M.,Account Executive
66,Ina J.,Marketing Enablement Specialist
81,Ned M.,DSS Training Specialist
161,Cooper J.,Business Development Specialist
211,Liam S.,Business Development Specialist
216,Terry O.,Corporate Controller
240,Tara C.,Data Science Technical Writer
241,Deborah G.,Senior Account Executive
271,Aurélien V.,Implementation Manager
275,Louis B.,Sales Development Representative


Ultimately Emilie C. and Nicolas O. are different pairs of people with the same names. 

## Output Files

In [107]:
# creating a csv file of the full list, do not include the index
team_listing.to_csv('../data/full_team_list.csv', index = False)

In [108]:
# creating a csv file of the duplicated members list, do not include the index
team_listing[team_listing.duplicated()].to_csv('../data/duplicated_list.csv', index = False)

## Conclusion

We identified 1 employee, Abby Z. a Talent Acquisition Specialist who does not have a role assigned on the team website: https://www.dataiku.com/company/team/. We also identified 12 employees who have their records duplicated, with one of their photos as a zoomed in version of the other. In some cases, the first instance is zoomed in, in others it is the second instance. We recommend updating the site to ensure all roles are populated, and duplicate records removed, although which to remove is up to Dataiku, as all photos look beautiful.