## Q3: LSE Department of Statistics (35 marks)


For this question, we want to collect some data about LSE Statistics academic staff via web scraping, and find out the distribution of different types of staff, gender balance, etc

Do the following:

1. [Data collection via web scraping] Use web scraping with Python modules `requests` and `BeautifulSoup` to extract the information about the academic staff of the Department of Statistics from https://www.lse.ac.uk/Statistics/People (and the linked pages). For simplicity, we only consider:
    > * All the staff under the tab "Academic faculty" 
    > * Guest lecturers/teachers and LSE fellows under the tab "Academic associates - Emeritus Professors, Guest Lecturers and Visiting Staff" 
        * i.e. we DO NOT consider any of the emeritus or visiting staff, as they are less involved and/or only joining for a short period

    For each staff fulfilling the requirements above, do the following:
    
    * (a) Get the URL of the staff's page from "More information". Example:
    <img src="figs/people.png" width="400"/>
    
    For example, for Dr. James Abdey, the link is https://www.lse.ac.uk/Statistics/People/Dr-James-Abdey
        * You may get some shorter version of the path first (like `Statistics/People/Dr-James-Abdey`), but you can construct the full URL from the short version
    
    * (b) With the URLs collected from (a), use web scraping with python modules `requests` and `BeautifulSoup` to scrape the following information and store the result in a `pd.DataFrame` with 3 columns:
        * Name of the staff (e.g. James Abdey, with or without the title prefix)
        * Job title (e.g. Associate Professorial Lecturer)
        * Gender - depending on the page, you may be able to deduce the gender of the staff from the content of "About me"
            * You may want to use `np.nan` to represent those you cannot find the gender automatically
        
        <img src="figs/james.png" width="400"/>

2. [Data exploration] Calculate the descriptive statistics on title and gender. Based on the descriptive statistics _only_, answer the following:
    * Are there any issues with the data collected? If yes, what are they?
    * Is the gender ratio of academic staff close to 50:50 within the department?
    
3. [Data wrangling] Cleaning and organise the data:

    * (a) Handle any issues of the data collected from part (1) based on what you have discovered in part (2), including:
        * If you have discovered that you did not collect the data correctly in part (1), fix it
            * If you are not able to fix the code in part (1), at the very least try to fix the data manually so that you can get reasonable results (and marks) in part (3) and (4)
        * If there is any missing data, handle them _appropriately_ and explain your rationale
    * (b) Extract the "role" and the "rank" of each staff from the `pd.DataFrame` in (3a):
        * "Role" of the staff (based on https://info.lse.ac.uk/staff/divisions/Human-Resources/The-recruitment-toolkit/Role-profiles): 
            * "teaching" (the job title should have the words like "Lecturer", "Teacher", "Teaching Fellow" or "LSE Fellow"), e.g. Associate Professorial Lecturer
            * "faculty" (the job title should have the _word_ "Professor"), e.g. Assistant Professor
        * "Rank" of the staff: 
            * "non-tenure" (non tenure-track, should contain the words like "Guest" or "Fellow"), e.g. Guest Teacher, LSE fellow
            * "assistant" (entry-level tenure-track, should contain the word "Assistant"), e.g. Assistant Professor
            * "associate" (mid-level tenure-track, should contain the word "Associate"), e.g. Associate Professorial Lecturer
            * "full" (senior-level tenure-track, does not have any of the keywords above), e.g. Professor
        
    and put the information about the role and rank as two additional columns to the data frame in (3a)
    * (c) Store the `pd.DataFrame` from (3b) into a csv file in the `data` folder, with the name of the file to be `lse_statistics_2022mmdd.csv` with `mm` the month and `dd` the day you have collected the data. This file will help us to verify your result
    
4. [Simple analysis] Use the `pd.DataFrame` from part (3b), find out if the gender ratio depends on the roles and/or the rank
    * Based on your results in this part, do you think the Department of Statistics has a good gender balance among its staff?
    
    
Please state the limitations of your answers.

---
### Note

* Data should be extracted as automatically as possible
* If you are not able to do some parts, you can hard code the values and continue to work on the rest of the question 
    * You will lose marks for not being able to solve the corresponding part, but at least you may get some marks from the other parts of the question
    * For example, if you cannot get the URLs from part 1(a), you may manually figure out the URLs of the staff to allow you to continue to work on part 1(b) and the rest of the question - but of course you will lose marks for part 1(a)
    
---
    
### Hints

* When finding the gender from the content of "About me", what pronouns do you expect to see with higher frequency if the staff is male? If the staff is female?
* You may find [`.str.contains()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.contains.html) and/or `.str.split()` useful for (3b)
    
---

In [2]:
## your attempt, please add the code cells and markdown cells for your answers. Make sure you:
## * use the right type of cells
## * state clearly which answer is for which part
## * show the output of the code cells

# Q3.1

In [3]:
#(a)

In [2]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

In [3]:
res = requests.get("https://www.lse.ac.uk/Statistics/People")
soup = BeautifulSoup(res.content,'lxml')

In [4]:
# For academic faculty
header = soup.find(text = "Academic faculty")
p = header.find_next("div", {"class": "accordion__content"})
people_data_1 = p.find_all('a',{"class": "sys_0 sys_t0"})

In [5]:
#check
people_data_1

[<a class="sys_0 sys_t0" href="/statistics/people/james-abdey" title="Dr James Abdey">More information</a>,
 <a class="sys_0 sys_t0" href="/statistics/people/mona-azadkia" title="Mona Azadkia">More information </a>,
 <a class="sys_0 sys_t0" href="/statistics/people/marcos-barreto" title="Dr Marcos Barreto">More information</a>,
 <a class="sys_0 sys_t0" href="/statistics/people/pauline-barrieu" title="Professor Pauline Barrieu">More information</a>,
 <a class="sys_0 sys_t0" href="/statistics/people/erik-baurdoux" title="Dr. Erik Baurdoux">More information</a>,
 <a class="sys_0 sys_t0" href="/statistics/people/wicher-bergsma" title="Dr. Wicher Bergsma">More information</a>,
 <a class="sys_0 sys_t0" href="/statistics/people/umut-cetin" title="Professor Umut Cetin">More information</a>,
 <a class="sys_0 sys_t0" href="/statistics/people/yining-chen" title="Dr. Yining Chen">More information</a>,
 <a class="sys_0 sys_t0" href="/statistics/people/yunxiao-chen" title="Dr Yunxiao Chen">More info

In [8]:
#scrap the base(short version) of URL for people in Academic faculty
#then we get ul_1_base
ul_1_base = []
for i in range(len(people_data_1)):
    data = people_data_1[i]['href']
    ul_1_base.append(data)

In [10]:
#check
ul_1_base  

['/statistics/people/james-abdey',
 '/statistics/people/mona-azadkia',
 '/statistics/people/marcos-barreto',
 '/statistics/people/pauline-barrieu',
 '/statistics/people/erik-baurdoux',
 '/statistics/people/wicher-bergsma',
 '/statistics/people/umut-cetin',
 '/statistics/people/yining-chen',
 '/statistics/people/yunxiao-chen',
 '/statistics/people/angelos-dassios',
 '/statistics/people/daniela-escobar',
 '/statistics/people/piotr-fryzlewicz',
 '/statistics/people/sara-geneletti',
 '/statistics/people/kostas-kalogeropoulos',
 '/statistics/people/kostas-kardaras',
 '/statistics/people/jouni-kuha',
 '/statistics/people/clifford-lam',
 '/statistics/people/giulia-livieri',
 '/statistics/people/joshua-loftus',
 '/statistics/people/gelly-mitrodima',
 '/statistics/people/irini-moustaki',
 '/statistics/people/francesca-panero',
 '/statistics/people/xinghao-qiao',
 '/statistics/people/chengchun-shi',
 '/statistics/people/andreas-søjmark',
 '/statistics/people/fiona-steele',
 '/statistics/people/z

In [10]:
#construct the full URL for people in Academic faculty from the short version(ul_1_base)
#then get ul_1

ul_1=[]
for j in range(len(ul_1_base)):
    full_url ="https://www.lse.ac.uk"+ ul_1_base[j]
    ul_1.append(full_url)

In [11]:
#check 
# ul_1

In [12]:
# For Academic associates - Emeritus Professors, Guest Lecturers, Fellows and Visiting staf

In [13]:
header2 = soup.find(text = 'Research staff')
p2 = header2.find_previous("div", {"class": "accordion__content"})#不晓得为啥识别不了这个aca acc的字段
people_data_2 = p2.find_all("div", {"class": "accordion__txt"})

In [14]:
# scrap the base(short version) of URL for people in Academic associates-EGFV 
# excluding any of the emeritus or visiting staff
#then we get ul_2_base
ul_2_base = []
for i in range(len(people_data_2)):
    # do not consider any of the emeritus or visiting staff
    if (('Emeritus'not in people_data_2[i].text) and
        ('Visiting' not in people_data_2[i].text) and 
        ('Visting' not in people_data_2[i].text) ):# there is a typo on the webpage，it misspelled 'Visiting' into 'Visting', so we
                                                    # add anther statement here 
        data = people_data_2[i].find('a')['href']
        ul_2_base.append(data)  

In [15]:
#check 
#ul_2_base 

In [16]:
#construct the full URL for people in Academic associates-EGFV  from the short version(ul_2_base)
#then get ul_2
ul_2=[]
for j in range(len(ul_2_base)):
    full_url ="https://www.lse.ac.uk"+ ul_2_base[j]
    ul_2.append(full_url)

In [17]:
#check
#ul_2 

In [18]:
# Combine ul_1 and ul_2 to get the full list of URL(ul_LSE_staff)
ul_LSE_staff = ul_1 + ul_2

In [19]:
ul_LSE_staff

['https://www.lse.ac.uk/Statistics/People/Dr-James-Abdey',
 'https://www.lse.ac.uk/Statistics/People/Dr-Mona-Azadkia',
 'https://www.lse.ac.uk/Statistics/People/Dr-Marcos-Barreto',
 'https://www.lse.ac.uk/Statistics/People/Professor-Pauline-Barrieu',
 'https://www.lse.ac.uk/Statistics/People/Dr-Erik-Baurdoux',
 'https://www.lse.ac.uk/Statistics/People/Professor-Wicher-Bergsma',
 'https://www.lse.ac.uk/Statistics/People/Professor-Umut-Cetin',
 'https://www.lse.ac.uk/Statistics/People/Dr-Yining-Chen',
 'https://www.lse.ac.uk/Statistics/People/Yunxiao-Chen',
 'https://www.lse.ac.uk/Statistics/People/Professor-Angelos-Dassios',
 'https://www.lse.ac.uk/Statistics/People/Dr-Daniela-Escobar',
 'https://www.lse.ac.uk/Statistics/People/Professor-Piotr-Fryzlewicz',
 'https://www.lse.ac.uk/Statistics/People/Dr-Sara-Geneletti',
 'https://www.lse.ac.uk/Statistics/People/Dr-Kostas-Kalogeropoulos',
 'https://www.lse.ac.uk/Statistics/People/Professor-Kostas-Kardaras',
 'https://www.lse.ac.uk/Statistic

In [20]:
#(b)

In [21]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np


In [22]:
#name and job title
name_list = []
jobt_list = []
for url in ul_LSE_staff : 
    res = requests.get(url)
    soup = BeautifulSoup(res.content,'lxml')
    people_headers = soup.find_all("header", {"class": "people__header"})
    name =(people_headers[0].h1.text).strip() # get the name from people_header
    jobt =(people_headers[0].h2.text).strip() # get the job_title from people_header
    name_list.append(name) 
    jobt_list.append(jobt) 

In [23]:
#check
#name_list

In [24]:
#check
#jobt_list

In [25]:
#Gender
male_pronouns = ["he", "him", "his", "himself"]
female_pronouns = ["she", "her", "hers", "herself"]
male_count = 0
female_count = 0

In [33]:
gender_list = []
for url in ul_LSE_staff : 
    res = requests.get(url)
    soup = BeautifulSoup(res.content,'lxml')
    about_me = soup.find_all("div", {"class": "people__bio"})
    #get the text from about_me above 
    t = about_me[0].text
    #remove the "\n", "\r" ,"\xa0" "\'s" "," , "." from t above 
    #to get the clean text 
    text = t.replace("\n", "").replace("\r", "").replace("\xa0", "").replace("\'s", "").replace(",", "").replace(".", "")
    for word in text.split(' '):
        if word.lower() in male_pronouns:
            male_count += 1
        if word.lower() in female_pronouns:
            female_count += 1
    
    if male_count >female_count :
        gender_list.append('M')
    elif female_count > male_count :
        gender_list.append('F')
    else:
        gender_list.append( np.nan )
   
    male_count = 0
    female_count= 0
        

In [34]:
#check
print(gender_list)
len(gender_list)

['M', 'F', 'M', 'F', 'M', 'M', 'M', 'M', 'M', 'M', 'F', 'M', 'F', 'M', 'M', 'M', nan, 'M', 'F', 'F', 'F', 'M', 'M', 'M', 'F', 'M', 'M', 'M', nan, 'F', 'M', nan, 'F', 'F', 'M', 'M', 'F', 'M']


38

In [155]:
# store the results in a pd.DataFrame df
zipped = zip(name_list,jobt_list,gender_list)
df_staff = pd.DataFrame(zipped, columns=['Name of the staff', 'Job title', 'Gender'])
# df_staff

# Q3.2

In [156]:
print(df_staff[["Gender"]].value_counts(normalize=True))
print(df_staff[["Gender"]].value_counts())

Gender
M         0.676471
F         0.323529
dtype: float64
Gender
M         23
F         11
dtype: int64


NO, gender ratio of academic staff is not close to 50:50.

Accodering to the result we got above, even there are three staffs with uncertian gender, the proportion of male is much higher than female,hence the gender ratio of academic staff is not close to 50:50.

# Q3.3

In [157]:
#(a)

In [158]:
# fix the data
df_staff.loc[15 ,'Gender'] = 'M'
df_staff.loc[27 ,'Gender'] = 'M'
df_staff.loc[30 ,'Gender'] = 'M'

In [159]:
# check 
df_staff.head()

Unnamed: 0,Name of the staff,Job title,Gender
0,Dr James Abdey,Associate Professorial Lecturer,M
1,Dr Marcos Barreto,Assistant Professorial Lecturer,M
2,Professor Pauline Barrieu,Professor and Head of Department,F
3,Dr Erik Baurdoux,Associate Professor,M
4,Professor Wicher Bergsma,Professor and Deputy Head of Department (Teach...,M


In [160]:
#(b) 

In [161]:
teaching_word = ["Lecturer", "Teacher" , "Teaching Fellow" , "LSE Fellow" ]

In [162]:
Role_list =[]
Rank_list= []



for i in range(len(df_staff)):
    #Extract role
    for tw in teaching_word : 
        if tw in df_staff.loc[i]["Job title"]:
            Role_list.append("teaching")

    if "Professor" in df_staff.loc[i]["Job title"] and "Professorial" not in df_staff.loc[i]["Job title"]:
        Role_list.append("faculty")
        
    #Extract rank
    if "Guest" in df_staff.loc[i]["Job title"] or "Fellow" in df_staff.loc[i]["Job title"] :
        Rank_list.append("non-tenure") 
    
    if "Assistant" in df_staff.loc[i]["Job title"]:
        Rank_list.append("assistant")
    
    if "Associate" in df_staff.loc[i]["Job title"]:
        Rank_list.append("associate") 
        
    if ("Professor" in df_staff.loc[i]["Job title"] 
        and "Assistant" not in df_staff.loc[i]["Job title"]
        and "Associate" not in df_staff.loc[i]["Job title"] ):
        
        Rank_list.append("full")
 


In [163]:
df_staff["Role"] =Role_list
df_staff["Rank"] =Rank_list


In [164]:
#check 
df_staff.head()

Unnamed: 0,Name of the staff,Job title,Gender,Role,Rank
0,Dr James Abdey,Associate Professorial Lecturer,M,teaching,associate
1,Dr Marcos Barreto,Assistant Professorial Lecturer,M,teaching,assistant
2,Professor Pauline Barrieu,Professor and Head of Department,F,faculty,full
3,Dr Erik Baurdoux,Associate Professor,M,faculty,associate
4,Professor Wicher Bergsma,Professor and Deputy Head of Department (Teach...,M,faculty,full


In [165]:
#(c)

In [166]:
import os

os.getcwd()

'C:\\Users\\27615\\OneDrive\\桌面\\ST445_C\\ST445_PS5\\2022-ps-5-RUIYINGJIANG'

In [167]:
df_staff.to_csv('data/lse_statistics_2022_11_2.csv')

# Q4[Simple analysis]

In [168]:
df_staff_rankg = df_staff.loc[ :,["Rank","Gender"]]
df_staff_roleg = df_staff.loc[ :,["Role","Gender"]]
df_staff_rankg.head()  # data frame that only contains columns 'rank' and 'gender' 

Unnamed: 0,Rank,Gender
0,associate,M
1,assistant,M
2,full,F
3,associate,M
4,full,M


In [169]:
# gender ratio depends on the roles and/or the rank
by_rank_gender = df_staff_rankg.groupby(['Rank']).value_counts(normalize=True)
print(by_rank_gender)

Rank        Gender
assistant   M         0.600000
            F         0.400000
associate   M         0.857143
            F         0.142857
full        M         0.769231
            F         0.230769
non-tenure  M         0.571429
            F         0.428571
dtype: float64


In [170]:
by_role_gender = df_staff_roleg.groupby(['Role']).value_counts(normalize=True)
by_role_gender

Role      Gender
faculty   M         0.8
          F         0.2
teaching  F         0.5
          M         0.5
dtype: float64

According to the results we got above,


For Rank ;

the gender ratio is relative balance in assistant and non-tenure
but gender ratios in the rank of 'associate' and 'full' are unbalanced, the proportion of male is signifivant higer than that for female


For Role ;

the gender ratio is balance in faculty
but gender ratio in teaching is unbalanced, the proportion of male is four times as much as that for female


In conclution, considering the gender ratio academic staff in the department, gender ratio depends on rank and role, department of Statistics doesn't have a good gender balance. 
The majority of academic staffs are male, and males are more likely in faculty and get higher rank than female. 

# limitations of answers

In Data collection via web scraping<br>  
when we collecting the gender of staffs from the 'About me',three of staffs have no gender pronoun 
in the sentence under 'About me', so we can not find the gender of those three staffs automatically.

when we excluding any of the emeritus or visiting staff, there is a typo on the webpage，which misspelled 'Visiting' into 'Visting', so we add anther if statement.