# Webscraping the data

* This notebook is concerned with introducing our project as well as acquiring all the data needed to answer the reseach questions


### Contents :

#### Introduction

#### Import Libraries

#### Scrape data for LSE university acommodations

#### Scrape data for King's College London  acommodations

#### Scrape data for LSE private acommodations

#### Scrape data for Warwick University acommodations

#### Convert dataframes into csv files

###### Following this the questions will be answers in separate notebook for clarity purposes
* In the notebooks that answer the question we will focus more on : any further preparation for data analysis needed, data analysis and conclusion, which would consist of answering the question



### Introduction

The motivation of our research project is to assess how good LSE accomodation is.The reason we chose this is because as students at LSE we believe accomodation is a key aspect of a student's life, as well as allowing us to help fellow students create a well informed decision on where to stay during their time at university. 

Our main research questions to answer are : 

* 1) How does the university accommodation cost at LSE vary? 
* 2) Is it better to have private accommodation or uni accommodation at LSE? 
* 3) What’s LSE accommodation like compared to KCL?
* 4) What’s the lse accommodation like relative to Warwick university? 




### Import Libraries

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
#!pip install geopy
from geopy.distance import geodesic
from geopy.geocoders import Nominatim
import os


### Scrape data for LSE university acommodations

First, we'll gather the data from here: https://www.lse.ac.uk/student-life/accommodation/apply/types-of-contracts-halls-and-rooms

The links are stored in a table under a tab. We will used the header of the tables to determine which type of students the accomodation is for.


Next, we will also get the links for the individual pages, which we'll collect the rest of the data from


In [2]:
url = "https://www.lse.ac.uk/student-life/accommodation/apply/types-of-contracts-halls-and-rooms"

response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

# Find the parent element that contains the table
accordion_content = soup.find('div', class_='accordion__content')

# Find the table within the accordion content
table = accordion_content.find('table')

In [3]:
# Initialize dictionaries for undergraduate (UG), graduate (PG), and mixed accommodation (MIX)
UG = []
PG = []
MIX = []
links = []
# Find all table headers
headers = table.find_all('th')

# Get the text content of each header
header_texts = [header.text.strip() for header in headers]


rows = table.find_all('tr')
for row in rows:
    table_data = row.find_all('td')
    for index, element in enumerate(table_data):
        cells = element.find_all('a')
        for  cell in cells:
            link = 'https://www.lse.ac.uk/'+ cell.get('href')
            links.append(link)
            title = cell.get('title')

            if header_texts[index] == 'Undergraduate only halls':
                UG.append(title)
            elif header_texts[index] == 'Graduate only halls':
                PG.append(title)
            elif header_texts[index] == 'Mixed halls':
                MIX.append(title)


In [4]:
header = ['accom name','Student Level (UG/PG/mixed)', 'distance by foot','distance by bike','distance by public transport','Bed Size(D/S)','Bathroom Type(S/P)', 'Distance (km)' ,'price']
df_lse = pd.DataFrame(columns=header)

In [5]:
def shorten_bed_name(room_name): #Single(S) or Double(D)
  if 'single' in room_name.lower() and 'queen' not in room_name.lower(): #lets count queen as double 
    return 'S'

  else:
    return 'D'

def shorten_bath_name(bath_name): #Private(P) or Shared(S)
  if 'private' in bath_name.lower():
    return 'P'
  else:
    return 'S'

In [6]:
for link in links:
  response = requests.get(link)

  soup = BeautifulSoup(response.text, "html.parser")

  accom_name = soup.find('h1', class_='heroBanner__title').get_text()

  page_content = soup.find('article', class_='pageContent accommContent')
  page_content = str(page_content)

  #distance by foot/bike/public transport are all within the text on the page, not in seperate classes
  #we'll have to look for these specific words on the page using string manipulation
  #First let's get the index of where these words start

  index_foot = page_content.find('On foot')
  index_bike = page_content.find('By bike')
  index_pt = page_content.find('By public transport')
  level = 'PG' if accom_name in PG else ('UG' if accom_name in UG else 'mixed')


  #Now lets look for the specific times using the index
  foot_time = page_content[index_foot:].split('<br/>')[0].split(':')[1].split()[0]
  bike_time = page_content[index_bike:].split('<br/>')[0].split(':')[1].split()[0]
  pt_time = page_content[index_pt:].split('<br/>')[0].split(':')[1].split()[0]

  #lets also get the distance from the campus
  distance = soup.find('div', class_='accommKeyDetails__dist').get_text()
  distance = distance.split(':')[1].split('km')[0].strip()

  all_rooms = soup.find('ul', class_='roomlist').find_all('li', class_='roomlist__room')
  rooms_data = str(all_rooms).split("</li>")

  # Loop through each room data
  for room_data in rooms_data:
    if len(room_data) > 1: #for some reason we get ']' as one of the elements in rooms_data, lets ignore this

      string_room_data = str(room_data)

      bed_size = shorten_bed_name(string_room_data)



      # Find room position
      position_start = room_data.find('class="roomlist__position">') + len('class="roomlist__position">')
      position_end = room_data.find('</p>', position_start)
      bathroom_type = room_data[position_start:position_end].strip()
      bathroom_type = shorten_bath_name(bathroom_type)

      # Find room price
      price_start = room_data.find('roomataGlance__figure">') + len('roomataGlance__figure">')
      price_end = room_data.find('<span class="roomataGlance__freq">', price_start)
      room_price = room_data[price_start:price_end].strip()

      if '-' in room_price: #sometimes there is a range of prices
        lower_price = room_price.split('<br/>')[0].split('-')[0].strip()
        higher_price = room_price.split('<br/>')[1].split('</span>')[1].strip()

        room_price = round((float(lower_price) + float(higher_price)) / 2, 2)



      list_append = [accom_name,level,foot_time, bike_time, pt_time, bed_size, bathroom_type,distance,room_price]
      df_lse.loc[len(df_lse.index)] = list_append


In [7]:
df_lse.head()

Unnamed: 0,accom name,Student Level (UG/PG/mixed),distance by foot,distance by bike,distance by public transport,Bed Size(D/S),Bathroom Type(S/P),Distance (km),price
0,Bankside House,UG,27,13,24,S,S,1.5,259.7
1,Bankside House,UG,27,13,24,S,P,1.5,287.17
2,Bankside House,UG,27,13,24,S,P,1.5,176.93
3,Butler's Wharf Residence,PG,51,22,34,S,S,3.2,231.0
4,Butler's Wharf Residence,PG,51,22,34,S,S,3.2,140.53


df_uni:
* name
* Student Level (UG/PG/mixed) - UG means undergraduate, PG means postgraduate, mixed means both
* ditance columns are all the time taken to get from the accomodation to the LSE campus 
* Bed Size(D/S) - D stands for double size bed, S stands for single bed 
* Bathroom Type(S/P) - S stands for shared bathroom and P stands for private bathroom
* Distance (km) - is the same as walking distance from the accomodation to the LSE campus
* price - the weekly price in pounds

In [8]:
#Check to see if there is missing data
df_lse.isna().sum()

accom name                      0
Student Level (UG/PG/mixed)     0
distance by foot                0
distance by bike                0
distance by public transport    0
Bed Size(D/S)                   0
Bathroom Type(S/P)              0
Distance (km)                   0
price                           0
dtype: int64

### Scrape data for King's College London  acommodations

Lets repeat for KCL accomodation. We will get our data from https://www.kcl.ac.uk/accommodation/undergraduate and https://www.kcl.ac.uk/accommodation/postgraduate


Note, on the postgraduate page, it shows us the accomodations which are only available for the PG students but also the ones which are availble to both (i.e. mixed)

Create a list for the links and add the undergraduate accomodations first

In [9]:
links = []


url = "https://www.kcl.ac.uk/accommodation/undergraduate"


response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

all_accom = soup.find_all('div', class_='col-xs-12 col-sm-4 col-lg-3')

for accom in all_accom:
  link = accom.find('a')
  link = link.get('href')
  links.append(link)

In [10]:
links = links[:-4] #ignore the last 4 links as they are not related to the accomodations

Now lets also add to the list the Postgraduate accomodations (PG)

In [11]:
url = "https://www.kcl.ac.uk/accommodation/postgraduate"
mixed_accom_name = []


response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

PG_accom = soup.find_all('div', class_='col')
mixed_accom = soup.find_all('div', class_='col-xs-12 col-sm-4')

# Iterate through each accommodation div
for accom in PG_accom:
    # Find the anchor tag within the div
    link = accom.find('a')
    link = link.get('href')
    links.append(link)

for accom in mixed_accom:
    # Find the anchor tag within the div
    link = accom.find('a')
    accom_name = link.get('title') if link else None
    mixed_accom_name.append(accom_name)

In [12]:
header = ['accom name','Student Level (UG/PG/mixed)','Bed Size(D/S)','Bathroom Type(S/P)', 'price']
df_kings = pd.DataFrame(columns=header)

In [13]:
list_append = []

for link in links:
    response = requests.get(link)
    soup = BeautifulSoup(response.text, "html.parser")

    # Extract accommodation name
    accom_name = soup.find('div', class_='block--hero__text').get_text().strip()

    # Determine level (UG or PG)
    level = 'PG' if'Postgraduate' in accom_name else ('mixed' if accom_name in mixed_accom_name else'UG')

    # Extract table content
    table = soup.find('div', class_='tab-pane tab-pane--border fade').find('table')
    rows = table.find_all('tr')[1:]  # Skip the first two rows


    # Parse table rows
    for row in rows:
        columns = row.find_all('td')
        #print(columns)
        bed_size = columns[2].get_text().strip().split('x')[0].strip().strip('cm')
        bed_size = 'D' if int(str(bed_size)) >= 120 else 'S'

        bathroom_type = str(columns[0].get_text()).lower()
        #bathroom_type = 'P' if 'ensuite' in str(columns[1].get_text()).lower() else 'S'

        if ('ensuite' in bathroom_type or 'studio' in bathroom_type) and'non-ensuite' not in bathroom_type:
          bathroom_type = 'P'  # Ensuite
        else:
          bathroom_type = 'S'

        room_price = columns[3].get_text().strip('£')

        # Append data to the list
        list_append.append([accom_name, level, bed_size, bathroom_type, room_price])

        list_append = [accom_name, level,bed_size, bathroom_type, room_price]
        df_kings.loc[len(df_kings.index)] = list_append

In [14]:
df_kings.head()

Unnamed: 0,accom name,Student Level (UG/PG/mixed),Bed Size(D/S),Bathroom Type(S/P),price
0,Angel Lane,mixed,D,P,163.0
1,Angel Lane,mixed,D,P,268.0
2,Angel Lane,mixed,D,P,268.0
3,Angel Lane,mixed,D,P,268.0
4,Angel Lane,mixed,D,P,394.0


df_kings:

* accom name
* Student Level (UG/PG/mixed)-UG means undergraduate, PG means postgraduate, mixed means both
* Bed Size(D/S) - D means double and S stands for single 
* Bathroom Type(S/P) - S means studio and P means non-ensuite
* price

In [15]:
#Check to see if there is missing data
df_kings.isna().sum()

accom name                     0
Student Level (UG/PG/mixed)    0
Bed Size(D/S)                  0
Bathroom Type(S/P)             0
price                          0
dtype: int64

### Scrape data for LSE private acommodations

Scrape data from: "https://www.studentcrowd.com/best-halls-l1055195-s1008323-the_london_school_of_economics_and_political_science-central-london" 




In [16]:
#the url we will scrape from and initialise a list to store the accomodation urls
page_url = "https://www.studentcrowd.com/best-halls-l1055195-s1008323-the_london_school_of_economics_and_political_science-central-london"
base_url = "https://www.studentcrowd.com/"
accom_urls = []

response = requests.get(page_url)

if response.status_code == 200:
    soup1 = BeautifulSoup(response.text, "html.parser")
    container = soup1.find("ul", class_="tw-mt-8 tw-mx-0 tw-mb-0 list-style--none")

    if container:
        for li in container.find_all("li"):
            a_tag = li.find("a")
            if a_tag and a_tag.has_attr("href"):
                accom_urls.append(base_url+a_tag["href"])
else:
    print("Failed to retrieve the web page")

In [17]:

accom_info=[]

for url in accom_urls:
    page = requests.get(url)
    soup = BeautifulSoup(page.content, "html.parser")

    rating = soup.find(class_="rating-value__score").get_text()
    name = soup.find(class_="mb-").get_text().replace(" Reviews","")
    postcode = soup.find(itemprop="postalCode").get_text()
    price_raw = soup.find(text=lambda t: "Price from" in t).parent.find_next_sibling().get_text() if soup.find(text=lambda t: "Price from" in t) else "not found"
    price_match = re.search(r"£\d+\.\d{2}", price_raw)
    price = price_match.group() if price_match else "Not Found"


    accom_info.append([name,price,rating,postcode])


lse_private_accom_df= pd.DataFrame(accom_info,columns=["name","price","rating","postcode"])


lse_private_accom_df.head()

Unnamed: 0,name,price,rating,postcode
0,Bloomsbury Janet Poole House,£405.00,4.54,WC1E 6AA
1,Grosvenor House,£569.00,4.52,WC2B 5TB
2,East Central House,£349.00,4.35,EC1V 3RH
3,Elizabeth Croll House,£399.00,4.26,WC1X 9EJ
4,Camden Hawley Crescent,£554.00,4.23,NW1 8NP


In [18]:
#create a function to convert the postcode into distances from the LSE campus, WC2A 3PH being the LSE postcode
def calculate_walking_distance(origin):
    geolocator = Nominatim(user_agent="walking_distance_calculator")
    origin_location = geolocator.geocode(origin)
    destination_location = geolocator.geocode("WC2A 3PH")

    if origin_location is None or destination_location is None:
        return None

    origin_coords = (origin_location.latitude, origin_location.longitude)
    destination_coords = (destination_location.latitude, destination_location.longitude)

    distance = geodesic(origin_coords, destination_coords).kilometers
    return distance

In [19]:
lse_private_accom_df["walking_distance"] = lse_private_accom_df["postcode"].apply(calculate_walking_distance)

lse_private_accom_df.head()


Unnamed: 0,name,price,rating,postcode,walking_distance
0,Bloomsbury Janet Poole House,£405.00,4.54,WC1E 6AA,1.489944
1,Grosvenor House,£569.00,4.52,WC2B 5TB,0.462801
2,East Central House,£349.00,4.35,EC1V 3RH,1.87175
3,Elizabeth Croll House,£399.00,4.26,WC1X 9EJ,
4,Camden Hawley Crescent,£554.00,4.23,NW1 8NP,3.430531


In [20]:
# remove duplicate data, as some LSE accomodation appears in the private accomodation list

names_to_remove = df_lse["accom name"].unique()
lse_private_accom_df_filtered = lse_private_accom_df[~lse_private_accom_df["name"].isin(names_to_remove)]

lse_private_accom_df_filtered.head()

Unnamed: 0,name,price,rating,postcode,walking_distance
0,Bloomsbury Janet Poole House,£405.00,4.54,WC1E 6AA,1.489944
1,Grosvenor House,£569.00,4.52,WC2B 5TB,0.462801
2,East Central House,£349.00,4.35,EC1V 3RH,1.87175
3,Elizabeth Croll House,£399.00,4.26,WC1X 9EJ,
4,Camden Hawley Crescent,£554.00,4.23,NW1 8NP,3.430531


In [21]:
#checking to see if there is missing data
lse_private_accom_df_filtered.isna().sum()

name                0
price               0
rating              0
postcode            0
walking_distance    1
dtype: int64

In [22]:
#finally remove all the rows that have missing data

lse_private_accom_df_semi_clean= lse_private_accom_df_filtered.dropna()
lse_private_accom_df_clean = lse_private_accom_df_semi_clean[lse_private_accom_df_semi_clean['price'] != 'Not Found']
lse_private_accom_df_clean.head()

Unnamed: 0,name,price,rating,postcode,walking_distance
0,Bloomsbury Janet Poole House,£405.00,4.54,WC1E 6AA,1.489944
1,Grosvenor House,£569.00,4.52,WC2B 5TB,0.462801
2,East Central House,£349.00,4.35,EC1V 3RH,1.87175
4,Camden Hawley Crescent,£554.00,4.23,NW1 8NP,3.430531
5,Prince Consort Village,£345.00,4.19,W12 9PL,8.655255


lse_private_accom_df_clean:
* name
* price
* rating - out of 5, 5 being the highest 0 being the lowest
* postcode
* walking_distance - in km

In [23]:
#checking to see if there is missing data
lse_private_accom_df_clean.isna().sum()

name                0
price               0
rating              0
postcode            0
walking_distance    0
dtype: int64

### Scrape data for Warwick University acommodations

* The reason behind scraping warwick is that they offer similar courses to LSE are a top 10 university overall and so are a suitable substitute for LSE, so if a student wants to study at warwick or LSE this comparison of accomodations should allows them to see which is better and help them come to a decision, if LSE and Warwick are fairly matched in terms of accademic ratings https://warwick.ac.uk/services/accommodation/students/ugresidences-2023


In [24]:
#defining and parsing the url using BeautifulSoup and requests
url = "https://warwick.ac.uk/services/accommodation/students/ugresidences-2023"
response = requests.get(url)


soup2 = BeautifulSoup(response.text, "html.parser")

In [25]:
#initialise a list to store the data
ww_ug_accom_data = []

#process to extract bathroom information as well as the prices
ul_tags = soup2.find_all("ul", class_="fa-ul")


for ul in ul_tags:
    for li in ul.find_all("li"):


        if "bathroom" in li.text.lower():
            bathroom = li.text.strip()

            if "en suite" in bathroom.lower():
                bathroom = "P"
            elif "shared" in bathroom.lower():
                bathroom = "S"
            else:
                bathroom = "Unknown"

        if "£" in li.text.lower():
            raw_price = li.text.strip()
            price_string= re.search(r'£(\d+)', raw_price)
            price=int(price_string.group(1))


    ww_ug_accom_data.append([bathroom,price])

#extracting the names
name_links = soup2.find_all("a", class_="button-ug")
names = [link.get_text(strip=True) for link in name_links]

In [26]:
#putting all the data into a pandas dataframe

ww_ug_accom_df= pd.DataFrame(ww_ug_accom_data,columns=["bathroom_type","price/week"])

ww_ug_accom_df["name"]=names[:-1]

ww_ug_accom_df.head()

Unnamed: 0,bathroom_type,price/week,name
0,P,221,Arthur Vick
1,P,233,Bluebell
2,S,162,Claycroft
3,P,215,Cryfield Apartments
4,S,117,Cryfield Standard


In [27]:
#checking to see if there is missing data
ww_ug_accom_df.isna().sum()

bathroom_type    0
price/week       0
name             0
dtype: int64

We've extracted the ug accomodation details, now we will extract the pg accomodation details

In [28]:
#defining and parsing the url using BeautifulSoup and requests
url = "https://warwick.ac.uk/services/accommodation/students/pgresidences-2023"
response = requests.get(url)


soup3 = BeautifulSoup(response.text, "html.parser")

In [29]:

#initialise a list to store the data
ww_pg_accom_data = []

#process to extract bathroom information as well as the prices
ul_tags = soup3.find_all("ul", class_="fa-ul")


for ul in ul_tags:
    for li in ul.find_all("li"):


        if "bathroom" in li.text.lower():
            bathroom = li.text.strip()

            if "en suite" in bathroom.lower():
                bathroom = "P"
            elif "shared" in bathroom.lower():
                bathroom = "S"
            else:
                bathroom = "Unknown"

        if "£" in li.text.lower():
            raw_price = li.text.strip()
            price_string= re.search(r'£(\d+)', raw_price)
            price=int(price_string.group(1))


    ww_pg_accom_data.append([bathroom,price])

#extracting the names
name_links = soup3.find_all("a", class_="button-pg")
names = [link.get_text(strip=True) for link in name_links]

In [30]:

#putting all the data into a pandas dataframe

ww_pg_accom_df= pd.DataFrame(ww_pg_accom_data,columns=["bathroom_type","price/week"])

ww_pg_accom_df["name"]=names

ww_pg_accom_df.head()

Unnamed: 0,bathroom_type,price/week,name
0,P,244,Benefactors
1,S,162,Claycroft
2,P,215,Cryfield Apartments
3,P,260,Cryfield Studios
4,P,210,Cryfield Townhouses


In [31]:
#checking to see if there is missing data
ww_pg_accom_df.isna().sum()

bathroom_type    0
price/week       0
name             0
dtype: int64

#### Convert dataframes into csv files and create and stored under data folder

In [32]:
import os

if not os.path.exists('data'):
    os.makedirs('data')

ww_ug_accom_df.to_csv("data/ww_undergrad_accom.csv", index=False)
lse_private_accom_df_clean.to_csv("data/lse_private_accom_df_clean.csv", index=False)
ww_pg_accom_df.to_csv("data/ww_postgrad_accom.csv", index=False)
df_lse.to_csv('data/df_lse.csv', index=False)
df_kings.to_csv('data/df_kings.csv', index=False)
