# Instructions

As a data scientist, you've found a data treasure on TampereBNB - the top online platform for short-term housing rentals in the city of Tampere. From thousands of listings, you can extract valuable insights to inform decision-making, marketing, and research. Your goal is to scrape and clean this data to enable further analysis.

## Accessing the HTML dataset
The quick way to get HTML data is by saving the HTML file to your computer manually. Also, a web page's HTML is known to change over time. Scraping code can break easily when web redesigns occur, which makes scraping brittle and not recommended for projects with longevity. So, to ease access and manage the traffic, we have scraped the TampereBNB website and extracted and stored data in the `data/accom` folder. Please, just use these HTML files provided to you and pretend like you saved them yourself. I recommend that you do and open the HTML files in your preferred text editor to inspect the HTML for the how the website structured information.

## TODO

- Students are expected to extract following information:
    - Region
    - Price of the accommodation
    - Apartment type
    - Square meters m2
    - Apartment floor
    - Construction year
    - Apartment status
    - The availability of an elevator
    - Longitude
    - Latitude
- change the data format of, if necessary:
    - Floor
    - Size
    - Construction
    - Longitude
    - Latitude
- Save the data frame in a pickle file

Screenshot of how the dataframe is expected to look like:

![df_screenshot](./data/images/df_screenshot.png) 
## Notes

<div class="alert alert-block alert-danger">
<b>Do not:</b> change the Jupyter Notebook file's name.
</div>

<div class="alert alert-block alert-danger">
<b>Use:</b> the given list as column names.
</div>

<div class="alert alert-block alert-danger">
<b>Save:</b> the dataframe as a pickle file with the name "TampereBNB.pkl".
</div>

<div class="alert alert-block alert-info">
<b>Tip:</b> Please keep in mind that although this template is designed in a linear style, generally, data wrangling is an iterative process.
</div>


# Reminder

This section is devoted to refreshing your memory on prerequisites to complete the assignment. 

## How does web scraping work? 
Website data is written in HTML (HyperText Markup Language) which uses tags to structure the page. Because HTML and its tags are just text, the text can be accessed using parsers . We'll be using a Python parser called [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/).

The following script can be used to download HTML files programmatically. **However, please note that this repository cames with an offline copy of the files so there is no need to run this script.**

```python
for listing_index in range(0, 1080):
     print("getting the %1d url:" % listing_index)
     url = 'https://infotuni.github.io/tamperebnb/accom/M20%1d23' % listing_index
     print(url)
     page = requests.get(url)
     fname = 'data/accom/M20%1d23.html' % listing_index
     print(fname)
     with open(fname, 'wb') as f:
         f.write(page.content)
     time.sleep(1 + random.random() * 2)
```

## HTML file structure

The Hypertext Markup Language (or HTML) is the language used to create documents for the World Wide Web. You can use [w3school](https://www.w3schools.com/html/default.asp) to refresh your memory. 

The HTML element is everything from the start tag to the end tag: <br />

`<opening tag> content...</closing tag>`

### HTML Elements
#### Heading
elements are used for section headings.

```
<h1>Heading 1</h1>
<h2>Heading 2</h2>
<h3>Heading 3</h3>
<h4>Heading 4</h4>
<h5>Heading 5</h5>
<h6>Heading 6</h6>
```

#### Paragraph
elements are used for standard blocks of text

```
<p>This is just a block of text.</p>
```

#### Span
elements are used to group text within another block of text, often for styling.
```
<p>This block of text has a <span>element</span> inside it.</p>

```
#### Image
elements are used to embed images in a web page
```
<img src="image-file.jpg" alt="text that describes the image" />
```

### Trees represent hierarchical data

Web developers use trees to represent the data that makes up websites. Elements belong to each other or are descended from one another.


![Tree](./data/images/trees.png)

We can create a tree structure in HTML by putting elements inside other elements. To do this we often use a `<div>  `element as a container. `<div` elements are used to group chunks of content together.

```
<div>
    <h1>This is a heading.</h1>
    <p>This is a paragraph.</p>
</div>
```

![HTML&Tree](./data/images/HTML&Tree.png)


## Using Beautiful Soup

#### Use the Find, select Methods
`find()` is one of the most popular Beautiful Soup methods. It is similar to the find feature in a text editor. <br/>
`select_one()` which finds only the first tag that matches a selector. <br />
To find classes you can use: 
* `{'class': 'class_name'}`
* `class_='class_name'`
* `.class_name`

#### Example:
Here we attempt to use find method to get the title of the webpage:
```
soup.find('text_string')
```

When we apply this to get the title of our webpage:
```
soup.find('title')
```

We get the title element of the webpage with its opening and closing tags

```
<title>Tampere BNB</title>

```

To get the webpage title only, we can use `.text`:
```
soup.find('title').text
```

which gives us:
```
'Tampere BNB'
```


## Installing packages

using conda:
```python
    conda install <package name>
```

using pip:
```python
    pip install <package name>
```

using pip within Jupyter Notebook:
```python
    !pip install <package name>
```

# Gather

In [77]:
# load packages
import os # establish the interaction between the user and the operating system
import glob # allows for Unix-style pathname pattern 

import re #check whether a given string matches a given pattern
from bs4 import BeautifulSoup # pull data out of HTML and XML files

import pandas as pd #pulling data out of HTML and XML files

In [78]:
#check currect directory
path = os.getcwd()
print ("The current working directory is %s" % path)

The current working directory is c:\Users\kalle\Documents\Ohjelmointi\joda2024\assignment\5.2-scraper


In [79]:
#reading files
indices = glob.glob('./data/accom/*.html')
len(indices)

1080

In [80]:
#loading the data from local
local_files = list()
for fname in glob.glob('./data/accom/*.html'):
    local_files.append(fname.split('/')[-1])

In [81]:
# Please, use the following list as a column name
column = ["Region", "Price", "Type", "Size", "Floor", "Construction", "Condition", "Elevator", "Longitude", "Latitude"]

In [82]:
#creating dictionary of lists with pre-specified columns/keys
accom = dict()
for col in column:
    accom[col] = list()

In [83]:
# Create a helper function for extracting first integer value from text
def extract_first_num(text):
    try:
        return re.search(r'-?\d+(\.\d+)?', text).group()
    except:
        return None

In [84]:
for fname in indices:
    with open(fname) as f:
        content = f.read()
    soup = BeautifulSoup(content, 'html.parser')
    page_content = soup.find('div', {'class': 'container'})

    # get the price
    price = page_content.select_one('.basic-info .profile_price')
    if (price):
        price = price.text.split(" ")[0].strip()
        accom["Price"].append(price)
    else:
        accom["Price"].append(None)

    # get the region
    elements = page_content.select_one('.about_space_region p')
    # Check if any elements were found
    if elements:
        first_p_text = elements.text
        # Extracting the last word from the selected <p> tag's text
        last_word = first_p_text.split()[-1].rstrip('.')
        accom["Region"].append(last_word)
    else:
        accom["Region"].append(None)
 
    # get the apartment type
    type = page_content.select_one('.about_space_type p')
    if (type):
        type = type.text
        accom["Type"].append(type)
    else:
        accom["Type"].append(None)

    # get the apartment size
    size = page_content.select_one('.about_space_area p')
    if (size):
        accom["Size"].append(extract_first_num(size.text))
    else:
        accom["Size"].append(None)

    # get the apartment floor
    floor = page_content.select_one('.about_space_floor p')
    if (floor):
        accom["Floor"].append(extract_first_num(floor.text))
    else:
        accom["Floor"].append(None)

    # get the apartment construction year
    year = page_content.select_one('.about_space_year p')
    if (year):
        accom["Construction"].append(extract_first_num(year.text))
    else:
        accom["Construction"].append(None)

    # get the apartment status
    status = page_content.select_one('.about_space_condition p')
    if (status):
        match = re.search(r'\b(\w+)\s+condition\b', status.text)
        if match:
            accom["Condition"].append(match.group(1))
        else:
            accom["Condition"].append(None)
    else:
        accom["Condition"].append(None)

    # get the availability of the elevator
    elevator = page_content.select_one('.offers tr')
    if (elevator):
        if ("does have an elevator" in elevator.text):
            result = "Yes"
        else:
            result = "No"
        accom["Elevator"].append(result)
    else:
        accom["Elevator"].append(None)

    # get the longitude
    details = page_content.select('.map p')
    if (details):
        for item in details:
            # Search for the pattern in the returned html
            match = re.search(r"longitude of ([\d.]+)", item.text)
            if match:
                longitude = match.group(1).rstrip('.')
                accom["Longitude"].append(longitude)
                break
            else:
                accom["Longitude"].append(None)
    else:
        accom["Longitude"].append(None)

    # get the latitude
    details = page_content.select('.map p')
    if (details):
        for item in details:
            # Search for the pattern in the returned html
            match = re.search(r"latitude of ([\d.]+)", item.text)
            if match:
                latitude = match.group(1).rstrip('.')
                accom["Latitude"].append(latitude)
                break
            else:
                accom["Latitude"].append(None)
    else:
        accom["Latitude"].append(None)


In [85]:
for (key, value) in accom.items():
    print(key, len(value))

print()

# Print the first 5 elements of each list
for (key, value) in accom.items():
    print(key, value[:20])

Region 1080
Price 1080
Type 1080
Size 1080
Floor 1080
Construction 1080
Condition 1080
Elevator 1080
Longitude 1080
Latitude 1080

Region ['Niemenranta', 'Leinola', 'Härmälänranta', 'Amuri', 'Kissanmaa', 'Tampella', 'Rantaperkiö', 'Tampella', 'Annala', 'Linnainmaa', 'Hervantajärvi', 'Keskusta', 'Santalahti', 'Hakametsä', 'Kaukajärvi', 'Hämeenpuisto', 'Kaleva', 'Hervanta', 'Halkoniemi', 'Hervantajärvi']
Price ['€300', '€255', '€360', '€432', '€153', '€396', '€282', '€366', '€177', '€174', '€246', '€354', '€297', '€351', '€186', '€366', '€246', '€138', '€273', '€219']
Type ['Two rooms', 'Three rooms', 'Two rooms', 'Two rooms', 'Studio apartment', 'Two rooms', 'Two rooms', 'Two rooms', 'Studio apartment', 'Studio apartment', 'Two rooms', 'Three rooms', 'Two rooms', 'Three rooms', 'Three rooms', 'Two rooms', 'Studio apartment', 'Three rooms', 'Two rooms', 'Two rooms']
Size ['50.0', '80.0', '54.0', '48.0', '28.0', '52.0', '51.0', '43.5', '28.0', '27.5', '48.0', '64.0', '64.0', '63.0', '63.0

In [86]:
#constructing a DataFrame from a dict
df = pd.DataFrame.from_dict(accom)

# Access

Assessing your data is the second step in data wrangling. When assessing, you're like a detective at work, inspecting your dataset for two things: data quality issues (i.e. content issues) and lack of tidiness (i.e. structural issues).
- Quality: issues with content. Low quality data is also known as dirty data.
- Tidiness: issues with structure that prevent easy analysis. Untidy data is also known as messy data. Tidy data requirements:
    - Each variable forms a column.
    - Each observation forms a row.
    - Each type of observational unit forms a table.

<div class="alert alert-block alert-info">
<b>Tip:</b> This part is not graded and the codes are provided, you can just run the cells and access your dataset.
</div>


In [87]:
#loading pickle file and getting the first five rows
df.head()

Unnamed: 0,Region,Price,Type,Size,Floor,Construction,Condition,Elevator,Longitude,Latitude
0,Niemenranta,€300,Two rooms,50.0,2,2020,good,Yes,23.69660557450159,61.52426921143939
1,Leinola,€255,Three rooms,80.0,0,1975,unknown,No,23.910149658010493,61.48946062565256
2,Härmälänranta,€360,Two rooms,54.0,2,2022,unknown,No,23.72300425566757,61.47551363767057
3,Amuri,€432,Two rooms,48.0,1,2023,unknown,Yes,23.741643263744766,61.49968017394604
4,Kissanmaa,€153,Studio apartment,28.0,3,1959,good,No,23.82277363783996,61.50021477921354


In [88]:
#loading pickle file and getting the last five rows
df.tail()

Unnamed: 0,Region,Price,Type,Size,Floor,Construction,Condition,Elevator,Longitude,Latitude
1075,Härmälänranta,€330,Two rooms,54.0,7,2018,good,Yes,23.727866061687656,61.47573308469892
1076,Amuri,€324,Two rooms,37.0,3,2023,unknown,Yes,23.74267272233312,61.49705474306858
1077,Rantaperkiö,€372,Three rooms,86.5,1,2006,good,No,23.75351047076308,61.47293778780631
1078,Keskusta,€225,Studio apartment,48.0,4,1929,satisfactory,No,24.058603556628885,61.46460833687141
1079,Ratina,€396,Two rooms,44.0,5,2011,good,Yes,23.767109944707475,61.495878988285774


In [89]:
# gettign a summary of the data set
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1080 entries, 0 to 1079
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Region        1080 non-null   object
 1   Price         1080 non-null   object
 2   Type          1080 non-null   object
 3   Size          1080 non-null   object
 4   Floor         1080 non-null   object
 5   Construction  1080 non-null   object
 6   Condition     1080 non-null   object
 7   Elevator      1080 non-null   object
 8   Longitude     1080 non-null   object
 9   Latitude      1080 non-null   object
dtypes: object(10)
memory usage: 84.5+ KB


In [90]:
#check null values in Price feature
df[df['Price'].isnull()]

Unnamed: 0,Region,Price,Type,Size,Floor,Construction,Condition,Elevator,Longitude,Latitude


In [91]:
#check duplicated values
df[df.duplicated()]

Unnamed: 0,Region,Price,Type,Size,Floor,Construction,Condition,Elevator,Longitude,Latitude


In [92]:
#gettign a summary statistics of the construction feature
df[["Construction"]].describe()

Unnamed: 0,Construction
count,1080
unique,99
top,2022
freq,92


In [93]:
#gettign a series of unique values of apartment Type in feature
df["Type"].value_counts()

Type
Two rooms             463
Three rooms           262
Studio apartment      204
Four rooms or more    151
Name: count, dtype: int64

# Clean

In [94]:
#Make a copy of this object’s indices and data.
df_clean = df.copy()

In [95]:
#changing the data format of floor feature
df_clean["Floor"] = pd.to_numeric(df_clean["Floor"])

# change the data format of Size feature
df_clean["Size"] = pd.to_numeric(df_clean["Size"])

# change the data format of Longitude feature
df_clean["Longitude"] = pd.to_numeric(df_clean["Longitude"])

# TODO: change the data format of Latitude feature
df_clean["Latitude"] = pd.to_numeric(df_clean["Latitude"])

# change the data format of Construction feature
df_clean["Construction"] = pd.to_numeric(df_clean["Construction"])

In [96]:
#check changes
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1080 entries, 0 to 1079
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Region        1080 non-null   object 
 1   Price         1080 non-null   object 
 2   Type          1080 non-null   object 
 3   Size          1080 non-null   float64
 4   Floor         1080 non-null   int64  
 5   Construction  1080 non-null   int64  
 6   Condition     1080 non-null   object 
 7   Elevator      1080 non-null   object 
 8   Longitude     1080 non-null   float64
 9   Latitude      1080 non-null   float64
dtypes: float64(3), int64(2), object(5)
memory usage: 84.5+ KB


In [97]:
#save and load latest changes
df_clean.to_pickle("TampereBNB.pkl")
unpickled_df = pd.read_pickle("TampereBNB.pkl")
unpickled_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1080 entries, 0 to 1079
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Region        1080 non-null   object 
 1   Price         1080 non-null   object 
 2   Type          1080 non-null   object 
 3   Size          1080 non-null   float64
 4   Floor         1080 non-null   int64  
 5   Construction  1080 non-null   int64  
 6   Condition     1080 non-null   object 
 7   Elevator      1080 non-null   object 
 8   Longitude     1080 non-null   float64
 9   Latitude      1080 non-null   float64
dtypes: float64(3), int64(2), object(5)
memory usage: 84.5+ KB
