# Instructions

As a data scientist, you've found a data treasure on TampereBNB - the top online platform for short-term housing rentals in the city of Tampere. From thousands of listings, you can extract valuable insights to inform decision-making, marketing, and research. Your goal is to scrape and clean this data to enable further analysis.

## Accessing the HTML dataset
The quick way to get HTML data is by saving the HTML file to your computer manually. Also, a web page's HTML is known to change over time. Scraping code can break easily when web redesigns occur, which makes scraping brittle and not recommended for projects with longevity. So, to ease access and manage the traffic, we have scraped the TampereBNB website and extracted and stored data in the `data/accom` folder. Please, just use these HTML files provided to you and pretend like you saved them yourself. I recommend that you do and open the HTML files in your preferred text editor to inspect the HTML for the how the website structured information.

## TODO

- Students are expected to extract following information:
    - Region
    - Price of the accommodation
    - Apartment type
    - Square meters m2
    - Apartment floor
    - Construction year
    - Apartment status
    - The availability of an elevator
    - Longitude
    - Latitude
- change the data format of, if necessary:
    - Floor
    - Size
    - Construction
    - Longitude
    - Latitude
- Save the data frame in a pickle file

Screenshot of how the dataframe is expected to look like:

![df_screenshot](./data/images/df_screenshot.png) 
## Notes

<div class="alert alert-block alert-danger">
<b>Do not:</b> change the Jupyter Notebook file's name.
</div>

<div class="alert alert-block alert-danger">
<b>Use:</b> the given list as column names.
</div>

<div class="alert alert-block alert-danger">
<b>Save:</b> the dataframe as a pickle file with the name "TampereBNB.pkl".
</div>

<div class="alert alert-block alert-info">
<b>Tip:</b> Please keep in mind that although this template is designed in a linear style, generally, data wrangling is an iterative process.
</div>


# Reminder

This section is devoted to refreshing your memory on prerequisites to complete the assignment. 

## How does web scraping work? 
Website data is written in HTML (HyperText Markup Language) which uses tags to structure the page. Because HTML and its tags are just text, the text can be accessed using parsers . We'll be using a Python parser called [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/).

The following script can be used to download HTML files programmatically. **However, please note that this repository cames with an offline copy of the files so there is no need to run this script.**

```python
for listing_index in range(0, 1080):
     print("getting the %1d url:" % listing_index)
     url = 'https://infotuni.github.io/tamperebnb/accom/M20%1d23' % listing_index
     print(url)
     page = requests.get(url)
     fname = 'data/accom/M20%1d23.html' % listing_index
     print(fname)
     with open(fname, 'wb') as f:
         f.write(page.content)
     time.sleep(1 + random.random() * 2)
```

## HTML file structure

The Hypertext Markup Language (or HTML) is the language used to create documents for the World Wide Web. You can use [w3school](https://www.w3schools.com/html/default.asp) to refresh your memory. 

The HTML element is everything from the start tag to the end tag: <br />

`<opening tag> content...</closing tag>`

### HTML Elements
#### Heading
elements are used for section headings.

```
<h1>Heading 1</h1>
<h2>Heading 2</h2>
<h3>Heading 3</h3>
<h4>Heading 4</h4>
<h5>Heading 5</h5>
<h6>Heading 6</h6>
```

#### Paragraph
elements are used for standard blocks of text

```
<p>This is just a block of text.</p>
```

#### Span
elements are used to group text within another block of text, often for styling.
```
<p>This block of text has a <span>element</span> inside it.</p>

```
#### Image
elements are used to embed images in a web page
```
<img src="image-file.jpg" alt="text that describes the image" />
```

### Trees represent hierarchical data

Web developers use trees to represent the data that makes up websites. Elements belong to each other or are descended from one another.


![Tree](./data/images/trees.png)

We can create a tree structure in HTML by putting elements inside other elements. To do this we often use a `<div>  `element as a container. `<div` elements are used to group chunks of content together.

```
<div>
    <h1>This is a heading.</h1>
    <p>This is a paragraph.</p>
</div>
```

![HTML&Tree](./data/images/HTML&Tree.png)


## Using Beautiful Soup

#### Use the Find, select Methods
`find()` is one of the most popular Beautiful Soup methods. It is similar to the find feature in a text editor. <br/>
`select_one()` which finds only the first tag that matches a selector. <br />
To find classes you can use: 
* `{'class': 'class_name'}`
* `class_='class_name'`
* `.class_name`

#### Example:
Here we attempt to use find method to get the title of the webpage:
```
soup.find('text_string')
```

When we apply this to get the title of our webpage:
```
soup.find('title')
```

We get the title element of the webpage with its opening and closing tags

```
<title>Tampere BNB</title>

```

To get the webpage title only, we can use `.text`:
```
soup.find('title').text
```

which gives us:
```
'Tampere BNB'
```


## Installing packages

using conda:
```python
    conda install <package name>
```

using pip:
```python
    pip install <package name>
```

using pip within Jupyter Notebook:
```python
    !pip install <package name>
```

# Gather

In [None]:
# load packages
import os # establish the interaction between the user and the operating system
import glob # allows for Unix-style pathname pattern 

import re #check whether a given string matches a given pattern
from bs4 import BeautifulSoup # pull data out of HTML and XML files

import pandas as pd #pulling data out of HTML and XML files

In [None]:
#check currect directory
path = os.getcwd()
print ("The current working directory is %s" % path)

In [None]:
#reading files
indices = glob.glob('./data/accom/*.html')
len(indices)

In [None]:
#loading the data from local
local_files = list()
for fname in glob.glob('./data/accom/*.html'):
    local_files.append(fname.split('/')[-1])

In [None]:
# Please, use the following list as a column name
column = ["Region", "Price", "Type", "Size", "Floor", "Construction", "Condition", "Elevator", "Longitude", "Latitude"]

In [None]:
#creating dictionary of lists with pre-specified columns/keys
accom = dict()
for col in column:
    accom[col] = list()

In [None]:
for fname in indices:
    with open(fname) as f:
        content = f.read()
    soup = BeautifulSoup(content, 'html.parser')
    page_content = soup.find('div', {'class': 'container'})
    #getting the price
    price = page_content.select_one('.basic-info .profile_price').text
    price = price.split(" ")[0].strip()
    accom["Price"].append(price)

    # TODO: get all info on space
    
    # TODO: get the region
    region = page_content.select_one('.about_space_basic .about_space_region').text
    region = region.split(" ")[-1].strip()
    accom["Region"].append(region)
    
    
    # TODO: get the apartment type
    type = page_content.select_one('.about_space_basic .about_space_type').text
    type = type.split(":")[-1].strip()
    accom["Type"].append(type)
    

    # TODO: get the apartment size
    size = page_content.select_one('.about_space_basic .about_space_area').text
    size = size.split(":")[-1].strip().split(" ")[0].strip()
    accom["Size"].append(size)
    
    # TODO: get the apartment floor
    floor = page_content.select_one('.about_space_basic .about_space_floor').text
    floor = floor.split(":")[-1].strip().split(" ")[4].strip()
    accom['Floor'].append(floor)
    
    
    # TODO: get the apartment construction year
    construction = page_content.select_one('.about_space_basic .about_space_year').text
    construction = construction.split(":")[-1].strip().split(" ")[7]
    accom['Construction'].append(construction)
    

    # TODO: get the apartment status
    condition = page_content.select_one('.about_space_basic .Guest.about_space_condition').text
    condition = condition.split(":")[-1].strip().split(" ")[13].strip()
    accom['Condition'].append(condition)
    

    # TODO: get the availability of the elevator
    elevator = page_content.select_one('.offers').text
    elevator = elevator.split(" ")[8]
    accom['Elevator'].append(elevator)
    
    # TODO: get the longitude
    longitude = page_content.select_one('.map').text
    longitude = longitude.split("longitude")[1].strip().split(" ")[1]
    accom['Longitude'].append(longitude)
    

    latitude = page_content.select_one('.map').text
    latitude = latitude.split("latitude")[1].strip().split(" ")[1].rstrip('.')
    accom['Latitude'].append(latitude)
    


In [None]:
# for key in accom:
#     print(key, len(accom[key]))

In [None]:
#constructing a DataFrame from a dict
df = pd.DataFrame.from_dict(accom)

# Access

Assessing your data is the second step in data wrangling. When assessing, you're like a detective at work, inspecting your dataset for two things: data quality issues (i.e. content issues) and lack of tidiness (i.e. structural issues).
- Quality: issues with content. Low quality data is also known as dirty data.
- Tidiness: issues with structure that prevent easy analysis. Untidy data is also known as messy data. Tidy data requirements:
    - Each variable forms a column.
    - Each observation forms a row.
    - Each type of observational unit forms a table.

<div class="alert alert-block alert-info">
<b>Tip:</b> This part is not graded and the codes are provided, you can just run the cells and access your dataset.
</div>


In [None]:
#loading pickle file and getting the first five rows
df.head()

In [None]:
#loading pickle file and getting the last five rows
df.tail()

In [None]:
# gettign a summary of the data set
df.info()

In [None]:
#check null values in Price feature
df[df['Price'].isnull()]

In [None]:
#check duplicated values
df[df.duplicated()]

In [None]:
#gettign a summary statistics of the construction feature
df[["Construction"]].describe()

In [None]:
#gettign a series of unique values of apartment Type in feature
df["Type"].value_counts()

# Clean

In [None]:
#Make a copy of this object’s indices and data.
df_clean = df.copy()

In [None]:
#change the data format

#changing the data format of floor feature
df_clean["Floor"] = pd.to_numeric(df_clean["Floor"])
# TODO: change the data format of Size feature
df_clean["Size"] = pd.to_numeric(df_clean["Size"])

# TODO: change the data format of Longitude feature
df_clean["Longitude"] = pd.to_numeric(df_clean["Longitude"])

# TODO: change the data format of Latitude feature
df_clean["Latitude"] = pd.to_numeric(df_clean["Latitude"])

# TODO: change the data format of Construction feature
df_clean["Construction"] = pd.to_numeric(df_clean["Construction"])

In [None]:
#check changes
df_clean.info()

In [None]:
#save and load latest changes
df_clean.to_pickle("TampereBNB.pkl")
unpickled_df = pd.read_pickle("TampereBNB.pkl")
unpickled_df.info()