# 00 Setting the Stage & Getting the Data

> "Unless you try to do something beyond what you have already mastered, you will never grow.” ~ 
Ronald E. Osborn

## Outline for this notebook  

1. What is Data Analytics? 🐍  
2. What is a Data Analyst?
3. What is the Data Analytics Cycle?
4. Small, Medium, and Big Data
5. Problem Definition
6. Data Gathering
7. Questions

In this section, we will be covering 2 of the most important stages of any data analytics process: problem definition and data gathering. We will begin by talking about what data analytics is, what is a data analyst

## 1. What is Data Analytics? 🐍

![Data Analytics Cycle](../images/4.png)

## 2. What is a Data Analyst?

![Data Analytics Cycle](../images/5.png)

![solving](../images/7.png)

As data detectives, we want to make sure we have at least a loosly define outline of what our projects involving data might look like. In particular, we want to be careful with those involving large amounts of data since errors can, at the very least, be very time consuming, and, at worst, very expensive.

For our task, we are currently sick and tired of COVID and we want to start planning our next vacation. More specifically, we would love to scratch some countries off our bucket list, but, since this can be quite costly, we should start by figuring out where are we going, where are we staying, and what kind of prices are we looking at if we decide to go there?

Since hotels are expensive, we thought we would give Airbnb a try 

![Gathering Data](../images/9.png)

We will be using data scraped from a scraping tool called, [Inside Airbnb](http://insideairbnb.com/index.html). Yes, we will be scraping a bit of data from the scraper itsef. More specifically, we will be taking the skeleton (an html version of the website), downloading it, and then extracting all of the links that will help us get the data from it.

We will start by importing the following packages to help us get the data we need.

- `os`
- `pandas`
- `numpy`
- `requests`
- `bs4`
- `wget`
- `glob`
- `urllib`
- `dask`

In [None]:
import pandas as pd
from bs4 import BeautifulSoup
import requests
import os
import wget
import dask
import numpy as np
from glob import glob
import urllib

pd.options.display.max_columns = None

Since we will be creating several directories, the first thing we will do is to assign a path to the directory where all of our data will go into and come out from.

In [None]:
path = '../data'

We will also create a function that takes in a existing path as a starting point and many additional directory names that we might want/need to create along the course of this tutorial. In addition, our function will check whether the directory we are trying to create exists or not, then combine all arguments into one directory and return such directory.

You might have already seen the `*args` parameter in a function a few times already while using Python. What this does is that it gives us the ability to provide multiple arguments to a function without explicitely adding them to the construction of the function.

In [None]:
def check_or_add(old_path, *args):
    
    """
    This function will help us check whether one or more directories exists, and
    if they don't exist, it will create, combine, and return a new directory.
    """
        
    if not os.path.exists(os.path.join(old_path, *args)):
        os.makedirs(os.path.join(old_path, *args))

    return os.path.join(old_path, *args)

We will use Python's `requests` module to send a request to Inside Airbnb, use our path creation function to add this html file to a directory called, `html_data`, and then save the file as text using a context manager construct.

In [None]:
web_data = requests.get('http://insideairbnb.com/get-the-data.html')

In [None]:
path_4_source = check_or_add(path, 'raw_files')

In [None]:
with open(os.path.join(path_4_source, 'insideairbnb.html'), 'w') as html:
    html.write(web_data.text)

We will add the path to our new file, plus the name of the file we just saved, to a variable called `html_doc`. We will then read it back in as bytes, and parse the document using `BeautifulSoup`.

In [None]:
html_doc = os.path.join(path_4_source, 'insideairbnb.html')
html_doc

In [None]:
with open(html_doc, 'r') as file: 
    soup = BeautifulSoup(file, 'html.parser')

`BeautifulSoup` will allow us to extract the links we need without much hassle. While we could figure out a way to get the exact links we need,  maybe with a regular expression or a similar approach, we will extract all links at this stage by parsing the html file and taking out the links we need. For this, we will use a Python list comprehension.

In [None]:
list_of_links = [link.get('href') for link in soup.find_all('a')]

In [None]:
list_of_links[:10]

In [None]:
print(f"We have {len(list_of_links)} links. Wow!")

Notice that the files we need are the ones that end in `listings.csv.gz` and, to extract them, (or filter out those we don't want), we can take advantage pandas' many string methods. Let's begin by converting our list into a pandas series, which can also be referred to as a 1 dimensional array with a maleable index.

In [None]:
our_list = pd.Series(list_of_links, name='links')

our_list.head() # let's examine the first five rows of our new pandas Series

We now have to get rid of `NaN` values, grab the listings links, and filter out those likns we don't with a mask. We will also reset the index just because it is nice to have values that start from 0 for our index.

In [None]:
our_list.dropna(inplace=True) # drop NaN's and keep the changes

condition = our_list.str.endswith('/listings.csv.gz') # let's find the listings we need

files_we_want = our_list[condition].reset_index(drop=True) # filter out what we don't need and reset the index

files_we_want.head() # make sure everything when through as expected

Now that we have the links we need, let's go ahead and examine how many we have.

In [None]:
files_we_want.shape

That's certainly still a lot of files to download (but at least is not 20k), so how about we have a look at how many files we have per country and, where possible, per city. Since we can imagine places such as the US, the UK, and Australia having multiple cities with people doing business through Airbnb.

In [None]:
countries = files_we_want.str.split('/').str.get(3)

unique_countries = countries.unique()
unique_countries

In [None]:
for country in unique_countries:
    print(f"{country.title()} has ------> {len(files_we_want[countries == country])}")

## Exercise 1

Find out how many cities are represented in our dataset, and print the country, city, and how many files for that city do we have. Name the list of unique cities, `unique_cities`.

Answers below! Don't peak! 👀

In [None]:
cities = files_we_want.str.split('/').str.get(5)
unique_cities = cities.unique()
unique_cities

In [None]:
for city in unique_cities:
    print(f"{city.title()} has ------> {len(files_we_want[cities == city])}")

It is time for us to pick a country or city for our analysis.

### Let's pick a countries and cities to visit.

In [None]:
my_country = 'japan'
my_country2 = 'belgium'
my_country3 = 'germany'
my_city = 'cape-town'
my_city2 = 'sydney'
my_city3 = 'washington-dc'

If you forget the amount of files available in each country and/or city when trying to come up with a decision, you can check them individually with the following function. There is also a table with more information coming up soon.

In [None]:
def check_len_files(country_city):
    
    if country_city in unique_countries:
        
        condition = files_we_want.str.contains(country_city)
        data_we_need = files_we_want[condition]
        
        return len(data_we_need)
    
    elif country_city in unique_cities:
        
        condition = files_we_want.str.contains(country_city)
        
        data_we_need = files_we_want[condition]
        
        return len(data_we_need)
    
    else:
        print("Sorry, your country or city is not on the list or it was misspelled")

In [None]:
print(f"{my_country.title()} has {check_len_files(my_country)} files")
print(f"{my_country2.title()} has {check_len_files(my_country2)} files")
print(f"{my_country3.title()} has {check_len_files(my_country3)} files")
print(f"{my_city.title()} has {check_len_files(my_city)} files")
print(f"{my_city2.title()} has {check_len_files(my_city2)} files")
print(f"{my_city3.title()} has {check_len_files(my_city3)} files")

The following is one of the most important functions in the whole notebook as it is the one that is going to allow us to get the data we need into our computers.

The function takes in the following arguments:
- `urls` --> This is strictly a pandas series with the list of urls we need
- `country_city` --> This would the country you want to get data for
- `path_to_files` --> This is where the data will be downloaded to
- `country_city_unique` --> This is the list of countries or cities where Airbnb operates in
- `unique_num` --> If you do not need all files, you can specify how many you need. Default is all

And it operates as follows:

1. It first checks whether the country you have picked is in the list of unique countries
2. Then it creates a boolean array (aka a mask)
3. Passes it through our pandas series containing the urls to filter out the countries you don't need
4. Then it downloads the files you want and
5. Saves them into a new folder it creates called `raw_data` in the path you provided

In [None]:
def get_me_specific_data(urls, country_city, path_to_files, country_city_unique, unique_num = None):
    
    """
    
    """
    
    # we go over every country
    if country_city in country_city_unique:
        
        # check whether it exists in our list of urls and create a mask
        condition = urls.str.contains(country_city)
        
        # we pass that mask to our pandas series
        data_we_need = urls[condition]
        
        # create a new directory for the raw data
        new_dir = check_or_add(path_to_files, country_city + '_data', 'raw_data')
        
        # we first check if a unique number of files was specified
        if unique_num:
            
            num = 0
            
            # loop until we reach that point
            while num < unique_num:
                
                # we first try to download the file with wget
                try:
                    # if wget doesn't work, we try with urllib
                    wget.download(data_we_need.iloc[num], os.path.join(new_dir, f'{country_city}_{num}.csv.gz'))
                except:
                    try:
                        urllib.request.urlretrieve(data_we_need.iloc[num], os.path.join(new_dir, f'{country_city}_{num}.csv.gz'))
                    except:
                        continue
                num += 1
        else:
            
            # iterate over the links we want
            for num, data in enumerate(data_we_need):
                
                # we first try to download the file with wget
                try:
                    # if wget doesn't work, we try with urllib
                    wget.download(data, os.path.join(new_dir, f'{country_city}_{num}.csv.gz'))
                except:
                    try:
                        # if urllib doesn't work, we move on to the next one
                        urllib.request.urlretrieve(data, os.path.join(new_dir, f'{country_city}_{num}.csv.gz'))
                    except:
                        continue

The following function should not be used in this tutorial but is here for reference. What it does is that it will get all dowloadable files from Inside Airbnb in a similar fashion as with the previous formula.

```python
def get_me_all_data(urls, path_to_files, countries_unique):
    """
    NOTE: Only use this function if you intend to download all of the data.
    
    Arguments:
    urls: pandas series with the links to iterate over
    path_to_files: path where you would like to save your files at
    countries_unique: iterable with the countries where Airbnb operates
    
    """
    for country in countries_unique: # we go over every country
        
        condition = urls.str.contains(country) # create a mask for it
        
        data_we_need = urls[condition] # we pass that mask to our pandas series
        
        new_dir = check_or_add(path_to_files, country, 'raw_data') # create a new directory for the raw data
        
        for num, data in enumerate(data_we_need): # iterate over the links we want
        
            try: # we first try to download the file with wget
                wget.download(data, os.path.join(new_dir, f'{country}_{num}.csv.gz'))
            except:
                try: # if wget doesn't work, we try with urllib
                    urllib.request.urlretrieve(data, os.path.join(new_dir, f'{country}_{num}.csv.gz'))
                except:
                    continue # if urllib doesn't work, we move on to the next one
```

Let's put our new function to use and get the first batch of data we will be using. In honor to our host, we will be picking Japan as our first country,

Here is a table with the countries, the amount of files available, the total size of the uncompressed and the compressed files, and the average size per file. The recommended way to pick a country and the amount of files you should download is as follows:
1. Pick a reasonable GB size for your batch (somewhere between 2 and 4 GB should be perfect).
2. Pick a country.
3. If the amount of files in that country don't amount to the size you choose in step 1, pick another country or pick multiple countries until you have the desired amount of GB.
4. If you want pick multiple countries but the total size of one or more of them is too large for what you think your computer can manage, divide the total GB size of that country by the GB space you have left and that would be the amount of files you should choose.
5. Use the `get_me_specific_data()` function with the appropriate parameters and wait for a bit.


| Country         | # of Cities | # of Files | GB Size Compressed  | GB Size Decompressed|
|:----------------|:------------|:-----------|:--------------------|:--------------------|
| The-Netherlands |     1       |     58     |        851 M        |        3.6 G        |
| Belgium         |     3       |     83     |        245 M        |        1.0 G        |
| United-States   |    28       |    859     |        8.4 G        |       35.0 G        |
| Greece          |     4       |     82     |        902 M        |        3.8 G        |
| Spain           |     9       |    259     |        2.7 G        |       12.0 G        |
| Australia       |     7       |    233     |        2.6 G        |       11.0 G        |
| China           |     3       |     57     |        1.1 G        |        4.9 G        |
| Belize          |     1       |     15     |         38 M        |        180 M        |
| Italy           |    10       |    246     |        4.0 G        |       16.0 G        |
| Germany         |     2       |     63     |        894 M        |        3.6 G        |
| France          |     3       |    117     |        3.1 G        |       13.0 G        |
| United-Kingdom  |     5       |    125     |        2.7 G        |       11.0 G        |
| Argentina       |     1       |     14     |        272 M        |        1.1 G        |
| South-Africa    |     1       |     24     |        452 M        |        1.9 G        |
| Denmark         |     1       |     27     |        505 M        |        2.2 G        |
| Ireland         |     2       |     45     |        550 M        |        2.3 G        |
| Switzerland     |     2       |     86     |        200 M        |        858 M        |
| Turkey          |     1       |     25     |        275 M        |        1.2 G        |
| Portugal        |     2       |     56     |        879 M        |        3.7 G        |
| Mexico          |     1       |     16     |        279 M        |        1.1 G        |
| Canada          |     7       |    191     |        1.4 G        |        6.0 G        |
| Norway          |     1       |     26     |        156 M        |        663 M        |
| Czech-Republic  |     1       |     25     |        317 M        |        1.3 G        |
| Brazil          |     1       |     27     |        731 M        |        2.9 G        |
| Chile           |     1       |      5     |         52 M        |        232 M        |
| Singapore       |     1       |     16     |        102 M        |        516 M        |
| Sweden          |     1       |     25     |        129 M        |        561 M        |
| Taiwan          |     1       |     25     |        281 M        |        1.1 G        |
| Japan           |     1       |     16     |        248 M        |        1.2 G        |
| Austria         |     1       |     52     |        433 M        |        1.8 G        |


Let's now put our function to use and get the data we need for our project.

In [None]:
%%time

get_me_specific_data(files_we_want, my_country, path, unique_countries)
get_me_specific_data(files_we_want, my_country2, path, unique_countries, 12)
get_me_specific_data(files_we_want, my_country3, path, unique_countries, 12)
get_me_specific_data(files_we_want, my_city, path, unique_cities, 12)
get_me_specific_data(files_we_want, my_city2, path, unique_cities, 12)
get_me_specific_data(files_we_want, my_city3, path, unique_cities, 12)

We can check the data we have gathered so far to see if we what we got back from Inside Airbnb. Since pandas has a nice compression parameter, we will not worry about decompressing our files with other tools and use pandas' in next few cells.

In [None]:
raw_files = check_or_add(path, my_country + '_data', 'raw_data') # let's add our new raw_data path to a variable
file_num = 5 # pick a number for the file you want to show.

In [None]:
df = pd.read_csv(os.path.join(raw_files, f'{my_country}_{file_num}.csv.gz'), compression='gzip', low_memory=False, encoding='utf-8')
df.info(memory_usage='deep')

In [None]:
df.head()

Let's have a quick look at how many files we downloaded our first country.

In [None]:
print(f"Amount of files we downloaded for {my_country} --> {len(os.listdir(raw_files))}")

Explain globbing

In [None]:
files = glob(os.path.join(path, '*_data', 'raw_data', '*.csv.gz'))
len(files), files[:5]

A quick example of dask delayed

In [None]:
def get_csv_files(data, path_out, new_dir, country, nums):
    """
    
    """
    
    df = pd.read_csv(data, compression='gzip',  low_memory=False, encoding='utf-8')
    
    df.to_csv(os.path.join(check_or_add(path_out, country + '_data', new_dir), f'{country}_{nums}.csv'), index=False, encoding='utf-8')
    
    print(f"Done Reading and Saving file {nums}!")

In [None]:
%%time

results = []

for num, file in enumerate(files):
    
    if my_country in file:
        result = dask.delayed(get_csv_files)(data=file, path_out=path, new_dir='csv_files', country=my_country, nums=num)
        results.append(result)
        
    elif my_country2 in file:
        result = dask.delayed(get_csv_files)(data=file, path_out=path, new_dir='csv_files', country=my_country2, nums=num)
        results.append(result)
        
    elif my_country3 in file:
        result = dask.delayed(get_csv_files)(data=file, path_out=path, new_dir='csv_files', country=my_country3, nums=num)
        results.append(result)
        
    elif my_city in file:
        result = dask.delayed(get_csv_files)(data=file, path_out=path, new_dir='csv_files', country=my_city, nums=num)
        results.append(result)
        
    elif my_city2 in file:
        result = dask.delayed(get_csv_files)(data=file, path_out=path, new_dir='csv_files', country=my_city2, nums=num)
        results.append(result)

    elif my_city3 in file:
        result = dask.delayed(get_csv_files)(data=file, path_out=path, new_dir='csv_files', country=my_city3, nums=num)
        results.append(result)

In [None]:
results[:5]

In [None]:
%%time

results_done = [result.compute() for result in results]

Double check that you have the correct amount of decompressed files with the cell below.

In [None]:
csv_files = glob(os.path.join(path, '*_data', 'csv_files', '*.csv'))
len(csv_files)

# Awesome Work! Now to Clean and Reshape our Data!

![Cleaning](https://media.giphy.com/media/RjpE964WUAE5a/giphy.gif)