# 00 Setting the Stage & Getting the Data

> "Unless you try to do something beyond what you have already mastered, you will never grow.” ~ 
Ronald E. Osborn

## Outline for this notebook  

1. What is Data Analytics? 🐍  
2. What is a Data Analyst?
3. What is the Data Analytics Cycle?
4. Small, Medium, and Big Data
5. Problem Definition
6. Data Gathering
7. Questions

In this section, we will be covering 2 of the most important stages of any data analytics process: problem definition and data gathering. We will begin by talking about what data analytics is, what is a data analyst

## 1. What is Data Analytics? 🐍

![Data Analytics Cycle](images/4.png)

## 2. What is a Data Analyst?

![Data Analytics Cycle](images/5.png)

![solving](images/7.png)

As data detectives, we want to make sure we have at least a loosly define outline of what our projects involving data might look like. In particular, we want to be careful with those involving large amounts of data since errors can, at the very least, be very time consuming, and, at worst, very expensive.

For our task, we are currently sick and tired of COVID and we want to start planning our next vacation. More specifically, we would love to scratch some countries off our bucket list, but, since this can be quite costly, we should start by figuring out where are we going, where are we staying, and what kind of prices are we looking at if we decide to go there?

Since hotels are expensive, we thought we would give Airbnb a try 

![Gathering Data](images/9.png)

We will be using data scraped from a scraping tool called, [Inside Airbnb](http://insideairbnb.com/index.html). Yes, we will be scraping a bit of data from the scraper itsef. More specifically, we will be taking the skeleton (an html version of the website), downloading it, and then extracting all of the links that will help us get the data from it.

We will start by importing the following packages to help us get the data we need.

- `os`
- `pandas`
- `numpy`
- `requests`
- `bs4`
- `wget`
- `glob`
- `urllib`
- `dask`

In [15]:
import pandas as pd
from bs4 import BeautifulSoup
import requests
import os
import wget
import dask
import numpy as np
from dask.diagnostics import ProgressBar
from glob import glob
import urllib

pd.options.display.max_columns = None

Since we will be creating several directories, the first thing we will do is to assign a path to the directory where all of our data will go into and come out from.

In [16]:
path = 'data'

We will also create a function that takes in a existing path as a starting point and many additional directory names that we might want/need to create along the course of this tutorial. In addition, our function will check whether the directory we are trying to create exists or not, then combine all arguments into one directory and return such directory.

You might have already seen the `*args` parameter in a function a few times already while using Python. What this does is that it gives us the ability to provide multiple arguments to a function without explicitely adding them to the construction of the function.

In [17]:
def check_or_add(old_path, *args):
    
    """
    This function will help us check whether one or more directories exists, and
    if they don't exist, it will create, combine, and return a new directory.
    """
        
    if not os.path.exists(os.path.join(old_path, *args)):
        os.makedirs(os.path.join(old_path, *args))

    return os.path.join(old_path, *args)

We will use Python's `requests` module to send a request to Inside Airbnb, use our path creation function to add this html file to a directory called, `html_data`, and then save the file as text using a context manager construct.

In [28]:
web_data = requests.get('http://insideairbnb.com/get-the-data.html')

In [29]:
path_4_source = check_or_add(path, 'html_data')

In [30]:
with open(os.path.join(path_4_source, 'insideairbnb.html'), 'w') as html:
    html.write(web_data.text)

We will add the path to our new file, plus the name of the file we just saved, to a variable called `html_doc`. We will then read it back in as bytes, and parse the document using `BeautifulSoup`.

In [31]:
html_doc = os.path.join(path_4_source, 'insideairbnb.html')
html_doc

'data/html_data/insideairbnb.html'

In [32]:
with open(html_doc, 'r') as file: 
    soup = BeautifulSoup(file, 'html.parser')

`BeautifulSoup` will allow us to extract the links we need without much hassle. While we could figure out a way to get the exact links we need,  maybe with a regular expression or a similar approach, we will extract all links at this stage by parsing the html file and taking out the links we need. For this, we will use a Python list comprehension.

In [33]:
list_of_links = [link.get('href') for link in soup.find_all('a')]

In [34]:
list_of_links[:10]

['index.html',
 'about.html',
 'behind.html',
 'get-the-data.html',
 'https://twitter.com/share',
 'about.html#disclaimers',
 'http://creativecommons.org/publicdomain/zero/1.0/',
 'amsterdam/',
 'http://data.insideairbnb.com/the-netherlands/north-holland/amsterdam/2020-09-09/data/listings.csv.gz',
 'http://data.insideairbnb.com/the-netherlands/north-holland/amsterdam/2020-09-09/data/calendar.csv.gz']

In [35]:
len(list_of_links)

20499

Notice that the files we need are the ones that end in `listings.csv.gz` and, to extract them, (or filter out those we don't want), we can take advantage pandas' many string methods. Let's begin by converting our list into a pandas series, which can also be referred to as a 1 dimensional array with a maleable index.

In [36]:
our_list = pd.Series(list_of_links, name='links')

our_list.head() # let's examine the first five rows of our new pandas Series

0                   index.html
1                   about.html
2                  behind.html
3            get-the-data.html
4    https://twitter.com/share
Name: links, dtype: object

We now have to get rid of `NaN` values, grab the listings links, and filter out those likns we don't with a mask. We will also reset the index just because it is nice to have values that start from 0 for our index.

In [37]:
our_list.dropna(inplace=True) # drop NaN's and keep the changes

condition = our_list.str.endswith('/listings.csv.gz') # let's find the listings we need
files_we_want = our_list[condition].reset_index(drop=True) # filter out what we don't need and reset the index
files_we_want.head() # make sure everything when through as expected

0    http://data.insideairbnb.com/the-netherlands/n...
1    http://data.insideairbnb.com/the-netherlands/n...
2    http://data.insideairbnb.com/the-netherlands/n...
3    http://data.insideairbnb.com/the-netherlands/n...
4    http://data.insideairbnb.com/the-netherlands/n...
Name: links, dtype: object

Now that we have the links we need, let's go ahead and examine how many we have.

In [38]:
files_we_want.shape

(2898,)

That's certainly still a lot of files to download (but at least is not 20k), so how about we have a look at how many files we have per country and, where possible, per city. Since we can imagine places such as the US, the UK, and Australia having multiple cities with people doing business through Airbnb.

In [39]:
countries = files_we_want.str.split('/').str.get(3)
unique_countries = countries.unique()
unique_countries

array(['the-netherlands', 'belgium', 'united-states', 'greece', 'spain',
       'australia', 'china', 'belize', 'italy', 'germany', 'france',
       'united-kingdom', 'argentina', 'south-africa', 'denmark',
       'ireland', 'switzerland', 'turkey', 'portugal', 'mexico', 'canada',
       'norway', 'czech-republic', 'brazil', 'chile', 'singapore',
       'sweden', 'taiwan', 'japan', 'austria'], dtype=object)

In [40]:
for country in unique_countries:
    print(f"{country.title()} has ------> {len(files_we_want[countries == country])}")

The-Netherlands has ------> 58
Belgium has ------> 83
United-States has ------> 859
Greece has ------> 82
Spain has ------> 259
Australia has ------> 233
China has ------> 57
Belize has ------> 15
Italy has ------> 246
Germany has ------> 63
France has ------> 117
United-Kingdom has ------> 125
Argentina has ------> 14
South-Africa has ------> 24
Denmark has ------> 27
Ireland has ------> 45
Switzerland has ------> 86
Turkey has ------> 25
Portugal has ------> 56
Mexico has ------> 16
Canada has ------> 191
Norway has ------> 26
Czech-Republic has ------> 25
Brazil has ------> 27
Chile has ------> 5
Singapore has ------> 16
Sweden has ------> 25
Taiwan has ------> 25
Japan has ------> 16
Austria has ------> 52


## Exercise 1

Find out how many cities are represented in our dataset, and print the country, city, and how many files for that city do we have.

Answers below! Don't peak! 👀

In [None]:
cities = files_we_want.str.split('/').str.get(5)
unique_cities = cities.unique()
unique_cities

In [None]:
for city in unique_cities:
    print(f"{city.title()} has ------> {len(files_we_want[cities == city])}")

It is time for us to pick a country or city for our analysis.

### Pick a country you would like to visit.

In [41]:
my_country = 'japan'

If you forget the amount of files available in each country and/or city when trying to come up with a decision, you can check them individually with the following function. There is also a table with more information coming up soon.

In [42]:
def check_len_files(country_city):
    
    if country_city in unique_countries:
        
        condition = files_we_want.str.contains(country_city)
        data_we_need = files_we_want[condition]
        
        return len(data_we_need)
    
    elif country_city in unique_cities:
        
        condition = files_we_want.str.contains(country_city)
        
        data_we_need = files_we_want[condition]
        
        return len(data_we_need)
    
    else:
        print("Sorry, your country or city is not on the list or it was misspelled")

In [44]:
print(check_len_files(my_country))
print(check_len_files('japan'))
print(check_len_files('australia'))
# print(check_len_files('new-york-city'))

16
16
233


The following is one of the most important functions in the whole notebook as it is the one that is going to allow us to get the data we need into our computers.

The function takes in the following arguments:
- `urls` --> This is strictly a pandas series with the list of urls we need
- `country` --> This would the country you want to get data for
- `path_to_files` --> This is where the data will be downloaded to
- `countries_unique` --> This is the list of countries where Airbnb operated in
- `unique_num` --> If you do not need all files, you can specify how many you need. Default is all

And it operates as follows:

1. It first checks whether the country you have picked is in the list of unique countries
2. Then it creates a boolean array (aka a mask)
3. Passes it through our pandas series containing the urls to filter out the countries you don't need
4. Then it downloads the files you want and
5. Saves them into a new folder it creates called `raw_data` in the path you provided

In [45]:
def get_me_specific_data(urls, country, path_to_files, countries_unique, unique_num = None):
    
    """
    
    """
    
    # we go over every country
    if country in countries_unique:
        
        # check whether it exists in our list of urls and create a mask
        condition = urls.str.contains(country)
        
        # we pass that mask to our pandas series
        data_we_need = urls[condition]
        
        # create a new directory for the raw data
        new_dir = check_or_add(path_to_files, country, 'raw_data')
        
        # we first check if a unique number of files was specified
        if unique_num:
            
            num = 0
            
            # loop until we reach that point
            while num < unique_num:
                
                # we first try to download the file with wget
                try:
                    # if wget doesn't work, we try with urllib
                    wget.download(data_we_need.iloc[num], os.path.join(new_dir, f'{country}_{num}.csv.gz'))
                except:
                    try:
                        urllib.request.urlretrieve(data_we_need.iloc[num], os.path.join(new_dir, f'{country}_{num}.csv.gz'))
                    except:
                        continue
                num += 1
        else:
            
            # iterate over the links we want
            for num, data in enumerate(data_we_need):
                
                # we first try to download the file with wget
                try:
                    # if wget doesn't work, we try with urllib
                    wget.download(data, os.path.join(new_dir, f'{country}_{num}.csv.gz'))
                except:
                    try:
                        # if urllib doesn't work, we move on to the next one
                        urllib.request.urlretrieve(data, os.path.join(new_dir, f'{country}_{num}.csv.gz'))
                    except:
                        continue

The following function should not be used in this tutorial but is here for reference. What it does is that it will get all dowloadable files from Inside Airbnb in a similar fashion as with the previous formula.

```python
def get_me_all_data(urls, path_to_files, countries_unique):
    """
    NOTE: Only use this function if you intend to download all of the data.
    
    Arguments:
    urls: pandas series with the links to iterate over
    path_to_files: path where you would like to save your files at
    countries_unique: iterable with the countries where Airbnb operates
    
    """
    for country in countries_unique: # we go over every country
        
        condition = urls.str.contains(country) # create a mask for it
        
        data_we_need = urls[condition] # we pass that mask to our pandas series
        
        new_dir = check_or_add(path_to_files, country, 'raw_data') # create a new directory for the raw data
        
        for num, data in enumerate(data_we_need): # iterate over the links we want
        
            try: # we first try to download the file with wget
                wget.download(data, os.path.join(new_dir, f'{country}_{num}.csv.gz'))
            except:
                try: # if wget doesn't work, we try with urllib
                    urllib.request.urlretrieve(data, os.path.join(new_dir, f'{country}_{num}.csv.gz'))
                except:
                    continue # if urllib doesn't work, we move on to the next one
```

Let's put our new function to use and get the first batch of data we will be using. In honor to our host, we will be picking Japan as our first country,

Here is a table with the countries, the amount of files available, the total size of the uncompressed and the compressed files, and the average size per file. The recommended way to pick a country and the amount of files you should download is as follows:
1. Pick a reasonable GB size for your batch (somewhere between 2 and 4 GB should be perfect).
2. Pick a country.
3. If the amount of files in that country don't amount to the size you choose in step 1, pick another country or pick multiple countries until you have the desired amount of GB.
4. If you want pick multiple countries but the total size of one or more of them is too large for what you think your computer can manage, divide the total GB size of that country by the GB space you have left and that would be the amount of files you should choose.
5. Use the `get_me_specific_data()` function with the appropriate parameters and wait for a bit.


| Country         | # of Cities | # of Files | GB Size Compressed  | GB Size Decompressed|
|:----------------|:------------|:-----------|:--------------------|:--------------------|
| The-Netherlands |     1       |     58     |        851 M        |        3.6 G        |
| Belgium         |     3       |     83     |        245 M        |        1.0 G        |
| United-States   |    28       |    859     |        8.4 G        |       35.0 G        |
| Greece          |     4       |     82     |        902 M        |        3.8 G        |
| Spain           |     9       |    259     |        2.7 G        |       12.0 G        |
| Australia       |     7       |    233     |        2.6 G        |       11.0 G        |
| China           |     3       |     57     |        1.1 G        |        4.9 G        |
| Belize          |     1       |     15     |         38 M        |        180 M        |
| Italy           |    10       |    246     |        4.0 G        |       16.0 G        |
| Germany         |     2       |     63     |        894 M        |        3.6 G        |
| France          |     3       |    117     |        3.1 G        |       13.0 G        |
| United-Kingdom  |     5       |    125     |        2.7 G        |       11.0 G        |
| Argentina       |     1       |     14     |        272 M        |        1.1 G        |
| South-Africa    |     1       |     24     |        452 M        |        1.9 G        |
| Denmark         |     1       |     27     |        505 M        |        2.2 G        |
| Ireland         |     2       |     45     |        550 M        |        2.3 G        |
| Switzerland     |     2       |     86     |        200 M        |        858 M        |
| Turkey          |     1       |     25     |        275 M        |        1.2 G        |
| Portugal        |     2       |     56     |        879 M        |        3.7 G        |
| Mexico          |     1       |     16     |        279 M        |        1.1 G        |
| Canada          |     7       |    191     |        1.4 G        |        6.0 G        |
| Norway          |     1       |     26     |        156 M        |        663 M        |
| Czech-Republic  |     1       |     25     |        317 M        |        1.3 G        |
| Brazil          |     1       |     27     |        731 M        |        2.9 G        |
| Chile           |     1       |      5     |         52 M        |        232 M        |
| Singapore       |     1       |     16     |        102 M        |        516 M        |
| Sweden          |     1       |     25     |        129 M        |        561 M        |
| Taiwan          |     1       |     25     |        281 M        |        1.1 G        |
| Japan           |     1       |     16     |        248 M        |        1.2 G        |
| Austria         |     1       |     52     |        433 M        |        1.8 G        |


Let's now put our function to use and get the data we need for our project.

In [47]:
%%time


get_me_specific_data(urls=files_we_want, country=my_country, path_to_files=path, countries_unique=unique_countries)

CPU times: user 1.82 s, sys: 1.58 s, total: 3.4 s
Wall time: 2min 33s


We can check the data we have gathered so far to see if we what we got back from Inside Airbnb. Since pandas has a nice compression parameter, we will not worry about decompressing our files with other tools and use pandas' in next few cells.

In [48]:
raw_files = check_or_add(path, my_country, 'raw_data') # let's add our new raw_data path to a variable
file_num = 5 # pick a number for the file you want to show.

In [49]:
df = pd.read_csv(os.path.join(raw_files, f'{my_country}_{file_num}.csv.gz'), compression='gzip', low_memory=False, encoding='utf-8')
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14941 entries, 0 to 14940
Columns: 106 entries, id to reviews_per_month
dtypes: float64(22), int64(23), object(61)
memory usage: 159.5 MB


In [50]:
df.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,notes,transit,access,interaction,house_rules,thumbnail_url,medium_url,picture_url,xl_picture_url,host_id,host_url,host_name,host_since,host_location,host_about,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_thumbnail_url,host_picture_url,host_neighbourhood,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,host_identity_verified,street,neighbourhood,neighbourhood_cleansed,neighbourhood_group_cleansed,city,state,zipcode,market,smart_location,country_code,country,latitude,longitude,is_location_exact,property_type,room_type,accommodates,bathrooms,bedrooms,beds,bed_type,amenities,square_feet,price,weekly_price,monthly_price,security_deposit,cleaning_fee,guests_included,extra_people,minimum_nights,maximum_nights,minimum_minimum_nights,maximum_minimum_nights,minimum_maximum_nights,maximum_maximum_nights,minimum_nights_avg_ntm,maximum_nights_avg_ntm,calendar_updated,has_availability,availability_30,availability_60,availability_90,availability_365,calendar_last_scraped,number_of_reviews,number_of_reviews_ltm,first_review,last_review,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,requires_license,license,jurisdiction_names,instant_bookable,is_business_travel_ready,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,35303,https://www.airbnb.com/rooms/35303,20200128150029,2020-01-28,"La Casa Gaienmae C Harajuku, Omotesando is nearby",This shared flat is locating at very close to ...,This apartment is 3 bedroom flat shared with t...,This shared flat is locating at very close to ...,none,10 min walking to Harajuku ~ Urahara ~ Omotesa...,Current tenants are living in this flat over 2...,"5min to subway, 10min to JR stations, you can ...","Your private room, Kitchen, Bathroom, Toilet, ...",I provide common space cleaning twice a week. ...,"If you would like to stay monthly, there is a ...",,,https://a0.muscache.com/im/pictures/67365319/c...,,151977,https://www.airbnb.com/users/show/151977,Miyuki,2010-06-25,"Shibuya, Tokyo, Japan",Hi I am Miyuki Kanda. I run a real estate & pr...,,,,f,https://a0.muscache.com/im/users/151977/profil...,https://a0.muscache.com/im/users/151977/profil...,Shibuya District,3,3,"['email', 'phone', 'facebook', 'reviews', 'jum...",t,t,"Shibuya, Tokyo, Japan",Shibuya District,Shibuya Ku,,Shibuya,Tokyo,150-0001,Tokyo,"Shibuya, Japan",JP,Japan,35.67152,139.71203,t,Apartment,Private room,1,1.0,1.0,1.0,Real Bed,"{TV,Internet,Wifi,Kitchen,""Paid parking off pr...",,"$4,250.00",,"$110,000.00","$30,000.00","$5,000.00",1,$0.00,28,1125,28,28,1125,1125,28.0,1125.0,15 months ago,t,29,59,89,90,2020-01-28,18,0,2011-12-28,2018-07-28,94.0,9.0,9.0,9.0,10.0,10.0,9.0,t,Other reasons | \n弊社は不動産事業者であり賃貸住宅管理事業者でもあります。...,,f,f,strict_14_with_grace_period,f,f,3,2,1,0,0.18
1,197677,https://www.airbnb.com/rooms/197677,20200128150029,2020-01-28,Oshiage Holiday Apartment,,"We are happy to welcome you to our apartment, ...","We are happy to welcome you to our apartment, ...",none,,,,,,1. Smoking is NOT allowed inside the property....,,,https://a0.muscache.com/im/pictures/38437056/d...,,964081,https://www.airbnb.com/users/show/964081,Yoshimi & Marek,2011-08-13,Tokyo,Would love to travel all over the world and me...,within a day,100%,,t,https://a0.muscache.com/im/users/964081/profil...,https://a0.muscache.com/im/users/964081/profil...,Sumida District,1,1,"['email', 'phone', 'facebook', 'reviews', 'jum...",t,t,"Sumida, Tokyo, Japan",Sumida District,Sumida Ku,,Sumida,Tokyo,,Tokyo,"Sumida, Japan",JP,Japan,35.71721,139.82596,t,Apartment,Entire home/apt,4,1.0,1.0,2.0,Futon,"{TV,Internet,Wifi,""Air conditioning"",Kitchen,""...",,"$11,006.00","$66,000.00","$240,000.00","$40,000.00","$5,000.00",1,$0.00,3,365,3,3,365,365,3.0,365.0,3 months ago,t,11,14,18,159,2020-01-28,163,11,2011-09-21,2019-12-03,95.0,10.0,10.0,10.0,10.0,9.0,10.0,t,M130003350,,f,f,moderate,f,f,1,1,0,0,1.6
2,289597,https://www.airbnb.com/rooms/289597,20200128150029,2020-01-29,Private apt in central Tokyo #203,,::::::::::::::::::::::::::::::::::::::::::::::...,::::::::::::::::::::::::::::::::::::::::::::::...,none,,,,,I can not see you in person as I don't live in...,No smoking inside. No Parties or loud noises. ...,,,https://a0.muscache.com/im/pictures/6454753/a8...,,341577,https://www.airbnb.com/users/show/341577,Hide&Kei,2011-01-10,"Tokyo, Japan",We love travelling all over the world.\r\nWe a...,within an hour,100%,,f,https://a0.muscache.com/im/users/341577/profil...,https://a0.muscache.com/im/users/341577/profil...,Nerima District,2,2,"['email', 'phone', 'reviews', 'jumio', 'govern...",t,t,"Nerima, Tokyo, Japan",Nerima District,Nerima Ku,,Nerima,Tokyo,,Tokyo,"Nerima, Japan",JP,Japan,35.74267,139.6581,t,Apartment,Entire home/apt,2,1.0,1.0,1.0,Real Bed,"{TV,Wifi,""Air conditioning"",Kitchen,""Hot tub"",...",220.0,"$4,252.00","$29,438.00","$152,642.00","$32,709.00","$5,452.00",1,"$1,090.00",30,180,30,30,180,180,30.0,180.0,8 months ago,t,11,11,20,115,2020-01-29,112,7,2012-06-15,2019-12-01,95.0,9.0,10.0,10.0,9.0,9.0,9.0,t,Other reasons | 1か月以上の賃貸借契約のみ対応とする。ゲストには賃貸契約の署...,,t,f,strict_14_with_grace_period,f,f,2,2,0,0,1.21
3,370759,https://www.airbnb.com/rooms/370759,20200128150029,2020-01-29,"Cozy flat #203, local area YET 10 mins to shib...","So close to busy centers, yet so peaceful! Jus...","Cozy and Relaxing, at home feeling in a reside...","So close to busy centers, yet so peaceful! Jus...",none,Peaceful and residential area just 10 mins awa...,January - February - July - August: During tho...,3 mins away to the station. Nice walks all aro...,It is your own private flat during your stay i...,We are very near if needed let us know. Will b...,Please check the following: -Thank you to put ...,,,https://a0.muscache.com/im/pictures/34594282-f...,,1573631,https://www.airbnb.com/users/show/1573631,"Gilles,Mayumi,Taiki",2012-01-06,"Imari, Saga, Japan",We are a French-Japanese couple working betwee...,within an hour,100%,,t,https://a0.muscache.com/im/pictures/user/a419d...,https://a0.muscache.com/im/pictures/user/a419d...,Setagaya District,3,3,"['email', 'phone', 'facebook', 'reviews', 'jum...",t,t,"Tokyo, Tokyo, Japan",Setagaya District,Setagaya Ku,,Tokyo,Tokyo,156-0042,Tokyo,"Tokyo, Japan",JP,Japan,35.66344,139.65593,t,Apartment,Entire home/apt,2,1.0,0.0,1.0,Real Bed,"{TV,Internet,Wifi,""Air conditioning"",Kitchen,""...",270.0,"$6,542.00","$56,009.00","$198,032.00","$20,000.00","$6,000.00",1,"$1,000.00",29,1125,1,29,1125,1125,28.1,1125.0,4 days ago,t,11,11,20,265,2020-01-29,102,4,2014-03-25,2019-11-20,95.0,10.0,10.0,10.0,10.0,10.0,10.0,t,Other reasons | We called Setagaya ku hokenjo ...,,t,f,strict_14_with_grace_period,f,f,3,3,0,0,1.43
4,700253,https://www.airbnb.com/rooms/700253,20200128150029,2020-01-29,Private apt in central Tokyo #201,,::::::::::::::::::::::::::::::::::::::::::::::...,::::::::::::::::::::::::::::::::::::::::::::::...,none,,,,,I can not see you in person as I don't live in...,No smoking inside. No Parties or loud noises. ...,,,https://a0.muscache.com/im/pictures/9888693/af...,,341577,https://www.airbnb.com/users/show/341577,Hide&Kei,2011-01-10,"Tokyo, Japan",We love travelling all over the world.\r\nWe a...,within an hour,100%,,f,https://a0.muscache.com/im/users/341577/profil...,https://a0.muscache.com/im/users/341577/profil...,Nerima District,2,2,"['email', 'phone', 'reviews', 'jumio', 'govern...",t,t,"Nerima, Tokyo, Japan",Nerima District,Nerima Ku,,Nerima,Tokyo,,Tokyo,"Nerima, Japan",JP,Japan,35.74264,139.65832,t,Apartment,Entire home/apt,2,1.0,1.0,1.0,Real Bed,"{TV,Internet,Wifi,""Air conditioning"",Kitchen,""...",,"$4,034.00","$29,438.00","$152,642.00","$32,709.00","$5,452.00",1,"$1,090.00",30,180,30,30,180,180,30.0,180.0,8 months ago,t,27,27,34,64,2020-01-29,103,4,2012-10-17,2019-10-04,96.0,10.0,10.0,10.0,10.0,9.0,10.0,t,Other reasons | 1か月以上の賃貸借契約のみ対応とする。ゲストには賃貸契約の署...,,f,f,strict_14_with_grace_period,f,f,2,2,0,0,1.16


Let's have a quick look at how many files we downloaded.

In [51]:
print(f"Amount of files we downloaded --> {len(os.listdir(raw_files))}")

Amount of files we downloaded --> 16


In [52]:
files = glob(os.path.join(raw_files, f'{my_country}*.csv.gz'))
files[:5]

['data/japan/raw_data/japan_14.csv.gz',
 'data/japan/raw_data/japan_7.csv.gz',
 'data/japan/raw_data/japan_5.csv.gz',
 'data/japan/raw_data/japan_9.csv.gz',
 'data/japan/raw_data/japan_1.csv.gz']

In [62]:
@dask.delayed
def get_csv_files(data, path_out, new_dir, country, nums):
    
    df = pd.read_csv(data, compression='gzip', low_memory=False)
    
    df.to_csv(os.path.join(check_or_add(path_out, country, new_dir), f'{country}_{nums}.csv'), index=False)
    
    print(f"Done Reading and Saving file {nums}!")
    
    pass

In [63]:
%%time

results = []
for num, file in enumerate(files):
    results.append(get_csv_files(file, path_out=path, new_dir='csv_files', country=my_country, nums=num))

CPU times: user 1.98 ms, sys: 926 µs, total: 2.91 ms
Wall time: 2.12 ms


In [64]:
%%time

dask.compute(results);

Done Reading and Saving file 9!
Done Reading and Saving file 0!
Done Reading and Saving file 15!
Done Reading and Saving file 11!
Done Reading and Saving file 6!
Done Reading and Saving file 3!
Done Reading and Saving file 13!
Done Reading and Saving file 14!
Done Reading and Saving file 5!
Done Reading and Saving file 4!
Done Reading and Saving file 12!
Done Reading and Saving file 1!
Done Reading and Saving file 8!
Done Reading and Saving file 10!
Done Reading and Saving file 2!
Done Reading and Saving file 7!
CPU times: user 50 s, sys: 6.04 s, total: 56 s
Wall time: 45.6 s


([None,
  None,
  None,
  None,
  None,
  None,
  None,
  None,
  None,
  None,
  None,
  None,
  None,
  None,
  None,
  None],)

Double check that you have the correct amount of decompressed files with the cell below.

In [65]:
csv_files = check_or_add(path, my_country, 'csv_files')
len(os.listdir(csv_files))

16

## Exercise 2

Create a futures object and unzip the files for your second country.

# Awesome Work! Now to Clean and Reshape our Data!