# **Airbnb Data Collection**

## Objectives

* Have data fetched from Kaggle and save as raw data that will be used to develop the project.

## Inputs

* Have the data inputed using Kaggle Jason file, and then loaded and inspected. 

## Outputs

* Have the dataset that will be used in the analyse process generated.  


---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/europe-airbnb-prices/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/europe-airbnb-prices'

---

# Using data from Kaggle

Install Kaggle package to import data.

In [16]:
pip install kaggle==1.5.12

Note: you may need to restart the kernel to use updated packages.


Recognising the **kaggle.jason** file in the session.

In [5]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

chmod: cannot access 'kaggle.json': No such file or directory


Getting the dataset path from kaggle url.

In [4]:
KaggleDatasetPath = "cahyaalkahfi/airbnb-european-cities-join"
DestinationFolder = "inputs/datasets/raw"
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Traceback (most recent call last):
  File "/workspace/.pip-modules/bin/kaggle", line 5, in <module>
    from kaggle.cli import main
  File "/workspace/.pip-modules/lib/python3.8/site-packages/kaggle/__init__.py", line 23, in <module>
    api.authenticate()
  File "/workspace/.pip-modules/lib/python3.8/site-packages/kaggle/api/kaggle_api_extended.py", line 164, in authenticate
    raise IOError('Could not find {}. Make sure it\'s located in'
OSError: Could not find kaggle.json. Make sure it's located in /home/gitpod/.kaggle. Or use the environment method.


Unzip the downloaded file, then delete the zip file and kaggle.json file.

In [6]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

unzip:  cannot find or open inputs/datasets/raw/*.zip, inputs/datasets/raw/*.zip.zip or inputs/datasets/raw/*.zip.ZIP.

No zipfiles found.


---

# Loading, Inspecting anf Processing Kaggle Data 

Loading data.

In [7]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/raw/airbnb_european-cities.csv")
df.head()

Unnamed: 0,realSum,room_type,room_shared,room_private,person_capacity,host_is_superhost,multi,biz,cleanliness_rating,guest_satisfaction_overall,bedrooms,dist,metro_dist,city,weekends
0,319.640053,Private room,False,True,2,False,0,1,9,88,1,4.76336,0.852117,Amsterdam,True
1,347.995219,Private room,False,True,2,False,0,1,9,87,1,5.74831,3.651591,Amsterdam,True
2,482.975183,Private room,False,True,4,False,0,1,9,90,2,0.384872,0.439852,Amsterdam,True
3,485.552926,Private room,False,True,2,True,0,0,10,98,1,0.544723,0.318688,Amsterdam,True
4,2771.541724,Entire home/apt,False,False,4,True,0,0,10,100,3,1.686798,1.458399,Amsterdam,True


DataFrame Summary 

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51707 entries, 0 to 51706
Data columns (total 15 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   realSum                     51707 non-null  float64
 1   room_type                   51707 non-null  object 
 2   room_shared                 51707 non-null  bool   
 3   room_private                51707 non-null  bool   
 4   person_capacity             51707 non-null  int64  
 5   host_is_superhost           51707 non-null  bool   
 6   multi                       51707 non-null  int64  
 7   biz                         51707 non-null  int64  
 8   cleanliness_rating          51707 non-null  int64  
 9   guest_satisfaction_overall  51707 non-null  int64  
 10  bedrooms                    51707 non-null  int64  
 11  dist                        51707 non-null  float64
 12  metro_dist                  51707 non-null  float64
 13  city                        517

Have `realSum`, `metro_dist` and `dist` columns name changed to `daily_price`, `metro_dist_km` and `city_center_dist_km` to have a good presentation and understand of the data.

In [13]:
df = df.rename(columns={'realSum' : 'daily_price', 'metro_dist' : 'metro_dist_km', 'dist' : 'city_center_dist_km'})
df.head()

Unnamed: 0,daily_price,room_type,room_shared,room_private,person_capacity,host_is_superhost,multi,biz,cleanliness_rating,guest_satisfaction_overall,bedrooms,city_center_dist_km,metro_dist_km,city,weekends
0,319.640053,Private room,False,True,2,False,0,1,9,88,1,4.76336,0.852117,Amsterdam,True
1,347.995219,Private room,False,True,2,False,0,1,9,87,1,5.74831,3.651591,Amsterdam,True
2,482.975183,Private room,False,True,4,False,0,1,9,90,2,0.384872,0.439852,Amsterdam,True
3,485.552926,Private room,False,True,2,True,0,0,10,98,1,0.544723,0.318688,Amsterdam,True
4,2771.541724,Entire home/apt,False,False,4,True,0,0,10,100,3,1.686798,1.458399,Amsterdam,True


Have data checked for empty cells.

In [15]:
blank_mask = df.isnull() | (df == '')
null_or_blank_rows = df[blank_mask.any(axis=1)]
print(null_or_blank_rows)

Empty DataFrame
Columns: [daily_price, room_type, room_shared, room_private, person_capacity, host_is_superhost, multi, biz, cleanliness_rating, guest_satisfaction_overall, bedrooms, city_center_dist_km, metro_dist_km, city, weekends]
Index: []


Have just data that will be used in the analyse selected.

In [17]:
df = df[['city', 'bedrooms', 'city_center_dist_km', 'metro_dist_km', 'daily_price', 'weekends']]
df.head()

Unnamed: 0,city,bedrooms,city_center_dist_km,metro_dist_km,daily_price,weekends
0,Amsterdam,1,4.763,0.852,319.64,True
1,Amsterdam,1,5.748,3.652,348.0,True
2,Amsterdam,2,0.385,0.44,482.98,True
3,Amsterdam,1,0.545,0.319,485.55,True
4,Amsterdam,3,1.687,1.458,2771.54,True


Have data on columns `city_center_dist_km` and `metro_dist_km` displayed with 3 
decimals place, and on column `daily_price` with 2 decimals place.

In [18]:
df['city_center_dist_km'] = df['city_center_dist_km'].round(3)
df['metro_dist_km'] = df['metro_dist_km'].round(3)
df['daily_price'] = df['daily_price'].round(2)
df.head()

Unnamed: 0,city,bedrooms,city_center_dist_km,metro_dist_km,daily_price,weekends
0,Amsterdam,1,4.763,0.852,319.64,True
1,Amsterdam,1,5.748,3.652,348.0,True
2,Amsterdam,2,0.385,0.44,482.98,True
3,Amsterdam,1,0.545,0.319,485.55,True
4,Amsterdam,3,1.687,1.458,2771.54,True


Have data on column `weekend` converted to integer: **True = 1** and **False = 0**.

In [21]:
df.weekends = df.weekends.replace({True: 1, False: 0})
df.head()

Unnamed: 0,city,bedrooms,city_center_dist_km,metro_dist_km,daily_price,weekends
0,Amsterdam,1,4.763,0.852,319.64,1
1,Amsterdam,1,5.748,3.652,348.0,1
2,Amsterdam,2,0.385,0.44,482.98,1
3,Amsterdam,1,0.545,0.319,485.55,1
4,Amsterdam,3,1.687,1.458,2771.54,1


---

# Push files to Repo

* If you do not need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [22]:
import os
try:
  os.makedirs(name='outputs/datasets/collection')
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/AirbnbEuropeanCities.csv",index=False)  
