# **Airbnb Data Collection**

## Objectives

* Have data fetched from Kaggle and save as raw data that will be used to develop the project.

## Inputs

* Have the data inputed using Kaggle Jason file, and then loaded and inspected. 

## Outputs

* Have the dataset that will be used in the analyse process generated.  


---

# Change working directory

* Have the working directory changed from its current folder to its parent folder.
    + We access the current directory with `os.getcwd()`.

In [None]:
import os
current_dir = os.getcwd()
current_dir

* Have the parent of the current directory set up as the new current directory.
    + `os.path.dirname()` gets the parent directory;
    + `os.chir()` defines the new current directory.

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

* Have the new current directory confirmed.

In [None]:
current_dir = os.getcwd()
current_dir

---

# Loadind and Preparing Data

Have data loaded and prepered for the next steps of analysis.

Install Kaggle package to import data.

In [None]:
%pip install kaggle==1.5.12

Recognising the **kaggle.jason** file in the session.

In [None]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Getting the dataset path from kaggle url.

In [None]:
KaggleDatasetPath = "cahyaalkahfi/airbnb-european-cities-join"
DestinationFolder = "inputs/datasets/raw"
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Unzip the downloaded file, then delete the zip file and kaggle.json file.

In [None]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

Loading data.

In [None]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/raw/airbnb_european-cities.csv")
df.head()

DataFrame Summary 

In [None]:
df.info()

Have `realSum`, `metro_dist` and `dist` columns name changed to `daily_price`, `metro_dist_km` and `city_center_dist_km` to have a good presentation and understand of the data.

In [None]:
df = df.rename(columns={'realSum' : 'daily_price', 'metro_dist' : 'metro_dist_km', 'dist' : 'city_center_dist_km'})
df.head()

Have data checked for empty cells.

In [None]:
blank_mask = df.isnull() | (df == '')
null_or_blank_rows = df[blank_mask.any(axis=1)]
print(null_or_blank_rows)

Have column names and the values in the DataFrame cleaned up from any leading or trailing spaces.

In [None]:
# Trim spaces in column names
df.columns = df.columns.str.strip()

# Trim spaces in all string columns
for col in df.select_dtypes(include=['object']).columns:
    df[col] = df[col].str.strip()
print(df.head())    

Have just data that will be used in the analysis selected.

In [None]:
# Correct columns to select
columns_to_select = ['city', 'bedrooms', 'room_type', 'city_center_dist_km', 'metro_dist_km', 'daily_price', 'weekends']

# Selecting the existing columns
df = df[columns_to_select]
df.head()

Have data on columns `city_center_dist_km` and `metro_dist_km` displayed with 3 
decimals place, and on column `daily_price` with 2 decimals place.

In [None]:
df['city_center_dist_km'] = df['city_center_dist_km'].round(3)
df['metro_dist_km'] = df['metro_dist_km'].round(3)
df['daily_price'] = df['daily_price'].round(2)
df.head()

Have data on column `weekend` converted to integer: **True = 1** and **False = 0**.

In [None]:
df.weekends = df.weekends.replace({True: 1, False: 0})
df.head()

---

# Pushing File to Repo

* Have new folder created to save the DataFrame that will be used in the next steps of the analysis.

In [None]:
import os
try:
  os.makedirs(name='outputs/datasets/collection')
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/EuropeanCitiesAirbnb.csv",index=False)  


---

# Conclusion

In this second step of my analysis, data was collected, stored and processed to be used in the next step of the analysis.

---

# Next Step

Have variables and their distribuition checked to finish the data processing step and then start the analysis process.