# Kaggle Dataset Analysis

### <span style="color:#32CAEC"> Contents </span>
- 0. Author Information
- 1. Introduction
    - 1.1 Basic information about the dataset
    - 1.2 What is Airbnb?
    - 1.3 About Amsterdam
    - 1.4 Detailed information about the dataset
- 2. The goal of this analysis

## <span style="color:navy"> Author Information </span>
- **Full name:** Alejandro Donaire Salvador
- **University ID (NIU):** 1600697
- **Personal e-mail:** aledonairesa@gmail.com

## <span style="color:navy"> 1. Introduction </span>

### <span style="color:#32CAEC"> 1.1 Basic information about the dataset </span>

The dataset is called "Airbnb Amsterdam" and can be found here: https://www.kaggle.com/datasets/erikbruin/airbnb-amsterdam. The data dates from December 6th, 2018 and comes from http://insideairbnb.com/. It weighs about 0.4GB and has been downloaded +4000 times in Kaggle as per November 2022.

### <span style="color:#32CAEC"> 1.2 What is Airbnb? </span>

Airbnb is a **public company** that manages an **online marketplace** (a type of e-commerce) focused on **short-term homestays and experiences**. It is based in San Francisco, California, and was founded in 2008 by Brian Chesky and others. It currently serves worldwide and **accounts for more than 20% of the vacation rental industry** as a whole. *Sources: https://en.wikipedia.org/wiki/Airbnb and https://hospitable.com/competitors-for-airbnb/*.

Here's its logo:
<img src="airbnblogo.png" alt="drawing" width="250"/>

### <span style="color:#32CAEC"> 1.3 About Amsterdam </span>

Amsterdam is the **capital and most populous city of the Netherlands** (northwestern Europe). Its **population is about 910k** people as per the city proper and has a large number of canals and bodies of water. Its climate is oceanic, meaning that it's humid, the summers are cool (about 20ºC), the winters are mild (about 5ºC) and the annual temperature is relatively narrow. *Sources: https://en.wikipedia.org/wiki/Amsterdam and https://www.wolframalpha.com/input?key=&i=climate+Amsterdam*.

Here's a picture of the map of Amsterdam with Airbnb home/apartments (red dots) and private rooms (green dots) as per September 2022 (*Source: http://insideairbnb.com/amsterdam*):

<img src="amsterdamairbnb.png" alt="drawing" width="450"/>

### <span style="color:#32CAEC"> 1.4 Detailed information about the dataset </span>

The entire dataset consists of **6 `.csv` files**:
- `listings.csv`
- `listings_details.csv`
- `calendar.csv`
- `neighbourhoods.csv`
- `reviews.csv`
- `reviews_details.csv`

We review each one in more detail now:

**`listings.csv`**:

All the advertisements in Amsterdam on December 6th 2018. It includes 20030 observations (advertisements) and 16 different attributes. The attributes are:
- **id**
- **name**
- **host_id**
- **host_name**
- **neighbourhood_group**
- **neighbourhood**
- **latitude**
- **longitude**
- **room_type**
- **price**
- **minimum_nights**
- **number_of_reviews**
- **last_review**
- **reviews_per_month**
- **calculated_host_listings_count**
- **availability_365**

An advertisement can look like this in the official webpage (https://www.airbnb.com/amsterdam):

<img src="advertisementairbnb.png" alt="drawing" width="250"/>

We can readily see its name, the number of reviews and the price per night among other pieces of information.


**`listings_details.csv`**:

An extension of `listings.csv`: more atributes corresponding to the advertisements in Amsterdam on December 6th 2018. It includes 20030 observations and 96 different attributes. All the atributes of `listings.csv` except "neighbourhood_group" are included in `listings_details.csv`. But since "neighbourhood_group" is a column with a 100% of NaNs, then in practice we can say that `listings.csv` is completely contained in `listings_details.csv`.  Some of the new attributes are:

- **experiences_offered**
- **house_rules**
- **instant_bookable**
- +93 attributes

**`calendar.csv`**:

It has 365 records for each listing (advertisement). It specifies whether the listing is available on a particular day (365 days ahead), and the price on that day. It consists of 7310950 observations (7.3M) and 4 attributes. The attributes are:

- **listing_id**
- **date**
- **available**
- **price**

**`neighbourhoods.csv`**:


**`reviews.csv`**:


**`reviews_details.csv`**:



## <span style="color:navy"> 2. The goal of this analysis </span>

## <span style="color:navy"> 3. A first visualization of the data </span>

### <span style="color:#32CAEC"> 3.1 Import of the libraries </span>

In [30]:
import numpy as np
import pandas as pd
np.warnings.filterwarnings('ignore')

### <span style="color:#32CAEC"> 3.2 Data reading </span>

In [32]:
listings_df = pd.read_csv("listings.csv", sep=",")
listings_det_df = pd.read_csv("listings_details.csv", sep=",") 
calendar_df = pd.read_csv("calendar.csv", sep=",")
neigh_df = pd.read_csv("neighbourhoods.csv", sep=",")
reviews_df = pd.read_csv("reviews.csv", sep=",")
reviews_det_df = pd.read_csv("reviews_details.csv", sep=",")

In [33]:
calendar_df

Unnamed: 0,listing_id,date,available,price
0,2818,2019-12-05,f,
1,73208,2019-08-30,f,
2,73208,2019-08-29,f,
3,73208,2019-08-28,f,
4,73208,2019-08-27,f,
...,...,...,...,...
7310945,29979667,2018-12-11,t,$139.00
7310946,29979667,2018-12-10,t,$139.00
7310947,29979667,2018-12-09,t,$139.00
7310948,29979667,2018-12-08,t,$139.00


In [28]:
# Check if the attributes of listings.csv are contained in listings_details.csv
for atr in listings_df.columns:
    if(not(atr in listings_det_df)):
        print("The attribute", atr,"from listings.csv is NOT contained in listings_details.csv")

The attribute neighbourhood_group from listings.csv is NOT contained in listings_details.csv
