# MongoDB Data Lake
We decided to use MongoDB as a first repository for our project, to **control** our Big Data source.

We hypothesized that our Big Data source is the AirBnb website, which is constantly feeding us new listings, reviews and calendar data (booking plans for the subsequent year). 

So we suppose it is a streaing source and we use MongoDB to store, as documents, each of these new entries.

We simulate the streaming from csv files (listings.csv, reviews.csv, ) using the Spark Structured Streaming engine and the MongoDB Spark Connector ( https://www.mongodb.com/docs/spark-connector/current/structured-streaming/#structured-streaming-with-mongodb )

# Importing Data
I a real case, the data will be stored from airbnb website to the database on mongodb cloud. But, in this project, we can import data in csv format to our cluster on Atlas Mongodb by execution below command on shell:

    mongoimport --uri mongodb+srv://analytics:analytics-password@mflix.ryqp8qp.mongodb.net/<database_name> --collection <collection_name> --type csv --headerline --file <filename>
    
We imported a database named 'Florence'.

We can update the database with new data (appending only documents with new ids) with the option 

    --mode=upsert

In [None]:
from pprint import pprint
import pandas as pd
from pymongo import MongoClient

client = MongoClient('mongodb+srv://analytics:analytics-password@mflix.ryqp8qp.mongodb.net/?retryWrites=true&w=majority') # I think that in some way we can connect to Atlas in this line
client.list_database_names()

['Florence',
 'sample_airbnb',
 'sample_geospatial',
 'sample_guides',
 'sample_mflix',
 'sample_supplies',
 'admin',
 'local']

# MongoDB queries and data processing
Then we can use mongoDB to perfom fast queries in real time on the newly imported data.

We can use mongoDB also to **create the sandbox environment**: we filter the data to select only the city of Florence, we clean up the data a little bit and we use the same engine as before, but in reverse, to create the listings_summary.csv and review_dates.csv files.

In [None]:
mydb = client.Florence
mydb.list_collection_names()

['listings_summary', 'listings', 'reviews', 'reviews_dates']

# A sample document in listings_summary collection:


In [None]:
mydb.listings_summary.find_one()

{'_id': ObjectId('632374d6ba769086fbe413c6'),
 'id': 24469,
 'name': 'Fortezza/City Centre Modern Apt 2+2',
 'host_id': 99178,
 'host_name': 'Benedetta And Lorenzo',
 'neighbourhood_group': '',
 'neighbourhood': 'Centro Storico',
 'latitude': 43.7821,
 'longitude': 11.24392,
 'room_type': 'Entire home/apt',
 'price': 70,
 'minimum_nights': 2,
 'number_of_reviews': 1,
 'last_review': '2019-09-27',
 'reviews_per_month': 0.03,
 'calculated_host_listings_count': 4,
 'availability_365': 320,
 'number_of_reviews_ltm': 0,
 'license': ''}

In [None]:
filtered = mydb.listings_summary.find({'price':9})
for doc in filtered:
    pprint(doc)

    

{'_id': ObjectId('632374d6ba769086fbe41693'),
 'availability_365': 0,
 'calculated_host_listings_count': 7,
 'host_id': 6976636,
 'host_name': 'Alberto',
 'id': 1853860,
 'last_review': '2019-11-18',
 'latitude': 43.77068,
 'license': '',
 'longitude': 11.23978,
 'minimum_nights': 2,
 'name': 'Apartment Aurea',
 'neighbourhood': 'Isolotto Legnaia',
 'neighbourhood_group': '',
 'number_of_reviews': 31,
 'number_of_reviews_ltm': 0,
 'price': 9,
 'reviews_per_month': 0.31,
 'room_type': 'Entire home/apt'}
{'_id': ObjectId('632374e2ba769086fbe450e6'),
 'availability_365': 0,
 'calculated_host_listings_count': 1,
 'host_id': 13473984,
 'host_name': 'Riccardo',
 'id': 8062901,
 'last_review': '',
 'latitude': 43.75541,
 'license': '',
 'longitude': 11.2902,
 'minimum_nights': 1,
 'name': 'Clothing Optional House near to Center of Florence',
 'neighbourhood': 'Gavinana Galluzzo',
 'neighbourhood_group': '',
 'number_of_reviews': 0,
 'number_of_reviews_ltm': 0,
 'price': 9,
 'reviews_per_month

finding all records with price between 10 and 15 and minimum nights 1.
showing as a result, only host id, host name, and the price.

In [None]:
query = {"price":{"$gte":10,"$lt":15},'minimum_nights': 1}

projection = {"host_id": 1, "host_name": 1,  "price": 1,}

for doc in mydb.listings_summary.find(query, projection):
    pprint(doc)

{'_id': ObjectId('632374e2ba769086fbe44ee4'),
 'host_id': 6976636,
 'host_name': 'Alberto',
 'price': 10}
{'_id': ObjectId('632374fbba769086fbe4bcf7'),
 'host_id': 40507781,
 'host_name': 'Federica',
 'price': 14}
{'_id': ObjectId('6323750eba769086fbe5193e'),
 'host_id': 293412133,
 'host_name': 'Piercarlo',
 'price': 14}
{'_id': ObjectId('6323750eba769086fbe519a4'),
 'host_id': 255983967,
 'host_name': 'Brunel',
 'price': 10}
{'_id': ObjectId('6323750eba769086fbe51a60'),
 'host_id': 305439008,
 'host_name': 'Giovanni',
 'price': 10}
{'_id': ObjectId('6323750eba769086fbe51a61'),
 'host_id': 305439008,
 'host_name': 'Giovanni',
 'price': 10}
{'_id': ObjectId('6323750eba769086fbe51a62'),
 'host_id': 305439008,
 'host_name': 'Giovanni',
 'price': 10}
{'_id': ObjectId('6323750eba769086fbe51a65'),
 'host_id': 305439008,
 'host_name': 'Giovanni',
 'price': 10}
{'_id': ObjectId('6323750eba769086fbe51b1b'),
 'host_id': 321614951,
 'host_name': 'Giacomo',
 'price': 12}
{'_id': ObjectId('6323750

# Reshaping data into a DataFrame


In [None]:
filtered = mydb.listings_summary.find({'neighbourhood': 'Centro Storico'})
data = pd.DataFrame(list(filtered))
data.head()

Unnamed: 0,_id,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,number_of_reviews_ltm,license
0,632374d6ba769086fbe413c6,24469,Fortezza/City Centre Modern Apt 2+2,99178,Benedetta And Lorenzo,,Centro Storico,43.7821,11.24392,Entire home/apt,70,2,1,2019-09-27,0.03,4,320,0,
1,632374d6ba769086fbe413c7,24470,Fortezza/City Centre Modern Apt 2+1,99178,Benedetta And Lorenzo,,Centro Storico,43.78202,11.24399,Entire home/apt,70,2,3,2019-04-21,0.02,4,200,0,
2,632374d6ba769086fbe413c8,24471,Fortezza/City Centre Modern Apt 4+2,99178,Benedetta And Lorenzo,,Centro Storico,43.78202,11.24399,Entire home/apt,135,2,0,,,4,200,0,
3,632374d6ba769086fbe413c9,24472,Fortezza/City Centre Modern Apt 4+2,99178,Benedetta And Lorenzo,,Centro Storico,43.78202,11.24399,Entire home/apt,120,2,2,2012-04-11,0.02,4,328,0,
4,632374d6ba769086fbe413ca,26738,N4U Guest House Florence,113883,N4U Guest House,,Centro Storico,43.77017,11.25754,Private room,149,1,31,2019-07-03,0.22,2,331,0,
