## Import data from HDFS to MongoDB
---

### Steps:
- Prepare the MongoDB database and collection

```bash
# Use mongo shell to create a database (spark_db) and a collection (books)
mongosh
use spark_db
db.createCollection('books')
```

- Connect to MongoDB using `pymongo`
- Connect to HDFS and read the data using `spark.read.csv`
- Select a subset of the Spark DataFrame to import using `sample` method
- Transform the data into a dictionary using `to_dict` method
- Insert the data into MongoDB using `insert_many` method

In [1]:
# Connect to MongoDB

import pymongo

client = pymongo.MongoClient('mongodb://localhost:27017/')
database = client['spark_db']
books = database['books_hypothesis_7']

In [2]:
# Connect to HDFS

import findspark
findspark.init()
import pyspark

# Initialize Spark Context
spark = pyspark.sql.SparkSession.builder.appName("hypotheses_testing").getOrCreate()
spark.stop()

hypothesis_number = 7
sc = pyspark.SparkContext(appName="hypothesis_"+str(hypothesis_number))

# Create the Spark Session
spark_session = pyspark.sql.SparkSession(sc)

# Load the data
df_joined = spark_session.read.csv("hdfs://localhost:9900/user/book_reviews/joined_tables_v2.csv", header=True, inferSchema=True, sep='\t')

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/09/08 12:12:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/09/08 12:12:58 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
23/09/08 12:12:58 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
23/09/08 12:12:59 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
23/09/08 12:12:59 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.


In [3]:
# Select a random subset of the big data to import
N_to_sample = 200000
df_sample = df_joined.sample(withReplacement = False, fraction = N_to_sample/df_joined.count(), seed = 42)

# Convert to a dictionary
df_sample_dict = df_sample.toPandas().to_dict(orient='records')

# Insert into MongoDB
books.insert_many(df_sample_dict)

IllegalArgumentException: requirement failed: Sampling fraction (105.318588730911) must be on interval [0, 1] without replacement