## Import data from HDFS to MongoDB
---

### Steps:
- Prepare the MongoDB database and collection

```bash
# Use mongo shell to create a database (spark_db) and a collection (books)
mongosh
use spark_db
db.createCollection('books')
```

- Connect to MongoDB using `pymongo`
- Connect to HDFS and read the data using `spark.read.csv`
- Select a subset of the Spark DataFrame to import using `sample` method
- Transform the data into a dictionary using `to_dict` method
- Insert the data into MongoDB using `insert_many` method

In [1]:
# Connect to MongoDB

import pymongo

client = pymongo.MongoClient('mongodb://localhost:27017/')
database = client['spark_db']
books = database['books_hypothesis_7']

In [4]:
# Connect to HDFS

import findspark
findspark.init()
import pyspark
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, FloatType

# Initialize Spark Context
spark = pyspark.sql.SparkSession.builder.appName("hypotheses_testing").getOrCreate()
spark.stop()

hypothesis_number = 7
sc = pyspark.SparkContext(appName="hypothesis_"+str(hypothesis_number))

# Create the Spark Session
spark_session = pyspark.sql.SparkSession(sc)

# Define the schema
joined_schema = StructType([
    StructField("Title", StringType(), True),
    StructField("description", StringType(), True),
    StructField("authors", StringType(), True),
    StructField("publisher", StringType(), True),
    StructField("publishedDate", StringType(), True),
    StructField("categories", StringType(), True),
    StructField("Price", FloatType(), True),
    StructField("User_id", IntegerType(), True),
    StructField("profileName", StringType(), True),
    StructField("review/helpfulness", StringType(), True),
    StructField("review/score", FloatType(), True),
    StructField("review/time", IntegerType(), True),
    StructField("review/summary", StringType(), True),
    StructField("review/text", StringType(), True)
    ])

# Load the data
df_joined = spark_session.read.csv("hdfs://localhost:9900/user/book_reviews/joined_tables_v2.csv", header=True, schema=joined_schema, sep='\t')

23/09/08 15:22:49 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.
23/09/08 15:22:49 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
23/09/08 15:22:49 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.


In [5]:
# Select a random subset of the big data to import
N_to_sample = 200000
df_sample = df_joined.sample(withReplacement = False, fraction = N_to_sample/df_joined.count(), seed = 42)

# Convert to a dictionary
df_sample_dict = df_sample.toPandas().to_dict(orient='records')

# Insert into MongoDB
books.insert_many(df_sample_dict)

23/09/08 15:22:56 WARN CSVHeaderChecker: Number of column in CSV header is not equal to number of fields in the schema:
 Header length: 13, schema size: 14
CSV file: hdfs://localhost:9900/user/book_reviews/joined_tables_v2.csv
                                                                                

<pymongo.results.InsertManyResult at 0x14e7f3b50>