## Import data from HDFS to MongoDB
---

### Steps:
- Prepare the MongoDB database and collection

```bash
# Use mongo shell to create a database (spark_db) and a collection (books)
mongosh
use spark_db
db.createCollection('books')
```

- Connect to MongoDB using `pymongo`
- Connect to HDFS and read the data using `spark.read.csv`
- Select a subset of the Spark DataFrame to import using `sample` method
- Transform the data into a dictionary using `to_dict` method
- Insert the data into MongoDB using `insert_many` method

In [7]:
# Connect to MongoDB

import pymongo

client = pymongo.MongoClient('mongodb://localhost:27017/')
database = client['spark_db']
books = database['books_rating']

In [8]:
# Connect to HDFS

import findspark
findspark.init()
import pyspark
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, FloatType

# Initialize Spark Context
spark = pyspark.sql.SparkSession.builder.appName("hypotheses_testing").getOrCreate()
spark.stop()

hypothesis_number = 'books_rating'
sc = pyspark.SparkContext(appName=hypothesis_number)

# Create the Spark Session
spark_session = pyspark.sql.SparkSession(sc)

# Define the schema
ratings_schema = StructType([
    StructField("Title", StringType(), True),
    StructField("Price", FloatType(), True),
    StructField("User_id", IntegerType(), True),
    StructField("profileName", StringType(), True),
    StructField("review/score", FloatType(), True),
    StructField("review/time", IntegerType(), True),
    StructField("review/summary", StringType(), True),
    StructField("review/text", StringType(), True),
    StructField("N_helpful", IntegerType(), True),
    StructField("Tot_votes", IntegerType(), True)
])

# Load the data
df_joined = spark_session.read.csv("hdfs://localhost:9900/user/book_reviews/books_rating_cleaned.csv", header=True, schema=ratings_schema, sep='\t')

23/09/10 11:06:34 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


In [9]:
# Select a random subset of the big data to import
N_to_sample = 500000
df_sample = df_joined.sample(withReplacement = False, fraction = N_to_sample/df_joined.count(), seed = 42)

# Convert to a dictionary
df_sample_dict = df_sample.toPandas().to_dict(orient='records')

# Insert into MongoDB
books.insert_many(df_sample_dict)

                                                                                

<pymongo.results.InsertManyResult at 0x12cbb3670>