## Import data from HDFS to MongoDB
---

### Steps:
- Prepare the MongoDB database and collection

```bash
# Use mongo shell to create a database (spark_db) and a collection (books)
mongosh
use spark_db
db.createCollection('books')
```

- Connect to MongoDB using `pymongo`
- Connect to HDFS and read the data using `spark.read.csv`
- Select a subset of the Spark DataFrame to import using `sample` method
- Transform the data into a dictionary using `to_dict` method
- Insert the data into MongoDB using `insert_many` method

In [1]:
# Connect to MongoDB

import pymongo

client = pymongo.MongoClient('mongodb://localhost:27017/')
database = client['spark_db']
books = database['books_joined']
reviews = database['book_reviews']

In [2]:
# Connect to HDFS

from pyspark.sql.types import StructType, StructField, StringType, IntegerType, FloatType
import pyspark
import findspark
findspark.init()
hypothesis_number = 'books_joined'
# Initialize Spark Context
spark = pyspark.sql.SparkSession.builder.master("local[*]")\
    .config("spark.driver.memory", "5g")\
    .config("spark.executor.memory", "5g")\
    .config("spark.storage.memoryFraction", "0.5")\
    .config("spark.shuffle.memoryFraction", "0.5")\
    .config("spark.driver.maxResultSize", "0")\
    .appName(hypothesis_number).getOrCreate()


# Define the schema
ratings_schema = StructType([
    StructField("Title", StringType(), True),
    StructField("Price", FloatType(), True),
    StructField("User_id", IntegerType(), True),
    StructField("profileName", StringType(), True),
    StructField("review/score", FloatType(), True),
    StructField("review/time", IntegerType(), True),
    StructField("review/summary", StringType(), True),
    StructField("review/text", StringType(), True),
    StructField("N_helpful", IntegerType(), True),
    StructField("Tot_votes", IntegerType(), True)
])

# Schema for joined data
joined_schema = StructType([
    StructField("Title", StringType(), True),
    StructField("description", StringType(), True),
    StructField("authors", StringType(), True),
    StructField("publisher", StringType(), True),
    StructField("publishedDate", StringType(), True),
    StructField("categories", StringType(), True),
    StructField("Price", FloatType(), True),
    StructField("User_id", IntegerType(), True),
    StructField("profileName", StringType(), True),
    StructField("review/score", FloatType(), True),
    StructField("review/time", IntegerType(), True),
    StructField("review/summary", StringType(), True),
    StructField("review/text", StringType(), True),
    StructField("N_helpful", IntegerType(), True),
    StructField("Tot_votes", IntegerType(), True)
])

# Load the data
df_joined = spark.read.csv("hdfs://localhost:9900/user/book_reviews/joined_tables",
                           header=True, schema=joined_schema, sep='\t')
spark_reviews = spark.read.csv(
    "hdfs://localhost:9900/user/book_reviews/books_rating_cleaned.csv", header=True, schema=ratings_schema, sep='\t')

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/09/16 13:52:47 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


### Insert in mongoDB a subset of the joined data

In [3]:
# Select a random subset of the big data to import
N_to_sample = 300000
df_sample = df_joined.sample(withReplacement = False, fraction = N_to_sample/df_joined.count(), seed = 42)

# Convert to a dictionary
df_sample_dict = df_sample.toPandas().to_dict(orient='records')

# Insert into MongoDB
books.insert_many(df_sample_dict)

23/09/16 13:53:01 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: $1,265 Gold, Profitable trade set-ups from StockTwits leading traders One of the biggest secrets on Wall Street is that to become consistently profitable, you need to specialize in a distinct setup. That is, you need to know how to read the signals that can help you identify an opportunity to buy or sell. In The StockTwits Edge: 40 Actionable Trade Setups from Real Market Pros, both well-known professional masters of the market and lesser-known individual traders describe their highest probability setups to teach you about an assortment of time frame and asset class-related market methods along the way. Drawing on the wisdom of some of the top minds at StockTwits, the leading stock market social networking site, this book has something for everyone, giving you exactly what you need to come up with profitable ideas and avoid financial risk, every day. Includes key trading insights from the exper

<pymongo.results.InsertManyResult at 0x7f8c6278e5c0>

### Insert into mongoDB a subset of the reviews

In [None]:
# Select a random subset of the big data to import
N_to_sample = 300000
df_sample = spark_reviews.sample(withReplacement = False, fraction = N_to_sample/spark_reviews.count(), seed = 42)

# Convert to a dictionary
df_sample_dict = df_sample.toPandas().to_dict(orient='records')

# Insert into MongoDB
reviews.insert_many(df_sample_dict)

In [None]:
spark.stop()