# CS 520 Data Curation Project - ETL Pipeline using Spark, Pandas and MongoDB

## Our goal for the project is to Extract data from a CSV file. 
## Transform the original schema of the file into the required schema with lesser number of columns and with column names changed. 
## Explore and Query the Data. 
## Load the data as JSON Documents in a Mongo-DB JSON Table. 
## All this is done using Spark and Pandas in Python.



## We have used the Medicare Open payments data from a CSV file.
## Requirements: Python, Spark, JupyterNotebook, MongoDB Connection.

## Easy way to install spark for jupyter notebook if not already present 

In [None]:
import sys
!{sys.executable} -m pip install pyspark

## Importing SparkSession

In [85]:
from pyspark.sql import SparkSession

## Building SparkSession 

In [86]:
spark = SparkSession.builder.appName('CS 520').getOrCreate()

## reading the csv File 

In [3]:
df = spark.read.csv("payments.csv", header = True)

## Print the original schema 

In [4]:
df.printSchema()

root
 |-- Change_Type: string (nullable = true)
 |-- Covered_Recipient_Type: string (nullable = true)
 |-- Teaching_Hospital_CCN: string (nullable = true)
 |-- Teaching_Hospital_ID: string (nullable = true)
 |-- Teaching_Hospital_Name: string (nullable = true)
 |-- Physician_Profile_ID: string (nullable = true)
 |-- Physician_First_Name: string (nullable = true)
 |-- Physician_Middle_Name: string (nullable = true)
 |-- Physician_Last_Name: string (nullable = true)
 |-- Physician_Name_Suffix: string (nullable = true)
 |-- Recipient_Primary_Business_Street_Address_Line1: string (nullable = true)
 |-- Recipient_Primary_Business_Street_Address_Line2: string (nullable = true)
 |-- Recipient_City: string (nullable = true)
 |-- Recipient_State: string (nullable = true)
 |-- Recipient_Zip_Code: string (nullable = true)
 |-- Recipient_Country: string (nullable = true)
 |-- Recipient_Province: string (nullable = true)
 |-- Recipient_Postal_Code: string (nullable = true)
 |-- Physician_Primary_Ty

## Changing the data type of Amount from String to Double  

In [5]:
from pyspark.sql.types import DoubleType


In [6]:
df2 = df.withColumn("amount" , df["Total_Amount_of_Payment_USDollars"].cast(DoubleType()))

## Creating a Temporary Payments 

In [7]:
df2.createGlobalTempView("payments1")

## We can also specify the schema while importing the file in the below manner 

In [8]:
from pyspark.sql.types import StructField,StringType,IntegerType,StructType

In [9]:
data_schema = [StructField("physician_id", StringType(), True),StructField("date_payment", StringType(), True),StructField("record_id", StringType(), True),StructField("payer", StringType(), True),StructField("amount", DoubleType(), True),StructField("physician_speciality", StringType(), True),StructField("nature_of_payment", StringType(), True)]

In [10]:
final_struc = StructType(fields=data_schema)

## Selecting only the columns we want and also renaming the Columns as we want 

In [11]:
ds = spark.sql("select Physician_Profile_ID as physician_id,Date_of_Payment as date_payment, Record_ID as record_id, Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name as payer,amount, Physician_Specialty, Nature_of_Payment_or_Transfer_of_Value as Nature_of_payment from global_temp.payments1 where Physician_Profile_ID IS NOT NULL") 

## Required Schema 

In [12]:
ds.printSchema()

root
 |-- physician_id: string (nullable = true)
 |-- date_payment: string (nullable = true)
 |-- record_id: string (nullable = true)
 |-- payer: string (nullable = true)
 |-- amount: double (nullable = true)
 |-- Physician_Specialty: string (nullable = true)
 |-- Nature_of_payment: string (nullable = true)



In [13]:
ds.first()

Row(physician_id='673985', date_payment='01/21/2016', record_id='346039414', payer='DFINE, Inc', amount=286.2, Physician_Specialty='Allopathic & Osteopathic Physicians|Anesthesiology', Nature_of_payment='Travel and Lodging')

## Replacing the temporary view with our new view 

In [14]:
ds.createOrReplaceGlobalTempView("payments")

## Sample data 

In [15]:
ds.show()

+------------+------------+---------+----------+------+--------------------+------------------+
|physician_id|date_payment|record_id|     payer|amount| Physician_Specialty| Nature_of_payment|
+------------+------------+---------+----------+------+--------------------+------------------+
|      673985|  01/21/2016|346039414|DFINE, Inc| 286.2|Allopathic & Oste...|Travel and Lodging|
|      673985|  01/21/2016|346039416|DFINE, Inc|  25.0|Allopathic & Oste...|Travel and Lodging|
|      673985|  02/19/2016|346039418|DFINE, Inc| 27.27|Allopathic & Oste...|Travel and Lodging|
|       93975|  04/15/2016|346039420|DFINE, Inc|  21.6|Allopathic & Oste...| Food and Beverage|
|      275444|  04/29/2016|346039422|DFINE, Inc| 22.57|Allopathic & Oste...| Food and Beverage|
|      132655|  02/05/2016|346039424|DFINE, Inc| 780.7|Allopathic & Oste...|Travel and Lodging|
|      132655|  02/05/2016|346039426|DFINE, Inc|  25.0|Allopathic & Oste...|Travel and Lodging|
|      132655|  02/19/2016|346039428|DFI

## Changing the type of date from string to date format and also changing format from mm/dd/yyyy to yyyy-mm-dd (unix timestamp) format

In [16]:
from pyspark.sql.functions import to_date
from pyspark.sql.functions import unix_timestamp

ds =ds.withColumn("date_payment", to_date(unix_timestamp(ds["date_payment"], "MM/dd/yyyy").cast("timestamp")))


In [17]:
ds.printSchema()

root
 |-- physician_id: string (nullable = true)
 |-- date_payment: date (nullable = true)
 |-- record_id: string (nullable = true)
 |-- payer: string (nullable = true)
 |-- amount: double (nullable = true)
 |-- Physician_Specialty: string (nullable = true)
 |-- Nature_of_payment: string (nullable = true)



In [18]:
ds.show()

+------------+------------+---------+----------+------+--------------------+------------------+
|physician_id|date_payment|record_id|     payer|amount| Physician_Specialty| Nature_of_payment|
+------------+------------+---------+----------+------+--------------------+------------------+
|      673985|  2016-01-21|346039414|DFINE, Inc| 286.2|Allopathic & Oste...|Travel and Lodging|
|      673985|  2016-01-21|346039416|DFINE, Inc|  25.0|Allopathic & Oste...|Travel and Lodging|
|      673985|  2016-02-19|346039418|DFINE, Inc| 27.27|Allopathic & Oste...|Travel and Lodging|
|       93975|  2016-04-15|346039420|DFINE, Inc|  21.6|Allopathic & Oste...| Food and Beverage|
|      275444|  2016-04-29|346039422|DFINE, Inc| 22.57|Allopathic & Oste...| Food and Beverage|
|      132655|  2016-02-05|346039424|DFINE, Inc| 780.7|Allopathic & Oste...|Travel and Lodging|
|      132655|  2016-02-05|346039426|DFINE, Inc|  25.0|Allopathic & Oste...|Travel and Lodging|
|      132655|  2016-02-19|346039428|DFI

In [19]:
ds.createOrReplaceGlobalTempView("payments")

In [69]:
ds.count()

11258361

## Querying and Exploring the data  

## Querying can be done in two ways. One by using Spark functions and other directly by writing SQl statements. Both the ways are used below. 

### Top 10 nature of accounts with payments by count  

In [23]:
from pyspark.sql.functions import desc
ds.groupBy(ds["Nature_of_Payment"]).count().orderBy(desc("count")).show(10)

+--------------------+-------+
|   Nature_of_Payment|  count|
+--------------------+-------+
|   Food and Beverage|9975364|
|  Travel and Lodging| 575255|
|Compensation for ...| 251440|
|           Education| 245515|
|      Consulting Fee| 115316|
|                Gift|  41523|
|           Honoraria|  17795|
|  Royalty or License|  11814|
|Compensation for ...|   9171|
|       Entertainment|   8408|
+--------------------+-------+
only showing top 10 rows



### Nature of payments with payments  > $1000 with their counts

In [24]:
ds.filter(ds["amount"] > 1000).groupBy(ds["Nature_of_Payment"]).count().show()

+--------------------+------+
|   Nature_of_Payment| count|
+--------------------+------+
|           Education|  8378|
|       Entertainment|    30|
|  Travel and Lodging| 24196|
|Charitable Contri...|   235|
|Current or prospe...|   523|
|           Honoraria| 13052|
|               Grant|  2154|
|Compensation for ...|  2032|
|  Royalty or License|  9151|
|Compensation for ...|  5688|
|   Food and Beverage|   430|
|Compensation for ...|190064|
|      Consulting Fee| 71516|
|                Gift|  1234|
+--------------------+------+



### Top five Physicain specialites by total amount 

In [25]:
from pyspark.sql.functions import sum

In [26]:
spark.sql ("select physician_id , sum(amount) as revenue from global_temp.payments group by physician_id order by revenue desc limit 5").show() 

+------------+--------------------+
|physician_id|             revenue|
+------------+--------------------+
|      288926|       2.183833534E7|
|      311622|1.9940975750000004E7|
|      327184|1.3732734540000003E7|
|      281659|         1.3202939E7|
|       32719|1.2149519940000001E7|
+------------+--------------------+



### Top 10  nature of payments by total amount

In [27]:
spark.sql("select Nature_of_payment , sum(amount) as total from global_temp.payments group by Nature_of_payment order by total desc limit 10").show()

+--------------------+--------------------+
|   Nature_of_payment|               total|
+--------------------+--------------------+
|Compensation for ...| 5.618398276899997E8|
|  Royalty or License| 4.892801498499998E8|
|      Consulting Fee|3.6266093983000004E8|
|   Food and Beverage|2.4166747336000344E8|
|  Travel and Lodging|1.9028856042000213E8|
|Current or prospe...|6.2862144450000025E7|
|           Honoraria|       4.392361441E7|
|           Education| 3.895051414999967E7|
|               Grant|       2.389616791E7|
|Compensation for ...|       2.006624783E7|
+--------------------+--------------------+



## Average amount of payment  in each month 

In [28]:
from pyspark.sql.functions import format_number,dayofmonth,hour,dayofyear,month,year,weekofyear,date_format

In [29]:
ds.groupBy(month(ds['date_payment'])).mean().show()

+-------------------+------------------+
|month(date_payment)|       avg(amount)|
+-------------------+------------------+
|                 12|203.47133688633699|
|               null|               1.0|
|                  1| 173.8491177234156|
|                  6|162.98720391815715|
|                  3|162.92386677567347|
|                  5|197.61441697505896|
|                  9|155.56513796119987|
|                  4|199.70469070432415|
|                  8|196.05558923083862|
|                  7|180.39087668301877|
|                 10|167.21379880949354|
|                 11|213.26558497269005|
|                  2|188.95287296417095|
+-------------------+------------------+



## Installing pymongo 

In [37]:
import sys
!{sys.executable} -m pip install pymongo

Collecting pymongo
  Downloading pymongo-3.6.1-cp36-cp36m-win_amd64.whl (291kB)
Installing collected packages: pymongo
Successfully installed pymongo-3.6.1


You are using pip version 9.0.1, however version 9.0.3 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.


In [23]:
ds.show()

+------------+------------+---------+----------+------+--------------------+------------------+
|physician_id|date_payment|record_id|     payer|amount| Physician_Specialty| Nature_of_payment|
+------------+------------+---------+----------+------+--------------------+------------------+
|      673985|  2016-01-21|346039414|DFINE, Inc| 286.2|Allopathic & Oste...|Travel and Lodging|
|      673985|  2016-01-21|346039416|DFINE, Inc|  25.0|Allopathic & Oste...|Travel and Lodging|
|      673985|  2016-02-19|346039418|DFINE, Inc| 27.27|Allopathic & Oste...|Travel and Lodging|
|       93975|  2016-04-15|346039420|DFINE, Inc|  21.6|Allopathic & Oste...| Food and Beverage|
|      275444|  2016-04-29|346039422|DFINE, Inc| 22.57|Allopathic & Oste...| Food and Beverage|
|      132655|  2016-02-05|346039424|DFINE, Inc| 780.7|Allopathic & Oste...|Travel and Lodging|
|      132655|  2016-02-05|346039426|DFINE, Inc|  25.0|Allopathic & Oste...|Travel and Lodging|
|      132655|  2016-02-19|346039428|DFI

## Converting the pyspark dataframe to RDD of JSON Object 

In [24]:

    
import json

results =ds.toJSON()
    

## Sample RDD 

In [25]:
results.first()

'{"physician_id":"673985","date_payment":"2016-01-21","record_id":"346039414","payer":"DFINE, Inc","amount":286.2,"Physician_Specialty":"Allopathic & Osteopathic Physicians|Anesthesiology","Nature_of_payment":"Travel and Lodging"}'

## As the dataset is too big, there are a lot of memory issues which we were facing. So we have decided to use only top 50000 rows. In a bigger environment, same method can be used for bigger datasets. 

In [72]:
ds2 = spark.sql("select * from global_temp.payments limit 50000")

In [73]:
ds2.show()

+------------+------------+---------+----------+------+--------------------+------------------+
|physician_id|date_payment|record_id|     payer|amount| Physician_Specialty| Nature_of_payment|
+------------+------------+---------+----------+------+--------------------+------------------+
|      673985|  2016-01-21|346039414|DFINE, Inc| 286.2|Allopathic & Oste...|Travel and Lodging|
|      673985|  2016-01-21|346039416|DFINE, Inc|  25.0|Allopathic & Oste...|Travel and Lodging|
|      673985|  2016-02-19|346039418|DFINE, Inc| 27.27|Allopathic & Oste...|Travel and Lodging|
|       93975|  2016-04-15|346039420|DFINE, Inc|  21.6|Allopathic & Oste...| Food and Beverage|
|      275444|  2016-04-29|346039422|DFINE, Inc| 22.57|Allopathic & Oste...| Food and Beverage|
|      132655|  2016-02-05|346039424|DFINE, Inc| 780.7|Allopathic & Oste...|Travel and Lodging|
|      132655|  2016-02-05|346039426|DFINE, Inc|  25.0|Allopathic & Oste...|Travel and Lodging|
|      132655|  2016-02-19|346039428|DFI

In [74]:
from pyspark import SparkContext, SparkConf

## Converting pyspark dataframe to pandas dataframe 

In [75]:
import pandas as pd

In [76]:
pdDf = ds2.toPandas()

## Sample pandas data frame 

In [77]:
pdDf.head()

Unnamed: 0,physician_id,date_payment,record_id,payer,amount,Physician_Specialty,Nature_of_payment
0,673985,2016-01-21,346039414,"DFINE, Inc",286.2,Allopathic & Osteopathic Physicians|Anesthesio...,Travel and Lodging
1,673985,2016-01-21,346039416,"DFINE, Inc",25.0,Allopathic & Osteopathic Physicians|Anesthesio...,Travel and Lodging
2,673985,2016-02-19,346039418,"DFINE, Inc",27.27,Allopathic & Osteopathic Physicians|Anesthesio...,Travel and Lodging
3,93975,2016-04-15,346039420,"DFINE, Inc",21.6,Allopathic & Osteopathic Physicians|Radiology|...,Food and Beverage
4,275444,2016-04-29,346039422,"DFINE, Inc",22.57,Allopathic & Osteopathic Physicians|Internal M...,Food and Beverage


## Creating a new index as physician_id+date_payment+Nature_of_payment so that it is easier to query and find records in the database

In [78]:
pdDf = pdDf.set_index([pdDf.physician_id+'_'+ pdDf.Nature_of_payment])

pdDf['date_payment'] = pdDf['date_payment'].astype(str)
pdDf = pdDf.set_index([pdDf.physician_id+'_'+'_'+pdDf.date_payment+'_'+ '_'+pdDf.Nature_of_payment])
jsonDict = pdDf.to_dict('index')

## Sample dataframe with the new index 

In [79]:
pdDf.head()

Unnamed: 0,physician_id,date_payment,record_id,payer,amount,Physician_Specialty,Nature_of_payment
673985__2016-01-21__Travel and Lodging,673985,2016-01-21,346039414,"DFINE, Inc",286.2,Allopathic & Osteopathic Physicians|Anesthesio...,Travel and Lodging
673985__2016-01-21__Travel and Lodging,673985,2016-01-21,346039416,"DFINE, Inc",25.0,Allopathic & Osteopathic Physicians|Anesthesio...,Travel and Lodging
673985__2016-02-19__Travel and Lodging,673985,2016-02-19,346039418,"DFINE, Inc",27.27,Allopathic & Osteopathic Physicians|Anesthesio...,Travel and Lodging
93975__2016-04-15__Food and Beverage,93975,2016-04-15,346039420,"DFINE, Inc",21.6,Allopathic & Osteopathic Physicians|Radiology|...,Food and Beverage
275444__2016-04-29__Food and Beverage,275444,2016-04-29,346039422,"DFINE, Inc",22.57,Allopathic & Osteopathic Physicians|Internal M...,Food and Beverage


## Json Dictionary in the format we require to store in the Database
#### Format:
#### 'Index': {'physician_id': 'Value',
####          'date_payment': 'Value',
####          'record_id': 'Value'
####          'payer': 'Value',
####          'amount': Value,
####          'Physician_Specialty': 'Value',
####          'Nature_of_Payment': 'Value',
####               }
     

In [80]:
jsonDict

{'673985__2016-01-21__Travel and Lodging': {'Nature_of_payment': 'Travel and Lodging',
  'Physician_Specialty': 'Allopathic & Osteopathic Physicians|Anesthesiology',
  'amount': 25.0,
  'date_payment': '2016-01-21',
  'payer': 'DFINE, Inc',
  'physician_id': '673985',
  'record_id': '346039416'},
 '673985__2016-02-19__Travel and Lodging': {'Nature_of_payment': 'Travel and Lodging',
  'Physician_Specialty': 'Allopathic & Osteopathic Physicians|Anesthesiology',
  'amount': 27.27,
  'date_payment': '2016-02-19',
  'payer': 'DFINE, Inc',
  'physician_id': '673985',
  'record_id': '346039418'},
 '93975__2016-04-15__Food and Beverage': {'Nature_of_payment': 'Food and Beverage',
  'Physician_Specialty': 'Allopathic & Osteopathic Physicians|Radiology|Vascular & Interventional Radiology',
  'amount': 21.6,
  'date_payment': '2016-04-15',
  'payer': 'DFINE, Inc',
  'physician_id': '93975',
  'record_id': '346039420'},
 '275444__2016-04-29__Food and Beverage': {'Nature_of_payment': 'Food and Beve

## Connecting to PyMongo 

In [81]:
client = MongoClient('localhost', 27017)

In [82]:
from pymongo import MongoClient
client = MongoClient()

## Inserting the JSON records into Mongo DB and printing the id of the insertion

In [84]:
from pymongo import MongoClient

client = MongoClient('mongodb://localhost:27017/')

# data base name : 'test-database-1'
mydb = client['test-database-']

import datetime



record_id = mydb.mytable.insert(jsonDict)

print (record_id)
print (mydb.collection_names())



5ad0c39434f2a343803e67f1
['mytable']
