# Health Insurance

#### Project Summary
This project aims to gather data of health insurance plans and analyze it through many aspects, It aims to answer questions like:
- What are the top 5 Networks that have the biggest Number of organizations
-
-
-
And many more analytical questions.

The project follows the following steps:
* Step 1: Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Performance Optimization

## Step 1: Gather Data







In [1]:
import requests
import zipfile
from io import BytesIO
import os


parent_folder = 'health-insurance-raw-data'
domain = "https://download.cms.gov/marketplace-puf/"



In [2]:
def get_zip_urls (url):
  zip_urls = []
  for i in range(2014,2024):
    zip_urls.append(domain+str(i)+url)

  return zip_urls


def download_data(urls,folder_name):
  # Iterate over the URLs and extract the zip files
  for url in urls:
      response = requests.get(url)
      year = url.split("/")[-2]

      with zipfile.ZipFile(BytesIO(response.content)) as z:
          for filename in z.namelist():
              os.makedirs(parent_folder + "/" + folder_name + "/" + year, exist_ok=True)

              with open(parent_folder + "/" + folder_name + "/" +  year + "/" + filename, "wb") as f:
                  f.write(z.read(filename))




In [3]:
folders = ['benefits-and-cost-sharing',"rate","plan-attributes","business-rules","service-area","network"]
remaining_urls = ["/benefits-and-cost-sharing-puf.zip","/rate-puf.zip","/plan-attributes-puf.zip","/business-rules-puf.zip","/service-area-puf.zip","/network-puf.zip"]
for i in range(len(folders)):
  urls = get_zip_urls(remaining_urls[i])
  download_data(urls,folders[i])

## Step 2 : Explore and Assess the data  

In [4]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.0.tar.gz (316.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m316.9/316.9 MB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.0-py2.py3-none-any.whl size=317425344 sha256=d569d4457f899e7c2e21e79f0d3d03be8aa3809c0348640a5b7b2b923a661b24
  Stored in directory: /root/.cache/pip/wheels/41/4e/10/c2cf2467f71c678cfc8a6b9ac9241e5e44a01940da8fbb17fc
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.0


In [5]:
from pyspark.sql import SparkSession

In [6]:
from pyspark.sql.functions import countDistinct
from pyspark.sql.functions import monotonically_increasing_id
from pyspark.sql.functions import count,when,col


In [7]:
spark = SparkSession.builder \
    .appName("HealthInsurance") \
    .getOrCreate()

### 1- Plan Attributes Raw Data

In [8]:
df_plan_raw = spark.read.option("header", "true").option("inferSchema", "true").csv("/content/health-insurance-raw-data/plan-attributes/*/*.csv")


df_plan_raw.show()
# df_plan_raw.filter(df_plan_raw.BusinessYear == "2023").show()


+------------+---------+--------+------------------------------+----------+---------------+------------------+--------------+-------------------+--------------------+-------------+---------+-------------+-----------+---------+---------+----------+----------+----------------+----------------+----------------------------+-------------------------------+---------------------------+-------------------+------------------------------------------------------------+----------------------+--------------------+---------------+----------------------+--------------------------------+----------------------+---------------------------------------+----------------+-----------------+------------------+--------------------+-------------------------------+------------------------+-----------------------------------+---------------+-----------------------+------------+-----------------+------------------------+--------------------+--------------------+------------------------+----------------------------

In [9]:
df_plan_raw.count()

227036

In [10]:
# df_plan_raw.printSchema()

In [11]:
df_plan_raw.select(countDistinct("StandardComponentId")).show()


+-----------------------------------+
|count(DISTINCT StandardComponentId)|
+-----------------------------------+
|                              17814|
+-----------------------------------+



In [12]:
df_plan_raw.filter(df_plan_raw.IssuerMarketPlaceMarketingName == "HIOS").show()


+------------+---------+--------+------------------------------+----------+-------------------+--------------+--------------+-------------------+------------------+-------------+----------+--------------+--------------------+----------+--------+----------+----------+----------------+---------------+----------------------------+-------------------------------+---------------------------+-------------------+------------------------------------------------------------+----------------------+-----------------+---------------+----------------------+--------------------------------+----------------------+---------------------------------------+----------------+-----------------+------------------+--------------------+-------------------------------+------------------------+-----------------------------------+---------------+-----------------------+------------+----------+------------------------+--------------------+--------------------+------------------------+------------------------------

In [13]:
df_plan_raw.filter(df_plan_raw.IssuerId == "21989").show(5)


+------------+---------+--------+------------------------------+----------+---------------+------------------+--------------+-------------------+--------------------+-------------+---------+-------------+-----------+---------+---------+----------+----------+----------------+----------------+----------------------------+-------------------------------+---------------------------+-------------------+------------------------------------------------------------+----------------------+--------------------+---------------+----------------------+--------------------------------+----------------------+---------------------------------------+----------------+-----------------+------------------+--------------------+-------------------------------+------------------------+-----------------------------------+---------------+-----------------------+------------+-----------------+------------------------+--------------------+--------------------+------------------------+----------------------------

In [14]:
df_plan_raw.filter(df_plan_raw.IssuerId == "38344").show(5)



+------------+---------+--------+------------------------------+----------+---------------+--------------+--------------+-------------------+--------------------+-------------+---------+-------------+-----------+---------+--------+----------+--------------+----------------+---------------+----------------------------+-------------------------------+---------------------------+-------------------+------------------------------------------------------------+----------------------+--------------------+---------------+----------------------+--------------------------------+----------------------+---------------------------------------+----------------+-----------------+------------------+--------------------+-------------------------------+------------------------+-----------------------------------+---------------+-----------------------+--------------------+-----------------+------------------------+--------------------+--------------------+------------------------+----------------------

We conclude from the above 3 cells that some data is outdated , and organization names were named HIOS at first then they were renamed in more recent years, so we don't need the raws with outdated data, so we can drop all data with ``` IssuerMarketPlaceMarketingName ``` value of ```HIOS```



In [15]:
df_plan_raw = df_plan_raw.filter(df_plan_raw.IssuerMarketPlaceMarketingName != "HIOS")


In [16]:
df_plan_raw.select(countDistinct("StandardComponentId")).show()


+-----------------------------------+
|count(DISTINCT StandardComponentId)|
+-----------------------------------+
|                              14503|
+-----------------------------------+



In [17]:
df_plan_raw.createOrReplaceTempView("plan_staging")


### 2- Network Raw Data

In [32]:
df_network_raw = spark.read.option("header", "true").option("inferSchema", "true").csv("/content/health-insurance-raw-data/network/*/*.csv")


df_network_raw.show()


+------------+---------+--------+----------+----------+-------------------+---------+----------+--------------------+---------+--------------------+---------+--------------+----------+
|BusinessYear|StateCode|IssuerId|SourceName|VersionNum|         ImportDate|IssuerId2|StateCode2|         NetworkName|NetworkId|          NetworkURL|RowNumber|MarketCoverage|DentalOnly|
+------------+---------+--------+----------+----------+-------------------+---------+----------+--------------------+---------+--------------------+---------+--------------+----------+
|        2015|       GA|   89942|      HIOS|         7|2015-02-19 06:21:02|    89942|        GA|Kaiser Permanente...|   GAN001|   kp.org/gaprovider|       13|          NULL|      NULL|
|        2015|       GA|   93332|      HIOS|        10|2014-09-29 21:43:29|    93332|        GA|        Atlanta HMOx|   GAN001|https://www.human...|       13|          NULL|      NULL|
|        2015|       GA|   93332|      HIOS|        10|2014-09-29 21:43:29|

In [37]:
df_network_raw.createOrReplaceTempView("network_staging")


#**Create Data Model**


## Organization Table

In [34]:
organization_table = spark.sql("""
                            SELECT  DISTINCT p.IssuerId AS OrganizationId,
                                    p.IssuerMarketPlaceMarketingName AS Name,
                                    p.StateCode AS StateCode
                            FROM plan_staging p

""")

organization_table.createOrReplaceTempView('organization_table')

organization_table.show()


+--------------+--------------------+---------+
|OrganizationId|                Name|StateCode|
+--------------+--------------------+---------+
|         23435|         BannerAetna|       AZ|
|         15995|               SERFF|       AR|
|         29497|               SERFF|       DE|
|         34210| Renaissance Dental |       WI|
|         27833|Ambetter of Illinois|       IL|
|         60075|TRUASSURE INSURAN...|       AL|
|         15833|            Guardian|       FL|
|         91908|Oscar Insurance C...|       OK|
|         24601|           BEST Life|       TN|
|         47638|Retailers Insuran...|       MI|
|         50274|               SERFF|       KS|
|         35700|               SERFF|       MI|
|         14186|               SERFF|       SD|
|         66759|   Dominion National|       NC|
|         27811|               SERFF|       KS|
|         34930|               SERFF|       MI|
|         99969|           MedMutual|       OH|
|         24832|Renaissance Life ...|   

In [25]:
organization_table.count()

1071

In [26]:
organization_table.select(countDistinct("OrganizationId")).show()


+------------------------------+
|count(DISTINCT OrganizationId)|
+------------------------------+
|                           797|
+------------------------------+



In [27]:
organization_table = organization_table.drop_duplicates(['OrganizationId','Name','StateCode'])
organization_table.count()

1071

In [28]:
from pyspark.sql.functions import desc

most_common_value = organization_table.groupBy("OrganizationId").count().orderBy(desc("count")).first()[0]
most_common_value

'36096'

In [29]:
organization_table.filter(organization_table.OrganizationId =="36096").show()

+--------------+--------------------+---------+
|OrganizationId|                Name|StateCode|
+--------------+--------------------+---------+
|         36096|               SERFF|       IL|
|         36096|                 OPM|       IL|
|         36096|Blue Cross and Bl...|       IL|
|         36096|Blue Cross and Bl...|       IL|
+--------------+--------------------+---------+



In [30]:
organization_table.filter(organization_table.Name =="OPM").show(5)

+--------------+----+---------+
|OrganizationId|Name|StateCode|
+--------------+----+---------+
|         26065| OPM|       SC|
|         46944| OPM|       AL|
|         11512| OPM|       NC|
|         14002| OPM|       TN|
|         96751| OPM|       NH|
+--------------+----+---------+
only showing top 5 rows



From the exploration above we conclude that the issuer ID is not unique.
Each organization can exist in different states and the identifier is the organiztion with the state code with the id (which represents the id of this organization in this state)
if we take a closer look , the id has duplicates across the data , so we need to add a primary key

In [31]:
organization_table = organization_table.withColumn("Id", monotonically_increasing_id())
organization_table.show(5)

+--------------+--------------------+---------+---+
|OrganizationId|                Name|StateCode| Id|
+--------------+--------------------+---------+---+
|         23435|         BannerAetna|       AZ|  0|
|         15995|               SERFF|       AR|  1|
|         29497|               SERFF|       DE|  2|
|         34210| Renaissance Dental |       WI|  3|
|         27833|Ambetter of Illinois|       IL|  4|
+--------------+--------------------+---------+---+
only showing top 5 rows



In [None]:
# from pyspark.sql.functions import col

# df_filtered = organization_table.groupBy(col("Id")).count().filter(col("count") > 1)
# df_filtered.show()


Insurance Plan Table

In [39]:
insurance_plan_table = spark.sql("""
                            SELECT  DISTINCT p.StandardComponentId AS Id,
                                    p.IssuerId AS OrganizationId,
                                    p.IssuerMarketPlaceMarketingName AS OrganizationName,
                                    p.StateCode AS StateCode
                            FROM plan_staging p

""")
# insurance_plan_table = insurance_plan_table.withColumn("plan_id", monotonically_increasing_id())

insurance_plan_table.createOrReplaceTempView('insurance_plan_table')

insurance_plan_table.show(5)


+--------------+--------------+--------------------+---------+
|            Id|OrganizationId|    OrganizationName|StateCode|
+--------------+--------------+--------------------+---------+
|28725AL0110001|         28725|  Renaissance Dental|       AL|
|53901AZ1420008|         53901|Blue Cross Blue S...|       AZ|
|77352AZ0020005|         77352|           BEST Life|       AZ|
|97667AZ0110016|         97667|Cigna HealthCare ...|       AZ|
|19898FL0340031|         19898|               AvMed|       FL|
+--------------+--------------+--------------------+---------+
only showing top 5 rows



In [40]:
insurance_plan_table.count()

16823

In [41]:
insurance_plan_table.filter(insurance_plan_table.Id == "53901AZ1420008").show()


+--------------+--------------+--------------------+---------+
|            Id|OrganizationId|    OrganizationName|StateCode|
+--------------+--------------+--------------------+---------+
|53901AZ1420008|         53901|Blue Cross Blue S...|       AZ|
+--------------+--------------+--------------------+---------+



In [45]:
insurance_plan_table.select(countDistinct("Id")).show()


+------------------+
|count(DISTINCT Id)|
+------------------+
|             14503|
+------------------+



In [44]:
insurance_plan_table = insurance_plan_table.drop_duplicates(['Id'])
insurance_plan_table.count()

14504

In [43]:
insurance_plan_table.filter(insurance_plan_table.OrganizationId == "28725").show()


+--------------+--------------+------------------+---------+
|            Id|OrganizationId|  OrganizationName|StateCode|
+--------------+--------------+------------------+---------+
|28725AL0110001|         28725|Renaissance Dental|       AL|
|28725AL0150001|         28725|Renaissance Dental|       AL|
|28725AL0120002|         28725|Renaissance Dental|       AL|
|28725AL0100003|         28725|Renaissance Dental|       AL|
|28725AL0110002|         28725|Renaissance Dental|       AL|
|28725AL0130001|         28725|Renaissance Dental|       AL|
|28725AL0100004|         28725|Renaissance Dental|       AL|
|28725AL0130002|         28725|Renaissance Dental|       AL|
|28725AL0120001|         28725|Renaissance Dental|       AL|
|28725AL0140001|         28725|Renaissance Dental|       AL|
+--------------+--------------+------------------+---------+



## Network Table

In [72]:
network_table = spark.sql("""
                            SELECT  DISTINCT n.NetworkId AS NetworkId,
                                    n.IssuerId AS OrganizationId,
                                    n.NetworkName AS NetworkName,
                                    n.StateCode AS StateCode

                            FROM network_staging n

""")
network_table = network_table.withColumn("Id", monotonically_increasing_id())

network_table.createOrReplaceTempView('network_table')

network_table.show(5)


+---------+--------------+--------------------+---------+---+
|NetworkId|OrganizationId|         NetworkName|StateCode| Id|
+---------+--------------+--------------------+---------+---+
|   AKN001|         84394|Principal Plan De...|       AK|  0|
|   AZN001|         23307|        Phoenix HMOx|       AZ|  1|
|   GAN001|         69677|Ameritas PPO Dent...|       GA|  2|
|   WIN002|         79475|Blue Priority X -...|       WI|  3|
|   WYN001|         76197|Ameritas PPO Dent...|       WY|  4|
+---------+--------------+--------------------+---------+---+
only showing top 5 rows



In [59]:
network_table.filter(network_table.NetworkName == "Molina Marketplace").show()

+---------+--------------+------------------+---------+----+
|NetworkId|OrganizationId|       NetworkName|StateCode|  Id|
+---------+--------------+------------------+---------+----+
|   FLN001|         54172|Molina Marketplace|       FL| 691|
|   MIN001|         40047|Molina Marketplace|       MI|1157|
|   NMN001|         19722|Molina Marketplace|       NM|2171|
+---------+--------------+------------------+---------+----+



In [60]:
network_table.filter(network_table.NetworkName == "Phoenix HMOx").show()

+---------+--------------+------------+---------+---+
|NetworkId|OrganizationId| NetworkName|StateCode| Id|
+---------+--------------+------------+---------+---+
|   AZN001|         23307|Phoenix HMOx|       AZ|  1|
+---------+--------------+------------+---------+---+



In [79]:
network_table.filter(network_table.OrganizationId == "19722").show()

+---------+--------------+--------------------+---------+----+
|NetworkId|OrganizationId|         NetworkName|StateCode|  Id|
+---------+--------------+--------------------+---------+----+
|   NMN001|         19722|  Molina Marketplace|       NM|2171|
|   NMN001|         19722|Molina New Mexico...|       NM|2623|
|   NMN001|         19722|Molina Healthcare...|       NM|2674|
+---------+--------------+--------------------+---------+----+



In [80]:
network_table.filter(network_table.OrganizationId == "40047").show()

+---------+--------------+------------------+---------+----+
|NetworkId|OrganizationId|       NetworkName|StateCode|  Id|
+---------+--------------+------------------+---------+----+
|   MIN001|         40047|Molina Marketplace|       MI|1157|
+---------+--------------+------------------+---------+----+



In [None]:
# unique_network_ids = network_table.select("NetworkId").distinct().collect()
# unique_network_ids

#### Clean Network table

In [77]:
network_table = network_table.filter(~col("NetworkId").isin("No","Yes","Individual") & ~col("networkId").isNull())
network_table.show()


+---------+--------------+--------------------+---------+---+
|NetworkId|OrganizationId|         NetworkName|StateCode| Id|
+---------+--------------+--------------------+---------+---+
|   AKN001|         84394|Principal Plan De...|       AK|  0|
|   AZN001|         23307|        Phoenix HMOx|       AZ|  1|
|   GAN001|         69677|Ameritas PPO Dent...|       GA|  2|
|   WIN002|         79475|Blue Priority X -...|       WI|  3|
|   WYN001|         76197|Ameritas PPO Dent...|       WY|  4|
|   IAN003|         18973|OA POS_Iowa_Patie...|       IA|  5|
|   NEN001|         90142|Lincoln DentalCon...|       NE|  6|
|   ORN001|         85804|LifeWise Health P...|       OR|  7|
|   IAN001|         11738|            DenteMax|       IA|  8|
|   OKN002|         77760|DDOK Network Indi...|       OK|  9|
|   FLN002|         27357|    Enhanced Network|       FL| 15|
|   PAN010|         16322|SHOP 10 County Se...|       PA| 16|
|   TNN001|         79913| Dentegra Dental PPO|       TN| 17|
|   ILN0

So a network can exist in different states but it exists only for one organization in that state.
An organization can have only one network in the state.

In [81]:
network_table.filter(network_table.NetworkId == "AKN001").show()

+---------+--------------+--------------------+---------+----+
|NetworkId|OrganizationId|         NetworkName|StateCode|  Id|
+---------+--------------+--------------------+---------+----+
|   AKN001|         84394|Principal Plan De...|       AK|   0|
|   AKN001|         21989|Delta Dental Premier|       AK|  43|
|   AKN001|         45858|Ameritas PPO Dent...|       AK|  49|
|   AKN001|         38536|Lincoln Dental Co...|       AK|  62|
|   AKN001|         73836|     Endeavor Select|       AK| 149|
|   AKN001|         74819|    CONNECTIONDental|       AK| 376|
|   AKN001|         81761|Ameritas PPO Dent...|       AK| 471|
|   AKN001|         21989|         ODS Premier|       AK| 750|
|   AKN001|         96211|Connection Dental...|       AK|1274|
|   AKN001|         38344|        HeritagePlus|       AK|1451|
|   AKN001|         47904|  Renaissance Dental|       AK|1475|
|   AKN001|         73836|Moda Plus AK Regi...|       AK|1640|
|   AKN001|         58670|           indemnity|       A

# Data Analysis


Show the Top 5 Networks with the greatest number of organizations

In [86]:
from pyspark.sql.functions import desc

network_counts = network_table.groupBy("NetworkId").count()

result = network_counts.join(network_table, "NetworkId")

network_name_counts = result.groupBy("NetworkName").count()

top_networks = network_name_counts.orderBy(desc("count")).limit(5).select("NetworkName", "count")

top_networks.show()



+--------------------+-----+
|         NetworkName|count|
+--------------------+-----+
|Ameritas PPO Dent...|   96|
|            DenteMax|   54|
|        Delta Dental|   38|
| Dentegra Dental PPO|   36|
|  Renaissance Dental|   34|
+--------------------+-----+

