# RUN THIS NOTEBOOK IN ENV: online_reviews_analysis_env

In [1]:
!conda env list

# conda environments:
#
                         C:\Program Files\Orange
base                     C:\ProgramData\Anaconda3
AutoCoder                C:\Users\PhillipRashaad\.conda\envs\AutoCoder
ChatDev_conda_env        C:\Users\PhillipRashaad\.conda\envs\ChatDev_conda_env
PandasProfileEnv         C:\Users\PhillipRashaad\.conda\envs\PandasProfileEnv
SMOP_env                 C:\Users\PhillipRashaad\.conda\envs\SMOP_env
autogen_autobuild_env     C:\Users\PhillipRashaad\.conda\envs\autogen_autobuild_env
autogen_studio_env       C:\Users\PhillipRashaad\.conda\envs\autogen_studio_env
automemgpt_env           C:\Users\PhillipRashaad\.conda\envs\automemgpt_env
classy_env               C:\Users\PhillipRashaad\.conda\envs\classy_env
crewai_env               C:\Users\PhillipRashaad\.conda\envs\crewai_env
crewai_job_search_env     C:\Users\PhillipRashaad\.conda\envs\crewai_job_search_env
crewai_newsletter_env     C:\Users\PhillipRashaad\.conda\envs\crewai_newsletter_env
crewai_poetry_env        C:

# SYNTHETIC DATA NOTEBOOK OVERVIEW

1. Customer Demographic Table
2. Online Reviews Table

# 1. CUSTOMER DEMOGRAPHIC TABLE

### Synthetic Data Creation based on Aetna Insurance

In this project, I am choosing to emulate Aetna Insurance to generate synthetic demographic data. This data includes age, gender, location, and plan type for a sample of customers.

#### Location
Aetna provides insurance in the following 17 states:
- Arizona, California, Delaware, Florida, Georgia, Illinois, Indiana, Kansas, Maryland, Missouri, North Carolina, New Jersey, Nevada, Ohio, Texas, Utah, Virginia

These states are grouped into four regions:
- **North**: Illinois, Indiana, Kansas, Ohio
- **West**: Arizona, California, Nevada, Utah
- **East**: Delaware, Maryland, New Jersey, Virginia
- **South**: Florida, Georgia, Missouri, North Carolina, Texas

#### Plan Type
Aetna offers two main plan types:
1. Individual and Family Plans (Mean age = 40 years, Std = 15 years, 69% of users)
2. Medicare Plans (Mean age = 70 years, Std = 5 years, 31% of users)

#### Age
For synthetic data generation:
- Ages for Individual and Family plans follow a normal distribution with mean 40 and standard deviation 15, within the range 25 to 64.
- Ages for Medicare plans follow a gamma distribution, right-skewed, within the range 65 to 80.

#### Gender
The gender distribution is:
- Female: 55%
- Male: 40%
- Other: 5%

#### Customer Demographic Table
A table with 1000 customers includes:
- **Customer_ID**: Integer from 1 to 1000
- **Plan_Type**: "Individual and Family" (70%) or "Medicare" (30%)
- **Age**: Based on plan type distribution
- **Gender**: Female (55%), Male (40%), Other (5%)
- **State**: Randomly chosen from the 17 Aetna service states
- **Region**: Mapped from state using the predefined region grouping

In [6]:

# Ensure the src directory is in the system path
import sys
import os

# Add the src directory to the system path
sys.path.append(os.path.abspath(os.path.join('..', 'src')))


# Import the classes
from data_generation import CustomerDemographicTable, SyntheticReviewGenerator

# Generate customer demographic data
customer_table_generator = CustomerDemographicTable(n_customers=1000, random_seed=42)

customer_df = customer_table_generator.generate_data()
print("Customer Demographic Data")
print(customer_df.shape)

customer_df.head()

Customer Demographic Data
(1000, 6)


Unnamed: 0,Customer_ID,Plan_Type,Age,Gender,State,Region
0,1,Individual and Family,42,Male,New Jersey,East
1,2,Medicare,68,Male,Missouri,South
2,3,Medicare,69,Male,Georgia,South
3,4,Individual and Family,46,Other,Maryland,East
4,5,Individual and Family,46,Female,California,West


In [None]:
# Save the data to CSV files
customer_df.to_csv('customer_demographic_data.csv', index=False)

# 2. ONLINE REVIEWS TABLE

### HEALTH INSURANCE TOPICS

This code now incorporates a wider range of topics for both positive and negative reviews, enhancing the diversity and realism of the generated synthetic data.

#### Positive Topics
The top 10 topics customers commonly praise regarding health insurance services typically include:

1. **Comprehensive Coverage**: Praise for policies that offer extensive coverage for a wide range of medical services, treatments, and medications.
2. **Affordable Premiums**: Satisfaction with reasonable and affordable premium costs.
3. **Quick Claims Processing**: Positive feedback on fast and efficient claims processing.
4. **Helpful Customer Service**: Commendations for responsive, knowledgeable, and helpful customer service representatives.
5. **Network Quality**: Appreciation for a large network of in-network providers, including top hospitals and specialists.
6. **Easy Access to Care**: Praise for ease of access to necessary medical care without extensive preauthorization requirements.
7. **User-Friendly Online Tools**: Positive comments about the convenience of online portals and mobile apps for managing policies, claims, and healthcare needs.
8. **Preventive Care Coverage**: Satisfaction with coverage for preventive care services, such as annual check-ups, screenings, and vaccinations.
9. **Clear Communication**: Praise for clear, transparent communication regarding policy details, benefits, and coverage changes.
10. **Health and Wellness Programs**: Appreciation for additional health and wellness programs, such as fitness discounts, telehealth services, and wellness incentives.

These topics are often highlighted in customer reviews and surveys, showcasing the aspects of health insurance services that customers find most valuable and satisfactory.


#### Negative Topics
The top 10 topics customers commonly complain about regarding health insurance services typically include:

1. **Claim Denials**: Complaints about claims being denied or not being covered as expected.
2. **Billing Issues**: Problems with billing errors, unexpected charges, or difficulties understanding bills.
3. **Coverage Limitations**: Frustrations about certain treatments, medications, or services not being covered or having limited coverage.
4. **Customer Service**: Negative experiences with customer service representatives, including long wait times and unhelpful responses.
5. **Premium Costs**: Concerns about high premium costs or unexpected increases in premiums.
6. **Network Issues**: Difficulties finding in-network doctors or hospitals and dissatisfaction with the network of providers.
7. **Preauthorization Requirements**: Complaints about the complexity and delays associated with preauthorization processes for certain treatments or medications.
8. **Policy Changes**: Frustration with changes in policy terms or benefits, often with little notice.
9. **Lack of Transparency**: Issues with understanding what is covered and the terms of the policy due to unclear or confusing information.
10. **Appeals Process**: Problems with the appeals process for denied claims, including lengthy and complicated procedures.

These topics are commonly highlighted in customer reviews and surveys related to health insurance services.

## 2-A) CLASS OBJECT - Sythetic Generator

### PART 1 - Class Object
- Made the Rating/Sentiment align with the Review Text
- Added more variety to the positive and negative prompts
- Added Prompt column to evaluate topic modeling algorithms
- Added Company_Name to make this class object dynamic for other insurance companies.

### PART 2 - Class Object
- Added Year over Year positive Rating score using normal distribution.
- Incorporated a dict with year and corresponding desired Rating mean.
- Using a constant standard deviation of 2 for every year.



In [28]:
from dotenv import load_dotenv
import os

# Load the environment variables from .env file
load_dotenv()



True

In [32]:
# Get OpenAI API key
api_key = os.getenv("OPENAI_KEY")

print("LEN: ",len(api_key))

LEN:  51


In [31]:

# Generate synthetic reviews
rating_mean_dict = {
    2020: 1.5,
    2021: 1.9,
    2022: 2.4,
    2023: 3.1,
    2024: 4.2
}


#Use class object
generator = SyntheticReviewGenerator(api_key, "Aetna's health insurance service", 
                                     num_customers=1000, num_reviews=1000, 
                                     start_date="1/1/2020", end_date="6/16/2024", 
                                     rating_mean_dict=rating_mean_dict, std_dev=2.0
                                     )


reviews_df = generator.generate_reviews()
print("Synthetic Reviews Data")
print(reviews_df.shape)

reviews_df.head()


Synthetic Reviews Data
(5, 8)


Unnamed: 0,Review_ID,Customer_ID,Review_Date,Rating,Review_Text,Sentiment,Prompt,Company_Name
0,1,9,2023-08-02,2,"""Aetna has been nothing but a headache since I...",negative,Write a negative customer review for Aetna's h...,Aetna's health insurance service
1,2,4,2020-06-27,2,"Unfortunately, my experience with Aetna's heal...",negative,Write a negative customer review for Aetna's h...,Aetna's health insurance service
2,3,9,2021-04-21,2,I have been incredibly disappointed with Aetna...,negative,Write a negative customer review for Aetna's h...,Aetna's health insurance service
3,4,1,2022-05-09,1,I honestly wish I could give Aetna zero stars....,negative,Write a negative customer review for Aetna's h...,Aetna's health insurance service
4,5,5,2024-04-03,5,I recently had an incredible experience with A...,positive,Write a positive customer review for Aetna's h...,Aetna's health insurance service


In [None]:

# Save the data to CSV files
reviews_df.to_csv('synthetic_reviews_data.csv', index=False)

# 3. MERGE DATASETS

## Benefits of Merging Datasets

Merging the datasets offers several advantages for building dashboards:

- **Streamlined Process**: Consolidating all data into a single source simplifies the workflow, making it easier to manage and visualize.
- **Enhanced Performance**: A unified data source ensures faster loading times and more efficient filtering.
- **Consistency**: Combining datasets ensures uniformity across all data points, reducing discrepancies and potential errors.
- **Simplified Data Management**: Handling a single dataset is less complex than managing multiple sources, facilitating easier updates and maintenance.

**NOTE**: Merging very large datasets or live datasets may not be practical in all cases. Consider the size and nature of the data before deciding to merge.

In [None]:
# Perform the left join again, this time renaming the Customer_ID columns to keep both
merged_data_with_ids = pd.merge(
                                customer_df.rename(columns={"Customer_ID": "Customer_ID_Demo"}),
                                reviews_df.rename(columns={"Customer_ID": "Customer_ID_Reviews"}),
                                left_on="Customer_ID_Demo",
                                right_on="Customer_ID_Reviews",
                                how="left"
                            )


In [None]:
# Define a function to assign age groups based on the given criteria
def assign_age_group(age):
    if age < 25:
        return "18 to 24"
    elif age < 35:
        return "25 to 34"
    elif age < 45:
        return "35 to 44"
    elif age < 55:
        return "45 to 54"
    elif age < 65:
        return "55 to 64"
    elif age < 70:
        return "65 to 69"
    elif age < 75:
        return "70 to 74"
    else:
        return "75+"

# Apply the function to create the Age_Group column
merged_data_with_ids["Age_Group"] = merged_data_with_ids["Age"].apply(assign_age_group)

merged_data_with_ids.head()


In [None]:
merged_data_with_ids.to_csv('03 - synthetic_merged_data.csv', index=False)