# -------------------- Day 11 of 100 days SQL Challenge ------------

## about this dataset - 
<p>This dataset contains insightful information related to insurance claims, giving us an in-depth look into the demographic patterns of those receiving them. The dataset contains information on patient age, gender, BMI (Body Mass Index), blood pressure levels, diabetic status, number of children, smoking status and region. By analyzing these key factors across geographical areas and across different demographics such as age or gender we can gain a greater understanding of who is most likely to receive an insurance claim. This understanding gives us valuable insight that can be used to inform our decision making when considering potential customers for our services. On a broader scale it can inform public policy by allowing for more targeted support for those who are most in need and vulnerable. These kinds of insights are extremely valuable and this dataset provides us with the tools we need to uncover them!</p>

 
**Dataset Link -** [kaggle](https://www.kaggle.com/datasets/thedevastator/insurance-claim-analysis-demographic-and-health) 

### DATA DICTIONARY 


1. **PatientID -** Unique identifier for each patient.
2. **age -** Age of the patient.
3. **gender -** Gender of the patient (e.g., male, female).
4. **bmi -** Body Mass Index, a measure of body fat calculated from the patient's height and weight.
5. **bloodpressure -** Blood pressure of the patient.
6. **diabetic -** Indicates whether the patient is diabetic (e.g., Yes, No).
7. **children -** Number of children the patient has.
8. **smoker -** Indicates whether the patient is a smoker (e.g., Yes, No).
9. **region -** Geographic region associated with the patient.
10. **claim -** Insurance claim associated with the patient.



In [1]:
import pandas as pd
import sqlite3

# Establish a connection to the SQLite database
db_path = "FinalDB.db"  # Replace with the correct path if not in the same directory
connection = sqlite3.connect(db_path)

In [2]:
%load_ext sql



In [3]:

%sql sqlite:///FinalDB.db


In [4]:

# path for the data
path = "insurance_data.csv"

# Load data into Pandas DataFrames from the provided path
insurance_df = pd.read_csv(path)


# Use the to_sql method to create tables and insert DataFrames into SQLite tables

insurance_df.to_sql('insurance', connection, index=False, if_exists='replace')


1340

In [7]:
%%sql
SELECT name FROM sqlite_master WHERE type='table';



 * sqlite:///FinalDB.db
Done.


name
insurance


In [9]:
%%sql
select count(*) as `Number of rows` from insurance

 * sqlite:///FinalDB.db
Done.


Number of rows
1340


**from above cell we can see that there are 1340 number of rows present in our table** 

In [10]:
%%sql
PRAGMA table_info(insurance);


 * sqlite:///FinalDB.db
Done.


cid,name,type,notnull,dflt_value,pk
0,index,INTEGER,0,,0
1,PatientID,INTEGER,0,,0
2,age,REAL,0,,0
3,gender,TEXT,0,,0
4,bmi,REAL,0,,0
5,bloodpressure,INTEGER,0,,0
6,diabetic,TEXT,0,,0
7,children,INTEGER,0,,0
8,smoker,TEXT,0,,0
9,region,TEXT,0,,0


In [17]:
%%sql 
select * from insurance limit 10

 * sqlite:///FinalDB.db
Done.


index,PatientID,age,gender,bmi,bloodpressure,diabetic,children,smoker,region,claim
0,1,39.0,male,23.2,91,Yes,0,No,southeast,1121.87
1,2,24.0,male,30.1,87,No,0,No,southeast,1131.51
2,3,,male,33.3,82,Yes,0,No,southeast,1135.94
3,4,,male,33.7,80,No,0,No,northwest,1136.4
4,5,,male,34.1,100,No,0,No,northwest,1137.01
5,6,,male,34.4,96,Yes,0,No,northwest,1137.47
6,7,,male,37.3,86,Yes,0,No,northwest,1141.45
7,8,19.0,male,41.1,100,No,0,No,northwest,1146.8
8,9,20.0,male,43.0,86,No,0,No,northwest,1149.4
9,10,30.0,male,53.1,97,No,0,No,northwest,1163.46


### Problem 1

How many patients have claimed more than the average claim amount for patients who are smokers and have at least one child, and belong to the southeast region?

In [13]:
%%sql
SELECT COUNT(*) AS `number of patients`
FROM insurance
WHERE smoker = 'Yes'
  AND children >= 1
  AND region = 'southeast'
  AND claim > (SELECT AVG(claim) FROM insurance);


 * sqlite:///FinalDB.db
Done.


number of patients
51


### Problem 2
How many patients have claimed more than the average claim amount for patients who are not smokers and have a BMI greater than the average BMI for patients who have at least one child?

In [15]:
%%sql
SELECT COUNT(*) AS `Number of patients`
FROM insurance 
WHERE smoker = 'No' 
  AND children >= 1 
  AND bmi > (SELECT AVG(bmi) FROM insurance)
  AND claim > (SELECT AVG(claim) FROM insurance);


 * sqlite:///FinalDB.db
Done.


Number of patients
45


### Problem 3
How many patients have claimed more than the average claim amount for patients who have a BMI greater than the average BMI for patients who are diabetic, have at least one child, and are from the southwest region?

In [16]:
%%sql
SELECT COUNT(*) AS `Number of patients`
FROM insurance 
WHERE smoker = 'No' 
  AND children >= 1 
AND diabetic = 'Yes'
AND region = 'southwest'
AND bmi > (SELECT AVG(bmi) FROM insurance)
AND claim > (SELECT AVG(claim) FROM insurance);


 * sqlite:///FinalDB.db
Done.


Number of patients
4


### Problem 4
What is the difference in the average claim amount between patients who are smokers and patients who are non-smokers, and have the same BMI and number of children?

In [18]:
%%sql
WITH AvgClaims AS (
    SELECT smoker,
           AVG(claim) AS avg_claim
    FROM insurance
    GROUP BY smoker, bmi, children
)

SELECT AVG(CASE WHEN smoker = 'Yes' THEN avg_claim ELSE -avg_claim END) AS avg_claim_difference
FROM AvgClaims;


 * sqlite:///FinalDB.db
Done.


avg_claim_difference
2813.9494850299347


### Problem 5
Find the average BMI for male and female patients separately.



In [22]:
%%sql
select gender, avg(bmi) as `average bmi`
from insurance
group by gender

 * sqlite:///FinalDB.db
Done.


gender,average bmi
female,30.379758308157097
male,30.951327433628315


### Problem 6
Identify the top 5 regions with the highest average claim amounts.

In [25]:
%%sql
select region, avg(claim) as `average claim`
from insurance
group by region
order by `average claim` desc
limit 5;

 * sqlite:///FinalDB.db
Done.


region,average claim
northeast,16889.044718614718
southeast,13058.52266365688
southwest,12723.129840764332
northwest,11672.088452722064
,1254.216666666667


### Problem 7
Count the number of smokers and non-smokers in each region.

In [29]:
%%sql 
select region,smoker, count(*) `number of smokers and non smoker`
from insurance
group by region,smoker
order by `number of smokers and non smoker` desc;

 * sqlite:///FinalDB.db
Done.


region,smoker,number of smokers and non smoker
southeast,No,352
northwest,No,291
southwest,No,256
northeast,No,164
southeast,Yes,91
northeast,Yes,67
northwest,Yes,58
southwest,Yes,58
,No,3


### Problem 8
Find the age of the oldest patient for each BMI range.

In [31]:
%%sql
SELECT BMI_Range, MAX(age) AS max_age
FROM (
    SELECT CASE
               WHEN bmi < 18.5 THEN 'Underweight'
               WHEN bmi >= 18.5 AND bmi < 25 THEN 'Normal Weight'
               WHEN bmi >= 25 AND bmi < 30 THEN 'Overweight'
               WHEN bmi >= 30 THEN 'Obese'
           END AS BMI_Range,
           age
    FROM insurance
)
GROUP BY BMI_Range;


 * sqlite:///FinalDB.db
Done.


BMI_Range,max_age
Normal Weight,60.0
Obese,60.0
Overweight,60.0
Underweight,60.0


### Problem 9
Calculate the total number of children for patients in each region.

In [30]:
%%sql
select region, sum(children) as `Number of Children` 
from insurance
group by region
order by `Number of Children` desc;

 * sqlite:///FinalDB.db
Done.


region,Number of Children
southeast,485
northwest,375
southwest,370
northeast,235
,0
