# Project Business Statistics: Axis Insurance



### **Please read the instructions carefully before starting the project.**
This is a commented Jupyter IPython Notebook file in which all the instructions and tasks to be performed are mentioned.
* Blanks '_______' are provided in the notebook that
needs to be filled with an appropriate code to get the correct result. With every '_______' blank, there is a comment that briefly describes what needs to be filled in the blank space.
* Identify the task to be performed correctly, and only then proceed to write the required code.
* Fill the code wherever asked by the commented lines like "# write your code here" or "# complete the code". Running incomplete code may throw error.
* Please run the codes in a sequential manner from the beginning to avoid any unnecessary errors.
* Add the results/observations (wherever mentioned) derived from the analysis in the presentation and submit the same. Any mathematical or computational details which are a graded part of the project can be included in the Appendix section of the presentation.





## Problem Statement


Leveraging customer information is of paramount importance for most businesses. In the case of an insurance company, the attributes of customers can be crucial in making business decisions. Hence, knowing to explore and generate value out of such data can be an invaluable skill to have.

Suppose you are hired as a Data Scientist in an Insurance company. The company wants to have a detailed understanding of the customer base for one of its Insurance Policy 'MediClaim'. The idea is to generate insights about the customers and answer a few key questions with statistical evidence, by using the past dataset. The dataset 'AxisInsurance' contains customers' details like age, sex, charges, etc. Perform the statistical analysis to answer the following questions using the collected data.

1.	Explore the dataset and extract insights using Exploratory Data Analysis.

2.	Prove(or disprove) that the medical claims made by the people who smoke are greater than those who don't?

3.	Prove (or disprove) with statistical evidence that the BMI of females is different from that of males.

4.  Does the smoking habit of customers depend on their region?  [Hint: Create a contingency table using the pandas.crosstab() function]

5. Is the mean BMI of women with no children, one child, and two children the same? Explain your answer with statistical evidence.

The idea behind answering these questions is to help the company in making evidence-based business decisions.

## Assumptions
The Health Insurance Customer's data is a simple random sample from the population data, and the samples are independent of each other.

## Data Dictionary

**AxisInsurance.csv**  contains the following information about customers of the Axis Health Insurance.
1.	Age - This is an integer indicating the age of the primary beneficiary (excluding those above 64 years, since they are generally covered by the government).
2.	Sex - This is the policy holder's gender, either male or female.
3.	BMI - This is the body mass index (BMI), which provides a sense of how over or under-weight a person is relative to their height. BMI is equal to weight (in kilograms) divided by height (in meters) squared. An ideal BMI is within the range of 18.5 to 24.9.
4.	Children - This is an integer indicating the number of children / dependents covered by the insurance plan.
5.	Smoker - This is yes or no depending on whether the insured person regularly smokes tobacco.
6.	Region - This is the beneficiary's place of residence in the U.S., divided into four geographic regions - northeast, southeast, southwest, or northwest.
7.	Charges - Individual medical costs billed by health insurance

### Import all the necessary libraries

In [None]:
# Installing the libraries with the specified version.
!pip install numpy==1.25.2 pandas==1.5.3 matplotlib==3.7.1 seaborn==0.13.1 scipy==1.11.4 -q --user

**Note**: *After running the above cell, kindly restart the notebook kernel and run all cells sequentially from the start again.*

In [None]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline
import seaborn as sns

# library for statistical tests
import scipy.stats as stats

In [None]:
sns.set() #setting the default seaborn style for our plots

## 1.	Explore the dataset and extract insights using Exploratory Data Analysis. (8 + 6 = 14 Marks)

### Exploratory Data Analysis - Step by step approach

Typical Data exploration activity consists of the following steps:
1.	Importing Data
2.	Variable Identification
3.  Variable Transformation/Feature Creation
4.  Missing value detection
5.	Univariate Analysis
6.	Bivariate Analysis

### Reading the Data into a DataFrame

In [None]:
# uncomment and run the following lines for Google Colab
# from google.colab import drive
# drive.mount('/content/drive')

In [None]:
# complete the code below to load the dataset
df = pd.read_csv('_______')

### Data Overview

In [None]:
# view a few rows of the data frame
df.head()

In [None]:
# view the shape of the data frame
df.shape

In [None]:
# check the data types of the columns in the data frame
df.info()

* There are total 1338 non-null observations in each of the columns.

* There are 7 columns named **'age', 'sex', 'bmi', 'children', 'smoker', 'region', 'charges'** whose data types are **int64, object, float64, int64, object, object, float64** respectively.


* sex, smoker and region are objects, we can change them to categories.

### Fixing the data types

`converting "objects" to "category" reduces the space required to store the DataFrame. It also helps in analysis`

In [None]:
df["sex"]=df["sex"].astype("category")
df["smoker"]=df["smoker"].astype("category")
df["region"]=df["region"].astype("category")

### Check for missing values

In [None]:
# write your code here

### Five Point Summary

In [None]:
# write your code here to print the summary statistics



In [None]:
df['sex'].value_counts()

In [None]:
df['smoker'].value_counts()

In [None]:
df['region'].value_counts()

### Univariate analysis

In [None]:
# function to plot a boxplot and a histogram along the same scale.


def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    kde: whether to show the density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a star will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram

#### Age

In [None]:
# plotting the distribution of 'age'
histogram_boxplot(df,'age')

#### BMI

In [None]:
# write the code to plot the distribution of 'bmi' column



#### Children

In [None]:
# write the code to plot the distribution of 'children' column



* As there are only 5 unique values in children, we can also include it in the barplot of categorical variables.

#### Charges

In [None]:
# write the code to plot the distribution of 'charges' column



In [None]:
# function to create labeled barplots


def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 1, 5))
    else:
        plt.figure(figsize=(n + 1, 5))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(data=data, x=feature, palette="Paired", order=data[feature].value_counts().index[:n].sort_values())

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(100 * p.get_height() / total)  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(label, (x, y), ha="center", va="center", size=12, xytext=(0, 5), textcoords="offset points")  # annotate the percentage

    plt.show()  # show the plot

#### Sex

In [None]:
# plotting the barplot of 'children'
labeled_barplot(df, 'sex', perc=True)

#### Children

In [None]:
# write the code to plot the barplot of 'children' column



#### Smoker

In [None]:
# write the code to plot the barplot of 'smoker' column



#### Region

In [None]:
# write the code to plot the barplot of 'region' column



### Bivariate Analysis

In [None]:
# write the code to plot the heatmap between the continuous variables
plt.figure(figsize=(15,5))
sns.heatmap(_______) # complete the code
plt.show()


In [None]:
# write the code to plot the pairplot between all possible attributes' pair
sns.pairplot(_______)  # complete the code
plt.show()


## 2. Prove(or disprove)  that the medical claims made by the people who smoke is greater than those who don't?

In [None]:
# visual analysis of medical charges for smokers and non-smokers
plt.figure(figsize=(8,6))
sns.scatterplot(x = 'age', y = 'charges', hue='smoker', data = df, palette= ['red','green'] ,alpha=0.6)
plt.show()

### Step 1: Define the null and alternate hypotheses

### Step 2: Select Appropriate test

This is a one-tailed test concerning two population means from two independent populations. The population standard deviations are unknown. **Based on this information, select the appropriate test**.

### Step 3: Decide the significance level

As given in the problem statement, we select α = 0.05.

### Step 4: Collect and prepare data

In [None]:
# extract the values of charges for smokers
charges_smokers = df[df['smoker'] == 'yes']['charges']
# extract the values of charges for non-smokers
charges_non_smokers = #write your code here


In [None]:
print("The sample mean of the charges of smokers is:", round(charges_smokers.mean(),2))
print("The sample mean of the charges of non-smokers:", round(charges_non_smokers.mean(),2))
print('The sample standard deviation of the charges of smokers is:', round(charges_smokers.std(),2))
print('The sample standard deviation of the charges of non-smokers is:', round(charges_non_smokers.std(),2))

**Based on the sample standard deviations of the two groups, decide whether the population standard deviations can be assumed to be equal or unequal**.

### Step 5: Calculate the p-value

In [None]:
# complete the code to import the required function
from scipy.stats import ______

# write the code to calculate the p-value
test_stat, p_value =    #write your code here

print('The p-value is', p_value)

### Step 6: Compare the p-value with $\alpha$

In [None]:
# print the conclusion based on p-value
if p_value < 0.05:
    print(f'As the p-value {p_value} is less than the level of significance, we reject the null hypothesis.')
else:
    print(f'As the p-value {p_value} is greater than the level of significance, we fail to reject the null hypothesis.')

### Step 7:  Draw inference

## 3. Prove (or disprove) with statistical evidence that BMI of females is different from that of males.

### Perform Visual Analysis

In [None]:
# write the code to visually compare the BMI of females and males
plt.figure(figsize=(8,6))
sns.scatterplot(________) # complete the code
plt.show()


### Step 1: Define the null and alternate hypotheses

### Step 2: Select Appropriate test

This is a two-tailed test concerning two population means from two independent populations. The population standard deviations are unknown. **Based on this information, select the appropriate test**.

### Step 3: Decide the significance level

As given in the problem statement, we select α = 0.05.

### Step 4: Collect and prepare data

In [None]:
# extract the values of BMI for females
bmi_females = df[df['sex'] == 'female']['bmi']
# extract the values of BMI for males
bmi_males = #write your code her


In [None]:
print("The sample mean of the BMI's of females is:", round(bmi_females.mean(),2))
print("The sample mean of the BMI's of males is:", round(bmi_males.mean(),2))
print("The sample standard deviation of the BMI's of females is:", round(bmi_females.std(),2))
print("The sample standard deviation of the BMI's of males is:", round(bmi_males.std(),2))

**Based on the sample standard deviations of the two groups, decide whether the population standard deviations can be assumed to be equal or unequal**.

### Step 5: Calculate the p-value

In [None]:
# complete the code to import the required function
from scipy.stats import ______

# write the code to calculate the p-value
test_stat, p_value =    #write your code here

print('The p-value is', p_value)

### Step 6: Compare the p-value with $\alpha$

In [None]:
# print the conclusion based on p-value
if p_value < 0.05:
    print(f'As the p-value {p_value} is less than the level of significance, we reject the null hypothesis.')
else:
    print(f'As the p-value {p_value} is greater than the level of significance, we fail to reject the null hypothesis.')

### Step 7:  Draw inference

## 4. Does the smoking habit of customers depend on their region?


### Perform Visual Analysis

In [None]:
# write the code to plot a stacked bar plot between 'smoker and 'region'.
pd.crosstab(________).plot(kind="bar", figsize=(8,8), stacked=True) # complete the code
plt.show()

### Step 1: Define the null and alternate hypotheses

### Step 2: Select Appropriate test

This is a problem of the test of independence, concerning two categorical variables - smoker and region. **Based on this information, select the appropriate test.**

### Step 3: Decide the significance level

As given in the problem statement, we select α = 0.05.

### Step 4: Collect and prepare data

In [None]:
# complete the code to create a contingency table showing the distribution of smokers across regions
contingency_table = pd.crosstab(______)

contingency_table

### Step 5: Calculate the p-value

In [None]:
# complete the code to import the required function
from scipy.stats import _____

# write the code to calculate the p-value
chi2, p_value, dof, exp_freq =    # write your code here

print('The p-value is', p_value)

### Step 6: Compare the p-value with $\alpha$

In [None]:
# print the conclusion based on p-value
if p_value < 0.05:
    print(f'As the p-value {p_value} is less than the level of significance, we reject the null hypothesis.')
else:
    print(f'As the p-value {p_value} is greater than the level of significance, we fail to reject the null hypothesis.')

### Step 7:  Draw inference

## 5.	Is the mean BMI of women with no children, one child and two children the same? Explain your answer with statistical evidence.

### Perform Visual Analysis

In [None]:
# create a new DataFrame for customers who are female and have 0,1, or 2 children
df_new = df[(df['sex']=='female') & (df['children']<3)]

In [None]:
# write the code to visually plot the BMI of women with 0, 1, and 2 children
plt.figure(figsize=(8,6))
sns.boxplot(______) # complete the code
plt.show()


In [None]:
# write the code to calculate the mean BMI of women with 0, 1, and 2 children



### Step 1: Define the null and alternate hypotheses

### Step 2: Select Appropriate test

This is a problem, concerning three population means. **Based on this information, select the appropriate test to compare the three population means.** Also, check the assumptions of normality and equality of variance for the three groups

* For testing of normality, Shapiro-Wilk’s test is applied to the response variable.

* For equality of variance, Levene test is applied to the response variable.

### Shapiro-Wilk’s test

We will test the null hypothesis

>$H_0:$ BMI of women follows a normal distribution

against the alternative hypothesis

>$H_a:$ BMI of women does not follow a normal distribution

In [None]:
# Assumption 1: Normality
# use Shapiro function for the test

# find the p-value
w, p_value = stats.shapiro(df_new['bmi'])
print('The p-value is', p_value)

Since p-value of the test is very larger than the 5% significance level, we fail to reject the null hypothesis that the response follows the normal distribution.

### Levene’s test

We will test the null hypothesis

>$H_0$: All the population variances are equal

against the alternative hypothesis

>$H_a$: At least one variance is different from the rest

In [None]:
#Assumption 2: Homogeneity of Variance
# use the levene function for this test

# find the p-value
statistic, p_value = stats.levene(df_new[df_new['children']==0]['bmi'],
                             df_new[df_new['children']==1]['bmi'],
                             df_new[df_new['children']==2]['bmi'])

print('The p-value is', p_value)

Since the p-value is larger than the 5% significance level, we fail to reject the null hypothesis of homogeneity of variances.

### Step 3: Decide the significance level

As given in the problem statement, we select α = 0.05.

### Step 4: Collect and prepare data

In [None]:
# extract the values of BMI of women with 0 children
bmi_women_zero = df_new[df_new['children']==0]['bmi']
# extract the values of BMI of women with 0 children
bmi_women_one =    # write your code here
# extract the values of BMI of women with 0 children
bmi_women_two =    # write your code here


### Step 5: Calculate the p-value

In [None]:
# complete the code to import the required function
from scipy.stats import ______

# write the code to calculate the p-value
test_stat, p_value =    # write your code here

print('The p-value is', p_value)

### Step 6: Compare the p-value with $\alpha$

In [None]:
# print the conclusion based on p-value
if p_value < 0.05:
    print(f'As the p-value {p_value} is less than the level of significance, we reject the null hypothesis.')
else:
    print(f'As the p-value {p_value} is greater than the level of significance, we fail to reject the null hypothesis.')

### Step 7:  Draw inference

## Conclusion and Business Recommendations