# **Statistical Tests**

## Objectives

* Perform distribution checks on the data
* Test project hypotheses through statistical tests

## Inputs

* Cleaned data from ETL pipeline

## Outputs

* Outputs are all contained within the notebook 

## Additional Comments

* If you have any additional comments that don't fit in the previous bullets, please state them here. 



---

# Import Data and Load Packages

Load packages required to run the notebook

In [1]:
import pandas as pd
import numpy as np

Import cleaned data produced by the ETL pipeline

In [2]:
# *Import raw data for testing
# Import cleaned data from ETL pipeline into dataframe

df = pd.read_csv("../data/insurance-cleaned.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   age           1338 non-null   int64  
 1   sex           1338 non-null   int64  
 2   bmi           1338 non-null   float64
 3   children      1338 non-null   int64  
 4   smoker        1338 non-null   int64  
 5   region        1338 non-null   object 
 6   charges       1338 non-null   float64
 7   bmi_category  1338 non-null   object 
dtypes: float64(2), int64(4), object(2)
memory usage: 83.8+ KB


Convert data types for categorical variables so that they are handled correctly in the code

In [6]:
df.astype({"sex": 'string'}).dtypes
df.astype({"smoker": 'object'}).dtypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   age           1338 non-null   int64  
 1   sex           1338 non-null   int64  
 2   bmi           1338 non-null   float64
 3   children      1338 non-null   int64  
 4   smoker        1338 non-null   int64  
 5   region        1338 non-null   object 
 6   charges       1338 non-null   float64
 7   bmi_category  1338 non-null   object 
dtypes: float64(2), int64(4), object(2)
memory usage: 83.8+ KB


---

# Distribution Checks

Create visualisations to check distributions to make sure that the correct statistical test is used

In [3]:
df.describe()

Unnamed: 0,age,sex,bmi,children,smoker,charges
count,1338.0,1338.0,1338.0,1338.0,1338.0,1338.0
mean,39.207025,0.505232,30.663397,1.094918,0.204783,13270.422265
std,14.04996,0.50016,6.098187,1.205493,0.403694,12110.011237
min,18.0,0.0,15.96,0.0,0.0,1121.8739
25%,27.0,0.0,26.29625,0.0,0.0,4740.28715
50%,39.0,1.0,30.4,1.0,0.0,9382.033
75%,51.0,1.0,34.69375,2.0,0.0,16639.912515
max,64.0,1.0,53.13,5.0,1.0,63770.42801


---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In cases where you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create your folder here
  # os.makedirs(name='')
except Exception as e:
  print(e)
