# U.S. Medical Insurance Costs

In this project, I am going to investigate a **csv** file using the Python skills developed along the *Python Fundamentals* section of the *Data Scientist Path* on *CodeAcademy*. 

The **insurance.csv** file contains real world data about *US medical insurance costs*. The project will not provide me step-by-step instructions on what to do, but it's asking me to perform my own independent analysis, only suggesting me a framework to structure exploration and analysis.

The project objectives are the following: 
 - Work locally on your own computer
 - Import a dataset into your program
 - Analyze a dataset by building out functions or class methods
 - Use libraries to assist in your analysis
 - Optional: Document and organize your findings
 - Optional: Make predictions about a dataset’s features based on your findings

### **Import the dataset**: 
*Import insurance.cvs into your Python file. Make sure that the information is easy to access.*



The only library I need to import for this project is the *csv library*:

In [2]:
import csv

### **Look over the dataset**:

*Open insurance.cvs and take a look at the file. Take note of how information is organized. How will this affect how you analyze the data in Python? Is there anything of particular interest to you in the dataset that you want to investigate? Think about these things before you jump into Python.*



**insurance.csv** contains the following columns separated by **,** : 
 - **Age** : Patient Age
 - **Sex** : Patient Sex
 - **BMI** : Patient BMI
 - **Children**: Patient Number of Children
 - **Smoker**: Patient Smoking Status
 - **Region**: Patient US Geographical Region
 - **Charges**: Patient Yearly Medical Insurance Cost


Some important notes about this dataset:
 - There is no missing data.
 - There are seven columns.
 - Some columns are numerical while some are categorical.



### Scoping the project
*Now that you have looked over your dataset, plan out what you want to analyze. What is it that you want to find out about this dataset? Based on the way information is organized, certain inspections may be easier to perform than others. As you map out the process, consider the scope of your analysis as well.*
*Properly scoping your project will greatly benefit you; scoping creates structure while requiring you to think through your entire project before you begin. You should start by stating the goals for your project, then gathering the data, and considering the analytical steps required. A proper project scope can be a great road map for your project, but keep in mind that some down-stream tasks may become dead ends which will require adjustment to the scope.*

I define the scope of the project by explaining the 4 steps:

**1. What is the goal?**
   
 I want to determine the Region of the US with the higher number of smokers under 21.
 Why?
 Because I want to implement then a TV promotional campaign against smoke in that Region targeting local TV Shows for kids and 
 teenagers. 
 Due to limited funds, I can only realize the campaign in one Region, that's why I need to determine where is the higher rate 
 of underage smokers.
   
**2. What action will be undertaken because of this project?**

 The action that I foresee to be implemented after the analysis is the mentioned TV campaign to persuade youngsters in 
 the Region to stop smoking or not smoking at all. 

**3. What data do I need?**

 I need a dataset sample containing indicators regarding age, region and smoking status. 
 

**4. What analysis needs to be done? Does it involve description, detection, prediction, or behavior change? How will the analysis be validated?**

 I am going to make a descriptive analysis on the provided dataset to find out the underage smokers per each region.



### Save your dataset via Python variables
*Organize the information from insurance.csv by storing them in variables that can be used for analysis. As you consider what types of variables to use and how many you plan to create, think ahead about the parameters you wish to investigate and how your organization will impact this analysis.*

Given my scope project, let's create 3 single variables (lists): 
 - Ages
 - Smokers
 - Regions

... and one big list joining the 3 single variables;
 - one_list

In [3]:
ages = []
with open("insurance.csv", newline = "") as age_data:
    age_reader = csv.DictReader(age_data, delimiter = ",")
    for row in age_reader:
        ages.append(row["age"])


In [4]:
print(ages)

['19', '18', '28', '33', '32', '31', '46', '37', '37', '60', '25', '62', '23', '56', '27', '19', '52', '23', '56', '30', '60', '30', '18', '34', '37', '59', '63', '55', '23', '31', '22', '18', '19', '63', '28', '19', '62', '26', '35', '60', '24', '31', '41', '37', '38', '55', '18', '28', '60', '36', '18', '21', '48', '36', '40', '58', '58', '18', '53', '34', '43', '25', '64', '28', '20', '19', '61', '40', '40', '28', '27', '31', '53', '58', '44', '57', '29', '21', '22', '41', '31', '45', '22', '48', '37', '45', '57', '56', '46', '55', '21', '53', '59', '35', '64', '28', '54', '55', '56', '38', '41', '30', '18', '61', '34', '20', '19', '26', '29', '63', '54', '55', '37', '21', '52', '60', '58', '29', '49', '37', '44', '18', '20', '44', '47', '26', '19', '52', '32', '38', '59', '61', '53', '19', '20', '22', '19', '22', '54', '22', '34', '26', '34', '29', '30', '29', '46', '51', '53', '19', '35', '48', '32', '42', '40', '44', '48', '18', '30', '50', '42', '18', '54', '32', '37', '47', '20

I need to convert the numbers into **integers**:

In [5]:
ages = [int(i) for i in ages] 

In [6]:
print(ages)

[19, 18, 28, 33, 32, 31, 46, 37, 37, 60, 25, 62, 23, 56, 27, 19, 52, 23, 56, 30, 60, 30, 18, 34, 37, 59, 63, 55, 23, 31, 22, 18, 19, 63, 28, 19, 62, 26, 35, 60, 24, 31, 41, 37, 38, 55, 18, 28, 60, 36, 18, 21, 48, 36, 40, 58, 58, 18, 53, 34, 43, 25, 64, 28, 20, 19, 61, 40, 40, 28, 27, 31, 53, 58, 44, 57, 29, 21, 22, 41, 31, 45, 22, 48, 37, 45, 57, 56, 46, 55, 21, 53, 59, 35, 64, 28, 54, 55, 56, 38, 41, 30, 18, 61, 34, 20, 19, 26, 29, 63, 54, 55, 37, 21, 52, 60, 58, 29, 49, 37, 44, 18, 20, 44, 47, 26, 19, 52, 32, 38, 59, 61, 53, 19, 20, 22, 19, 22, 54, 22, 34, 26, 34, 29, 30, 29, 46, 51, 53, 19, 35, 48, 32, 42, 40, 44, 48, 18, 30, 50, 42, 18, 54, 32, 37, 47, 20, 32, 19, 27, 63, 49, 18, 35, 24, 63, 38, 54, 46, 41, 58, 18, 22, 44, 44, 36, 26, 30, 41, 29, 61, 36, 25, 56, 18, 19, 39, 45, 51, 64, 19, 48, 60, 27, 46, 28, 59, 35, 63, 40, 20, 40, 24, 34, 45, 41, 53, 27, 26, 24, 34, 53, 32, 19, 42, 55, 28, 58, 41, 47, 42, 59, 19, 59, 39, 40, 18, 31, 19, 44, 23, 33, 55, 40, 63, 54, 60, 24, 19, 29,

In [7]:
smokers = []
with open("insurance.csv", newline = "") as smoker_data:
    smoker_reader = csv.DictReader(smoker_data, delimiter = ",")
    for row in smoker_reader:
        smokers.append(row["smoker"])

In [8]:
print(smokers)

['yes', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'yes', 'no', 'no', 'yes', 'no', 'no', 'no', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no', 'no', 'no', 'no', 'yes', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no', 'no', 'yes', 'yes', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'yes', 'no', 'yes', 'yes', 'no', 'no', 'no', 'no', 'no', 'yes', 'no', 'no', 'no', 'no', 'yes', 'yes', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'yes', 'no', 'yes', 'yes', 'yes', 'no', 'no', 'no', 'no', 'no', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'yes', 'no', 'no', 'no', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'yes', 'no', 'no', 'no', 'no', 'no', 'yes', 'no', 'no', 'yes', 'no', 'yes', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'yes', 'no', 'yes', 'no', 'yes', 'no', 'no', 'no', 'no', 'no', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'yes', 'no'

In [9]:
regions = []
with open("insurance.csv", newline = "") as region_data:
    region_reader = csv.DictReader(region_data, delimiter = ",")
    for row in region_reader:
        regions.append(row["region"])

In [10]:
print(regions)

['southwest', 'southeast', 'southeast', 'northwest', 'northwest', 'southeast', 'southeast', 'northwest', 'northeast', 'northwest', 'northeast', 'southeast', 'southwest', 'southeast', 'southeast', 'southwest', 'northeast', 'northeast', 'southwest', 'southwest', 'northeast', 'southwest', 'southeast', 'northeast', 'northwest', 'southeast', 'northeast', 'northwest', 'northwest', 'southwest', 'southwest', 'northeast', 'southwest', 'northwest', 'southwest', 'northwest', 'northwest', 'southwest', 'northeast', 'southwest', 'northeast', 'southeast', 'southeast', 'southeast', 'northeast', 'southwest', 'northeast', 'northwest', 'southeast', 'southeast', 'northeast', 'northwest', 'southwest', 'southeast', 'northwest', 'northwest', 'northeast', 'southeast', 'southeast', 'northwest', 'northeast', 'southeast', 'northwest', 'northwest', 'northwest', 'southwest', 'southwest', 'northwest', 'southeast', 'southeast', 'southeast', 'northeast', 'southwest', 'southeast', 'southwest', 'northwest', 'southeast'

In [11]:
one_list = zip(ages, smokers, regions)
ages_smokers_regions = list(one_list)

In [12]:
print(ages_smokers_regions)

[(19, 'yes', 'southwest'), (18, 'no', 'southeast'), (28, 'no', 'southeast'), (33, 'no', 'northwest'), (32, 'no', 'northwest'), (31, 'no', 'southeast'), (46, 'no', 'southeast'), (37, 'no', 'northwest'), (37, 'no', 'northeast'), (60, 'no', 'northwest'), (25, 'no', 'northeast'), (62, 'yes', 'southeast'), (23, 'no', 'southwest'), (56, 'no', 'southeast'), (27, 'yes', 'southeast'), (19, 'no', 'southwest'), (52, 'no', 'northeast'), (23, 'no', 'northeast'), (56, 'no', 'southwest'), (30, 'yes', 'southwest'), (60, 'no', 'northeast'), (30, 'no', 'southwest'), (18, 'no', 'southeast'), (34, 'yes', 'northeast'), (37, 'no', 'northwest'), (59, 'no', 'southeast'), (63, 'no', 'northeast'), (55, 'no', 'northwest'), (23, 'no', 'northwest'), (31, 'yes', 'southwest'), (22, 'yes', 'southwest'), (18, 'no', 'northeast'), (19, 'no', 'southwest'), (63, 'no', 'northwest'), (28, 'yes', 'southwest'), (19, 'no', 'northwest'), (62, 'no', 'northwest'), (26, 'no', 'southwest'), (35, 'yes', 'northeast'), (60, 'yes', 'so

### Build out analysis functions or class methods
*You now have everything you need to begin your analysis. You have organized the information from insurance.csv and have spent some time thinking about what it is you would like to investigate.
Now is the time to build out how you perform these investigations. Use the Python fundamentals you have learned so far to accomplish these tasks. There are many different ways you can achieve these analyses. In our hint, we will provide some ideas for how you can use Python to analyze data.*

Firs of all, as I am investigating from the geographical perspective, I need to be sure that the sample is well balanced between the 4 existing regions:

In [38]:
total_ne_values = regions.count('northeast')
total_nw_values = regions.count('northwest')
total_se_values = regions.count('southeast')
total_sw_values = regions.count('southwest')

In [40]:
print('In the dataset, there are ' + str(total_ne_values) + ' patients from NorthEast')
print('In the dataset, there are ' + str(total_nw_values) + ' patients from NorthWest')
print('In the dataset, there are ' + str(total_se_values) + ' patients from SouthEast')
print('In the dataset, there are ' + str(total_sw_values) + ' patients from SouthWest')

In the dataset, there are 324 patients from NorthEast
In the dataset, there are 325 patients from NorthWest
In the dataset, there are 364 patients from SouthEast
In the dataset, there are 325 patients from SouthWest


Except for a little excess for the SouthEast, the sample is well balanced geographically speaking.



Now: I must detect in the list only the record that counts for the analysis I am willing to do. Meaning: all and only the records where the age is equal or lower than 21 and the smoker status is yes

In [41]:
ages_smokers_regions_new = []
for record in ages_smokers_regions:
    if record[0] <= 21 and record[1] == 'yes':
        ages_smokers_regions_new.append(record)

In [42]:
print(ages_smokers_regions_new)

[(19, 'yes', 'southwest'), (18, 'yes', 'southeast'), (20, 'yes', 'northwest'), (20, 'yes', 'northwest'), (19, 'yes', 'southwest'), (18, 'yes', 'northeast'), (18, 'yes', 'southeast'), (19, 'yes', 'southwest'), (19, 'yes', 'northwest'), (18, 'yes', 'northeast'), (19, 'yes', 'northwest'), (20, 'yes', 'southeast'), (19, 'yes', 'northwest'), (19, 'yes', 'southwest'), (19, 'yes', 'southwest'), (21, 'yes', 'southwest'), (19, 'yes', 'southeast'), (21, 'yes', 'northeast'), (19, 'yes', 'northwest'), (19, 'yes', 'southeast'), (18, 'yes', 'northeast'), (18, 'yes', 'southeast'), (19, 'yes', 'northwest'), (18, 'yes', 'southeast'), (18, 'yes', 'northeast'), (19, 'yes', 'northwest'), (18, 'yes', 'northeast'), (20, 'yes', 'northeast'), (19, 'yes', 'northwest'), (19, 'yes', 'southeast'), (18, 'yes', 'northeast'), (20, 'yes', 'northwest'), (19, 'yes', 'northwest'), (18, 'yes', 'southeast'), (20, 'yes', 'southeast'), (20, 'yes', 'southwest'), (20, 'yes', 'southwest'), (18, 'yes', 'northeast'), (20, 'yes',

In [43]:
regions_final_list = [e3 for e1, e2, e3 in ages_smokers_regions_new] 

In [44]:
print(regions_final_list)

['southwest', 'southeast', 'northwest', 'northwest', 'southwest', 'northeast', 'southeast', 'southwest', 'northwest', 'northeast', 'northwest', 'southeast', 'northwest', 'southwest', 'southwest', 'southwest', 'southeast', 'northeast', 'northwest', 'southeast', 'northeast', 'southeast', 'northwest', 'southeast', 'northeast', 'northwest', 'northeast', 'northeast', 'northwest', 'southeast', 'northeast', 'northwest', 'northwest', 'southeast', 'southeast', 'southwest', 'southwest', 'northeast', 'southwest', 'southwest', 'southwest']


I can count the values per each region to determine which has the major number of smokers (among the full age sample):

I sort the list to check the different values easier:

In [17]:
sorted_region_list = sorted(regions_final_list)

In [18]:
print(sorted_region_list)

['northeast', 'northeast', 'northeast', 'northeast', 'northeast', 'northeast', 'northeast', 'northeast', 'northeast', 'northwest', 'northwest', 'northwest', 'northwest', 'northwest', 'northwest', 'northwest', 'northwest', 'northwest', 'northwest', 'northwest', 'southeast', 'southeast', 'southeast', 'southeast', 'southeast', 'southeast', 'southeast', 'southeast', 'southeast', 'southeast', 'southwest', 'southwest', 'southwest', 'southwest', 'southwest', 'southwest', 'southwest', 'southwest', 'southwest', 'southwest', 'southwest']


In [19]:
northeast_count = sorted_region_list.count('northeast')
northwest_count = sorted_region_list.count('northwest')
southeast_count = sorted_region_list.count('southeast')
southwest_count = sorted_region_list.count('southwest')

In [48]:
print('NorthEast has ' + str(northeast_count) + ' underage smokers patients out of ' +str(total_ne_values) + ' total patients')
print('NorthWest has ' + str(northwest_count) + ' underage smokers patients out of ' +str(total_nw_values) + ' total patients')
print('SouthEast has ' + str(southeast_count) + ' underage smokers patients out of ' +str(total_se_values) + ' total patients')
print('SouthWest has ' + str(southwest_count) + ' underage smokers patients out of ' +str(total_sw_values) + ' total patients')

NorthEast has 9 underage smokers patients out of 324 total patients
NorthWest has 11 underage smokers patients out of 325 total patients
SouthEast has 10 underage smokers patients out of 364 total patients
SouthWest has 11 underage smokers patients out of 325 total patients


**Southwest** and **Northwest**  have the same highest ratio of underage smokers in the dataset sample: 11 out of 325. Therefore, these are the Regions of the United States where I might want to focus the TV advertising campaign 

How can I determine which Region to choose between the two? 

Let's try counting the **total numbers of smokers** for the two region (Southwest and Northwest), not only underage ones:

In [24]:
total_smokers_sw = []
for record in ages_smokers_regions:
    if record[1] == 'yes' and record[2] == 'southwest':
        total_smokers_sw.append(record)

In [25]:
print(total_smokers_sw)

[(19, 'yes', 'southwest'), (30, 'yes', 'southwest'), (31, 'yes', 'southwest'), (22, 'yes', 'southwest'), (28, 'yes', 'southwest'), (60, 'yes', 'southwest'), (48, 'yes', 'southwest'), (37, 'yes', 'southwest'), (64, 'yes', 'southwest'), (38, 'yes', 'southwest'), (19, 'yes', 'southwest'), (63, 'yes', 'southwest'), (19, 'yes', 'southwest'), (63, 'yes', 'southwest'), (50, 'yes', 'southwest'), (19, 'yes', 'southwest'), (27, 'yes', 'southwest'), (34, 'yes', 'southwest'), (64, 'yes', 'southwest'), (19, 'yes', 'southwest'), (26, 'yes', 'southwest'), (36, 'yes', 'southwest'), (33, 'yes', 'southwest'), (42, 'yes', 'southwest'), (54, 'yes', 'southwest'), (21, 'yes', 'southwest'), (29, 'yes', 'southwest'), (39, 'yes', 'southwest'), (30, 'yes', 'southwest'), (49, 'yes', 'southwest'), (37, 'yes', 'southwest'), (39, 'yes', 'southwest'), (29, 'yes', 'southwest'), (33, 'yes', 'southwest'), (30, 'yes', 'southwest'), (50, 'yes', 'southwest'), (37, 'yes', 'southwest'), (32, 'yes', 'southwest'), (25, 'yes',

In [26]:
count_southwest = len(total_smokers_sw)

In [49]:
print('The smokers patients in SouthWest are ' + str(count_southwest))

The smokers patients in SouthWest are 58


In [50]:
total_smokers_nw = []
for record in ages_smokers_regions:
    if record[1] == 'yes' and record[2] == 'northwest':
        total_smokers_nw.append(record)

In [51]:
print(total_smokers_nw)

[(58, 'yes', 'northwest'), (20, 'yes', 'northwest'), (45, 'yes', 'northwest'), (57, 'yes', 'northwest'), (20, 'yes', 'northwest'), (32, 'yes', 'northwest'), (30, 'yes', 'northwest'), (46, 'yes', 'northwest'), (42, 'yes', 'northwest'), (19, 'yes', 'northwest'), (56, 'yes', 'northwest'), (19, 'yes', 'northwest'), (19, 'yes', 'northwest'), (31, 'yes', 'northwest'), (45, 'yes', 'northwest'), (52, 'yes', 'northwest'), (23, 'yes', 'northwest'), (63, 'yes', 'northwest'), (56, 'yes', 'northwest'), (61, 'yes', 'northwest'), (49, 'yes', 'northwest'), (35, 'yes', 'northwest'), (48, 'yes', 'northwest'), (34, 'yes', 'northwest'), (19, 'yes', 'northwest'), (59, 'yes', 'northwest'), (44, 'yes', 'northwest'), (42, 'yes', 'northwest'), (40, 'yes', 'northwest'), (60, 'yes', 'northwest'), (19, 'yes', 'northwest'), (27, 'yes', 'northwest'), (33, 'yes', 'northwest'), (25, 'yes', 'northwest'), (64, 'yes', 'northwest'), (43, 'yes', 'northwest'), (34, 'yes', 'northwest'), (51, 'yes', 'northwest'), (27, 'yes',

In [52]:
count_northwest = len(total_smokers_nw)

In [53]:
print('The smokers patients in NorthWest are ' + str(count_northwest))

The smokers patients in NorthWest are 58


Again, the same number for the 2 regions: 58! 

Let's try investigating how many **smokers patients are below 35** (rather than 21) in each Region (NW and SW):

In [64]:
sw_below_35 = []
for record in total_smokers_sw:
    if record[0] <= 35:
        sw_below_35.append(record)
        
nw_below_35 = []
for record in total_smokers_nw:
    if record[0] <= 35:
        nw_below_35.append(record)
        
print(sw_below_35)

print(nw_below_35)

[(19, 'yes', 'southwest'), (30, 'yes', 'southwest'), (31, 'yes', 'southwest'), (22, 'yes', 'southwest'), (28, 'yes', 'southwest'), (19, 'yes', 'southwest'), (19, 'yes', 'southwest'), (19, 'yes', 'southwest'), (27, 'yes', 'southwest'), (34, 'yes', 'southwest'), (19, 'yes', 'southwest'), (26, 'yes', 'southwest'), (33, 'yes', 'southwest'), (21, 'yes', 'southwest'), (29, 'yes', 'southwest'), (30, 'yes', 'southwest'), (29, 'yes', 'southwest'), (33, 'yes', 'southwest'), (30, 'yes', 'southwest'), (32, 'yes', 'southwest'), (25, 'yes', 'southwest'), (31, 'yes', 'southwest'), (24, 'yes', 'southwest'), (23, 'yes', 'southwest'), (20, 'yes', 'southwest'), (20, 'yes', 'southwest'), (20, 'yes', 'southwest'), (19, 'yes', 'southwest'), (25, 'yes', 'southwest'), (19, 'yes', 'southwest')]
[(20, 'yes', 'northwest'), (20, 'yes', 'northwest'), (32, 'yes', 'northwest'), (30, 'yes', 'northwest'), (19, 'yes', 'northwest'), (19, 'yes', 'northwest'), (19, 'yes', 'northwest'), (31, 'yes', 'northwest'), (23, 'yes'

In [65]:
count_sw_below_35 = len(sw_below_35)
print('In SouthWest ' + str(count_sw_below_35) + ' smokers patients are younger or 35 years old')

count_nw_below_35 = len(nw_below_35)
print('In NorthWest ' + str(count_nw_below_35) + ' smokers patients are younger or 35 years old')

In SouthWest 30 smokers patients are younger or 35 years old
In NorthWest 27 smokers patients are younger or 35 years old


To sum up: 
- in Southwest there are 58 smokers out of 325 total patients, of which 30 are 35 years old or younger, of which 11 are underage;
- in Northhwest there are 58 smokers out of 325 total patients, of which 27 are 35 years old or younger, of which 11 are underage;

Given this finding, we pick **Southwest** !!!


In [75]:
print('SW Ratio: ' + str(round(count_sw_below_35 / total_sw_values * 100, 2)))
print('NW Ratio: ' + str(round(count_nw_below_35 / total_nw_values * 100, 2)))

SW Ratio: 9.23
NW Ratio: 8.31
