# Project Title

## Overview

A one-paragraph overview of the project, including the business problem, data, methods, results and recommendations.

To find potential factors in vaccine hesistancy in the COVID-19 pandemic, we used a dataset of characteristics of people who either got the H1N1 or didn't. We tested different machine learning models on the dataset for prediction accuracy and ended up choosing a ____ model.  

## Business Problem

Summary of the business problem you are trying to solve, and the data questions that you plan to answer to solve them.

The CDC has hired us as data scientists to find lessons from the H1N1 pandemic. Early into the vaccine rollout, there has been massive public skepticism of the vaccine, slowing our ability to overcome the virus and re-open the country.

We have received a large dataset from 2009, with statistical information and pandemic/H1N1 scaled opinions for each respondent. This data includes whether each person was vaccinated or not, which will allow us to predict who gets the vaccine.

This model will allow the CDC to determine investments in public health awaareness, surveying and modelling during the pandemic, based on the parameters which impact an individuals vaccine choice the most.

Questions to consider:

- Who are your stakeholders?
- What are your stakeholders' pain points related to this project?
- Why are your predictions important from a business perspective?
- What exactly is your deliverable: your analysis, or the model itself?
- Does your business understanding/stakeholder require a specific type of model?
    - For example: a highly regulated industry would require a very transparent/simple/interpretable model, whereas a situation where the model itself is your deliverable would likely benefit from a more complex and thus stronger model
   

Additional questions to consider for classification:

- What does a false positive look like in this context?
- What does a false negative look like in this context?
- Which is worse for your stakeholder?
- What metric are you focusing on optimizing, given the answers to the above questions?

## Data Understanding

The dataset includes 26,000 respondents responses, and includes 34 different characteristics. They provide demographic information such as age, sex, race, income, and education and also include opinion and knowledge assessment on the risk of the H1N1 virus. 


The target variable is the 'h1n1 vaccine' column. It is binary: 0 means the respondent didn't get the vaccine and 1 means they did. 

Questions to consider:

- Where did the data come from, and how do they relate to the data analysis questions?
- What do the data represent? Who is in the sample and what variables are included?
- What is the target variable?
- What are the properties of the variables you intend to use?

In [1]:
# code here to explore your data

import pandas as pd

data = pd.read_csv('../Data/training_set_features.csv', index_col='respondent_id')
target = pd.read_csv('../Data/training_set_labels.csv', index_col='respondent_id')

df = pd.concat([data,target])

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 53414 entries, 0 to 26706
Data columns (total 37 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   h1n1_concern                 26615 non-null  float64
 1   h1n1_knowledge               26591 non-null  float64
 2   behavioral_antiviral_meds    26636 non-null  float64
 3   behavioral_avoidance         26499 non-null  float64
 4   behavioral_face_mask         26688 non-null  float64
 5   behavioral_wash_hands        26665 non-null  float64
 6   behavioral_large_gatherings  26620 non-null  float64
 7   behavioral_outside_home      26625 non-null  float64
 8   behavioral_touch_face        26579 non-null  float64
 9   doctor_recc_h1n1             24547 non-null  float64
 10  doctor_recc_seasonal         24547 non-null  float64
 11  chronic_med_condition        25736 non-null  float64
 12  child_under_6_months         25887 non-null  float64
 13  health_worker   

## Data Preparation

Describe and justify the process for preparing the data for analysis.

As the dataset also includes information on whether the respondent got the seasonal flu vaccine and their scaled opinions on the seasonal vaccine, we decided to drop all data and focus on h1n1-related data for model simplicity.

We then decided to drop the health insurance column as there were many null values, and in initial data modelling our model was much more accurate with it gone.

The data left is a combination of ordinal (>2 non-continuous) and binary variables, both in string and number format. We converted all string data to integers, i.e. 'Male' and 'Female' in the 'sex' column converted to 0 and 1, respectively.

For remaining missing values in columns, we filled in the mode of the dataset. 

Questions to consider:

- Were there variables you dropped or created?
- How did you address missing values or outliers?
- Why are these choices appropriate given the data and the business problem?
- Can you pipeline your preparation steps to use them consistently in the modeling process?

In [1]:
# code here to prepare your data



## Modeling

Describe and justify the process for analyzing or modeling the data.

Questions to consider:

- How will you analyze the data to arrive at an initial approach?
- How will you iterate on your initial approach to make it better?
- What model type is most appropriate, given the data and the business problem?

## Evaluation

The evaluation of each model should accompany the creation of each model, and you should be sure to evaluate your models consistently.

Evaluate how well your work solves the stated business problem. 

Questions to consider:

- How do you interpret the results?
- How well does your model fit your data? How much better is this than your baseline model? Is it over or under fit?
- How well does your model/data fit any relevant modeling assumptions?

For the final model, you might also consider:

- How confident are you that your results would generalize beyond the data you have?
- How confident are you that this model would benefit the business if put into use?
- What does this final model tell you about the relationship between your inputs and outputs?

### Baseline Understanding

- What does a baseline, model-less prediction look like?

From a baseline understanding, the data set target is skewed towards those who didn't get the vaccine, 79% of respondents. 

In [None]:
# code here to arrive at a baseline prediction

### First $&(@# Model

Before going too far down the data preparation rabbit hole, be sure to check your work against a first 'substandard' model! What is the easiest way for you to find out how hard your problem is?

We examined how accurately a decision tree-based model would be able to determine if someone got vaccinated or not based on their characteristics. 

In this case we dropped rows with missing values and used pd.get_dummies to split up the ordinal variables with 3 or more values.

In [None]:
# code here for your first 'substandard' model

In [None]:
# code here to evaluate your first 'substandard' model

### Modeling Iterations

Now you can start to use the results of your first model to iterate - there are many options!

In [None]:
# code here to iteratively improve your models

In [None]:
# code here to evaluate your iterations

### 'Final' Model

In the end, you'll arrive at a 'final' model - aka the one you'll use to make your recommendations/conclusions. This likely blends any group work. It might not be the one with the highest scores, but instead might be considered 'final' or 'best' for other reasons.

In [None]:
# code here to show your final model

In [None]:
# code here to evaluate your final model

## Conclusions

Provide your conclusions about the work you've done, including any limitations or next steps.

Questions to consider:

- What would you recommend the business do as a result of this work?
- How could the stakeholder use your model effectively?
- What are some reasons why your analysis might not fully solve the business problem?
- What else could you do in the future to improve this project (future work)?
