#                   A FLU-SHOT MACHINE LEARNING PROJECT

###   1:BUSINESS UNDERSTANDING

**1.1: OVERVIEW**

Vaccines play a very crucial role in disease prevention,they work by providing immunity to individuals as well communities to prevent certain communicable diseases like COVID-19,swine flu,seasonal flu and Tuberculosis.The first vaccine was developed in 1796,by a English physician Edward Jenner and was used to vaccinate the first human aganist smallpox.Today's vaccines have evolved and become more complex,robust and more efficient these is due to the continous growth of technology and research in the health industry.The H1N1 VACCINE was developed in 2009 and approved for use in 2010 these  is after the swineflue pandemic  broke out in 2009 killing more than two million people worldwide while the seasonal vaccine also known as the flu-shot is commonly used to prevent flu (common cold) during the flu season.

The central disease center(CDC) want to understand how different  individual personal characterstics influence the uptake of the H1N1 ans seasonal vaccine as well as the vaccine patterns which would  provide guidance for future public health efforts.

**1.2:CHALLENGES**

Lack of adequate data driven decisions in provision of H1N1 and seasonal vaccines to the population

**1.3:PROPOSED SOLUTION**

A machine learning model which will predict the uptake of the H1N1 and seasonal vaccines based on there own personal characteristics with "Area under the ROC Curve of 80%

**1.4:BRIEF CCONCLUSION**

Vaccines are crucial for disease prevention, protecting individuals and communities against diseases like COVID-19 and tuberculosis. The CDC is exploring how personal traits affect vaccine uptake, proposing a machine learning model with 80% ROC curve accuracy to predict H1N1 and seasonal flu vaccine uptake based on individual characteristics.

**1.5:PROBLEM STATEMENT**
Vaccines are critical for preventing communicable diseases such as COVID-19, swine flu, seasonal influenza, and tuberculosis. However, making data-driven decisions about vaccine distribution and delivery remains difficult. The CDC seeks to understand how individual features influence uptake of the vaccines so as to provide guidance for future public health efforts

**1.6:OBECTIVES**
**1.6.1:Main objective**

To predict how likely people are to get the H1N1 and seasonal flu vaccines.

**1.6.2:Specific Objectives**

To determine the distribution of the uptake of the H1N1 and seasonal vaccines
To determine the correlation between the uptake of both vaccines
To determine which characteristics are likely to influence one to taking a particular vaccine

### 2:DATA UNDERSTANDING

**2.1: DATA SOURCE**

Data was downloaded from https://www.drivendata.org/competitions/66/flu-shot-learning/data/ which was a phone survey done in 2009 courtesy of the United states National Center for Health Statistics

**2.2:DATA DESCRIPTION**

There are two datasets 
**Features** These contains responses indivuals gave during the phone survey(Contains our predictor variables)
**Labels** These contain reponses whether the had taken either the H1N1 or seasonal vaccine(it contain our two target variables H1N1 and seasonal vaccine  classified as [0,1])

Lets load them


In [1]:
#importing  necessary modules

# Data Manipulation
import pandas as pd 
import numpy as np 

# Visualisation
import seaborn as sns 
import matplotlib.pyplot as plt 
%matplotlib inline

# Modelling 
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_selection import chi2,SelectKBest
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn import tree
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix,roc_auc_score
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import GridSearchCV

In [2]:
#loading the data sets
feature_df = pd.read_csv(r'C:\Users\user\Documents\new\Flu-Shot-Learning-project\training_set_features.csv')
label_df = pd.read_csv(r'C:\Users\user\Documents\new\Flu-Shot-Learning-project\training_set_labels.csv')
print(feature_df)
print(label_df)

       respondent_id  h1n1_concern  h1n1_knowledge  behavioral_antiviral_meds  \
0                  0           1.0             0.0                        0.0   
1                  1           3.0             2.0                        0.0   
2                  2           1.0             1.0                        0.0   
3                  3           1.0             1.0                        0.0   
4                  4           2.0             1.0                        0.0   
...              ...           ...             ...                        ...   
26702          26702           2.0             0.0                        0.0   
26703          26703           1.0             2.0                        0.0   
26704          26704           2.0             2.0                        0.0   
26705          26705           1.0             1.0                        0.0   
26706          26706           0.0             0.0                        0.0   

       behavioral_avoidance

In [3]:
#lets merge the two data sets because they share a similar column respondent id using a left joint
vaccine_df = feature_df.merge(label_df, how='left', on='respondent_id')

In [4]:
#lets create a DataUnderstanding class and intialize it which will help us understand our merged dataset 

# Define the DataUnderstanding class
class DataUnderstanding:
    def __init__(self, data):
        self.data = data

    def data_shape(self):
        print(f"Shape: {self.data.shape}")

    def summary_info(self):
        print("Dataframe Info:")
        self.data.info()
    
    def summary_statistics(self):
        print(f"Descriptive Statistics:\n{self.data.describe()}")

    def columns_summary(self):
        print(f"Columns:\n{self.data.columns.tolist()}")

    def data_types(self):
        print(f"Data Types for each column:\n{self.data.dtypes}")

    def first_rows(self, n=10):
        print(f"The first {n} rows:\n{self.data.head(n)}")

    def last_rows(self, n=10):
        print(f"The last {n} rows:\n{self.data.tail(n)}")

    def columns_value_counts(self):
        for column in self.data.columns:
            print(f"Value counts for column '{column}':")
            print(self.data[column].value_counts())
            print("\n")


# Initialize the DataUnderstanding class with the merged DataFrame
data_summary = DataUnderstanding(vaccine_df)

# Calling all the methods to see the outputs
data_summary.data_shape()
data_summary.summary_info()
data_summary.summary_statistics()
data_summary.columns_summary()
data_summary.data_types()
data_summary.first_rows()
data_summary.last_rows()
data_summary.columns_value_counts()


Shape: (26707, 38)
Dataframe Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26707 entries, 0 to 26706
Data columns (total 38 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   respondent_id                26707 non-null  int64  
 1   h1n1_concern                 26615 non-null  float64
 2   h1n1_knowledge               26591 non-null  float64
 3   behavioral_antiviral_meds    26636 non-null  float64
 4   behavioral_avoidance         26499 non-null  float64
 5   behavioral_face_mask         26688 non-null  float64
 6   behavioral_wash_hands        26665 non-null  float64
 7   behavioral_large_gatherings  26620 non-null  float64
 8   behavioral_outside_home      26625 non-null  float64
 9   behavioral_touch_face        26579 non-null  float64
 10  doctor_recc_h1n1             24547 non-null  float64
 11  doctor_recc_seasonal         24547 non-null  float64
 12  chronic_med_condition        25736 non-

**2.3:COLUMN DISTRIBUTION**

We have 38 rows lets try to understand what they mean courtesy of(**https://www.drivendata.org/competitions/66/flu-shot-learning/page/211/**)

h1n1_concern - Level of concern about the H1N1 flu.
0 = Not at all concerned; 1 = Not very concerned; 2 = Somewhat concerned; 3 = Very concerned.

h1n1_knowledge - Level of knowledge about H1N1 flu.
0 = No knowledge; 1 = A little knowledge; 2 = A lot of knowledge.

behavioral_antiviral_meds - Has taken antiviral medications. (binary)

behavioral_avoidance - Has avoided close contact with others with flu-like symptoms. (binary)

behavioral_face_mask - Has bought a face mask. (binary)

behavioral_wash_hands - Has frequently washed hands or used hand sanitizer. (binary)

behavioral_large_gatherings - Has reduced time at large gatherings. (binary)

behavioral_outside_home - Has reduced contact with people outside of own household. (binary)

behavioral_touch_face - Has avoided touching eyes, nose, or mouth. (binary)

doctor_recc_h1n1 - H1N1 flu vaccine was recommended by doctor. (binary)

doctor_recc_seasonal - Seasonal flu vaccine was recommended by doctor. (binary)

chronic_med_condition - Has any of the following chronic medical conditions: asthma or an other lung condition, diabetes, a heart condition, a kidney condition, sickle cell anemia or other anemia, a neurological or neuromuscular condition, a liver 
condition, or a weakened immune system caused by a chronic illness or by medicines taken for a chronic illness. (binary)

child_under_6_months - Has regular close contact with a child under the age of six months. (binary)

health_worker - Is a healthcare worker. (binary)

health_insurance - Has health insurance. (binary)

opinion_h1n1_vacc_effective - Respondent's opinion about H1N1 vaccine effectiveness.
1 = Not at all effective; 2 = Not very effective; 3 = Don't know; 4 = Somewhat effective; 5 = Very effective.

opinion_h1n1_risk - Respondent's opinion about risk of getting sick with H1N1 flu without vaccine.
1 = Very Low; 2 = Somewhat low; 3 = Don't know; 4 = Somewhat high; 5 = Very high.

opinion_h1n1_sick_from_vacc - Respondent's worry of getting sick from taking H1N1 vaccine.
1 = Not at all worried; 2 = Not very worried; 3 = Don't know; 4 = Somewhat worried; 5 = Very worried.

opinion_seas_vacc_effective - Respondent's opinion about seasonal flu vaccine effectiveness.
1 = Not at all effective; 2 = Not very effective; 3 = Don't know; 4 = Somewhat effective; 5 = Very effective.

opinion_seas_risk - Respondent's opinion about risk of getting sick with seasonal flu without vaccine.
1 = Very Low; 2 = Somewhat low; 3 = Don't know; 4 = Somewhat high; 5 = Very high.

opinion_seas_sick_from_vacc - Respondent's worry of getting sick from taking seasonal flu vaccine.
1 = Not at all worried; 2 = Not very worried; 3 = Don't know; 4 = Somewhat worried; 5 = Very worried.

age_group - Age group of respondent.

education - Self-reported education level.

race - Race of respondent.

sex - Sex of respondent.

income_poverty - Household annual income of respondent with respect to 2008 Census poverty thresholds.

marital_status - Marital status of respondent.

rent_or_own - Housing situation of respondent.

employment_status - Employment status of respondent.

hhs_geo_region - Respondent's residence using a 10-region geographic classification defined by the U.S. Dept. of Health and 

Human Services. Values are represented as short random character strings.

census_msa - Respondent's residence within metropolitan statistical areas (MSA) as defined by the U.S. Census.

household_adults - Number of other adults in household, top-coded to 3.

household_children - Number of children in household, top-coded to 3.

employment_industry - Type of industry respondent is employed in. Values are represented as short random character strings.

employment_occupation - Type of occupation of respondent. Values are represented as short random character strings.

These dataset is relevant to our project