### Assignments:
1. Perform an Explanatory data analysis (EDA) with visualization.
2. Generate a training and test set. The test set should be used only at the end. Your goal is to estimate an individual's “Physical Health” feature
3. Preprocess the dataset (remove outliers, encode categorical features with one hot encoding, not necessarily in this order)
4. Define whether this is a regression, classification or clustering problem, explain why and  choose your model design accordingly. Test at least 3 different models. First, create a validation set from the training set to analyze the behaviour with the default hyperparameters.Then use cross-validation to find the best set of hyperparameters. You must describe every hyperparameter tuned (the more, the better)
5. Select the best architecture using the right metric
6. Compute the performances of the test set
7. Explain your findings

## Welcome to the MedCare Wellness Research Center!
Our primary objective is to better understand the health and well-being of the senior population across diverse communities. the dataset used contains records of thousands of elderly individuals, detailing an array of health indicators and lifestyle factors. Our mission? To anticipate and predict health issues and concerns in our aging population, enhancing their life quality. Lets' journey through this data-driven exploration to enrich the lives of our senior community!

In [12]:
# importing packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Exploratory Data Analysis (EDA)
The initial step of our analysis is known as Exploratory Data Analysis (EDA) which is useful in order to gain a deeper understanding of our dataset. EDA holds a central position in the data analysis for understanding the data structure and meaning, detecting outliers, identifying patterns through plots and visualizations, data cleaning and integrity, and, mostly, to construct a solid foundation for what we will do next. It helps us in order to reduce the risk of mistakes and enabling more accurate predictions and insights!


In [13]:
# Read data
path = 'data\medcenter.csv'
medcenter_df = pd.read_csv(path)

In [14]:
# Understanding columns meanings
medcenter_df.head()

Unnamed: 0,Walking Difficulty,Torsades de Pointes,Skin Cancer,Hours of sleep,How do you Feel,Asthma Status,Do you Exercise,Gender,Kidney Disease,Is Smoking,Ethnicity,Diabetes,How many Drinks per Week,Age Group,Mental Health,Body Mass Index,Physical Health,History of Stroke,Patient ID
0,Y,Y,N,10.0,Good,N,Y,F,N,Y,White,N,N,80 or older,0.0,15.55,7.0,Y,100074
1,N,Y,N,7.0,Fair,Y,N,F,N,N,White,Y,N,65-69,0.0,38.62,2.0,N,100086
2,N,Y,N,7.0,Good,N,N,M,N,N,White,N,N,60-64,0.0,21.62,3.0,N,100094
3,Y,Y,N,8.0,Good,N,N,F,N,Y,White,N,N,65-69,0.0,22.14,0.0,N,100154
4,Y,Y,N,8.0,Fair,N,Y,M,Y,N,White,Y,N,70-74,0.0,43.05,0.0,N,100158


In [22]:
# More to know about the medcare dataset

print("Relevant information of the dataset:"), (medcenter_df.info())
print("Shape of the medcenter dataset:", medcenter_df.shape)


Relevant information of the dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 261311 entries, 0 to 261310
Data columns (total 19 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   Walking Difficulty        261311 non-null  object 
 1   Torsades de Pointes       261311 non-null  object 
 2   Skin Cancer               261311 non-null  object 
 3   Hours of sleep            261311 non-null  float64
 4   How do you Feel           261311 non-null  object 
 5   Asthma Status             261311 non-null  object 
 6   Do you Exercise           261311 non-null  object 
 7   Gender                    261311 non-null  object 
 8   Kidney Disease            261311 non-null  object 
 9   Is Smoking                261311 non-null  object 
 10  Ethnicity                 261311 non-null  object 
 11  Diabetes                  261311 non-null  object 
 12  How many Drinks per Week  261311 non-null  object 
 13  Age Gro

##### check data integrity

In [16]:
#check for missing values
missing_values = medcenter_df.isnull().sum() 
print('missing values: \n', missing_values) 

missing values: 
 Walking Difficulty          0
Torsades de Pointes         0
Skin Cancer                 0
Hours of sleep              0
How do you Feel             0
Asthma Status               0
Do you Exercise             0
Gender                      0
Kidney Disease              0
Is Smoking                  0
Ethnicity                   0
Diabetes                    0
How many Drinks per Week    0
Age Group                   0
Mental Health               0
Body Mass Index             0
Physical Health             0
History of Stroke           0
Patient ID                  0
dtype: int64
