In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import cross_val_predict, cross_val_score, cross_validate
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix, recall_score, precision_score, accuracy_score
from sklearn.decomposition import PCA

import warnings
warnings.filterwarnings('ignore')

# <font color='#0000FF'>Table of Contents</font>

[1 : Exploring the data](#1)

<p style="padding:10px;background-color:#B9B7BD;margin:0;color:#000C66;font-family:sans serif;font-size:240%;text-align:center; overflow:hidden; font-weight:500; font-style:italic"><a id='1'></a>1. Exploring the data</p>

<p style="text-align:center; "></p>

Data exploration is a crucial step in the data analysis process. This phase allows for understanding the nature of the data, identifying trends, patterns, and laying the groundwork for more in-depth analyses. Here are some steps and techniques commonly used during data exploration:

- Understanding the Data: start by examining basic data features such as dataset size, variable types, and the initial rows to gain an initial overview.

- Descriptive Statistics: calculate descriptive statistics like mean, median, standard deviation, etc., to get an idea of the distribution of numerical variables.

- Visualization: Use graphs to visualize the data. Histograms, box plots, and scatter plots are useful for understanding the distribution, dispersion, and relationships between variables.

- Correlation Analysis: explore the relationships between variables by calculating correlations. This can reveal interesting associations or potential collinearities.

- Segmentation: if the data allows, perform segmentation to identify homogeneous subgroups. This can help tailor analyses based on specific characteristics.

- Preliminary Statistical Tests: if needed, conduct preliminary statistical tests to assess normality, equality of variances, etc.

In [2]:
df = pd.read_csv('heart_disease_2022_cleaned.csv')
df.head()

Unnamed: 0,State,Sex,GeneralHealth,PhysicalHealthDays,MentalHealthDays,LastCheckupTime,PhysicalActivities,SleepHours,RemovedTeeth,HadHeartAttack,...,HeightInMeters,WeightInKilograms,BMI,AlcoholDrinkers,HIVTesting,FluVaxLast12,PneumoVaxEver,TetanusLast10Tdap,HighRiskLastYear,CovidPos
0,Alabama,Female,Very good,4.0,0.0,Within past year (anytime less than 12 months ...,Yes,9.0,None of them,No,...,1.6,71.67,27.99,No,No,Yes,Yes,"Yes, received Tdap",No,No
1,Alabama,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,Yes,6.0,None of them,No,...,1.78,95.25,30.13,No,No,Yes,Yes,"Yes, received tetanus shot but not sure what type",No,No
2,Alabama,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,No,8.0,"6 or more, but not all",No,...,1.85,108.86,31.66,Yes,No,No,Yes,"No, did not receive any tetanus shot in the pa...",No,Yes
3,Alabama,Female,Fair,5.0,0.0,Within past year (anytime less than 12 months ...,Yes,9.0,None of them,No,...,1.7,90.72,31.32,No,No,Yes,Yes,"No, did not receive any tetanus shot in the pa...",No,Yes
4,Alabama,Female,Good,3.0,15.0,Within past year (anytime less than 12 months ...,Yes,5.0,1 to 5,No,...,1.55,79.38,33.07,No,No,Yes,Yes,"No, did not receive any tetanus shot in the pa...",No,No
