## First section - creating virtual environment to handle all code there

In [None]:
# Check Python version
import sys

python_version = sys.version.split()[0]
print(f"Your current Python version: {python_version}")

Your current Python version: 3.10.11


In [None]:
# Please install Python3.10 version, if you don't have it yet and run the following code
import platform
import os
system = platform.system()

if system == "Windows":
    !py -3.10 -m venv venv
else: #Another version for MacOS
    !python3.10 -m venv venv

if os.path.exists('venv'):
    print("Virtual environment was download successfully!")

Virtual environment was download successfully


In [10]:
# Install packages from requirements.txt
!pip install -r requirements.txt



You should consider upgrading via the 'c:\users\bondi\appdata\local\programs\python\python39\python.exe -m pip install --upgrade pip' command.


<br>

### 1. Dataset description

The dataset that is used for the model consists of 23 columns (22 features and smoker status) and 38.984 records of patients data.<br>
The dataset could be found on [Kaggle](https://www.kaggle.com/datasets/gauravduttakiit/smoker-status-prediction?resource=download&select=train_dataset.csv).


#### Problem overview

Analysing smoking and developing predictive models for smoking status using bio-signals is essential from an economic perspective due to its significant implications on healthcare costs and productivity losses. Smoking-related illnesses impose a heavy economic burden on societies worldwide, with costs arising from healthcare expenditures, decreased productivity due to illness or premature death, and other associated social costs. This study aims to develop and analyse a predictive model for smoker status using a dataset from Kaggle that includes various bio-signals and demographic information.

The adverse health effects of smoking are well-documented, including harm to nearly every organ, numerous diseases, and reduced life expectancy. According to a World Health Organization report, smoking is the leading cause of preventable morbidity and mortality, with smoking-related deaths projected to reach 10 million by 2030.

#### Feature Descriptions

| **Name**            | **Description**                                                                                                                                          |
|----------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------|
| Age                 | Age of patient, grouped by 5-year increments                                                                                                            |
| Height              | Height of patient, grouped by 5-cm increments                                                                                                           |
| Weight              | Weight of patient, grouped by 5-kg increments                                                                                                           |
| Waist               | Waist circumference in cm                                                                                                                              |
| Eyesight (left)     | Visual acuity in left eye from 0.1 to 2.0 (higher is better), where 1.0 is equivalent to 20/20, blindness is 9.9                                         |
| Eyesight (right)    | Visual acuity in right eye from 0.1 to 2.0 (higher is better), where 1.0 is equivalent to 20/20, blindness is 9.9                                        |
| Hearing (left)      | Hearing in left ear where 1=normal, 2=abnormal                                                                                                          |
| Hearing (right)     | Hearing in right ear where 1=normal, 2=abnormal                                                                                                         |
| Systolic            | Blood pressure, amount of pressure experienced by the arteries when the heart is contracting                                                           |
| Relaxation          | Blood pressure (diastolic), amount of pressure experienced by the arteries when the heart is relaxing                                                  |
| Fasting Blood Sugar | Blood sugar level (concentration per 100ml of blood) before eating                                                                                     |
| Cholesterol         | Sum of ester-type and non-ester-type cholesterol                                                                                                        |
| Triglyceride        | Amount of simple and neutral lipids in blood                                                                                                           |
| HDL                 | High Density Lipoprotein, "good" cholesterol, absorbs cholesterol in the blood and carries it back to the liver                                        |
| LDL                 | Low Density Lipoprotein, "bad" cholesterol, makes up most of body's cholesterol. High levels of this raise risk for heart disease and stroke.          |
| Hemoglobin          | Protein contained in red blood cells that delivers oxygen to the tissues                                                                               |
| Urine Protein       | Amount of protein mixed in urine                                                                                                                       |
| Serum Creatinine    | Creatinine level, Creatinine is a waste product in your blood that comes from your muscles. Healthy kidneys filter creatinine out of your blood.       |
| AST                 | Aspartate transaminase, an enzyme that helps the body break down amino acids. An increase in AST levels may mean liver damage, liver disease, or muscle damage. |
| ALT                 | Alanine transaminase, an enzyme found in the liver that helps convert proteins into energy for liver cells. When the liver is damaged, ALT increases.  |
| GTP                 | Gamma-glutamyltransferase (GGT), an enzyme in the blood. Higher-than-usual levels may mean liver or bile duct damage.                                   |
| Dental Caries       | Cavities, 0=absent, 1=present                                                                                                                           |
| Smoking             | 0=non-smoker, 1=smoker                                                                                                                                  |



In [3]:
# Download dataset to Jupyter notebook
import pandas as pd

df = pd.read_csv('Data/dataset.csv')

df.columns = [
    'Age',
    'Height',
    'Weight',
    'Waist',
    'Eyesight (left)',
    'Eyesight (right)',
    'Hearing (left)',
    'Hearing (right)',
    'Systolic',
    'Relaxation',
    'Fasting Blood Sugar',
    'Cholesterol',
    'Triglyceride',
    'HDL',
    'LDL',
    'Hemoglobin',
    'Urine Protein',
    'Serum Creatinine',
    'AST',
    'ALT',
    'GTP',
    'Dental Caries',
    'Smoking'
]

df.head()

Unnamed: 0,Age,Height,Weight,Waist,Eyesight (left),Eyesight (right),Hearing (left),Hearing (right),Systolic,Relaxation,...,HDL,LDL,Hemoglobin,Urine Protein,Serum Creatinine,AST,ALT,GTP,Dental Caries,Smoking
0,35,170,85,97.0,0.9,0.9,1,1,118,78,...,70,142,19.8,1,1.0,61,115,125,1,1
1,20,175,110,110.0,0.7,0.9,1,1,119,79,...,71,114,15.9,1,1.1,19,25,30,1,0
2,45,155,65,86.0,0.9,0.9,1,1,110,80,...,57,112,13.7,3,0.6,1090,1400,276,0,0
3,45,165,80,94.0,0.8,0.7,1,1,158,88,...,46,91,16.9,1,0.9,32,36,36,0,0
4,20,165,60,81.0,1.5,0.1,1,1,109,64,...,47,92,14.9,1,1.2,26,28,15,0,0
