 <div style="background-color: #00008B; padding: 10px; color: white;">
<h2 style="color: #00BFFF; font-weight: bold;">problem statement</h2>
<h3>
<li>we delve into a dataset encapsulating various health metrics from heart disease patients, including age, blood pressure, heart rate, and more. Our goal is to develop a predictive model capable of accurately identifying individuals with heart disease. Given the grave implications and poor prognosis of missing a positive diagnosis, our primary emphasis is on ensuring that the model identifies all potential patients, making recall for the positive class a crucial metric.</li>
</h3>
</div>

 <div style="background-color: #00008B; padding: 10px; color: white;">
<h2 style="color: #00BFFF; font-weight: bold;">objective</h2>
<h3>
* build Machine learning model to flag potential for keen attention

<div style="background-color: #00008B; padding: 10px; color: white;">
<h1 style="color: #00BFFF; font-weight: bold;">Approach</h1>
<h3>

1. check and structure data standardized forms by building a data pipeline

2. Explore the datasets by simple descriptive data analysis aiming to increase domain knowledge
- Remove irrelevant features
- Address missing values
- Treat outliers
- Encode categorical variables
- Transform skewed features to achieve normal-like distributions<
- confirm that the data is normalized if not repeat
  
4. Exploratory data analyses proper

5. Package data for modelling in csv files

6. Report findings; for tech team do proper documentation on git and other similar resources, for non-technical staff use power BI, Tableau, etc

<div style="background-color: #00008B; padding: 10px; color: white; border-radius:15px 50px;">
<h2 style="color: white; font-weight: bold;">1. check and structure data standardized forms by building a data pipeline




In [1]:
import pandas as pd
import numpy as np
import kagglehub
import os

  from .autonotebook import tqdm as notebook_tqdm


<div style="background-color: #00008B; padding: 10px; color: white;">
<h1 style="color: #00BFFF; font-weight: bold;">initial observations</h1>
<h3>

- The data is space seperated in the kaggle preview and description so i need to introduce seperators
    
- since we intend to structure raw data from four different data bases by mapping the standard 14 columns out of 76 columns used in research; and they exist in pairs of source: col_index; a dict serve well to hold that info

- also since we are loading four data bases a funtion would reduce the redundancy

- We know from the documentation that the col_indices are the same(0 to 75), so we should expect the df.shape[1] should return 76 else there is either a missing or duplicated column thus it should raise an error

## [kaggle data source](https://www.kaggle.com/datasets/abdelazizsami/heart-disease)

In [2]:
path = kagglehub.dataset_download("abdelazizsami/heart-disease")
print(os.listdir(path))



In [3]:
# test reading hungarian data
df_hungarian = pd.read_csv(f'{path}/{'switzerland.data'}', sep = ' ' )
df_hungarian

Unnamed: 0,3001,0,65,1,1.1,1.2,1.3
-9.0,4,115.0,0.0,0,-9.0,-9.0,-9.0
0.0,-9,-9.0,0.0,1,9.0,85.0,0.0
1.0,1,0.0,1.0,12,8.3,-9.0,100.0
93.0,56,185.0,80.0,115,70.0,1.0,0.0
0.0,2,-9.0,-9.0,-9,-9.0,-9.0,-9.0
...,...,...,...,...,...,...,...
0.0,-9,-9.0,-9.0,-9,-9.0,-9.0,-9.0
-9.0,-9,-9.0,7.0,1,0.0,-9.0,7.0
4.0,85,1.0,1.0,1,1.0,1.0,1.0
1.0,1,1.0,2.0,1,1.0,1.0,1.0


### The read read of the data shows a wrong shape; i expected 76 columns, could it be because of irregular spaces?!or irrgular delimiters like tabs, commas...the data card on kaggle said they are space seperated so I'll try the pd.read_fwf such that any space regardless of the size is considered and sep='\s+' or sep = none

In [4]:
df_hungarian_1 = pd.read_fwf(f'{path}/{'switzerland.data'}', sep=None )
df_hungarian_1.shape
df_hungarian_1.head(5)

Unnamed: 0,3001 0 65 1 1 1 1
0,-9 4 115 0 0 -9 -9 -9
1,0 -9 -9 0 1 9 85 0
2,1 1 0 1 12 8.3 -9 100
3,93 56 185 80 115 70 1 0
4,0 2 -9 -9 -9 -9 -9 -9


### the shape just shows it does thesame as read_csv but puts the column into one column not knowing where the rows end thus mutating the data
### so traditional read functions can't seem to tell the rows appart:
- I could count the next 76 values to get each row but it will mutate the data too if only one feature for a row is missing
- since it reads as one large block; i could try to build a custom read function that can try to get the lines manually
- 

In [5]:
with open(f'{path}/{'switzerland.data'}') as file:
    lines = file.readlines()
# Split each line by whitespace
data = [line.split() for line in lines]
df_raw = pd.DataFrame(data)
df_raw.shape

(1230, 8)

### The shape is still wrong and I'm lowkey fraustrated by the many errorsss
### I returned to the original data card on kaggle and saw that the processed data that did'nt present these errors had proper line structure and the unprocessed was just one large string so the problem was the line/row breaks but HOPE;
### I saw the data engineer had put in a dummy string(name) to maintain annonymity of the recodes and i instantly remembered nonsense codons in the DNA are used as puntuations; so i could use that here too and use "name" as delimiter and not worry about mutations

- this approach is also efficient like a sorting algorithm using slices intead of going by bulk
- so I need to read it as one large string and use the split() method to return a nested list of strings

In [6]:
with open(f'{path}/{'switzerland.data'}') as file:
    raw_text = file.read().replace('\n', ' ')#new line characters are interrupting so i'll strip them or use replace this
raw_rows= raw_text.split('name')
print(len(raw_rows))
print(f'the length of this  string apparently counts but characters which is such a relief: {len(raw_rows[0])}')
raw_rows[0]


124
the length of this  string apparently counts but characters which is such a relief: 194


'3001 0 65 1 1 1 1 -9 4 115 0 0 -9 -9 -9 0 -9 -9 0 1 9 85 0 1 1 0 1 12 8.3 -9 100 93 56 185 80 115 70 1 0 0 2 -9 -9 -9 -9 -9 -9 -9 -9 -9 7 -9 -9 -9 1 11 85 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 75 -9. '

In [7]:
clean_data = []
error_rows = []
for i, row in enumerate(raw_rows):
    values = row.split() 
    
    if not values:# this accounts for the end where there is no namw to stop it and i produces emmpty lists
        continue
    
    values.append('name')#sdding back name that was removed by split thus maintaing the original data
    
    if len(values) == 76:
        clean_data.append(values)
    else:
        error_rows.append({
            "patient_index": i,
            "found_length": len(values)
        })
df = pd.DataFrame(clean_data)

In [8]:
df.shape

(123, 76)

In [9]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,66,67,68,69,70,71,72,73,74,75
0,3001,0,65,1,1,1,1,-9,4,115,...,1,1,1,1,1,1,1,75,-9.0,name
1,3002,0,32,1,0,0,0,-9,1,95,...,1,1,1,1,1,5,1,63,-9.0,name
2,3003,0,61,1,1,1,1,-9,4,105,...,2,1,1,1,1,1,1,67,-9.0,name
3,3004,0,50,1,1,1,1,-9,4,145,...,1,1,1,1,1,5,4,36,-9.0,name
4,3005,0,57,1,1,1,1,-9,4,110,...,2,1,1,1,1,1,1,60,-9.0,name


In [10]:
if error_rows:
    print(f"Total clean records: {len(df)}")
    print(f"Total broken records: {len(error_rows)}")
    print("Example error:", error_rowss[0])
else: 
    print('no abnormal rows found')

no abnormal rows found
