##1. Import modul yang diperlukan
Mengimpor modul-modul yang diperlukan, seperti pandas, numpy, re, dan itertools.
Modul-modul ini digunakan untuk melakukan operasi data, manipulasi string, dan iterasi.

In [1]:
import pandas as pd
import numpy as np
import re
import itertools

##2. Menghubungkan Google Drive
Menghubungkan Google Drive dengan Google Colab. Ini dilakukan untuk mengakses file data yang disimpan di Google Drive dari Google Colab.

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


##3. Membuka file hungarian.data dengan encoding Latin1
File ini berisi data tentang penyakit jantung yang akan dianalisis. Encoding Latin1 digunakan untuk menghindari kesalahan decoding karakter.

In [3]:
dir = '/content/drive/My Drive/BK/hungarian.data'

In [4]:
#dir adalah variabel yang digunakan untuk menyimpan data
#dir = 'hungarian.data'

##4. Membuat DataFrame dari data yang dibaca
DataFrame adalah struktur data tabular yang memudahkan untuk melakukan analisis dan visualisasi data. Data dibagi menjadi 10 baris per record, sehingga perlu digabungkan menjadi satu baris per record.

In [5]:
with open(dir, encoding='Latin1') as file:
  lines = [line.strip() for line in file]

lines [0:10]

['1254 0 40 1 1 0 0',
 '-9 2 140 0 289 -9 -9 -9',
 '0 -9 -9 0 12 16 84 0',
 '0 0 0 0 150 18 -9 7',
 '172 86 200 110 140 86 0 0',
 '0 -9 26 20 -9 -9 -9 -9',
 '-9 -9 -9 -9 -9 -9 -9 12',
 '20 84 0 -9 -9 -9 -9 -9',
 '-9 -9 -9 -9 -9 1 1 1',
 '1 1 -9. -9. name']

##5. Menghapus kolom yang tidak relevan
Menghapus kolom yang tidak relevan. Kolom pertama dan terakhir tidak berisi informasi yang berguna, sehingga dihapus dari DataFrame.

In [6]:
data = itertools.takewhile(
    lambda x: len(x) == 76,
    (' '.join(lines[i:(i+10)]).split() for i in range(0, len(lines),10))
)

df = pd.DataFrame.from_records(data)

In [7]:
 df.tail()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,66,67,68,69,70,71,72,73,74,75
289,1053,0,48,0,0,0,0,-9,2,-9,...,-9,-9,1,1,1,1,1,-9.0,-9.0,name
290,1054,0,36,1,1,0,0,-9,2,120,...,-9,-9,1,1,1,1,1,-9.0,-9.0,name
291,5001,0,48,1,0,0,0,-9,3,110,...,-9,-9,1,1,1,1,1,-9.0,-9.0,name
292,5000,0,47,0,0,0,0,-9,2,140,...,-9,-9,1,1,1,1,1,-9.0,-9.0,name
293,5002,0,53,1,1,1,1,-9,4,130,...,1,1,1,1,1,1,1,-9.0,-9.0,name


In [8]:
df.dtypes

0     object
1     object
2     object
3     object
4     object
       ...  
71    object
72    object
73    object
74    object
75    object
Length: 76, dtype: object

##6. Mengubah tipe data object menjadi float
Tipe data object tidak dapat digunakan untuk melakukan perhitungan atau pemodelan, sehingga perlu diubah menjadi tipe data numerik, yaitu float.

In [9]:
df = df.iloc[:,:-1]
df = df.drop(df.columns[0], axis=1)

In [10]:
df = df.astype(float)

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 294 entries, 0 to 293
Data columns (total 74 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   1       294 non-null    float64
 1   2       294 non-null    float64
 2   3       294 non-null    float64
 3   4       294 non-null    float64
 4   5       294 non-null    float64
 5   6       294 non-null    float64
 6   7       294 non-null    float64
 7   8       294 non-null    float64
 8   9       294 non-null    float64
 9   10      294 non-null    float64
 10  11      294 non-null    float64
 11  12      294 non-null    float64
 12  13      294 non-null    float64
 13  14      294 non-null    float64
 14  15      294 non-null    float64
 15  16      294 non-null    float64
 16  17      294 non-null    float64
 17  18      294 non-null    float64
 18  19      294 non-null    float64
 19  20      294 non-null    float64
 20  21      294 non-null    float64
 21  22      294 non-null    float64
 22  23

##7. Mengganti nilai -9.0 dengan NaN
Nilai -9.0 menunjukkan data yang hilang atau tidak diketahui, sehingga diganti dengan NaN (Not a Number) untuk memudahkan pengolahan data selanjutnya.

In [12]:
df.replace(-9.0, np.nan, inplace=True)

In [13]:
df.isnull().sum()

1       0
2       0
3       0
4       0
5       0
     ... 
70      0
71      0
72      0
73    266
74    294
Length: 74, dtype: int64

In [14]:
df.head()

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,65,66,67,68,69,70,71,72,73,74
0,0.0,40.0,1.0,1.0,0.0,0.0,,2.0,140.0,0.0,...,,,,1.0,1.0,1.0,1.0,1.0,,
1,0.0,49.0,0.0,1.0,0.0,0.0,,3.0,160.0,1.0,...,,,,1.0,1.0,1.0,1.0,1.0,,
2,0.0,37.0,1.0,1.0,0.0,0.0,,2.0,130.0,0.0,...,,,,1.0,1.0,1.0,1.0,1.0,,
3,0.0,48.0,0.0,1.0,1.0,1.0,,4.0,138.0,0.0,...,,2.0,,1.0,1.0,1.0,1.0,1.0,,
4,0.0,54.0,1.0,1.0,0.0,1.0,,3.0,150.0,0.0,...,,1.0,,1.0,1.0,1.0,1.0,1.0,,


In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 294 entries, 0 to 293
Data columns (total 74 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   1       294 non-null    float64
 1   2       294 non-null    float64
 2   3       294 non-null    float64
 3   4       294 non-null    float64
 4   5       294 non-null    float64
 5   6       294 non-null    float64
 6   7       0 non-null      float64
 7   8       294 non-null    float64
 8   9       293 non-null    float64
 9   10      293 non-null    float64
 10  11      271 non-null    float64
 11  12      12 non-null     float64
 12  13      1 non-null      float64
 13  14      0 non-null      float64
 14  15      286 non-null    float64
 15  16      21 non-null     float64
 16  17      1 non-null      float64
 17  18      293 non-null    float64
 18  19      294 non-null    float64
 19  20      294 non-null    float64
 20  21      294 non-null    float64
 21  22      293 non-null    float64
 22  23

##8. Memilih 14 fitur

Fitur adalah variabel yang digunakan untuk melakukan analisis atau pemodelan data. Dari 74 kolom yang ada, hanya dipilih 14 kolom yang dianggap relevan dan memiliki cukup data yang tidak kosong.

In [16]:
df_selected = df.iloc[:, [1,2,7,8,10,14,17,30,36,38,39,42,49,56]]

In [17]:
df_selected.head()

Unnamed: 0,2,3,8,9,11,15,18,31,37,39,40,43,50,57
0,40.0,1.0,2.0,140.0,289.0,0.0,0.0,172.0,0.0,0.0,,,,0.0
1,49.0,0.0,3.0,160.0,180.0,0.0,0.0,156.0,0.0,1.0,2.0,,,1.0
2,37.0,1.0,2.0,130.0,283.0,0.0,1.0,98.0,0.0,0.0,,,,0.0
3,48.0,0.0,4.0,138.0,214.0,0.0,0.0,108.0,1.0,1.5,2.0,,,3.0
4,54.0,1.0,3.0,150.0,,0.0,0.0,122.0,0.0,0.0,,,,0.0


In [18]:
df_selected.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 294 entries, 0 to 293
Data columns (total 14 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   2       294 non-null    float64
 1   3       294 non-null    float64
 2   8       294 non-null    float64
 3   9       293 non-null    float64
 4   11      271 non-null    float64
 5   15      286 non-null    float64
 6   18      293 non-null    float64
 7   31      293 non-null    float64
 8   37      293 non-null    float64
 9   39      294 non-null    float64
 10  40      104 non-null    float64
 11  43      4 non-null      float64
 12  50      28 non-null     float64
 13  57      294 non-null    float64
dtypes: float64(14)
memory usage: 32.3 KB


In [19]:
column_mapping = {
    2: 'age',
    3: 'sex',
    8: 'cp',
    9: 'trestbps',
    11: 'chol',
    15: 'fbs',
    18: 'restecg',
    31: 'thalach',
    37: 'exang',
    39: 'oldpeak',
    40: 'slope',
    43: 'ca',
    50: 'thal',
    57: 'target',
}
