## Tải dữ liệu

### Subtask:
Tải dữ liệu từ tệp `pima-indians-diabetes.csv` vào một DataFrame.


**Reasoning**:
The first step is to import the pandas library and load the data from the specified CSV file into a DataFrame as instructed.



In [1]:
import pandas as pd

df = pd.read_csv('pima-indians-diabetes.csv')

## Khám phá dữ liệu

### Subtask:
Hiển thị thông tin cơ bản về dữ liệu, bao gồm các cột, kiểu dữ liệu và các giá trị còn thiếu.


**Reasoning**:
Display basic information about the dataframe, including column names, data types, and non-null counts, and then calculate the number of missing values per column to understand the data structure and identify missing data.



In [2]:
df.info()
print(df.isnull().sum())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 767 entries, 0 to 766
Data columns (total 9 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   6       767 non-null    int64  
 1   148     767 non-null    int64  
 2   72      767 non-null    int64  
 3   35      767 non-null    int64  
 4   0       767 non-null    int64  
 5   33.6    767 non-null    float64
 6   0.627   767 non-null    float64
 7   50      767 non-null    int64  
 8   1       767 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
6        0
148      0
72       0
35       0
0        0
33.6     0
0.627    0
50       0
1        0
dtype: int64


## Tiền xử lý dữ liệu

### Subtask:
Thực hiện các bước tiền xử lý cần thiết, chẳng hạn như xử lý các giá trị còn thiếu, chuyển đổi kiểu dữ liệu, hoặc chuẩn hóa dữ liệu.


**Reasoning**:
Assign meaningful column names to the DataFrame and identify columns where '0' might represent missing values. Then replace the '0' values in the identified columns with the median.



In [3]:
column_names = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']
df.columns = column_names

cols_with_zeros = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']
for col in cols_with_zeros:
    df[col] = df[col].replace(0, df[col].median())

display(df.head())

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,1,85,66,29,32,26.6,0.351,31,0
1,8,183,64,23,32,23.3,0.672,32,1
2,1,89,66,23,94,28.1,0.167,21,0
3,0,137,40,35,168,43.1,2.288,33,1
4,5,116,74,23,32,25.6,0.201,30,0


**Reasoning**:
Check the data types of the columns and apply StandardScaler to the numerical features, excluding the 'Outcome' column.



In [4]:
display(df.info())

from sklearn.preprocessing import StandardScaler

numerical_cols = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']

scaler = StandardScaler()
df[numerical_cols] = scaler.fit_transform(df[numerical_cols])

display(df.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 767 entries, 0 to 766
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               767 non-null    int64  
 1   Glucose                   767 non-null    int64  
 2   BloodPressure             767 non-null    int64  
 3   SkinThickness             767 non-null    int64  
 4   Insulin                   767 non-null    int64  
 5   BMI                       767 non-null    float64
 6   DiabetesPedigreeFunction  767 non-null    float64
 7   Age                       767 non-null    int64  
 8   Outcome                   767 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


None

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,-0.843726,-1.203741,-0.528017,0.181613,-0.603974,-0.850794,-0.364265,-0.18894,0
1,1.23424,2.017463,-0.693352,-0.468803,-0.603974,-1.330777,0.604701,-0.103795,1
2,-0.843726,-1.072264,-0.528017,-0.468803,-0.013946,-0.63262,-0.919684,-1.040393,0
3,-1.140579,0.505469,-2.677375,0.83203,0.69028,1.549122,5.482732,-0.01865,1
4,0.343683,-0.184789,0.133324,-0.468803,-0.603974,-0.996243,-0.817052,-0.274086,0


## Lưu dữ liệu đã xử lý

### Subtask:
Lưu DataFrame đã được tiền xử lý vào một tệp mới.


**Reasoning**:
Save the preprocessed DataFrame to a new CSV file without the index.



In [5]:
df.to_csv('pima-indians-diabetes_preprocessed.csv', index=False)