# **1. Perkenalan Dataset**


Dataset yang digunakan adalah Titanic Dataset yang diperoleh dari Kaggle (Titanic: Machine Learning from Disaster), dan dapat diakses melalui link berikut: https://www.kaggle.com/competitions/titanic/data

# **2. Import Library**

In [35]:
import pandas as pd
import numpy as np
from google.colab import files
from sklearn.preprocessing import LabelEncoder, StandardScaler

# **3. Memuat Dataset**

In [36]:
df = pd.read_csv("train.csv")

# Menampilkan beberapa baris awal dataset
df.head()

# Informasi struktur dataset
df.info()

# Ukuran dataset
df.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


(891, 12)

# **4. Exploratory Data Analysis (EDA)**


In [37]:
df.info()

df.describe()

df['Survived'].value_counts()

df['Sex'].value_counts()

df['Pclass'].value_counts()

df.corr(numeric_only=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
PassengerId,1.0,-0.005007,-0.035144,0.036847,-0.057527,-0.001652,0.012658
Survived,-0.005007,1.0,-0.338481,-0.077221,-0.035322,0.081629,0.257307
Pclass,-0.035144,-0.338481,1.0,-0.369226,0.083081,0.018443,-0.5495
Age,0.036847,-0.077221,-0.369226,1.0,-0.308247,-0.189119,0.096067
SibSp,-0.057527,-0.035322,0.083081,-0.308247,1.0,0.414838,0.159651
Parch,-0.001652,0.081629,0.018443,-0.189119,0.414838,1.0,0.216225
Fare,0.012658,0.257307,-0.5495,0.096067,0.159651,0.216225,1.0


# **5. Data Preprocessing**

In [38]:
missing_values = df.isnull().sum()
duplicate_count = df.duplicated().sum()

# Age → median
df['Age'] = df['Age'].fillna(df['Age'].median())

# Embarked → modus
df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0])

df = df.drop(columns=['Name', 'Ticket', 'Cabin'])

# Label Encoding untuk Sex
le = LabelEncoder()
df['Sex'] = le.fit_transform(df['Sex'])

# One-hot encoding untuk Embarked
df = pd.get_dummies(df, columns=['Embarked'], drop_first=True)

scaler = StandardScaler()

fitur_numerik = ['Age', 'Fare', 'SibSp', 'Parch']
X_scaled = scaler.fit_transform(df[fitur_numerik])

df_scaled = pd.DataFrame(
    X_scaled,
    columns=fitur_numerik
)

df_preprocessed = df.copy()
df_preprocessed[fitur_numerik] = df_scaled

file_name = 'titanic_preprocessed.csv'
df_preprocessed.to_csv(file_name, index=False)

print("Missing Values per Column:")
display(missing_values)

print("\nJumlah Data Duplikat:")
print(duplicate_count)

print("\nDataset Hasil Preprocessing (5 data pertama):")
display(df_preprocessed.head())

print(f"\nFile '{file_name}' berhasil dibuat dan siap diunduh.")

files.download(file_name)

Missing Values per Column:


Unnamed: 0,0
PassengerId,0
Survived,0
Pclass,0
Name,0
Sex,0
Age,177
SibSp,0
Parch,0
Ticket,0
Fare,0



Jumlah Data Duplikat:
0

Dataset Hasil Preprocessing (5 data pertama):


Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked_Q,Embarked_S
0,1,0,3,1,-0.565736,0.432793,-0.473674,-0.502445,False,True
1,2,1,1,0,0.663861,0.432793,-0.473674,0.786845,False,False
2,3,1,3,0,-0.258337,-0.474545,-0.473674,-0.488854,False,True
3,4,1,1,0,0.433312,0.432793,-0.473674,0.42073,False,True
4,5,0,3,1,0.433312,-0.474545,-0.473674,-0.486337,False,True



File 'titanic_preprocessed.csv' berhasil dibuat dan siap diunduh.


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>