## Investigasi sampel data titanic berikut dengan cara :
1. Cek secara head, tail, sample, info lalu observasi apa yang bisa anda peroleh ?
2. Lakukan Statistical Summary dengan mengekstrak informasi yang didapat dari observasi anda ?
3. Cek apakah ada duplikat dan bagaimana handlenya ?
4. Cek apakah ada missing value, berapa persentasenya jika ada, dan bagaimana cara handlenya ?

## Import Libraries

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

In [2]:
# import data
df = pd.read_excel('titanic.xlsx')


# Pengecekan Data

In [None]:
df.head()
#untuk menampilkan 5 data teratas.

Unnamed: 0,survived,name,sex,age
0,1,"Allen, Miss. Elisabeth Walton",female,29.0
1,1,"Allison, Master. Hudson Trevor",male,0.9167
2,0,"Allison, Miss. Helen Loraine",female,2.0
3,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0
4,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0


In [None]:
df.tail()
#untuk menampilkan 5 data terbawah.

Unnamed: 0,survived,name,sex,age
495,1,"Mallet, Mrs. Albert (Antoinette Magnin)",female,24.0
496,0,"Mangiavacchi, Mr. Serafino Emilio",male,
497,0,"Matthews, Mr. William John",male,30.0
498,0,"Maybery, Mr. Frank Hubert",male,40.0
499,0,"McCrae, Mr. Arthur Gordon",male,32.0


In [None]:
df.sample(2)
#Menampilkan data secara acak.

Unnamed: 0,survived,name,sex,age
473,0,"Kirkland, Rev. Charles Leonard",male,57.0
222,0,"Ovies y Rodriguez, Mr. Servando",male,28.5


In [None]:
df.info()
#untuk menampilakan informasi seperti jumlah baris, kolom, type data, dll.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   survived  500 non-null    int64  
 1   name      500 non-null    object 
 2   sex       500 non-null    object 
 3   age       451 non-null    float64
dtypes: float64(1), int64(1), object(2)
memory usage: 15.8+ KB


##Observasi:
1. Data terdiri dari 4 kolom dan 500 baris.
2. kolom survived terdiri dari bilangan biner(0 dan 1)
3. kolom age memiliki nilai yang hilang.

# Statistical Summary


In [None]:
df.columns

Index(['survived', 'name', 'sex', 'age'], dtype='object')

In [None]:
categoricals = ['name', 'sex']
numericals = ['survived', 'age']

In [None]:
df[categoricals].describe()

Unnamed: 0,name,sex
count,500,500
unique,499,2
top,"Eustis, Miss. Elizabeth Mussey",male
freq,2,288


In [None]:
df[numericals].describe()

Unnamed: 0,survived,age
count,500.0,451.0
mean,0.54,35.917775
std,0.498897,14.766454
min,0.0,0.6667
25%,0.0,24.0
50%,1.0,35.0
75%,1.0,47.0
max,1.0,80.0


##Observasi:
1. name memiliki 499 nilai unik dan artinya ada 1 nama yang sama.
2. sex memiliki 2 nilai unik yakni hanya berisi male dan female.
3. male lebih banyak daripada female yakni memiliki 288.
4. lebih dari setengah penumpang yang selamat berdasarkan data tersebut.
5. umur penumpang bervariasi dari bayi hingga lansia.

# Pengecekan data duplikat dan penanganan data duplikat

In [None]:
len(df.drop_duplicates()) / len(df)

#melakukan pengecekan apakah memiliki data duplikat

0.998

In [None]:
duplicates = df[df.duplicated(keep=False)]

In [None]:
duplicates
#melihat data yang duplikat

Unnamed: 0,survived,name,sex,age
104,1,"Eustis, Miss. Elizabeth Mussey",female,54.0
349,1,"Eustis, Miss. Elizabeth Mussey",female,54.0


In [None]:
df = df.drop_duplicates()
#menghilangkan data yang duplikat

In [None]:
len(df.drop_duplicates()) / len(df)
#mengecek kembali apakah nilai sudah 1(tidak ada data duplikat)

1.0


1. Memiliki satu data yang duplikat dan sudah dihilangkan dengan cara drop_duplicates()

# Pengecekan data dengan nilai yang hilang dan cara menanganinya

In [None]:
df.isna().sum()

Unnamed: 0,0
survived,0
name,0
sex,0
age,49


In [None]:
df.isnull().sum()

Unnamed: 0,0
survived,0
name,0
sex,0
age,49


In [None]:
total_rows = len(df)
for column in df.columns:
    missing_count = df[column].isna().sum()
    missing_percentage = (missing_count / total_rows) * 100
    print(f"Kolom '{column}' memiliki {missing_count} nilai yang hilang ({missing_percentage:.2f}%)")

Kolom 'survived' memiliki 0 nilai yang hilang (0.00%)
Kolom 'name' memiliki 0 nilai yang hilang (0.00%)
Kolom 'sex' memiliki 0 nilai yang hilang (0.00%)
Kolom 'age' memiliki 49 nilai yang hilang (9.82%)


In [None]:
df['age'].median()

35.0

In [None]:
for column in df.columns:
    if df[column].dtype == 'object':
        df[column].fillna(df[column].mode()[0], inplace=True)
    else:
        df[column].fillna(df[column].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[column].fillna(df[column].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[column].fillna(df[column].mode()[0], inplace=True)


In [None]:
df.isna().sum()

Unnamed: 0,0
survived,0
name,0
sex,0
age,0


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 499 entries, 0 to 499
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   survived  499 non-null    int64  
 1   name      499 non-null    object 
 2   sex       499 non-null    object 
 3   age       499 non-null    float64
dtypes: float64(1), int64(1), object(2)
memory usage: 19.5+ KB



1. Ada data dengan nilai yang hilang yakni sebesar 9.82%.
2. Data dengan nilai yang hilang telah diisi dengan nilai median.