# Data Science Final Project - Rizky Maulana Saputra

## Data Understanding

Dataset yang digunakan diambil dari Spada UMS terkait Student Academic Performance

Dataset ini memiliki 1.000 baris dan 16 kolom data yang mempresentasikan :
- student_id : ID unik yang dimiliki pelajar
- age : Umur dari pelajar
- gender : Kelamin dari setiap pelajar
- study_hours_per_day : Total jam belajar pelajar perharinya
- social_media_hours : Total jam menggunakan Social Media perharinya
- netflix_hours : Total jam menggunakan Netflix perharinya
- part_time_job : Pelajar melakukan Part Time Job (Yes/No)
- attendance_percentage : Persentase kedatangan pelajar
- sleep_hours : Total jam tidur perharinya
- diet_quality : Kondisi Diet Quality tiap pelajar (Poor/Good/Fair)
- exercise_frequency : Frekuensi Exercise dari setiap pelajar
- parental_education_level : Pendidikan terakhir Orang Tua
- internet_quality : Kualitas dari Internet yang digunakan pelajar (Poor/Average/Good)
- mental_health_rating : Rating dari Mental Health pelajar (1-10)
- extracurricular_participation : Ekstrakulikuler yang di ikuti pelajar
- exam_score : Nilai Ujian Pelajar

Kondisi Data :
- Missing Values : Dilakukan pada Exploratory Data Analytics yakni terdapat missing values pada kolom gender (5), attendance_percentage (3) dan parental_education_level (91)
- Duplicated Data : Dilakukan pada Exploratory Data Analytics yakni tidak terdapat duplikat pada data
- Outlier : Dilakukan pada Exploratory Data Analytics terlihat terdapat outlier di kolom :

- Import Library

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

- Load Dataset

In [2]:
df = pd.read_csv('../data/student_habits.csv')

In [3]:
df.head()

Unnamed: 0,student_id,age,gender,study_hours_per_day,social_media_hours,netflix_hours,part_time_job,attendance_percentage,sleep_hours,diet_quality,exercise_frequency,parental_education_level,internet_quality,mental_health_rating,extracurricular_participation,exam_score
0,S1000,23,Female,0.0,1.2,1.1,No,85.0,8.0,Fair,6,Master,Average,8,Yes,56.2
1,S1001,20,Female,6.9,2.8,2.3,No,97.3,4.6,Good,6,High School,Average,8,No,100.0
2,S1002,21,Male,1.4,3.1,1.3,No,94.8,8.0,Poor,1,High School,Poor,1,No,34.3
3,S1003,23,Female,1.0,3.9,1.0,No,71.0,9.2,Poor,4,Master,Good,1,Yes,26.8
4,S1004,19,Female,5.0,4.4,0.5,No,90.9,4.9,Fair,3,Master,Good,1,No,66.4


## Exploratory Data Analytics


1. Pemeriksaan Struktur Data, Missing Values dan Duplicate Data

In [5]:
df.dtypes

student_id                        object
age                                int64
gender                            object
study_hours_per_day              float64
social_media_hours               float64
netflix_hours                    float64
part_time_job                     object
attendance_percentage            float64
sleep_hours                      float64
diet_quality                      object
exercise_frequency                 int64
parental_education_level          object
internet_quality                  object
mental_health_rating               int64
extracurricular_participation     object
exam_score                       float64
dtype: object

In [6]:
df.isna().sum()

student_id                        0
age                               0
gender                            5
study_hours_per_day               0
social_media_hours                0
netflix_hours                     0
part_time_job                     0
attendance_percentage             3
sleep_hours                       0
diet_quality                      0
exercise_frequency                0
parental_education_level         91
internet_quality                  0
mental_health_rating              0
extracurricular_participation     0
exam_score                        0
dtype: int64

In [7]:
df.duplicated().sum()

np.int64(0)

2. Statistik Deskriptif

In [8]:
df.describe()

Unnamed: 0,age,study_hours_per_day,social_media_hours,netflix_hours,attendance_percentage,sleep_hours,exercise_frequency,mental_health_rating,exam_score
count,1000.0,1000.0,1000.0,1000.0,997.0,1000.0,1000.0,1000.0,1000.0
mean,20.498,3.5501,2.5055,1.8197,83.995286,6.4701,3.042,5.438,69.6015
std,2.3081,1.46889,1.172422,1.075118,9.909688,1.226377,2.025423,2.847501,16.888564
min,17.0,0.0,0.0,0.0,10.0,3.2,0.0,1.0,18.4
25%,18.75,2.6,1.7,1.0,77.9,5.6,1.0,3.0,58.475
50%,20.0,3.5,2.5,1.8,84.4,6.5,3.0,5.0,70.5
75%,23.0,4.5,3.3,2.525,91.1,7.3,5.0,8.0,81.325
max,24.0,8.3,7.2,5.4,100.0,10.0,6.0,10.0,100.0


3. Outlier

In [14]:
def check_outlier(data):
    Q1 = df[data].quantile(0.25)
    Q3 = df[data].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR

    outlier = df[(df[data] < lower) | (df[data] > upper)]
    print(f"Kolom {data} : {len(outlier)} outlier")

numeric = df.select_dtypes(include='number').columns
for i in numeric:
    check_outlier(i)

Kolom age : 0 outlier
Kolom study_hours_per_day : 7 outlier
Kolom social_media_hours : 5 outlier
Kolom netflix_hours : 4 outlier
Kolom attendance_percentage : 5 outlier
Kolom sleep_hours : 2 outlier
Kolom exercise_frequency : 0 outlier
Kolom mental_health_rating : 0 outlier
Kolom exam_score : 2 outlier


4. Visualization

## Data Preparation

1. Handling Missing Values

2. Mengkonversi Tipe Data...

3. Handling Outlier

In [None]:
def handling_outlier(data):
    Q1 = df[data].quantile(0.25)
    Q3 = df[data].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR

    df.loc[df[data] > upper, data] = upper
    df.loc[df[data] < lower, data] = lower

4. Splitting Data

## Modelling

## Evaluation