# Proyek MLOps : Depression Students
- **Nama:** Abid Juliant Indraswara
- **Email:** abidjuliant@gmail.com
- **ID Dicoding:** abidindraswara

Fokus pada proyek ini adalah untuk membuat pipeline machine learning dengan topik machine learning yaitu sentimen berita mengenai siswa yang mengalami depresi. Dataset yang digunakan adalah depression-student-dataset yang berasal dari Kaggle.

Dataset : https://www.kaggle.com/datasets/ikynahidwin/depression-student-dataset

## Install & Import Libray

In [1]:
# Import Library Umum
import os, shutil
from shutil import copyfile
import zipfile
import pathlib
from pathlib import Path
import numpy as np
import pandas as pd

In [2]:
# Import Library untuk Machine Learning dan Pipeline
import tensorflow as tf
import tensorflow_model_analysis as tfma
from sklearn.preprocessing import LabelEncoder
from tfx.components import (
    CsvExampleGen,
    StatisticsGen,
    SchemaGen,
    ExampleValidator,
    Transform,
    Trainer,
    Evaluator,
    Pusher,
    Tuner
)
from tfx.proto import example_gen_pb2, trainer_pb2, pusher_pb2
from tfx.types import Channel
from tfx.dsl.components.common.resolver import Resolver
from tfx.orchestration.experimental.interactive.interactive_context import InteractiveContext
from tfx.types.standard_artifacts import Model, ModelBlessing
from tfx.dsl.input_resolution.strategies.latest_blessed_model_strategy import (
    LatestBlessedModelStrategy)

## Data Loading & Cek Dataset

In [3]:
# Dataset Depression Students
depression_df = pd.read_csv("./depression_student/data/data.csv")
depression_df

Unnamed: 0,Gender,Age,Academic Pressure,Study Satisfaction,Sleep Duration,Dietary Habits,Have you ever had suicidal thoughts ?,Study Hours,Financial Stress,Family History of Mental Illness,Depression
0,Male,28,2.0,4.0,7-8 hours,Moderate,Yes,9,2,Yes,No
1,Male,28,4.0,5.0,5-6 hours,Healthy,Yes,7,1,Yes,No
2,Male,25,1.0,3.0,5-6 hours,Unhealthy,Yes,10,4,No,Yes
3,Male,23,1.0,4.0,More than 8 hours,Unhealthy,Yes,7,2,Yes,No
4,Female,31,1.0,5.0,More than 8 hours,Healthy,Yes,4,2,Yes,No
...,...,...,...,...,...,...,...,...,...,...,...
497,Male,26,5.0,2.0,More than 8 hours,Unhealthy,No,8,3,No,Yes
498,Male,24,2.0,1.0,Less than 5 hours,Unhealthy,Yes,8,5,No,Yes
499,Female,23,3.0,5.0,5-6 hours,Healthy,No,1,5,Yes,No
500,Male,33,4.0,4.0,More than 8 hours,Healthy,No,8,1,Yes,No


### Cek Nilai NaN Dataset

In [4]:
# Cek NaN Dataset
print('Cek NaN Dataset Depression Student')
print(depression_df.isna().sum())

Cek NaN Dataset Depression Student
Gender                                   0
Age                                      0
Academic Pressure                        0
Study Satisfaction                       0
Sleep Duration                           0
Dietary Habits                           0
Have you ever had suicidal thoughts ?    0
Study Hours                              0
Financial Stress                         0
Family History of Mental Illness         0
Depression                               0
dtype: int64


### Cek Info Dataset

In [5]:
# Cek Info Dataset
depression_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 502 entries, 0 to 501
Data columns (total 11 columns):
 #   Column                                 Non-Null Count  Dtype  
---  ------                                 --------------  -----  
 0   Gender                                 502 non-null    object 
 1   Age                                    502 non-null    int64  
 2   Academic Pressure                      502 non-null    float64
 3   Study Satisfaction                     502 non-null    float64
 4   Sleep Duration                         502 non-null    object 
 5   Dietary Habits                         502 non-null    object 
 6   Have you ever had suicidal thoughts ?  502 non-null    object 
 7   Study Hours                            502 non-null    int64  
 8   Financial Stress                       502 non-null    int64  
 9   Family History of Mental Illness       502 non-null    object 
 10  Depression                             502 non-null    object 
dtypes: flo

### Cek Fitur Kategori

In [6]:
# Cek Kategori untuk Kolom dengan Tipe Data object
category_per_columns = depression_df.select_dtypes(include=['object']).columns

In [7]:
# Cek per Parameter atau kolom
for column in category_per_columns:
    category_value = depression_df[column].unique()
    print(f"{column}: \n{category_value}", '\n')

Gender: 
['Male' 'Female'] 

Sleep Duration: 
['7-8 hours' '5-6 hours' 'More than 8 hours' 'Less than 5 hours'] 

Dietary Habits: 
['Moderate' 'Healthy' 'Unhealthy'] 

Have you ever had suicidal thoughts ?: 
['Yes' 'No'] 

Family History of Mental Illness: 
['Yes' 'No'] 

Depression: 
['No' 'Yes'] 



## Encoding Label

In [8]:
# Inisialisasi LabelEncoder
label_encoder = LabelEncoder()

# Mengonversi kolom 'Depression' menjadi label numerik
depression_df['Depression'] = label_encoder.fit_transform(depression_df['Depression'])

# Tampilkan hasil label encoding
depression_df[['Depression']]

Unnamed: 0,Depression
0,0
1,0
2,1
3,0
4,0
...,...
497,1
498,1
499,0
500,0


### Convert Dataset File

In [9]:
# Convert dataset to csv file
depression_df.to_csv("./data/data.csv", index=False)

## Running Pipeline Machine Learning

### Import Library TFX

In [10]:
import os
from tfx.orchestration.beam.beam_dag_runner import BeamDagRunner
from modules.pipeline import init_local_pipeline
from modules.depression_components import init_components

### Init Variabel Pipeline

In [11]:
PIPELINE_NAME = 'abidindraswara-pipeline'

DATA_ROOT = 'data'
TRANSFORM_MODULE_FILE = 'modules/depression_transform.py'
TRAINER_MODULE_FILE = 'modules/depression_trainer.py'

OUTPUT_BASE = 'output'
serving_model_dir = os.path.join(OUTPUT_BASE, 'serving_model')
pipeline_root = os.path.join(OUTPUT_BASE, PIPELINE_NAME)
metadata_path = os.path.join(pipeline_root, 'metadata.sqlite')

### Running Pipeline secara Lokal

In [12]:
components = init_components(
    data_dir=DATA_ROOT,
    transform_module=TRANSFORM_MODULE_FILE,
    training_module=TRAINER_MODULE_FILE,
    training_steps=5000,
    eval_steps=1000,
    serving_model_dir=serving_model_dir
)

pipeline = init_local_pipeline(components, pipeline_root)
BeamDagRunner().run(pipeline)





Instructions for updating:
Use ref() instead.


Instructions for updating:
Use ref() instead.


Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 Gender_xf (InputLayer)         [(None, 3)]          0           []                               
                                                                                                  
 SleepDuration_xf (InputLayer)  [(None, 5)]          0           []                               
                                                                                                  
 DietaryHabits_xf (InputLayer)  [(None, 4)]          0           []                               
                                                                                                  
 Haveyoueverhadsuicidalthoughts  [(None, 3)]         0           []                               
 _xf (InputLayer)                                                                             

INFO:tensorflow:struct2tensor is not available.


INFO:tensorflow:tensorflow_decision_forests is not available.


INFO:tensorflow:tensorflow_decision_forests is not available.


INFO:tensorflow:tensorflow_text is not available.


INFO:tensorflow:tensorflow_text is not available.


INFO:tensorflow:Assets written to: output\abidindraswara-pipeline\Trainer\model\7\Format-Serving\assets


INFO:tensorflow:Assets written to: output\abidindraswara-pipeline\Trainer\model\7\Format-Serving\assets


You must install pydot (`pip install pydot`) and install graphviz (see instructions at https://graphviz.gitlab.io/download/) for plot_model to work.












Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`


Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`
