### Import Library

In [None]:
import os
import math
import kagglehub
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.model_selection import train_test_split

## Download and Explore the Student Depression Dataset

This code download the latest version of the "Student Depression" dataset from KaggleHub, and then list the files contained in the downloaded directory.

### Code Breakdown
- The function `kagglehub.dataset_download("hopesb/student-depression-dataset")` is used to download the dataset.
- The returned value, stored in the variable `path`, is the location where the dataset files are saved.


In [1]:
# Download latest version
path = kagglehub.dataset_download("hopesb/student-depression-dataset")

print("Path to dataset files:", path)

NameError: name 'kagglehub' is not defined

In [None]:
files = os.listdir(path) if os.path.isdir(path) else []
print("Isi folder:", files)

## Loading the Student Depression Dataset CSV File

This code snippet demonstrates how to load a CSV file containing the "Student Depression" dataset using `pandas` and how to handle potential exceptions during the file reading process.

### Code Breakdown

- The code tries to load the CSV file using `pd.read_csv()`.
- It constructs the file path by concatenating the `path` variable with the CSV file name `/Student Depression Dataset.csv`.
- If the file is loaded successfully, it prints a success message.

In [None]:
try:
  df = pd.read_csv(path + '/Student Depression Dataset.csv')
  print("dataset load succesfully")
except FileNotFoundError:
  print(f"Error: File 'student-depression-dataset.csv' not found in {path}. Please check the file name and path.")
except Exception as e:
  print(f"An error occurred: {e}")


dataset load succesfully


## DataFrame Inspection and Summary Statistics

This code snippet is used to inspect and analyze a Pandas DataFrame (`df`). Each command provides a different perspective on the data, helping you understand its structure, contents, and quality.

## Code Explanation



### 1. View the First Few Rows
- `df.head()`
- Displays the first 5 rows (by default) of the DataFrame.
- Useful to get an initial look at the dataset structure and sample data.

In [None]:
df.head()

Unnamed: 0,id,Gender,Age,City,Profession,Academic Pressure,Work Pressure,CGPA,Study Satisfaction,Job Satisfaction,Sleep Duration,Dietary Habits,Degree,Have you ever had suicidal thoughts ?,Work/Study Hours,Financial Stress,Family History of Mental Illness,Depression
0,2,Male,33.0,Visakhapatnam,Student,5.0,0.0,8.97,2.0,0.0,5-6 hours,Healthy,B.Pharm,Yes,3.0,1.0,No,1
1,8,Female,24.0,Bangalore,Student,2.0,0.0,5.9,5.0,0.0,5-6 hours,Moderate,BSc,No,3.0,2.0,Yes,0
2,26,Male,31.0,Srinagar,Student,3.0,0.0,7.03,5.0,0.0,Less than 5 hours,Healthy,BA,No,9.0,1.0,Yes,0
3,30,Female,28.0,Varanasi,Student,3.0,0.0,5.59,2.0,0.0,7-8 hours,Moderate,BCA,Yes,4.0,5.0,Yes,1
4,32,Female,25.0,Jaipur,Student,4.0,0.0,8.13,3.0,0.0,5-6 hours,Moderate,M.Tech,Yes,1.0,1.0,No,0


### 2. View the Last Few Rows
- `df.tail()`
- Displays the last 5 rows (by default) of the DataFrame.
- Helps to check the end of the dataset, ensuring that no issues exist in the final rows.

In [None]:
df.tail()

Unnamed: 0,id,Gender,Age,City,Profession,Academic Pressure,Work Pressure,CGPA,Study Satisfaction,Job Satisfaction,Sleep Duration,Dietary Habits,Degree,Have you ever had suicidal thoughts ?,Work/Study Hours,Financial Stress,Family History of Mental Illness,Depression
27896,140685,Female,27.0,Surat,Student,5.0,0.0,5.75,5.0,0.0,5-6 hours,Unhealthy,Class 12,Yes,7.0,1.0,Yes,0
27897,140686,Male,27.0,Ludhiana,Student,2.0,0.0,9.4,3.0,0.0,Less than 5 hours,Healthy,MSc,No,0.0,3.0,Yes,0
27898,140689,Male,31.0,Faridabad,Student,3.0,0.0,6.61,4.0,0.0,5-6 hours,Unhealthy,MD,No,12.0,2.0,No,0
27899,140690,Female,18.0,Ludhiana,Student,5.0,0.0,6.88,2.0,0.0,Less than 5 hours,Healthy,Class 12,Yes,10.0,5.0,No,1
27900,140699,Male,27.0,Patna,Student,4.0,0.0,9.24,1.0,0.0,Less than 5 hours,Healthy,BCA,Yes,2.0,3.0,Yes,1



### 3. Get DataFrame Information
- `df.info()`
- Provides a concise summary of the DataFrame.
- Shows the data types, non-null counts, and memory usage.
- Useful for identifying columns with missing values or unexpected data types.

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27901 entries, 0 to 27900
Data columns (total 18 columns):
 #   Column                                 Non-Null Count  Dtype  
---  ------                                 --------------  -----  
 0   id                                     27901 non-null  int64  
 1   Gender                                 27901 non-null  object 
 2   Age                                    27901 non-null  float64
 3   City                                   27901 non-null  object 
 4   Profession                             27901 non-null  object 
 5   Academic Pressure                      27901 non-null  float64
 6   Work Pressure                          27901 non-null  float64
 7   CGPA                                   27901 non-null  float64
 8   Study Satisfaction                     27901 non-null  float64
 9   Job Satisfaction                       27901 non-null  float64
 10  Sleep Duration                         27901 non-null  object 
 11  Di

### 4. Check for Missing Values
- `df.isnull().sum()`
- Computes the total number of missing values in each column.
- Helps in identifying which columns might need cleaning or imputation.

In [None]:
df.isnull().sum()

Unnamed: 0,0
id,0
Gender,0
Age,0
City,0
Profession,0
Academic Pressure,0
Work Pressure,0
CGPA,0
Study Satisfaction,0
Job Satisfaction,0


### 5. Generate Descriptive Statistics
- `df.describe()`
- Provides statistical summaries for numerical columns.
- Includes metrics such as count, mean, standard deviation, minimum, and maximum values.
- Useful for a quick quantitative summary of the data.

In [None]:
df.describe()

Unnamed: 0,id,Age,Academic Pressure,Work Pressure,CGPA,Study Satisfaction,Job Satisfaction,Work/Study Hours,Financial Stress,Depression
count,27901.0,27901.0,27901.0,27901.0,27901.0,27901.0,27901.0,27901.0,27898.0,27901.0
mean,70442.149421,25.8223,3.141214,0.00043,7.656104,2.943837,0.000681,7.156984,3.139867,0.585499
std,40641.175216,4.905687,1.381465,0.043992,1.470707,1.361148,0.044394,3.707642,1.437347,0.492645
min,2.0,18.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
25%,35039.0,21.0,2.0,0.0,6.29,2.0,0.0,4.0,2.0,0.0
50%,70684.0,25.0,3.0,0.0,7.77,3.0,0.0,8.0,3.0,1.0
75%,105818.0,30.0,4.0,0.0,8.92,4.0,0.0,10.0,4.0,1.0
max,140699.0,59.0,5.0,5.0,10.0,5.0,4.0,12.0,5.0,1.0


### 6. Count Unique Values per Column
- The loop iterates over each column in the DataFrame:
  ```python
  for col in df.columns:
      print(f"Column '{col}': {df[col].nunique()} unique values")

In [None]:

for col in df.columns:
    print(f"Column '{col}': {df[col].nunique()} unique values")


Column 'id': 27901 unique values
Column 'Gender': 2 unique values
Column 'Age': 34 unique values
Column 'City': 52 unique values
Column 'Profession': 14 unique values
Column 'Academic Pressure': 6 unique values
Column 'Work Pressure': 3 unique values
Column 'CGPA': 332 unique values
Column 'Study Satisfaction': 6 unique values
Column 'Job Satisfaction': 5 unique values
Column 'Sleep Duration': 5 unique values
Column 'Dietary Habits': 4 unique values
Column 'Degree': 28 unique values
Column 'Have you ever had suicidal thoughts ?': 2 unique values
Column 'Work/Study Hours': 13 unique values
Column 'Financial Stress': 5 unique values
Column 'Family History of Mental Illness': 2 unique values
Column 'Depression': 2 unique values


## Encoding Categorical Variables with LabelEncoder

This code snippet demonstrates how to preprocess a DataFrame by removing an unnecessary identifier column and encoding categorical (object-type) columns using scikit-learn's `LabelEncoder`. It also captures the mapping from original category names to their corresponding integer labels.

## Code Explanation

### 1. Remove the 'id' Column
- The `id` column is dropped from the DataFrame since it often serves as a unique identifier and is not needed for analysis or modeling.
  
```python
df.drop("id", axis=1, inplace=True)

In [2]:
df.drop("id", axis=1, inplace=True)

NameError: name 'df' is not defined

In [3]:
encoded_data = {}

label_encoder = LabelEncoder()

for column in df.columns:
    if df[column].dtype == 'object':
        encoded_data[column] = {}
        df[column] = label_encoder.fit_transform(df[column])
        for i in range(len(label_encoder.classes_)):
            encoded_data[column][label_encoder.classes_[i]] = i
encoded_data


NameError: name 'LabelEncoder' is not defined

## Visualizing Data Distributions and Correlations

This code snippet demonstrates how to visualize the distributions of multiple DataFrame columns using histograms arranged in a grid layout, as well as how to create a correlation heatmap to examine relationships between numerical variables.

## Histogram Grid Visualization

### 1. Define Grid Parameters
- **`num_cols`:** Specifies the total number of histograms to plot (set to 16).
- **`cols_per_row`:** Maximum number of histograms per row (set to 6).

```python
num_cols = 16
cols_per_row = 6  # Maksimum histogram per baris


In [None]:
# Tentukan jumlah histogram
num_cols = 16
cols_per_row = 6  # Maksimum histogram per baris

# Hitung jumlah baris yang dibutuhkan
num_rows = math.ceil(num_cols / cols_per_row)

# Buat subplots
fig, axes = plt.subplots(num_rows, cols_per_row, figsize=(20, 4 * num_rows))

# Flatten axes agar mudah diakses (jika lebih dari 1 baris)
axes = axes.flatten()

# Loop untuk setiap kolom yang ingin diplot
for i, col in enumerate(df.columns[:num_cols]):  # Ambil hanya 16 kolom pertama
    sns.histplot(df[col], kde=True, ax=axes[i], orientation="vertical")  # Histogram tetap sama

    axes[i].set_title(f'Distribution of {col}')
    axes[i].set_xlabel(col)
    axes[i].set_ylabel('Frequency')

# Sembunyikan subplot kosong jika jumlah histogram tidak genap dengan grid
for j in range(i + 1, len(axes)):
    fig.delaxes(axes[j])

# Atur layout agar lebih rapi
plt.tight_layout()
plt.show()



In [None]:
plt.figure(figsize=(12, 10))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap')
plt.show()


# Data Preprocessing dan Normalisasi

Notebook ini digunakan untuk melakukan beberapa tahapan preprocessing data, antara lain:

1. **Pembersihan Data**  
   - Menghapus baris yang memiliki nilai kosong menggunakan `df.dropna(inplace=True)`.  
   - Mengecek jumlah missing values di setiap kolom dengan `df.isnull().sum()`.

2. **Filtering Data**  
   - Memilih baris data dengan nilai `CGPA` minimal 0.6, sehingga hanya data dengan nilai `CGPA >= 0.6` yang digunakan.

3. **Seleksi Fitur**  
   - Menentukan fitur-fitur yang akan dikecualikan (misalnya: `Gender`, `Job Satisfaction`, `Profession`, `Work Pressure`).  
   - Menghasilkan list fitur yang akan digunakan dengan menyeleksi kolom yang tidak termasuk dalam daftar pengecualian.

4. **Memisahkan Fitur dan Target**  
   - Memisahkan fitur (variabel X) dari target (variabel y, yaitu `Depression`).

5. **Membagi Data**  
   - Membagi dataset menjadi data latih dan data uji dengan rasio 80% data latih dan 20% data uji menggunakan `train_test_split`.

6. **Normalisasi Data**  
   - Menggunakan `MinMaxScaler` untuk melakukan normalisasi pada kolom-kolom tertentu (seperti: `Age`, `Academic Pressure`, `CGPA`, `Study Satisfaction`).


In [4]:

df.dropna(inplace=True)


NameError: name 'df' is not defined

In [None]:
df.isnull().sum()

In [None]:
# Filter out rows where CGPA is below 0.6
df = df[df['CGPA'] >= 0.6]

In [None]:
# mengambil data dengan nilai profesi student (karena ini dataset student)
df = df[df['Profession'] == 11]

In [6]:
# Inisialisasi MinMaxScaler
scaler = MinMaxScaler()

# Pilih kolom yang ingin dinormalisasi
columns_to_normalize = ['Age', 'Academic Pressure', 'CGPA', 'Study Satisfaction']

# Normalisasi kolom yang dipilih
df[columns_to_normalize] = scaler.fit_transform(df[columns_to_normalize])

# Tampilkan DataFrame yang telah dinormalisasi
print(df.head())


NameError: name 'MinMaxScaler' is not defined

In [None]:
features_to_exclude = ['Gender', 'Job Satisfaction', 'Profession', 'Work Pressure']
selected_features = [col for col in df.columns if col not in features_to_exclude]

# Create a new DataFrame with the selected features
df_selected = df[selected_features]

In [None]:
# Separate features (X) and target variable (y)
X = df_selected.drop('Depression', axis=1)
y = df_selected['Depression']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # 80% training and 20% test

print("Training data shape:", X_train.shape, y_train.shape)
print("Testing data shape:", X_test.shape, y_test.shape)


### Test Dataset

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Initialize and train the Logistic Regression model
model = LogisticRegression(max_iter=1000) # Increased max_iter
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

print(classification_report(y_test, y_pred))

conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
conf_matrix


In [5]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

# Assuming X_train, X_test, y_train, y_test are already defined from the previous code

# Define the ANN model
model = keras.Sequential([
    keras.layers.Dense(64, activation='relu', input_shape=(X_train.shape[1],)),
    keras.layers.Dense(32, activation='relu'),
    keras.layers.Dense(1, activation='sigmoid') # Output layer for binary classification
])

# Compile the model
model.compile(optimizer='adam',
              loss='binary_crossentropy', # Use binary_crossentropy for binary classification
              metrics=['accuracy'])

# Define early stopping
early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)

# Define model checkpoint
checkpoint_filepath = 'best_model.h5'
model_checkpoint = ModelCheckpoint(filepath=checkpoint_filepath,
                                   monitor='val_accuracy',
                                   save_best_only=True,
                                   mode='max')


# Train the model with callbacks
history = model.fit(X_train, y_train, epochs=50, batch_size=32,
                    validation_split=0.2, callbacks=[early_stopping, model_checkpoint])

# Load the best model
best_model = keras.models.load_model(checkpoint_filepath)

# Evaluate the best model on the test set
loss, accuracy = best_model.evaluate(X_test, y_test, verbose=0)
print(f"Test Loss: {loss:.4f}")
print(f"Test Accuracy: {accuracy:.4f}")

# Make predictions
y_pred = (best_model.predict(X_test) > 0.5).astype("int32") # Convert probabilities to binary predictions

NameError: name 'X_train' is not defined