
# Lab 11: Understanding the data. 

Last 2 classes we have been working on understanding data. In this lab you will apply that.


**Context**
The goal of this lab is to understand a large dataset of 27901 rows and 18 columns Perform Exploratory Data Analysis (EDA) on the data. Look at how the data effects the end result **depression** (yes/no)

- Load and get a basic understanding of the dataset
- Clean up the data 
    - examples:
        - check and fix any missing data 
        - use both Hot and integer encoding
        - convert any numeric data from strings to ints
- Perform Exploratory Data Analysis (EDA) 
    - I will leave it to you to understand and explore this. 
        - Histplot
        - Countplot
        - Correlation
        - Heatmaps
        - boxplots
        - have fun

Make sure to communicate to me along the way. I want your to tell me what your assumptions are what your learning about the data and what you learned with EDA.  There are 18 data points for each student I expect an perform EDA on most of those points like what we did in class with MPG. (remember how we did a sns.pairplot(df[[ "cylinders", "mpg","model_year"]]) and sns.pairplot(df[["mpg", "horsepower", "weight", "displacement"]]) and others. ) 

remember to have fun with this 

----------------------------------------------------------------------------------------------------------------------------------------------

From: https://www.kaggle.com/datasets/adilshamim8/student-depression-dataset/data


Field Descriptions

**id** - A unique identifier assigned to each student record in the dataset.

**Gender** - The gender of the student (e.g., Male, Female, Other). This helps in analyzing gender-specific trends in mental health.

**Age** - The age of the student in years.

**City** - The city or region where the student resides, providing geographical context for the analysis.

**Profession** - The field of work or study of the student, which may offer insights into occupational or academic stress factors.

**Academic Pressure** - A measure indicating the level of pressure the student faces in academic settings. This could include stress from exams, assignments, and overall academic expectations.

**Work Pressure** - A measure of the pressure related to work or job responsibilities, relevant for students who are employed alongside their studies.

**CGPA** - The cumulative grade point average of the student, reflecting overall academic performance.

**Study Satisfaction** - An indicator of how satisfied the student is with their studies, which can correlate with mental well-being.

**Job Satisfaction** - A measure of the student’s satisfaction with their job or work environment, if applicable.

**Sleep Duration** - The average number of hours the student sleeps per day, which is an important factor in mental health.

**Dietary Habits** - An assessment of the student’s eating patterns and nutritional habits, potentially impacting overall health and mood.

**Degree** - The academic degree or program that the student is pursuing.

**Have you ever had suicidal thoughts ?** - A binary indicator (Yes/No) that reflects whether the student has ever experienced suicidal ideation.

**Work/Study Hours** - The average number of hours per day the student dedicates to work or study, which can influence stress levels.

**Financial Stress** - A measure of the stress experienced due to financial concerns, which may affect mental health.

**Family History of Mental Illness** - Indicates whether there is a family history of mental illness (Yes/No), which can be a significant factor in mental health predispositions.

**Depression** - The target variable that indicates whether the student is experiencing depression (Yes/No).

------------------------------------------------------------------------------------------------------------------------------------------------

*Provided for reference only*
``` python 
import kagglehub
import shutil

# Download latest version
path = kagglehub.dataset_download("adilshamim8/student-depression-dataset")

#move the dowload to the current directory
shutil.move(path, "./Lab_11_dataset")
````

In [3]:
# importing libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# loading the dataset
df = pd.read_csv("./Lab_11_dataset/student_depression_dataset.csv")
df.head()

ModuleNotFoundError: No module named 'numpy'

In [2]:
"""used AI for to help install pandas. i kept getting an error trying to import it. up until line 19"""

import subprocess
import sys
import pandas as pd  # Import pandas here

print("Python version:")
print(sys.version)
print("Pandas version:")
print(pd.__version__)

# Try to install pandas
try:
    subprocess.check_call(['pip', 'install', 'pandas'])
    print("Pandas installed successfully.")
except subprocess.CalledProcessError as e:
    print(f"Error installing pandas: {e}")
    print("Please make sure you have pip installed and try installing pandas manually: pip install pandas")
    # Don't exit here, try to continue if possible

csv_file_path = "Lab11/student_depression_dataset.csv"

data_frame = None

try:
    """Read file path"""
    data_frame = pd.read_csv(csv_file_path)
    print(f"Successfully loaded data from: {csv_file_path}")

except FileNotFoundError:

    print(f"Error: File not found at {csv_file_path}. Please make sure the file exists and the path is correct.")
    data_frame = pd.DataFrame()

except Exception as e:

    print(f"An error occurred while reading the CSV file: {e}")
    data_frame = pd.DataFrame()

if data_frame is not None: 

    numeric_columns_to_convert = ['Age', 'Academic Pressure', 'CGPA', 'Study Satisfaction', 'Work/Study Hours', 'Financial Stress']
    try:

        for column_name in numeric_columns_to_convert:
            data_frame[column_name] = pd.to_numeric(data_frame[column_name], errors='coerce')

    except Exception as e:

        print(f"Error converting columns: {e}")

    print("\nValue counts for 'Work Pressure' before cleaning:")

    try:
        print(data_frame['Work Pressure'].value_counts())

    except Exception as e:

        print(f"Error getting value counts for 'Work Pressure': {e}")

    print("\nValue counts for 'Job Satisfaction' before cleaning:")

    try:
        print(data_frame['Job Satisfaction'].value_counts())

    except Exception as e:
        print(f"Error getting value counts for 'Job Satisfaction': {e}")

    categorical_columns_to_clean = ['Gender', 'City', 'Profession', 'Sleep Duration', 'Dietary Habits', 'Degree','Have you ever had suicidal thoughts?', 'Family History of Mental Illness', 'Depression']
    
    try:
        for column_name in categorical_columns_to_clean:
            data_frame[column_name] = data_frame[column_name].str.strip()

    except Exception as e:
        print(f"Error cleaning categorical columns: {e}")

    try:
        data_frame['Sleep Duration'] = data_frame['Sleep Duration'].str.replace("'", "")
        data_frame['Dietary Habits'] = data_frame['Dietary Habits'].str.replace("'", "")

    except Exception as e:
        print(f"Error'Sleep Duration' and 'Dietary Habits': {e}")

    try:
        print("\nUnique values of 'Sleep Duration' after cleaning:")
        print(data_frame['Sleep Duration'].unique())
        print("\nUnique values of 'Dietary Habits' after cleaning:")
        print(data_frame['Dietary Habits'].unique())
        print("\nUnique values of 'Depression' after cleaning:")
        print(data_frame['Depression'].unique())

    except Exception as e:
        print(f"Error printing unique values: {e}")

    print("\nMissing values per column:")

    try:
        print(data_frame.isnull().sum())

    except Exception as e:

        print(f"Error checking for missing values: {e}")

    print("\nFirst 5 rows of the cleaned data:")

    try:
        print(data_frame.head())

    except Exception as e:
        print(f"Error printing head: {e}")

    print("\nDataFrame Info:")

    try:
        print(data_frame.info())

    except Exception as e:
        print(f"Error printing info: {e}")


Python version:
3.13.3 (tags/v3.13.3:6280bb5, Apr  8 2025, 14:47:33) [MSC v.1943 64 bit (AMD64)]
Pandas version:
2.2.3
Pandas installed successfully.
Error: File not found at Lab11/student_depression_dataset.csv. Please make sure the file exists and the path is correct.
Error converting columns: 'Age'

Value counts for 'Work Pressure' before cleaning:
Error getting value counts for 'Work Pressure': 'Work Pressure'

Value counts for 'Job Satisfaction' before cleaning:
Error getting value counts for 'Job Satisfaction': 'Job Satisfaction'
Error cleaning categorical columns: 'Gender'
Error'Sleep Duration' and 'Dietary Habits': 'Sleep Duration'

Unique values of 'Sleep Duration' after cleaning:
Error printing unique values: 'Sleep Duration'

Missing values per column:
Series([], dtype: float64)

First 5 rows of the cleaned data:
Empty DataFrame
Columns: []
Index: []

DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 0 entries
Empty DataFrame
None
