# Initial Data Prep

Before completing any analysis, this notebook will check any processing required for data types, and split out data for training.

In [5]:
import sys

sys.path.append("../..")

import pandas as pd

from src.params.project_params import DATA_PATHS

mh_raw = pd.read_csv(DATA_PATHS["mh_raw"])
mh_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 413768 entries, 0 to 413767
Data columns (total 16 columns):
 #   Column                        Non-Null Count   Dtype  
---  ------                        --------------   -----  
 0   Name                          413768 non-null  object 
 1   Age                           413768 non-null  int64  
 2   Marital Status                413768 non-null  object 
 3   Education Level               413768 non-null  object 
 4   Number of Children            413768 non-null  int64  
 5   Smoking Status                413768 non-null  object 
 6   Physical Activity Level       413768 non-null  object 
 7   Employment Status             413768 non-null  object 
 8   Income                        413768 non-null  float64
 9   Alcohol Consumption           413768 non-null  object 
 10  Dietary Habits                413768 non-null  object 
 11  Sleep Patterns                413768 non-null  object 
 12  History of Mental Illness     413768 non-nul

In [6]:
mh_raw.head()

Unnamed: 0,Name,Age,Marital Status,Education Level,Number of Children,Smoking Status,Physical Activity Level,Employment Status,Income,Alcohol Consumption,Dietary Habits,Sleep Patterns,History of Mental Illness,History of Substance Abuse,Family History of Depression,Chronic Medical Conditions
0,Christine Barker,31,Married,Bachelor's Degree,2,Non-smoker,Active,Unemployed,26265.67,Moderate,Moderate,Fair,Yes,No,Yes,Yes
1,Jacqueline Lewis,55,Married,High School,1,Non-smoker,Sedentary,Employed,42710.36,High,Unhealthy,Fair,Yes,No,No,Yes
2,Shannon Church,78,Widowed,Master's Degree,1,Non-smoker,Sedentary,Employed,125332.79,Low,Unhealthy,Good,No,No,Yes,No
3,Charles Jordan,58,Divorced,Master's Degree,3,Non-smoker,Moderate,Unemployed,9992.78,Moderate,Moderate,Poor,No,No,No,No
4,Michael Rich,18,Single,High School,0,Non-smoker,Sedentary,Unemployed,8595.08,Low,Moderate,Fair,Yes,No,Yes,Yes


Upon initial inspection, all columns are non-null, mostly categorical, including several columns that are binary but represented as yes/no

There is no ID, or obvious ordering of the rows, so the data can be split based on a random subsample.

Proposed processing of the data:

|    | Column                       | Assessment                          |
|----|------------------------------|-------------------------------------|
| 0  | Name                         | No predictive value                 |
| 1  | Age                          | Leave as is                         |
| 2  | Marital Status               | OHE or MRE, review below            |
| 3  | Education Level              | Possibly ordinal, review below      |
| 4  | Number of Children           | Leave as is                         |
| 5  | Smoking Status               | Appears binary                      |
| 6  | Physical Activity Level      | Possibly ordinal, review below      |
| 7  | Employment Status            | Appears binary                      |
| 8  | Income                       | Review distribution after splitting |
| 9  | Alcohol Consumption          | Possibly ordinal, review below      |
| 10 | Dietary Habits               | Possibly ordinal, review below      |
| 11 | Sleep Patterns               | Possibly ordinal, review below      |
| 12 | History of Mental Illness    | Appears binary                      |
| 13 | History of Substance Abuse   | Appears binary                      |
| 14 | Family History of Depression | Appears binary                      |
| 15 | Chronic Medical Conditions   | Appears binary                      |

## 'Binary' Fields

In [7]:
binary_fields = [
    "Smoking Status",
    "Employment Status",
    "History of Mental Illness",
    "History of Substance Abuse",
    "Family History of Depression",
    "Chronic Medical Conditions",
]

for field in binary_fields:
    print(f"{field}:\n{mh_raw[field].unique()}")

Smoking Status:
['Non-smoker' 'Former' 'Current']
Employment Status:
['Unemployed' 'Employed']
History of Mental Illness:
['Yes' 'No']
History of Substance Abuse:
['No' 'Yes']
Family History of Depression:
['Yes' 'No']
Chronic Medical Conditions:
['Yes' 'No']


As above, smoking status is not binary, though the others are. The four yes/no fields can be mapped to binary immediately, employment status will be handled my making a binary 'employed' field

## 'Ordinal' Fields

Ordinal encoding is most useful when there is an apparent order to the categories, as this ordering can be used by most predictiv emodels. It's less useful for fields with no obvious order, where OHE/MRE is more suitable

In [8]:
ordinal_fields = [
    "Education Level",
    "Physical Activity Level",
    "Alcohol Consumption",
    "Dietary Habits",
    "Sleep Patterns",
]

for field in ordinal_fields:
    print(f"{field}:\n{mh_raw[field].unique()}")

Education Level:
["Bachelor's Degree" 'High School' "Master's Degree" 'Associate Degree'
 'PhD']


Physical Activity Level:
['Active' 'Sedentary' 'Moderate']
Alcohol Consumption:
['Moderate' 'High' 'Low']
Dietary Habits:
['Moderate' 'Unhealthy' 'Healthy']
Sleep Patterns:
['Fair' 'Good' 'Poor']


All of these fields can be ordinally encoded, which may be more useful than OHE/MRE but this will need investigating