<a href="https://colab.research.google.com/github/Jrk373/MachineLearningDemo/blob/main/HealthcareAttritionModel.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Create a Narrow AI using Linear Regression
John Ryan Kivela  
The Narbha Institute  
January 2025


# Introduction

This is walkthrough demonstration of fundemental Machine Learning concepts and techniques used in developing Artificial Intelligence (AI).

The activity will create a Narrow Artificial Intelligence, called Rayne 1.0, who will tell us a person's income based on healthcare indicators.

Narrow AI, also known as Weak AI, is a type of artificial intelligence designed to perform a specific task or solve a particular problem with high efficiency. Unlike General AI, Narrow AI is limited in scope and cannot adapt to tasks outside its predefined domain.

**Meet Rayne!**

They deliver Sales Price, with attitude.

 <img src="https://github.com/Jrk373/MachineLearningDemo/blob/main/Kid.jpg?raw=true" alt="Rayne 1.0" width="250" height="400" />

## Intended Audience

- The intended audience is a discerning group of professionals with strong data literacy, advanced education in mathematics like Central Limits Theorum (Mean, Median, Mode, Standard Deviation, etc.), and basic Linear Algebra.

- It is not necessary to understand computer languages for this activity. This notebook primarily speaks in Python, but it is programmed to run all code and calculations out-of-the-box.

## Materials

- This notebook is open source. All of these materials will be located in a public Github repository.

  https://github.com/Jrk373/MachineLearningDemo

- The goal is for the instructor to walk through the Notebook with the audience watching and discussing. Then the instructor and the class go through the notebook together. The learner can then go on to use the Notebook on their own.

## CRISP-DM

CRISP-DM stands for Cross-Industry Standard Process for Data Mining. It's a popular method used to guide data mining and data science projects. The process is divided into six main phases:

- **Business Understanding:** Understand the project's goals and requirements from a business perspective.
- **Data Understanding:** Collect and analyze the data to understand its characteristics.
- **Data Preparation:** Clean and prepare the data for analysis.
- **Modeling:** Apply different modeling techniques to the prepared data.
- **Evaluation:** Assess the models to ensure they meet the business objectives.
- **Deployment:** Implement the model in the real-world environment and monitor its performance.

## Object Oriented Programming

## Python

## Jupyter Notebooks

The notebook relies heavily on foundations from the ODSC West 2024 AI Bootcamp. It is referenced specificfally throughout, but can also be acknoweldged broadly as the inspiration for this notebook. This notebook also partners with AI as a generator of code and content.

## Enjoy!

# Stage 1: Business Understanding





The Business Understanding phase of CRISP-DM focuses on defining the project’s goals and objectives from a business perspective. This stage ensures that the data science work aligns with the organization’s needs and delivers value.

# Stage 2: Data Understanding

The Data Understanding phase of the CRISP-DM (Cross-Industry Standard Process for Data Mining) framework focuses on exploring and analyzing the available data to ensure it is suitable for the project's goals.

It involves the following steps:

- **Data Collection:** Gather initial data from relevant sources.

- **Data Description:** Summarize key attributes, including data types, formats, and basic statistics (e.g., means, counts, ranges).

- **Data Exploration:** Use visualizations and analyses to identify patterns, trends, or potential relationships in the data.

- **Data Quality Assessment:** Check for issues such as missing values, outliers, inconsistencies, or inaccuracies.
The objective is to develop insights into the data, identify challenges, and determine whether it can support the project's objectives effectively.

## About the Data Set

Attrition of nurses in the US Healthcare system is at an all-time high. It is a major area of focus, especially for hospitals.

This dataset contains employee and company data useful for supervised ML, unsupervised ML, and analytics. Attrition - whether an employee left or not - is included and can be used as the target variable.

The data is synthetic and based on the IBM Watson dataset for attrition. Employee roles and departments were changed to reflect the healthcare domain. Also, known outcomes for some employees were changed to help increase the performance of ML models.

https://www.kaggle.com/datasets/jpmiller/employee-attrition-for-healthcare

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_auc_score
from sklearn.impute import SimpleImputer


In [None]:
import urllib.request
import pandas as pd

# Corrected URL with raw content
url = 'https://raw.githubusercontent.com/Jrk373/MachineLearningDemo/main/watson_healthcare_modified.csv'
file_path = 'watson_healthcare_modified.csv'

# Download the file
urllib.request.urlretrieve(url, file_path)

# Load the dataset
try:
    df = pd.read_csv(file_path)
    print('Successfully downloaded', file_path)
    print('Data successfully loaded as data frame "df"')
except pd.errors.ParserError as e:
    print("ParserError encountered:", e)
    print("Attempting to load with alternative options...")
    df = pd.read_csv(file_path, delimiter=',', error_bad_lines=False)
    print("Data loaded with error handling.")


Successfully downloaded watson_healthcare_modified.csv
Data successfully loaded as data frame "df"


## Data Dictionary for Watson Healthcare Modified Dataset

| Column Name                 | Data Type | Description                                                                 |
|-----------------------------|-----------|-----------------------------------------------------------------------------|
| **EmployeeID**              | Integer   | Unique identifier for each employee.                                        |
| **Age**                     | Integer   | Age of the employee.                                                        |
| **Attrition**               | String    | Whether the employee has left the company ("Yes" or "No").                  |
| **BusinessTravel**          | String    | Frequency of business travel ("Travel_Rarely", "Travel_Frequently", etc.).  |
| **DailyRate**               | Integer   | Daily rate of the employee's salary.                                        |
| **Department**              | String    | Department where the employee works (e.g., "Cardiology", "Maternity").      |
| **DistanceFromHome**        | Integer   | Distance from home to the workplace (in miles or another unit).             |
| **Education**               | Integer   | Education level (e.g., 1 = High School, 2 = Bachelor's, etc.).              |
| **EducationField**          | String    | Field of education (e.g., "Life Sciences", "Medical").                      |
| **EmployeeCount**           | Integer   | Always 1 for each employee (likely a placeholder).                          |
| **EnvironmentSatisfaction** | Integer   | Satisfaction with the work environment (1 to 4).                            |
| **Gender**                  | String    | Gender of the employee ("Male", "Female").                                  |
| **HourlyRate**              | Integer   | Hourly rate of the employee's salary.                                       |
| **JobInvolvement**          | Integer   | Level of job involvement (1 to 4).                                          |
| **JobLevel**                | Integer   | Job level (e.g., 1 = Entry level).                                          |
| **JobRole**                 | String    | Job role (e.g., "Manager", "Technician").                                   |
| **JobSatisfaction**         | Integer   | Job satisfaction level (1 to 4).                                            |
| **MaritalStatus**           | String    | Marital status (e.g., "Single", "Married").                                 |
| **MonthlyIncome**           | Integer   | Monthly income of the employee.                                             |
| **MonthlyRate**             | Integer   | Monthly rate of the employee's salary.                                      |
| **NumCompaniesWorked**      | Integer   | Number of companies the employee has worked at previously.                  |
| **Over18**                  | String    | Whether the employee is over 18 ("Yes").                                    |
| **OverTime**                | String    | Whether the employee works overtime ("Yes" or "No").                        |
| **PercentSalaryHike**       | Integer   | Percentage of salary hike during the last appraisal.                        |
| **PerformanceRating**       | Integer   | Performance rating (1 to 4).                                                |
| **RelationshipSatisfaction**| Integer   | Satisfaction with relationships at work (1 to 4).                           |
| **StandardHours**           | Integer   | Standard working hours (typically 80).                                      |
| **Shift**                   | Integer   | Shift timing (e.g., 0 = Morning, 1 = Night).                                |
| **TotalWorkingYears**       | Integer   | Total working years of the employee.                                        |
| **TrainingTimesLastYear**   | Integer   | Number of training sessions attended last year.                             |
| **WorkLifeBalance**         | Integer   | Work-life balance satisfaction (1 to 4).                                    |
| **YearsAtCompany**          | Integer   | Number of years the employee has been at the company.                       |
| **YearsInCurrentRole**      | Integer   | Number of years in the current role.                                        |
| **YearsSinceLastPromotion** | Integer   | Number of years since the last promotion.                                   |
| **YearsWithCurrManager**    | Integer   | Number of years working with the current manager.                           |


---

## Sources

- Dataset: [Employee Attrition for Healthcare](https://www.kaggle.com/datasets/jpmiller/employee-attrition-for-healthcare)
- Data Dictionary: Compiled from the dataset's description and metadata.

*Note:* Columns like `EmployeeCount` and `StandardHours` have constant values across all records and may not provide useful information for analysis.


## Data Shape

Assessing the shape of data helps identify its dimensionality (rows and columns), which is crucial for understanding its structure and determining suitable analysis techniques. It ensures the dataset is in the expected format, enabling error detection and proper preprocessing. Additionally, knowing the data shape aids in resource optimization and selecting the right tools for analysis.

In [None]:
# Import necessary packages
import pandas as pd

# Check the shape (rows, columns)
print('Data set rows and columns:', df.shape)

Data set rows and columns: (1676, 35)


In [None]:
# Import necessary packages
import pandas as pd

# Print off the first 5 rows
print(df.head(5))

   EmployeeID  Age Attrition     BusinessTravel  DailyRate  Department  \
0     1313919   41        No      Travel_Rarely       1102  Cardiology   
1     1200302   49        No  Travel_Frequently        279   Maternity   
2     1060315   37       Yes      Travel_Rarely       1373   Maternity   
3     1272912   33        No  Travel_Frequently       1392   Maternity   
4     1414939   27        No      Travel_Rarely        591   Maternity   

   DistanceFromHome  Education EducationField  EmployeeCount  ...  \
0                 1          2  Life Sciences              1  ...   
1                 8          1  Life Sciences              1  ...   
2                 2          2          Other              1  ...   
3                 3          4  Life Sciences              1  ...   
4                 2          1        Medical              1  ...   

   RelationshipSatisfaction StandardHours  Shift  TotalWorkingYears  \
0                         1            80      0                  8  

In [None]:
# Import necessary packages
import pandas as pd

# Lets get more basic information on columns, datatypes etc using .info()
print('Feature Information:')
print(df.info())

Feature Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1676 entries, 0 to 1675
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   EmployeeID                1676 non-null   int64 
 1   Age                       1676 non-null   int64 
 2   Attrition                 1676 non-null   object
 3   BusinessTravel            1676 non-null   object
 4   DailyRate                 1676 non-null   int64 
 5   Department                1676 non-null   object
 6   DistanceFromHome          1676 non-null   int64 
 7   Education                 1676 non-null   int64 
 8   EducationField            1676 non-null   object
 9   EmployeeCount             1676 non-null   int64 
 10  EnvironmentSatisfaction   1676 non-null   int64 
 11  Gender                    1676 non-null   object
 12  HourlyRate                1676 non-null   int64 
 13  JobInvolvement            1676 non-null   int64 
 14  Job

# Stage 3: Data Preparation

## Data Wrangling

### Drop unnecessary columns

In [None]:
# Columns to drop
columns_to_drop = ['EmployeeID',
                   'StandardHours',
                   'Over18',
                   'MonthlyRate',
                   'EmployeeCount']

# Drop those Columns like they're hot
df = df.drop(columns = columns_to_drop)

### Missing Values

In [None]:
# Deal with NA values
## Identify Variable with NaN values
def find_columns_with_nan(df):
    columns_with_nan = [col for col in df.columns if df[col].isna().any()]
    return columns_with_nan

### Identify Variable with NaN values
columns_with_nan = find_columns_with_nan(df)

if columns_with_nan:
    print("Columns with NaN values:", columns_with_nan)
else:
    print("There are no NaN values in the dataset.")

Columns with NaN values: []


In [None]:
# Impute values for NA with numbers
## Columns to impute
columns_to_impute = []

## make a function
def impute_selected_columns(df, columns_to_impute):
    # Use SimpleImputer with strategy='mean'
    imputer = SimpleImputer(strategy='mean')

    # Select columns to impute
    df_to_impute = df[columns_to_impute]

    # Impute NaN values in selected columns
    df_imputed = pd.DataFrame(imputer.fit_transform(df_to_impute),
                              columns = columns_to_impute)

    # Update original DataFrame with imputed values
    df[columns_to_impute] = df_imputed

    # return the data frame
    return df

## Impute missing values for selected columns
df = impute_selected_columns(df,
                             columns_to_impute)

### Check data for matching data types

## Data Transformation

### One Hot Encoding

In [None]:
# Identify categorical variables
categorical_columns = df.select_dtypes(include=['object', 'category']).columns
print(categorical_columns)


Index(['Attrition', 'BusinessTravel', 'Department', 'EducationField', 'Gender',
       'JobRole', 'MaritalStatus', 'OverTime'],
      dtype='object')


In [None]:
## One-hot encoding
### List of variables to encode
variables_to_encode = ['BusinessTravel',
                       'Department',
                       'EducationField',
                       'Gender',
                       'JobRole',
                       'MaritalStatus',
                       'OverTime']

### Perform one-hot encoding for the variables in variables_to_encode
df_hot = pd.get_dummies(df, columns=variables_to_encode)

In [None]:
df_hot.dtypes

Unnamed: 0,0
Age,int64
Attrition,object
DailyRate,int64
DistanceFromHome,int64
Education,int64
EnvironmentSatisfaction,int64
HourlyRate,int64
JobInvolvement,int64
JobLevel,int64
JobSatisfaction,int64


# Stage 4: Modeling

## Define Variable

In [None]:
## Define Variables
TargetVariable = 'Attrition'

X = df_hot.drop(columns = TargetVariable)
y = df_hot[TargetVariable]

## Train Test Split

In [None]:
## split the preprocessed data into training and validation
train_X, valid_X, train_y, valid_y = train_test_split(X,
                                                      y,
                                                      test_size = 0.3,
                                                      train_size = 0.7,
                                                      random_state = 373
                                                     )

## Create a Decision Tree Classifier Model

In [None]:
# Create a tree model with defaults
clf = DecisionTreeClassifier(criterion="gini",
                               #splitter="best",
                               max_depth=None,
                               min_samples_split=2,
                               #min_samples_leaf=1,
                               #min_weight_fraction_leaf=0.0,
                               #max_features=None,
                               random_state=373,
                               #max_leaf_nodes=None,
                               min_impurity_decrease=0.0,
                               #class_weight=None,
                               #ccp_alpha=0.0
                            )

## Fit the Model

In [21]:
# Fit (train) the model
clf.fit(X = train_X,
        y = train_y,
        #sample_weight=None,
        #check_input=True
       )

## Evaluate the Model

### Cross Validation

In [22]:
# Set some names
FeatureNames = list(valid_X.columns)
ClassNames = list(clf.classes_)

In [23]:
# Cross-validation on the training set
cv_scores = cross_val_score(clf, X, y, cv=5)

print("Cross-validation scores on training set:", cv_scores)
print("Mean CV accuracy on training set:", cv_scores.mean())
# CV accuracy = estimate of how well the model generalizes to new data.

Cross-validation scores on training set: [0.87797619 0.85970149 0.87761194 0.87164179 0.89552239]
Mean CV accuracy on training set: 0.8764907604832979


### Feature Importance