<a href="https://colab.research.google.com/github/CodeSimple0496/No-Tutorial-AI-Data-Scientist-Roadmap-6-8-Months-of-Real-Execution/blob/main/MONTH%201/DAY%202/Day5_part2_machine_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Day 5 â€“ Part 2 Practice

## ðŸ§  Level 1 â€” Basic ML

### What is the target variable?
The target variable is **LeaveOrNot**.
It represents whether an employee leaves the company (1) or stays (0).

### What are the feature columns?
All columns except LeaveOrNot.
These include:
- Age
- Gender
- City (encoded)
- Education (encoded)
- PaymentTier
- ExperienceInCurrentDomain
- EverBenched
- JoiningYear

### What is train-test split and why is it needed?
Train-test split divides the dataset into:
- Training data (used to train the model)
- Testing data (used to evaluate the model)

It prevents overfitting and ensures the model generalizes to unseen data.


In [14]:
# Load Dataset & Prepare Data

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Load dataset
df = pd.read_csv("/content/sample_data/cleaned_employee_attrition.csv")
df.columns = df.columns.str.lower()

df.head()


Unnamed: 0,education,joiningyear,city,paymenttier,age,gender,everbenched,experienceincurrentdomain,leaveornot
0,Bachelors,2017,Bangalore,3,34,Male,No,0,0
1,Bachelors,2013,Pune,1,28,Female,No,3,1
2,Bachelors,2014,New Delhi,3,38,Female,No,2,0
3,Masters,2016,Bangalore,3,27,Male,No,5,1
4,Masters,2017,Pune,3,24,Male,Yes,2,1


In [15]:
# Encode Categorical Variables

df = pd.get_dummies(df, columns=["city", "education"], drop_first=True)

df["gender"] = df["gender"].map({"Male":1, "Female": 0})
df["everbenched"] = df["everbenched"].map({"Yes":1, "No":0})

df.head()


Unnamed: 0,joiningyear,paymenttier,age,gender,everbenched,experienceincurrentdomain,leaveornot,city_New Delhi,city_Pune,education_Masters,education_PHD
0,2017,3,34,1,0,0,0,False,False,False,False
1,2013,1,28,0,0,3,1,False,True,False,False
2,2014,3,38,0,0,2,0,True,False,False,False
3,2016,3,27,1,0,5,1,False,False,True,False
4,2017,3,24,1,1,2,1,False,True,True,False


In [16]:
# Define Features and target

X = df.drop("leaveornot", axis=1)
y = df["leaveornot"]

print("Features Shape:", X.shape)
print("Target shape", y.shape)

Features Shape: (2764, 10)
Target shape (2764,)


In [17]:
# Train-Test Split(80-20)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Train samples:", X_train.shape[0])
print("Testing smaple:", X_test.shape[0])

Train samples: 2211
Testing smaple: 553


In [18]:
# Model Training

# Train Logistic Regression
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

In [19]:
# Predict on test Data
predictions = model.predict(X_test)

In [20]:
# Calculate Accuracy
accuracy = accuracy_score(y_test, predictions)
print("Accuracy:", accuracy)

Accuracy: 0.6726943942133815


In [21]:
# Confusion Matrix
cm = confusion_matrix(y_test, predictions)
print("Confusion Matrix:\n", cm)

Confusion Matrix:
 [[282  51]
 [130  90]]


In [22]:
# Classification Report

print("Classification Report:\n")
print(classification_report(y_test, predictions))

Classification Report:

              precision    recall  f1-score   support

           0       0.68      0.85      0.76       333
           1       0.64      0.41      0.50       220

    accuracy                           0.67       553
   macro avg       0.66      0.63      0.63       553
weighted avg       0.67      0.67      0.65       553



# Level 3 - Data Scientist thinking

## Is accuracy good enough to judge attrition model? Why?

No.
If most employees stay, a model predicting "Stay" for everyone
could still achieve high accuracy.
We must check:
- Precision
- Recall
- F1-score
- Confusion Matrix

## Which features influence leaving the most?

Check logistic regression coefficients.
Features with higher absolute coefficients influence prediction more.

## What happens if the model predicts everyone stays?

Accuracy might look high,
but Recall for employees who leave will be 0.
This makes the model useless for HR.

## How would HR use this model in real life?

HR can:
- Identify high-risk employees
- Offer retention programs
- Improve engagement strategies
- Reduce hiring costs


In [23]:
# BOUNS TASK - Training Function
from sklearn.metrics import accuracy_score

def train_and_evaluate(model, X_train, X_test, y_train, y_test):
  model.fit(X_train, y_train)
  preds = model.predict(X_test)
  return accuracy_score(y_test, preds)