
# Day 5 – Machine Learning Basics & First Model

## What is Machine Learning?
Machine Learning is a method where computers learn patterns from data 
and make predictions without being explicitly programmed.

## What is Supervised Learning?
Supervised learning uses labeled data where we provide input (X) 
and correct output (y), and the model learns their relationship.

## Classification vs Regression
- Classification → Predict categories (Yes/No, 0/1)
- Regression → Predict continuous values (Salary, Price)

## What is a Model?
A model is a mathematical function that maps input features (X) 
to output predictions (y).


In [None]:

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report


In [None]:

df = pd.read_csv("cleaned_employee_attrition.csv")
df.columns = df.columns.str.lower()

df.head()


In [None]:

df = pd.get_dummies(df, columns=["city", "education"], drop_first=True)

df["gender"] = df["gender"].map({"Male": 1, "Female": 0})
df["everbenched"] = df["everbenched"].map({"Yes": 1, "No": 0})

df.head()


In [None]:

X = df.drop("leaveornot", axis=1)
y = df["leaveornot"]

print("Feature shape:", X.shape)
print("Target shape:", y.shape)


In [None]:

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("Training samples:", X_train.shape[0])
print("Testing samples:", X_test.shape[0])


In [None]:

log_model = LogisticRegression(max_iter=1000)
log_model.fit(X_train, y_train)

y_pred_log = log_model.predict(X_test)

accuracy_log = accuracy_score(y_test, y_pred_log)
print("Logistic Regression Accuracy:", accuracy_log)


In [None]:

print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_log))

print("\nClassification Report:\n")
print(classification_report(y_test, y_pred_log))


In [None]:

tree_model = DecisionTreeClassifier(random_state=42)
tree_model.fit(X_train, y_train)

y_pred_tree = tree_model.predict(X_test)

accuracy_tree = accuracy_score(y_test, y_pred_tree)
print("Decision Tree Accuracy:", accuracy_tree)


In [None]:

rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)

y_pred_rf = rf_model.predict(X_test)

accuracy_rf = accuracy_score(y_test, y_pred_rf)
print("Random Forest Accuracy:", accuracy_rf)


In [None]:

print("Logistic Regression:", accuracy_log)
print("Decision Tree:", accuracy_tree)
print("Random Forest:", accuracy_rf)


In [None]:

coefficients = pd.DataFrame({
    "Feature": X.columns,
    "Coefficient": log_model.coef_[0]
}).sort_values(by="Coefficient", ascending=False)

coefficients.head(10)


In [None]:

feature_importance = pd.DataFrame({
    "Feature": X.columns,
    "Importance": rf_model.feature_importances_
}).sort_values(by="Importance", ascending=False)

feature_importance.head(10)



# Notes

## What is a Model?
A model is a learned pattern from data that predicts outcomes.

## What is Accuracy?
Accuracy = Correct Predictions / Total Predictions

## Why Accuracy Alone Is Not Enough?
If data is imbalanced, a model can predict the majority class 
and still achieve high accuracy. 
We must also check precision, recall, and F1-score.

## What Confused Me Today?
- Difference between Logistic Regression and Tree models
- How coefficients translate to probability
- Why tree models do not require scaling
