## 1. Problem Definition
**Definition:**  
Clearly deciding what problem the machine should solve.

**Example:**  
Predicting an output (like Yes/No or a number) using given data.

---

## 2. Data Collection
**Definition:**  
Gathering data that is needed to solve the problem.

**Example:**  
Collecting data from files, databases, or online sources.

---

## 3. Data Preprocessing
**Definition:**  
Cleaning and preparing the data so the machine can understand it.

**Example:**  
Removing useless data, filling missing values, and converting text to numbers.

---

## 4. Model Selection
**Definition:**  
Choosing a suitable machine learning model.

**Example:**  
Selecting a model like Logistic Regression or Decision Tree.

---

## 5. Training & Evaluation
**Definition:**  
Teaching the model using data and checking its performance.

**Example:**  
Training the model and measuring accuracy or error.

---

## 6. Improvement & Deployment
**Definition:**  
Improving the model and using it in real applications.

**Example:**  
Making the model more accurate and using it in a system or app.

---

Reference: https://www.youtube.com/watch?v=sTTLL0q9Yh0

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

In [2]:
df = pd.read_csv("healthcare-dataset-stroke-data.csv")
df.head()


Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


In [9]:
df.head()

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,28.893237,never smoked,1
2,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


In [11]:
df['bmi'] = df['bmi'].fillna(df['bmi'].mean())


In [12]:
le = LabelEncoder()

categorical_cols = [
    'gender',
    'ever_married',
    'work_type',
    'Residence_type',
    'smoking_status'
]

for col in categorical_cols:
    df[col] = le.fit_transform(df[col])


In [13]:
X = df.drop(columns=['stroke'])  # input 
y = df['stroke']                 # target 


In [14]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

In [15]:
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

In [16]:
y_pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n")
print(classification_report(y_test, y_pred))

Accuracy: 0.9393346379647749

Classification Report:

              precision    recall  f1-score   support

           0       0.94      1.00      0.97       960
           1       0.00      0.00      0.00        62

    accuracy                           0.94      1022
   macro avg       0.47      0.50      0.48      1022
weighted avg       0.88      0.94      0.91      1022



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [None]:
# Improvements:
# - Try Decision Tree / Random Forest
# - Handle class imbalance (stroke cases are less)

# Deployment:
# - Integrate into healthcare systems


##### DataSet: https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset?
##### Guide: https://www.kaggle.com/code/nikunjmalpani/stroke-prediction-step-by-step-guide? 
##### Reference: https://www.youtube.com/watch?v=jZSlg64rnoE

## Common Mistakes Beginners Make in AI/ML

- Not understanding the problem before building a model  
- Using dirty or incomplete data without cleaning  
- Ignoring missing values and duplicates  
- Choosing complex models instead of simple ones  
- Trusting accuracy alone without checking other results  
- Using ID or irrelevant columns for prediction  
- Overfitting the model to training data  
- Copying code without understanding it  

---

## Ethics and Responsibility in AI

- Protecting user and patient data privacy  
- Avoiding biased or unfair predictions  
- Using AI to support humans, not replace them  
- Making AI decisions transparent and explainable  
- Taking responsibility for AI outcomes  
- Using AI only for positive and ethical purposes  

---
Reference:https://www.youtube.com/watch?v=4Aa-ynT3UN0 