In [1]:
# Step 1: Import libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score

In [None]:
'''🔍 Line by Line:
✅ import pandas as pd
— pandas = data handling பண்ண dictionary மாதிரி powerful tool.
— Datasets read பண்ண, clean பண்ண, explore பண்ண useful.
👉 pdன்னு short name வெச்சுக்குறோம் (easy reference).

✅ from sklearn.model_selection import train_test_split
— இவங்க namma dataset-ஐ training & testing-ஆ divide பண்ண use பண்ணுவோம்.
👉 Like: 70% padikka vechiradhu, 30% test-க்கு வைக்கறது!

✅ from sklearn.linear_model import LogisticRegression
— இதுதான் நம்ம ML model.
👉 Logistic Regression is used for classification problems (Yes/No, 1/0 types).

✅ from sklearn.metrics import classification_report, accuracy_score
— Model correct-a predict pannudhaa-nu check பண்ண metric tools.

accuracy_score → Overall performance
classification_report → Precision, recall, F1-score etc.

🧠 Analogy:
Imagine you're preparing for a cooking competition:
pandas → Ingredients manager
train_test_split → Separate practice batch vs test batch
LogisticRegression → Recipe (model)
accuracy_score, classification_report → Judges giving marks 😎
'''

In [2]:
# Step 2: Load dataset
df = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/diabetes.csv')

In [4]:
# Step 3: Explore the dataset
print(df.head())

   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6      148             72             35        0  33.6   
1            1       85             66             29        0  26.6   
2            8      183             64              0        0  23.3   
3            1       89             66             23       94  28.1   
4            0      137             40             35      168  43.1   

   DiabetesPedigreeFunction  Age  Outcome  
0                     0.627   50        1  
1                     0.351   31        0  
2                     0.672   32        1  
3                     0.167   21        0  
4                     2.288   33        1  


In [6]:
print(df.describe())

       Pregnancies     Glucose  BloodPressure  SkinThickness     Insulin  \
count   768.000000  768.000000     768.000000     768.000000  768.000000   
mean      3.845052  120.894531      69.105469      20.536458   79.799479   
std       3.369578   31.972618      19.355807      15.952218  115.244002   
min       0.000000    0.000000       0.000000       0.000000    0.000000   
25%       1.000000   99.000000      62.000000       0.000000    0.000000   
50%       3.000000  117.000000      72.000000      23.000000   30.500000   
75%       6.000000  140.250000      80.000000      32.000000  127.250000   
max      17.000000  199.000000     122.000000      99.000000  846.000000   

              BMI  DiabetesPedigreeFunction         Age     Outcome  
count  768.000000                768.000000  768.000000  768.000000  
mean    31.992578                  0.471876   33.240885    0.348958  
std      7.884160                  0.331329   11.760232    0.476951  
min      0.000000                  

In [None]:
print(df['Outcome'].value_counts())  # 0 = No diabetes, 1 = Has diabetes

Outcome
0    500
1    268
Name: count, dtype: int64


In [8]:
# Step 4: Split into features and target
X = df.drop('Outcome', axis=1)
y = df['Outcome']

In [None]:
'''Ithu step-ல நாம dataset-ஐ 2 parts-ஆ split பண்ணுறோம்:
X → input features (independent variables)
y → output target (dependent variable)

🔍 Line by Line:
✅ X = df.drop('Outcome', axis=1)
👉 DataFrame-ல Outcome column-ஐ remove பண்ணி, baaki columns-ஐ X-ல வைச்சுக்குறோம்.
👉 These are the features — like age, blood pressure, glucose etc.

✅ y = df['Outcome']
👉 Outcome column alone-ஐ y-ல வைச்சுருக்கோம் —
இது தான் target (model predict பண்ண வேண்டியது).
Ex: person diabetic-aa illa-yaa (1 or 0)

🍲 Cooking Analogy:
Dataset = recipe book
X = ingredients (features like sugar, salt...)
y = final dish (Outcome = sweet/diabetic)
இப்ப ingredients (X) வைச்சு, final dish (y) predict பண்ண model prepare பண்ணுறோம்!'''

In [9]:
# Step 5: Split into training/testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
'''✅ train_test_split(X, y, test_size=0.2, random_state=42)
test_size=0.2 → 20% test set, 80% training set
random_state=42 → Same split repeat-a-aganum-na, seed fix பண்ணுறது
(like same question paper again and again for practice)

👉 This will return:
X_train = Training features
X_test = Testing features
y_train = Training labels (correct answers)
y_test = Testing labels (for evaluation)

📚 Analogy:
Imagine exam preparation:
80% time padikka vechuttu (X_train, y_train)
20% time test pannradhukku vekkura practice questions (X_test, y_test)
So, ippo model training-ku ready-யா இருக்குது!'''

In [10]:
# Step 6: Train Logistic Regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

In [None]:
'''✅ model = LogisticRegression(max_iter=1000)
LogisticRegression → Classification problems-ku use பண்ணும் ML algorithm
(example: 0 or 1, Yes or No, Diabetic or Not)

max_iter=1000 → Model ethavathu kashtapadra case-ல,
maximum 1000 times try pannalaam to learn (iteration limit)

✅ model.fit(X_train, y_train)
Model-க்கு training features (X_train) & correct answers (y_train) கொடுத்து,
padikka vechururom

Ithu dhaan training process — student model padikkura phase!

📚 Analogy:
🎓 model = student
📖 X_train = questions
✅ y_train = correct answers

fit() = student padichu exam-ku ready ஆகுறது!
(model ippo learning stage complete பண்ணுது)'''

In [11]:
# Step 7: Predict on test set
y_pred = model.predict(X_test)

In [12]:
# Step 8: Evaluate performance
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.7467532467532467

Classification Report:
               precision    recall  f1-score   support

           0       0.81      0.79      0.80        99
           1       0.64      0.67      0.65        55

    accuracy                           0.75       154
   macro avg       0.73      0.73      0.73       154
weighted avg       0.75      0.75      0.75       154



In [None]:
'''✅ accuracy_score(y_test, y_pred)
y_test = real answers
y_pred = model predict pannadhu
Ithu rendu match aagudhaa-nu paathu, total accuracy % kudukkum

✅ classification_report(y_test, y_pred)
In-depth result kudukkum —
Precision → Model sonna “yes”-la evalavu correct
Recall → Total true yes-la evalavu model identify pannichu
F1-score → Balance between precision & recall

📚 Analogy:
Imagine model is a student who wrote an exam.
accuracy_score = total mark percentage
classification_report = detailed mark sheet
(maths-la evalavu, science-la evalavu-nu breakdown)

So, ippo un model eppadi padichirukku-nu full result sheet kitta irukku!'''

In [None]:
'''📊 What is a Classification Report?
A classification report is like a mark sheet 📝 for your machine learning model.
It shows how well your model predicted the categories (e.g., spam vs. not spam, diabetic vs. not diabetic).
It gives 4 main metrics for each class (like 0 and 1): '''

In [None]:
'''| Metric          | Meaning (Simple)                                                                 | Analogy                         |
|--------------------|----------------------------------------------------------------------------------|----------------------------------|
| **Precision**      | Out of all predictions the model said **"yes"**, how many were actually **yes**? | How careful is the model? (False alarms?) |
| **Recall**         | Out of all the actual **"yes"** cases, how many did the model catch?             | How good is it at finding the target? |
| **F1-score**       | Balance between **precision** and **recall**.                                    | Average performance (if both matter) |
| **Support**        | Number of actual samples in that class in the test set                           | How many examples are there to evaluate? |'''