<div style="font-size:32px; font-weight:bold; color:#FABF8F; font-family:Arial;">
  🤖 Lesson 2 : Building a Simple Model  
</div>


 <h2 style="color:#0288D1">📥 Step 1: Import Libraries & Load Data</h2>

In [53]:
# 🧰 Core libraries for data processing and modeling
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score,classification_report

from sklearn.model_selection import train_test_split , cross_val_score
from sklearn.preprocessing import OneHotEncoder 
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestRegressor


In [54]:
# 📂 Load the dataset
Data = pd.read_csv('student_exam.csv')

<h2 style="color:#6A1B9A">
🔍 Step 2: Check for Missing Data</h2>

In [55]:
# 🕵️‍♂️ Find columns with missing values (NaNs)
columns_who_has_nan = []

for col in Data.columns:
    if Data[col].isnull().sum() > 0:      # ❗ Missing values exist
        columns_who_has_nan.append(col)   # ➕ Track those columns

len(columns_who_has_nan)  # ✅ 0 means dataset is clean

0

<p>We check for missing values before moving ahead. If <code>len(columns_who_has_nan)</code> returns 0, no imputation is required.</p>


 <h2 style="color:#D84315">🧹 Step 3: Categorical Encoding</h2>

In [56]:
# 🎭 Isolate categorical features
categorical_features = Data.select_dtypes(include='object')
object_cols = categorical_features.columns

# 🔄 One-hot encode them
encoder = OneHotEncoder(sparse_output=False)  # ✅ New syntax for sklearn >=1.2
encoded = encoder.fit_transform(Data[object_cols])

# 🧱 Combine encoded columns back with numeric ones
encoded_df = pd.DataFrame(encoded, columns=encoder.get_feature_names_out(object_cols), index=Data.index)
Data_encoded = pd.concat([Data.drop(columns=object_cols), encoded_df], axis=1)

# 🔍 Preview the encoded dataset
Data_encoded.head()
Data = Data_encoded

<p>This transforms categorical columns into numerical format using <code>OneHotEncoder</code> — preparing the data for ML models.</p>


 <h2 style="color:#009688">🎯 Step 4: Define Target Variable</h2>

In [57]:
# 🎯 Convert G3 into binary: Pass = True, Fail = False
Data['G3'] = Data['G3'].apply(lambda x: x >= 10)

# 🧮 Check the distribution
Data['G3'].value_counts()

G3
True     814
False    230
Name: count, dtype: int64

<h2 style="color:#5D4037">✂️ Step 5: Split Data</h2>

In [58]:
# 📦 Separate features and target
X = Data.drop('G3', axis=1)
Y = Data['G3']

# 📐 Create training/test split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, shuffle=True, random_state=45)

<p>We split the data for model training (<code>X_train, Y_train</code>) and evaluation (<code>X_test, Y_test</code>).</p>


<h2 style="color:#3F51B5">🧠 Step 6: Train Logistic Regression</h2>

In [59]:
# 🧠 Initialize and train the model
model = LogisticRegression(random_state=42)
model.fit(X_train, Y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


<p>We train a simple logistic regression model to classify student exam outcomes.</p>


 <h2 style="color:#7B1FA2">📈 Step 7: Make Predictions & Evaluate</h2>

In [60]:
# 🔮 Make predictions
Y_pred = model.predict(X_test)

# 🎯 Evaluate accuracy
accuracy = accuracy_score(Y_test, Y_pred)
report = classification_report(Y_test, Y_pred)

# 📊 Show results
accuracy

0.9118773946360154

<p>We evaluate our model using <code>accuracy_score</code> and <code>classification_report</code>. A DataFrame shows predicted values.</p>


 <h2 style="color:#4E342E">💾 Step 8: Save Outputs</h2>

In [61]:
# 💾 Save outputs as CSV
Data.to_csv('student_exam_new.csv')
Y_test.to_csv('Y_test.csv')
pd.DataFrame(Y_pred).to_csv('Y_pred.csv')
pd.DataFrame(model.classes_).to_csv('modelclasses_.csv')

<p>All key outputs — including predictions and transformed dataset — are saved for later use or deployment.</p>


 <h2 style="color:#388E3C">🌳 Step 9: Try Random Forest</h2>

In [62]:
# 🌲 Train a random forest regressor
model2 = RandomForestRegressor(n_estimators=300, max_depth=6)
model2.fit(X_train, Y_train)

# 🔮 Predict and round classification
Y_pred2 = model2.predict(X_test).round(0)

# 📈 Evaluate accuracy
accuracy = accuracy_score(Y_pred2, Y_test)
accuracy


0.9042145593869731

<p>We experiment with a <b>Random Forest</b> model — powerful for handling complex relationships and nonlinear patterns.</p>

 <h2 style="color:#00695C">🧠 What You Should Learn</h2>

By completing this notebook, you should now be able to:

✅ Understand the full data science workflow  
✅ Load and inspect a real-world dataset using `pandas`  
✅ Detect and handle missing data  
✅ Encode categorical variables using `OneHotEncoder`  
✅ Convert a numeric score (`G3`) into a binary classification target  
✅ Split your data into training and testing sets  
✅ Train a Logistic Regression model to predict pass/fail  
✅ Evaluate your model using accuracy and classification report  
✅ Try a Random Forest model as a more powerful alternative  
✅ Save your processed data and model outputs to CSV for deployment


 <h2 style="color:#00695C">🎯 Challenges For You : </h2>

Now that you've built a working ML model, try these hands-on challenges to deepen your skills:

🧩 **Challenge 1: Try a Different Classifier**  

🔁 **Challenge 2 : Cross-Validate Your Model**  
    Apply `cross_val_score` to evaluate both Logistic Regression and Random Forest across folds.
