<a href="https://colab.research.google.com/github/Shnku/pythoning_stuff/blob/main/mathml/Placement_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🎯Placement Data Prediction: KNN & Logistic Regression


---

# 👨‍💻INTRODUCTION:

This project utilizes the **Placement Prediction Dataset** from Kaggle, which comprises students' academic records and training details.

* [Predicting student placements based on historical data on kaggle](https://www.kaggle.com/datasets/ruchikakumbhar/placement-prediction-dataset)

* The objective is to predict whether a student will be placed based on features such as CGPA, specialization, and training experience.

* Two machine learning techniques used:
  1. K-Nearest Neighbors (KNN)
  2. Logistic Regression

* Goal:  We aim to analyze and compare their effectiveness in forecasting placement outcomes

---


# Importing the Data from kaggle

In [None]:
# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.
import kagglehub
download_path = kagglehub.dataset_download('ruchikakumbhar/placement-prediction-dataset')

print('Data source import complete.\nlocation: ', download_path)


Data source import complete.
location:  /kaggle/input/placement-prediction-dataset


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
data_path=""
for dirname, _, filenames in os.walk(download_path):
    for filename in filenames:
        data_path=os.path.join(dirname, filename)
        print(data_path)

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/placement-prediction-dataset/placementdata.csv


# Viewing the Data

In [None]:
dataset = pd.read_csv('/kaggle/input/placement-prediction-dataset/placementdata.csv')
dataset

Unnamed: 0,StudentID,CGPA,Internships,Projects,Workshops/Certifications,AptitudeTestScore,SoftSkillsRating,ExtracurricularActivities,PlacementTraining,SSC_Marks,HSC_Marks,PlacementStatus
0,1,7.5,1,1,1,65,4.4,No,No,61,79,NotPlaced
1,2,8.9,0,3,2,90,4.0,Yes,Yes,78,82,Placed
2,3,7.3,1,2,2,82,4.8,Yes,No,79,80,NotPlaced
3,4,7.5,1,1,2,85,4.4,Yes,Yes,81,80,Placed
4,5,8.3,1,2,2,86,4.5,Yes,Yes,74,88,Placed
...,...,...,...,...,...,...,...,...,...,...,...,...
9995,9996,7.5,1,1,2,72,3.9,Yes,No,85,66,NotPlaced
9996,9997,7.4,0,1,0,90,4.8,No,No,84,67,Placed
9997,9998,8.4,1,3,0,70,4.8,Yes,Yes,79,81,Placed
9998,9999,8.9,0,3,2,87,4.8,Yes,Yes,71,85,Placed


In [None]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   StudentID                  10000 non-null  int64  
 1   CGPA                       10000 non-null  float64
 2   Internships                10000 non-null  int64  
 3   Projects                   10000 non-null  int64  
 4   Workshops/Certifications   10000 non-null  int64  
 5   AptitudeTestScore          10000 non-null  int64  
 6   SoftSkillsRating           10000 non-null  float64
 7   ExtracurricularActivities  10000 non-null  object 
 8   PlacementTraining          10000 non-null  object 
 9   SSC_Marks                  10000 non-null  int64  
 10  HSC_Marks                  10000 non-null  int64  
 11  PlacementStatus            10000 non-null  object 
dtypes: float64(2), int64(7), object(3)
memory usage: 937.6+ KB


# Now we need to identify features(X value) and target(y value)

**Here are 12 columns**

**The last column `PlacementStatus` is `Y` or `target`**

**Rest columns `0-10` are `feature` or `x` values**

In [None]:
x = dataset.drop('PlacementStatus', axis=1) #delete the last (11 th) column
y = dataset['PlacementStatus'] #last column is y value or target

print("x or fearures are:\n ",x.columns ,'\n')
print("y or target is:" ,y.name)
print(y)

x or fearures are:
  Index(['StudentID', 'CGPA', 'Internships', 'Projects',
       'Workshops/Certifications', 'AptitudeTestScore', 'SoftSkillsRating',
       'ExtracurricularActivities', 'PlacementTraining', 'SSC_Marks',
       'HSC_Marks'],
      dtype='object') 

y or target is: PlacementStatus
0       NotPlaced
1          Placed
2       NotPlaced
3          Placed
4          Placed
          ...    
9995    NotPlaced
9996       Placed
9997       Placed
9998       Placed
9999    NotPlaced
Name: PlacementStatus, Length: 10000, dtype: object


## here the `y` data in `Placed` `NotPlaced` not in  0,1 value

**so we need to change it** by pd `replace` function


In [None]:
y_refined = y.replace(['Placed','NotPlaced'], [1,0]).astype(int)
y_refined

  y_refined = y.replace(['Placed','NotPlaced'], [1,0]).astype(int)


Unnamed: 0,PlacementStatus
0,0
1,1
2,0
3,1
4,1
...,...
9995,0
9996,1
9997,1
9998,1


In [None]:
y_refined.info()

<class 'pandas.core.series.Series'>
RangeIndex: 10000 entries, 0 to 9999
Series name: PlacementStatus
Non-Null Count  Dtype
--------------  -----
10000 non-null  int64
dtypes: int64(1)
memory usage: 78.3 KB


## Just found also 2 features values are in string form `ExtracurricularActivities` and `PlacementTraining`. 😓

In [None]:
x_refined = x.replace(['Yes','No'], [1,0]).astype(int)
x_refined

  x_refined = x.replace(['Yes','No'], [1,0]).astype(int)


Unnamed: 0,StudentID,CGPA,Internships,Projects,Workshops/Certifications,AptitudeTestScore,SoftSkillsRating,ExtracurricularActivities,PlacementTraining,SSC_Marks,HSC_Marks
0,1,7,1,1,1,65,4,0,0,61,79
1,2,8,0,3,2,90,4,1,1,78,82
2,3,7,1,2,2,82,4,1,0,79,80
3,4,7,1,1,2,85,4,1,1,81,80
4,5,8,1,2,2,86,4,1,1,74,88
...,...,...,...,...,...,...,...,...,...,...,...
9995,9996,7,1,1,2,72,3,1,0,85,66
9996,9997,7,0,1,0,90,4,0,0,84,67
9997,9998,8,1,3,0,70,4,1,1,79,81
9998,9999,8,0,3,2,87,4,1,1,71,85


In [None]:
x_refined.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 11 columns):
 #   Column                     Non-Null Count  Dtype
---  ------                     --------------  -----
 0   StudentID                  10000 non-null  int64
 1   CGPA                       10000 non-null  int64
 2   Internships                10000 non-null  int64
 3   Projects                   10000 non-null  int64
 4   Workshops/Certifications   10000 non-null  int64
 5   AptitudeTestScore          10000 non-null  int64
 6   SoftSkillsRating           10000 non-null  int64
 7   ExtracurricularActivities  10000 non-null  int64
 8   PlacementTraining          10000 non-null  int64
 9   SSC_Marks                  10000 non-null  int64
 10  HSC_Marks                  10000 non-null  int64
dtypes: int64(11)
memory usage: 859.5 KB


# Methods

## 📊K-Nearest Neighbors (KNN)

- Non-parametric algorithm
- Finds closest similar instances in feature space
- Simple implementation, effective for small datasets
- Pros:
  - Intuitive understanding
  - Handles non-linear relationships well
- Cons:
  - Computationally expensive for large datasets
  - Sensitive to noise in data


## 📊Logistic Regression

- Parametric algorithm
- Models probability of binary outcomes
- Linear model with sigmoid activation function
- Pros:
  - Fast computation
  - Interpretable coefficients
  - Works well with high-dimensional data
- Cons:
  - Assumes linear relationship between features and outcome
  - May overfit if not regularized


---

## 📈 Model Evaluation Metrics

* **Confusion Matrix**
  A table showing:

  * True Positives (TP)
  * True Negatives (TN)
  * False Positives (FP)
  * False Negatives (FN)

* **Accuracy**
  Proportion of correct predictions:

  $$
  \text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}
  $$

* **Precision**
  Correct positive predictions out of all predicted positives:

  $$
  \text{Precision} = \frac{TP}{TP + FP}
  $$

---


# Main part (Importing Libraries)

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report

# Split the dataset to Train and Test

In [None]:
train_x,test_x, train_y, test_y = train_test_split(x_refined,y_refined, test_size=0.2, random_state=42)

In [None]:
from IPython.display import display
display(train_x.head(10))
display(train_y.head())
display(test_x.head())
display(test_y.head())

Unnamed: 0,StudentID,CGPA,Internships,Projects,Workshops/Certifications,AptitudeTestScore,SoftSkillsRating,ExtracurricularActivities,PlacementTraining,SSC_Marks,HSC_Marks
9254,9255,8,1,3,3,87,4,1,0,83,88
1561,1562,7,1,2,1,87,3,1,1,61,85
1670,1671,8,1,3,2,86,4,1,1,76,65
6087,6088,7,1,1,0,64,4,0,0,55,59
6669,6670,8,2,3,2,90,4,1,0,80,87
5933,5934,7,1,2,2,86,4,1,1,77,81
8829,8830,8,1,3,2,88,4,1,1,74,88
7945,7946,8,1,3,2,90,4,1,1,70,84
3508,3509,6,1,2,2,83,4,1,0,65,80
2002,2003,8,1,2,2,74,4,1,1,72,81


Unnamed: 0,PlacementStatus
9254,1
1561,0
1670,1
6087,0
6669,1


Unnamed: 0,StudentID,CGPA,Internships,Projects,Workshops/Certifications,AptitudeTestScore,SoftSkillsRating,ExtracurricularActivities,PlacementTraining,SSC_Marks,HSC_Marks
6252,6253,7,0,1,0,80,3,0,0,60,71
4684,4685,7,1,1,0,81,3,1,1,56,58
1731,1732,6,1,1,1,60,4,0,0,55,64
4742,4743,8,1,3,2,90,4,1,1,88,87
4521,4522,8,1,0,2,90,4,1,1,73,88


Unnamed: 0,PlacementStatus
6252,0
4684,0
1731,0
4742,0
4521,1


# Results

## Using Logistic Regression Classifier

In [None]:
classifier = LogisticRegression()

In [None]:
## Training the model (using `fit`)
classifier.fit(train_x,train_y)

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [None]:
y_pred_logi = classifier.predict(test_x)
print(y_pred_logi)

[0 0 0 ... 1 1 0]


In [None]:
logi_accuracy = accuracy_score(test_y, y_pred_logi)
print("logistic regression Accuracy:", logi_accuracy)

logistic regression Accuracy: 0.768


## Using KNN

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(train_x, train_y)

In [None]:
y_pred_knn = knn.predict(test_x)
print(y_pred_knn)

[0 1 0 ... 1 0 0]


In [None]:
knn_accuracy = accuracy_score(test_y, y_pred_knn)
print("KNN Accuracy:", knn_accuracy)

KNN Accuracy: 0.739


## Using Normalized KNN

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(train_x)
x_test_scaled = scaler.fit_transform(test_x)
# Step 4: Fit the KNN model
knn2 = KNeighborsClassifier()
knn2.fit(x_train_scaled, train_y)

# Step 5: Make predictions
y_pred_norm = knn2.predict(x_test_scaled)

# Step 6: Evaluate the model
knn_norm = accuracy_score(test_y, y_pred_norm)
print('normalized accuracy= ',knn_norm)
print(classification_report(test_y,y_pred_norm))

normalized accuracy=  0.7685
              precision    recall  f1-score   support

           0       0.80      0.80      0.80      1172
           1       0.72      0.72      0.72       828

    accuracy                           0.77      2000
   macro avg       0.76      0.76      0.76      2000
weighted avg       0.77      0.77      0.77      2000





---
# ✅ Conclusion
Here's a concise conclusion slide for your presentation:

* Logistic Regression achieved an accuracy of **76.8%**
* K-Nearest Neighbors (KNN) achieved **73.9%**
* KNN with feature scaling (Normalization) improved to **76.85%**

📌 While both models performed similarly after scaling,
feature normalization significantly improved KNN’s performance,
bringing it on par with Logistic Regression.

🔍 Feature scaling is essential for distance-based models like KNN.


