## GRADUATE EMPLOYABILITY PREDICTOR - BY: Timur, Anish, Sean, Rahul and David

In [1]:
import pandas as pd

In [2]:
# Load Dataset
try:
    df = pd.read_csv('CollegePlacement.csv')
    display(df.head())
except FileNotFoundError:
    print("Error: collegeplacement.csv not found. Please make sure the file is uploaded to the /content/ directory or specify the correct path.")

Unnamed: 0,College_ID,IQ,Prev_Sem_Result,CGPA,Academic_Performance,Internship_Experience,Extra_Curricular_Score,Communication_Skills,Projects_Completed,Placement
0,CLG0030,107,6.61,6.28,8,No,8,8,4,No
1,CLG0061,97,5.52,5.37,8,No,7,8,0,No
2,CLG0036,109,5.36,5.83,9,No,3,1,1,No
3,CLG0055,122,5.47,5.75,6,Yes,1,6,1,No
4,CLG0004,96,7.91,7.69,7,No,8,10,2,No


## Prepare Data for Model Training

### Subtask:
This step involves several data preprocessing tasks: dropping the 'College_ID' column as it's irrelevant, encoding 'Internship Experience' and 'Placement' from 'Yes'/'No' to numerical '1'/'0' values, and then scaling all features using `StandardScaler`. Finally, the dataset will be split into training (80%) and testing (20%) sets.


In [3]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# 1. Remove the 'College_ID' column
df = df.drop('College_ID', axis=1)
print("Dropped 'College_ID' column.")

# 2. Convert 'Internship Experience' and 'Placement' to numerical values
df['Internship_Experience'] = df['Internship_Experience'].map({'Yes': 1, 'No': 0})
df['Placement'] = df['Placement'].map({'Yes': 1, 'No': 0})
print("Encoded 'Internship Experience' and 'Placement' columns.")

# 3. Separate features (X) and target (y)
X = df.drop('Placement', axis=1)
y = df['Placement']
print("Separated features (X) and target (y).")

# 4. Initialize StandardScaler
scaler = StandardScaler()

# 5. Apply StandardScaler to the features DataFrame X
X_scaled = scaler.fit_transform(X)
X = pd.DataFrame(X_scaled, columns=X.columns)
print("Scaled features using StandardScaler.")

# 6. Split the scaled features X and target y into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("Split data into training (80%) and testing (20%) sets.")

print("\nShape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

Dropped 'College_ID' column.
Encoded 'Internship Experience' and 'Placement' columns.
Separated features (X) and target (y).
Scaled features using StandardScaler.
Split data into training (80%) and testing (20%) sets.

Shape of X_train: (8000, 8)
Shape of X_test: (2000, 8)
Shape of y_train: (8000,)
Shape of y_test: (2000,)


In [4]:
from sklearn.linear_model import LogisticRegression

# Instantiate the Logistic Regression model
model = LogisticRegression(random_state=42, max_iter=200)

# Train the model using the training data
model.fit(X_train, y_train)

print("Logistic Regression model trained successfully.")

Logistic Regression model trained successfully.


In [14]:
from sklearn.metrics import accuracy_score, recall_score, f1_score, confusion_matrix

# 1. Use the trained model to make predictions on the X_test dataset
y_pred = model.predict(X_test)
print("Predictions made on the test set.")

# 2. Calculate the accuracy, recall, and F1-score of the model
accuracy = accuracy_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Logistic Regression Performance Metrics:")
print(f"Accuracy: {accuracy:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")

# 3. Generate the confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)

print(f"\nConfusion Matrix:\n{conf_matrix}")

Predictions made on the test set.
Logistic Regression Performance Metrics:
Accuracy: 0.9035
Recall: 0.6074
F1-Score: 0.6723

Confusion Matrix:
[[1609   65]
 [ 128  198]]


## Summary:

### Data Analysis Key Findings

*   **Data Preparation**:
    *   The `College_ID` column was removed, and the categorical fields `Internship_Experience` and `Placement` were successfully converted from Yes/No to 1/0.
    *   All features were scaled using `StandardScaler` to brig them onta a consistent range.
    *   The dataset was then split into an 80 percent training set and a 20 percent testing set, giving `X_train` with 8000 samples and `X_test` with 2000 samples.
*   **Model Training**: A Logistic Regression model was created and trained on the processed training data without any issues.
*   **Model Evaluation**: The modelâ€™s performance on the test set produced the following results:
    *   **Accuracy**: 0.9035, showing that the model correctly predicted 90.35 percent of cases.
    *   **Recall**: 0.6074, indicating that it correctly captured about 60 percent of the students who were actually placed.
    *   **F1-Score**: 0.6723, which summarizes the balance between precision and recall.
    *   **Confusion Matrix**:
        *   True Negatives (correctly predicted no placement): 1609
        *   False Positives (incorrectly predicted placement): 65
        *   False Negatives (incorrectly predicted no placement when there was one): 128
        *   True Positives (correctly predicted placement): 198

### Insights

*   The model reaches strong accuracy but has lower recall, which means it tends to miss some students who were actually placed. Examining the false negatives or testing other classification methods could help improve recall.


### Decision Tree Classifier Model

In [10]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, recall_score, f1_score, confusion_matrix

# 1. Create and train decision tree model
dt_model = DecisionTreeClassifier(
    criterion='gini',        # default split metric
    max_depth=None,          # allow full depth unless you want to tune
    random_state=42
)

dt_model.fit(X_train, y_train)
print("Decision Tree model trained successfully.")

# 2. Make predictions
dt_y_pred = dt_model.predict(X_test)
print("Predictions made using Decision Tree on the test set.")


Decision Tree model trained successfully.
Predictions made using Decision Tree on the test set.


In [15]:
# 3. Evaluate performance
dt_accuracy = accuracy_score(y_test, dt_y_pred)
dt_recall = recall_score(y_test, dt_y_pred)
dt_f1 = f1_score(y_test, dt_y_pred)

print("\nDecision Tree Performance Metrics:")
print(f"Accuracy: {dt_accuracy:.4f}")
print(f"Recall: {dt_recall:.4f}")
print(f"F1-Score: {dt_f1:.4f}")

# 4. Confusion matrix
dt_conf_matrix = confusion_matrix(y_test, dt_y_pred)
print(f"\nConfusion Matrix:\n{dt_conf_matrix}")


Decision Tree Performance Metrics:
Accuracy: 1.0000
Recall: 1.0000
F1-Score: 1.0000

Confusion Matrix:
[[1674    0]
 [   0  326]]
