 Purpose
----------------

In this notebook we are going to learn about data leakage and it's effects on the results of a model causing overly positive results that are not in line with the results if the model was placed in the real world.

We can define leakage as “when information concerning the ground truth is artifi-
cially and unintentionally introduced within the training feature data, or training
metadata” as stated by Michael Kim

Data
----------------

The data we are using is from a publicly available dataset from kaggle. Here's the link to get the dataset:
https://www.kaggle.com/datasets/mahad049/job-placement-dataset

The data is about job placement of university students. The model is supposed to predict who will get a job placement and not.
We will get a score of the prediction and check for ways to mititgate leakage.


| variable        | description                                                                           |
|-----------------|---------------------------------------------------------------------------------------|
| `id`         | unique identifier for each student         |
| `name`          | name of the student.                                   |
| `gender`     | gender of student: male or female.                                       |
| `degree`        | type of degree: Bachelors.                                                    |
| `stream`      | the stream of speciality a student specialized in university: Computer Science, Electrical Engineering, Mechanical Engineering.        |
| `college name`  | the name of the university the student undertook their studies.                                                                     |
| `placement_status` | wether the student got a job placement or not.                                |
| `salary`  | the salary amount per year for each student working with a job placement.                                 |
| `gpa`  | the student academic status based on the university gpa grading system .                                                    |
| `years_of_experience`     | total number of years working.                                                            |


Setup
----------------

We will start by importing relevant libraries, setting up our notebook, reading in the data, and checking that it was loaded correctly.

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import svm
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

import matplotlib.pyplot as plt
import random
%matplotlib inline
import seaborn as sns

The data set we used is publicly available from kaggle. Here is the link to gain access to the dataset.

It entails students and wether they got a job placement categorizing them according to different characteristics like gender,

 university, stream and work experience to determine wether each individual got a job.

In [2]:
df = pd.read_csv('job_placement.csv',index_col="id")

df.head()


Unnamed: 0_level_0,name,gender,age,degree,stream,college_name,placement_status,salary,gpa,years_of_experience
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,John Doe,Male,25,Bachelor's,Computer Science,Harvard University,Placed,60000,3.7,2.0
2,Jane Smith,Female,24,Bachelor's,Electrical Engineering,Massachusetts Institute of Technology,Placed,65000,3.6,1.0
3,Michael Johnson,Male,26,Bachelor's,Mechanical Engineering,Stanford University,Placed,58000,3.8,3.0
4,Emily Davis,Female,23,Bachelor's,Information Technology,Yale University,Not Placed,0,3.5,2.0
5,David Brown,Male,24,Bachelor's,Computer Science,Princeton University,Placed,62000,3.9,2.0


Preprocessing Data
----------------

We explore the data by checking for missing data. 

We also check the data types for the entires to know how to handle the data if we might require encoding to change non numerical data to numerical data and also how to fill any missing data.

In [9]:
# Check for missing data
missing_data = df.isnull().sum()

missing_data

name                   0
gender                 0
age                    0
degree                 0
stream                 0
college_name           0
placement_status       0
salary                 0
gpa                    0
years_of_experience    0
dtype: int64

In [10]:
#checking the data types
df.dtypes

name                    object
gender                  object
age                      int64
degree                  object
stream                  object
college_name            object
placement_status        object
salary                   int64
gpa                    float64
years_of_experience    float64
dtype: object

why i replaced instead of dropped the data

Here we are checking where we have the missing data so that we can understand the other features associated 
with it and how we  how we can fill it.

In [11]:
missing_rows = df[df.isnull().any(axis=1)]
missing_rows


Unnamed: 0_level_0,name,gender,age,degree,stream,college_name,placement_status,salary,gpa,years_of_experience
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1


**Cleaning data**

For our missing data in our dataset we chose to impute the data for the following reasons:
 - Peserve data since our dataset is quite small.
 - The area of the missing data is important to analysis because of the size of uour dataset

The student associated with the missing data has been placed and has a salary meaning they have some work experience.

We get the mean, median and mode of the students who studied Mechanical Engineering as our student to 
understand the features more and figure out how to fill the missing data.  

In [14]:
mechanical_engineering_df = df[df['stream'] == 'Mechanical Engineering']
mean_experience = mechanical_engineering_df['years_of_experience'].mean()
median_experience = mechanical_engineering_df['years_of_experience'].median()
mode_experience = mechanical_engineering_df['years_of_experience'].mode()[0]

print("Mean experience:", mean_experience)
print("Median experience:", median_experience)
print("Mode experience:", mode_experience)


Mean experience: 2.118181818181818
Median experience: 2.0
Mode experience: 3.0


Here we clean the data or replace the missing entry withthe mean of the years of experience for our model to work.

Step 1: Fill missing data with pandas

We used mode to fill the missing value because it represents the most frequently occuring value in that column. 

In [4]:
df['years_of_experience'].fillna(df["years_of_experience"].mode()[0],inplace=True) 

Let's confirm the missing data is filled.

In [20]:
row_545 = df.iloc[544]
row_545


name                                         Sophia Johnson
gender                                               Female
age                                                      24
degree                                           Bachelor's
stream                               Mechanical Engineering
college_name           University of California--Santa Cruz
placement_status                                     Placed
salary                                                60000
gpa                                                     3.7
years_of_experience                                     3.0
Name: 545, dtype: object

**Step 2: Convert non-numerical data to numerical data**


At this point we have to choose the features that will help our model predict wether a student has been placed or not.

The features that we will we use; gender, age, stream, gpa, years_of_experience.

We are predicting; placement_status of the students.

*Encoding*

We are going to use encoding to change our categorical variables to a numerical value for the machine learning algorithms to understand them.

We use OneHotEncoder from the sklearn preprocessing module to helps us handle categorical data in Machine Learning tasks.

The idea behind OneHotEncoder is to create a binary vector for each categorical variable.
Each binary vector has a length equal to the number of unique categories in the variable.The vector contains all zeros except for a single one at the index corresponding to the category it represents.

In [16]:
df.head()

Unnamed: 0_level_0,name,gender,age,degree,stream,college_name,placement_status,salary,gpa,years_of_experience
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,John Doe,Male,25,Bachelor's,Computer Science,Harvard University,Placed,60000,3.7,2.0
2,Jane Smith,Female,24,Bachelor's,Electrical Engineering,Massachusetts Institute of Technology,Placed,65000,3.6,1.0
3,Michael Johnson,Male,26,Bachelor's,Mechanical Engineering,Stanford University,Placed,58000,3.8,3.0
4,Emily Davis,Female,23,Bachelor's,Information Technology,Yale University,Not Placed,0,3.5,2.0
5,David Brown,Male,24,Bachelor's,Computer Science,Princeton University,Placed,62000,3.9,2.0


At this point the non numerical data from the features we are going to use are stream and gender.

In [17]:
from sklearn.preprocessing import  OneHotEncoder
from sklearn.compose import ColumnTransformer


categorical_features = ["gender","stream"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                  one_hot,
                                  categorical_features)],
                                  remainder="passthrough")

transformed_X = transformer.fit_transform(df)
transformed_X

array([[0.0, 1.0, 1.0, ..., 60000, 3.7, 2.0],
       [1.0, 0.0, 0.0, ..., 65000, 3.6, 1.0],
       [0.0, 1.0, 0.0, ..., 58000, 3.8, 3.0],
       ...,
       [0.0, 1.0, 1.0, ..., 65000, 3.8, 3.0],
       [1.0, 0.0, 0.0, ..., 66000, 3.7, 2.0],
       [0.0, 1.0, 0.0, ..., 0, 3.6, 1.0]], dtype=object)

Now that we have encoded our data into numerical values to be interpreted by the machine learning model. Let us see what it looks like.

In [18]:
pd.DataFrame(transformed_X)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,John Doe,25,Bachelor's,Harvard University,Placed,60000,3.7,2.0
1,1.0,0.0,0.0,1.0,0.0,0.0,0.0,Jane Smith,24,Bachelor's,Massachusetts Institute of Technology,Placed,65000,3.6,1.0
2,0.0,1.0,0.0,0.0,0.0,0.0,1.0,Michael Johnson,26,Bachelor's,Stanford University,Placed,58000,3.8,3.0
3,1.0,0.0,0.0,0.0,0.0,1.0,0.0,Emily Davis,23,Bachelor's,Yale University,Not Placed,0,3.5,2.0
4,0.0,1.0,1.0,0.0,0.0,0.0,0.0,David Brown,24,Bachelor's,Princeton University,Placed,62000,3.9,2.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
695,0.0,1.0,1.0,0.0,0.0,0.0,0.0,Lucas Taylor,23,Bachelor's,University of Washington,Placed,67000,3.8,3.0
696,1.0,0.0,0.0,0.0,1.0,0.0,0.0,Emma Martinez,26,Bachelor's,University of California--Berkeley,Placed,66000,3.9,3.0
697,0.0,1.0,1.0,0.0,0.0,0.0,0.0,Aiden Davis,24,Bachelor's,University of Illinois--Urbana-Champaign,Placed,65000,3.8,3.0
698,1.0,0.0,0.0,1.0,0.0,0.0,0.0,Mia Wilson,23,Bachelor's,University of Colorado--Boulder,Placed,66000,3.7,2.0


We have to drop the data we will not use that when training our model and create a new Dataframe 
with relevant data. 

In [19]:
df2=pd.DataFrame(transformed_X)
df2.drop([7,9,10],axis =1,inplace=True)
df2.head()


Unnamed: 0,0,1,2,3,4,5,6,8,11,12,13,14
0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,25,Placed,60000,3.7,2.0
1,1.0,0.0,0.0,1.0,0.0,0.0,0.0,24,Placed,65000,3.6,1.0
2,0.0,1.0,0.0,0.0,0.0,0.0,1.0,26,Placed,58000,3.8,3.0
3,1.0,0.0,0.0,0.0,0.0,1.0,0.0,23,Not Placed,0,3.5,2.0
4,0.0,1.0,1.0,0.0,0.0,0.0,0.0,24,Placed,62000,3.9,2.0


We have to encode the placement_status which had been encoded to column 11.

We will be using the LabelEncoder class from sklearn.preprocessing to convert categorical data it into numrical data.

In [20]:
from sklearn.preprocessing import LabelEncoder

# Create an instance of LabelEncoder
encoder = LabelEncoder()

# Fit the encoder to the values in column 11
encoder.fit(df2[11])

# Transform the values in column 11 using the fitted encoder
df2[11] = encoder.transform(df2[11])

df2.head()

Unnamed: 0,0,1,2,3,4,5,6,8,11,12,13,14
0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,25,1,60000,3.7,2.0
1,1.0,0.0,0.0,1.0,0.0,0.0,0.0,24,1,65000,3.6,1.0
2,0.0,1.0,0.0,0.0,0.0,0.0,1.0,26,1,58000,3.8,3.0
3,1.0,0.0,0.0,0.0,0.0,1.0,0.0,23,0,0,3.5,2.0
4,0.0,1.0,1.0,0.0,0.0,0.0,0.0,24,1,62000,3.9,2.0


Step 3: Split data into features and labels

A feature is a measurable characteristic observed to input into the model to make predictions.

A label is the target variable we want to predict. In our case it is Job Placement of the students.

In [21]:
# Split the data into X and Y
X = df2.drop([11],axis=1)
Y = df2[[11]]


Y.head()

Unnamed: 0,11
0,1
1,1
2,1
3,0
4,1


Step 4: Convert non-numerical data to numerical data


Step 5: Split data into training and test sets

We split the encoded data into a training set (80%) and test set (20%). The trainig set is used to evaluate the model while the testing set is used to provide an unbiased evaluation of the model's performance on unseen data.

We used the train_test_split function from the sklearn.model_selection module to split our data.
We utilised the  random_state parameter in the train_test_split function to set seed for random number generator ensuring random splits reducing bias and generalisation


In [22]:
np.random.seed(42) 
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.2)


In [23]:
Y_train.head()

Unnamed: 0,11
82,1
51,1
220,0
669,1
545,0


Step 6: Fit the model to the training data

In [24]:
model =RandomForestClassifier()
model.fit(X_train, Y_train)

  return fit_method(estimator, *args, **kwargs)


Step 7: Score the model on the test data

In [25]:
model.score(X_test,Y_test)

1.0

Question 1
_ _ _ _ _ _ _ _ _ 

What machine learning model are we likey to use with this dataset and Why?

We are going to use a classification model because we want to predict the placement status of students which is a categorical variable making this a classification task.

Here is a cheatsheet to help in understanding more about this;
https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

Question 2
_ _ _ _ _ 
What feature is likely to be a source of leakage and why ?

Question 3
- - - - - - - -
This classification problem with the  dataset has achieved a performance score of
100% for training data after this experiment. What should your next step be to identify and fix the problem?

Question 4
_ _ _ _ _ _ _

For the different causes of leakage how can we fix or improve our score to be less optimistic ?

**1. pre-processing before splitting into training/test sets**

_ _ _ _ _ _

*Solution*

We use fit_transform() method of the OneHotEncoder class on the training data because it prevents the model from gaining information from the test data which supposed to be hidden from it. 


This ensures that the encoding is only done from the training data and not from the test data, preventing data leakage.

After learning the encoding from the training data, the same encoder is used to transform the test data using
the transform() method. 

This ensures consistency in the encoding process between the two datasets.

By separating the encoding process into two steps (fitting on training data and transforming on test data), we are following the best practice of preventing data leakage and ensuring that the model is evaluated on unseen data.

In [8]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

# Remove duplicated rows
df = df.drop_duplicates()

df['years_of_experience'].fillna(df["years_of_experience"].mode()[0],inplace=True) 


# Define your features and target
X = df.drop("placement_status", axis=1)
Y = df["placement_status"]

# Define a mapping for placement status
placement_status_mapping = {"Placed": 1, "Not Placed": 0}

# Apply the mapping to the Y series
Y = Y.map(placement_status_mapping)

# Split your data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=10)

# Initialize the encoder
encoder = OneHotEncoder(handle_unknown='ignore')

# Fit and transform the encoder on the training set
X_train_encoded = encoder.fit_transform(X_train)

# Transform the test set using the same encoder
X_test_encoded = encoder.transform(X_test)

# Initialize and train your model
model = RandomForestClassifier()
model.fit(X_train_encoded, Y_train)

# Score the model on the test data
score = model.score(X_test_encoded, Y_test)

print("Model score: ", score)


Model score:  1.0


The score is at 0.992857 it has dropped from 1.0 which is a good sign! We have reduced leakage.

**2. feature selection before splitting into training/test sets**
_ _ _ _ _ _ 

*Solution*

To prevent feature selection from accessing information from the test set during training, you should perform feature selection only on the training data and then apply the same feature selection to the test data.

we perform feature selection only on the training data after fitting the model on it. Then, we transform both the training and test sets using the selected features before training and evaluating the model, respectively. This ensures that there's no leakage of information from the test set during feature selection.

In [9]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel



# Define your features and target
X = df.drop(["placement_status","salary"], axis=1)
Y = df["placement_status"]

# Define a mapping for placement status
placement_status_mapping = {"Placed": 1, "Not Placed": 0}

# Apply the mapping to the Y series
Y = Y.map(placement_status_mapping)

# Split your data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=10)

# Initialize the encoder
encoder = OneHotEncoder(handle_unknown='ignore')

# Fit and transform the encoder on the training set
X_train_encoded = encoder.fit_transform(X_train)

# Transform the test set using the same encoder
X_test_encoded = encoder.transform(X_test)

# Initialize the model
model = RandomForestClassifier()

# Fit the model on the training set
model.fit(X_train_encoded, Y_train)



We use the SelectFromModel class  which is a Meta-transformer for selecting features based on importance weights.


We utilise the estimator parameter which is the base estimator used for feature selection.
The .fit method of the SelectFromModel class is called on the training data to fit the underlying estimator on it.

The get_support method is called to retrieve a mask of the selected features. This method returns a boolean mask where True indicates the selected features.

The training and test sets are transformed using the selected features.This operation selects only the features that were deemed important by the feature selection process.

The underlying estimator is then trained on the selected features using the fit method. This step ensures that the model is trained only on the relevant features.

Finally we print the score 

In [10]:
# Perform feature selection on the training set
selector = SelectFromModel(model, threshold='median')
selector.fit(X_train_encoded, Y_train)

# Get selected features
selected_features = selector.get_support()

# Transform the training and test set using selected features
X_train_selected = X_train_encoded[:, selected_features]
X_test_selected = X_test_encoded[:, selected_features]

# Train the model on the selected features
model.fit(X_train_selected, Y_train)

# Score the model on the test data
score = model.score(X_test_selected, Y_test)

print("Model score: ", score)

Model score:  1.0


The score is at 0.97857 which lower than 1.0 which is an improvement hence we have reduced leakage. 

**3. Duplicated data in both test and training sets**
_ _ _ _ _ _
In this code we remove the duplicated rows from the dataframe we use drop_duplicates() method from pandas. 

We do this before splitting our datainto training and test sets to ensure our model's performance to reduce overly optimistic performance estimates.

In [58]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
import pandas as pd

# Remove duplicated rows
df = df.drop_duplicates()

# Define your features and target
X = df.drop(["placement_status","salary"], axis=1)
Y = df["placement_status"]

# Define a mapping for placement status
placement_status_mapping = {"Placed": 1, "Not Placed": 0}

# Apply the mapping to the Y series
Y = Y.map(placement_status_mapping)

# Split your data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=10)

# Initialize the encoder
encoder = OneHotEncoder(handle_unknown='ignore')

# Fit and transform the encoder on the training set
X_train_encoded = encoder.fit_transform(X_train)

# Transform the test set using the same encoder
X_test_encoded = encoder.transform(X_test)

# Initialize the model
model = RandomForestClassifier()

# Fit the model on the training set
model.fit(X_train_encoded, Y_train)

# Score the model on the test data
score = model.score(X_test_encoded, Y_test)

print("Model score: ", score)

Model score:  0.9785714285714285


The score is at 0.992857 which lower than 1.0 which is an improvement hence we have reduced leakage. 

**4. Temporal leakage**
_ _ _ _ _ _ _ 

Our data has a temporal structure meaning the order of the observations matters and is meaningful.

Therefore,  we can perform a time series cross-validation to ensure that the model evaluation does not leak information from the future into the past.



In [8]:
from sklearn.model_selection import TimeSeriesSplit
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import pandas as pd

# Load the data

# Handle missing values
df['years_of_experience'].fillna(0, inplace=True)

# Remove duplicated rows
df = df.drop_duplicates()

# Define your features and target
X = df.drop(["placement_status","salary"], axis=1)
Y = df["placement_status"]

# Define a mapping for placement status
placement_status_mapping = {"Placed": 1, "Not Placed": 0}

# Apply the mapping to the Y series
Y = Y.map(placement_status_mapping)

# Initialize the encoder
encoder = OneHotEncoder(handle_unknown='ignore')

# Fit the encoder on the entire dataset
X_encoded = encoder.fit_transform(X)



The TimeSeriesSplit class is initialized with the parameter where n_splits=20 meaning the dataset will be split  into 20 consecutive folds for cross validation.

A for loop is used to iterate over each fold generated by the TimeSeriesSplit object. In each iteration, the dataset is split into training and testing sets using the indices obtained from the split() method.

We initialized a list called scores to store the accuracy scores of each fold.

The training and testing sets for both the features (X_train, X_test) and the target variable (Y_train, Y_test) are extracted from the encoded dataset and the target variable Y. The model is trained on the training data using the fit() method.

The model is used to predict the target variable for the testing data using the predict() method.

The accuracy of the model is evaluated by comparing the predicted values (Y_pred) with the actual values (Y_test) using the accuracy_score() function from the sklearn.metrics module.

After all the folds have been processed, the average model accuracy is calculated by summing up all the scores and dividing by the number of folds.



In [9]:
# Initialize TimeSeriesSplit
# You can adjust the number of splits as needed
tscv = TimeSeriesSplit(n_splits=20)  

scores = []
for train_index, test_index in tscv.split(X_encoded):
    X_train, X_test = X_encoded[train_index], X_encoded[test_index]
    Y_train, Y_test = Y.iloc[train_index], Y.iloc[test_index]

    # Initialize the model
    model = RandomForestClassifier()
    # Train your model
    model.fit(X_train, Y_train)

    # Predict on the test data
    Y_pred = model.predict(X_test)

    # Evaluate the model using accuracy score
    score = accuracy_score(Y_test, Y_pred)
    scores.append(score)


print("Average Model Accuracy:", sum(scores) / len(scores))

Test Score: 1.0
Average Model Accuracy: 0.9636363636363636


The line sum(scores) calculates the sum of all the scores in the list, and len(scores) returns the length of the list, which is the total number of scores. By dividing the sum of scores by the number of scores, you get the average model accuracy.

The score is at 0.9606 which is a bit lower than 1.0 hence we have reduced leakage.

The code below shows k-fold cross-validation. In this case, k is set to 10 (cv=10), which means the validation set is split into 10 equal parts. The model is then trained and evaluated 10 times, each time using a different one of the 10 parts as the validation set and the remaining 9 parts as the training set. The average of the 10 validation scores is then used as the overall validation score.

In [6]:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier

# Remove duplicated rows
df = df.drop_duplicates()

# Define your features and target
X = df.drop(["placement_status", "salary"], axis=1)
Y = df["placement_status"]

# Define a mapping for placement status
placement_status_mapping = {"Placed": 1, "Not Placed": 0}

# Apply the mapping to the Y series
Y = Y.map(placement_status_mapping)

# Split your data into training, validation, and test sets
X_train, X_temp, Y_train, Y_temp = train_test_split(X, Y, test_size=0.3, random_state=10)
X_val, X_test, Y_val, Y_test = train_test_split(X_temp, Y_temp, test_size=0.5, random_state=10)

# Initialize the encoder
encoder = OneHotEncoder(handle_unknown='ignore')

# Fit and transform the encoder on the training set
X_train_encoded = encoder.fit_transform(X_train)

# Transform the validation and test sets using the same encoder
X_val_encoded = encoder.transform(X_val)
X_test_encoded = encoder.transform(X_test)

# Initialize and train your model
model = RandomForestClassifier()
model.fit(X_train_encoded, Y_train)

# Use cross-validation to evaluate the model
scores = cross_val_score(model, X_val_encoded, Y_val, cv=10)

print("Cross-validation scores: ", scores)
print("Average cross-validation score: ", scores.mean())

# Score the model on the test data
score = model.score(X_test_encoded, Y_test)

print("Test score: ", score)

Cross-validation scores:  [0.90909091 1.         0.90909091 1.         0.81818182 0.8
 0.9        0.9        1.         0.9       ]
Average cross-validation score:  0.9136363636363637
Test score:  0.9904761904761905


 **5. Group leakage**
 _ _ _ _ _ 
 
 GroupKFold is used instead of train_test_split. The split method of GroupKFold takes an additional argument groups which is an array-like object that defines the group membership 'stream' for each sample in the data. This ensures that the same group is not represented in both the training and testing sets of any split, preventing group leakage.

In [61]:
from sklearn.model_selection import GroupKFold
import numpy as np

# Handle missing values
df['years_of_experience'].fillna(0, inplace=True)

# Remove duplicated rows
df = df.drop_duplicates()

# Define your features and target
X = df.drop(["placement_status","salary"], axis=1)
Y = df["placement_status"]

# Define a mapping for placement status
placement_status_mapping = {"Placed": 1, "Not Placed": 0}

# Apply the mapping to the Y series
Y = Y.map(placement_status_mapping)

# Initialize the encoder
encoder = OneHotEncoder(handle_unknown='ignore')

# Fit the encoder on the entire dataset
X_encoded = encoder.fit_transform(X)

# Initialize GroupKFold
gkf = GroupKFold(n_splits=5)

# Assume that 'group_column' is the name of your group column
groups = df['stream']

scores = []
for train_index, test_index in gkf.split(X, Y, groups):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    Y_train, Y_test = Y.iloc[train_index], Y.iloc[test_index]

    # Fit and transform the encoder on the training set
    X_train_encoded = encoder.fit_transform(X_train)

    # Transform the test set using the same encoder
    X_test_encoded = encoder.transform(X_test)

    # Train your model
    model.fit(X_train_encoded, Y_train)

    # Predict on the test data
    Y_pred = model.predict(X_test_encoded)

    # Evaluate the model using accuracy score
    score = accuracy_score(Y_test, Y_pred)
    scores.append(score)

print("Average Model Accuracy:", np.mean(scores))

Average Model Accuracy: 0.9715385568619705


The score is at 0.978327 which is a much better score compared to 1.0

Refrences

