<div style="text-align: center;">
  <img src="Images/Placement.png" alt="Placement Illustration" width="600"/>
</div>

## Hello!
This is a project utilizing the **Placement Prediction Dataset** to analyze and predict student placement outcomes. By exploring this dataset, we aim to identify key factors that influence placement success and develop machine learning models for accurate classification of placement status.  

The dataset, available on [Kaggle](https://www.kaggle.com/datasets/ruchikakumbhar/placement-prediction-dataset), contains **10,000 rows** and **12 columns**, providing detailed academic and extracurricular records of students, along with their placement status. It is well-structured with no missing values, making it suitable for analysis and classification tasks. 

#### Dataset Overview
The dataset includes the following features:  

1. **CGPA**: Cumulative Grade Point Average, representing a student’s overall academic performance.  
2. **Internships**: The number of internships completed by a student.  
3. **Projects**: Number of projects undertaken by the student.  
4. **Workshops/Certifications**: Participation in workshops and online courses for skill enhancement.  
5. **Aptitude Test Score**: Measures a student’s quantitative and logical reasoning skills, often used in recruitment.  
6. **Soft Skills Rating**: Assesses communication and interpersonal skills.  
7. **Extracurricular Activities**: Indicates student involvement in non-academic activities.  
8. **Placement Training**: Training programs provided by colleges to help students prepare for placement.  
9. **SSC & HSC Marks**: Academic performance in Senior Secondary (SSC) and Higher Secondary (HSC) levels.  
10. **Placement Status**: The target variable, indicating whether a student was **Placed** or **Not Placed**.  

#### Project Inspiration  
This project is inspired by the need to analyze key factors affecting student placements and build predictive models to assist career guidance efforts. Understanding the most influential attributes can help students and institutions improve placement strategies.  

#### Goals of the Project  
1. **Data Analysis**: Explore the dataset to identify trends and correlations among academic, training, and extracurricular features.  
2. **Classification Models**: Develop and evaluate machine learning models to predict placement status based on the provided features.  

#### Dataset Overview
The dataset includes the following features:  

1. **CGPA**: Cumulative Grade Point Average, representing a student’s overall academic performance.  
2. **Internships**: The number of internships completed by a student.  
3. **Projects**: Number of projects undertaken by the student.  
4. **Workshops/Certifications**: Participation in workshops and online courses for skill enhancement.  
5. **Aptitude Test Score**: Measures a student’s quantitative and logical reasoning skills, often used in recruitment.  
6. **Soft Skills Rating**: Assesses communication and interpersonal skills.  
7. **Extracurricular Activities**: Indicates student involvement in non-academic activities.  
8. **Placement Training**: Training programs provided by colleges to help students prepare for placement.  
9. **SSC & HSC Marks**: Academic performance in Senior Secondary (SSC) and Higher Secondary (HSC) levels.  
10. **Placement Status**: The target variable, indicating whether a student was **Placed** or **Not Placed**.  

#### Project Inspiration  
This project is inspired by the need to analyze key factors affecting student placements and build predictive models to assist career guidance efforts. Understanding the most influential attributes can help students and institutions improve placement strategies.  

#### Goals of the Project  
1. **Data Analysis**: Explore the dataset to identify trends and correlations among academic, training, and extracurricular features.  
2. **Classification Models**: Develop and evaluate machine learning models to predict placement status based on the provided features.  

> ⚠️ *Disclaimer*: The goal is to build predictive tools and derive insights—not to stereotype or oversimplify student capabilities. Ethical analysis and transparency are essential. 

# **Step 1: Data Wrangling**

This notebook covers the **data wrangling and preprocessing phase** of the Placement Prediction project. The goal is to clean, transform, and prepare the raw dataset for exploratory data analysis (EDA), modeling, and dashboard development.

---

### Objectives of This Notebook

1. [Import Libraries and Load the Dataset](#import)  
2. [Initial Inspection](#inspection)  
3. [Handle Missing Values](#missing)  
4. [Remove Duplicate Records](#duplicates)  
5. [Rename Columns for Consistency](#rename)  
6. [Save the Cleaned Dataset](#save)

---

### Next Steps

- Step 2: [Exploratory Data Analysis (EDA) – Visual](./02_eda_visualization.ipynb)
- Step 3: [EDA – SQL Queries](./03_eda_sql_queries.ipynb) 
- Step 4: [Modeling & Prediction](./04_modeling_prediction.ipynb)

---

<a id="import"></a>

## **1.1 Import Libraries and Load the Dataset**

We start by importing the necessary Python libraries and loading the dataset into a DataFrame.

In [1]:
# Pandas is a software library written for the Python programming language for data manipulation and analysis.
import pandas as pd

# NumPy is a Python library that supports fast operations on large, multi-dimensional arrays and provides a wide range of mathematical functions.
import numpy as np

# Import display function to render DataFrames or outputs neatly in the notebook
from IPython.display import display

In [2]:
# Load the dataset
print("Previewing the raw dataset:")
df = pd.read_csv('placement_data.csv')
display(df.head())

Previewing the raw dataset:


Unnamed: 0,StudentID,CGPA,Internships,Projects,Workshops/Certifications,AptitudeTestScore,SoftSkillsRating,ExtracurricularActivities,PlacementTraining,SSC_Marks,HSC_Marks,PlacementStatus
0,1,7.5,1,1,1,65,4.4,No,No,61,79,NotPlaced
1,2,8.9,0,3,2,90,4.0,Yes,Yes,78,82,Placed
2,3,7.3,1,2,2,82,4.8,Yes,No,79,80,NotPlaced
3,4,7.5,1,1,2,85,4.4,Yes,Yes,81,80,Placed
4,5,8.3,1,2,2,86,4.5,Yes,Yes,74,88,Placed


---

<a id="inspection"></a>

## **1.2 Initial Inspection**

We inspect the structure, data types, and basic info of the dataset.

In [3]:
# Dropping the 'StudentID' feature
df.drop(columns=['StudentID'],axis=1,inplace=True) 

In [4]:
# Display basic structure of the dataset
print("Dataset Info:")
display(df.info())

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 11 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   CGPA                       10000 non-null  float64
 1   Internships                10000 non-null  int64  
 2   Projects                   10000 non-null  int64  
 3   Workshops/Certifications   10000 non-null  int64  
 4   AptitudeTestScore          10000 non-null  int64  
 5   SoftSkillsRating           10000 non-null  float64
 6   ExtracurricularActivities  10000 non-null  object 
 7   PlacementTraining          10000 non-null  object 
 8   SSC_Marks                  10000 non-null  int64  
 9   HSC_Marks                  10000 non-null  int64  
 10  PlacementStatus            10000 non-null  object 
dtypes: float64(2), int64(6), object(3)
memory usage: 859.5+ KB


None

In [5]:
# Summary statistics for numerical columns
print("Numerical Summary:")
display(df.describe().T)

Numerical Summary:


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
CGPA,10000.0,7.69801,0.640131,6.5,7.4,7.7,8.2,9.1
Internships,10000.0,1.0492,0.665901,0.0,1.0,1.0,1.0,2.0
Projects,10000.0,2.0266,0.867968,0.0,1.0,2.0,3.0,3.0
Workshops/Certifications,10000.0,1.0132,0.904272,0.0,0.0,1.0,2.0,3.0
AptitudeTestScore,10000.0,79.4499,8.159997,60.0,73.0,80.0,87.0,90.0
SoftSkillsRating,10000.0,4.32396,0.411622,3.0,4.0,4.4,4.7,4.8
SSC_Marks,10000.0,69.1594,10.430459,55.0,59.0,70.0,78.0,90.0
HSC_Marks,10000.0,74.5015,8.919527,57.0,67.0,73.0,83.0,88.0


In [6]:
# Summary statistics for categorical columns
print("Categorical Summary:")
display(df.describe(include=[object]).T)

Categorical Summary:


Unnamed: 0,count,unique,top,freq
ExtracurricularActivities,10000,2,Yes,5854
PlacementTraining,10000,2,Yes,7318
PlacementStatus,10000,2,NotPlaced,5803


In [7]:
# Checking the shape of the dataset (rows, columns)
print(f"The dataset contains {df.shape[0]:,} rows and {df.shape[1]} columns.")

The dataset contains 10,000 rows and 11 columns.


In [8]:
# Getting the column for the dataframe
print(f"The dataset columns include:")
display(df.columns)

The dataset columns include:


Index(['CGPA', 'Internships', 'Projects', 'Workshops/Certifications',
       'AptitudeTestScore', 'SoftSkillsRating', 'ExtracurricularActivities',
       'PlacementTraining', 'SSC_Marks', 'HSC_Marks', 'PlacementStatus'],
      dtype='object')

---

<a id="missing"></a>

## **1.3 Handling Missing Values**

We identify missing values and apply appropriate strategies to handle them.

In [9]:
# Check for missing values
print("Missing Values per Column:")
display(df.isnull().sum())

Missing Values per Column:


CGPA                         0
Internships                  0
Projects                     0
Workshops/Certifications     0
AptitudeTestScore            0
SoftSkillsRating             0
ExtracurricularActivities    0
PlacementTraining            0
SSC_Marks                    0
HSC_Marks                    0
PlacementStatus              0
dtype: int64

---

<a id="duplicates"></a>

## **1.4 Remove Duplicate Records**

To ensure data quality, we check and remove duplicate rows.

In [10]:
# Remove duplicate rows
print(f"Duplicate Rows Found: {df.duplicated().sum()}")

Duplicate Rows Found: 72


In [11]:
# Dropping Duplicate Rows
df.drop_duplicates(inplace=True)
print(f"Duplicate Rows dropped")

Duplicate Rows dropped


---

<a id="rename"></a>

## **1.5 Rename Columns for Consistency**

Standardizing column names improves readability and downstream processing.

In [12]:
# Renaming Columns
df.rename(columns={'ExtracurricularActivities':'ECAs',
                     'Workshops/Certifications':'Trainings'},inplace=True)

# Standardize column names: lowercase and replace spaces with underscores
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')

In [13]:
# Reset index to avoid any issues from dropped rows
df.reset_index(drop=True, inplace=True)

#### Observations from the Placement Dataset:

1. **Dataset Dimensions**:  
   - The dataset contains **10,000 rows** and **11 columns**.  
   - After removing **72 duplicate rows**, the final dataset size remains **9,928 rows** (assuming all were removed).  

2. **Data Quality**:  
   - **No missing values** are present across all columns.  
   - **Duplicates were found** (72 rows) but were successfully removed.  

3. **Numerical Features Summary**:  
   - **CGPA** ranges from **6.5 to 9.1**, with a mean of **7.70**.  
   - **Internships** range between **0 and 2**, with most students having **1 internship** on average.  
   - **Projects** range between **0 and 3**, with a mean of **2.03**.  
   - **Workshops/Certifications** renamed as **Trainings** vary between **0 and 3**, averaging **1.01** per student.  
   - **Aptitude Test Score** is between **60 and 90**, with an average score of **79.45**.  
   - **Soft Skills Rating** ranges from **3.0 to 4.8**, with an average of **4.32**.  
   - **SSC Marks** vary from **55% to 90%**, with a mean of **69.16%**.  
   - **HSC Marks** range from **57% to 88%**, averaging **74.50%**.  

4. **Categorical Features Summary**:  
   - **Extracurricular Activities**: **Two categories** ("Yes" / "No"), with **5,854 students** engaged in extracurricular activities.  
   - **Placement Training**: **Two categories** ("Yes" / "No"), with **7,318 students** having taken placement training.  
   - **Placement Status**: **Two categories** ("Placed" / "Not Placed"), with **5,801 students not placed**, making up the majority.  

5. **Observations**:  
   - Most students have a **CGPA around 7.7** and **one internship**.  
   - **Aptitude Test Score and Soft Skills Rating** show a generally high performance.  
   - **Placement Training is common**, suggesting its importance in securing a job.  
   - Despite having **strong academic and skill-based metrics**, many students remain **unplaced**, indicating other factors may influence placement outcomes.

---

<a id="save"></a>

## **1.6 Save the Cleaned Dataset**

After completing the data cleaning and preprocessing steps, save the cleaned dataset to a CSV file for future use and reproducibility.

In [14]:
# Create a clean copy of the DataFrame for further processing and cleaning steps.
df_clean = df.copy()

In [15]:
# Display the first and last few rows of the cleaned dataset to verify changes
display(df_clean)

Unnamed: 0,cgpa,internships,projects,trainings,aptitudetestscore,softskillsrating,ecas,placementtraining,ssc_marks,hsc_marks,placementstatus
0,7.5,1,1,1,65,4.4,No,No,61,79,NotPlaced
1,8.9,0,3,2,90,4.0,Yes,Yes,78,82,Placed
2,7.3,1,2,2,82,4.8,Yes,No,79,80,NotPlaced
3,7.5,1,1,2,85,4.4,Yes,Yes,81,80,Placed
4,8.3,1,2,2,86,4.5,Yes,Yes,74,88,Placed
...,...,...,...,...,...,...,...,...,...,...,...
9923,7.5,1,1,2,72,3.9,Yes,No,85,66,NotPlaced
9924,7.4,0,1,0,90,4.8,No,No,84,67,Placed
9925,8.4,1,3,0,70,4.8,Yes,Yes,79,81,Placed
9926,8.9,0,3,2,87,4.8,Yes,Yes,71,85,Placed


In [16]:
# Check the shape of the dataset after cleaning
print("Dataset shape after cleaning:")
print(df_clean.shape)

Dataset shape after cleaning:
(9928, 11)


In [17]:
# Recheck the dataframe info to verify datatypes and non-null counts
print("Dataset info:")
display(df_clean.info())

Dataset info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9928 entries, 0 to 9927
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   cgpa               9928 non-null   float64
 1   internships        9928 non-null   int64  
 2   projects           9928 non-null   int64  
 3   trainings          9928 non-null   int64  
 4   aptitudetestscore  9928 non-null   int64  
 5   softskillsrating   9928 non-null   float64
 6   ecas               9928 non-null   object 
 7   placementtraining  9928 non-null   object 
 8   ssc_marks          9928 non-null   int64  
 9   hsc_marks          9928 non-null   int64  
 10  placementstatus    9928 non-null   object 
dtypes: float64(2), int64(6), object(3)
memory usage: 853.3+ KB


None

In [18]:
# Summary statistics for numerical columns
print("Numerical Summary:")
display(df_clean.describe())

Numerical Summary:


Unnamed: 0,cgpa,internships,projects,trainings,aptitudetestscore,softskillsrating,ssc_marks,hsc_marks
count,9928.0,9928.0,9928.0,9928.0,9928.0,9928.0,9928.0,9928.0
mean,7.693946,1.042808,2.019944,1.006849,79.376209,4.320679,69.093372,74.425766
std,0.639961,0.663699,0.867118,0.903612,8.140884,0.411211,10.428709,8.901786
min,6.5,0.0,0.0,0.0,60.0,3.0,55.0,57.0
25%,7.4,1.0,1.0,0.0,73.0,4.0,59.0,67.0
50%,7.7,1.0,2.0,1.0,80.0,4.4,70.0,73.0
75%,8.2,1.0,3.0,2.0,87.0,4.7,78.0,83.0
max,9.1,2.0,3.0,3.0,90.0,4.8,90.0,88.0


In [19]:
# Summary statistics for categorical columns
print("Categorical Summary:")
display(df_clean.describe(include=[object]))

Categorical Summary:


Unnamed: 0,ecas,placementtraining,placementstatus
count,9928,9928,9928
unique,2,2,2
top,Yes,Yes,NotPlaced
freq,5784,7246,5801


In [20]:
# Save this lightly cleaned dataset for EDA and Dashboard
df_clean.to_csv('placement_cleaned.csv', index=False)

#### With these datasets ready, we can now proceed confidently to perform Exploratory Data Analysis (EDA).