# Campus placements Prediction

### 1) Problem statement
         
-  to predict whether the student will be recruited in campus placements or not based on the available factors in the dataset.

 ### 2) Data collection
 
 - Dataset Source - https://www.kaggle.com/competitions/ml-with-python-course-project/data
- The train data consists of 15 column and 215 rows.
- The test data consist of 3 columns and 43 rows.

### 2.1 Import Data and Required Packages
####  Importing Pandas, Numpy, Matplotlib, Seaborn and Warings Library.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

#### Import the CSV Data as Pandas DataFrame

In [4]:
df = pd.read_csv('data/train.csv')
df_test = pd.read_csv('data/test.csv')

In [5]:
df.shape, df_test.shape

((215, 15), (43, 3))

#### Show Top 5 Records

In [13]:
df.head()

Unnamed: 0,sl_no,gender,ssc_p,ssc_b,hsc_p,hsc_b,hsc_s,degree_p,degree_t,workex,etest_p,specialisation,mba_p,status,salary
0,1,0,67.0,Others,91.0,Others,Commerce,58.0,Sci&Tech,No,55.0,Mkt&HR,58.8,Placed,270000.0
1,2,0,79.33,Central,78.33,Others,Science,77.48,Sci&Tech,Yes,86.5,Mkt&Fin,66.28,Placed,200000.0
2,3,0,65.0,Central,68.0,Central,Arts,64.0,Comm&Mgmt,No,75.0,Mkt&Fin,57.8,Placed,250000.0
3,4,0,56.0,Central,52.0,Central,Science,52.0,Sci&Tech,No,66.0,Mkt&HR,59.43,Not Placed,
4,5,0,85.8,Central,73.6,Central,Commerce,73.3,Comm&Mgmt,No,96.8,Mkt&Fin,55.5,Placed,425000.0


### 2.2 Dataset information

- **sl_no**: anonymous id unique to a given employee
- **gender**: employee gender
- **ssc_p**: SSC is Secondary School Certificate (Class 10th). ssc_p is the percentage of marks secured in Class 10th.
- **ssc_b**: SSC Board. Binary feature.
- **hsc_p**: HSC is Higher Secondary Certificate (Class 12th). hsc_p is the percentage of marks secured in Class 12th.
- **hsc_b**: HSC Board. Binary feature.
- **hsc_s**: HSC Subject. Feature with three categories.
- **degree_p**: percentage of marks secured while acquiring the degree.
- **degree_t**: branch in which the degree was acquired. Feature with three categories.
- **workex**: Whether the employee has some work experience or not. Binary feature.
- **etest_p**: percentage of marks secured in the placement exam.
- **specialisation**: the specialization that an employee has. Binary feature.
- **mba_p**: percentage of marks secured by an employee while doing his MBA.
- **status**: whether the student was placed or not. Binary Feature. Target variable.
- **salary**: annual compensation at which an employee was hired.

### 3. Data Checks to perform

- Check Missing values
- Check Duplicates
- Check data type
- Check the number of unique values of each column
- Check statistics of data set
- Check various categories present in the different categorical column

### 3.1 Check Missing values

In [16]:
df.isna().sum()

sl_no              0
gender             0
ssc_p              0
ssc_b              0
hsc_p              0
hsc_b              0
hsc_s              0
degree_p           0
degree_t           0
workex             0
etest_p            0
specialisation     0
mba_p              0
status             0
salary            67
dtype: int64

### Observation: 
- There are **67** missing values in salary depicts that 67 out of 215 has not been placed


In [17]:
df

Unnamed: 0,sl_no,gender,ssc_p,ssc_b,hsc_p,hsc_b,hsc_s,degree_p,degree_t,workex,etest_p,specialisation,mba_p,status,salary
0,1,0,67.00,Others,91.00,Others,Commerce,58.00,Sci&Tech,No,55.0,Mkt&HR,58.80,Placed,270000.0
1,2,0,79.33,Central,78.33,Others,Science,77.48,Sci&Tech,Yes,86.5,Mkt&Fin,66.28,Placed,200000.0
2,3,0,65.00,Central,68.00,Central,Arts,64.00,Comm&Mgmt,No,75.0,Mkt&Fin,57.80,Placed,250000.0
3,4,0,56.00,Central,52.00,Central,Science,52.00,Sci&Tech,No,66.0,Mkt&HR,59.43,Not Placed,
4,5,0,85.80,Central,73.60,Central,Commerce,73.30,Comm&Mgmt,No,96.8,Mkt&Fin,55.50,Placed,425000.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
210,211,0,80.60,Others,82.00,Others,Commerce,77.60,Comm&Mgmt,No,91.0,Mkt&Fin,74.49,Placed,400000.0
211,212,0,58.00,Others,60.00,Others,Science,72.00,Sci&Tech,No,74.0,Mkt&Fin,53.62,Placed,275000.0
212,213,0,67.00,Others,67.00,Others,Commerce,73.00,Comm&Mgmt,Yes,59.0,Mkt&Fin,69.72,Placed,295000.0
213,214,1,74.00,Others,66.00,Others,Commerce,58.00,Comm&Mgmt,No,70.0,Mkt&HR,60.23,Placed,204000.0


### 3.2 Check Duplicates

In [18]:
df.duplicated().sum()

0

#### There are no duplicates  values in the data set

### 3.3 Check data types

In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 215 entries, 0 to 214
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   sl_no           215 non-null    int64  
 1   gender          215 non-null    int64  
 2   ssc_p           215 non-null    float64
 3   ssc_b           215 non-null    object 
 4   hsc_p           215 non-null    float64
 5   hsc_b           215 non-null    object 
 6   hsc_s           215 non-null    object 
 7   degree_p        215 non-null    float64
 8   degree_t        215 non-null    object 
 9   workex          215 non-null    object 
 10  etest_p         215 non-null    float64
 11  specialisation  215 non-null    object 
 12  mba_p           215 non-null    float64
 13  status          215 non-null    object 
 14  salary          148 non-null    float64
dtypes: float64(6), int64(2), object(7)
memory usage: 25.3+ KB


### 3.4 Checking the number of unique values of each column

In [20]:
df.nunique()

sl_no             215
gender              2
ssc_p             103
ssc_b               2
hsc_p              97
hsc_b               2
hsc_s               3
degree_p           89
degree_t            3
workex              2
etest_p           100
specialisation      2
mba_p             205
status              2
salary             45
dtype: int64

### 3.5 Check statistics of data set

In [21]:
df.describe()

Unnamed: 0,sl_no,gender,ssc_p,hsc_p,degree_p,etest_p,mba_p,salary
count,215.0,215.0,215.0,215.0,215.0,215.0,215.0,148.0
mean,108.0,0.353488,67.303395,66.333163,66.370186,72.100558,62.278186,288655.405405
std,62.209324,0.479168,10.827205,10.897509,7.358743,13.275956,5.833385,93457.45242
min,1.0,0.0,40.89,37.0,50.0,50.0,51.21,200000.0
25%,54.5,0.0,60.6,60.9,61.0,60.0,57.945,240000.0
50%,108.0,0.0,67.0,65.0,66.0,71.0,62.0,265000.0
75%,161.5,1.0,75.7,73.0,72.0,83.5,66.255,300000.0
max,215.0,1.0,89.4,97.7,91.0,98.0,77.89,940000.0


#### Observation
- The given data is perfect with no outliers.

### 3.6 Exploring Data

In [23]:
df.head()

Unnamed: 0,sl_no,gender,ssc_p,ssc_b,hsc_p,hsc_b,hsc_s,degree_p,degree_t,workex,etest_p,specialisation,mba_p,status,salary
0,1,0,67.0,Others,91.0,Others,Commerce,58.0,Sci&Tech,No,55.0,Mkt&HR,58.8,Placed,270000.0
1,2,0,79.33,Central,78.33,Others,Science,77.48,Sci&Tech,Yes,86.5,Mkt&Fin,66.28,Placed,200000.0
2,3,0,65.0,Central,68.0,Central,Arts,64.0,Comm&Mgmt,No,75.0,Mkt&Fin,57.8,Placed,250000.0
3,4,0,56.0,Central,52.0,Central,Science,52.0,Sci&Tech,No,66.0,Mkt&HR,59.43,Not Placed,
4,5,0,85.8,Central,73.6,Central,Commerce,73.3,Comm&Mgmt,No,96.8,Mkt&Fin,55.5,Placed,425000.0


In [24]:
# define numerical & categorical columns
numeric_features = [feature for feature in df.columns if df[feature].dtype != 'O']
categorical_features = [feature for feature in df.columns if df[feature].dtype == 'O']

# print columns
print('We have {} numerical features : {}'.format(len(numeric_features), numeric_features))
print('\nWe have {} categorical features : {}'.format(len(categorical_features), categorical_features))

We have 8 numerical features : ['sl_no', 'gender', 'ssc_p', 'hsc_p', 'degree_p', 'etest_p', 'mba_p', 'salary']

We have 7 categorical features : ['ssc_b', 'hsc_b', 'hsc_s', 'degree_t', 'workex', 'specialisation', 'status']


In [25]:
for feature in categorical_features:
    print(f'unique values in "{feature}" column: ', df[feature].unique())
    print()

unique values in "ssc_b" column:  ['Others' 'Central']

unique values in "hsc_b" column:  ['Others' 'Central']

unique values in "hsc_s" column:  ['Commerce' 'Science' 'Arts']

unique values in "degree_t" column:  ['Sci&Tech' 'Comm&Mgmt' 'Others']

unique values in "workex" column:  ['No' 'Yes']

unique values in "specialisation" column:  ['Mkt&HR' 'Mkt&Fin']

unique values in "status" column:  ['Placed' 'Not Placed']



#### Observations
- The data look perfect with no mistaken records.
- the naming convention maybe improved slightly for better understanding.