# Students campus placement Prediction Model

### DATASET INFORMATION: 

This data set consists of Placement data of students in our campus. It includes secondary and higher secondary school percentage and specialization. It also includes degree specialization, type and Work experience and salary offers to the placed students



* sl_no	  == Serial Number
* gender	  == Gender- Male='M',Female='F'
* ssc_p	  == Secondary Education percentage- 10th Grade
* ssc_b	  == Board of Education- Central/ Others
* hsc_p	  == Higher Secondary Education percentage- 12th Grade
* hsc_b	  == Board of Education- Central/ Others
* hsc_s	  == Specialization in Higher Secondary Education
* degree_p	  == Degree Percentage
* degree_t	  == Under Graduation(Degree type)- Field of degree education
* workex	  == Work Experience
* etest_p	  == Employability test percentage ( conducted by college)
* specialisation	  == Post Graduation(MBA)- Specialization
* mba_p	  == MBA percentage
* status	  == Status of placement- Placed/Not placed
* salary  == Salary offered by corporate to candidates

____________

In [None]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.metrics import accuracy_score

# Load the dataset

In [None]:
df=pd.read_csv('../input/factors-affecting-campus-placement/Placement_Data_Full_Class.csv')
df.head()

### 1.	Data Understanding (8 marks)

a.	Read the dataset (tab, csv, xls, txt, inbuilt dataset). What are the number of rows and no. of cols & types of variables (continuous, categorical etc.)? (1 MARK)

b.	Calculate five-point summary for numerical variables (1 MARK)

c.	Summarize observations for categorical variables – no. of categories, % observations in each category. (2 mark)

d.	Check for defects in the data such as missing values, null, outliers, etc and also check for class imbalance. (4 marks)


In [None]:
#dataset shape
print('Number of rows:',df.shape[0])
print('Number of columns:',df.shape[1])

In [None]:
# datatype of variables
df.info()

In [None]:
# 5 point summary
df.describe()

In [None]:
#describe categorical variables
df.describe(include=[np.object])

### Values count in each categories

In [None]:
a=df['gender'].value_counts()
percent=(a.values/df.shape[0])*100 #% observations
b=pd.DataFrame()
b['Type']=df['gender'].unique()
b['Percentage']=percent
b

In [None]:
a=df['ssc_b'].value_counts()
percent=(a.values/df.shape[0])*100 #% observations
b=pd.DataFrame()
b['Type']=df['ssc_b'].unique()
b['Percentage']=percent
b

In [None]:
a=df['hsc_b'].value_counts()
percent=(a.values/df.shape[0])*100 #% observations
b=pd.DataFrame()
b['Type']=df['hsc_b'].unique()
b['Percentage']=percent
b

In [None]:
a=df['status'].value_counts()
percent=(a.values/df.shape[0])*100 #% observations
b=pd.DataFrame()
b['Type']=df['status'].unique()
b['Percentage']=percent
b

In [None]:
#Checked for missing values
print(df.isnull().sum())

plt.figure(figsize=(10,6))
sns.heatmap(df.isnull())
plt.show()

#There are 67 missing values in salary column

In [None]:

# We will impute null values

print('Skewness in salary :',df['salary'].skew())

In [None]:

#since salary is right skewed, we will impute null values by median

df['salary'].fillna(df['salary'].median(),inplace=True)

In [None]:
print(df.isnull().sum())

In [None]:
# Checked for outliers

plt.figure(figsize=(10,6))
df.boxplot()
plt.show()


In [None]:
#Individual boxplots are plotted
df1=df.select_dtypes(exclude='object')
for i in range(len(df1.columns)):
    sns.boxplot(df1.iloc[:,i])
    plt.show()
    
    
    
#We can observe that the salary and HSC percentage are having outliers


In [None]:
#We will remove outliers
#We can observe that the salary and HSC percentage are having outliers

q1=df.quantile(0.25)
q3=df.quantile(0.75)
iqr=q3-q1

ll=q1-iqr
ul=q3+iqr

df=df[~((df<ll)|(df>ul)).any(axis=1)]
df=df.reset_index(drop=True)


plt.figure(figsize=(10,6))
df.boxplot()
plt.show()

In [None]:
# Checked for data inbalance

sns.countplot(df['status'])
plt.show()


#We can clearly observe the imbalance in the data givien for placement status since the count of students 
#placed is more than the count of student who did not placed
# But the target variable is fairly represents two classes
 

In [None]:

#Correlation matrix
plt.figure(figsize=(10,6))
sns.heatmap(df.corr(),annot=True)
plt.plot()

### EDA

In [None]:
plt.figure(figsize=(10,6))
sns.scatterplot(x=df['ssc_p'],y=df['hsc_p'],hue=df['status'])
plt.show()

In [None]:
df.columns

In [None]:
plt.figure(figsize=(10,6))
sns.scatterplot(x=df['degree_p'],y=df['hsc_p'],hue=df['status'])
plt.show()

In [None]:
sns.barplot(x=df['status'],y=df['ssc_p'],hue=df['ssc_b'])
plt.show()

sns.barplot(x=df['status'],y=df['hsc_p'])
plt.show()

sns.barplot(x=df['status'],y=df['degree_p'])
plt.show()

sns.barplot(x=df['status'],y=df['mba_p'])
plt.show()

In [None]:
sns.barplot(x=df['status'],y=df['etest_p'])
plt.show()

sns.countplot(x=df['status'],hue=df['workex'])
plt.show()

### INFERENCES:
1. From the above scatterplots , we can say that the people who are having high ssc,hsc , degree percentage are more likely to get placed rather than those are having less percentsge.

2. From the above boxplot we can say that the average percentage of the students who are placed are more than those who are not placed.

3. Basically the placement of student depends on the previous marks.(ssc,hsc,degree) 
    
4. People having more work experience will likely to get placed    

In [None]:
# standard deviation
df.std()

In [None]:
# We remove serial number
df.drop(columns=['sl_no'],inplace=True)
df.head()

In [None]:
#Label encoding & dummy variable encoding
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
df['status']=le.fit_transform(df['status'])

In [None]:
# Categorical vars converted into dummy vars

df_cat1=df.drop(columns='status').select_dtypes('object')
df_cat=pd.get_dummies(df_cat1,drop_first=True)
df_cat.head()

In [None]:
# Numerical variables are scaled
from scipy.stats import zscore
df_num1=df.select_dtypes(exclude='object')
df_num=df_num1.apply(zscore)
df_num.head()

In [None]:
# Concatenating x_cat & x_num

df_x=pd.concat([df_num,df_cat] , axis=1)
df_x.head()

In [None]:
# train test split

from sklearn.model_selection import train_test_split
x=df_x.drop(columns='status')
y=df['status']
xtrain,xtest,ytrain,ytest=train_test_split( x , y , test_size=0.3 , random_state=10)

In [None]:
# To check whether xtrain & xtest are representing fair data or not ,
#We will plot distplot for any one of the numerical feature

print('Skewness train:',xtrain['ssc_p'].skew())
sns.distplot(xtrain['ssc_p'])
plt.show()


print('Skewness test:',xtest['ssc_p'].skew())
sns.distplot(xtest['ssc_p'])
plt.show()

In [None]:
xtrain.std()

In [None]:
xtest.std()

## Model Building

In [None]:
#THREE models are build below -KNN.DECISION TREE, RANDOM FOREST
#Model is fitted using decision tree classifier
# Because it can capture the non-linearity in the data 

from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import KFold,cross_val_score
dt=DecisionTreeClassifier(criterion='entropy')

dt.fit(xtrain,ytrain)

ypred=dt.predict(xtest)
ypredt=dt.predict(xtrain)
ypred_prob=dt.predict_proba(xtest)[:,1]

score=cross_val_score(dt, xtrain,ytrain,scoring='accuracy', cv=5)

bias_error=np.mean(1-score)
var_error=np.std(score)
print('Bias_error',bias_error)
print('Variance_error:',var_error)

In [None]:
# ROC CURVE FOR Decision tree model

from sklearn.metrics import roc_curve,roc_auc_score

print('Area under the roc :',roc_auc_score(ytest,ypred_prob))
fpr,tpr,threshold=roc_curve(ytest,ypred_prob)
plt.plot(fpr,tpr)
plt.plot([0,1],[0,1],'r--')




In [None]:
#Random forest model is build
rf=RandomForestClassifier(n_estimators=20 , criterion='entropy',random_state=10)
rf.fit(xtrain,ytrain)
ypredr=rf.predict(xtest)


In [None]:
#KNN Model is build
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

knn=KNeighborsClassifier()
param={'n_neighbors':np.arange(1,80)}

gs=GridSearchCV(knn, param_grid=param , scoring='roc_auc')
gs.fit(xtrain,ytrain)
gs.best_params_

In [None]:
knn=KNeighborsClassifier(n_neighbors=15,weights='distance')
knn.fit(xtrain,ytrain)
ypredk=knn.predict(xtest)
ypredk_train=knn.predict(xtrain)
print('Overall accuracy of the knn test data:',accuracy_score(ytest,ypredk))
print('Overall accuracy of the knn train data:',accuracy_score(ytrain,ypredk_train))

In [None]:
from sklearn.metrics import accuracy_score


print('Overall accuracy of the Decision tree model test data:',accuracy_score(ytest,ypred))
print('Overall accuracy of the Decision tree model train data:',accuracy_score(ytrain,ypredt))
#print('Overall accuracy of the Random forest model:',accuracy_score(ytest,ypredr))

### INFERENCES:
1.We can clearly see that , all the models build above are having the more accuracy for train data & for test data they are having less accuracy.

2. It means that the model is overfitted over train data
    
3. We need to minimize the variance error , it can be minimized using bagging techniques.    

In [None]:

from sklearn.metrics import classification_report

#Classification report for model build using Decision tree
print(classification_report(ytest,ypred))


#We can observe from below classification report that the people who will get placed 
#can be predicted with 96% of the accuracy.
#The people who will not get placed is predicted with accuracy of 82%
# Model is overfitted model & also has imbalace in the data
# We will apply bagging ensemble technique to overcome this.





1. From all the three models build above , we can clearly see that the training accuracy is more than the testing accuracy ,
 so model is overffited the training data.
 
2. Therefore the variance error of the model is also more. to reduce variance error we will go for bagging technique


In [None]:
#BAGGING with Decision tree

from sklearn.ensemble import BaggingClassifier
dt=DecisionTreeClassifier()
bg=BaggingClassifier(base_estimator=dt,n_estimators=30,random_state=10)

bg.fit(xtrain,ytrain)

In [None]:
ypredb=bg.predict(xtest)
ypredb_prob=bg.predict_proba(xtest)[:,1]
ypredb_train=bg.predict(xtrain)

print('Overall accuracy of the Decision tree model with bagging test data:',accuracy_score(ytest,ypredb))
print('Overall accuracy of the Decision tree model with bagging  train data:',accuracy_score(ytrain,ypredb_train))


In [None]:
# ROC CURVE FOR Decision tree model whith bagging

from sklearn.metrics import roc_curve,roc_auc_score

print('Area under the roc :',roc_auc_score(ytest,ypredb_prob))
fpr,tpr,threshold=roc_curve(ytest,ypredb_prob)
plt.plot(fpr,tpr)
plt.plot([0,1],[0,1],'r--')

In [None]:
# PLOTTING PREVIOUS & BAGGING ROC 

print('Area under the roc :',roc_auc_score(ytest,ypred_prob))
fpr,tpr,threshold=roc_curve(ytest,ypred_prob)
plt.plot(fpr,tpr , label='WITHOUT BAGGING')
plt.plot([0,1],[0,1],'r--')


print('Area under the roc :',roc_auc_score(ytest,ypredb_prob))
fpr,tpr,threshold=roc_curve(ytest,ypredb_prob)
plt.plot(fpr,tpr, label='WITH BAGGING')
plt.plot([0,1],[0,1],'r--')
plt.legend()
plt.show()

#### INFERENCES:
1.testing accuracy has been improved from the previous decision tree model.

2.Decision tree model without bagging has less testing accuracy than the decision tree model with bagging model

3.From above ROC Curve , we ca clearly see that the area under the curve for the model  build using BAGGING technique is more than the are under the curve for the model build without bagging technique.



4.ROC_AUC Score of the model is improved So the accuracy of the model has been increased by bagging the Decision tree model.


### INFERENCES:
1. From EDA part done  , We can clearly see that the previous class marks matters in placements.
The students having more marks in ssc,hsc ,degree are likely to get placed than the ones having less marks.

2. So the marks is one of the important criteria to predict whether the person will get placed or not .

3. Experience  of the person also matters in placement.There are more chances that Experienced person will get placed.

4. Then we built Decision tree model .
The model is overfitted on the training dataset therefore we are getting less accuracy for testing dataset.
Also the imbalance in data will reflect in the precision & recall matrix also.

5. Since the data is having more number of placed people's data so more characteristics or variety of data for this group is  available to build model & predict the people who will get placed.
But the data of people who are not got placed is less so may be all the variety of characteristics of these class are not available or not enough to predict the person will not get placed.

6. Ensemble techniques are used to overcome the overfitting & underfitting of the model.
Since above model is overfitted , we have to use bagging ensemble technique to minimize the variance error

7. So we build bagging model with base model as Decision tres thereby we overcame the overfitting problem.
Now the accuracy for testing dataset has been improved than the previous model.

8. We compared the results using ROC Curve & roc_auc score