![Right_Candidate](http://www.prevuehr.com/drive/uploads/2016/11/AdobeStock_95250209-Converted-1024x695.png?_t=1484780827)

In this Notebook, I will try to do detailed Exploratory data analysis(EDA) on Student Placement Dataset with Visualization. Extract different Insights from data that will be helpful for Fresh candidates. At the end I will Create Machine Learning Model that predicts whether a particular candidate got placement or not based on some feature.

**I Hope you will like my work!**

<a id='top'></a>
<div class="list-group" id="list-tab" role="tablist">
<p style="background-color:#47b7ed;font-family:newtimeroman;color:#000000;font-size:170%;text-align:center;border-radius:20px 80px;">📋 TABLE OF CONTENTS</p>   

    
* [1. Importing Libraries](#1)
    
* [2. Meta information of dataframe](#2)
    
* [3. Statistical information](#3)  
    
* [4. NaN values](#4)  
    
* [5. EDA & Visualization](#5)
    * [5.1. Mean Age of Students](#5.1)
    
    * [5.2. Total Male & Female](#5.2)
    
    * [5.3. Total Male & Female Pass Placement Exam](#5.3)
    
    * [5.4. Higher CGPA VS Placement](#5.4)
    
    * [5.5. Lower CGPA VS Placement](#5.5)
    
    * [5.6. Analyze Stream](#5.6)
    
    * [5.7. No Internship Experience VS Placement](#5.7)
    
    * [5.8. Top words in Stream](#5.8)

* [6. Preprocess data for Machine Learning](#6)
    * [6.1. One-Hot Encoding](#6.1)
    
    * [6.2. Scaling Features](#6.2)

* [7. Visualize Coorelation](#7)
 
* [8. Train Test Split](#8)

* [9. Create & Train Model](#9)

* [10. Visualize Model Score](#10)

* [11. Hyperparameter tuning using RandomizedSearchCV](#11)

* [12. Best estimator and best hyperparameters](#12)

* [13. Training model with best hyperparameters](#13)

* [14. Plot Confusion Matrix](#14)









    

<a id="1"></a>
# <p style="background-color:#47b7ed;font-family:newtimeroman;color:#000000;font-size:120%;text-align:center;border-radius:20px 80px;">📥 Importing libraries</p>


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
from wordcloud import WordCloud
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import accuracy_score, plot_confusion_matrix
from datetime import datetime
import warnings

warnings.filterwarnings("ignore", category = FutureWarning)


sns.set(style="darkgrid")

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
data = pd.read_csv('/kaggle/input/engineering-placements-prediction/collegePlace.csv')

data.head()

<a id="2"></a>
# <p style="background-color:#47b7ed;font-family:newtimeroman;color:#000000;font-size:120%;text-align:center;border-radius:20px 80px;">📝 Meta information of Dataframe</p>


In [None]:
print(f"Shape of Dataframe is: {data.shape}")

In [None]:
print('Datatype in Each Column\n')
pd.DataFrame(data.dtypes, columns=['Datatype']).rename_axis("Column Name")

<a id="3"></a>
# <p style="background-color:#47b7ed;font-family:newtimeroman;color:#000000;font-size:120%;text-align:center;border-radius:20px 80px;">📈 Statistical information of Dataframe</p>


In [None]:
data.describe().T.style.bar(subset=['mean'], color='#205ff2').background_gradient(subset=['std'], cmap='Reds').background_gradient(subset=['50%'], cmap='coolwarm')


<a id="4"></a>
# <p style="background-color:#47b7ed;font-family:newtimeroman;color:#000000;font-size:120%;text-align:center;border-radius:20px 80px;">🔎 Checking for NaN values</p>


In [None]:
pd.DataFrame(data.isnull().sum(), columns=["Null Values"]).rename_axis("Column Name")

### **Fortunately data has no missing value**

<a id="5"></a>
# <p style="background-color:#47b7ed;font-family:newtimeroman;color:#000000;font-size:120%;text-align:center;border-radius:20px 80px;">🔥 EDA & Visualization</p>


<a id="5.1"></a>
# <p style="background-color:#73d1ff;font-family:newtimeroman;color:#000000;font-size:120%;text-align:center;border-radius:20px 80px;">✔️ Mean Age of Student</p>


In [None]:
fig = px.histogram(data, 'Age',
                   title="<b>Average Age of Student</b>")

fig.add_vline(x=data['Age'].mean(), line_width=2, line_dash="dash", line_color="red")

fig.show()

In [None]:
fig = px.histogram(data, 'Age',             
                   color="Gender",
                   title="<b>Average Age Gender wise</b>")

fig.add_vline(x=data['Age'].mean(), line_width=2, line_dash="dash", line_color="black")

fig.show()

<a id="5.2"></a>
# <p style="background-color:#73d1ff;font-family:newtimeroman;color:#000000;font-size:120%;text-align:center;border-radius:20px 80px;">✔️ Total Male & Female</p>


In [None]:
pd.DataFrame(data['Gender'].value_counts()).rename({"Gender":"Counts"}, axis = 1).rename_axis("Gender")

In [None]:
px.histogram(data, x = "Gender", title = "<b>Total Male and Female</b>")

In [None]:
fig = px.pie(data, names = "Gender",
             title = "<b>Counts in Gender</b>",
             hole = 0.5, template = "plotly_dark")

fig.update_traces(textposition='inside',
                  textinfo='percent+label',
                  marker=dict(line=dict(color='#000000', width = 1.5)))


fig.show()


<a id="5.3"></a>
# <p style="background-color:#73d1ff;font-family:newtimeroman;color:#000000;font-size:120%;text-align:center;border-radius:20px 80px;">✔️ Total Male and Female Pass Placement</p>




In [None]:
male = data[data['Gender'] == "Male"]
female = data[data['Gender'] == "Female"]

In [None]:
total_male = male.shape[0]
total_female = female.shape[0]

In [None]:
total_male_pass = male[male['PlacedOrNot'] == 1].shape[0]
total_female_pass = female[female['PlacedOrNot'] == 1].shape[0]

In [None]:
pass_male_percentage = np.round((total_male_pass * 100) / total_male,2)
pass_female_percentage = np.round((total_female_pass * 100) / total_female,2)

In [None]:
details = {"Total Male": [total_male],
             "Total Female": [total_female],
             "Total male pass" : [total_male_pass],
             "Total female pass" : [total_female_pass],
             "% of Passed Male" : [pass_male_percentage],
             "% of Passed Female" : [pass_female_percentage]}

In [None]:
details

In [None]:
gender_wise = pd.DataFrame(details, index=["Detail"])
gender_wise.T

In [None]:
fig = px.histogram(data_frame = data,
             x = "Stream",
             color="PlacedOrNot", title="<b>Counts of Stream</b>",
             pattern_shape_sequence=['x'],
             template='plotly_dark')

fig.show()

**Majority of candidate are Computer Science Student and they are also large in number who got placement as compare to other Streams**

<a id="5.4"></a>
# <p style="background-color:#73d1ff;font-family:newtimeroman;color:#000000;font-size:120%;text-align:center;border-radius:20px 80px;">✔️ Higher CGPA Vs Placement</p>




## **Displaying all those records whose CGPA is above average**

In [None]:
cgpa_above_avg = data[data['CGPA'] > data['CGPA'].mean()]

cgpa_above_avg

In [None]:
fig = px.histogram(data_frame = cgpa_above_avg,
                   x = 'CGPA',
                   color='PlacedOrNot',
                   title = "<b>Above Average CGPA Vs Placement</b>",
                   template='plotly')

fig.update_layout(bargap=0.2)

fig.show()

**Above graph represents that all those students whose CGPA is above average has successfully pass placement test, So we can say that if a person has higher CGPA it will increase his/her chance of placement.**


<a id="5.5"></a>
# <p style="background-color:#73d1ff;font-family:newtimeroman;color:#000000;font-size:120%;text-align:center;border-radius:20px 80px;">✔️ Lower CGPA Vs Placement</p>




## **Candidates whose CGPA is below average**

In [None]:
cgpa_below_avg = data[data['CGPA'] < data['CGPA'].mean()]

cgpa_below_avg

In [None]:
fig = px.histogram(data_frame = cgpa_below_avg,
                   x = 'CGPA',
                   color='PlacedOrNot',
                   title = "<b>Below Average CGPA Vs Placement</b>",
                   template='plotly_dark', barmode='group')

fig.update_layout(bargap=0.2)

fig.show()

**From above above, if some student has CGPA below average it will reduce his/her chances of placement**


<a id="5.6"></a>
# <p style="background-color:#73d1ff;font-family:newtimeroman;color:#000000;font-size:120%;text-align:center;border-radius:20px 80px;">✔️ Analyze important features stream wise</p>




In [None]:
stream_wise = data.groupby('Stream').agg({'Age':'mean',
                                          'Internships' : 'sum',                            
                                           "CGPA":'mean',
                                           'PlacedOrNot':'sum'})

stream_wise.style.highlight_max()

In [None]:
px.bar(data_frame=stream_wise, barmode='group',
       title = "<b>Stream wise Analyzing</b>",template="plotly_dark")

**From Computer Science degree most of the student placed...**

<a id="5.7"></a>
# <p style="background-color:#73d1ff;font-family:newtimeroman;color:#000000;font-size:120%;text-align:center;border-radius:20px 80px;">✔️ Effect of No Internship Experience on Placement</p>




In [None]:
no_internship = data[data['Internships'] == 0]

no_internship

In [None]:
fig = px.histogram(data_frame = no_internship,
                   x = "PlacedOrNot",
                   color="PlacedOrNot",
                   title = "<b>No Internship Experience Vs Placement</b>")

fig.update_layout(bargap=0.2)

fig.show()

In [None]:
fig = px.pie(no_internship, names = "PlacedOrNot",
             hole = 0.5)

fig.update_traces(textposition='inside',
                  textinfo='percent+label',
                  marker=dict(line=dict(color='#000000', width = 1.5)))


fig.show()

**So from above graph, if person don't have any Internship Experience, it will not much effect on his/her placement. Majority of the student who don't have any Internship Experience has passed the placement exam**

<a id="5.8"></a>
# <p style="background-color:#73d1ff;font-family:newtimeroman;color:#000000;font-size:120%;text-align:center;border-radius:20px 80px;">✔️ Top words in Stream</p>




In [None]:
def plot_word_cloud(df, col_name):    
    text = ' '.join(df[col_name].str.lower())

    wordcloud = WordCloud(width = 2000, height = 900,
                          background_color ='black',
                          collocations=False,
                          max_words=500,
                          min_font_size = 15).generate(text)

    plt.figure(figsize=(12, 8), facecolor = 'k', edgecolor = 'k' )
    plt.imshow(wordcloud)
    plt.axis("off")
    plt.tight_layout(pad = 0) 
    plt.show()

In [None]:
plot_word_cloud(data, "Stream")

<a id="6"></a>
# <p style="background-color:#47b7ed;font-family:newtimeroman;color:#000000;font-size:120%;text-align:center;border-radius:20px 80px;">⚙️ Preprocessing data for Machine Learning</p>




<a id="6.1"></a>
# <p style="background-color:#73d1ff;font-family:newtimeroman;color:#000000;font-size:120%;text-align:center;border-radius:20px 80px;">✔️ One-Hot Encoding</p>




In [None]:
dummy_gender = pd.get_dummies(data['Gender'])
dummy_stream = pd.get_dummies(data['Stream'])

In [None]:
data = pd.concat([data.drop(["Gender", "Stream"], axis = 1), dummy_gender, dummy_stream], axis = 1)

data

## **Rearrange columns**

In [None]:
data = data[['Age', 'Male', 'Female',
             'Electronics And Communication',
             'Computer Science', 'Information Technology',
             'Mechanical', 'Electrical', "Civil",
             "Internships","CGPA",'Hostel',
             'HistoryOfBacklogs', 'PlacedOrNot']]

data

<a id="6.2"></a>
# <p style="background-color:#73d1ff;font-family:newtimeroman;color:#000000;font-size:120%;text-align:center;border-radius:20px 80px;">✔️ Scaling features</p>




In [None]:
scaler = StandardScaler()

scaler.fit(data.drop('PlacedOrNot',axis=1))

scaled_features = scaler.transform(data.drop('PlacedOrNot',axis=1))

In [None]:
scaled_features = pd.DataFrame(scaled_features, columns = data.columns[:-1])
scaled_features.head()


<a id="7"></a>
# <p style="background-color:#47b7ed;font-family:newtimeroman;color:#000000;font-size:120%;text-align:center;border-radius:20px 80px;">🎰 Visualize coorelation of independent feature with dependent</p>




In [None]:
corrmat = data.corr()
top_corr_features = corrmat.index

plt.figure(figsize=(20,15))

#plot heat map
g=sns.heatmap(data[top_corr_features].corr(),annot=True,cmap="RdYlGn")

**Internship and CGPA is highly coorelated with dependent feature i.e PlacedOrNot**




<a id="8"></a>
# <p style="background-color:#47b7ed;font-family:newtimeroman;color:#000000;font-size:120%;text-align:center;border-radius:20px 80px;">🧾 Train test split</p>




In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(scaled_features,
                                                    data['PlacedOrNot'],
                                                    test_size = 0.25,
                                                    random_state = 0)


In [None]:
print(f"Shape of X_train is: {X_train.shape}")
print(f"Shape of X_test is: {X_test.shape}\n")

print(f"Shape of y_train is: {y_train.shape}")
print(f"Shape of y_test is: {y_test.shape}")


<a id="9"></a>
# <p style="background-color:#47b7ed;font-family:newtimeroman;color:#000000;font-size:120%;text-align:center;border-radius:20px 80px;">🤖 Create & Train Model</p>




In [None]:
def models_score(models, X_train, X_test, y_train, y_test):    
    
    scores = {}
    
    for name, model in models.items():
        model.fit(X_train, y_train)
        scores[name] = model.score(X_test,y_test)

    model_scores = pd.DataFrame(scores, index=['Score']).transpose()
    model_scores = model_scores.sort_values('Score')
        
    return model_scores

In [None]:
models = {"DecisionTree":DecisionTreeClassifier(),
         "RandomForest":RandomForestClassifier(),
         "XgBoost": XGBClassifier(),
         "KNeighborsClassifier":KNeighborsClassifier()}


<a id="10"></a>
# <p style="background-color:#47b7ed;font-family:newtimeroman;color:#000000;font-size:120%;text-align:center;border-radius:20px 80px;">📊 Visualize Model Score</p>




In [None]:
model_scores = models_score(models, X_train, X_test, y_train, y_test)

In [None]:
model_scores.style.highlight_max()

In [None]:
model_scores = model_scores.reset_index().rename({"index":"Algorithms"}, axis = 1)

model_scores.style.bar()

In [None]:
fig = px.bar(data_frame = model_scores,
             x="Algorithms",
             y="Score",
             color="Algorithms", title = "<b>Models Score</b>", template = 'plotly_dark')

fig.update_layout(bargap=0.2)

fig.show()

In [None]:
label = model_scores['Algorithms']
value = model_scores['Score']

fig = go.Figure(data=[go.Pie(labels = label, values = value, rotation = 90)])

fig.update_traces(textposition='inside',
                  textinfo='percent+label',
                  marker=dict(line=dict(color='#000000', width = 1.5)))

fig.update_layout(title_x=0.5,
                  title_font=dict(size=20),
                  uniformtext_minsize=15)

fig.show()

<a id="11"></a>
# <p style="background-color:#47b7ed;font-family:newtimeroman;color:#000000;font-size:120%;text-align:center;border-radius:20px 80px;">📻 Hyperparameter tuning using RandomizedSearchCV</p>





In [None]:
## Hyper Parameter Optimization

params={
 "learning_rate"    : [0.05, 0.10, 0.15, 0.20, 0.25, 0.30 ] ,
 "max_depth"        : [ 3, 4, 5, 6, 8, 10, 12, 15],
 "min_child_weight" : [ 1, 3, 5, 7 ],
 "gamma"            : [ 0.0, 0.1, 0.2 , 0.3, 0.4 ],
 "colsample_bytree" : [ 0.3, 0.4, 0.5 , 0.7 ]
    
}

In [None]:
def timer(start_time=None):
    if not start_time:
        start_time = datetime.now()
        return start_time
    elif start_time:
        thour, temp_sec = divmod((datetime.now() - start_time).total_seconds(), 3600)
        tmin, tsec = divmod(temp_sec, 60)
        print(f'\nTime taken: {thour} hours {tmin} minutes and {round(tsec, 2)} seconds.')

In [None]:
# For now I use xgboost algo
xgb_classifier = XGBClassifier()

In [None]:
random_search = RandomizedSearchCV(xgb_classifier,
                                   param_distributions=params,
                                   n_iter=5,
                                   scoring='roc_auc',
                                   n_jobs=-1,
                                   cv=5, verbose=3)

In [None]:
start_time = timer(None) 

random_search.fit(X_train, y_train)

timer(start_time)

<a id="12"></a>
# <p style="background-color:#47b7ed;font-family:newtimeroman;color:#000000;font-size:120%;text-align:center;border-radius:20px 80px;">✨ Checking for best Estimator and best Hyperparameters</p>





In [None]:
xgb_best_params = random_search.best_estimator_

In [None]:
xgb_best_params

In [None]:
random_search.best_params_

In [None]:
classifier = xgb_best_params


<a id="13"></a>
# <p style="background-color:#47b7ed;font-family:newtimeroman;color:#000000;font-size:120%;text-align:center;border-radius:20px 80px;">🤓 Training model with Best Hyperparameters</p>





In [None]:
classifier.fit(X_train, y_train, eval_metric='logloss')

In [None]:
pred = classifier.predict(X_test)

print(f"Accuracy: {accuracy_score(y_test, pred)}")

<a id="14"></a>
# <p style="background-color:#47b7ed;font-family:newtimeroman;color:#000000;font-size:120%;text-align:center;border-radius:20px 80px;">(⌐■_■) Ploting confusion matrix</p>






In [None]:
plot_confusion_matrix(classifier,
                      X_test, y_test,
                      cmap=plt.cm.Blues,
                      display_labels = ['Not Placed', 'Placed'])
plt.grid(False)
plt.show();

<a id="6.2"></a>
# <p style="background-color:#73d1ff;font-family:newtimeroman;color:#000000;font-size:110%;text-align:center;border-radius:150px 150px;">I Hope my Kernel will be helpful for you. If you like my work, Don't forget to Upvote Thank You! 🙂</p>


