### Problem Statement:

Predict the placement status based on scores at several examinations. Evaluate the predictions using classification report and ROC AUC Curve

Source: https://www.kaggle.com/benroshan/factors-affecting-campus-placement

#### Data Dictionary (Understand & explain each feature in the data)

#### Import required packages

#### Load and view the data

In [None]:
df = pd.read_csv('placement.csv')

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
# Dimension of the dataset


In [None]:
# Check the structure of data using info command


In [None]:
# Data summary for all features in the data


In [None]:
# Check the distribution of the labelled field (Class variable)


##### Observations:

sl_no is a unique id used as an identity for each observation. As it is not useful we can drop the same.

We observe special character like '&' being used in data which can be replaced by space to avoid any possible error while processing the data

There are 215 observations in the data with 6 numeric fields and 8 categorical fields

There are no missing values in the data

'status' is the dependent variable. 69% of the observations correspond to a status of 'Placed' and remaining 31% correspond to 'Not Placed'. There is a class imbalance observed in the data

### Data Cleanup

In [None]:
# Drop sl_no field as it will not be used in the model


In [None]:
# View bad data in degree_t feature
df.degree_t.head()

In [None]:
# Check the frequency distribution of the categorical levels in degree_t feature


In [None]:
# Replace '&' with 'And' in degree_t 


In [None]:
# Check the frequency distribution of the categorical levels in degree_t feature and compare with orignial data


In [None]:
# Cleanup specialisation feature in similar way

In [None]:
# Check for duplicates


In [None]:
# Final dimension of the dataframe


In [None]:
# Final structure of dataframe after cleanup


### EDA

#### Univariate Analysis

In [None]:
fig_dims = (15, 5)
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=fig_dims)
sns.histplot(x='ssc_p', data=df, ax=axs[0])
sns.boxplot(x='ssc_p', data=df,ax=axs[1])

In [None]:
# Univariate analysis for hsc_p



In [None]:
# Univariate analysis for degree_p



In [None]:
fig_dims = (15, 5)
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=fig_dims)
sns.histplot(x='etest_p', data=df, ax=axs[0])
sns.boxplot(x='etest_p', data=df,ax=axs[1])

#### Observation:

Numeric fields are slightly rigth skewed. Fields 'degree_p' and 'hsc_p' show outliers

In [None]:
sns.countplot(x='degree_t', data=df)

#### Observation:

Maximum candidates are with 'Communication and Management' degree

In [None]:
# Countplot for workex


#### Observation:

Maximum candidates are without any work experience

#### Bivariate Analysis

In [None]:
df_num = df.select_dtypes(exclude='object')
num_list = df_num.columns
num_list

In [None]:
df_cat = df.select_dtypes(include='object')
cat_list = df_cat.columns
cat_list

In [None]:
# Pairplot for numeric features


In [None]:
# Save coefficient of correlation in corr object


In [None]:
# Plot a heatmap to analyze correlation
sns.heatmap(corr, annot=True)

#### Observation:

Numeric fields are not much correlated with each other. There is no multicollinearity in data.

In [None]:
# Bivariate analysis of workex and etest_p


Median etest score of experienced candidates is slightly more than that of inexperienced candidates

In [None]:
# Bivariate analysis of specialisation and status



It is observed that more number of candidates with 'MktAndFin' specialisation have been placed 

#### Multivariate Analysis

In [None]:
sns.scatterplot(x='ssc_p', y='degree_p', hue='status', data=df)

It is observed that candidates with higher ssc and degree scores have been placed

In [None]:
# Multivariate analysis of hsc_s, etest_p and status



Median etest score of candidates who have been Placed is higher compared to who have not been placed.

Median etest score of 'Commerce' and 'Science' candidates is comparatively higher than that of Arts candidates

### Data Preparation

#### Import libraries required for data preparation

##### Label Encoding

In [None]:
le = LabelEncoder()

In [None]:
# Perform label encoding using le object



In [None]:
df.head()

##### Outlier Treatment

In [None]:
df[num_list].boxplot()
plt.xticks(rotation=90)
plt.show()

In [None]:
# Define a function treat_outlier to imput whisker values to the outlier





In [None]:
for i in num_list:
    df[i]=treat_outlier(df[i])

In [None]:
df[num_list].boxplot()
plt.xticks(rotation=90)
plt.show()

In [None]:
df.info()

In [None]:
df.status.value_counts()

#### Import all libraries required for building and evaluation the model

#### Logistic Regression Model 

In [None]:
# Capture independent variables in x and dependent in y



In [None]:
# Perform train test split with 30% test set
X_train, X_test, Y_train, Y_test = 

In [None]:
# Create an object of Logistic Regression and train the model using fit fucntion



In [None]:
# Check accuracy of train data


In [None]:
# Check accuracy of test data


In [None]:
# Predict y for train ans test data
Y_train_predict = 
Y_test_predict = 

In [None]:
# Plot confusion matrix for train


In [None]:
# Print classification report for train data


In [None]:
# Plot confusion matrix for test



In [None]:
# Print classification report for test data


In [None]:
df_pred = pd.DataFrame(Y_test.values, columns=['Actual'])
df_pred['Class_Pred'] = Y_test_predict
df_pred['Prob_Pred_1'] = logreg.predict_proba(X_test)[:,1]
df_pred.head()

In [None]:
sns.boxplot(x='Class_Pred', y='Prob_Pred_1', data=df_pred)

In [None]:
sns.boxplot(x='Actual', y='Prob_Pred_1', data=df_pred)

In [None]:
# predict probabilities
probs = logreg.predict_proba(X_train)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
auc = roc_auc_score(Y_train, probs)
print('AUC: %.3f' % auc)
# calculate roc curve
train_fpr, train_tpr, train_thresholds = roc_curve(Y_train, probs)
plt.plot([0, 1], [0, 1], linestyle='--')
# plot the roc curve for the model
plt.plot(train_fpr, train_tpr);

In [None]:
# Plot ROC Curve for test data


##### Conclusion:

Overall accuracy of the built logistic regression model is at 86% on the test dataset. 

Recall for both class 0 and class 1 is 0.90 and 0.85 respectively.

Accuracy in train dataset is at 88% and test dataset is at 86%. As there is not much difference, there is no overfitting in the model.

AUC for test data is at 0.97.

Considering the evaluation, we can consider the above model as the final one.

##### Never Stop Learning!!!