# 1. Introduction

MBA is the common abbreviation for a Master of Business Administration degree. An MBA is a common stepping stone to C-suite positions in large businesses, as well as provides aspiring entrepreneurs a springboard to success. This project analyzes how a person’s MBA post-score is influenced by lower-level performances and demographics.

In [30]:
# Load the dplyr library
library(dplyr)
library(tibble)

# dplyr is a part of the R "tidyverse"
library(tidyverse)

In [31]:
# read csv file
admissions <- read.csv("../input/others/MBA_ADMISSIONS.csv")
head(admissions)

 ***1.1 Basic Visualization***

In [32]:
# Bar charts showing the distribution of Age, Gender&Marital Status
barplot(table(admissions$Age), ylim = c(0,170), main = 'Distribution of Age', ylab = "Percentage", xlab = "Age in years", cex.lab = 1.2, cex.name = 1,xlim = c(0,10), cex.axis = 1.2, beside = TRUE, col = c(10,11))
barplot(table(admissions$Gender, admissions$Marital_status), ylim = c(0,300), main = 'Distribution of Marital Status and Gender', ylab = "Percentage", xlab = "Count", cex.lab = 1.2, cex.name = 1, xlim = c(0,10), cex.axis = 1.2, legend = TRUE, beside = TRUE, col = c(10,11))

> Figure 1: Upper: Distribution of Age, Lower: Distribution of gender and Marital Status

Figure 1 shows that most MBA learners are around 22 years old and single. Interestingly, while the number of single Male MBAs outweigh that of female MBAs, there are more married Female MBA Leaners

In [33]:
# Pie charts showing the distribution of Specilization and Previous Degrees
pie(table(admissions$Specialization), main = "Distribution of Specialization", col = c(3,4), cex = 1.3, radius = 1.2)
pie(table(admissions$Previous_Degree), main = "Distribution of Previous Degrees", col = c(3,4), cex = 1.3, radius = 1.2)

> Figure 2: Upper: Distribution of Previous Degree; Lower: Distribution of Specialization

Figure 2 shows that roughly a half of MBA Leaners pursued Engineering Degree in Undergraduate, followed by Commerce Major which makes up around 25%. Pursing MBA, most learners choose to specialize in Marketing, followed by Finance, LOS, and HR respectively.

In [34]:
# Box plots showing the distribution of MBA pre-scores
boxplot(admissions$pre_score,
        main = "Pre-scores of MBA Applicants",
        xlab = "MBA Applicants",
        ylab = "Post-score",
        col = "pink",
        border = "blue",
        notch = TRUE)
# Box plots showing the distribution of MBA post-scores
boxplot(admissions$post_score,
        main = "Post-scores of MBA Applicants",
        xlab = "MBA Applicants",
        ylab = "Post-score",
        col = "pink",
        border = "blue",
        notch = TRUE)

> Figure 3: Upper: Distribution of Pre-score; Lower: Distribution of Post-score

Figure 3 shows the distribution of MBAs’ pre-scores and post-scores. Regarding pre-score, the median pre-score is 68 over the scale of 100 (68/100), and around 75% of the MBAs get the pre-score higher than 60. Regarding post-score, the median post-score is around 78/100, and around 75% of the MBAs score higher than 70.

***1.2 Summary of methods***

In this report, I am going to use Clustering, Regression, and Classification to analyze the MBA.csv data set.

> Clustering: K-means Clustering Method


Performing K-means Clustering Method, I define a target number k, which refers to the number of centroids that I need in the dataset. A centroid is the imaginary or real location representing the center of the cluster. Then, I will allocate every data point to the nearest cluster, while keeping the centroids as small as possible.
-	K = 7
-	Included variables: pre_score, Percentage_in_10_Class, Percentage_in_12_Class, Percentage_in_Under_Graduate, percentage_MBA

> Linear Regression: 

I will take the admission data and calculate the effect that the independent variables (potential predictors listed below) have on the response variable post_score using the equation for the linear model: lm().
-	Response variable: post_score
-	Predictors selection: using forward selection, which begins with an empty model and adds in variables one by one. In each forward step, I will add the one variable that gives the single best improvement to my model

> Classification: Logistic Regression

I will split the admission dataset into a training set to train the model on and a testing set to test the model on. Then, I will predict test data based on model and evaluate model accuracy.

-	Binary response variable: Performance (converted by post_score) 

  * Performance = 0 means this is a high MBA percentage
  * Performance = 1 means this is a low MBA percentage

# 2. Models

***2.1 K-means clustering***

Since Admissions.csv is a large data set with many variables, it is not ideal use Hierarchical clustering which is generally applicable to a small set of data. Therefore, I choose K-means clustering for the data set – which is more efficient.

To use K-Means, we need to specify the value of K that is the number of clusters we want to group our data into. To determine K, I am going to use Elbow Curve Method.

In [35]:
admission_num <- (admissions[, -c(8:14)])
K_list <- 2:30
total_dist <- 0
for (k in K_list){
  kmeans_num_choose <- kmeans(admission_num, k)
  total_dist <- c(total_dist, kmeans_num_choose$tot.withinss)
}
plot(total_dist[-1])

> Figure 4: Elbow Curve for Kmeans

Figure 4 shows that at K = 7, the sum of squared distance begins to flatten out and we can see an inflection point. Therefore, K = 7 is a good choice for the number of clusters.

For K-means clustering model, I eliminated all non-numerical variables and chose Manhattan Distance for measurement. Specifically, Figure 5 shows the result from K-means clustering, which is a scatter plot of different clusters in accordance to MBA post-score (denoted: post_score) and Undergraduate Performance (denoted: Percentage_in_Under_Graduate).

In [36]:
admissions_kmeans <- kmeans(admission_num, 7)
for (i in 1:7){
  if (i == 1){
    plot(admissions[which(admissions_kmeans$cluster == i), c('Percentage_in_Under_Graduate', 'post_score')],
         xlim = c(min(admissions$Percentage_in_Under_Graduate), max(admissions$Percentage_in_Under_Graduate)),
         ylim = c(min(admissions$post_score), max(admissions$post_score)))
  }else{
    points(admissions[which(admissions_kmeans$cluster == i), c('Percentage_in_Under_Graduate', 'post_score')], col = i)
  }
}

> Figure 5: Clusters in terms of MBA Post-score and Undergraduate Performance

***2.2 Linear Regression***

* Variable selection

I have used a linear regression model to predict MBA admissions based on the most relevant predictors by utilizing forward selection. Based on forward selection results with the response variable post_score, I picked 9 potential predictors that improve the model as I continued the process by adding one more variable at a time.

In [37]:
library(MASS)
data_null <- lm(post_score~1., admissions)
data_full <- lm(post_score~., admissions)

forward_lm <- stepAIC(data_null, direction = "forward",
                      scope = list(upper = data_full, lower = data_null))
summary(forward_lm)

Potential predictors are: *perceived.Job.Skill, Percentage_in_Under_Graduate, Percentage_in_10_Class, STATE, Gender*

In [38]:
pairs(admission_num)

> Figure 6: Variables of pairs 

Figure 6 shows a pairs plot which explains the pairwise relationship between different variables in a dataset.

In [39]:
# resuts from linear regression
admissions_lm <- lm(post_score ~ perceived.Job.Skill + Percentage_in_Under_Graduate + Percentage_in_10_Class + STATE + Gender, data = admissions) 
summary(admissions_lm)

* Full model based off response variable and 5 potential variables

**post_score = 71.72 - 5.84 * perceived.Job.Skillprefered skills - 12.69 * perceived.Job.Skillrequired skills + 0.4 * Percentage_in_Under_Graduate - 0.26 * Percentage_in_10_Class - 3.48 * STATEEast Zone + 18.52 * STATENorth East -  1.35 * STATENorth Zone - 1.11 * STATESouth Zone - 9.11 * STATEWest Zone + 3.53 * GenderMale**

* Interaction between predictors

In [40]:
lm_int <- lm(formula = post_score ~ Percentage_in_Under_Graduate*Gender, data = admissions)
summary(lm_int)

I suspect there would be interaction between 2 predictors Percentage_in_Under_Graduate and Gender since these two variables are statistically significant, and therefore have large effects on the outcome (post_score). Moreover, I suspect that gender may influence their undergraduate perfomance.

**post_score = 64.83 + 0.13 * Percentage_in_Under_Graduate - 25.9 * GenderMale + 0.4 * (Percentage_in_Under_Graduate * GenderMale)**

Based on the model, it can be interpreted that Under-graduate performance has a positive impact on post-score, however this impact is worse if the person is a Male MBA. Being a Male MBA has a negative impact on post-score comparing to Female MBA; yet the post-score would be higher if the Male MBA performed well in their Under-graduate studies.

***2.3 Logistic Regression***

I chose Logistic Regression for Classification because this provides p-values, standard errors, etc, which gives me insight into what features are important and what they do. Using Logistic classification, I created Performance as the binary response variable, in which Performance is determined by:
* If post_score < 70: Performance = 0, which is low MBA performance  
* If post_score > 70: Performance = 1, which is high MBA performance

In [41]:
# Split the data set into training set and validation set
admissions <- read.csv("../input/others/MBA_ADMISSIONS.csv")
options(scipen=200)
admissions <- na.omit(admissions)
admissions <- mutate(admissions, Performance = ifelse(admissions$post_score > 70, 1, 0))
admissions <- (admissions[, -c(8:14)])
admissions <- subset(admissions, select = -post_score)

training <- sample(1:nrow(admissions), 0.8*nrow(admissions))
trainingset <- admissions[training,]
validation <- setdiff(1:nrow(admissions), training)
validationset <- admissions[validation,]

# Fit the model using generalized linear model
admission_log <- glm(Performance~., trainingset, family = "binomial")
summary(admission_log)

**Intepretation**

It can be seen that only 4 out of 6 predictors are significantly associated to the outcome. These include: pre_score, Percentage_in_10_Class, Percentage_in_Under_Graduate, percentage_MBA.

The coefficient estimate of the variable pre_score is b = 0.050989, which is positive. This means that a higher MBA pre_score is associated with an increase in the probability of getting High MBA performance. However the coefficient for the variable Percentage_in_10_Class is b =  -0.066698, which is negative. This means that an increase in Grade 10 score will be associated with a decreased probability of getting a high MBA performance. From the logistic regression results, it can be noticed that some variables (Age_in_years, Percentage_in_12_Class) are not statistically significant. Keeping them in the model may contribute to overfitting. Therefore, they should be eliminated. This can be done automatically using statistical techniques, selecting an optimal model with a reduced set of variables, without compromising the model accuracy.

In [42]:
admission_log <- glm(Performance~pre_score + Percentage_in_10_Class + Percentage_in_Under_Graduate + percentage_MBA, trainingset, family = "binomial")
summary(admission_log)

**Making predictions**

I will make predictions using the validation data set in order to evaluate the performance of my logistic regression model. 
To do this, I will follow the procedure:
*   Predict the probabilities of getting high MBA performance
*   Predict the class of individuals

> Predict the probabilities of getting high MBA performance

In [43]:
# Predict the probabilities of getting high MBA performance
admission_log_pred <- predict(admission_log, validationset, type = "response")
head(admission_log_pred)

The output is the probability that the MBA performance will be good. We know that these values correspond to the probability of the validation data set to be high, rather than low, because R indicates 1 for "high" and 0 for "low".

> Predict the class of individuals:

To do this, first, I have to categorizes individuals into two groups based on their predicted probabilities (p) of getting good performance. To find p, I need to execute a thredshold model.

With the optimal p, I will then predict the class of individuals.

In [44]:
# Thredshold list
threshold_list <- seq(0.05, 0.95, 0.1)
Acc <- 0
for (p in threshold_list){
  admission_log_performance <- ifelse(admission_log_pred >= p, 1, 0)
  conf_mat <- table(validationset$Performance, admission_log_performance)
  Acc <- c(Acc, sum(diag(conf_mat))/sum(conf_mat))
}
plot(threshold_list, Acc[-1],
     ylab = "Accuracy")

> Figure 8: Threshold list
Figure 8 shows the accuracy on the validation set. At threshold 0.7, the model gives the optimal accuracy. Therefore, p = 0.7

In [45]:
# Predict the class of individuals
predicted.admissions <- ifelse(admission_log_pred > 0.7, "high", "low")
head(predicted.admissions)

# **3. Conclusion**


Performing K-means clustering, linear regression, and logistic regression, I am able to analyze how a an MBA's post-score is influenced by lower-level performances and demographics. The analysis reveals these key findings:
- There are 5 predictors that mostly influence the MBA post-score: perceived.Job.Skill, Percentage_in_Under_Graduate, Percentage_in_10_Class, STATE, and Gender.
- Under-graduate performance has a positive impact on post-score, however this impact is worse if the person is a Male MBA. Being a Male MBA has a negative impact on post-score comparing to Female MBA; yet the post-score would be higher if the Male MBA performed well in their Under-graduate studies.
- High MBA pre-score, high grades in Under-graduate and MBA studies are associated with an increase in the probability of getting a good MBA performance (high MBA post-score).