# 1. Introduction

MBA is the common abbreviation for a Master of Business Administration degree. An MBA is a common stepping stone to C-suite positions in large businesses, as well as provides aspiring entrepreneurs a springboard to success. This project analyzes how a person’s MBA post-score is influenced by lower-level performances and demographics. The potential predictors are Gender, Marital Status, Demographics, Performances in lower-level study, Under-graduate Specialization.

In [11]:
# read csv file
admissions <- read.csv("../input/others/MBA_ADMISSIONS.csv")
head(admissions)


 ***1.1 Basic Visualization***

In [12]:
# Bar charts showing the distribution of Age, Gender&Marital Status
barplot(table(admissions$Age), ylim = c(0,170), main = 'Distribution of Age', ylab = "Percentage", xlab = "Age in years", cex.lab = 1.2, cex.name = 1,xlim = c(0,10), cex.axis = 1.2, beside = TRUE, col = c(10,11))
barplot(table(admissions$Gender, admission$Marital_status), ylim = c(0,300), main = 'Distribution of Marital Status and Gender', ylab = "Percentage", xlab = "Count", cex.lab = 1.2, cex.name = 1, xlim = c(0,10), cex.axis = 1.2, legend = TRUE, beside = TRUE, col = c(10,11))

> Figure 1: Upper: Distribution of Age, Lower: Distribution of gender and Marital Status

Figure 1 shows that most MBA learners are around 22 years old and single. Interestingly, while the number of single Male MBAs outweigh that of female MBAs, there are more married Female MBA Leaners

In [13]:
# Pie chart showing the distribution of Specilization and Previous Degrees
pie(table(admissions$Specialization), main = "Distribution of Specialization", col = c(3,4), cex = 1.3, radius = 1.2)
pie(table(admissions$Previous_Degree), main = "Distribution of Previous Degrees", col = c(3,4), cex = 1.3, radius = 1.2)

> Figure 2: Upper: Distribution of Previous Degree; Lower: Distribution of Specialization

Figure 2 shows that roughly a half of MBA Leaners pursued Engineering Degree in Undergraduate, followed by Commerce Major which makes up around 25%. Pursing MBA, most learners choose to specialize in Marketing, followed by Finance, LOS, and HR respectively.

In [14]:
#Box plots showing the distribution of MBA pre-scores
boxplot(admissions$pre_score,
        main = "Pre-scores of MBA Applicants",
        xlab = "MBA Applicants",
        ylab = "Post-score",
        col = "pink",
        border = "blue",
        notch = TRUE)
#Box plots showing the distribution of MBA post-scores
boxplot(admissions$post_score,
        main = "Post-scores of MBA Applicants",
        xlab = "MBA Applicants",
        ylab = "Post-score",
        col = "pink",
        border = "blue",
        notch = TRUE)

> Figure 3: Upper: Distribution of Pre-score; Lower: Distribution of Post-score

Figure 3 shows the distribution of MBAs’ pre-scores and post-scores. Regarding pre-score, the median pre-score is 68 over the scale of 100 (68/100), and around 75% of the MBAs get the pre-score higher than 60. Regarding post-score, the median post-score is around 78/100, and around 75% of the MBAs score higher than 70.

***1.2 Summary of methods***

In this report, I am going to use Clustering, Regression, and Classification to analyze the MBA.csv data set.

> Clustering: K-means Clustering Method


Performing K-means Clustering Method, I define a target number k, which refers to the number of centroids that I need in the dataset. A centroid is the imaginary or real location representing the center of the cluster. Then, I will allocate every data point to the nearest cluster, while keeping the centroids as small as possible.
-	K = 7
-	Included variables: pre_score, Percentage_in_10_Class, Percentage_in_12_Class, Percentage_in_Under_Graduate, percentage_MBA

> Linear Regression: 

I will take the admission data and calculate the effect that the independent variables (potential predictors listed below) have on the response variable post_score using the equation for the linear model: lm().
-	Response variable: post_score
-	Predictors selection: using forward selection, which begins with an empty model and adds in variables one by one. In each forward step, I will add the one variable that gives the single best improvement to my model

> Classification: Logistic Regression

I will split the admission dataset into a training set to train the model on and a testing set to test the model on. Then, I will predict test data based on model and evaluate model accuracy.

-	Binary response variable: Performance (converted by post_score) 

  * Performance = 0 means this is a high MBA percentage
  * Performance = 1 means this is a low MBA percentage

# 2. Models

***2.1 K-means clustering***

Since Admissions.csv is a large data set with many variables, it is not ideal use Hierarchical clustering which is generally applicable to a small set of data. Therefore, I choose K-means clustering for the data set – which is more efficient.

To use K-Means, we need to specify the value of K that is the number of clusters we want to group our data into. To determine K, I am going to use Elbow Curve Method.

In [15]:
admission_num <- (admissions[, -c(8:14)])
K_list <- 2:30
total_dist <- 0
for (k in K_list){
  kmeans_num_choose <- kmeans(admission_num, k)
  total_dist <- c(total_dist, kmeans_num_choose$tot.withinss)
}
plot(total_dist[-1])

> Figure 4: Elbow Curve for Kmeans

Figure 4 shows that at K = 7, the sum of squared distance begins to flatten out and we can see an inflection point. Therefore, K = 7 is a good choice for the number of clusters.

For K-means clustering model, I eliminated all non-numerical variables and chose Manhattan Distance for measurement. Specifically, Figure 5 shows the result from K-means clustering, which is a scatter plot of different clusters in accordance to MBA post-score (denoted: post_score) and Undergraduate Performance (denoted: Percentage_in_Under_Graduate).

In [16]:
admissions_kmeans <- kmeans(admission_num, 7)
for (i in 1:7){
  if (i == 1){
    plot(admissions[which(admissions_kmeans$cluster == i), c('Percentage_in_Under_Graduate', 'post_score')],
         xlim = c(min(admissions$Percentage_in_Under_Graduate), max(admissions$Percentage_in_Under_Graduate)),
         ylim = c(min(admissions$post_score), max(admissions$post_score)))
  }else{
    points(admissions[which(admissions_kmeans$cluster == i), c('Percentage_in_Under_Graduate', 'post_score')], col = i)
  }
}

> Figure 5: Clusters in terms of MBA Post-score and Undergraduate Performance

***2.2 Linear Regression***

* Variable selection

I have used a linear regression model to predict MBA admissions based on the most relevant predictors by utilizing forward selection. Based on forward selection results with the response variable post_score, I picked 9 potential predictors that improve the model as I continued the process by adding one more variable at a time. 

In [17]:
library(MASS)
data_null <- lm(post_score~1., admissions)
data_full <- lm(post_score~., admissions)

forward_lm <- stepAIC(data_null, direction = "forward",
                      scope = list(upper = data_full, lower = data_null))
summary(forward_lm)

Potential predictors are: *perceived.Job.Skill, Percentage_in_Under_Graduate, Percentage_in_10_Class, STATE, Gender*

In [18]:
pairs(admission_num)

> Figure 6: Variables of pairs 

Figure 6 shows a pairs plot which explains the pairwise relationship between different variables in a dataset. 

In [25]:
# resuts from linear regression
admissions_lm <- lm(post_score ~ perceived.Job.Skill + Percentage_in_Under_Graduate + Percentage_in_10_Class + STATE + Gender, data = admissions) 
summary(admissions_lm)

* Full model based off response variable and 5 potential variables

**post_score = 71.72 - 5.84 * perceived.Job.Skillprefered skills - 12.69 * perceived.Job.Skillrequired skills + 0.4 * Percentage_in_Under_Graduate - 0.26 * Percentage_in_10_Class - 3.48 * STATEEast Zone + 18.52 * STATENorth East -  1.35 * STATENorth Zone - 1.11 * STATESouth Zone - 9.11 * STATEWest Zone + 3.53 * GenderMale**

* Interaction between predictors

In [None]:
lm_int <- lm(formula = post_score ~ Percentage_in_Under_Graduate*Percentage_in_10_Class, data = admissions)
summary(lm_int)

I suspect there would be interaction between 2 predictors Percentage_in_Under_Graduate and Percentage_in_10_Class since these two variables are statistically significant, and therefore have large effects on the outcome (post_score). Moreover, I suspect that a person’s performance in grade 10 may influence their undergraduate perfomance.

**post_score = 123.65 - 0.9 * Percentage_in_10_Class - 0.36 * Percentage_in_Under_Graduate + 0.008 * (Percentage_in_10_Class * Percentage_in_Under_Graduate)**