# MBA Specialization Classification

![](mba1.png)

### Importing required libraries and dataset

In [1]:
# Importing Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
plt.style.use('ggplot')

In [2]:
data = pd.read_csv('../DataSets/MBA_ADMISSIONS.csv')
data.head()

Unnamed: 0,pre_score,Age_in_years,Percentage_in_10_Class,Percentage_in_12_Class,Percentage_in_Under_Graduate,percentage_MBA,post_score,Gender,STATE,Previous_Degree,Marital_status,Place_you_belong_to,perceived#Job#Skill,Specialization
0,75.0,22,71.0,74.8,72.0,61.0,83.333333,Male,Central Zone,Engineering,Single,Urban,prefered skills,Marketing
1,71.666667,25,77.6,82.6,76.9,66.85,76.666667,Male,Central Zone,Engineering,Single,Semi Urban,prefered skills,LOS
2,76.666667,26,93.2,83.8,77.0,74.97,75.0,Female,Central Zone,Engineering,Single,Urban,desired skills,Finance
3,66.666667,22,91.2,80.0,67.0,68.3,60.0,Male,Central Zone,Commerce,Single,Semi Urban,prefered skills,Finance
4,71.666667,24,79.8,61.6,60.33,69.28,76.666667,Female,Central Zone,Engineering,Single,Urban,prefered skills,Finance


**Data Pre-processing**

In [3]:
df = data.astype(int, errors='ignore')

In [4]:
df.head()

Unnamed: 0,pre_score,Age_in_years,Percentage_in_10_Class,Percentage_in_12_Class,Percentage_in_Under_Graduate,percentage_MBA,post_score,Gender,STATE,Previous_Degree,Marital_status,Place_you_belong_to,perceived#Job#Skill,Specialization
0,75,22,71,74,72,61,83,Male,Central Zone,Engineering,Single,Urban,prefered skills,Marketing
1,71,25,77,82,76,66,76,Male,Central Zone,Engineering,Single,Semi Urban,prefered skills,LOS
2,76,26,93,83,77,74,75,Female,Central Zone,Engineering,Single,Urban,desired skills,Finance
3,66,22,91,80,67,68,60,Male,Central Zone,Commerce,Single,Semi Urban,prefered skills,Finance
4,71,24,79,61,60,69,76,Female,Central Zone,Engineering,Single,Urban,prefered skills,Finance


**Finding out the correlation between the attributes of the dataset**

In [5]:
df.corr()

Unnamed: 0,pre_score,Age_in_years,Percentage_in_10_Class,Percentage_in_12_Class,Percentage_in_Under_Graduate,percentage_MBA,post_score
pre_score,1.0,0.250699,-0.140844,-0.191213,0.00959,0.053699,0.265957
Age_in_years,0.250699,1.0,0.038332,-0.207313,-0.141367,0.296556,-0.036615
Percentage_in_10_Class,-0.140844,0.038332,1.0,0.441987,0.410374,0.490386,-0.075099
Percentage_in_12_Class,-0.191213,-0.207313,0.441987,1.0,0.418453,0.286299,0.052577
Percentage_in_Under_Graduate,0.00959,-0.141367,0.410374,0.418453,1.0,0.36345,0.161584
percentage_MBA,0.053699,0.296556,0.490386,0.286299,0.36345,1.0,0.067904
post_score,0.265957,-0.036615,-0.075099,0.052577,0.161584,0.067904,1.0


**Creating the dataset for training the model**

In [6]:
x = df[['Percentage_in_10_Class','Percentage_in_Under_Graduate','Percentage_in_12_Class']]

In [7]:
y = df['percentage_MBA']

<a id="train-test-split"></a>
**Training and Testing Dataset Spliting using the `train_test_split`**
  
  * Immporting the library from the sklearn.model_selection
  * Split the dataset into 80:20 ratio
  * x_train1 and y_train1 are the trainning datasets
  * x_test1 and y_test1 are the testing datasets
  * After the spliting of the datasets the model is ready to be prepared!

In [8]:
from sklearn.model_selection import train_test_split

In [9]:
x_train1, x_test1, y_train1, y_test1 = train_test_split(x,y, test_size = 0.2)

In [10]:
from sklearn import preprocessing
from sklearn import utils

In [11]:
x_train1.shape

(378, 3)

## Classification Algorithms
Classification is a task that requires the use of machine learning algorithms that learn how to assign a class label to examples from the problem domain. An easy to understand example is classifying emails as “spam” or “not spam.”

There are many different types of classification tasks that you may encounter in machine learning and specialized approaches to modeling that may be used for each.

Here I am going to use 10 Classification algorithms, based on these the models will be trained and then evaluated using the accuracy scores.

**The following models that we are going to use -**
  * **Logistic Regression** : Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable, although many more complex extensions exist. In regression analysis, logistic regression (or logit regression) is estimating the parameters of a logistic model (a form of binary regression).
  
  
  * **Decision Tree Classifier** : Decision Tree is a Supervised learning technique that can be used for both classification and Regression problems, but mostly it is preferred for solving Classification problems. It is a tree-structured classifier, where internal nodes represent the features of a dataset, branches represent the decision rules and each leaf node represents the outcome.
  
  
  * **Random Forest Classifier** : Random Forest is a popular machine learning algorithm that belongs to the supervised learning technique. It can be used for both Classification and Regression problems in ML. It is based on the concept of ensemble learning, which is a process of combining multiple classifiers to solve a complex problem and to improve the performance of the model.
  
  
  * **Gausian NB** : This method is expected to be called several times consecutively on different chunks of a dataset so as to implement out-of-core or online learning. This is especially useful when the whole dataset is too big to fit in memory at once. This method has some performance and numerical stability overhead, hence it is better to call partial_fit on chunks of data that are as large as possible (as long as fitting in the memory budget) to hide the overhead.
  
  
  * **KNN algorithm** : K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on Supervised Learning technique. K-NN algorithm assumes the similarity between the new case/data and available cases and put the new case into the category that is most similar to the available categories. K-NN algorithm stores all the available data and classifies a new data point based on the similarity. This means when new data appears then it can be easily classified into a well suite category by using K- NN algorithm.
  
  
  * **Support Vector Machine Algorithm** : Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which is used for Classification as well as Regression problems. However, primarily, it is used for Classification problems in Machine Learning. The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-dimensional space into classes so that we can easily put the new data point in the correct category in the future. This best decision boundary is called a hyperplane.
  

* **Stochastic Gradient Descent Classifier** : Stochastic Gradient Descent (SGD) is a simple yet efficient optimization algorithm used to find the values of parameters/coefficients of functions that minimize a cost function. In other words, it is used for discriminative learning of linear classifiers under convex loss functions such as SVM and Logistic regression.


* **Linear Discriminant Analysis (LDA)** : Linear Discriminant Analysis (LDA) is a dimensionality reduction technique. As the name implies dimensionality reduction techniques reduce the number of dimensions (i.e. variables) in a dataset while retaining as much information as possible.


* **Gradient Boosting** : Gradient boosting is a machine learning technique for regression, classification and other tasks, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees.


* **MLP Classifier** : Multi-layer perceptrons (MLP) make powerful classifiers that may provide superior performance compared with other classifiers, but are often criticized for the number of free parameters. Parameter selection for optimal performance is performed using measures that correlate well with generalisation error.


 
 
We are going to use these ten algorithms and based on the scores of the models the most fitted algorithm will be set! Now let's check out the algorithms.

In [12]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from time import time
from sklearn.metrics import f1_score
from os import path, makedirs, walk
from joblib import dump, load
import json

### Logistic Regression 

Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable, although many more complex extensions exist. In regression analysis, logistic regression (or logit regression) is estimating the parameters of a logistic model (a form of binary regression).

In [13]:
from sklearn.linear_model import LogisticRegression
logReg = LogisticRegression(max_iter = 1000)
logReg.fit(x_train1, y_train1)

LogisticRegression(max_iter=1000)

In [14]:
logReg.score(x_test1, y_test1)

0.17894736842105263

### Decision Tree Classifier Algorithm

Decision Tree is a Supervised learning technique that can be used for both classification and Regression problems, but mostly it is preferred for solving Classification problems. It is a tree-structured classifier, where internal nodes represent the features of a dataset, branches represent the decision rules and each leaf node represents the outcome.

In [15]:
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier()
dtc.fit(x_train1, y_train1)
dtc.score(x_test1, y_test1)

0.968421052631579

In [16]:
y_pred = np.round(dtc.predict(x_test1), decimals=2)
pd.DataFrame({'Actual Marks': y_test1, 'Predicted Marks': y_pred})

Unnamed: 0,Actual Marks,Predicted Marks
445,61,61
253,66,66
73,68,68
406,69,69
46,68,68
...,...,...
129,71,71
231,71,73
119,70,70
399,77,77


### Random Forest Classifier Algorithm

Random Forest is a popular machine learning algorithm that belongs to the supervised learning technique. It can be used for both Classification and Regression problems in ML. It is based on the concept of ensemble learning, which is a process of combining multiple classifiers to solve a complex problem and to improve the performance of the model.

In [17]:
rfc = RandomForestClassifier()
rfc.fit(x_train1, y_train1)
rfc.score(x_test1, y_test1)

0.968421052631579

### K-Nearest Neighbours Algorithm

K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on Supervised Learning technique. K-NN algorithm assumes the similarity between the new case/data and available cases and put the new case into the category that is most similar to the available categories. K-NN algorithm stores all the available data and classifies a new data point based on the similarity. This means when new data appears then it can be easily classified into a well suite category by using K- NN algorithm.

In [18]:
from sklearn.neighbors import KNeighborsClassifier  
classifier= KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2 ) 

In [19]:
classifier.fit(x_train1, y_train1)  

KNeighborsClassifier()

In [20]:
classifier.score(x_test1, y_test1)

0.5368421052631579

### Gausian NB Algorithm

This method is expected to be called several times consecutively on different chunks of a dataset so as to implement out-of-core or online learning. This is especially useful when the whole dataset is too big to fit in memory at once. This method has some performance and numerical stability overhead, hence it is better to call partial_fit on chunks of data that are as large as possible (as long as fitting in the memory budget) to hide the overhead.

In [21]:
clf = GaussianNB()
clf.fit(x_train1, y_train1)

GaussianNB()

In [22]:
clf.score(x_test1, y_test1)

0.22105263157894736

### Support Vector Machine Algorithm

Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which is used for Classification as well as Regression problems. However, primarily, it is used for Classification problems in Machine Learning. The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-dimensional space into classes so that we can easily put the new data point in the correct category in the future. This best decision boundary is called a hyperplane.

In [23]:
svm = SVC()
svm.fit(x_train1, y_train1)

SVC()

In [24]:
svm.score(x_test1, y_test1)

0.3473684210526316

### Gradient Boosting
Gradient boosting is a machine learning technique for regression, classification and other tasks, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees.

In [25]:
from sklearn.ensemble import GradientBoostingClassifier
clf = GradientBoostingClassifier(n_estimators=300, learning_rate=1.0,max_depth=1, random_state=0)

In [26]:
clf.fit(x_train1, y_train1)

GradientBoostingClassifier(learning_rate=1.0, max_depth=1, n_estimators=300,
                           random_state=0)

In [27]:
clf.score(x_test1, y_test1)

0.37894736842105264

### MLP Classifier
Multi-layer perceptrons (MLP) make powerful classifiers that may provide superior performance compared with other classifiers, but are often criticized for the number of free parameters. Parameter selection for optimal performance is performed using measures that correlate well with generalisation error.

In [28]:
from sklearn.neural_network import MLPClassifier
clf = MLPClassifier(random_state=1, max_iter=1000)

In [29]:
clf.fit(x_train1, y_train1)

MLPClassifier(max_iter=1000, random_state=1)

In [30]:
clf.score(x_test1, y_test1)

0.2

### Stochastic Gradient Descent (SGD)
Stochastic Gradient Descent (SGD) is a simple yet efficient optimization algorithm used to find the values of parameters/coefficients of functions that minimize a cost function. In other words, it is used for discriminative learning of linear classifiers under convex loss functions such as SVM and Logistic regression.

In [31]:
from sklearn.linear_model import SGDClassifier
sdg = SGDClassifier()

In [32]:
sdg.fit(x_train1, y_train1)

SGDClassifier()

In [33]:
sdg.score(x_test1, y_test1)

0.08421052631578947

### Linear Discriminant Analysis (LDA)
Linear Discriminant Analysis (LDA) is a dimensionality reduction technique. As the name implies dimensionality reduction techniques reduce the number of dimensions (i.e. variables) in a dataset while retaining as much information as possible.

In [34]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis()

In [35]:
lda.fit(x_train1, y_train1)

LinearDiscriminantAnalysis()

In [36]:
lda.score(x_test1, y_test1)

0.16842105263157894

-----------------------------------

### Conclusion
Leading models are : Random Forest and Decision Tree
