# Exploratory Data Analysis

In [64]:
import pandas as pd
import numpy as np

Let's see some of our data and how it is shown

In [65]:
df = pd.read_csv('MBA.csv')
df.head()

Unnamed: 0,application_id,gender,international,gpa,major,race,gmat,work_exp,work_industry,admission
0,1,Female,False,3.3,Business,Asian,620.0,3.0,Financial Services,Admit
1,2,Male,False,3.28,Humanities,Black,680.0,5.0,Investment Management,
2,3,Female,True,3.3,Business,,710.0,5.0,Technology,Admit
3,4,Male,False,3.47,STEM,Black,690.0,6.0,Technology,
4,5,Male,False,3.35,STEM,Hispanic,590.0,5.0,Consulting,


In [66]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6194 entries, 0 to 6193
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   application_id  6194 non-null   int64  
 1   gender          6194 non-null   object 
 2   international   6194 non-null   bool   
 3   gpa             6194 non-null   float64
 4   major           6194 non-null   object 
 5   race            4352 non-null   object 
 6   gmat            6194 non-null   float64
 7   work_exp        6194 non-null   float64
 8   work_industry   6194 non-null   object 
 9   admission       1000 non-null   object 
dtypes: bool(1), float64(3), int64(1), object(5)
memory usage: 441.7+ KB


Seeing the unique values of categorical data

In [67]:
unique_values_for_col = {}
for col in ['gender', 'major', 'race', 'work_industry']:
    unique_values_for_col.update({col: df[col].unique()})
unique_values_for_col

{'gender': array(['Female', 'Male'], dtype=object),
 'major': array(['Business', 'Humanities', 'STEM'], dtype=object),
 'race': array(['Asian', 'Black', nan, 'Hispanic', 'White', 'Other'], dtype=object),
 'work_industry': array(['Financial Services', 'Investment Management', 'Technology',
        'Consulting', 'Nonprofit/Gov', 'PE/VC', 'Health Care',
        'Investment Banking', 'Other', 'Retail', 'Energy', 'CPG',
        'Real Estate', 'Media/Entertainment'], dtype=object)}

Note that there is an already unbalance of some of the races given by the dataset

Considering that it is almost half missing, lets make some imputation

In [68]:
df['race'].value_counts()

race
White       1456
Asian       1147
Black        916
Hispanic     596
Other        237
Name: count, dtype: int64

In [69]:
probability_race = df['race'].value_counts(normalize=True)
missing_mask = df['race'].isnull()
df.loc[missing_mask, 'race'] = np.random.choice(
    probability_race.index,
    size=missing_mask.sum(),
    p=probability_race.values
)

Since at the description of the dataset it says that NA values in admission means that the admission was denied, let's impute 'Deny' at the missing values so that we can use supervised machine learning techniques to predict if a certain candidate will have it's admission status as 'Admit'

In [70]:
df['admission'].isna().sum()

np.int64(5194)

In [71]:
df['admission'] = df['admission'].fillna('Deny')

In [72]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6194 entries, 0 to 6193
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   application_id  6194 non-null   int64  
 1   gender          6194 non-null   object 
 2   international   6194 non-null   bool   
 3   gpa             6194 non-null   float64
 4   major           6194 non-null   object 
 5   race            6194 non-null   object 
 6   gmat            6194 non-null   float64
 7   work_exp        6194 non-null   float64
 8   work_industry   6194 non-null   object 
 9   admission       6194 non-null   object 
dtypes: bool(1), float64(3), int64(1), object(5)
memory usage: 441.7+ KB


In [73]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6194 entries, 0 to 6193
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   application_id  6194 non-null   int64  
 1   gender          6194 non-null   object 
 2   international   6194 non-null   bool   
 3   gpa             6194 non-null   float64
 4   major           6194 non-null   object 
 5   race            6194 non-null   object 
 6   gmat            6194 non-null   float64
 7   work_exp        6194 non-null   float64
 8   work_industry   6194 non-null   object 
 9   admission       6194 non-null   object 
dtypes: bool(1), float64(3), int64(1), object(5)
memory usage: 441.7+ KB


In [74]:
df['race'].value_counts(normalize=True)

race
White       0.335486
Asian       0.263804
Black       0.209881
Hispanic    0.134485
Other       0.056345
Name: proportion, dtype: float64

In [75]:
df['admission'].value_counts(normalize=True)

admission
Deny        0.838553
Admit       0.145302
Waitlist    0.016145
Name: proportion, dtype: float64

Now, it's time to remove some redundant values of our dataset.

Note that 'gpa' and 'gmat' has some positive correlation.

In [76]:
string_columns = ['gender', 'major', 'race', 'work_industry', 'admission']
df_numerical = df.drop(columns=string_columns)
df_numerical.corr()

Unnamed: 0,application_id,international,gpa,gmat,work_exp
application_id,1.0,0.008045,0.013872,0.004694,0.0031
international,0.008045,1.0,-0.02854,-0.014784,-0.010341
gpa,0.013872,-0.02854,1.0,0.577539,0.000346
gmat,0.004694,-0.014784,0.577539,1.0,-0.000999
work_exp,0.0031,-0.010341,0.000346,-0.000999,1.0


Null Hypothesis (H₀): There is no linear relationship between GPA and GMAT scores. In other words, the population correlation coefficient 
ρ=0.

Alternative Hypothesis (H₁): There is a linear relationship between GPA and GMAT scores. This can be expressed as 
ρ!=0, meaning the population correlation coefficient is significantly different from 0 (a two-tailed test).

In [77]:
from scipy.stats import pearsonr
pearson_corr, p_value = pearsonr(df['gmat'].values, df['gpa'].values)
print(f"Pearson correlation value {pearson_corr:.3f}")
print(f"P-value {p_value:.3f}")

Pearson correlation value 0.578
P-value 0.000


From pearson correlation value as 0.578 indicates a moderate to strong positive linear relationship between GPA and GMAT scores.

Since the p-value is less than any typical significance level (e.g., α = 0.05, 0.01), we reject the null hypothesis 
𝐻₀.

Thus, the p-value of 0 tells us that the observed correlation is highly statistically significant, and it is very unlikely that this relationship is due to random chance. Which means that there is significant correlation that can lead to the decision of removing one of the columns.

In [78]:
df_final = df.drop(columns=['gmat'])

Finally, we need to make our categorical data to numerical data so that we can use this data in sklearn machine learning techniques.

In [79]:
from sklearn.preprocessing import LabelEncoder

# Initialize LabelEncoder
label_encoder_gender = LabelEncoder()
label_encoder_major = LabelEncoder()
label_encoder_race = LabelEncoder()
label_encoder_work_industry = LabelEncoder()

# Transform the categorical variables
df_final['gender'] = label_encoder_gender.fit_transform(df_final['gender'])
df_final['major'] = label_encoder_major.fit_transform(df_final['major'])
df_final['race'] = label_encoder_race.fit_transform(df_final['race'])
df_final['work_industry'] = label_encoder_work_industry.fit_transform(df_final['work_industry'])

In [80]:
df_final.head()

Unnamed: 0,application_id,gender,international,gpa,major,race,work_exp,work_industry,admission
0,1,0,False,3.3,0,0,3.0,3,Admit
1,2,1,False,3.28,1,1,5.0,6,Deny
2,3,0,True,3.3,0,1,5.0,13,Admit
3,4,1,False,3.47,2,1,6.0,13,Deny
4,5,1,False,3.35,2,2,5.0,1,Deny


In [81]:
df_final = df_final.drop(index=df_final[df_final['admission'] == 'Waitlist'].index)
df_final = df_final.drop(columns=['application_id'])

# Machine learning model

Seeing the accuracy of the data with LogisticRegression, RidgeClassifier and KNeighborsClassifier default parameters

In [82]:
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, cross_val_score

X = df_final.drop(columns=['admission'])
y = df_final['admission']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, test_size=0.3)

accuracy_train = {}
accuracy_test = {}
models = [
    LogisticRegression(),
    RidgeClassifier(),
    KNeighborsClassifier()
]

for model in models:
    model.fit(X_train, y_train)
    train_score = model.score(X_train, y_train)
    test_score = model.score(X_test, y_test)
    accuracy_train.update({model:train_score})
    accuracy_test.update({model:test_score})

print(f"Train scores: {accuracy_train}\n\nTest scores {accuracy_test}")

Train scores: {LogisticRegression(): 0.8518171160609613, RidgeClassifier(): 0.8543962485345838, KNeighborsClassifier(): 0.8710433763188745}

Test scores {LogisticRegression(): 0.8414434117003827, RidgeClassifier(): 0.8480043739748496, KNeighborsClassifier(): 0.8326954620010935}


Now, we will do a search at the hyperparameter grid for a better solution

In [83]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from scipy.stats import uniform, randint

# Define the models
models = {
    'KNeighborsClassifier': KNeighborsClassifier(),
    'LogisticRegression': LogisticRegression(max_iter=100),
    'RidgeClassifier': RidgeClassifier(),
    'SVC': SVC(),
    'DecisionTreeClassifier': DecisionTreeClassifier(),
    'RandomForestClassifier': RandomForestClassifier(),
    'GradientBoostingClassifier': GradientBoostingClassifier(),
}

# Define the hyperparameters and their distributions for each model
param_dists = {
    'KNeighborsClassifier': {
        'n_neighbors': randint(1, 10),
        'weights': ['uniform', 'distance']
    },
    'LogisticRegression': {
        'C': uniform(0.01, 100),
        'penalty': ['l1', 'l2'],  # Note: 'l1' requires the 'liblinear' solver
        'solver': ['lbfgs', 'liblinear']
    },
    'RidgeClassifier': {
        'alpha': uniform(0.01, 10)
    },
    'SVC': {
        'C': uniform(0.1, 10),
        'kernel': ['linear', 'rbf', 'poly'],
        'gamma': ['scale', 'auto']
    },
    'DecisionTreeClassifier': {
        'criterion': ['gini', 'entropy'],
        'max_depth': [None, 10, 20, 30],
        'min_samples_split': randint(2, 11)
    },
    'RandomForestClassifier': {
        'n_estimators': randint(1, 10),
        'max_depth': [None, 10, 20, 30],
        'min_samples_split': randint(2, 11)
    },
    'GradientBoostingClassifier': {
        'n_estimators': randint(1, 10),
        'learning_rate': uniform(0.01, 0.2),
        'max_depth': randint(3, 8)
    }
}

# Loop through models and perform Randomized Search
for model_name, model in models.items():
    print(f"\nTuning hyperparameters for {model_name}...")
    random_search = RandomizedSearchCV(estimator=model, param_distributions=param_dists[model_name],
                                       n_iter=10, scoring='accuracy', cv=5, verbose=1, n_jobs=-1, random_state=42)
    random_search.fit(X, y)
    
    # Best parameters and best score
    print(f"{model_name} Best Parameters: {random_search.best_params_}")
    print(f"{model_name} Best Score: {random_search.best_score_}")


Tuning hyperparameters for KNeighborsClassifier...
Fitting 5 folds for each of 10 candidates, totalling 50 fits
KNeighborsClassifier Best Parameters: {'n_neighbors': 8, 'weights': 'uniform'}
KNeighborsClassifier Best Score: 0.8360680845561047

Tuning hyperparameters for LogisticRegression...
Fitting 5 folds for each of 10 candidates, totalling 50 fits
LogisticRegression Best Parameters: {'C': np.float64(2.0684494295802445), 'penalty': 'l2', 'solver': 'liblinear'}
LogisticRegression Best Score: 0.8513293218619801

Tuning hyperparameters for RidgeClassifier...
Fitting 5 folds for each of 10 candidates, totalling 50 fits


25 fits failed out of a total of 50.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
25 fits failed with the following error:
Traceback (most recent call last):
  File "/home/lucas/TI/data-science-projects/venv/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 888, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/home/lucas/TI/data-science-projects/venv/lib/python3.9/site-packages/sklearn/base.py", line 1473, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "/home/lucas/TI/data-science-projects/venv/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 1194, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/home/lucas/TI/d

RidgeClassifier Best Parameters: {'alpha': np.float64(9.51714306409916)}
RidgeClassifier Best Score: 0.8524779389281101

Tuning hyperparameters for SVC...
Fitting 5 folds for each of 10 candidates, totalling 50 fits
SVC Best Parameters: {'C': np.float64(6.274815096277165), 'gamma': 'auto', 'kernel': 'rbf'}
SVC Best Score: 0.8526418731335141

Tuning hyperparameters for DecisionTreeClassifier...
Fitting 5 folds for each of 10 candidates, totalling 50 fits
DecisionTreeClassifier Best Parameters: {'criterion': 'entropy', 'max_depth': 10, 'min_samples_split': 7}
DecisionTreeClassifier Best Score: 0.8140777320234761

Tuning hyperparameters for RandomForestClassifier...
Fitting 5 folds for each of 10 candidates, totalling 50 fits
RandomForestClassifier Best Parameters: {'max_depth': 10, 'min_samples_split': 9, 'n_estimators': 6}
RandomForestClassifier Best Score: 0.8450920092514391

Tuning hyperparameters for GradientBoostingClassifier...
Fitting 5 folds for each of 10 candidates, totalling 5

With the above testing, it can be seeing that the best model with the best parameters is SVC with parameters:

- 'C' = `np.float64(6.274815096277165)`;
- 'gamma' = 'auto';
- 'kernel' = 'rbf'.