# University applicants

### Data exploration

In [1]:
import pandas as pd
import numpy as np

In [2]:
adm = pd.read_csv("../data/admission_univ.csv")

In [3]:
adm.head()

Unnamed: 0.1,Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,0,1,337,118,4,4.5,4.5,9.65,1,0.92
1,1,2,316,104,3,3.0,3.5,8.0,1,0.72
2,2,3,322,110,3,3.5,2.5,8.67,1,0.8
3,3,4,314,103,2,2.0,3.0,8.21,0,0.65
4,4,5,330,115,5,4.5,3.0,9.34,1,0.9


In [4]:
adm.columns = adm.columns.str.lstrip()
adm.columns = adm.columns.str.replace(" ","_").str.replace(".","")

In [5]:
adm.head()

Unnamed: 0,Unnamed:_0,Serial_No,GRE_Score,TOEFL_Score,University_Rating,SOP,LOR,CGPA,Research,Chance_of_Admit
0,0,1,337,118,4,4.5,4.5,9.65,1,0.92
1,1,2,316,104,3,3.0,3.5,8.0,1,0.72
2,2,3,322,110,3,3.5,2.5,8.67,1,0.8
3,3,4,314,103,2,2.0,3.0,8.21,0,0.65
4,4,5,330,115,5,4.5,3.0,9.34,1,0.9


In [6]:
adm.drop("Unnamed:_0", axis=1, inplace=True)

In [7]:
adm.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 385 entries, 0 to 384
Data columns (total 9 columns):
Serial_No            385 non-null int64
GRE_Score            385 non-null int64
TOEFL_Score          385 non-null int64
University_Rating    385 non-null int64
SOP                  385 non-null float64
LOR                  385 non-null float64
CGPA                 385 non-null float64
Research             385 non-null int64
Chance_of_Admit      385 non-null float64
dtypes: float64(4), int64(5)
memory usage: 27.1 KB


Before beginning to work with this dataset and evaluating graduate admissions data, we will verify that there is neither missing data nor duplicates in the dataset.

In [8]:
adm.isna().sum()

Serial_No            0
GRE_Score            0
TOEFL_Score          0
University_Rating    0
SOP                  0
LOR                  0
CGPA                 0
Research             0
Chance_of_Admit      0
dtype: int64

In [9]:
adm.duplicated().sum()

0

In [10]:
adm.Serial_No.duplicated().sum()

0

In [11]:
adm["grades_identifier"] = adm[["GRE_Score", "CGPA"]].apply(lambda x: "_".join(map(str,x)), axis=1)

In [12]:
adm.head()

Unnamed: 0,Serial_No,GRE_Score,TOEFL_Score,University_Rating,SOP,LOR,CGPA,Research,Chance_of_Admit,grades_identifier
0,1,337,118,4,4.5,4.5,9.65,1,0.92,337.0_9.65
1,2,316,104,3,3.0,3.5,8.0,1,0.72,316.0_8.0
2,3,322,110,3,3.5,2.5,8.67,1,0.8,322.0_8.67
3,4,314,103,2,2.0,3.0,8.21,0,0.65,314.0_8.21
4,5,330,115,5,4.5,3.0,9.34,1,0.9,330.0_9.34


In [13]:
adm.grades_identifier.count()

385

In [14]:
adm.grades_identifier.nunique()

385

Turns out that `GRE_Score` and `CGPA` uniquely identify the data.

In [15]:
adm.set_index(["grades_identifier"], drop=True, inplace=True)

In [16]:
adm.head()

Unnamed: 0_level_0,Serial_No,GRE_Score,TOEFL_Score,University_Rating,SOP,LOR,CGPA,Research,Chance_of_Admit
grades_identifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
337.0_9.65,1,337,118,4,4.5,4.5,9.65,1,0.92
316.0_8.0,2,316,104,3,3.0,3.5,8.0,1,0.72
322.0_8.67,3,322,110,3,3.5,2.5,8.67,1,0.8
314.0_8.21,4,314,103,2,2.0,3.0,8.21,0,0.65
330.0_9.34,5,330,115,5,4.5,3.0,9.34,1,0.9


However, there is a column that uniquely identifies the applicants in a more simple way. This column is the serial number column. Instead of having our own "home_made" index, this one is perhaps better.

In [17]:
adm.reset_index(drop=True).head()

Unnamed: 0,Serial_No,GRE_Score,TOEFL_Score,University_Rating,SOP,LOR,CGPA,Research,Chance_of_Admit
0,1,337,118,4,4.5,4.5,9.65,1,0.92
1,2,316,104,3,3.0,3.5,8.0,1,0.72
2,3,322,110,3,3.5,2.5,8.67,1,0.8
3,4,314,103,2,2.0,3.0,8.21,0,0.65
4,5,330,115,5,4.5,3.0,9.34,1,0.9


Let's not drop the serial number from the dataframe values for now. Nothing would happen if we dropped it since it can be recovered from the index.

In [18]:
adm.set_index("Serial_No", drop=False, inplace=True)

Moving forward I have learnt a lesson, which is that it is safer to keep the automatic index, so I have decided to leave the original one as it was.

In [38]:
adm.reset_index(drop=True, inplace=True)

In [39]:
adm.head()

Unnamed: 0,Serial_No,GRE_Score,TOEFL_Score,University_Rating,SOP,LOR,CGPA,Research,Chance_of_Admit,Adjusted_GRE_Score,CGPA_std,GRE_adjusted_std,LOR_std,deciding_column
0,1,337,118,4,4.5,4.5,9.65,1,0.92,347.0,1.750174,1.821086,1.193197,LOR_std
1,2,316,104,3,3.0,3.5,8.0,1,0.72,316.0,-0.992501,-0.270193,0.07684,LOR_std
2,3,322,110,3,3.5,2.5,8.67,1,0.8,322.0,0.121191,0.134571,-1.039517,CGPA_std
3,4,314,103,2,2.0,3.0,8.21,0,0.65,314.0,-0.643433,-0.405114,-0.481338,LOR_std
4,5,330,115,5,4.5,3.0,9.34,1,0.9,340.0,1.234884,1.348862,-0.481338,GRE_adjusted_std


## Finding the right applicants

#### Let's start by getting some info from the applicants.

Total applicants

In [19]:
len(adm)

385

Total applicants who have already conducted an academic research paper and whose CGPA is greater or equal than 9

In [20]:
adm[(adm.Research == 1) & (adm.CGPA >= 9)].Serial_No.count()

104

Now those applicants whose CGPA is greater than 9 and whose SOP score is less than 3.5. Let's see the chance of being admitted on average

In [21]:
(adm[(adm.CGPA > 9) & (adm.SOP < 3.5)].Chance_of_Admit.mean())*100

80.19999999999999

Let's imagine now that is decided that an applicant with university rating of 4 or higher should be given a 10 point boost on their GRE score.

In [22]:
adm.loc[adm.University_Rating >= 4, "Adjusted_GRE_Score"] = adm.GRE_Score+10

In [23]:
adm.Adjusted_GRE_Score = adm.Adjusted_GRE_Score.fillna(adm.GRE_Score)

In [24]:
adm.head()

Unnamed: 0_level_0,Serial_No,GRE_Score,TOEFL_Score,University_Rating,SOP,LOR,CGPA,Research,Chance_of_Admit,Adjusted_GRE_Score
Serial_No,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,1,337,118,4,4.5,4.5,9.65,1,0.92,347.0
2,2,316,104,3,3.0,3.5,8.0,1,0.72,316.0
3,3,322,110,3,3.5,2.5,8.67,1,0.8,322.0
4,4,314,103,2,2.0,3.0,8.21,0,0.65,314.0
5,5,330,115,5,4.5,3.0,9.34,1,0.9,340.0


#### Getting to the decision factors
We would like to create a deciding factor column for each student. We standardize several columns and then pick the most important factor for each randomly. Then if the standardized value for the random most important factor is above 0.8, the student will be accepted.

So, first let's standardize the columns that are going to be taken into account for the decision. In this way, the acceptance threshold can be set regardless of the factor chosen.

In [25]:
def standardize(col):
    
    mean = np.mean(col)
    std = np.std(col)
    
    standardized_col = [(applicant-mean)/std for applicant in col]
        
    return standardized_col

Let's just assume that the relevant factors are the following:
`CGPA_std`, `GRE_std` and `LOR_std`.

In [27]:
adm["CGPA_std"]=standardize(adm.CGPA)
adm["GRE_adjusted_std"]=standardize(adm.Adjusted_GRE_Score)
adm["LOR_std"]=standardize(adm.LOR)

In [28]:
adm.head()

Unnamed: 0_level_0,Serial_No,GRE_Score,TOEFL_Score,University_Rating,SOP,LOR,CGPA,Research,Chance_of_Admit,Adjusted_GRE_Score,CGPA_std,GRE_adjusted_std,LOR_std
Serial_No,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,1,337,118,4,4.5,4.5,9.65,1,0.92,347.0,1.750174,1.821086,1.193197
2,2,316,104,3,3.0,3.5,8.0,1,0.72,316.0,-0.992501,-0.270193,0.07684
3,3,322,110,3,3.5,2.5,8.67,1,0.8,322.0,0.121191,0.134571,-1.039517
4,4,314,103,2,2.0,3.0,8.21,0,0.65,314.0,-0.643433,-0.405114,-0.481338
5,5,330,115,5,4.5,3.0,9.34,1,0.9,340.0,1.234884,1.348862,-0.481338


We will generate the decision choice at random for each applicant using the code below.

In [29]:
from random import choices

In [30]:
std_columns = ['CGPA_std', 'GRE_adjusted_std', 'LOR_std']

decision_choice = choices(std_columns, k=adm.shape[0])

Now create the deciding column using the `lookup` function. The lookup column is `decision_choice` found above. Call the column resulting from the lookup function `deciding_column` and add it to the `admissions` dataframe.

In [31]:
adm["deciding_column"] = decision_choice

In [32]:
adm.head()

Unnamed: 0_level_0,Serial_No,GRE_Score,TOEFL_Score,University_Rating,SOP,LOR,CGPA,Research,Chance_of_Admit,Adjusted_GRE_Score,CGPA_std,GRE_adjusted_std,LOR_std,deciding_column
Serial_No,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1,1,337,118,4,4.5,4.5,9.65,1,0.92,347.0,1.750174,1.821086,1.193197,LOR_std
2,2,316,104,3,3.0,3.5,8.0,1,0.72,316.0,-0.992501,-0.270193,0.07684,LOR_std
3,3,322,110,3,3.5,2.5,8.67,1,0.8,322.0,0.121191,0.134571,-1.039517,CGPA_std
4,4,314,103,2,2.0,3.0,8.21,0,0.65,314.0,-0.643433,-0.405114,-0.481338,LOR_std
5,5,330,115,5,4.5,3.0,9.34,1,0.9,340.0,1.234884,1.348862,-0.481338,GRE_adjusted_std


Once the deciding column is decided randomly for each applicant, let's grab the corresponding value by creating a lookup function.

In [33]:
def pandas_lookup(col):
    
    decision_value=[]       

    for index, row in adm.iterrows():
        if row[col] == "CGPA_std":
            value=adm.iloc[index, adm.columns.get_loc(row[col])]
            decision_value.append(value)
            
        elif row[col] == "GRE_adjusted_std":
            value=adm.iloc[index, adm.columns.get_loc(row[col])]
            decision_value.append(value)
            
        elif row[col] == "LOR_std":
            value=adm.iloc[index, adm.columns.get_loc(row[col])]
            decision_value.append(value)
        
    return decision_value


Then we can simply store it in a new column.

In [40]:
adm["deciding_value"] = pandas_lookup("deciding_column")

Now it is easy, as we said before, we set the unified threshold for application acceptance in 0.8, so let's filter those who are accepted and those who are not.

In [41]:
adm["decision"] = np.where(adm.deciding_value >= 0.8, 1, 0)

In [42]:
adm.head()

Unnamed: 0,Serial_No,GRE_Score,TOEFL_Score,University_Rating,SOP,LOR,CGPA,Research,Chance_of_Admit,Adjusted_GRE_Score,CGPA_std,GRE_adjusted_std,LOR_std,deciding_column,deciding_value,decision
0,1,337,118,4,4.5,4.5,9.65,1,0.92,347.0,1.750174,1.821086,1.193197,LOR_std,1.193197,1
1,2,316,104,3,3.0,3.5,8.0,1,0.72,316.0,-0.992501,-0.270193,0.07684,LOR_std,0.07684,0
2,3,322,110,3,3.5,2.5,8.67,1,0.8,322.0,0.121191,0.134571,-1.039517,CGPA_std,0.121191,0
3,4,314,103,2,2.0,3.0,8.21,0,0.65,314.0,-0.643433,-0.405114,-0.481338,LOR_std,-0.481338,0
4,5,330,115,5,4.5,3.0,9.34,1,0.9,340.0,1.234884,1.348862,-0.481338,GRE_adjusted_std,1.348862,1


So, how many applicants will be accepted to the program using the decision column? And what is the acceptance proportion?

In [43]:
adm[adm.decision==1].decision.sum()

96

In [44]:
(adm[adm.decision==1].decision.sum() / adm.shape[0])*100

24.935064935064936

82 applicants are accepted with his criteria, which corresponds to 21.3% of the total.

*By the way, there is another way, as in below, to also get to the same results if the deciding value is not needed to be highlighted. (This is actually simpler and accomplishes the goal just fine, but I wanted to challenge myself a bit).*

In [46]:
adm["decision_2"] = np.where((adm.deciding_column == "CGPA_std") & (adm.CGPA_std >= 0.8) \
                         | (adm.deciding_column == "GRE_adjusted_std") & (adm.GRE_adjusted_std >= 0.8) \
                         | (adm.deciding_column == "LOR_std") & (adm.LOR_std >= 0.8), 1, 0)

In [47]:
adm[adm.decision_2==1].decision_2.sum()

96

In [48]:
(adm[adm.decision_2==1].decision_2.sum() / adm.shape[0])*100

24.935064935064936