# Will I Get Into TJ?

My mom, being the Asian mom that she is, somehow found TJ admissions data on the [FCAG website](http://fcag.org).

Me, being the nerd that I am, perked up when she mentioned the words "data set".

So I wrote a classifier that, given some information, can determine (with 87% accuracy) whether a semifinalist will make it into TJ.

Now, there are some very important caveats here, of course.

1. I'm using the data from the Classes of 2017 and 2018. There is also admissions data for a couple of other years that I just didn't bother using. I'm sure that the admissions policies change to some degree every year, so just keep that in mind.
2. I actually have no way of verifying the validity of the data, but I'll trust FCAG because it's a relatively reputable institution.
2. This assumes that the student has reached the semifinalist stage, since it asks for admissions test scores.
4. This data does not include summer admissions.
3. The TJ admissions test is changing for the class of 2022, so this classifier will actually be pretty useless for anything other than trying it out for fun.

With that out of the way, let's dig into the good stuff!

In [10]:
# Import all of the things we want

import pandas as pd # pandas for reading in data
import numpy as np  # numpy because numpy is wonderful
import tflearn      # for the neural net™

In [2]:
data = pd.read_excel('./201718.xlsx') # data at fcag.org/tjstatistics.shtml

Ok, let's take a look at this data set!

In [3]:
data[25:30]

Unnamed: 0,ID,Application Year,FC Resident,AAP,Gender,M/S GPA,GPA,Verbal Score,Math Score,Semifinalist,Final Decision,CombineScore,Math and Verbal
25,26,2012-2013,Yes,Yes,M,,2.945,39.0,29.0,N,,,68.0
26,27,2012-2013,Yes,Yes,F,,3.9714,27.0,20.0,N,,,47.0
27,28,2012-2013,Yes,Yes,F,3.5875,3.781,40.0,47.0,Y,N,88.0,87.0
28,29,2012-2013,Yes,,M,,2.8333,,,Withdrawn,,,
29,30,2012-2013,Yes,Yes,M,,3.855,25.0,29.0,N,,,54.0


So a couple of things may be confusing here. 

1. M/S GPA, from what I can tell, means a student's math/science GPA. This is not provided for all students, possibly because their middle school does not provide this information. 
2. The CombineScore and Math and Verbal scores are different. I think this is because math is weighted more than verbal in the CombineScore column.
3. The final decision is not binary; there are 3 options: 'Y' for admitted, 'N' for rejected, or 'W' for waitlisted.

In [4]:
def preprocess(data):
    del data['M/S GPA']
    del data['Semifinalist']
    del data['ID']
    del data['Application Year']
    del data['CombineScore']
    del data['Math and Verbal']

    data.AAP = data['AAP'].fillna(value='No')
    data['Final Decision'] = data['Final Decision'].fillna(value='N')
    
    data = data.dropna()
    data.reindex()
    
    np_data = data.as_matrix()
    
    np_data[np_data == 'Yes'] = 1
    np_data[np_data == 'No'] = 0
    np_data[np_data == 'F'] = 1
    np_data[np_data == 'M'] = 0
    np_data[np_data == 'Y'] = 2
    np_data[np_data == 'N'] = 0
    np_data[np_data == 'W'] = 1
    
    return np_data

We clean up the data a little bit, because there are some things that we want to get rid of.

We drop:
- 'M/S GPA' because not every student has that.
- 'ID' because it's useless
- 'Application Year', despite it potentially impacting the outcome, because I primarily want to predict future admissions, in which case historical admission year data wouldn't be as significant a factor. The exception here is if there is a clear trend in admissions across years, but I'm not convinced that that's the case, and I don't have enough data from year to year to determine that conclusively.
- 'CombineScore' and 'Math and Verbal', since those can be relatively trivially derived from the individual section scores.

We fill in all of the blank AAP rows with 'No', and fill in all of the blank 'Final Decision' rows with a 'N' for rejected, since they are only blank if the student did not make it to the semifinalist round.

We then convert all of the categorical data to numerical data.

In [5]:
data = preprocess(data)

In [6]:
trX = data[...,:6] # input data

In [7]:
trY = data[..., 6] # labels

We create a small neural network. I use Adam because that's default, and relu because it works pretty well. Softmax because we want probabilities out of 1.

In [8]:
optimizer = tflearn.optimizers.Adam(learning_rate=0.01)

net = tflearn.input_data(shape=[None, 6])
net = tflearn.fully_connected(net, 32, activation='relu')
net = tflearn.fully_connected(net, 32, activation='relu')
net = tflearn.fully_connected(net, 3, activation='softmax')
net = tflearn.regression(net, to_one_hot=True, n_classes=3, optimizer=optimizer, shuffle_batches=True)

I ran this for 1000 epochs, but it gets stuck at a min pretty early, so it's not necessarily worth running 1000 epochs. I kept a pretty large batch size because I didn't want batches that had exclusively rejections.

In [9]:
model = tflearn.DNN(net)
model.fit(trX, trY, n_epoch=50, batch_size=300, show_metric=True, validation_set=.2)

Training Step: 799  | total loss: [1m[32m0.31892[0m[0m | time: 0.066s
| Adam | epoch: 050 | loss: 0.31892 - acc: 0.8658 -- iter: 4500/4736
Training Step: 800  | total loss: [1m[32m0.31685[0m[0m | time: 1.078s
| Adam | epoch: 050 | loss: 0.31685 - acc: 0.8687 | val_loss: 0.31303 - val_acc: 0.8598 -- iter: 4736/4736
--


Aight, so the training has finished. We can make a prediction with the model now. Let's try my admissions data.

In [29]:
model.predict([[1, 0, 0, 4, 48, 49]])

[[0.2051711529493332, 0.04664463549852371, 0.7481841444969177]]

I have an 75% chance of getting into TJ? Sweet.