## Import Libraries

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.model_selection import KFold

from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor

## Collect Data
I also go ahead and make copies to play with feature selection

In [2]:
filepath = "/content/drive/MyDrive/Colab Notebooks/archive/Admission_Predict_Ver1.1.csv"
train_df = pd.read_csv(filepath)
train_df_copy = train_df.copy()
test_df = train_df.pop('Chance of Admit ')
test_df_copy = train_df_copy.pop('Chance of Admit ')

X_train, X_test, y_train, y_test = train_test_split(train_df, test_df, test_size=0.2)

## Feature Selection
Since I did not do much feature selection on past projects, I wanted to use the sklearn 'SelectKBest' function to select a few different feature sizes to test the cures of dimenstionality.

In [3]:
three_feat = SelectKBest(chi2, k=3).fit_transform(train_df_copy.astype('int'), test_df_copy.astype('int'))
X_train_3, X_test_3, y_train_3, y_test_3= train_test_split(three_feat, test_df_copy, test_size=0.2)
feat3 = pd.DataFrame(three_feat)
             
five_feat = SelectKBest(chi2, k=5).fit_transform(train_df_copy.astype('int'), test_df_copy.astype('int'))
X_train_5, X_test_5, y_train_5, y_test_5 = train_test_split(five_feat, test_df_copy, test_size=0.2)
feat5 = pd.DataFrame(five_feat)

## K-Fold
To score all of the models I build, I wanted to use a single 5-fold cross validation function. 

In [4]:
def k_fold_cross_val(data_x, data_y, model, num_folds=5):
  kf = KFold(n_splits=num_folds)
  avg = []
  for train_index, test_index in kf.split(data_x):
    X_train, X_test = data_x.iloc[train_index], data_x.iloc[test_index]
    y_train, y_test = data_y.iloc[train_index], data_y.iloc[test_index]
    model.fit(X_train, y_train)
    avg += [model.score(X_test, y_test)]
  return sum(avg)/len(avg)

## Linear Regression
I feel the most basic model to start with is linear regression. This model assumes the data follows a linear trend throughout. Normalizing the data does not add any extra benefits for regression as it looks for functions to calculate an output rather than something else. Frin these models, it's clear that more features results in better accuracy, but that the keap between 5 and 8 features is likely the result of overfitting.

In [6]:
lrm = LinearRegression()
print(k_fold_cross_val(train_df,test_df,lrm))

lrm = LinearRegression()
print(k_fold_cross_val(feat3,test_df,lrm))

lrm = LinearRegression()
print(k_fold_cross_val(feat5,test_df,lrm))

0.7902348853116391
0.7181017449412218
0.7282410096350164


## Nearest Neighbors
Nearest neighbors as a form of regression is somethign I have not worked with since coming to IU, and it allowed me to vaary my model in another dimension- the number of neighbors- for testing. Immediately, we can see that having too many parameters destroys the accuracy for a nearest neighbors model, which makes sense. Having so many ways to evaluate closeness means it's likely that unimportant features are weighted the saem as important ones. 

Another interesting finding is that 20 neighbors and 3 features begins to rival five features and any number of neighbors. This makes me wonder if 20 neighbors is beginning to cause unintended biases with 3 features and how these three variables are correlated.

In [7]:
nnr = KNeighborsRegressor(n_neighbors=2)
print(k_fold_cross_val(train_df,test_df,nnr))

nnr = KNeighborsRegressor(n_neighbors=5)
print(k_fold_cross_val(train_df,test_df,nnr))

nnr = KNeighborsRegressor(n_neighbors=10)
print(k_fold_cross_val(train_df,test_df,nnr))

nnr = KNeighborsRegressor(n_neighbors=20)
print(k_fold_cross_val(train_df,test_df,nnr))

0.24904138398204623
0.1945693987602522
0.16405767160981316
0.14268738614869997


In [8]:
nnr = KNeighborsRegressor(n_neighbors=2)
print(k_fold_cross_val(feat3,test_df,nnr))

nnr = KNeighborsRegressor(n_neighbors=5)
print(k_fold_cross_val(feat3,test_df,nnr))

nnr = KNeighborsRegressor(n_neighbors=10)
print(k_fold_cross_val(feat3,test_df,nnr))

nnr = KNeighborsRegressor(n_neighbors=20)
print(k_fold_cross_val(feat3,test_df,nnr))

0.4007445750865699
0.4777529307487294
0.5746809641270213
0.6214390990332485


In [9]:
nnrnnr = KNeighborsRegressor(n_neighbors=2)
print(k_fold_cross_val(feat5,test_df,nnr))

nnr = KNeighborsRegressor(n_neighbors=5)
print(k_fold_cross_val(feat5,test_df,nnr))

nnr = KNeighborsRegressor(n_neighbors=10)
print(k_fold_cross_val(feat5,test_df,nnr))

nnr = KNeighborsRegressor(n_neighbors=20)
print(k_fold_cross_val(feat5,test_df,nnr))

0.6339938038391458
0.5646759345550985
0.6277398144963
0.6339938038391458


## Decision Tree
Last but not least, let's look at a decision tree. Here, I wanted to play around with setting a maximum tree depth to see how that impacted the various models. At a glance, a depth of ten was around what the models were maxing out was, but models with less dense trees performed better in all cases. Like linear regression, trees often can suffer from overfitting with features, but suprisingly both 3 and 5 features shared their accuracy quite well. 

In [10]:
dtr = DecisionTreeRegressor()
print(k_fold_cross_val(train_df,test_df,dtr))

dtr = DecisionTreeRegressor(max_depth=3)
print(k_fold_cross_val(train_df,test_df,dtr))

dtr = DecisionTreeRegressor(max_depth=5)
print(k_fold_cross_val(train_df,test_df,dtr))

dtr = DecisionTreeRegressor(max_depth=10)
print(k_fold_cross_val(train_df,test_df,dtr))

0.6733133286141734
0.7104746451388882
0.73902789755243
0.6681541972267249


In [11]:
dtr = DecisionTreeRegressor()
print(k_fold_cross_val(feat3,test_df,dtr))

dtr = DecisionTreeRegressor(max_depth=3)
print(k_fold_cross_val(feat3,test_df,dtr))

dtr = DecisionTreeRegressor(max_depth=5)
print(k_fold_cross_val(feat3,test_df,dtr))

dtr = DecisionTreeRegressor(max_depth=10)
print(k_fold_cross_val(feat3,test_df,dtr))

0.683499681752247
0.6612554606991324
0.6818847918740596
0.683499681752247


In [12]:
dtr = DecisionTreeRegressor()
print(k_fold_cross_val(feat5,test_df,dtr))

dtr = DecisionTreeRegressor(max_depth=3)
print(k_fold_cross_val(feat5,test_df,dtr))

dtr = DecisionTreeRegressor(max_depth=5)
print(k_fold_cross_val(feat5,test_df,dtr))

dtr = DecisionTreeRegressor(max_depth=10)
print(k_fold_cross_val(feat5,test_df,dtr))

0.5966979152818437
0.6718757196950599
0.649684132379097
0.6068445771442722


## Conclusion
Having tested a variety of different models for Linear Regression, Decision Tree Regression, and Nearest Neighbor Regression, I believe decision trees for regression to be the best. From what I know from classes and readings, the model can perform very well with little tweaking (especially when abstracted into a forest for ensemble learning.) Neural networks were certainly the most volatile due to the differences in how they're computer with the other models tested. There are tons of other regression models out there though,and I only played around in the space a small bit. I may add more models to the project, but they'll come after this bit unless I have more findings.

To add for fun:
- sklearn.ensemble.RandomForestRegressor
- Neural networks like SVM?