# Logistic Regression With College Admission data

This execise will use a data set of college graduate admissions to predict whether or not a student will be accepted based on their GRE scores, GPA and ranking

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Load Data

- After loading the data, print out a sample of ten random rows to visually inspect the data

In [None]:
pd.options.display.float_format = '{:,.2f}'.format
dataset = "datasets/admission.csv"
data = pd.read_csv(dataset)
data.sample(10)


## Exploratory Data Analysis (EDA)

- As before, get as idea of what the data distribution looks like
- We can also see the distribution of the target labels.
- The results don't seem to be grossly skewed so we don't worry about over or under sampling

In [None]:
data.describe()

In [None]:
data['admit'].value_counts()

In [None]:
data['admit'].value_counts(normalize=True)

## Shape Data

- As in the previous example, we are going to split our data into an input training set and a label set

In [None]:
x = data[['gre', 'gpa', 'rank']]
y = data['admit']

print (x)
print (y)

In [None]:
print ('x : ', x.shape)
print ('y : ', y.shape)

## Split train/test

- Unlike the last example, we are splitting the data into two parts.
- One part is the training data that is used to train the model
- The other part is the test data used to test our model
- In this case, 80% of the data is used to train the model

In [None]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train, y_test = train_test_split(x,y,  test_size=0.2, random_state=123)


print ("x_train :" , x_train.shape )
print ("x_test :", x_test.shape)
print ("y_train :", y_train.shape)
print ("y_test :", y_test.shape)

## Logistic Regression

- We use the built in library for scikit-learn to create a logistic regression model
- Once the model is trained, we examine the coefficients and notice that GPA seems to have a strong influence

In [None]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
model = lr.fit(x_train, y_train)
print('coef : ', model.coef_)



## Model Evaluation

- Now we use the test data to make predictions and compare them against the labels
- Notice that the score on the test data is higher than on the training data. 


In [None]:
y_pred = model.predict (x_test)
y_pred

In [None]:
train_accuracy = model.score(x_train,y_train)
print ("Train accuracy: ", train_accuracy)

In [None]:
test_accuracy = model.score(x_test, y_test)

print ("Test accuracy: ", test_accuracy)

#### Confusion matrix

- This shows us the misclassifications. The actual values are on the side and predicted values on the top
- In this case for 0, the model got 52 right but wrongly identified 21 as a 1
- In this case for 1, it correctly identifed 7 as 1  a

In [None]:
from sklearn.metrics import confusion_matrix
import numpy as np

cm_labels = np.unique(y)
cm_array = confusion_matrix(y_test, y_pred)


cm_df = pd.DataFrame(cm_array, index=cm_labels, columns=cm_labels)
cm_df

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize = (8,5))

# colormaps : cmap="YlGnBu" , cmap="Greens", cmap="Blues",  cmap="Reds"
sns.heatmap(cm_df, annot=True, cmap="Reds", fmt='d').plot()