# LOGISTIC REGRESSION FOR STUDENT ADMISSION

In this exercise I found on GitHub (https://github.com/jdwittenauer/ipython-notebooks) we were asked to determine each applicant's chance of admission based on their results on two exams. As adiminstrators of the university department, we have historical data from previous applicants that can be used as a training set. For each training example, we have the applicant's scores on two exams and the admissions decision. To accomplish this, we're going to build a classification model that estimates the probability of admission based on the exam scores using a technique called logistic regression.

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

import os
path = os.getcwd() + '\data\ex2data1.txt'
data = pd.read_csv(path, header=None, names=['Exam 1', 'Exam 2', 'Admitted'])
data.head() #Value of 1: The student was admitted
            #Value of 0: The student was not admitted

Unnamed: 0,Exam 1,Exam 2,Admitted
0,34.62366,78.024693,0
1,30.286711,43.894998,0
2,35.847409,72.902198,0
3,60.182599,86.308552,1
4,79.032736,75.344376,1


In this case, it is not necessary to do any data pre-processing because we don't have text characters, null values or different scale problems. 

Once we have the hypothesis, we will need to evaluate it in order to check that it generalizes correclty and doesn't generate overfitting. Therefore, we will divide now the experimental data into three sets: the train set, the test set and the validation set.


In [3]:
len(data)  #We have a total of 100 data


#The 60% of the data that we have will be assigned as the train set.
#The other 40% of the data will be assigned as the test set. From this 40% of the test set, half will be assigned 
#to the validation set.

100

In [6]:
# We split the data set into the train and test set as mentioned before

from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(data, test_size=0.4, random_state=42)

In [7]:
train_set.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 60 entries, 49 to 51
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Exam 1    60 non-null     float64
 1   Exam 2    60 non-null     float64
 2   Admitted  60 non-null     int64  
dtypes: float64(2), int64(1)
memory usage: 1.9 KB


In [8]:
# We split the test set: half will be the test set and the other the validation set (50/50).
val_set, test_set = train_test_split(test_set, test_size=0.5, random_state=42)

In [9]:
print("Training Set length:", len(train_set))
print("Validation Set length:", len(val_set))
print("Test Set length:", len(test_set))

Training Set length: 60
Validation Set length: 20
Test Set length: 20


In [16]:
x_test = test_set[['Exam 1', 'Exam 2']]
y_test=test_set['Admitted']


x_train = train_set[['Exam 1', 'Exam 2']]
y_train=train_set['Admitted']


x_val = val_set[['Exam 1', 'Exam 2']]
y_val=val_set['Admitted']


In [17]:
#Let's train our algorithm using the train data set:

from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()
clf.fit(x_train, y_train)

LogisticRegression()

In [26]:
#Now, let's do a prediction using the generalized logistic regression model obtained and trained before.
#We will first use the validation subset to evaluate the model. 

y_pred = clf.predict(x_val)


In [36]:
print("Prediction:\n", y_pred)
print("\nReal labels:\n", y_val)

Prediction:
 [1 1 1 1 0 1 0 0 0 0 1 0 0 1 1 1 0 1 0 0]

Real labels:
 39    0
30    1
53    0
72    1
88    1
70    0
11    0
66    1
45    0
5     0
42    1
85    1
18    1
26    1
12    1
55    0
80    1
90    1
9     1
35    0
Name: Admitted, dtype: int64


In [27]:
#Let's get the accuracy of the result:
from sklearn.metrics import accuracy_score

print('Accuracy: {:.3f}'.format(accuracy_score(y_val, y_pred)))



#An accuracy of 100% may indicate overfitting.

Accuracy: 1.000


In [28]:
#Let's try now the prediction with the test set:

y_pred = clf.predict(x_test)

In [29]:
from sklearn.metrics import accuracy_score

print('Accuracy: {:.3f}'.format(accuracy_score(y_test, y_pred)))

#Now we get a lower accuracy than before, which indicates that somehow there is something wrong with our model. 

Accuracy: 0.700


In [35]:
#Let's try out f1_score, which  takes into account both precision and recall, and is better to evaluate our model.
#It indicates us how good the alogrithm works (the closer to 1, the better)


from sklearn.metrics import f1_score  

print("F1 score with the validation_set:", f1_score(y_val, y_pred))
print("F1 score with the test_set:", f1_score(y_test, y_pred))


F1 score with the validation_set: 0.5454545454545454
F1 score with the test_set: 0.75


## Conclusion

The reason we may have an accuracy of 1 in the validation set but a lower F1 score is that accuracy alone may not be the best metric to evaluate the performance of a classifier. In some cases, a model can achieve high accuracy by simply predicting the majority class all the time. This can lead to high accuracy, but the model may not be able to predict the minority class well.

In this case, it's possible that the model is overfitting to the validation set, which is why it's achieving perfect accuracy but lower F1 score. The lower F1 score in the validation set suggests that the model may not be generalizing well to new, unseen data.

Similarly, the reason why we may have a lower accuracy in the test set but a higher F1 score is that the model is better at predicting the minority class in the test set, which is reflected in the higher F1 score. The lower accuracy in the test set may be due to the fact that the test set may be more challenging or representative of the real-world data, as opposed to the validation set that the model has already seen during training.

Overfitting occurs when the model learns the noise in the training data rather than the underlying pattern. This results in a model that performs well on the training data but poorly on new, unseen data. To address this issue, we could try to reduce the complexity of the model by using regularization techniques, increasing the size of the training set, or adjusting the hyperparameters of the model. We could also try to collect more data or use data augmentation techniques to increase the variability of the data.

It's important to use a combination of metrics when evaluating the performance of a classifier, as no single metric can give a complete picture of the model's performance.

A good model would have a F1 score close to 1 for both the validation and test set.

