# 1. Introduction to the Data

In the previous mission, we learned about classification, logistic regression, and how to use scikit-learn to fit a logistic regression model to a dataset on graduate school admissions. We'll continue to work with the dataset, which contains data on 644 applications with the following columns:

* gre - applicant's score on the Graduate Record Exam, a generalized test for prospective graduate students.
Score ranges from 200 to 800.
* gpa - college grade point average.
Continuous between 0.0 and 4.0.
* admit - binary value
Binary value, 0 or 1, where 1 means the applicant was admitted to the program and 0 means the applicant was rejected.

## TODO:
* Use the LogisticRegression method predict to return the label for each observation in the dataset, admissions. Assign the returned list to labels.
* Add a new column to the admissions Dataframe named predicted_label that contains the values from labels.
* Use the Series method value_counts and the print function to display the distribution of the values in the predicted_label column.
* Use the Dataframe method head and the print function to display the first 5 rows in admissions.

In [1]:
import pandas as pd
admissions=pd.read_csv('admissions.csv')

In [2]:
from sklearn.linear_model import LogisticRegression

In [3]:
lr=LogisticRegression(solver='lbfgs')
lr.fit(admissions[['gpa']],admissions['admit'])

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False)

In [4]:
labels=lr.predict(admissions[['gpa']])
admissions['predicted_label']=labels

In [5]:
admissions['predicted_label'].value_counts()

0    507
1    137
Name: predicted_label, dtype: int64

In [6]:
admissions.head()

Unnamed: 0,admit,gpa,gre,predicted_label
0,0,3.177277,594.102992,0
1,0,3.412655,631.528607,0
2,0,2.728097,553.714399,0
3,0,3.093559,551.089985,0
4,0,3.141923,537.184894,0


# 2. Accuracy

The simplest way to determine the effectiveness of a classification model is **prediction accuracy**. Accuracy helps us answer the question:

**What fraction of the predictions were correct (actual label matched predicted label)**?

**`Prediction accuracy boils down to the number of labels that were correctly predicted divided by the total number of observations:`**

$Accuracy = \dfrac{\text{# of Correctly Predicted}}{\text{# of Observations}}$

 To decide who gets admitted, we set a threshold and accept all of the students where their computed probability exceeds that threshold. **`This threshold is called the discrimination threshold`** and scikit-learn sets it to 0.5 by default when predicting labels. If the predicted probability is greater than 0.5, the label for that observation is 1. If it is instead less than 0.5, the label for that observation is 0.

## TODO:
* Rename the admit column from the admissions Dataframe to actual_label so it's more clear which column contains the predicted labels (predicted_label) and which column contains the actual labels (actual_label).
* Compare the predicted_label column with the actual_label column.
  * Use a double equals sign (==) to compare the 2 Series objects and assign the resulting Series object to matches.
* Use conditional filtering to filter admissions to just the rows where matches is True. Assign the resulting Dataframe to correct_predictions.
  * Display the first 5 rows in correct_predictions to make sure the values in the predicted_label and actual_label columns are equal.
* Calculate the accuracy and assign the resulting float value to accuracy.
  * Display accuracy using the print function.

In [7]:
admissions.rename(columns={'admit':'actual_label'},inplace=True)


In [8]:
admissions.head()

Unnamed: 0,actual_label,gpa,gre,predicted_label
0,0,3.177277,594.102992,0
1,0,3.412655,631.528607,0
2,0,2.728097,553.714399,0
3,0,3.093559,551.089985,0
4,0,3.141923,537.184894,0


In [9]:
matches=admissions['actual_label']==admissions['predicted_label']
correct_predictions=admissions[matches]

In [10]:
correct_predictions.head()

Unnamed: 0,actual_label,gpa,gre,predicted_label
0,0,3.177277,594.102992,0
1,0,3.412655,631.528607,0
2,0,2.728097,553.714399,0
3,0,3.093559,551.089985,0
4,0,3.141923,537.184894,0


In [11]:
accuracy=len(correct_predictions)/len(admissions)
print(accuracy)

0.6847826086956522


# 3. Binary classification outcomes

 Calculating the accuracy of a model on the dataset used for training is a useful initial step just to make sure the model at least beats randomly assigning a label for each observation. However, prediction accuracy doesn't tell us much more.

**`The accuracy doesn't tell us how the model performs on data it wasn't trained on`**. A model that returns a 100% accuracy when evaluated on it's training set doesn't tell us how well the model works on data it's never seen before (and wasn't trained on). `Accuracy also doesn't help us discriminate between the different types of outcomes a binary classification model can make.`

In this mission, we'll focus on the principles of evaluating binary classification models by testing our model's effectiveness on the training data.

<block><pre>
To start, let's discuss the 4 different outcomes of a binary classification model:

	
 Prediction                   Observation
                   Admitted (1)	       Rejected (0)
Admitted (1)	True Positive (TP)	False Positive (FP)
Rejected (0)	False Negative (FN)	True Negative (TN)
<block></pre>

By segmenting a model's predictions into these different outcome categories, we can start to think about other measures of effectiveness that give us more granularity than simple accuracy.

# 4. Binary classification outcomes

## TODO
* Extract all of the rows where predicted_label and actual_label both equal 1. Then, calculate the number of true positives and assign to true_positives.

* Extract all of the rows where predicted_label and actual_label both equal 0. Then, calculate the number of true negatives and assign to true_negatives.

* Display both true_positives and true_negatives.

In [12]:
true_positives_filter=(admissions['predicted_label']==1) & (admissions['actual_label']==1)
true_positivess=admissions[true_positives_filter]

In [13]:
true_positivess.head()

Unnamed: 0,actual_label,gpa,gre,predicted_label
401,1,3.652733,662.854261,1
409,1,3.968276,621.354786,1
411,1,3.582525,574.138603,1
412,1,3.880944,628.912737,1
415,1,3.512116,653.74426,1


In [14]:
true_positives=len(true_positivess)

In [15]:
true_negatives=len(admissions[(admissions['predicted_label']==0)&(admissions['actual_label'])])

In [16]:
print(true_positives)
print(true_negatives)

89
155


# 5. Sensitivity

**Sensitivity or True Positive Rate** - The proportion of applicants that were correctly admitted:

### $TPR=\dfrac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}$

Of all of the students that should have been admitted (True Positives + False Negatives), what fraction did the model correctly admit (True Positives)? More generally, this measure helps us answer the question:

**How effective is this model at identifying positive outcomes?**

` If the True Positive Rate is low, it means that the model isn't effective at catching positive cases.` For certain problems, high sensitivity is incredibly important. `If we're building a model to predict which patients have cancer, every patient that is missed by the model could mean a loss of life.` We want a highly sensitive model that is able to "catch" all of the positive cases (in this case, the positive case is a patient with cancer).

## TODO:
* Calculate the number of false negatives (where the model predicted rejected but the student was actually admitted) and assign to false_negatives.
* Calculate the sensitivity and assign the computed value to sensitivity.
* Display sensitivity.

In [17]:
false_negatives=len(admissions[(admissions['predicted_label']==0)&(admissions['actual_label']==1)])

In [18]:
sensitivity=true_positives/(true_positives+false_negatives)
print(sensitivity)

0.36475409836065575


# 6. Specificity

In the context of predicting student admissions, this probably isn't too bad of a thing.` Graduate schools can only admit a select number of students into their programs and by definition they end up rejecting many qualified students that would have succeeded.`

In the healthcare context, `however, low sensitivity could mean a severe loss of life`. If a classification model is only catching 12.7% of positive cases for an illness, then around 7 of 8 people are going undiagnosed (being classified as false negatives). Hopefully you're beginning to acquire a sense for the tradeoffs predictive models make and the importance of understanding the various measures.

**Specificity or True Negative Rate** - The proportion of applicants that were correctly rejected:

### $TNR=\dfrac{\text{True Negatives}}{\text{False Positives} + \text{True Negatives}}$

It helps in:

**How effective is this model at identifying negative outcomes?**


A high specificity means that the model is really good at predicting which applicants should be rejected.

## TODO:
* Calculate the number of false positives (where the model predicted admitted but the student was actually rejected) and assign to false_positives.
* Calculate the specificity and assign the computed value to specificity.
* Display specificity.

In [19]:
false_positives=len(admissions[(admissions['predicted_label']==1)&(admissions['actual_label'])==0])

In [20]:
specificity=true_negatives/(true_negatives+false_positives)
print(specificity)

0.21830985915492956


In this mission, we learned about some of the different ways of evaluating how well a binary classification model performs. 
These measures are just a starting point, however, and aren't super useful by themselves. In the next mission, we'll dive into cross-validation, where we'll evaluate our model's accuracy on new data that it wasn't trained on. In addition, we'll explore how varying the discrimination threshold affects the measures we learned about in this mission. These important techniques help us gain a much more complete understanding of a classification model's performance.