# **Conclusions**

## Objectives

- Round up our data analysis and machine learning model training

## Inputs

- Classification reports and confusion matrices of all final 2-bin models

## Outputs

- Conclusions

## Data Analysis

In the data analysis notebooks, we validated the project hypotheses as they were investigated. For the sake of completeness, we will validate the hypotheses here as well. Before discussing these, I will briefly discuss the distribution analysis.

### Distribution Analysis

After plotting a number of KDE-histograms and conducting the Shapiro-Wilk test, we determined that the dataset's numerical variables are not normally distributed. To correct this, we tried to transform the data using the Box-Cox and Yeo-Johnson transformers. When this failed, we tried removing outlier records. When this failed, we were forced to conclude that the dataset was not normally distributed, and that we could not correct this.

### Primary Hypotheses

#### Hypothesis 1

I hypothesised that a student's gender affects their test scores. This is true. Male students perform better in maths by 6.5 points, whereas female students perform better in reading by 5 points and in writing by 7.5 points. On average, this means that female students perform better by 2 points. 

#### Hypothesis 2

I hypothesized that a student's ethnicity affects their test scores. This is true. Ethnicity E performs the best, closely followed by ethnicity D. Ethnicities B and C perform the worst. 

#### Hypothesis 3

I hypothesized that a students of better educated parents achieve higher test scores. This is true. Students of better educated parents tend to perform better. This difference is particularly notable in the difference in performance between students of parents who have only some high school education and students of parents who have completed high school, where we see a large increase in performance. 

#### Hypothesis 4

I hypothesized that a student's lunch program affects their exam performance. This is true, and considerably so. Students who participate in the standard lunch program score significantly better exam scores than those students who participate in the free/reduced lunch program.

#### Hypothesis 5

I hypothesized that students who participate in the test preparation course achieve higher exam scores. This is true. Students who complete the test preparation course achieve higher test scores.

### Secondary Hypotheses

#### Hypothesis 6

I hypothesized that increased levels of parental education correlate with increased participation in the test preparation course. This is slightly true. Students of better educated parents participate in the test preparation course at slightly higher levels. 

#### Hypothesis 7

I hypothesized that increased levels of parental education correlate with increased participation in the standard lunch program. This is slightly true. Students of better educated parents participate in the standard lunch program at a slightly higher rate.

#### Hypothesis 8

I hypothesized that parental education is linked to ethnicity, or more simply that the parents of students of certain ethnicities are better educated. This is true, and not slightly true either. Students of ethnicity groups A and B have parents who are noticably better educated, whereas students of ethnicity groups D and E are noticably less well educated.

#### Hypothesis 9

I hypothesized that student ethnicity is linked to their lunch program, or more simply that certain ethnicities participate in the different lunch programs at different rates. This is slightly true. Students of ethnicity D participate in the standard lunch program at a higher rate, whereas students of ethnicity E participate in the free/reduced lunch program at a higher rate. The other ethnicity groups have no real difference in lunch program participation rates.

#### Hypothesis 10

I hypothesized that student ethnicity is linked to their participation in the test preparation course, or more simply that certain ethnicities participate in the test preparation course at higher rates than others. This is slightly true. Ethnicity group C completes the test preparation course at slightly lower rate, whereas ethnicity group A completes the test preparation course at a slightly higher rate.

#### Hypothesis 11

I hypothesized that a student's gender affects their participation in the test preparation course, or more simply that one gender participates in the test preparation course at a higher rate than other. This is false, and we have hypothesized incorrectly. There is no discernable link between a student's gender and their participation in the test preparation course.

## Machine Learning Pipelines

We have trained 3 machine learning pipelines for predicting a student's math score, reading score and writing score. For the sake of completeness, we will round up the performance of these pipelines in one place, so as to evaluate performance side-by-side.

### Terminology
- Recall is the percentage of a particular class that was properly predicted
- Precision is the percentage of predicted results related to a particular class that were actually correct

### Math Score

The math score pipeline required a classification task, after a regression task failed to perform well. We identified the lunch_program, ethnicity and parental_education variables as being the most important, and when the other variables were eliminated, we saw no loss of performance.

Below is the classification report and confusion matrix for the final math_score classification model, copied wholesale from that notebook, and transposed into markdown tables:

**Train Set**

| Confusion Matrix    |                      |                               |
|---------------------|----------------------|-------------------------------|
|                     | Actual below average | Actual better than average    |
| Prediction <66.5    |         263          |            110                |
| Prediction >66.4    |        125           |          302                  |

<br>
<br>

| Classification Report |             |                 |              |          |
|-----------------------|-------------|-----------------|--------------|----------|
|                       | precision   | recall          | f1-score     |  support |
| below average         |     0.71    |  0.68           |    0.69      |    388   |
| better than average   |     0.71    |  0.73           | 0.72         |     412  |
| accuracy              |             |                 |  0.71        | 800      |
| macro avg             | 0.71        | 0.71            |  0.71        | 800      |
| weighted avg          | 0.71        | 0.71            | 0.71         | 800      |
 
<br>

The key figures here are the recall score on the first class - 68%, and the precision on that same class - 71%. Given that our stated task to identify as many students who will likely underperform as possible, my determination is that recall is the most important score. This determination was aided by [this Medium article](https://towardsdatascience.com/multi-class-metrics-made-simple-part-i-precision-and-recall-9250280bddc2). In particular, the paragraph that begins 'What is more important, precision or recall?' was useful. The author uses the example of a classifier built to detect patients with diabetes. In that case, the classifier needs to be able to correctly identify diabetics, and therefore a high recall score is needed. If we subsitute underperforming students for diabetics, then the same logic applies. This means that 68% (263 out of 388) of students likely to underperform were correctly identified.


**Test Set**                                       

| Confusion Matrix                  |                     |                               |
|-----------------------------------|---------------------|-------------------------------|
|                                   |Actual below average | Actual better than average    |
| Prediction below average          |     57              |      37                       |
| Prediction  better than average   |    55               |     51                        |

<br>
<br>

| Classification Report |             |                 |              |          |
|-----------------------|-------------|-----------------|--------------|----------|
|                       | precision   | recall          | f1-score     |  support |
| below average         |   0.61      |   0.51          |  0.55        |   112    |
| better than average   | 0.48        |     0.58        |   0.53       |  88      |
| accuracy              |             |                 |    0.54      |   200    |
| macro avg             |    0.54     |   0.54          |   0.54       |   200    |
| weighted avg          |    0.55     |  0.54           |  0.54        |    200   |

If we examine the Test Set results, we see a recall score of 51% on the lower-performing class. This matches what we would normally expect to see in a model, where performance on the train set is higher than on the test set. Recall scores of 68% and 51% are not great, but neither are they poor. From here on, I will say that performance is *decent*, as a shorthand for not great and not poor.
<br>
We also do not see excellent performance on the train set and poor performance on the test set, which would indicate overfitting. Therefore, we can say that the math score pipeline is likely underfit slightly. The Code Institute Predictive Analytics notebooks indicates that underfitting can be caused by:

- a too-small dataset
- poor algorithm selection
- insufficiently informative feature variables
- a too-small number of features
- ineffective hyperparameters

In the math score notebook, we took pains to select the correct algorithm and optimise the hyperparameters, and we also used a feature selection step in our exploratory pipeline to determine the relevant features, so these are clearly not contributory factors. Our dataset only consists of 1000 records, so we must conclude that the decent performance of the model is due to the small dataset and that the dataset's feature variables are not overly informative. That said, the dataset's variables may actually be informative, since the ethnicity variable has 5 possible values, and the parental education variable has 6 possible values. Combined with the binary lunch_program variable, we have 60 possible combinations (6 x 5 x 2)

### Reading score

The reading score pipeline required a classfication task, which we used from the outset, as we predicted that a regression model would perform poorly, given the problems we had with the math score regression model. We identified the lunch_program and test_preparation_course variables as being the most important. As above, the classification report and confusion matrix for the final model is below, transposed exactly and rendered into markdown tables:

**Train Set**

| Confusion Matrix                      |                      |                            |
|---------------------------------------|----------------------|----------------------------|
|                                       | Actual below average | Actual average or above    |
| Prediction below average              |         360          |            256             |
| Prediction actual average or above    |         50           |           134              |

<br>
<br>

| Classification Report |             |                 |              |          |
|-----------------------|-------------|-----------------|--------------|----------|
|                       | precision   | recall          | f1-score     |  support |
| below average         |     0.58    |  0.88           |    0.70      |    410   |
| average or better     |     0.73    |  0.34           | 0.47         |     390  |
| accuracy              |             |                 |  0.62        | 800      |
| macro avg             | 0.66        | 0.61            |  0.58        | 800      |
| weighted avg          | 0.65        | 0.62            | 0.59         | 800      |

As above, the most important metrics are the recall score on the lower peforming class - 0.88, and the precision score on that same class - 0.58. Per the reasoning above, the recall score is the more important metric. The recall score of 0.88 is much better than on the train set of the math score pipeline, and is in fact excellent. A score of 0.88 means that 88% of students who underperform were correctly identified.

The recall performance of 0.34 on the higher-performing class is poor. However, as noted in the reading score notebook, this is immaterial, since the business requirements call for predictive performance on the lowest scoring class to be as high as possible.


**Test Set**                                       

| Confusion Matrix                |                      |                   |
|---------------------------------|----------------------|-------------------|
|                                 | Actual below average | average or better |
| Prediction below average        |     91               |      68           |
| Prediction average or better    |    16                |     25            |

<br>
<br>

| Classification Report |             |                 |              |          |
|-----------------------|-------------|-----------------|--------------|----------|
|                       | precision   | recall          | f1-score     |  support |
| below average         |   0.57      |   0.85          |  0.68        |  107     |
| average or better     | 0.61        |     0.27        |   0.37       |  93      |
| accuracy              |             |                 |    0.58      |   200    |
| macro avg             |    0.59     |   0.56          |   0.53       |   200    |
| weighted avg          |    0.59     |  0.58           |  0.54        |    200   |

The recall score on the test set is 0.85, which is slightly less than that of the train set. This is perfectly in line with expected behaviour, and the high recall score indicates normal fitting behaviour, so underfitting and overfitting are not present.


### Writing Score

The writing score pipeline required a classfication task, which we used from the outset, as we predicted that a regression model would perform poorly, given the problems we had with the math score regression model. As with the math score and reading score pipelines, we identified the lunch_program, test_preparation_course and parental_education variables as being the most important. As above, the classification report and confusion matrix for the final model is below, transposed exactly and rendered into markdown tables:

**Train Set**

| Confusion Matrix                  |                        |                               |
|-----------------------------------|------------------------|-------------------------------|
|                                   |Actual average or below | Actual better than average    |
| Prediction average or below       |         403            |            333                |
| Prediction better than average    |         7              |           57                  |

<br>
<br>

| Classification Report |             |                 |              |          |
|-----------------------|-------------|-----------------|--------------|----------|
|                       | precision   | recall          | f1-score     |  support |
| average or below      |     0.55    |  0.98           |    0.70      |    410   |
| better than average   |     0.89    |  0.15           | 0.25         |     390  |
| accuracy              |             |                 |  0.57        | 800      |
| macro avg             | 0.72        | 0.56            |  0.48        | 800      |
| weighted avg          | 0.71        | 0.57            | 0.48         | 800      |


As above, the most important metric here is the recall score of 0.98. This is excellent performance, indicative of near-perfect predictive performance. As with the reading score pipeline, recall performance on the higher-performing class is poor, which is not ideal, but is acceptable given the stated objective of the project.


**Test Set**                                       

| Confusion Matrix                  |                        |                               |
|-----------------------------------|------------------------|-------------------------------|
|                                   |Actual average or below | Actual better than average    |
| Prediction average or below       |     101                |      83                       |
| Prediction better than average    |    4                   |     12                        |

<br>
<br>

| Classification Report |             |                 |              |          |
|-----------------------|-------------|-----------------|--------------|----------|
|                       | precision   | recall          | f1-score     |  support |
| average or below      |   0.55      |   0.96          |  0.70        |  105     |
| better than average   | 0.75        |     0.13        |   0.22       |  95      |
| accuracy              |             |                 |    0.56      |   200    |
| macro avg             |    0.65     |   0.54          |   0.46       |   200    |
| weighted avg          |    0.64     |  0.56           |  0.47        |    200   |

The recall score of the test set is slightly lower at 0.96, but is still excellent.

### Final conclusions

Our models are quite different from one another. The math_score model has the lowest performance, but we predicted that this might be the case when we noted that, in the standard dataset, it has high skew and kurtosis coefficients. That said, the math_score pipeline's performance is not poor, and is acceptable. The math_score model was trained on the lunch_program, ethnicity and parental_education variables.

The reading_score pipelines has much better performance, with very high recall scores for the lower class that indicates excellent predictive performance for that class. The reading_score model was trained on the lunch_program and test_preparation_course variables.

The writing_score pipelines has even better performance, with exceptionally high recall scores for the lower class that indicates near-perfect predictive performance for that class. The writing_score model was trained on the lunch_program, test_preparation_course and parental_education variables.