### Fitting a simple model

Notes:

As in tutorial 1, we must start with importing useful data science libraries. We will also be using the library `sci-kit learn` as part of this tutorial. We will also use some functions from the sci-kit learn package during this tutorial, but since we don't want to import the whole library we will do that later in the code.

func-ai -> common imports

In [None]:
## Open func-ai, and click on common imports. Copy the code here and then execute the cell.

import pandas as pd
import numpy as np
import scipy as sp
import seaborn as sns
import matplotlib   ### note to func-ai --> this is probably unnecessary
from matplotlib import pyplot as plt


Note: Again, we will load the [adult](https://archive.ics.uci.edu/dataset/2/adult) dataset, an open source dataset based on a sample of US census data collected in 1994.

We name the dataset `data`, and use the pandas `read_csv` function to load this dataset

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
data = pd.read_csv('/content/drive/MyDrive/SampleData/adult.csv')

Note --- First we will look at a small sample of the data.

You can try and do this yourself, it's the same as tutorial 1.

We'll also check the column names again.

We see that the table has 15 columns, each column having a name such as 'age', 'workclass', etc. Further information about the data can be found on [this website](https://archive.ics.uci.edu/dataset/2/adult). It is important to notice that some columns contain numeric data, others contain text. Later we will see how we can use data science techniques with text data.



Let's look at the data type for each column.

func-ai --> Exploratory Data Analysis --> Data Type


In this tutorial we will try and use dataset features to predict whether each person earns more or less than \$50k per year - this is recorded in the `class` column. We will call this our "target" column, and you may also see it referred to as "y". This comes from a mathematical perspective, in machine learning tasks we are trying to find a function $f$ which accurately maps our input data $X$ to an output column $y$).

You will notice that most of the columns in this dataset are not numerical. For our first model, we will only use the numeric columns as these can be directly used with our modelling functions. Later we will see how to convert non-numeric columns in to something which can be used by a model.

First we can look to see if there are any features that the target has a dependent relationship with. We can do this using Chi-squared test, but there are also many other approaches - see [here](https://scikit-learn.org/stable/modules/feature_selection.html).

func-ai -> menu -> exploratory data analysis -> feature importance

We can see that capital-gain and capital-loss are the most important features. We'll use a Random Forest classifier to try and predict the income class of each person using these two features.

In order to measure the predictive performance of a model, we set aside a random sample from the dataset before training the model. This is called a 'holdout' split, or a 'test' split. Doing this means that the model will be tested on data it has not been directly trained on, which gives us a more accurate indication of its performance on unseen data. A typical size for this would be 20-30% of data, depending on the size of your total dataset.

We will measure the performance using a [classification report](https://scikit-learn.org/stable/modules/model_evaluation.html#classification-report), which calculates the precision, recall and f1 score for each class, and a [confusion matrix](https://scikit-learn.org/stable/modules/model_evaluation.html#confusion-matrix).

A Random Forest Classifier is an ensemble of Decision Tree Classifiers, and func-ai provides code to plot one of these decision trees to give insight in to the decision rules used to split the dataset in to groups which are associated with one of the classes in the target.

func-ai -> modelling -> random forests

#### Exercises:
1. Train a random forest classifier on the features age, hours-per-week and education-num. How does the performance differ to the model we trained?

2. `sci-kit learn` contains many different classification models, and the power of the library is that all of the models follow the same input-output structure (known as an API). This means that all we have to do to train a different model is use a different model name. You can try this yourself:

   - Make a copy of the modelling code for the Random Forest in a cell below. Do not include the code to plot a decision tree.
   - Replace the line `from sklearn.ensemble import RandomForestClassifier` with `from sklearn.neighbors import KNeighborsClassifier`
   - Replace the line `model = RandomForestClassifier(...)` with `model = KNeighborsClassifier()` and rerun the code.
   - Compare the Neighbors model with the RandomForest. Then compare the performance if you use the age, hours-per-week and education-num instead of capital-gain and capital-loss.

In [None]:
### Answer for 1

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.ensemble import RandomForestClassifier


column_names = ['age','hours-per-week','education-num']
X = data.dropna()[column_names]
y = data.dropna()['class']   ### class is a protected name so doing .class here breaks

# FUNC TIP:  if you dont want to optimise for single class, replace the line above with following one
# y = array_name['target_feature']

### i think it should be made clear that in the multi-class case,
### 'optimise for one class' actually means 'predict the presence of a single class'


### Create the train and test datasets
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3)

### Fit the model to the data
model = RandomForestClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

## calculate the performance scores on the test set
accuracy_score(y_test, y_pred)
print(classification_report(y_test, y_pred))
cf_matrix = confusion_matrix(y_test, y_pred)
print(cf_matrix)

fig, ax = plt.subplots(figsize=(5,5))  ## i think 5,5, is big enough
sns.heatmap(cf_matrix, cmap="Reds", fmt='.0f', ax=ax)

ax.set_title('Confusion Matrix');
ax.set_xlabel('\n Predicted Outcome')
ax.set_ylabel('Actual Outcome')
plt.show()

In [None]:
### Answer for 2

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.neighbors import KNeighborsClassifier


column_names = ['age','hours-per-week','education-num']
X = data.dropna()[column_names]
y = data.dropna()['class']


### Create the train and test datasets
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3)

### Fit the model to the data
model = KNeighborsClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

## calculate the performance scores on the test set
accuracy_score(y_test, y_pred)
print(classification_report(y_test, y_pred))
cf_matrix = confusion_matrix(y_test, y_pred)
print(cf_matrix)

fig, ax = plt.subplots(figsize=(5,5))  ## i think 5,5, is big enough
sns.heatmap(cf_matrix, cmap="Reds", fmt='.0f', ax=ax)

ax.set_title('Confusion Matrix');
ax.set_xlabel('\n Predicted Outcome')
ax.set_ylabel('Actual Outcome')
plt.show()

## Fit a model to categorical columns

Sci-kit learn models require numerical inputs, but our dataset has a number of categorical columns. One way to handle these is to use an 'encoder' which creates a mapping from categories to numbers. In func-ai, we have included a pre-model encoder which can handle different types of categorical data and convert them to an appropriate numerical format.

func-ai -> data-wrangling -> PIPELINES Auto Encoder

Note: The preprocessed data has many more columns than our original dataset, because it uses [one-hot encoding](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) to convert categorical columns. In order to use the modelling pipeline, you will need to run the preprocessor on the training data - the variable called `X_train` in the modelling cells above. The preprocessor actually has two methods - `fit_transform`, which adapts the preprocessor to a dataset and then applies the preprocessing functions, and `transform`, which uses an adapted preprocessor and applies it to a new dataset. This is important because we should only `fit` to our training data, and use `transform` on test data, because this gives us a realistic measure of model performance.

To do this, create the modelling pipeline using func-ai -> modelling -> forests, and add the following lines (the first one is already provided by func:

```


Then we use this processed data in our modelling pipeline as before, and you can compare the results with the other models.

Exercise: Now that we have more features, we could explore tuning the random forest further to make the most of them. Try increasing the "max_depth" feature of the Random Forest to 10.