# First Model with scikit Learn

Based on the previous analysis on the adult_census data, we move forward to develop a simple KNN predictive model using the Scikit-Learn API: https://scikit-learn.org/stable/index.html 

Here, we will base our prediction on just numerical data that can be fed directly into a Machine Learning Model.  


In [2]:
import pandas as pd 

adult_census = pd.read_csv("adult_census.csv")

In [3]:
adult_census.shape

(48842, 15)

In [4]:
adult_census.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


### First identify and remove the Categorical Variables and keep just the numerical fields that would be fed into the Machine Learning Algorithm at a later stage. 

In [5]:
categorical_columns = [
    'workclass', 'education', 'marital-status', 'occupation',
    'relationship', 'race', 'sex', 'native-country']

adult_census_numerical = adult_census.drop(axis = 1, columns = categorical_columns)

In [6]:
adult_census_numerical = adult_census_numerical.drop(axis = 1, columns= 'fnlwgt')

In [7]:
adult_census_numerical.columns

Index(['age', 'education-num', 'capital-gain', 'capital-loss',
       'hours-per-week', 'class'],
      dtype='object')

In [8]:
adult_census_numerical.sample(10)

Unnamed: 0,age,education-num,capital-gain,capital-loss,hours-per-week,class
20143,41,11,0,1848,48,>50K
42299,23,10,0,0,23,<=50K
9087,48,10,0,0,45,<=50K
40503,58,9,0,0,40,<=50K
46232,17,6,0,0,15,<=50K
30964,50,13,0,0,40,<=50K
15099,51,13,0,0,45,>50K
10707,43,11,3103,0,40,>50K
23094,22,9,0,0,40,<=50K
4394,37,13,0,0,40,<=50K


## Separate the data and the target
The next step now is to separate the data and target class 

In [9]:
target_name = 'class'
target = adult_census_numerical[target_name]
target.head()

0     <=50K
1     <=50K
2      >50K
3      >50K
4     <=50K
Name: class, dtype: object

In [10]:
data = adult_census_numerical.drop(columns =[target_name, ])
data.head()

Unnamed: 0,age,education-num,capital-gain,capital-loss,hours-per-week
0,25,7,0,0,40
1,38,9,0,0,50
2,28,12,0,0,40
3,44,10,7688,0,40
4,18,10,0,0,30


In [13]:
print(f"The datasets contains {data.shape[0]} samples and " + "\n" + f"{data.shape[1]}  features")

The datasets contains 48842 samples and 
5  features


In [14]:
data.columns

Index(['age', 'education-num', 'capital-gain', 'capital-loss',
       'hours-per-week'],
      dtype='object')

## Let's fit a model and make predictions
Here, we will use the KNN algorithm from the scikit learn API. To predict the target of a new sample, a KNN takes into account it k closest sample and predicts the majority target of these samples. Note: K should always be a odd number to avoid a draw in the vote. 

In [17]:
# The fit method is used to train the model from the input (features) and data.
from sklearn import set_config
set_config(display='diagram') # displays nice model diagram


In [19]:
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier()
model.fit(data, target)

It is important to remember that the method fit comprise of two major components, i.) a learning algorithm and ii.) some model state. The learning algorithm takes the training data and training target and sets the model states. These model states will be used later to predict (for classifier and regressor) or transform (for transformers)

In [21]:
#Using the same datasets, we can use our model to predict. 
target_predicted = model.predict(data)

The model uses a prediction function that uses the input data together with the model states to make predictions. The prediction function is specific to each model type depending on the learning algorithm and model states. 

In [25]:
# We can now take a look at our predicted targets:
target_predicted[:5]

array([' <=50K', ' <=50K', ' <=50K', ' >50K', ' <=50K'], dtype=object)

In [27]:
# We can also compare the predictions to the actual data. 
target[:5]

0     <=50K
1     <=50K
2      >50K
3      >50K
4     <=50K
Name: class, dtype: object

In [28]:
target[:5] == target_predicted[:5]

0     True
1     True
2    False
3     True
4     True
Name: class, dtype: bool

In [34]:
data.shape

(48842, 5)

In [35]:
print(f"Number of correct prediction: "
      f"{(target== target_predicted).sum()} / 48842 ")

Number of correct prediction: 41719 / 48842 


In [36]:
# To get a better assessment of the model, we compute the average success rate
(target == target_predicted).mean()

0.8541624012120715

This evaluation method can, however, not be trusted as much. It looks too good to be true, since we also tested the model with the data it has seen before. We may have our doubts. 
To get a better assessment of the model performance, we should rather train the model on an unseen data. This is where we will use the Train Test Strategy. 

## Split Data into Train and Test Sets
When building a machine learning model, it is important to note that generalization is more than memorization (that is, we want a model that generalizes to a new data and not to data that has been memorized). Hence, we would like to test our model on data it has not seen before. This is because of the difficulty of concluding on never-seen before instances than already seen ones.

This approach allows for the correct evaluation of our model. That is, leaving out a subset of our data when training the model and using it afterwards for evaluating the model's performance. 

In [37]:
adult_census_numerical.shape

(48842, 6)

In [39]:
target_test = adult_census_numerical[target_name]
data_test = adult_census_numerical.drop(columns=[target_name, ])

In [40]:
print(f"The training datasets contain {data_test.shape[0]} samples and "
      f"{data_test.shape[1]} features")

The training datasets contain 48842 samples and 5 features


At this point, we can compute the statistical performance instead of manually computing the average success rate. The score method returns the performance metric when dealing with classifiers. 

In [45]:
accuracy = model.score(data_test, target_test)
model_name = model.__class__.__name__

print(f"The test accuracy using a {model_name} is "
      f"{accuracy:.3f}")

The test accuracy using a KNeighborsClassifier is 0.854


### Conclusion
If we compare the accuracy obtained when we evaluated the model on the training set to the score derived from an held=out test set, we will see that the evaluation was indeed optimistic. This shows us the importance of always testing the statistical performance of models on different data sets than the ones we used in training them. 

In this notebook, we used the sklearn API to fit a KNN model using the fit method,evaluated the statistical performance of model on the testing data using the score method.  

In [46]:
 #Save the Adult Census Numerical Data to a CSV file for use in other analysis.
 adult_census_numerical.to_csv("adult_census_numerical.csv")