<a href="https://colab.research.google.com/github/GabeMaldonado/sklearn_FUN/blob/main/sklearn_First_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Building a model with sklearn

Build a predictive model on a tabular dataset using only numerical features.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
# load data

df = pd.read_csv("/content/adult_census.csv")
df.sample(5)


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
42575,27,Private,134566,HS-grad,9,Never-married,Adm-clerical,Not-in-family,White,Male,0,0,40,United-States,<=50K
3372,21,Private,163595,Some-college,10,Never-married,Craft-repair,Not-in-family,White,Male,0,0,45,United-States,<=50K
13184,44,Private,219155,Assoc-acdm,12,Married-civ-spouse,Sales,Husband,White,Male,0,0,40,United-States,>50K
33449,63,?,64448,HS-grad,9,Married-civ-spouse,?,Husband,White,Male,0,0,40,United-States,<=50K
37194,45,Private,259323,Some-college,10,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,40,United-States,<=50K


Isolate the target variable

In [4]:
target_name = "class"
target = df[target_name]
target

0         <=50K
1         <=50K
2          >50K
3          >50K
4         <=50K
          ...  
48837     <=50K
48838      >50K
48839     <=50K
48840     <=50K
48841      >50K
Name: class, Length: 48842, dtype: object

Separate the data and put them on a different dataframe

In [6]:
data = df.drop(columns=[target_name])
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             48842 non-null  int64 
 1   workclass       48842 non-null  object
 2   fnlwgt          48842 non-null  int64 
 3   education       48842 non-null  object
 4   education-num   48842 non-null  int64 
 5   marital-status  48842 non-null  object
 6   occupation      48842 non-null  object
 7   relationship    48842 non-null  object
 8   race            48842 non-null  object
 9   sex             48842 non-null  object
 10  capital-gain    48842 non-null  int64 
 11  capital-loss    48842 non-null  int64 
 12  hours-per-week  48842 non-null  int64 
 13  native-country  48842 non-null  object
dtypes: int64(6), object(8)
memory usage: 5.2+ MB


Now, get only the numeric features that are relevant for the analysis. (age,, capital-gain, capital-loos, hours-per-week)

In [10]:
data.drop(columns=['workclass', 'fnlwgt', 'education', 'education-num', 'marital-status',
                    'occupation', 'relationship', 'race', 'sex', 'native-country'], inplace=True)

In [11]:
data.head()

Unnamed: 0,age,capital-gain,capital-loss,hours-per-week
0,25,0,0,40
1,38,0,0,50
2,28,0,0,40
3,44,7688,0,40
4,18,0,0,30


In [12]:
data.columns

Index(['age', 'capital-gain', 'capital-loss', 'hours-per-week'], dtype='object')

In [13]:
# print the number of features and samples

print(f"The dataset contains {data.shape[1]} of features and {data.shape[0]} samples")

The dataset contains 4 of features and 48842 samples


## Fit a model and make predictions

We build a classification model using the *k-nearest neighbors* algorithm. To predict the target of a new sample, a k-nearest neighbors takes into account its *k* closest samples in the training set and predicts the majority targert of these samples. 

<div class="admonition caution alert alert-warning">
<p class="first admonition-title" style="font-weight: bold;">Caution!</p>
<p class="last">We use a K-nearest neighbors here. However, be aware that it is seldom useful
in practice. We use it because it is an intuitive algorithm. In the next
notebook, we will introduce better models.</p>
</div>

The `fit` method is called to train the model from the input (features) and target data.

In [14]:
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier()
model.fit(data, target)

KNeighborsClassifier()

The method `fit` is composed of two elements: (i) a **learning algorithm**
and (ii) some **model states**. The learning algorithm takes the training
data and training target as input and sets the model states. These model
states will be used later to either predict (for classifiers and regressors)
or transform data (for transformers).

Both the learning algorithm and the type of model states are specific to each
type of model.

In [15]:
# Use the model to make some predictions using the same dataset

target_predicted = model.predict(data)

To predict, a model uses a **prediction function** that will use the input
data together with the model states. As for the learning algorithm and the
model states, the prediction function is specific for each type of model.

Take a look at the first five predictions:

In [18]:
target_predicted[:5]

array([' <=50K', ' <=50K', ' <=50K', ' >50K', ' <=50K'], dtype=object)

Compare these predictions to the actual data

In [19]:
target[:5]

0     <=50K
1     <=50K
2      >50K
3      >50K
4     <=50K
Name: class, dtype: object

Let's compare the predictions with the actual data/


In [20]:
target[:5] == target_predicted[:5]

0     True
1     True
2    False
3     True
4     True
Name: class, dtype: bool

In [21]:
print(f"Number of correct prediction: "
      f"{(target[:5] == target_predicted[:5]).sum()} / 5")

Number of correct prediction: 4 / 5


Calculate average success rate

In [22]:
(target == target_predicted).mean()

0.8175340895131239

## Train-test split the dataset

Scikit-learn provides the helper function
`sklearn.model_selection.train_test_split` which is used to automatically
split the dataset into two subsets.

In [26]:
from sklearn.model_selection import train_test_split

data_train, data_test, target_train, target_test = train_test_split(
    data, target, random_state=42, test_size=0.25)

<div class="admonition tip alert alert-warning">
<p class="first admonition-title" style="font-weight: bold;">Tip</p>
<p class="last">In scikit-learn setting the <tt class="docutils literal">random_state</tt> parameter allows to get
deterministic results when we use a random number generator. In the
<tt class="docutils literal">train_test_split</tt> case the randomness comes from shuffling the data, which
decides how the dataset is split into a train and a test set).</p>
</div>

When calling the function `train_test_split`, we specified that we would like
to have 25% of samples in the testing set while the remaining samples (75%)
will be available in the training set. We can check quickly if we got
what we expected.

In [27]:
print(f"Number of samples in testing: {data_test.shape[0]} => "
      f"{data_test.shape[0] / data.shape[0] * 100:.1f}% of the"
      f" original set")

Number of samples in testing: 12211 => 25.0% of the original set


In [29]:
print(f"Number of samples in training: {data_train.shape[0]} => "
      f"{data_train.shape[0] / data.shape[0] * 100:.1f}% of the"
      f" original set")

Number of samples in training: 36631 => 75.0% of the original set


In the previous notebook, we used a k-nearest neighbors model. While this
model is intuitive to understand, it is not widely used in practice. Now, we
will use a more useful model, called a logistic regression, which belongs to
the linear models family.

<div class="admonition note alert alert-info">
<p class="first admonition-title" style="font-weight: bold;">Note</p>
<p>In short, linear models find a set of weights to combine features linearly
and predict the target. For instance, the model can come up with a rule such
as:</p>
<ul class="simple">
<li>if <tt class="docutils literal">0.1 * age + 3.3 * <span class="pre">hours-per-week</span> - 15.1 &gt; 0</tt>, predict <tt class="docutils literal"><span class="pre">high-income</span></tt></li>
<li>otherwise predict <tt class="docutils literal"><span class="pre">low-income</span></tt></li>
</ul>
<p class="last">Linear models, and in particular the logistic regression, will be covered in
more details in the "Linear models" module later in this course. For now the
focus is to use this logistic regression model in scikit-learn rather than
understand how it works in details.</p>
</div>

To create a logistic regression model in scikit-learn you can do:

In [30]:
# to display nice model diagram
from sklearn import set_config
set_config(display='diagram')

In [31]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

Now that the model has been created, you can use it exactly the same way as
we used the k-nearest neighbors model in the previous notebook. In
particular, we can use the `fit` method to train the model using the training
data and labels:

In [32]:
model.fit(data_train, target_train)

We can also use the `score` method to check the model generalization performance
on the test set.

In [33]:
accuracy = model.score(data_test, target_test)
print(f"Accuracy of logistic regression: {accuracy:.3f}")

Accuracy of logistic regression: 0.807


In scikit-learn, the `score` method of a classification model returns the accuracy,
i.e. the fraction of correctly classified samples. In this case, around
8 / 10 of the times, the logistic regression predicts the right income of a
person. Now the real question is: is this generalization performance relevant
of a good predictive model? Find out by solving the next exercise!

In this notebook, we learned to:

* identify numerical data in a heterogeneous dataset;
* select the subset of columns corresponding to numerical data;
* use the scikit-learn `train_test_split` function to separate data into
  a train and a test set;
* train and evaluate a logistic regression model.

## Compare performance of the previous classifier with some simple baseline classifiers.

Use a `DummyClassifier` such that the resulting classifier will always
predict the class `' >50K'`. What is the accuracy score on the test set?
Repeat the experiment by always predicting the class `' <=50K'`.


In [34]:
from sklearn.dummy import DummyClassifier

# predict for 50k
class_to_predict =  ' >50K'
high_rev_clf = DummyClassifier(strategy='constant',
                               constant=class_to_predict)
high_rev_clf.fit(data_train, target_train)
score = high_rev_clf.score(data_test, target_test)

print(f"Accuracy of a model predicting only high revenue: {score:.3f}")

Accuracy of a model predicting only high revenue: 0.234


In [35]:
# predict for <50K
class_to_predict = ' <=50K'
low_rev_clf = DummyClassifier(strategy='constant',
                              constant=class_to_predict)
low_rev_clf.fit(data_train, target_train)
score = low_rev_clf.score(data_test, target_test)

print(f"Accuracy of a model predicting only high revenue: {score:.3f}")

Accuracy of a model predicting only high revenue: 0.766
