# Classical Machine Learning

In case you do not dispose from a local environment please launch this repository from 

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/Talan-TechForData/datascience-solutions/HEAD?labpath=problem.ipynb) 
[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Talan-TechForData/datascience-solutions/blob/main/1_exploratory_data_analysis/taxinyc_analysis/problem.ipynb)

## Situational description 

In this case we consider a classical dataset describing blood transfusions.

In [None]:
%pip install pandas matplotlib scikit-learn tensorflow seaborn

Here's a description of the features in this dataset:

1. **age**: The age of the individual. This is a numerical feature indicating how old the person is.

2. **education-num**: The number of years of education the individual has completed. This is a numerical feature representing educational attainment.

3. **capital-gain**: The capital gains of the individual, which is a numerical feature indicating income from investment profits.

4. **capital-loss**: The capital losses of the individual, which is a numerical feature indicating losses from investments.

5. **hours-per-week**: The number of hours the individual works per week. This is a numerical feature representing workload.

6. **class**: The target variable indicating whether the individual's income exceeds $50K per year. This is a categorical feature with two possible values:
   - `<=50K`: The individual earns $50,000 or less per year.
   - `>50K`: The individual earns more than $50,000 per year.

In [None]:
import pandas as pd

adult_census = pd.read_csv(
    'https://raw.githubusercontent.com/INRIA/scikit-learn-mooc/main/datasets/adult-census-numeric-all.csv',
)
adult_census.head()

1. In this case what kind of problems can be solved with this dataset via ML?

2. Consider the following piece of code. Explain what the piece of code does and the conclusions this might bring

In [None]:
import seaborn as sns

sns.pairplot(adult_census.sample(4000), hue='class')

The exploratory analysis with the pairplot indicates that:

 - Income has two big modalities of income 
 - Hours per week have a wider spread in the case of hours-per-week
 - The data does not show explicit linear relationships so it regular linear regressors might under perform.
 - Binary classifiers in the form of Trees/ Logistic Regressors could provide accurate class prediction

3. What does the following code does?

In [None]:
from sklearn.model_selection import ShuffleSplit

cv = ShuffleSplit(n_splits=10, test_size=0.5, random_state=0)

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

classifier = make_pipeline(StandardScaler(), LogisticRegression())

data, target = adult_census.drop(columns="class"), adult_census["class"]

from sklearn.model_selection import cross_validate

cv_results_logistic_regression = cross_validate(
    classifier, data, target, cv=cv, n_jobs=2
)

test_score_logistic_regression = pd.Series(
    cv_results_logistic_regression["test_score"], name="Logistic Regression"
)
test_score_logistic_regression


4. What is the objective of the following lines with respect to the previous
work?

In [None]:
from sklearn.dummy import DummyClassifier

most_frequent_classifier = DummyClassifier(strategy="most_frequent")
cv_results_most_frequent = cross_validate(
    most_frequent_classifier, data, target, cv=cv, n_jobs=2
)
test_score_most_frequent = pd.Series(
    cv_results_most_frequent["test_score"],
    name="Most frequent class predictor",
)
test_score_most_frequent

5. By observing the results what can you conclude?

In [None]:
# solution
all_test_scores = pd.concat(
    [test_score_logistic_regression, test_score_most_frequent],
    axis="columns",
)
all_test_scores

4. Let consider the following added benchmarks with respect to Dummy Classifiers that use other kind of strategies.

- Stratified: The `predict_proba` method randomly samples one-hot vectors from a multinomial distribution parametrized by the empirical class prior probabilities. The predict method returns the class label which got probability one in the one-hot vector of predict_proba
- Uniform: Generates predictions uniformly at random from the list of unique classes observed in `y`

What could be more convenient to evaluate and why?

In [None]:
import numpy as np
import matplotlib.pyplot as plt

bins = np.linspace(start=0.5, stop=1.0, num=100)

stratified_dummy = DummyClassifier(strategy="stratified")
cv_results_stratified = cross_validate(
    stratified_dummy, data, target, cv=cv, n_jobs=2
)
test_score_dummy_stratified = pd.Series(
    cv_results_stratified["test_score"], name="Stratified class predictor"
)
uniform_dummy = DummyClassifier(strategy="uniform")
cv_results_uniform = cross_validate(
    uniform_dummy, data, target, cv=cv, n_jobs=2
)
test_score_dummy_uniform = pd.Series(
    cv_results_uniform["test_score"], name="Uniform class predictor"
)
all_test_scores = pd.concat(
    [
        test_score_logistic_regression,
        test_score_most_frequent,
        test_score_dummy_stratified,
        test_score_dummy_uniform,
    ],
    axis="columns",
)
all_test_scores.plot.hist(bins=bins, edgecolor="black")
plt.legend(bbox_to_anchor=(1.05, 0.8), loc="upper left")
plt.xlabel("Accuracy (%)")
_ = plt.title("Distribution of the test scores")

# Fin