# Supervised Learning with scikit-learn
- [Supervised Learning with scikit-learn](https://app.datacamp.com/learn/courses/supervised-learning-with-scikit-learn)
- first course on the track "[Machine Learning Scientist in Python](https://app.datacamp.com/learn/career-tracks/machine-learning-scientist-with-python)"
- **Supervised Learning** - the values to be predicted are already known, goal is to predict values of previously unseen data

## types
- **Classification** - predict the label or category of an observations (is a transaction fraudulent or not)
- **Regression** - predict continuous variables (cost of house based on size, bedrooms,...)

## terminology
- **features** - independent variables, predictor variables, variables being input
- **target variable** - dependent variable, response variable, variable being predicted

## prerequisites
- data must not have missing values
- must be numeric
- usually we store in Pandas DataFrames or NumPy arrays
- do Exploratory Data Analysis to check it out first

## scikit-learn syntax
- [scikit-learn](https://scikit-learn.org/stable/)
- that page actually has good way to select categories like classification, regression, clustering, dimensionality reduction, model selection, preprocessing

## Example 1
- **Binary Classification** - classification where there are only two outcomes to choose between
- **k-Nearest Neighbors** - predict the label of a data point by looking at the `k` closest labeled data points
- so for `k=5`, you find the `5` closest points to your target point and give it the same label as the majority of those (so you need to pick an odd number)

In [2]:
from sklearn.neighbors import KNeighborsClassifier
import pandas as pd
from pathlib import Path
import numpy as np

## Extracting DataFrames from DataCamp
- ran `pd.to_csv` on the `churn_df` DataFrame in the interactive window
- `Shift` + `Down` to select the text
- copy and paste into a `csv`
- Find and Replace `\n` with `/`
- activate Regex mode (`.*`)
- Find and Replace `/` with `\n`
- remove the extra `'` that was at the start and end of the text

In [3]:
churn_df = pd.read_csv(
    Path().cwd() / "datasets" / "churn.csv", 
    index_col=0)
churn_df

Unnamed: 0,account_length,total_day_charge,total_eve_charge,total_night_charge,total_intl_charge,customer_service_calls,churn
0,101,45.85,17.65,9.64,1.22,3,1
1,73,22.30,9.05,9.98,2.75,2,0
2,86,24.62,17.53,11.49,3.13,4,0
3,59,34.73,21.02,9.66,3.24,1,0
4,129,27.42,18.75,10.11,2.59,1,0
...,...,...,...,...,...,...,...
3328,89,51.66,22.18,14.04,1.43,1,1
3329,141,43.96,18.87,14.69,3.02,0,0
3330,111,42.47,20.60,10.43,3.13,0,1
3331,135,46.48,13.09,11.06,3.32,1,0


In [4]:
X = churn_df[["total_day_charge", "total_eve_charge"]].values
y = churn_df["churn"].values
print(f"feature shape: {X.shape}, target shape: {y.shape}")
knn = KNeighborsClassifier(n_neighbors=15)
knn.fit(X,y)
X_new  = np.array([[56.8, 17.5], [24.4, 24.1], [50.1, 10.9]])
print(f"new shape: {X_new.shape}")
predictions = knn.predict(X_new)
predictions

feature shape: (3333, 2), target shape: (3333,)
new shape: (3, 2)


array([1, 0, 0])