# Classification 

### Binary Classification

Classification is an example of a supervised machine learning technique which means it relies on data that includes known feature values. 

*A classification algorithm is used to fit a subset of the data to a function that can calculate the probability for each class label from the feature values.* The remaining data is used to evaluate the model by comparing the predictions it generates from the features to the known class labels

**the label (the y value) we must train our model to predict. The other columns are potential features (x values).**

Summary: "Classification is a form of supervised machine learning in which you train a model to use the features (the x values in our function) to predict a label (y) that calculates the probability of the observed case belonging to each of a number of possible classes, and predicting an appropriate label."

#### Example 

In [1]:
import os
import pathlib
import pandas as pd
from matplotlib import pyplot as plt
import urllib
from dotenv import load_dotenv


load_dotenv()

CSV_URL="https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/Data/ml-basics/diabetes.csv"

csv_path=pathlib.Path(os.environ["DATA_DIR"]) / "diabetes.csv"

urllib.request.urlretrieve(CSV_URL, csv_path)

diabetes=pd.read_csv(csv_path, delimiter=",", header="infer")



diabetes


Unnamed: 0,PatientID,Pregnancies,PlasmaGlucose,DiastolicBloodPressure,TricepsThickness,SerumInsulin,BMI,DiabetesPedigree,Age,Diabetic
0,1354778,0,171,80,34,23,43.509726,1.213191,21,0
1,1147438,8,92,93,47,36,21.240576,0.158365,23,0
2,1640031,7,115,47,52,35,41.511523,0.079019,23,0
3,1883350,9,103,78,25,304,29.582192,1.282870,43,1
4,1424119,1,85,59,27,35,42.604536,0.549542,22,0
...,...,...,...,...,...,...,...,...,...,...
14995,1490300,10,65,60,46,177,33.512468,0.148327,41,1
14996,1744410,2,73,66,27,168,30.132636,0.862252,38,1
14997,1742742,0,93,89,43,57,18.690683,0.427049,24,0
14998,1099353,0,132,98,18,161,19.791645,0.302257,23,0



On the Diabetes label(y)

- 0 for patients who tested negative for diabetes 
- 1 for patients who tested positive.

We separate the features (x) from the label (y)

In [3]:
features = ['Pregnancies','PlasmaGlucose','DiastolicBloodPressure','TricepsThickness','SerumInsulin','BMI','DiabetesPedigree','Age']
label = 'Diabetic'

X, y = diabetes[features].values, diabetes[label].values

Split the data in 70% 30%

In [4]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

In addition to the training features and labels, we'll need to set a regularization parameter. This is used to counteract any bias in the sample, and help the model generalize well by avoiding overfitting the model to the training data.

In [5]:
from sklearn.linear_model import LogisticRegression

reg= 0.01

model= LogisticRegression(C=1/reg, solver="liblinear").fit(X_train, y_train)

print(model)

LogisticRegression(C=100.0, solver='liblinear')


In [6]:
predictions = model.predict(X_test)
print('Predicted labels: ', predictions)
print('Actual labels:    ', y_test)

Predicted labels:  [0 0 0 ... 0 1 0]
Actual labels:     [0 0 1 ... 1 1 1]


Let's evaluate the accuracy

- Accuracy: The proportions of the the data the model predict accurately

In [7]:
from sklearn.metrics import accuracy_score

print(f"Accuracy: {accuracy_score(y_test, predictions)}")

Accuracy: 0.7893333333333333
