# Classification-Notebook Version 0.8

<b>This notebook can be used to work on classification tasks.</b><br>
<b>Bernward Asprion, 29.11.2024</b><br>
<b>Version 0.8</b>

# 1&nbsp;Table of contents

The contents of this notebook are divided into various categories which are given as follows:

<ol>
<li><b>Table of contents</b>
<li><b>Import libraries</b>
<li><b>Import data and check import</b><br>
    3.1 Import Data<br>
    3.2 Check Import<br>
<li><b>Data Preparation for Analysis</b><br>
    4.1 Remove unwanted columns<br>
    4.2 Rename column names<br>
    4.3 Check and handle Null values<br>
    4.4 Check and change data types<br>
    4.5 Remove special characters<br>
    4.6 Replace values<br>
    4.7 Replace rare values<br>
<li><b>Data Understanding</b><br>
    5.1 Univariate - statistical analysis<br>
    5.2 Univariate - visualizations<br>
    5.3 Bivariate - correlation matrix<br>
    5.4 Bivariate - cross table<br>
    5.5 Bivariate - visualizations<br>
    5.6 Trivaraite - visualization<br>
<li><b>Data Preparation</b><br>
    6.1 Feature Engineering<br>
    6.2 Scaling<br>
    6.3 Delete data records<br>
    6.4 Dummy Encoding<br>
<li><b>Create X_train, X_test, y_train, y_test</b>
<li><b>Define positive and negative class</b>
<li><b>Logistic Regression</b><br>
    9.1 Standard<br>
    9.2 Thresholding<br>
<li><b>Decision Tree</b><br>
<li><b>Random Forest</b><br>
<li><b>Cross validation</b><br>
<li><b>Create final model</b><br>

# 2&nbsp;Import libraries

<b>Brief description of the used libraries</b><br>
<ul>
<li><b>numpy (numerical python):</b><br>
supports <b>efficient numerical operations</b> on large quantities of data.
<li><b>pandas (dervived from 'panel data'):</b><br>
is a very popular library for working with data.<br>
DataFrames are at the center of pandas.<br>
It is based on numpy.
<li><b>matplotlib:</b><br>
is a library for creating static, animated, and interactive <b>visualizations</b>.
<li><b>seaborn:</b><br>
is a data <b>visualization</b> library based on matplotlib.<br>
It provides a high-level interface for drawing attractive and informative statistical graphics.
<li><b>sklearn (scikit-learn)</b><br>
Built on NumPy, SciPy, and matplotlib<br>
Simple and efficient tools for predictive data analysis
<li><b>statsmodels</b><br>
provides classes and functions for the estimation of many different statistical models,<br>
as well as for conducting statistical tests, and statistical data exploration.
</ul>

In [None]:
# import io
import math
import matplotlib.cm as cm
import matplotlib.pyplot as plt  # for plotting and visualisations
import numpy as np

# import os
import pandas as pd
import seaborn as sns  # for plotting and visualisations - sometimes nicer than plt
import sys

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
import sklearn.metrics as metrics
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold  # Repeated k-fold cross validation
from sklearn.model_selection import (
    train_test_split,
)  # split a dataset in training and test


# from sklearn.model_selection import GridSearchCV

# 3&nbsp;Import Data and check import

## 3.1&nbsp;Import Data

<b>(A) For Colab-Users and loading from your local drive:</b>

The code will prompt you to select a file.<br>
Click on “Choose Files” then select the file to be imported (e.g., 'housing-california.csv').<br>
<b>Wait for the file to be 100% uploaded.</b><br>
You should see the name of the file once Colab has uploaded it.

In [None]:
"""
import sys
if "google.colab" in sys.modules:   # checks if google is used
  from google.colab import files
  uploaded = files.upload()
"""

<b>After executing this code cell, it should be commented out.</b><br>
This can be done with ''' at the beginning and at the end of that call.</b>
Advantage: you can run the code again and again from the beginning.

The following code imports the file (that is uploaded in google drive) into a DataFrame.<br>
Make sure the filename in the code matches the name of the uploaded file - tip: copy the filename with extension (after 'to').<br>
sep: stands for separator for the columns - adjust if necessary<br>
decimals: decimal point - adjust if necessary

In [None]:
if "google.colab" in sys.modules:
    file = "phishing-URL (1).csv"  # change filename

    import io

    df = pd.read_csv(
        io.BytesIO(uploaded[file]), sep=",", decimal="."
    )  # change values for sep and decimal
    # Dataset is now stored in a Pandas Dataframe named df

**(B) For Colab-Users and loading from Google Drive:**

In [None]:
"""
from google.colab import drive
drive.mount('/content/drive')
"""

Click on the folder-symbol on the left side (onmouseover: "files" is shown).<br>
Then you can see folders and files under "content/drive".<br>
Go through the folders und look for the file you want to import.<br> Mark the file, and klick on the "three points". Click on "copy path".<br>
Copy the copied path in the following code cell like in the comment.

In [None]:
"""
df = pd.read_csv('/content/drive/MyDrive/DAS/Regression/Housing/housing-california.csv',sep=',', decimal= '.')
"""

**(C) For those, who do not use Colab:**

<b>In this case, you have to remove one "#" in the following code cell,<br>
and adjust the path.</b>

In [None]:
"""
#df = pd.read_csv(r'C:/data_folder/house-california.csv')   # absolute path
#df = pd.read_csv("../data_folder/house-cailfornia.csv")    # relative path
"""

## 3.2&nbsp;Check import</b>

<b>pandas.DataFrame.info</b><br>
This method prints information about a DataFrame including the index dtype and columns, non-null values and memory usage.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html

In [None]:
df.info()

<b>pandas.DataFrame.sample</b><br>
Return a random sample of items from an axis of object.<br>
axis=0: are always the rows<br>
axis=1: are always the colums.<br>
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html

In [None]:
df.sample(3, axis=0)  # show three randomly selected rows with all cols

<b>pandas.DataFrame.shape</b><br>
Return a tuple representing the dimensionality of the DataFrame.<br>
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shape.html

In [None]:
df.shape

# 4&nbsp;First Data Preparation for Analysis


## 4.1&nbsp;Remove unwanted columns

<b>pandas.DataFrame.drop</b><br>
Remove rows or columns by specifying label names and corresponding axis, or by directly specifying index or column names.<br>
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html

Columns that are not needed can be removed.

In [None]:
"""
df = df.drop('name_of_column_to_be_dropped', axis = 1)          # axis=1:  means columns (axis=0 are rows)
"""

## 4.2&nbsp;Rename column names

<b>pandas.DataFrame.columns</b><br>
The column labels of the DataFrame.<br>
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.columns.html

If the column names are very long or include white spaces,<br>
it makes sense to rename them.<br>
We save the long names for the possible later use.

I prefer column names with lower case letters:

In [None]:
df.columns = df.columns.str.lower()
df.columns

In case you'd like to rename columns:

In [None]:
"""
oldname = 'screenresolution'
newname = 'resolution'

df = df.rename(columns={oldname: newname})
df.columns
"""

## 4.3&nbsp;Check and handle Null values

<b>pandas.isnull</b><br>
Detect missing values for an array-like object.<br>
https://pandas.pydata.org/docs/reference/api/pandas.isnull.html

In [None]:
df.info()

Remark:<br>
<b>In Python you can calculate with boolean values.<br>
"False" corresponds to the value 0, "True" corresponds to the value 1.</b>


<b>pandas.DataFrame.dropna</b><br>
Remove missing values.<br>
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html

<b>Delete every row, that has 'any' Null value in it:</b>

In [None]:
"""
df = df.dropna(how='any', axis=0)
df.info()                              # always check what you have done!
"""

## 4.4&nbsp;Check and change data types

<b>A change of the data type is mandatory if a categorical variable has been coded as a number!<br>
In this case, this means that you inform the dataframe that it is a categorical variable.

In [None]:
df.info()

We can see that the data types like 'float', 'integer' and 'object'.<br>
'object' is for text or mixed text and numeric values.

<b>pandas.DataFrame.astype</b><br>
Cast a pandas object to a specified dtype dtype.<br>
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html

Remark: 'object' and 'categorial' are very similar.<br>
When range of possible values is fixed and finite, categorial hat advantages resp. speed and memory.<br>
It is also possible to order categories (ordinal data).


<u><b>Cast one single column to another datatype:</b></u><br>
If several columns are to be cast, the cell must be copied and the column name adjusted (for reproducibility).


In [None]:
"""
col = 'ocean'                        # change column name
df[col] = df[col].astype('category') # change type, e.g.: 'category', 'int', 'float', 'str'
df.info()
"""

<u><b>Cast all columns to another datatype:</b></u>

In [None]:
"""
df = df.astype('category')    # change type, e.g.: 'category', 'int', 'float', 'str'
df.info()
"""

## 4.5 Remove special characters

When encoding categorical variables, values become column names.<br>
It may therefore be necessary to remove the following characters from the entire DataFrame.

In [None]:
"""
df = df.replace(' ','', regex=True)
df = df.replace('-','', regex=True)
df = df.replace('\+','', regex=True)
df = df.replace('/','', regex=True)
df = df.replace('.','', regex=False)
df = df.replace('<','', regex=True)
df = df.replace('>','', regex=True)
df = df.replace('\[','', regex=True)
df = df.replace(']','', regex=True)
df = df.replace('\(','', regex=True)
df = df.replace('\)','', regex=True)
"""

##&nbsp;4.6 Replace values

Sometimes it is necessary to replace certain values with others in the entire dataframe.

In [None]:
"""
oldvalue = 'old'     # change string
newvalue = 'new'     # change string
df = df.replace(oldvalue,newvalue , regex=True)
"""

##&nbsp;4.7 Replace rare values

Sometimes it is necessary to combine rarely occurring values in categorical variables into a group ('rare').

In [None]:
"""
rare = 10
cols_cat = df.select_dtypes(include=['object','string','category']).columns
for col in cols_cat:
  ToReplace = df[col].value_counts()[df[col].value_counts() < rare].index
  for replace in ToReplace:
    df = df.replace(replace,'rare', regex=True)
"""

In [None]:
"""
col = ['company']                   # change column name
print(df[col].value_counts())
"""

# 5.&nbsp;Data Understanding

In [None]:
"""
df.select_dtypes(include=['object', 'string', 'category']).describe()
"""

In [None]:
df.info()

## 5.1 Univariate - statistical analysis

Check, if a column has identical values (standard deviation == 0).

In [None]:
df.nunique()  # print the number of unique values per column

<b><u>numerical columns<u></b>

<b>pandas.DataFrame.describe</b><br>
Generate descriptive statistics.<br>
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html

In [None]:
"""
df.select_dtypes(include=['bool', 'float', 'integer']).describe().round(1)    # select a suitable number of decimal place
"""

<u><b>categorial columns</b></u>

In [None]:
"""
df.select_dtypes(include=['object', 'string', 'category']).describe()
"""

<b>pandas.Series.value_counts</b><br>
Return a Series containing counts of unique values.<br>
https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html

In [None]:
"""
col = ['company']                   # change column name
print(df[col].value_counts())
"""

## 5.2 Univariate - visualizations

<b><u>Count Plot</u></b>

<b>Count Plot of all categorial variables</b>

In [None]:
"""
width  = 4           # change width
height = 1.5         # change height

col_categorials = df.select_dtypes(include=['object','string','category']).columns  # column names of categorials
for col in col_categorials:
  plt.figure(figsize=(width, height))
  ax = sns.countplot(y=df[col])
  #plt.set(ax.get_xticklabels(), rotation=rotation)
  plt.title('Count Plot: ' + col, fontsize = 10)
  plt.xlabel('');
"""

<b>Count Plot of one categorial variable</b>

In [None]:
"""
width  = 4           # change width
height = 2           # change height
col    = 'has_ip'    # change column

plt.figure(figsize=(width, height))
ax = sns.countplot(x=df[col])
#plt.set(ax.get_xticklabels(), rotation=rotation)
plt.title('Count Plot: ' + col, fontsize = 10)
plt.xlabel('');
"""

<u><b>Histogram of all numerical variables</b></u>

In [None]:
# calculation recommended number of bins in histogram
if df.shape[0] <= 1000:
    bins = round(math.sqrt(df.shape[0]) + 0.5)
else:
    bins = round(10 * math.log10(df.shape[0]) + 0.5)
print("Calculated recommended number of bins: ", bins)

In [None]:
"""
bins   = 20         # change number of bins
width  = 12          # change width
height = 10          # change height

df.hist(bins=bins, figsize=(width, height));
"""

<u><b>Creation of boxplots for all numerical variables</b></u>

In [None]:
"""
kind   = 'box'
title  = 'Bar Chart - number of occurences'  # change title
width  = 30                                  # change width
height = 5                                   # change height

df.plot(kind=kind, subplots=True, sharey=False, title= title, figsize=(width, height));
"""

<u><b>Creation of boxplots of one numerical variable</b></u>

In [None]:
"""
col    = 'inches'      # change column name
kind   = 'box'
title  = 'Boxplot'   # change title
width  = 4           # change width
height = 4           # change height

df[col].plot(kind=kind, title = title, figsize=(width, height));
"""

## 5.3&nbsp;Bivariate - correlation matrix

<b>pandas.DataFrame.corr</b><br>
Compute pairwise correlation of columns, excluding NA/null values.<br>
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html

In [None]:
numeric_only = False  # change to: True or False

df.corr(numeric_only=numeric_only).round(2)

<b>seaborn.heatmap</b><br>
Plot rectangular data as a color-encoded matrix.<br>
https://seaborn.pydata.org/generated/seaborn.heatmap.html

In [None]:
width = 10  # change width
height = 10  # change height
size_number = 5
numeric_only = False  # change to: True or False
title = "Correlation heat map"  # change title

plt.figure(figsize=(width, height))
plt.title(title)
a = sns.heatmap(
    df.corr(numeric_only=False),
    square=True,
    annot=True,
    fmt=".2f",
    linecolor="white",
    vmin=-1,
    vmax=1,
    center=0,
    cmap=sns.diverging_palette(20, 220, n=200),
    annot_kws={"size": size_number},
)
a.set_xticklabels(a.get_xticklabels(), rotation=90)
a.set_yticklabels(a.get_yticklabels(), rotation=0);

## 5.4&nbsp;Bivariate - crosstable

In [None]:
"""
col1 = 'traffic'        # change column name
col2 = 'target'         # change column name

pd.crosstab(df[col1], df[col2])
"""

## 5.5&nbsp;Bivariate - visualization

<u><b>Pairplot: scatterplot of alle combinations of numerical features</b></u>

<b>seaborn.pairplot</b><br>
Plot pairwise relationships in a dataset.<br>
https://seaborn.pydata.org/generated/seaborn.pairplot.html

In [None]:
"""
sns.pairplot(df);
"""

<u><b>Scatterplot of two numerical features</b></u>

<b>pandas.DataFrame.plot</b><br>
Make plots of Series or DataFrame.<br>
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html

In [None]:
df.select_dtypes(include=["float", "int"]).columns  # Show numerical features

In [None]:
"""
col_x  = 'col1'          # column for x-axis
col_y  = 'col2'          # column for y-axis
width  = 4               # change width
height = 4               # change height
title  = 'Scatterplot: ' + col_y + ' vs. ' + col_x
alpha  = 0.1             # regulate the transparency of a graph plot using the alpha attribute.

df.plot.scatter(title= title, x= col_x, y= col_y, alpha=alpha, figsize=(width, height));
"""

<u><b>Creation of "side-by-side" boxplots</b></u>

In [None]:
print("Numerical features:")
print(df.select_dtypes(include=["float", "int"]).columns)
print("Categorial features:")
print(df.select_dtypes(include=["object", "string", "category"]).columns)

In [None]:
"""
target   = 'price'                 # change column name
width    = 8                       # change width
height   = 4                       # change height
rotation = 90                      # change orientation degree x-label

col_categorial = df.select_dtypes(include=['object','string','category'])  # column names of categorials

for col in col_categorial:
  plt.figure(figsize=(width, height))
  title  = 'Boxplots: ' + target + ' vs. ' + col
  ax = sns.boxplot(x=df[col], y=df[target])
  plt.setp(ax.get_xticklabels(), rotation=rotation)
  plt.title(title, fontsize = 16 )
  plt.xlabel('');
"""

## 5.6 Trivariate - visualisation

<u><b>Scatterplot of selected two features with color-information of third feature</b></u>

In [None]:
"""
col_x     = 'ram'         # change column name (for x axis)
col_y     = 'weight'      # change column name (for y axis)
var_color = 'price'       # change column name (different colors)
alpha     =  0.2          # change parameter to see density of data points
width  = 5                # change width
height = 3                # change height

df.plot(kind="scatter", x=col_x, y=col_y, alpha=alpha,
        c = var_color,  cmap=plt.get_cmap("jet"), colorbar=True, figsize=(width, height));
"""

# 6&nbsp;Data Preparation

In [None]:
df.info()

## 6.1&nbsp;Feature Engineering

<b>The task is to generate features, that (probably) have a high correlation with the target.</b>

In [None]:
"""
df['weight2'] = df.weight**2          # change column names and formula
"""

<b>The correlations of all features with the target are shown.</b>

In [None]:
"""
target = 'price'                                  # change column
numeric_only = True                               # choose True or False

corr_matrix = df.corr(numeric_only = numeric_only)
corr_matrix[target].sort_values(ascending=False)
"""

## 6.2&nbsp;Scaling

In lineare regression: scaling is not mandantory.<br>
Scaling may help here, that the parameters are more understandable.

<b>Show min, median and max of every feature</b>.

In [None]:
"""
df.describe().loc[['min','50%','max'],:]
"""

In [None]:
"""
col = 'price'                  # replace column name
df[col] = df[col] / 1000       # replace factor
"""

## 6.3&nbsp;Delete data records

In [None]:
"""
col = 'age'                    # change column

# keep only records that fulfill the following condition
# to drop a row with value of age=999 use: df = df[ df[col] != 999]

df = df[ df[col] >= 0 ]       # change condition
"""

## 6.4&nbsp;Dummy Encoding

<b>pandas.get_dummies</b><br>
Convert categorical variable into dummy/indicator variables.<br>
Each variable is converted in as many 0/1 variables as there are different values.<br>
https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html

In [None]:
df.info()

In [None]:
# replace column names that are to be encoded
col2enc = [
    "long_url",
    "pref_suf",
    "has_sub_domain",
    "ssl_state",
    "long_domain",
    "url_of_anchor",
    "tag_links",
    "domain_age",
    "traffic",
    "page_rank",
    "links_to_page",
]

In [None]:
"""
dummy_var = pd.get_dummies(df[col2enc], drop_first=True)                   # encoding
df = pd.concat([df, dummy_var], axis=1)                                    # add encoded colum to df
df = df.drop(col2enc, axis=1)                                              # remove original columns
"""

In [None]:
df.info()

We now replay white spaces in the column names (if there are any)

In [None]:
df.columns = df.columns.str.replace(" ", "_")
df.columns

# 7&nbsp;Create X_train, X_test, y_train, y_test

y: Vector with target values<br>
X: Matrix with feature values

In [None]:
target = "target"  # change column

y = df[target].values  # y is the target-vecor
X = df.drop(target, axis=1).values  # X is the feature matrix (target is dropped)

<b>sklearn.model_selection.train_test_split</b><br>
Split arrays or matrices into random train and test subsets.<br>
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [None]:
seed = 12345
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=seed
)

In [None]:
print("Shape from X:", X.shape)
print("Shape from y:", y.shape)
print()
print("Shape from X_train:", X_train.shape)
print("Shape from y_train:", y_train.shape)
print()
print("Shape from X_test:", X_test.shape)
print("Shape from y_test:", y_test.shape)

# 8&nbsp;Define positive and negative class

<b>This defines which values of the target are assigned to the positive class and which value of the target is assigned to the negative class.</b>

In [None]:
class_pos = 1  # change value
class_neg = -1  # change value

In [None]:
# check whether the value in the positive class comes before the negative class in the alphabet.
if class_pos < class_neg:
    class_pos_before_neg = True
else:
    class_pos_before_neg = False

# 9&nbsp;Logistic Regression

## 9.1&nbsp;Standard

<b>sklearn.linear_model.LogisticRegression</b><br>
Logistic Regression (aka logit, MaxEnt) classifier.<br>
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

In [None]:
logreg = LogisticRegression().fit(X_train, y_train)

y_pred_train = logreg.predict(X_train)
y_pred_test = logreg.predict(X_test)

In [None]:
print(y_pred_test[0:10])
print(y_test[0:10])

<b>sklearn.metrics.confusion_matrix</b><br>
Compute confusion matrix to evaluate the accuracy of a classification.<br>
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html

In [None]:
labels = [class_pos, class_neg]  # class_pos and class_neg must be defined in section 8.

cm_train = confusion_matrix(y_train, y_pred_train, labels=labels)
cm_test = confusion_matrix(y_test, y_pred_test, labels=labels)

print("TRAINING DATA:")
print(cm_train)
print("Accuracy(train): ", "{:.3f}".format(accuracy_score(y_train, y_pred_train)), "\n")
print("TEST DATA:")
print(cm_test)
print("Accuracy(test):  ", "{:.3f}".format(accuracy_score(y_test, y_pred_test)))

In [None]:
print(sum(y_train == 1))
print(sum(y_train == -1))

## 9.2&nbsp;Thresholding

<ul>
<li>The threshold is a parameter of the classifier that is additionally defined.
<li>Thresholding is possible with any classifier that calculates the probability of class membership.
<li>By default: Threshold = 0.5
<li>The classification can be decisively influenced with this parameter.
</ul>

<b>predict_proba(X)</b><br>
Output: Estimates of probabilities: how sure is the classifier?<br>
The returned estimates for all classes <b>are ordered by the label of classes<b>.

In [None]:
y_pred_train_prob = logreg.predict_proba(X_train)
y_pred_test_prob = logreg.predict_proba(X_test)

print("Estimates of the probabilites of the first 5 items of training set:")
print(y_pred_train_prob[0:5], "\n")
print("Estimates of the classes of the first 5 items of training set:")
print(y_pred_train[0:5])

We only keep probabilities for the positive outcome:

In [None]:
if class_pos_before_neg:
    y_pred_train_prob_pos = y_pred_train_prob[:, 0]
    y_pred_test_prob_pos = y_pred_test_prob[:, 0]
else:
    y_pred_train_prob_pos = y_pred_train_prob[:, 1]
    y_pred_test_prob_pos = y_pred_test_prob[:, 1]

In [None]:
y_pred_train_prob_pos[0:5]

In [None]:
thresholds = [0.5, 0.4, 0.3, 0.2, 0.1, 0.05]  # change thresholds

tp_train, fn_train, fp_train, tn_train = [], [], [], []

for threshold in (0.5, 0.4, 0.3, 0.2, 0.1, 0.05):
    y_pred_train_class = np.where(
        y_pred_train_prob_pos > threshold, class_pos, class_neg
    )
    tp, fn, fp, tn = confusion_matrix(
        list(y_train), list(y_pred_train_class), labels=labels
    ).ravel()
    tp_train.append(tp)
    fn_train.append(fn)
    fp_train.append(fp)
    tn_train.append(tn)
    print("TRAINING, Threshold=", threshold)
    print(confusion_matrix(list(y_train), list(y_pred_train_class), labels=labels))
    print("Accuracy(train):", round(accuracy_score(y_train, y_pred_train_class), 3))
    print("")

In [None]:
tp_test, fn_test, fp_test, tn_test = [], [], [], []

for threshold in thresholds:
    y_pred_test_class = np.where(y_pred_test_prob_pos > threshold, class_pos, class_neg)
    tp, fn, fp, tn = confusion_matrix(
        list(y_test), list(y_pred_test_class), labels=labels
    ).ravel()
    tp_test.append(tp)
    fn_test.append(fn)
    fp_test.append(fp)
    tn_test.append(tn)
    print("TEST, Threshold=", threshold)
    print(confusion_matrix(list(y_test), list(y_pred_test_class), labels=labels))
    print("Accuracy(test):", round(accuracy_score(y_test, y_pred_test_class), 3))
    print("")

In [None]:
if class_pos_before_neg:
    y_pred_train_proba = logreg.predict_proba(X_train)[::, 0]
else:
    y_pred_train_proba = logreg.predict_proba(X_train)[::, 1]

fpr, tpr, ths = metrics.roc_curve(y_train, y_pred_train_proba, pos_label=class_pos)

# create ROC curve
plt.plot(fpr, tpr, label="ROC curve")
plt.ylabel("True Positive Rate")
plt.xlabel("False Positive Rate")

# create point linked with certain threshold
colors = cm.rainbow(np.linspace(0, 1, len(thresholds)))
for i, thres in enumerate(thresholds):
    plt.scatter(
        fp_train[i] / (fp_train[i] + tn_train[i]),
        tp_train[i] / (tp_train[i] + fn_train[i]),
        color=colors[i],
        label="Threshold =" + str(thres),
    )

# ROC line for 'random classifier'
x_values = np.linspace(0, 1, 10)
y_values = x_values
plt.plot(x_values, y_values, label="random classifier")

plt.legend(loc="lower right")
plt.title("ROC: Logistic Regression (Train)")
plt.grid(visible=True);

In [None]:
if class_pos_before_neg:
    y_pred_test_proba = logreg.predict_proba(X_test)[::, 0]
else:
    y_pred_test_proba = logreg.predict_proba(X_test)[::, 1]

fpr, tpr, ths = metrics.roc_curve(y_test, y_pred_test_proba, pos_label=class_pos)

# create ROC curve
plt.plot(fpr, tpr, label="ROC curve")
plt.ylabel("True Positive Rate")
plt.xlabel("False Positive Rate")

# create point linked with certain threshold
colors = cm.rainbow(np.linspace(0, 1, len(thresholds)))
for i, thres in enumerate(thresholds):
    plt.scatter(
        fp_test[i] / (fp_test[i] + tn_test[i]),
        tp_test[i] / (tp_test[i] + fn_test[i]),
        color=colors[i],
        label="Threshold =" + str(thres),
    )

# ROC line for 'random classifier'
x_values = np.linspace(0, 1, 10)
y_values = x_values
plt.plot(x_values, y_values, label="random classifier")

plt.legend(loc="lower right")
plt.title("ROC: Logistic Regression (Test)")
plt.grid(visible=True);

# 10&nbsp;Decision Tree

<b>sklearn.tree.DecisionTreeClassifier</b><br>
A decision tree classifier.<br>
https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

A decision tree classifier is trained.

In [None]:
from sklearn.tree import DecisionTreeClassifier

dtc = DecisionTreeClassifier().fit(X_train, y_train)

The classifier is used for estimation.

In [None]:
y_train_pred = dtc.predict(X_train)
y_test_pred = dtc.predict(X_test)

In [None]:
feature_names = np.setdiff1d(df.columns, target)

Graphic representation of the tree.

In [None]:
from sklearn import tree
import graphviz

dot_data = tree.export_graphviz(
    dtc,
    out_file=None,
    feature_names=feature_names,
    class_names=[str(class_pos), str(class_neg)],
    filled=True,
    rounded=True,
    special_characters=True,
)
graph = graphviz.Source(dot_data)
graph

In [None]:
# X_roc = X_train
# y_roc = y_train
X_roc = X_test
y_roc = y_test

# define metrics

if class_pos_before_neg:
    y_pred_proba = dtc.predict_proba(X_roc)[::, 0]
else:
    y_pred_proba = dtc.predict_proba(X_roc)[::, 1]

fpr, tpr, ths = metrics.roc_curve(y_roc, y_pred_proba, pos_label=class_pos)
# create ROC curve
plt.plot(fpr, tpr, label="ROC Curve")
plt.ylabel("True Positive Rate")
plt.xlabel("False Positive Rate")
x_values = np.linspace(0, 1, 10)
y_values = x_values
plt.plot(x_values, y_values, label="random classifier")
plt.grid()
plt.title("ROC: decision tree classifier")
plt.legend(loc="lower right");

In [None]:
print("Training Data")
print(confusion_matrix(list(y_train), list(y_train_pred), labels=labels))
print("Accuracy: ", round(accuracy_score(y_train, y_train_pred), 3))

In [None]:
print("Test Data")
print(confusion_matrix(list(y_test), list(y_test_pred), labels=labels))
print("Accuracy: ", round(accuracy_score(y_test, y_test_pred), 3))

# 11&nbsp;Random Forest

In [None]:
rfc = RandomForestClassifier().fit(X_train, y_train)
y_pred_train = rfc.predict(X_train)
y_pred_test = rfc.predict(X_test)

In [None]:
print("Training Data")
print(confusion_matrix(list(y_train), list(y_pred_train), labels=labels))
print("Accuracy Training: ", round(accuracy_score(y_train, y_train_pred), 3))

In [None]:
print("Test Data")
print(confusion_matrix(list(y_test), list(y_pred_test), labels=labels))
print("Accuracy Test: ", round(accuracy_score(y_test, y_test_pred), 3))

# 12&nbsp;Cross validation

<b>RepeatedKFold:</b><br>
Repeats K-Fold n times with different randomization in each repetition.<br>
https://scikit-learn.org/dev/modules/generated/sklearn.model_selection.RepeatedKFold.html

<b>Definition of the cross validation:</b>

In [None]:
n_splits = 5  # change number of splits
n_repeats = 3  # change number of repeats
random_state = 3  # change number of random_state

cv = RepeatedKFold(
    n_splits=n_splits, n_repeats=n_repeats, random_state=random_state
)  # cross validation is defined

In [None]:
model = LogisticRegression()  # choose model

scores = cross_val_score(
    model, X, y, scoring="accuracy", cv=cv, n_jobs=-1
)  # scores of crossvalidation are stored
print("Accuracy", type(model).__name__, ":", "\n", scores.round(3), "\n")
print(
    "Min/Mean/Max: {:.3f}".format(np.min(scores)),
    "{:.3f}".format(np.mean(scores)),
    "{:.3f}".format(np.max(scores)),
)

In [None]:
model = DecisionTreeClassifier()  # choose model
scores = cross_val_score(model, X, y, scoring="accuracy", cv=cv, n_jobs=-1)
print("Accuracy", type(model).__name__, ":", "\n", scores.round(3), "\n")
print(
    "Min/Mean/Max: {:.3f}".format(np.min(scores)),
    "{:.3f}".format(np.mean(scores)),
    "{:.3f}".format(np.max(scores)),
)

In [None]:
model = RandomForestClassifier()  # choose model
scores = cross_val_score(model, X, y, scoring="accuracy", cv=cv, n_jobs=-1)
print("Accuracy", type(model).__name__, ":", "\n", scores.round(3), "\n")
print(
    "Min/Mean/Max: {:.3f}".format(np.min(scores)),
    "{:.3f}".format(np.mean(scores)),
    "{:.3f}".format(np.max(scores)),
)

#&nbsp;13 Create final model

<b>Task: Create the best final Model for predicting classes of new URL items.</b>