## Basic Python Tutorial with `Pandas`, `Scikit-learn`, and `NumPy`

Return to the [castle](https://github.com/Nkluge-correa/teeny-tiny_castle).

`Pandas` is a popular open-source library for data manipulation and analysis in `Python`. It provides powerful data structures for working with structured data, such as tables, series, and data frames

The library is widely used in data science, finance, and other fields where large amounts of data need to be processed and analyzed. `Pandas` allow users to easily load, manipulate, and analyze data from various sources, such as CSV files, Excel spreadsheets, and SQL databases.

Its core data structure, the `DataFrame`, provides a flexible and efficient way to work with data, allowing users to perform operations such as filtering, grouping, and aggregation. `Pandas` also offers a wide range of data analysis functions and statistical tools, making it an essential tool for data scientists and analysts.

![cool_pandas](https://c.tenor.com/_TV6qVC4toAAAAAM/panda-dancing.gif)

[Source](https://tenor.com/pt-BR/view/panda-dancing-moves-funny-shaking-gif-17764808).

If you are new to `Pandas`, check our [Basic Python Tutorial](https://github.com/Nkluge-correa/teeny-tiny_castle/blob/fa17764aa8800c388d0d298b750c686757e0861e/ML%20Intro%20Course/2_Basic_Python_Tutorial.ipynb).

### What are the types of DATA out there?

A very simplistic awenser would be `N.O.I.R` (_Nominal-Ordinal-Interval-Ratio_)

- Nominal – _a set of items that can be distinguished by name or category (e.g., Nationality)._
- Ordinal - _items that can be ordered, such as military rank, or units of government, but whose degree of difference can’t be measured._
- Interval – _items that have a measurable distance between them, but no meaningful (non-arbitrary) zero point, such as Fahrenheit and Celsius temperatures._
- Ratio – _measurements that have a meaningful zero and can be divided meaningfully, such as the Kelvin temperature scale._

![data-types](https://www.voxco.com/wp-content/uploads/2021/03/Ordinal-Data3.jpg)

[Source](https://hotcore.info/act/kareff-080627.html).

For this part of the tutorial, we are gonna use some Pokemon data. 👾

Note: The column names that come in the original data frame we are using are all in `Camel case` (which is not optimized for Python). For example:

This does not work (will give an error):

```python
df.pokemon name.sum()
```

But this will:

```python
df.pokemon_name.sum()
```

Hence, let us format our column names to `snake_case` using the function below.

- The `replace()` function takes basically two arguments, one to be replaced, and the thing that will replace the first argument. The `lower()` function (you guess it) lower's all characters in a string.

In [None]:
import pandas as pd

df = pd.read_csv(
    'https://gist.githubusercontent.com/armgilles/194bcff35001e7eb53a2a8b441e8b2c6/raw/92200bc0a673d5ce2110aaad4544ed6c4010f687/pokemon.csv')

df.columns = df.columns.str.lower().str.replace('.', '', regex=True).str.replace(' ', '_', regex=True)

display(df)

Unnamed: 0,#,name,type_1,type_2,total,hp,attack,defense,sp_atk,sp_def,speed,generation,legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
795,719,Diancie,Rock,Fairy,600,50,100,150,100,150,50,6,True
796,719,DiancieMega Diancie,Rock,Fairy,700,50,160,110,160,110,110,6,True
797,720,HoopaHoopa Confined,Psychic,Ghost,600,80,110,60,150,130,70,6,True
798,720,HoopaHoopa Unbound,Psychic,Dark,680,80,160,60,170,130,80,6,True


How can we find the unique types of Pokemons and sort them in alphabetical order? By using `unique()` and `sorted()`:

- `unique` shows all unique values in a certain `pandas.data.series`.
- `sorted` sorts these values alphabetically. If you give `reverse=True` the sorting order is reversed.

In [None]:
print(sorted(df.type_1.unique()))
print(sorted(df.type_1.unique(), reverse=True))

['Bug', 'Dark', 'Dragon', 'Electric', 'Fairy', 'Fighting', 'Fire', 'Flying', 'Ghost', 'Grass', 'Ground', 'Ice', 'Normal', 'Poison', 'Psychic', 'Rock', 'Steel', 'Water']
['Water', 'Steel', 'Rock', 'Psychic', 'Poison', 'Normal', 'Ice', 'Ground', 'Grass', 'Ghost', 'Flying', 'Fire', 'Fighting', 'Fairy', 'Electric', 'Dragon', 'Dark', 'Bug']


What about how many Pokemon are from each Type? We can achieve this by using `value_counts()`. Wrapping this into a ` pd.DataFrame()` gives us a new pandas dataFrame.

In [None]:
type_1_df = pd.DataFrame(df.type_1.value_counts())

display(type_1_df)

Unnamed: 0,type_1
Water,112
Normal,98
Grass,70
Bug,69
Psychic,57
Fire,52
Electric,44
Rock,44
Dragon,32
Ground,32


And what is the average speed of these types (a.k.a wich one is the fastest)? We can write a for loop to get this!

In [None]:
average_speed = []

for poketype in sorted(df.type_1.unique()): # for pokemon type in all unique types

    temp_df = df[df['type_1'] == poketype] # select only pokemons from this type

    average_speed.append(temp_df.speed.mean()) # append the mean speed to the average_speed list

# to create a DataFrame from scratch, you can pass a dictionary as the data source
average_speed_df = pd.DataFrame({
    'porkemon_type' : sorted(df.type_1.unique()),
    'average_speed': average_speed})

display(average_speed_df)

Unnamed: 0,porkemon_type,average_speed
0,Bug,61.681159
1,Dark,76.16129
2,Dragon,83.03125
3,Electric,84.5
4,Fairy,48.588235
5,Fighting,66.074074
6,Fire,74.442308
7,Flying,102.5
8,Ghost,64.34375
9,Grass,61.928571


Flying pokemons are the fastest!

We could get this value, desides looking at the `DataFrame`, by using `idmax()` and `loc()`.

- `idmax` returns the index (row number) of the highest value in a given column.
- `loc` is used to specify a certain range whitin a dataFrame (access a group of rows and columns).

In [None]:
print('Pokemon type with highest Speed is: ' +
        average_speed_df
        .loc[average_speed_df.average_speed.idxmax()]
        ['porkemon_type']
)

print('Pokemon type with lowes Speed is: ' +
        average_speed_df
        .loc[average_speed_df.average_speed.idxmin()]
        ['porkemon_type']
)

Pokemon type with highest Speed is: Flying
Pokemon type with lowes Speed is: Fairy


If you want a quick view of the statistics of your DataFrame, you can use `describe()` or `info()`.

In [None]:
display(df.describe(include='all')) # includes even nominal/categorical data
display(df.info())

Unnamed: 0,#,name,type_1,type_2,total,hp,attack,defense,sp_atk,sp_def,speed,generation,legendary
count,800.0,800,800,414,800.0,800.0,800.0,800.0,800.0,800.0,800.0,800.0,800
unique,,800,18,18,,,,,,,,,2
top,,Bulbasaur,Water,Flying,,,,,,,,,False
freq,,1,112,97,,,,,,,,,735
mean,362.81375,,,,435.1025,69.25875,79.00125,73.8425,72.82,71.9025,68.2775,3.32375,
std,208.343798,,,,119.96304,25.534669,32.457366,31.183501,32.722294,27.828916,29.060474,1.66129,
min,1.0,,,,180.0,1.0,5.0,5.0,10.0,20.0,5.0,1.0,
25%,184.75,,,,330.0,50.0,55.0,50.0,49.75,50.0,45.0,2.0,
50%,364.5,,,,450.0,65.0,75.0,70.0,65.0,70.0,65.0,3.0,
75%,539.25,,,,515.0,80.0,100.0,90.0,95.0,90.0,90.0,5.0,


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800 entries, 0 to 799
Data columns (total 13 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   #           800 non-null    int64 
 1   name        800 non-null    object
 2   type_1      800 non-null    object
 3   type_2      414 non-null    object
 4   total       800 non-null    int64 
 5   hp          800 non-null    int64 
 6   attack      800 non-null    int64 
 7   defense     800 non-null    int64 
 8   sp_atk      800 non-null    int64 
 9   sp_def      800 non-null    int64 
 10  speed       800 non-null    int64 
 11  generation  800 non-null    int64 
 12  legendary   800 non-null    bool  
dtypes: bool(1), int64(9), object(3)
memory usage: 75.9+ KB


None

Now, let us plot some stuff!

### `Plotly`

`Plotly` is a data visualization library that allows users to create interactive, high-quality graphs and charts. It provides a range of visualization tools, including scatter plots, line charts, bar charts, heatmaps, and more.

The library offers a simple syntax for creating interactive visualizations that can be easily customized and shared. With `Plotly`, users can add annotations, hover labels, and zoom and pan features to their visualizations, making it easy to explore and understand complex data sets.

Let us go through the most common type of plots.

#### Histograms

Histogram is an _approximate representation of some data distribution_. `Histograms` are good, just like `pies`, `donuts` and `bar graphs`, for showing the distribution of nominal data.

In [None]:
import plotly.express as px
import plotly.offline as py

fig = px.histogram(df, x='type_1')

fig.update_layout(
    template='plotly_dark',
    title='Distribution of Types of Pokemon',
    paper_bgcolor='rgba(0, 0, 0, 0)',
    plot_bgcolor='rgba(0, 0, 0, 0)'
)
fig.show()

# If you want to save your plot you can uncomment the line bellow
# If you are using your own workstation `auto_open=False` prevents the
# html file will be automatically open when saved

#py.plot(fig, filename=f'Distribution of Types of Pokemon.html', auto_open=False)


'Distribution of Types of Pokemon.html'

If you dont like the "plotly_dark" theme, these are the available templates in `Plotly` you can try

- 'plotly_dark', 'seaborn', 'simple_white', 'plotly', 'plotly_white', 'plotly_dark', 'presentation', 'xgridoff', 'ygridoff', 'gridon', 'none'.

With `plotly.express`, users can create a variety of visualizations, including scatter plots, line charts, bar charts, histograms, and more. It is ideal for users who are new to Plotly or who want to quickly create interactive visualizations without having to write a lot of code.

`plotly.graph_objects`, on the other hand, is a lower-level interface that provides more fine-grained control over the visualizations. It allows users to create visualizations using a more object-oriented approach and provides more flexibility in terms of customizing the visualizations.

In [None]:
import plotly.graph_objects as go

fig = go.Figure(data=[go.Histogram(x=df.hp)])
fig.update_layout(
    template='plotly_dark',
    title='Distribution of HP between Pokemon',
    paper_bgcolor='rgba(0, 0, 0, 0)',
    plot_bgcolor='rgba(0, 0, 0, 0)'
)
fig.show()

### Scatter Plot

A scatter plot is a type of plot or mathematical diagram using Cartesian coordinates to display values for typically two variables. _They canonically are used to look for the relation between two variables._ Let us look for the relationship between `defense` and `attack`.


In [None]:

fig = px.scatter(df, x='attack', y='defense', trendline='lowess',
                 trendline_options=dict(frac=0.1))
fig.update_layout(
    template='plotly_dark',
    title='Relationship Between Defense and Attack',
    paper_bgcolor='rgba(0, 0, 0, 0)',
    plot_bgcolor='rgba(0, 0, 0, 0)'
)
fig.show()


Again, `plotly.graph_objects` lets us create more intricate plots.

In [None]:
fig = go.Figure(data=go.Scatter(
    x=df.sp_atk,
    y=df.sp_def,
    mode='markers',
    marker=dict(
        size=4,
        color=df.sp_atk,  # set color equal to a variable
        colorscale='viridis',  # one of plotly colorscales
        showscale=True
    )
))
fig.update_layout(
    template='plotly_dark',
    title='Relationship Between SP. Def and SP. Atk',
    xaxis_title='SP. Atk',
    yaxis_title='SP. Def',
    paper_bgcolor='rgba(0, 0, 0, 0)',
    plot_bgcolor='rgba(0, 0, 0, 0)'
)
fig.show()


Plotly also allows you to create 3d scatter plots using `scatter_3d`.

In [None]:

fig = px.scatter_3d(df, x='hp', y='defense', z='speed',
                    color='type_1')

fig.update_layout(
    template='plotly_dark',
    title='Relationship Between HP, Defence, and Speed',
    paper_bgcolor='rgba(0, 0, 0, 0)',
    plot_bgcolor='rgba(0, 0, 0, 0)'
)
fig.show()


Enough working with Pokemons. Let us now work with some real data and Machine Learning.

### Scikit-learn

[Scikit-learn](https://scikit-learn.org/stable/) is a popular open-source machine learning library for `Python`. It provides a wide range of machine learning algorithms and tools for classification, regression, clustering, and dimensionality reduction, making it a valuable tool for data scientists and machine learning practitioners.

`Scikit-learn` is built on top of `NumPy`, `SciPy`, and `matplotlib`. The library offers a range of preprocessing and feature extraction methods, as well as tools for model selection, cross-validation, and hyperparameter tuning.

Let us train an ML model on the [`Breast Cancer Wisconsin (Diagnostic) Dataset`](https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data).

This dataset contains features computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.

These features are:

- 1. ID number.
- 2. Diagnosis (M = malignant, B = benign).
- 3. radius (mean of distances from center to points on the perimeter).
- 4. texture (standard deviation of gray-scale values).
- 5. perimeter.
- 6. area.
- 7. smoothness (local variation in radius lengths).
- 8. compactness (perimeter^2 / area - 1.0).
- 9. concavity (severity of concave portions of the contour).
- 10. concave points (number of concave portions of the contour).
- 11. symmetry.
- 12. fractal dimension ("coastline approximation" - 1).

First, let us load the `CSV` file containg our dataset from the Hugging Face hub 🤗. For that, we use `load_dataset`, and after, we turn the dataset into a `pandas.DataFrame`.

But before, let us install the required library for this tutorial by running the cell below.

In [1]:
# Install the required libraries
%pip install datasets ydata_profiling -q

Note: you may need to restart the kernel to use updated packages.


In [None]:
from datasets import load_dataset

# load the dataset from the hub
dataset = load_dataset("dieineb/data_cancer")

# turn the dataset into a pandas.DataFrame
df = dataset['train'].to_pandas()

display(df.head())

Downloading readme:   0%|          | 0.00/1.77k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/142k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/569 [00:00<?, ? examples/s]

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


Remember the `describe()` and `info()` functions? A better way to vizualize of your dataset is by using tools like `ydata_profiling `.

[`ydata_profiling`](https://pypi.org/project/ydata-profiling/) generates profile reports from a pandas `DataFrame`. Extending a pandas `DataFrame` with `df.profile_report()`, will automatically generate a standardized univariate and multivariate report for data understanding.

Since we are using a colab notebook, we can show our profile by using the `to_notebook_iframe` or `to_widgets` methods. You can also save your report with the `to_file` method.

If you are using Colab, run the following to save in your Drive:

```python
from google.colab import drive
drive.mount('/content/drive')

profile = ProfileReport(df, title="Pandas Profiling Report (Cancer Dataset)")
profile.to_notebook_iframe()
profile.to_file("/content/drive/MyDrive/pandas_profiling_report_cancer.html")
```

In [None]:
from ydata_profiling import ProfileReport

profile = ProfileReport(df, title="Pandas Profiling Report (Cancer Dataset)")
profile.to_notebook_iframe()
profile.to_file("pandas_profiling_report_cancer.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Our label feature is called `diagnosis` and has two categorical values (M, B). We need to turn these strings into numbers. For this, we will use the `LabelEncoder()`:

- the `LabelEncoder()` encodes target labels with value between `0` and `n_classes-1`.

In [None]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()

df.diagnosis = encoder.fit_transform(df.diagnosis)  # Now "M" = 1 and "B" = 0

feature_names = df.columns[1:] # features are everything but the Label ("diagnosis")

display(df)


Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,1,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,...,25.380,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890
1,1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,...,24.990,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902
2,1,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,...,23.570,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758
3,1,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,...,14.910,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300
4,1,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,...,22.540,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,1,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,...,25.450,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115
565,1,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,...,23.690,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637
566,1,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,...,18.980,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820
567,1,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,...,25.740,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400


In machine learning, it is important to evaluate the performance of a model on data that it has not seen before. This is known as "_testing_" the model. To accomplish this, we typically split our dataset into two parts: a "_training_" set, which we use to train the model, and a "_testing_" set, which we use to evaluate the model's performance.

The `train_test_split` function in `Scikit-learn` is a convenient way to split our dataset into these two parts. It randomly splits the data into a training set and a testing set, with the option to specify the size of each set. The function also ensures that the data is split in a way that maintains the distribution of classes or labels in the original dataset, which is important for ensuring that the training and testing sets are representative of the overall data.


In [None]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(
    df[feature_names].values,
    df.diagnosis,
    test_size=0.2,
    random_state=42
)

print(f'Training set has {x_train.shape[0]} samples, each one having {x_train.shape[1]} features.')
print(f'Test set has {x_test.shape[0]} samples, each one having {x_test.shape[1]} features.')

Training set has 455 samples, each one having 30 features.
Test set has 114 samples, each one having 30 features.


We will now create our classifier using `LogisticRegression`.

The `LogisticRegression` method, despite its name, is a classification algorithm rather than a regression algorithm. Based on a given set of independent variables, it is used to estimate discrete values.

We are passing through this model two parameters:

- `lbfgs`: Limited-memory Broyden–Fletcher–Goldfarb–Shanno (the optimizer).
- `max_iter`: max nº of iterations.

After this we will:

- `fit` the training data into our model ("_the learning_").
- We will calculate the accuracy of our model with `score`, using our test set.
- Use `predict` to create predictions for all samples in our test set.
- Plot the  `confusion_matrix`, which is basically a table that compares our predictions with the ground truth.


In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
import plotly.express as px

model = LogisticRegression(solver='lbfgs', max_iter=5000)

model.fit(x_train, y_train)

score = model.score(x_test,  y_test)

preds = model.predict(x_test)

matrix = confusion_matrix(y_test, preds)

print(f'Model Accuracy: ' + '{:.2f}'.format(score * 100) + ' %')

fig = px.imshow(matrix,
                labels=dict(x="Predicted", y="True label"),
                x=['Benign', 'Malignant'],
                y=['Benign', 'Malignant'],
                text_auto=True
                )
fig.update_xaxes(side='top')
fig.update_layout(template='plotly_dark',
                  title='Confusion Matrix',
                  coloraxis_showscale=False,
                  paper_bgcolor='rgba(0, 0, 0, 0)',
                  plot_bgcolor='rgba(0, 0, 0, 0)')
fig.show()

Model Accuracy: 95.61 %


Now, let us do some interpretability investigation.

Let us first mesure the correlation of our features with our label variable.

Correlation coefficients measure the *linear association between variables*. We can interpret such values as follows:

- +1: Full positive correlation.
- +0.8: Strong positive correlation.
- +0.6: Moderate positive correlation.
- 0: No correlation at all.
- -0.6: Moderate negative correlation.
- -0.8: Strong negative correlation.
- -1: Strong negative correlation.


In [None]:
corr_matrix = df.corr()

most_correlated_features = pd.DataFrame(corr_matrix.loc['diagnosis'])\
    .iloc[1:]\
        .sort_values(by='diagnosis',ascending=False)\
            .head(5)

most_correlated_features.columns = ['Most Correlated Features']

display(most_correlated_features)

least_correlated_features = pd.DataFrame(corr_matrix.loc['diagnosis'])\
    .iloc[1:]\
        .sort_values(by='diagnosis',ascending=True)\
            .head(5)

least_correlated_features.columns = ['Least Correlated Features']

display(least_correlated_features)

Unnamed: 0,Most Correlated Features
concave points_worst,0.793566
perimeter_worst,0.782914
concave points_mean,0.776614
radius_worst,0.776454
perimeter_mean,0.742636


Unnamed: 0,Least Correlated Features
smoothness_se,-0.067016
fractal_dimension_mean,-0.012838
texture_se,-0.008303
symmetry_se,-0.006522
fractal_dimension_se,0.077972


We can also explore the model's coefficients. The coefficients in a model (_when they are not too many_) can provide insight into which characteristics are most important for a classification.

In [None]:
coefs = pd.DataFrame(
    model.coef_.T,
    columns=['coefficients'],
    index=feature_names)

fig = go.Figure(go.Bar(
    x=coefs.coefficients,
    y=feature_names,
    orientation='h'))

fig.update_xaxes(range=[model.coef_.min()
    + (model.coef_.min() * 0.1),
    model.coef_.max()
    + (model.coef_.max() * 0.1)])

fig.update_layout(
    xaxis=dict(
        tickmode='linear',
        tick0=0,
        dtick=0.5
    ),
    template='plotly_dark',
    title='Model Coefficients',
    paper_bgcolor='rgba(0, 0, 0, 0)',
    plot_bgcolor='rgba(0, 0, 0, 0)'
)

fig.show()

On the other notebooks of this repository, you will see many other examples of how to train and perform inspections in ML models. You can also reuse the code in this notebook for other applications and datasets.

---

Return to the [castle](https://github.com/Nkluge-correa/teeny-tiny_castle).