## Basic Python Tutorial with `Pandas`, `Scikit-learn`, and `NumPy`

Return to the [castle](https://github.com/Nkluge-correa/teeny-tiny_castle).

**[Pandas](https://pandas.pydata.org/) is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series.**

![cool_pandas](https://c.tenor.com/_TV6qVC4toAAAAAM/panda-dancing.gif)

Quick tip: _[13 Most Important Pandas Functions for Data Science](https://www.analyticsvidhya.com/blog/2021/05/pandas-functions-13-most-important/)_.


### What is DATA?

**N.O.I.R (_Nominal-Ordinal-Interval-Ratio_)**

- **Nominal** – a set of items that can be distinguished by name or category (e.g., Nationality).

- **Ordinal** – items that can be ordered, such as military rank, or units of government, but whose degree of difference can’t be measured.
- **Interval** – items that have a measurable distance between them, but no meaningful (non-arbitrary) zero point, such as Fahrenheit and Celsius temperatures.
- **Ratio** – measurements that have a meaningful zero and can be divided meaningfully, such as the Kelvin temperature scale.

![data-types](https://www.voxco.com/wp-content/uploads/2021/03/Ordinal-Data3.jpg)


# Recap from our Basic Python Tutorial 🐍

## `pd.read`

- `read_excel()`: reads an Excel file into a pandas DataFrame.
- `read_csv()`: reads a CSV file into a pandas DataFrame.
- `read_json()`: convert a JSON string to pandas object.

---

- example: `df = pd.read_csv('folder/your_file.csv')`

---

## `df.columns` and `df.index`

- The `columns` method gives you the name of all columns in a DataFrame.
- The `index` method gives you the positions of all indexes in a DataFrame.

---

- the `list()` function can turn objects (like `pandas series`) into lists.

---

## `Pandas` functions that give statistical insigth.

```python

- .mean() # mean of the distribution
- .max() # maximum value in the distribution
- .min() # minimum value in the distribution
- .std() # the standard deviation of the distribution
- .var() # the variance of the distribution

```

---

- The `round()` function _rounds a number_ to a certain decimal place:

```python

x = 3.7777
round(x, 3) = 3.777
round(x, 2) = 3.77
round(x, 1) = 3.7

```


In [1]:
# Import your librarys/modules

import pandas as pd
import plotly.graph_objects as go
import plotly.offline as py
import plotly.express as px


In [2]:
# get some data

df = pd.read_csv(
    'https://gist.githubusercontent.com/armgilles/194bcff35001e7eb53a2a8b441e8b2c6/raw/92200bc0a673d5ce2110aaad4544ed6c4010f687/pokemon.csv')

display(df)


Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
795,719,Diancie,Rock,Fairy,600,50,100,150,100,150,50,6,True
796,719,DiancieMega Diancie,Rock,Fairy,700,50,160,110,160,110,110,6,True
797,720,HoopaHoopa Confined,Psychic,Ghost,600,80,110,60,150,130,70,6,True
798,720,HoopaHoopa Unbound,Psychic,Dark,680,80,160,60,170,130,80,6,True


### `Unique()` and `Sorted()`

- **The `unique` function shows all unique values in a certain list or, in this case, column of a data frame.**
- **The `sorted` function sorts these values alphabetically.**


In [3]:
l = list(df['Type 1'].unique())

print(f'Types of pokemon: {sorted(l)}\n.')


Types of pokemon: ['Bug', 'Dark', 'Dragon', 'Electric', 'Fairy', 'Fighting', 'Fire', 'Flying', 'Ghost', 'Grass', 'Ground', 'Ice', 'Normal', 'Poison', 'Psychic', 'Rock', 'Steel', 'Water']
.


### Visualizing Data with Plotly

**[Plotly](https://plotly.com/) provides online graphing, analytics, and statistics tools for individuals and collaboration, as well as scientific graphing libraries for Python, R, MATLAB, Perl, Julia, Arduino, and REST.**

![cool_graph](https://i.stack.imgur.com/7MGHV.gif)


#### Histograms

**Histogram is an *approximate representation of some data distribution.* `Histograms` are good, just like `pies`, `donuts` and `bar graphs`, for showing the distribution of nominal data.**


In [4]:
fig = px.histogram(df, x='Type 1')  # histogram with plolty.express

fig.update_layout(
    template='plotly_dark',  # make the plot look fancy!
    title='Distribution of Types of Pokemon',
    paper_bgcolor='rgba(0, 0, 0, 0)',
    plot_bgcolor='rgba(0, 0, 0, 0)'
)
fig.show()  # show the plot
# py.plot(fig, filename=f'Distribution of Types of Pokemon') # saves the .html file in your folder


**Available templates in Plotly:**

```python
['plotly_dark', 'seaborn', 'simple_white', 'plotly',
'plotly_white', 'plotly_dark', 'presentation', 'xgridoff',
'ygridoff', 'gridon', 'none']

```


In [5]:
fig = go.Figure(data=[go.Histogram(x=df['HP'])])
fig.update_layout(
    template='plotly_dark',
    title='Distribution of HP between Pokemon',
    paper_bgcolor='rgba(0, 0, 0, 0)',
    plot_bgcolor='rgba(0, 0, 0, 0)'
)
fig.show()
#py.plot(fig, filename=f'Distribution of HP between Pokemon')


### `idmax()` and `loc()`

- **`idmax` returns the index (row number) of the highest value in a given column.**
- **`loc` is used to specify a certain range whitin a dataFrame (access a group of rows and columns).**


In [6]:
# use this if you want to print Markdown
from IPython.display import display, Markdown

display(Markdown('## *Pokemon with highest HP is:* ' +
        df.loc[df['HP'].idxmax()]['Name']))
print(df.loc[df['HP'].idxmax()])
print('\n')

display(Markdown('## *Pokemon with lowest HP is:* ' +
        df.loc[df['HP'].idxmin()]['Name']))
print(df.loc[df['HP'].idxmin()])


## *Pokemon with highest HP is:* Blissey

#                 242
Name          Blissey
Type 1         Normal
Type 2            NaN
Total             540
HP                255
Attack             10
Defense            10
Sp. Atk            75
Sp. Def           135
Speed              55
Generation          2
Legendary       False
Name: 261, dtype: object




## *Pokemon with lowest HP is:* Shedinja

#                  292
Name          Shedinja
Type 1             Bug
Type 2           Ghost
Total              236
HP                   1
Attack              90
Defense             45
Sp. Atk             30
Sp. Def             30
Speed               40
Generation           3
Legendary        False
Name: 316, dtype: object


### Scatter Plot

**A scatter plot is a type of plot or mathematical diagram using Cartesian coordinates to display values for typically two variables for a set of data. _They can be used to look for the relation between two variables._**


In [7]:

fig = px.scatter(df, x='Attack', y='Defense', trendline='lowess',
                 trendline_options=dict(frac=0.1))  # 'ols' for linear tends, 'loweess' for non-linear trends
fig.update_layout(                                 # level of smoothing can be controlled via the frac trendline option
    template='plotly_dark',
    title='Relationship Between Defense and Attack',
    paper_bgcolor='rgba(0, 0, 0, 0)',
    plot_bgcolor='rgba(0, 0, 0, 0)'
)
fig.show()
#py.plot(fig, filename=f'Relationship Between Defense and Attack')

mean_attack = df['Attack'].mean()
mean_defense = df['Defense'].mean()

print(f'\nMean attack value among pokemons: {round(mean_attack, 2)}. \n')
print(f'Mean defense value among pokemons: {round(mean_defense, 2)}. \n')



Mean attack value among pokemons: 79.0. 

Mean defense value among pokemons: 73.84. 



In [8]:
fig = go.Figure(data=go.Scatter(
    x=df['Sp. Atk'],
    y=df['Sp. Def'],
    mode='markers',
    marker=dict(
        size=4,
        color=df['Sp. Atk'],  # set color equal to a variable
        colorscale='Viridis',  # one of plotly colorscales
        showscale=True
    )
))
fig.update_layout(
    template='plotly_dark',
    title='Relationship Between SP. Def and SP. Atk',
    xaxis_title='SP. Atk',
    yaxis_title='SP. Def',
    paper_bgcolor='rgba(0, 0, 0, 0)',
    plot_bgcolor='rgba(0, 0, 0, 0)'
)
fig.show()
#py.plot(fig, filename=f'Relationship Between SP. Def and SP. Atk')


In [9]:

fig = px.scatter_3d(df, x='HP', y='Defense', z='Speed',
                    color='Type 1')

fig.update_layout(
    template='plotly_dark',
    title='Relationship Between HP, Defence, and Speed',
    paper_bgcolor='rgba(0, 0, 0, 0)',
    plot_bgcolor='rgba(0, 0, 0, 0)'
)
fig.show()
#py.plot(fig, filename=f'Relationship Between HP, Defence, and Speed')


### Scikit-learn

***Scikit-learn* (formerly *scikits.learn* and also known as *sklearn*) is a [free software](https://en.wikipedia.org/wiki/Free_software "Free software") [machine learning](https://en.wikipedia.org/wiki/Machine_learning "Machine learning") [library](<https://en.wikipedia.org/wiki/Library_(computing)> "Library (computing)") for the [Python](<https://en.wikipedia.org/wiki/Python_(programming_language)> "Python (programming language)") [programming language](https://en.wikipedia.org/wiki/Programming_language "Programming language"). It features various [classification](https://en.wikipedia.org/wiki/Statistical_classification "Statistical classification"), [regression](https://en.wikipedia.org/wiki/Regression_analysis "Regression analysis") and [clustering](https://en.wikipedia.org/wiki/Cluster_analysis "Cluster analysis") algorithms including [support-vector machines](https://en.wikipedia.org/wiki/Support_vector_machine "Support vector machine"), [random forests](https://en.wikipedia.org/wiki/Random_forests "Random forests"), [gradient boosting](https://en.wikipedia.org/wiki/Gradient_boosting "Gradient boosting"), [_k_-means](https://en.wikipedia.org/wiki/K-means_clustering "K-means clustering") and [DBSCAN](https://en.wikipedia.org/wiki/DBSCAN "DBSCAN"), and is designed to interoperate with the Python numerical and scientific libraries [NumPy](https://en.wikipedia.org/wiki/NumPy "NumPy") and [SciPy](https://en.wikipedia.org/wiki/SciPy "SciPy").**

<img src="https://miro.medium.com/max/1200/1*-eLjPY7UGSoQhSyW5qC6gw.jpeg" alt="drawing" width="800" height="400"/>


#### Breast Cancer Wisconsin (Diagnostic) Data Set

**This dataset contains features computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.**

**Attribute Information:**

- 1. ID number.
- 2. Diagnosis (M = malignant, B = benign).
- 3. radius (mean of distances from center to points on the perimeter).
- 4. texture (standard deviation of gray-scale values).
- 5. perimeter.
- 6. area.
- 7. smoothness (local variation in radius lengths).
- 8. compactness (perimeter^2 / area - 1.0).
- 9. concavity (severity of concave portions of the contour).
- 10. concave points (number of concave portions of the contour).
- 11. symmetry.
- 12. fractal dimension ("coastline approximation" - 1).

**All feature values are recoded with four significant digits.**

**Class distribution: 357 benign, 212 malignant**

_Missing attribute values: none_

Learn more about this dataset [here](https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data).


### `info()` and `describe()`

- **The `info()` method prints information about the DataFrame. The information contains the number of columns, column labels, column data types, memory usage, range index, and the number of cells in each column (non-null values).**
- **The `describe()` method is used for calculating some statistical data like percentile, mean and std of the numerical values of the Series or DataFrame.**


In [2]:
df = pd.read_csv('data_cancer\data_cancer.csv')
display(df)


Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,...,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890,
1,842517,M,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,...,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902,
2,84300903,M,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,...,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,...,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300,
4,84358402,M,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,...,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,926424,M,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,...,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115,
565,926682,M,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,...,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637,
566,926954,M,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,...,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820,
567,927241,M,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,...,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400,


In [3]:
df.describe(include='all')  # includes even nominal/categorical data


Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
count,569.0,569,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,0.0
unique,,2,,,,,,,,,...,,,,,,,,,,
top,,B,,,,,,,,,...,,,,,,,,,,
freq,,357,,,,,,,,,...,,,,,,,,,,
mean,30371830.0,,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,...,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946,
std,125020600.0,,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,...,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061,
min,8670.0,,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,...,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504,
25%,869218.0,,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,...,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146,
50%,906024.0,,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,...,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004,
75%,8813129.0,,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,...,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208,


In [4]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 33 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perimeter_se             5

## FACETS

**[Facets](https://pair-code.github.io/facets/) its a lybrary for data visualization. It contains two robust visualizations to aid in understanding and analyzing machine learning datasets. Get a sense of the shape of each feature of your dataset using Facets Overview, or explore individual observations using Facets Dive.**

![facets](https://3.bp.blogspot.com/-T0dTxdse9Ow/WWz0u431RpI/AAAAAAAAB5M/rBvToJjx1L0FVVpXkgNOAwzXASyZC_JWwCLcBGAs/s640/image4.gif)


In [None]:
%pip install facets-overview==1.0.


In [15]:
from facets_overview.feature_statistics_generator import FeatureStatisticsGenerator
from IPython.display import display, HTML
import base64

fsg = FeatureStatisticsGenerator()
dataframes = [
    {'table': df, 'name': 'trainData'}]
censusProto = fsg.ProtoFromDataFrames(dataframes)
protostr = base64.b64encode(censusProto.SerializeToString()).decode('utf-8')


HTML_TEMPLATE = '''<script src='https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js'></script>
        <link rel='import' href='https://raw.githubusercontent.com/PAIR-code/facets/1.0.0/facets-dist/facets-jupyter.html'>
        <facets-overview id='elem'></facets-overview>
        <script>
          document.querySelector('#elem').protoInput = '{protostr}';
        </script>'''
html = HTML_TEMPLATE.format(protostr=protostr)
display(HTML(html))
with open("overview_data_cancer.html", "w") as fp:
    fp.write(html)
    fp.close()
display(HTML("<a href='overview_data_cancer.html' target='_blank'>overview_data_cancer.html</a>"))


In [16]:
SAMPLE_SIZE = 500

data_dive = df.sample(SAMPLE_SIZE).to_json(orient='records')

HTML_TEMPLATE = """<script src="https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js"></script>
        <link rel="import" href="https://raw.githubusercontent.com/PAIR-code/facets/1.0.0/facets-dist/facets-jupyter.html">
        <facets-dive id="elem" height="600"></facets-dive>
        <script>
          var data = {jsonstr};
          document.querySelector("#elem").data = data;
        </script>"""
html = HTML_TEMPLATE.format(jsonstr=data_dive)

print('''
For some reason, the Dive HTML does not open well in VScode notebooks, 
but you can see the dash following the link below:
''')
with open("dive_data_cancer.html", "w") as fp:
    fp.write(html)
    fp.close()
display(HTML("<a href='dive_data_cancer.html' target='_blank'>dive_data_cancer.html</a>"))



For some reason, the Dive HTML does not open well in VScode notebooks, 
but you can see the dash following the link below:



## Pandas Profiling

![pandas-profiling](https://warehouse-camo.ingress.cmh1.psfhosted.org/6d498300bf33179bd2299e521adf386991ac9ba4/68747470733a2f2f70616e6461732d70726f66696c696e672e79646174612e61692f646f63732f6173736574732f6c6f676f5f6865616465722e706e67)

**Another way to find information about a `pd.DataFrame`, besides using `facets` and functions like `describe` and `info`, is by using the `pandas-profiling` [module](https://pypi.org/project/pandas-profiling/).**

**`pandas-profiling` generates profile reports from a pandas `DataFrame`. Extending a pandas `DataFrame` with `df.profile_report()`, will automatically generate a standardized univariate and multivariate report for data understanding.**

**Let's first install this library.**

In [None]:
%pip install pandas-profiling

**Since we are using a jupyter notebook, we can show our profile by using the `to_notebook_iframe` or `to_widgets` methods. You can also save your report with the `to_file` method.**

In [None]:
from pandas_profiling import ProfileReport

profile = ProfileReport(df, title="Pandas Profiling Report")
profile.to_notebook_iframe()
profile.to_file("pandas_profiling_report.html")

### `LabelEncoder()` and `drop()`

- **the `LabelEncoder()` Encode target labels with value between `0` and `n_classes-1`. This transformer should be used to encode target values, i.e. `y`, and not the input `X`.**
- **The `drop` method removes the specified row or column.**


In [4]:
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

encoder = LabelEncoder()

df['diagnosis'] = encoder.fit_transform(
    df['diagnosis'])  # Now "M" = 1 and "B" = 0
# drop the 'id' nd 'Unnamed: 32'. axis = 1 is for columns, 0 is for rows.
df = df.drop(['id', 'Unnamed: 32'], axis=1)
# save the name of the colomns (except for the diagnosis, 0 index, column)
feature_names = df.columns[1:]
display(df)


Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,1,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,...,25.380,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890
1,1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,...,24.990,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902
2,1,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,...,23.570,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758
3,1,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,...,14.910,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300
4,1,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,...,22.540,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,1,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,...,25.450,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115
565,1,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,...,23.690,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637
566,1,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,...,18.980,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820
567,1,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,...,25.740,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400


### `.values` and `train_test_split()`

- **The `.values` method returns a Numpy representation (array) of the DataFrame.**
- **The `train_test_split` method splits arrays or matrices into random subsets for train and test data, respectively.**


In [5]:
df = df.values
X, y = df[:, 1:31], df[:, 0]

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=666
)
print(X_train.shape, y_train.shape)
# .shape gives dimensions of the array, just a sanity check
print(X_test.shape, y_test.shape)


(455, 30) (455,)
(114, 30) (114,)


### `LogisticRegression`, `fit()`, `score()`, `predict()`, and `confusion_matrix()`

- **the `LogisticRegression` method, despite its name, is a classification algorithm rather than regression algorithm. Based on a given set of independent variables, it is used to estimate discrete value (0 or 1, yes/no, true/false).**
- **`lbfgs`: Limited-memory Broyden–Fletcher–Goldfarb–Shanno (the optimizer).**
- **`max_iter`: max nº of iterations.**
- **The `fit` method trains the algorithm on the training data, after the model is initialized.**
- **The `score` method calculates accuracy (a very common metric).**
- **The `predict` method that predict an output value, once the model is trained.**
- **The `confusion_matrix` is a method that returns a table that is used to define the performance of a classification algorithm.**


In [6]:
model = LogisticRegression(solver='lbfgs', max_iter=5000)
model.fit(X_train, y_train)

score = model.score(X_test,  y_test)
preds = model.predict(X_test)
matrix = confusion_matrix(y_test, preds)
# '{:.2f}'.format() is another fancy way of formating strings
print(f'Model Accuracy: ' + '{:.2f}'.format(score * 100) + ' %')


Model Accuracy: 94.74 %


In [7]:
import plotly.express as px

fig = px.imshow(matrix,
                labels=dict(x="Predicted", y="True label"),
                x=['Benign', 'Malignant'],
                y=['Benign', 'Malignant'],
                text_auto=True
                )
fig.update_xaxes(side='top')
fig.update_layout(template='plotly_dark',
                  title='Confusion Matrix',
                  coloraxis_showscale=False,
                  paper_bgcolor='rgba(0, 0, 0, 0)',
                  plot_bgcolor='rgba(0, 0, 0, 0)')
fig.show()
#py.plot(fig, filename=f'model_confusion_matrix.html')


# NumPy

**[NumPy](https://numpy.org/) is a [library](<https://en.wikipedia.org/wiki/Library_(computing)> "Library (computing)") for the [Python programming language](<https://en.wikipedia.org/wiki/Python_(programming_language)> "Python (programming language)"), adding support for large, multi-dimensional [arrays](https://en.wikipedia.org/wiki/Array_data_structure "Array data structure") and [matrices](<https://en.wikipedia.org/wiki/Matrix_(mathematics)> "Matrix (mathematics)"), along with a large collection of [high-level](https://en.wikipedia.org/wiki/High-level_programming_language "High-level programming language") [mathematical](https://en.wikipedia.org/wiki/Mathematics "Mathematics") [functions](<https://en.wikipedia.org/wiki/Function_(mathematics)> "Function (mathematics)") to operate on these arrays.**

![numpy](https://miro.medium.com/max/1400/1*Nhz7M4r_x8MuJtZ1orr0jg.gif)

**Learn more about NumPy functions in [here](https://www.w3schools.com/python/numpy/default.asp) and [here](https://numpy.org/doc/stable/user/quickstart.html).**


### Correlation Coefficients

**Correlation coefficients measure the *linear association between variables*. We can interpret such values as follows:**

- $+1$: Full positive correlation;
- $+0.8$: Strong positive correlation;
- $+0.6$: Moderate positive correlation;
- $0$: No correlation at all;
- $-0.6$: Moderate negative correlation;
- $-0.8$: Strong negative correlation;
- $-1$: Strong negative correlation.

### `corrcoef()`

- **The `corrcoef` method returns returns coefficients between two random variables $A$ and $B$.**


In [20]:
import numpy as np

array = []
for i in range(0, len(feature_names)):
    A, B = df[:, i], df[:, len(feature_names)]
    coef = np.corrcoef(A, B)
    coef = pd.DataFrame(coef, index=[feature_names[i], 'Cancer Type (M/B)'],
                        columns=[feature_names[i], 'Cancer Type (M/B)'])
    array.append(coef[feature_names[i]][1])
coef = pd.DataFrame(array, columns=['Correlation Coefficients'],
                    index=feature_names)
display(coef)


Unnamed: 0,Correlation Coefficients
radius_mean,0.323872
texture_mean,0.007066
perimeter_mean,0.119205
area_mean,0.051019
smoothness_mean,0.003738
compactness_mean,0.499316
concavity_mean,0.687382
concave points_mean,0.51493
symmetry_mean,0.368661
fractal_dimension_mean,0.438413


In [21]:
coefs = pd.DataFrame(
    model.coef_,
    columns=feature_names,
    index=['Coefficients'])
coefs = coefs.transpose()

fig = go.Figure(go.Bar(
    x=coefs['Coefficients'],
    y=feature_names,
    orientation='h'))
fig.update_xaxes(range=[model.coef_.min(
) + (model.coef_.min() * 0.1), model.coef_.max() + (model.coef_.max() * 0.1)])
fig.update_layout(
    xaxis=dict(
        tickmode='linear',
        tick0=0,
        dtick=0.5
    ),
    title='Model Coefficients',
)
fig.update_layout(template='plotly_dark',
                  title_text='Model Coefficients',
                  paper_bgcolor='rgba(0, 0, 0, 0)',
                  plot_bgcolor='rgba(0, 0, 0, 0)')
fig.show()
#py.plot(fig, filename=f'Model Coefficients.html')


**The coefficients in a model (when they are not too many) can provide insight into which characteristics are most important for a given classification. Something that in the end is a simple interpretability tool!** 🙃

---

Return to the [castle](https://github.com/Nkluge-correa/teeny-tiny_castle).